Skip to content Skip to sidebar Skip to footer

How To Fix Broken Utf-8 Encoding In Python?

My string is Niệm Bồ Tát (Thiá»n sÆ° Nhất Hạnh) and I want to decode it to Niệm Bồ Tát (Thiền sư Nhất Hạnh). I see in that site can do that

Solution 1:

The only thing that helped me with broken cyrillic string - https://github.com/LuminosoInsight/python-ftfy

This module fixes pretty much everything and works much better than online decoders.

>>>from ftfy import fix_encoding>>>mystr = '09. Bát Nhã Tâm Kinh'>>>fix_encoding(mystr)
'09. Bát Nhã Tâm Kinh'

It can be easily installed using pip install ftfy

Solution 2:

I'm not sure what you can do with these kind of data, but for your example in your original post, this works:

>>> mystr = '09. Bát Nhã Tâm Kinh'>>> s = mystr.decode('utf8').encode('latin1').decode('utf8')
>>> s
u'09. B\xe1t Nh\xe3 T\xe2m Kinh'>>> print(s)
09. Bát Nhã Tâm Kinh

Post a Comment for "How To Fix Broken Utf-8 Encoding In Python?"