How To Fix Broken Utf-8 Encoding In Python?
My string is Niệm Bồ Tát (Thiá»n sÆ° Nhất Hạnh) and I want to decode it to Niệm Bồ Tát (Thiền sư Nhất Hạnh). I see in that site can do that
Solution 1:
The only thing that helped me with broken cyrillic string - https://github.com/LuminosoInsight/python-ftfy
This module fixes pretty much everything and works much better than online decoders.
>>>from ftfy import fix_encoding>>>mystr = '09. Bát Nhã Tâm Kinh'>>>fix_encoding(mystr)
'09. Bát Nhã Tâm Kinh'
It can be easily installed using pip install ftfy
Solution 2:
I'm not sure what you can do with these kind of data, but for your example in your original post, this works:
>>> mystr = '09. Bát Nhã Tâm Kinh'>>> s = mystr.decode('utf8').encode('latin1').decode('utf8')
>>> s
u'09. B\xe1t Nh\xe3 T\xe2m Kinh'>>> print(s)
09. Bát Nhã Tâm Kinh
Post a Comment for "How To Fix Broken Utf-8 Encoding In Python?"