Skip to content Skip to sidebar Skip to footer

Detect Encoding In Wrongly Encoded Utf-8 Text File

I have an encoding issue. I have millions of text files that I need to parse for a language data science project. Each text file is encoded as UTF-8, but I just found that some of

Solution 1:

Eventually, I've figured it out. Using CharsetNormalizerMatches seems to work, properly detecting the encoding. Anyways, this is how I implemented it and it works like a charm, correctly detecting gb18030 encoding for the file in question:

from charset_normalizer importCharsetNormalizerMatchesasCnM
encoding = CnM.from_path(path).best().first().encoding

Note: The answer was hinted to me by someone who suggested using CharsetNormalizerMatches, but later deleted his post here. Too bad, I'd love to give him/her the credit.

Solution 2:

You can't get a definite encoding of your linked example.txt file as concatenated two different encodings:

path = r'D:\Downloads\example.txt'withopen(path,'rb') as f:
    data = f.read()

# double mojibakeprint( data[:37].decode('utf-8').encode('latin1').decode('gb2312') )

# Chineseprint( data[37:].decode('gb2312') )

Result pasted into Google Translate gives

Subject: Re: I upgraded to

The orange version of the orange version, should be corrected

Unfortunately, SO thinks that Chinese text in the result is a spam so I can't embed it here…

Body cannot contain "".

This appears to be spam. If you think we've made an error, make a post in meta.

Edit: print( data[:37].decode('gb18030')) returns

Subject: 禄脴赂麓: 脦脪脡媒录露碌陆

Google Translate then gives Subject: Lulululu: Lululu Lulu as an English equivalent for the latter string. Anyway, abovementioned Subject: Re: I upgraded to (or Re: My promotionsuggested by Mark Tolonen) looks to be more meaningful than this…

Post a Comment for "Detect Encoding In Wrongly Encoded Utf-8 Text File"