Detect Encoding In Wrongly Encoded Utf-8 Text File
Solution 1:
Eventually, I've figured it out. Using CharsetNormalizerMatches seems to work, properly detecting the encoding. Anyways, this is how I implemented it and it works like a charm, correctly detecting gb18030 encoding for the file in question:
from charset_normalizer importCharsetNormalizerMatchesasCnM
encoding = CnM.from_path(path).best().first().encoding
Note: The answer was hinted to me by someone who suggested using CharsetNormalizerMatches, but later deleted his post here. Too bad, I'd love to give him/her the credit.
Solution 2:
You can't get a definite encoding of your linked example.txt file as concatenated two different encodings:
path = r'D:\Downloads\example.txt'withopen(path,'rb') as f:
data = f.read()
# double mojibakeprint( data[:37].decode('utf-8').encode('latin1').decode('gb2312') )
# Chineseprint( data[37:].decode('gb2312') )
Result pasted into Google Translate gives
Subject: Re: I upgraded to
The orange version of the orange version, should be corrected
Unfortunately, SO thinks that Chinese text in the result is a spam so I can't embed it here…
Body cannot contain "".
This appears to be spam. If you think we've made an error, make a post in meta.
Edit: print( data[:37].decode('gb18030'))
returns
Subject: 禄脴赂麓: 脦脪脡媒录露碌陆
Google Translate then gives Subject: Lulululu: Lululu Lulu
as an English equivalent for the latter string.
Anyway, abovementioned Subject: Re: I upgraded to
(or Re: My promotion
suggested by Mark Tolonen) looks to be more meaningful than this…
Post a Comment for "Detect Encoding In Wrongly Encoded Utf-8 Text File"