How To Find Out The Correct Encoding When Using Beautifulsoup?
Solution 1:
Requests determines encoding like this:
When you receive a response, Requests makes a guess at the encoding to use for decoding the response when you access the
Response.text
attribute. Requests will first check for an encoding in the HTTP header, and if none is present, will use chardet to attempt to guess the encoding.The only time Requests will not do this is if no explicit charset is present in the HTTP headers and the Content-Type header contains text. In this situation, RFC 2616 specifies that the default charset must be ISO-8859-1. Requests follows the specification in this case. If you require a different encoding, you can manually set the
Response.encoding
property, or use the rawResponse.content
.
Inspecting the request headers show that indeed "no explicit charset is present in the HTTP headers and the Content-Type header contains text"
>>> req.headers['content-type']
'text/html'
So requests faithfully follows the standard and decodes as ISO-8859-1 (latin-1).
In the response content, a charset is specified:
<METAhttp-equiv="Content-Type"content="text/html; charset=utf-16">
however this is wrong: decoding as UTF-16 produces mojibake.
chardet
correctly identifies the encoding as UTF-8.
So to summarise:
- there is no general way to determine text encoding with complete accuracy
- in this particular case, the correct encoding is UTF-8.
Working code:
>>>req.encoding = 'UTF-8'>>>soup = bs4.BeautifulSoup(req.text,'lxml')>>>soup.find('h1').text
'\r\n CÂMARA MUNICIPAL DE SÃO PAULO'
Solution 2:
When you use requests, you can use the encoding function, for example:
req = requests.get('https://sisgvarmazenamento.blob.core.windows.net/prd/PublicacaoPortal/Arquivos/201901.htm')
encoding = req.encoding
text = req.content
decoded_text = text.decode(encoding)
Post a Comment for "How To Find Out The Correct Encoding When Using Beautifulsoup?"