Beautifulsoup4 Stripped_strings Gives Me Byte Objects?
I'm trying to get the text out of a blockquote which looks like this:
01 Oyasumi 02 DanSin' &l
Solution 1:
The problem is that Beautiful Soup converts the original encoding to Unicode if the from_encoding
is not specified using a sub-library called Unicode, Dammit. More info in the Encodings section in the documentation.
>>>from bs4 import BeautifulSoup>>>doc = '''<blockquote class="postcontent restore ">... 01 Oyasumi... <br></br>... 02 DanSin'... <br></br>... 03 w.t.s.... <br></br>... 04 Lovism... <br></br>... 05 NoName... <br></br>... 06 Gakkou... <br></br>... 07 Happy☆Day... <br></br>... 08 Endless End....</blockquote>'''>>>soup = BeautifulSoup(doc, 'html5lib')>>>soup.original_encoding
u'windows-1252'
>>>content = soup.find("blockquote", {"class": "postcontent restore "}).stripped_strings>>>for line in content:...print(line)...
01 Oyasumi
02 DanSin'
03 w.t.s.
04 Lovism
05 NoName
06 Gakkou
07 HappyĆ¢˜†Day
08 Endless End.
To fix this you have two options:
By passing in the correct
from_encoding
parameter or excluding the wrong the wrong encoding Dammit is guessing. One problem is that not all Parsers support theexclude_encodings
argument. For example thehtml5lib
tree builder doesn't supportexclude_encoding
>>>soup = BeautifulSoup(doc, 'html5lib', from_encoding='utf-8')>>>content = soup.find("blockquote", {"class": "postcontent restore "}).stripped_strings>>>for line in content:...print(line)... 01 Oyasumi 02 DanSin' 03 w.t.s. 04 Lovism 05 NoName 06 Gakkou 07 Happy☆Day 08 Endless End. >>>
Use the
lxml
Parser>>>soup = BS(doc, 'lxml')>>>soup.original_encoding 'utf-8' >>>content = soup.find("blockquote", {"class": "postcontent restore "}).stripped_strings>>>for line in content:...print(line)... 01 Oyasumi 02 DanSin' 03 w.t.s. 04 Lovism 05 NoName 06 Gakkou 07 Happy☆Day 08 Endless End.
Post a Comment for "Beautifulsoup4 Stripped_strings Gives Me Byte Objects?"