Skip to content Skip to sidebar Skip to footer

Beautifulsoup4 Stripped_strings Gives Me Byte Objects?

I'm trying to get the text out of a blockquote which looks like this:
01 Oyasumi
02 DanSin' &l

Solution 1:

The problem is that Beautiful Soup converts the original encoding to Unicode if the from_encoding is not specified using a sub-library called Unicode, Dammit. More info in the Encodings section in the documentation.

>>>from bs4 import BeautifulSoup>>>doc = '''<blockquote class="postcontent restore ">...    01 Oyasumi...    <br></br>...    02 DanSin'...    <br></br>...    03 w.t.s....    <br></br>...    04 Lovism...    <br></br>...    05 NoName...    <br></br>...    06 Gakkou...    <br></br>...    07 Happy☆Day...    <br></br>...    08 Endless End....</blockquote>'''>>>soup = BeautifulSoup(doc, 'html5lib')>>>soup.original_encoding 
u'windows-1252'
>>>content = soup.find("blockquote", {"class": "postcontent restore "}).stripped_strings>>>for line in content:...print(line)... 
01 Oyasumi
02 DanSin'
03 w.t.s.
04 Lovism
05 NoName
06 Gakkou
07 HappyĆ¢˜†Day
08 Endless End.

To fix this you have two options:

  1. By passing in the correct from_encoding parameter or excluding the wrong the wrong encoding Dammit is guessing. One problem is that not all Parsers support the exclude_encodings argument. For example the html5lib tree builder doesn't support exclude_encoding

    >>>soup = BeautifulSoup(doc, 'html5lib', from_encoding='utf-8')>>>content = soup.find("blockquote", {"class": "postcontent restore "}).stripped_strings>>>for line in content:...print(line)... 
    01 Oyasumi
    02 DanSin'
    03 w.t.s.
    04 Lovism
    05 NoName
    06 Gakkou
    07 Happy☆Day
    08 Endless End.
    >>>
  2. Use the lxml Parser

    >>>soup = BS(doc, 'lxml')>>>soup.original_encoding
    'utf-8'
    >>>content = soup.find("blockquote", {"class": "postcontent restore "}).stripped_strings>>>for line in content:...print(line)... 
    01 Oyasumi
    02 DanSin'
    03 w.t.s.
    04 Lovism
    05 NoName
    06 Gakkou
    07 Happy☆Day
    08 Endless End.
    

Post a Comment for "Beautifulsoup4 Stripped_strings Gives Me Byte Objects?"