Skip to content Skip to sidebar Skip to footer

Beautiful Soup Replaces < With <

I've found the text I want to replace, but when I print soup the format gets changed.
stuff here
becomes <div id='content'>stuff h

Solution 1:

The found object is not a Python string, it's a Tag that just happens to have a nice string representation. You can verify this by doing

type(found)

A Tag is part of the hierarchy of objects that Beautiful Soup creates for you to be able to interact with the HTML. Another such object is NavigableString. NavigableString is a lot like a string, but it can only contain things that would go into the content portion of the HTML.

When you do

found.replace_with('<divid="content">stuff here</div>')

you are asking the Tag to be replaced with a NavigableString containing that literal text. The only way for HTML to be able to display that string is to escape all the angle brackets, as it's doing.

Instead of that mess, you probably want to keep your Tag, and replace only it's content:

found.string.replace_with('stuff here')

Notice that the correct replacement does not attempt to overwrite the tags.

When you do found.replace_with(...), the object referred to by the name found gets replaced in the parent hierarchy. However, the name found keeps pointing to the same outdated object as before. That is why printing soup shows the update, but printing found does not.

Post a Comment for "Beautiful Soup Replaces < With <"