HTML Elements In Lxml Get Incorrectly Encoded Like Най

July 09, 2022 Post a Comment

I need to print RSS link from a web page, but this link is decoded incorrectly. Here is my code: import urllib2 from lxml import html, etree import chardet data = urllib2.urlopen(

Solution 1:

Here is a simplified version of your program that works:

from lxml import html

url = 'http://facts-and-joy.ru/'
content = html.parse(url)
rsslinks = content.xpath('//link[@type="application/rss+xml"]')

for link in rsslinks:
    print link.get('title')
    print html.tostring(link, encoding="utf-8")

Output:

Позитивное мышление RSS Feed
<link rel="alternate" type="application/rss+xml" title="Позитивное мышление RSS Feed" href="http://facts-and-joy.ru/feed/">&#13;

The crucial line is

print html.tostring(link, encoding="utf-8")

That is the only thing you must change in your original program.

Using html.tostring() instead of etree.tostring() produces actual characters instead of numeric character references. You could also use etree.tostring(link, method="html", encoding="utf-8").

Baca Juga

It is not clear why this difference exists between the "html" and "xml" output methods. This post to the lxml mailing list didn't get any replies: https://mailman-mail5.webfaction.com/pipermail/lxml/2011-September/006131.html.

Python Freelancers

HTML Elements In Lxml Get Incorrectly Encoded Like Най

Solution 1:

Post a Comment for "HTML Elements In Lxml Get Incorrectly Encoded Like Най"