HTML Elements In Lxml Get Incorrectly Encoded Like Най
I need to print RSS link from a web page, but this link is decoded incorrectly. Here is my code: import urllib2 from lxml import html, etree import chardet data = urllib2.urlopen(
Solution 1:
Here is a simplified version of your program that works:
from lxml import html
url = 'http://facts-and-joy.ru/'
content = html.parse(url)
rsslinks = content.xpath('//link[@type="application/rss+xml"]')
for link in rsslinks:
print link.get('title')
print html.tostring(link, encoding="utf-8")
Output:
Позитивное мышление RSS Feed
<link rel="alternate" type="application/rss+xml" title="Позитивное мышление RSS Feed" href="http://facts-and-joy.ru/feed/">
The crucial line is
print html.tostring(link, encoding="utf-8")
That is the only thing you must change in your original program.
Using html.tostring()
instead of etree.tostring()
produces actual characters instead of numeric character references. You could also use etree.tostring(link, method="html", encoding="utf-8")
.
It is not clear why this difference exists between the "html" and "xml" output methods. This post to the lxml mailing list didn't get any replies: https://mailman-mail5.webfaction.com/pipermail/lxml/2011-September/006131.html.
Post a Comment for "HTML Elements In Lxml Get Incorrectly Encoded Like Най"