Skip to content Skip to sidebar Skip to footer

HTML Elements In Lxml Get Incorrectly Encoded Like Най

I need to print RSS link from a web page, but this link is decoded incorrectly. Here is my code: import urllib2 from lxml import html, etree import chardet data = urllib2.urlopen(

Solution 1:

Here is a simplified version of your program that works:

from lxml import html

url = 'http://facts-and-joy.ru/'
content = html.parse(url)
rsslinks = content.xpath('//link[@type="application/rss+xml"]')

for link in rsslinks:
    print link.get('title')
    print html.tostring(link, encoding="utf-8")

Output:

Позитивное мышление RSS Feed
<link rel="alternate" type="application/rss+xml" title="Позитивное мышление RSS Feed" href="http://facts-and-joy.ru/feed/">&#13;

The crucial line is

print html.tostring(link, encoding="utf-8")

That is the only thing you must change in your original program.

Using html.tostring() instead of etree.tostring() produces actual characters instead of numeric character references. You could also use etree.tostring(link, method="html", encoding="utf-8").

It is not clear why this difference exists between the "html" and "xml" output methods. This post to the lxml mailing list didn't get any replies: https://mailman-mail5.webfaction.com/pipermail/lxml/2011-September/006131.html.


Post a Comment for "HTML Elements In Lxml Get Incorrectly Encoded Like Най"