Skip to content Skip to sidebar Skip to footer

Unicode Problems With Web Pages In Python's Urllib

I seem to have the all-familiar problem of correctly reading and viewing a web page. It looks like Python reads the page in UTF-8 but when I try to convert it to something more vie

Solution 1:

As noted by Lennart, your problem is not the decoding. It is trying to encode into "ascii", which is often a problem with print statements. I suspect the line

printstr

is your problem. You need to encode the str into whatever your console is using to have that line work.

Solution 2:

It doesn't look like Python is "reading it in UTF-8" at all. As already pointed out, you have an encoding problem, NOT a decoding problem. It is impossible for that error to have arisen from that line that you say. When asking a question like this, always give the full traceback and error message.

Kathy's suspicion is correct; in fact the print str line is the only possible source of that error, and that can only happen when sys.stdout.encoding is not set so Python punts on 'ascii'.

Variables that may affect the outcome are what version of Python you are using, what platform you are running on and exactly how you run your script -- none of which you have told us; please do.

Example: I'm using Python 2.6.2 on Windows XP and I'm running your script with some diagnostic additions: (1) import sys; print sys.stdout.encoding up near the front (2) print repr(str) before print str so that I can see what you've got before it crashes.

In a Command Prompt window, if I do \python26\python hockey.py it prints cp850 as the encoding and just works.

However if I do

\python26\python hockey.py | more

or

\python26\python hockey.py >hockey.txt

it prints None as the encoding and crashes with your error message on the first line with the a-with-diaeresis:

C:\junk>\python26\python hockey.py >hockey.txt
Traceback (most recent call last):
  File "hockey.py", line 18, in <module>
    print str
UnicodeEncodeError:'ascii' codec can't encode character u'\xe4' in position 2: ordinal not in range(128)

If that fits your case, the fix in general is to explicitly encode your output with an encoding suited to the display mechanism you plan to use.

Solution 3:

That text is indeed iso-88591-1, and I can decode it without a problem, and indeed your code runs without a hitch.

Your error, however, is an ENCODE error, not a decode error. And you don't do any encoding in your code, so. Possibly you have gotten encoding and decoding confused, it's a common problem.

You DECODE from Latin1 to Unicode. You ENCODE the other way. Remember that Latin1, UTF8 etc are called "encodings".

Post a Comment for "Unicode Problems With Web Pages In Python's Urllib"