The best answers to the question “UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)” in the category Dev.
I’m having problems dealing with unicode characters from text fetched from different web pages (on different sites). I am using BeautifulSoup.
The problem is that the error is not always reproducible; it sometimes works with some pages, and sometimes, it barfs by throwing a
UnicodeEncodeError. I have tried just about everything I can think of, and yet I have not found anything that works consistently without throwing some kind of Unicode-related error.
One of the sections of code that is causing problems is shown below:
agent_telno = agent.find('div', 'agent_contact_number') agent_telno = '' if agent_telno is None else agent_telno.contents p.agent_info = str(agent_contact + ' ' + agent_telno).strip()
Here is a stack trace produced on SOME strings when the snippet above is run:
Traceback (most recent call last): File "foobar.py", line 792, in <module> p.agent_info = str(agent_contact + ' ' + agent_telno).strip() UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)
I suspect that this is because some pages (or more specifically, pages from some of the sites) may be encoded, whilst others may be unencoded. All the sites are based in the UK and provide data meant for UK consumption – so there are no issues relating to internalization or dealing with text written in anything other than English.
Does anyone have any ideas as to how to solve this so that I can CONSISTENTLY fix this problem?
This is a classic python unicode pain point! Consider the following:
a = u'bats\u00E0' print a => batsà
All good so far, but if we call str(a), let’s see what happens:
str(a) Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)
Oh dip, that’s not gonna do anyone any good! To fix the error, encode the bytes explicitly with .encode and tell python what codec to use:
a.encode('utf-8') => 'bats\xc3\xa0' print a.encode('utf-8') => batsà
The issue is that when you call str(), python uses the default character encoding to try and encode the bytes you gave it, which in your case are sometimes representations of unicode characters. To fix the problem, you have to tell python how to deal with the string you give it by using .encode(‘whatever_unicode’). Most of the time, you should be fine using utf-8.
For an excellent exposition on this topic, see Ned Batchelder’s PyCon talk here: http://nedbatchelder.com/text/unipain.html
You need to read the Python Unicode HOWTO. This error is the very first example.
Basically, stop using
str to convert from unicode to encoded text / bytes.
Instead, properly use
.encode() to encode the string:
p.agent_info = u' '.join((agent_contact, agent_telno)).encode('utf-8').strip()
or work entirely in unicode.
well i tried everything but it did not help, after googling around i figured the following and it helped.
python 2.7 is in use.
# encoding=utf8 import sys reload(sys) sys.setdefaultencoding('utf8')
I found elegant work around for me to remove symbols and continue to keep string as string in follows:
yourstring = yourstring.encode('ascii', 'ignore').decode('ascii')
It’s important to notice that using the ignore option is dangerous because it silently drops any unicode(and internationalization) support from the code that uses it, as seen here (convert unicode):
>>> u'City: Malmö'.encode('ascii', 'ignore').decode('ascii') 'City: Malm'