Skip to content Skip to sidebar Skip to footer

Remove Offending Characters From Strings In List

Sample data to parse (a list of unicode strings): [u'\n', u'1\xa0', u'Some text here.', u'\n', u'1\xa0', u'Some more text here.', u'\n', u'1\xa0', u'Some more text here.'] I want

Solution 1:

The problem is different in each version of your code. Let's start with this:

newli = re.sub(x, '', li)
l[li].replace(newli)

First, newli is already the line you want—that's what re.sub does—so you don't need replace here at all. Just assign newli.

Second, l[li] isn't going to work, because li is the value of the line, not the index.


In this version, it's a but more subtle:

li = re.sub(x, '', li)

re.sub is returning a new string, and you're assigning that string to li. But that doesn't affect anything in the list, it's just saying "li no longer refers to the current line in the list, it now refers to this new string".


To only way to replace the list elements is to get the index so you can use the [] operator. And to get that, you want to use enumerate.

So:

def remove_from_list(l, x):
  for index, li in enumerate(l):
    l[index] = re.sub(x, '', li)
  return l

But really, you probably do want to use str.replace—it's just that you want to use it instead of re.sub:

def remove_from_list(l, x):
  for index, li in enumerate(l):
    l[index] = li.replace(x, '')
  return l

Then you don't have to worry about what happens if x is a special character in regular expressions.


Also, in Python, you almost never want to modify an object in-place, and also return it. Either modify it and return None, or return a new copy of the object. So, either:

def remove_from_list(l, x):
  for index, li in enumerate(l):
    newli = li.replace(x, '')
    l[index] = newli

… or:

def remove_from_list(l, x):
  new_list = []
  for li in l:
    newli = li.replace(x, '')
    new_list.append(newli)
  return new_list

And you can simply the latter to a list comprehension, as in unutbu's answer:

def remove_from_list(l, x):
  new_list = [li.replace(x, '') for li in l]
  return new_list

The fact that the second one is easier to write (no need for enumerate, has a handy shortcut, etc.) is no coincidence—it's usually the one you want, so Python makes it easy.


I don't know how else to make this clearer, but one last try:

If you choose the version that returns a fixed-up new copy of the list instead of modifying the list in-place, your original list will not be modified in any way. If you want to use the fixed-up new copy, you have to use the return value of the function. For example:

>>> def remove_from_list(l, x):
...     new_list = [li.replace(x, '') for li in l]
...     return new_list
>>> a = [u'\n', u'1\xa0']
>>> b = remove_from_list(a, u'\xa0')
>>> a
[u'\n', u'1\xa0']
>>> b
[u'\n', u'1']

The problem you're having with your actual code turning everything into a list of 1-character and 0-character strings is that you don't actually have a list of strings in the first place, you have one string that's a repr of a list of strings. So, for li in l means "for each character li in the string l, instead of for each stringliin the listl`.


Solution 2:

Another option if you're only interested in ASCII chars (as you mention characters, but this also also happens to work for the case of the posted example):

[text.encode('ascii', 'ignore') for text in your_list]

Solution 3:

You could use a list comprehension and str.replace:

>>> items
[u'\n',
 u'1\xa0',
 u'Some text here.',
 u'\n',
 u'1\xa0',
 u'Some more text here.',
 u'\n',
 u'1\xa0',
 u'Some more text here.']
>>> [item.replace(u'\xa0', u'') for item in items]
[u'\n',
 u'1',
 u'Some text here.',
 u'\n',
 u'1',
 u'Some more text here.',
 u'\n',
 u'1',
 u'Some more text here.']

Post a Comment for "Remove Offending Characters From Strings In List"