Skip to content Skip to sidebar Skip to footer

Parsing Xml In Python With Regex

I'm trying to use regex to parse an XML file (in my case this seems the simplest way). For example a line might be: line='PLAINSBORO, NJ 08536-1906

Solution 1:

You normally don't want to use re.match. Quoting from the docs:

If you want to locate a match anywhere in string, use search() instead (see also search() vs. match()).

Note:

>>>print re.match('>.*<', line)
None
>>>print re.search('>.*<', line)
<_sre.SRE_Match object at 0x10f666238>
>>>print re.search('>.*<', line).group(0)
>PLAINSBORO, NJ 08536-1906<

Also, why parse XML with regex when you can use something like BeautifulSoup :).

>>>from bs4 import BeautifulSoup as BS>>>line='<City_State>PLAINSBORO, NJ 08536-1906</City_State>'>>>soup = BS(line)>>>print soup.find('city_state').text
PLAINSBORO, NJ 08536-1906

Solution 2:

Please, just use an XML parser like ElementTree

>>>from xml.etree import ElementTree as ET>>>line='<City_State>PLAINSBORO, NJ 08536-1906</City_State>'>>>ET.fromstring(line).text
'PLAINSBORO, NJ 08536-1906'

Solution 3:

re.match returns a match only if the pattern matches the entire string. To find substrings matching the pattern, use re.search.

And yes, this is a simple way to parse XML, but I would highly encourage you to use a library specifically designed for the task.

Post a Comment for "Parsing Xml In Python With Regex"