Skip to content Skip to sidebar Skip to footer

How I Do Capture All Of The Element Names Of An XML File Using LXML In Python?

I am able to use lxml to accomplish most of what I would like to do, although it was a struggle to go through the obfuscating examples and tutorials. In short, I am able to read a

Solution 1:

I believe you are looking for element.xpath().

XPath is not a concept introduced by lxml but a general query language for selecting nodes from an XML document supported by many things that deal with XML. Think of it as something similar to CSS selectors, but more powerful (also a bit more complicated). See XPath Syntax.

Your document uses namespaces - I'll ignore that for now and explain at the end of the post how to deal with them, because it keeps the examples more readable that way. (But they won't work as-is for your document).

So, for example,

tree.xpath('/net/endAddress')

would select the <endAddress>79.255.255.255</endAddress> element direcly below the <net /> node. But not the <endAddress /> inside the <netBlock>.

The XPath expression

tree.xpath('//endAddress')

however would select all <endAddress /> nodes anywhere in the document.

You can of course further query the nodes you get back with XPath epxressions:

netblocks = tree.xpath('/net/netBlocks/netBlock')
for netblock in netblocks:
    start = netblock.xpath('./startAddress/text()')[0]
    end = netblock.xpath('./endAddress/text()')[0]
    print "%s - %s" % (start, end)

would give you

79.0.0.0 - 79.255.255.255

Notice that .xpath() always returns a list of selected nodes - so if you want just one, account for that.

You can also select elements by their attributes:

comment = tree.xpath('/net/comment')[0]
line_2 = comment.xpath("./line[@number='2']")[0]

This would select the <line /> element with number="2" from the first comment.

You can also select attributes themselves:

numbers = tree.xpath('//line/attribute::number')

['0', '1', '2']

To get the list of element names you asked about last, you could do something likes this:

names = [node.tag for node in tree.xpath('/net/*')]

['registrationDate', 'ref', 'endAddress', 'handle', 'name', 'netBlocks', 'orgRef', 'comment', 'startAddress', 'updateDate', 'version']

But given the power of XPath, it's probably better to just query the document for what you want to know from it, as specific or loose as you see fit.

Now, namespaces. As you noticed, if your document uses XML namespaces, you need to take that into consideration in many places, and XPath is no different. When querying a namespaced document, you pass the xpath() method the namespace map like this:

NSMAP = {'ns':  'http://www.arin.net/whoisrws/core/v1',
         'ns2': 'http://www.arin.net/whoisrws/rdns/v1',
         'ns3': 'http://www.arin.net/whoisrws/netref/v2'}

names = [node.tag for node in tree.xpath('/ns:net/*', namespaces=NSMAP)]

In many other places in lxml you can speficy the default namespace by using None as the dictionary key in the namespace map. Not with xpath() unfortunately, that will raise an exception

TypeError: empty namespace prefix is not supported in XPath

So you unfortunately have to prefix every node name in your XPath expression with ns: (or whatever you choose to map that namespace to).

For more information on the XPath syntax, see for example the XPath Syntax page in the W3Schools Xpath Tutorial.

To get going with XPath it can also be very helpful to fiddle around with your document in one of the many XPath testers. Also, the Firebug plugin for Firefox, or Google Chrome inspector allow you to show the (or rather, one of many) XPath for the selected element.


Post a Comment for "How I Do Capture All Of The Element Names Of An XML File Using LXML In Python?"