How to parse XML and count instances of a particular node attribute? – Dev

The best answers to the question “How to parse XML and count instances of a particular node attribute?” in the category Dev.

QUESTION:

I have many rows in a database that contains XML and I’m trying to write a Python script to count instances of a particular node attribute.

My tree looks like:

<foo>
   <bar>
      <type foobar="1"/>
      <type foobar="2"/>
   </bar>
</foo>

How can I access the attributes "1" and "2" in the XML using Python?

ANSWER:

minidom is the quickest and pretty straight forward.

XML:

<data>
    <items>
        <item name="item1"></item>
        <item name="item2"></item>
        <item name="item3"></item>
        <item name="item4"></item>
    </items>
</data>

Python:

from xml.dom import minidom
xmldoc = minidom.parse('items.xml')
itemlist = xmldoc.getElementsByTagName('item')
print(len(itemlist))
print(itemlist[0].attributes['name'].value)
for s in itemlist:
    print(s.attributes['name'].value)

Output:

4
item1
item1
item2
item3
item4

ANSWER:

I suggest ElementTree. There are other compatible implementations of the same API, such as lxml, and cElementTree in the Python standard library itself; but, in this context, what they chiefly add is even more speed — the ease of programming part depends on the API, which ElementTree defines.

First build an Element instance root from the XML, e.g. with the XML function, or by parsing a file with something like:

import xml.etree.ElementTree as ET
root = ET.parse('thefile.xml').getroot()

Or any of the many other ways shown at ElementTree. Then do something like:

for type_tag in root.findall('bar/type'):
    value = type_tag.get('foobar')
    print(value)

And similar, usually pretty simple, code patterns.

ANSWER:

There are many options out there. cElementTree looks excellent if speed and memory usage are an issue. It has very little overhead compared to simply reading in the file using readlines.

The relevant metrics can be found in the table below, copied from the cElementTree website:

library                         time    space
xml.dom.minidom (Python 2.1)    6.3 s   80000K
gnosis.objectify                2.0 s   22000k
xml.dom.minidom (Python 2.4)    1.4 s   53000k
ElementTree 1.2                 1.6 s   14500k  
ElementTree 1.2.4/1.3           1.1 s   14500k  
cDomlette (C extension)         0.540 s 20500k
PyRXPU (C extension)            0.175 s 10850k
libxml2 (C extension)           0.098 s 16000k
readlines (read as utf-8)       0.093 s 8850k
cElementTree (C extension)  --> 0.047 s 4900K <--
readlines (read as ascii)       0.032 s 5050k   

As pointed out by @jfs, cElementTree comes bundled with Python:

  • Python 2: from xml.etree import cElementTree as ElementTree.
  • Python 3: from xml.etree import ElementTree (the accelerated C version is used automatically).

ANSWER:

You can use BeautifulSoup:

from bs4 import BeautifulSoup

x="""<foo>
   <bar>
      <type foobar="1"/>
      <type foobar="2"/>
   </bar>
</foo>"""

y=BeautifulSoup(x)
>>> y.foo.bar.type["foobar"]
u'1'

>>> y.foo.bar.findAll("type")
[<type foobar="1"></type>, <type foobar="2"></type>]

>>> y.foo.bar.findAll("type")[0]["foobar"]
u'1'
>>> y.foo.bar.findAll("type")[1]["foobar"]
u'2'