Extracting Page Titles And Contributors From Mediawiki Xml
Solution 1:
Try pulling the 'title' elements directly out during iterative parsing instead of doing a secondary loop:
NS = '{http://www.mediawiki.org/xml/export-0.3/}'from xml.etree.ElementTree import iterparse
withopen('XMLFile.xml') as f:
for event, elem in iterparse(f):
if elem.tag == NS + 'title':
print elem.text
elem.clear()
seems to work for me.
Solution 2:
You get None
when printing the text content of the title
element because you are using elem.clear()
"too early". By default, iterparse()
only generates "end" events. When the "end" event for page
is emitted, all its subelements, including title
, have already been cleared (emptied).
If elem.clear()
in the code in the question is moved just one indentation level (four spaces) to the right, it will work as expected. Another way to make your code work is to change iterparse(f)
to iterparse(f, events=["start"])
.
And node.text()
should be node.text
.
See http://effbot.org/zone/element-iterparse.htm for more details on iterparse()
.
Assume that the XML dump (mw.xml) looks like this:
<mediawikixmlns="http://www.mediawiki.org/xml/export-0.3/"><page><title>Unique Page title 1</title><id>11</id><restrictions>sysop</restrictions><revision><id>11</id><timestamp>2005-10-26T02:23:03Z</timestamp><contributor><username>Alice</username></contributor><textxml:space="preserve">i</text></revision></page><page><title>Unique Page title 2</title><id>11</id><restrictions>sysop</restrictions><revision><id>11</id><timestamp>2005-10-26T02:23:03Z</timestamp><contributor><username>Bob</username></contributor><textxml:space="preserve">j</text></revision></page></mediawiki>
Here is a suggestion on how you can get the title and contributor:
from xml.etree.ElementTree import iterparse
NS = '{http://www.mediawiki.org/xml/export-0.3/}'withopen('mw.xml') as f:
for event, elem in iterparse(f):
if elem.tag == '{0}page'.format(NS):
title = elem.find("{0}title".format(NS))
contr = elem.find(".//{0}username".format(NS))
if title isnotNone:
print title.text
if contr isnotNone:
print contr.text
elem.clear()
Output:
Unique Page title 1
Alice
Unique Page title 2
Bob
I'm assuming that you want the username of the contributor. According to the latest XML Schema, contributor
can contain username
, ip
, and/or id
child elements (this is true also for the 0.3 version of the schema).
Solution 3:
I have no experience in using Python and iterparse, but generally, the way you'd do this with an iterative XML parser would be like this:
- Outside the parsing loop, set up variables to store the current page title and list of contributors.
- Inside the loop, whenever a
page
tag is opened, reset the variables. - When you encounter a
title
tag, set the page title variable to its contents. - When you encounter a
contributor
tag, add its contents to the list of contributors. - When the
page
tag is closed, output the collected title and the list of contributors.
Post a Comment for "Extracting Page Titles And Contributors From Mediawiki Xml"