Extracting Page Titles And Contributors From Mediawiki Xml

June 02, 2023 Post a Comment

I have a very large (7GB) MediaWiki XML dump, which consists of records of each change made to each page of the Wiki. I am trying to record which users have contributed to each pag

Solution 1:

Try pulling the 'title' elements directly out during iterative parsing instead of doing a secondary loop:

NS = '{http://www.mediawiki.org/xml/export-0.3/}'from xml.etree.ElementTree import iterparse
withopen('XMLFile.xml') as f:
    for event, elem in iterparse(f):
            if elem.tag == NS + 'title':
                print elem.text
            elem.clear()

seems to work for me.

Solution 2:

You get None when printing the text content of the title element because you are using elem.clear() "too early". By default, iterparse() only generates "end" events. When the "end" event for page is emitted, all its subelements, including title, have already been cleared (emptied).

If elem.clear() in the code in the question is moved just one indentation level (four spaces) to the right, it will work as expected. Another way to make your code work is to change iterparse(f) to iterparse(f, events=["start"]).

And node.text() should be node.text.

See http://effbot.org/zone/element-iterparse.htm for more details on iterparse().

Assume that the XML dump (mw.xml) looks like this:

<mediawikixmlns="http://www.mediawiki.org/xml/export-0.3/"><page><title>Unique Page title 1</title><id>11</id><restrictions>sysop</restrictions><revision><id>11</id><timestamp>2005-10-26T02:23:03Z</timestamp><contributor><username>Alice</username></contributor><textxml:space="preserve">i</text></revision></page><page><title>Unique Page title 2</title><id>11</id><restrictions>sysop</restrictions><revision><id>11</id><timestamp>2005-10-26T02:23:03Z</timestamp><contributor><username>Bob</username></contributor><textxml:space="preserve">j</text></revision></page></mediawiki>

Here is a suggestion on how you can get the title and contributor:

from xml.etree.ElementTree import iterparse

NS = '{http://www.mediawiki.org/xml/export-0.3/}'withopen('mw.xml') as f:
    for event, elem in iterparse(f):
        if elem.tag == '{0}page'.format(NS):
            title = elem.find("{0}title".format(NS))
            contr = elem.find(".//{0}username".format(NS))

            if title isnotNone:
                print title.text
            if contr isnotNone:
                print contr.text

            elem.clear()

Output:

Unique Page title 1 
Alice
Unique Page title 2 
Bob

I'm assuming that you want the username of the contributor. According to the latest XML Schema, contributor can contain username, ip, and/or id child elements (this is true also for the 0.3 version of the schema).

Solution 3:

I have no experience in using Python and iterparse, but generally, the way you'd do this with an iterative XML parser would be like this:

Outside the parsing loop, set up variables to store the current page title and list of contributors.
Inside the loop, whenever a page tag is opened, reset the variables.
When you encounter a title tag, set the page title variable to its contents.
When you encounter a contributor tag, add its contents to the list of contributors.
When the page tag is closed, output the collected title and the list of contributors.

Python Freelancers

Extracting Page Titles And Contributors From Mediawiki Xml

Solution 1:

Solution 2:

Solution 3:

Post a Comment for "Extracting Page Titles And Contributors From Mediawiki Xml"