Skip to content Skip to sidebar Skip to footer

How Can I Select And Update Text Nodes In Mixed Content Using Lxml?

I need to check all words in all text() nodes in an XML file. I'm using the XPath //text() to select the text nodes and a regex to select the words. If the word exists in a set of

Solution 1:

I found the key to this solution in the docs: Using XPath to find text

Specifically the is_text and is_tail properties of _ElementUnicodeResult.

Using these properties I can tell if I need to update the .text or .tail property of the parent _Element.

This is a little tricky understand at first because when you use getparent() on a text node (_ElementUnicodeResult) that is a tail of its preceding sibling (.is_tail == True), the preceding sibling is what's returned as the parent; not the actual parent.

Example...

Python

import re
from lxml import etree

xml = """<doc>
    <para>I think the only card she has <gotcha>is the</gotcha> Lorem card. We have so many things that we have to do
        better... and certainly ipsum is one of them. When other <gotcha>websites</gotcha> give you text, they're not
        sending the best. They're not sending you, they're <gotcha>sending words</gotcha> that have lots of problems
        and they're <gotcha>bringing</gotcha> those problems with us. They're bringing mistakes. They're bringing
        misspellings. They're typists… And some, <gotcha>I assume</gotcha>, are good words.</para>
</doc>
"""defupdate_text(match, word_list):
    if match in word_list:
        returnf"[{match}]"else:
        return match


root = etree.fromstring(xml)

keywords = {"ipsum", "is", "the", "best", "problems", "mistakes"}

for text in root.xpath("//text()"):
    parent = text.getparent()
    updated_text = re.sub(r"[\w]+", lambda match: update_text(match.group(), keywords), text)
    if text.is_text:
        parent.text = updated_text
    elif text.is_tail:
        parent.tail = updated_text

etree.dump(root)

Output (dumped to console)

<doc><para>I think [the] only card she has <gotcha>[is] [the]</gotcha> Lorem card. We have so many things that we have to do
        better... and certainly [ipsum] [is] one of them. When other <gotcha>websites</gotcha> give you text, they're not
        sending [the] [best]. They're not sending you, they're <gotcha>sending words</gotcha> that have lots of [problems]
        and they're <gotcha>bringing</gotcha> those [problems] with us. They're bringing [mistakes]. They're bringing
        misspellings. They're typists… And some, <gotcha>I assume</gotcha>, are good words.</para></doc>

Post a Comment for "How Can I Select And Update Text Nodes In Mixed Content Using Lxml?"