How Can I Select And Update Text Nodes In Mixed Content Using Lxml?
I need to check all words in all text() nodes in an XML file. I'm using the XPath //text() to select the text nodes and a regex to select the words. If the word exists in a set of
Solution 1:
I found the key to this solution in the docs: Using XPath to find text
Specifically the is_text
and is_tail
properties of _ElementUnicodeResult.
Using these properties I can tell if I need to update the .text
or .tail
property of the parent _Element.
This is a little tricky understand at first because when you use getparent()
on a text node (_ElementUnicodeResult
) that is a tail of its preceding sibling (.is_tail == True
), the preceding sibling is what's returned as the parent; not the actual parent.
Example...
Python
import re
from lxml import etree
xml = """<doc>
<para>I think the only card she has <gotcha>is the</gotcha> Lorem card. We have so many things that we have to do
better... and certainly ipsum is one of them. When other <gotcha>websites</gotcha> give you text, they're not
sending the best. They're not sending you, they're <gotcha>sending words</gotcha> that have lots of problems
and they're <gotcha>bringing</gotcha> those problems with us. They're bringing mistakes. They're bringing
misspellings. They're typists… And some, <gotcha>I assume</gotcha>, are good words.</para>
</doc>
"""defupdate_text(match, word_list):
if match in word_list:
returnf"[{match}]"else:
return match
root = etree.fromstring(xml)
keywords = {"ipsum", "is", "the", "best", "problems", "mistakes"}
for text in root.xpath("//text()"):
parent = text.getparent()
updated_text = re.sub(r"[\w]+", lambda match: update_text(match.group(), keywords), text)
if text.is_text:
parent.text = updated_text
elif text.is_tail:
parent.tail = updated_text
etree.dump(root)
Output (dumped to console)
<doc><para>I think [the] only card she has <gotcha>[is] [the]</gotcha> Lorem card. We have so many things that we have to do
better... and certainly [ipsum] [is] one of them. When other <gotcha>websites</gotcha> give you text, they're not
sending [the] [best]. They're not sending you, they're <gotcha>sending words</gotcha> that have lots of [problems]
and they're <gotcha>bringing</gotcha> those [problems] with us. They're bringing [mistakes]. They're bringing
misspellings. They're typists… And some, <gotcha>I assume</gotcha>, are good words.</para></doc>
Post a Comment for "How Can I Select And Update Text Nodes In Mixed Content Using Lxml?"