前言
大家好,有没有了解过 XPath 呀。顾名思义,XPath 就是一门在 XML (也可以在 HTML、XHTML 甚至一般的 SGML)文档中用 path 形状的语法查找节点的语言。
XPath also called as XML Path is a language to query XML documents.
关于 XPath 的基础知识,我就不赘述了,感兴趣的同学可以学习一下这几份资料:
- https://www.runoob.com/xpath/xpath-tutorial.html
- https://www.w3schools.com/xml/xpath_intro.asp
- https://www.w3.org/TR/xpath
- https://www.w3.org/TR/xpath-functions-3/
- https://www.ibm.com/docs/en/baw/23.x?topic=expressions-xpath-...
- https://en.wikipedia.org/wiki/XPath
- https://developer.mozilla.org/en-US/docs/Web/XPath
Version | W3C Recommendation Date |
---|---|
XPath 1.0 | November 16, 1999 |
XPath 2.0 | January 23, 2007 |
XPath 3.0 | April 8, 2014 |
XPath 3.1 | March 21, 2017 |
提示:了解了 XPath 后,可以顺便再学习一下 XQuery、XLink、XSLT 和 XPoint
XML languages and their scope
XPath 3.1 and XQuery 3.1 Type System: 1. Items
XPath 3.1 and XQuery 3.1 Type System: 2. Simple and Complex Types
XPath 3.1 and XQuery 3.1 Type System: 3. Atomic Types
实例
当我们在查询带有 namespace 的 XML 时,就需要在查询的节点名前面添加命名空间的前缀。例如对于下面这个 XML 文档
<?xml version="1.0"?>
<root xmlns="http://example.com/">
<element>Text</element>
</root>
我们在使用 python 的 lxml 库查询 element
元素节点时,需要这样写
from lxml.etree import fromstring
# 解析XML
root = fromstring('''\
<?xml version="1.0"?>
<root xmlns="http://example.com/">
<element>Text</element>
</root>''')
# 定义命名空间
nsmap = {'ns': 'http://example.com/'}
# 使用XPath查询
result = root.xpath('//ns:element', namespaces=nsmap)
在这个例子中,我们首先解析了 XML 文件,然后定义了一个命名空间映射 nsmap
,这个映射将我们自定义的前缀 ns
映射到实际的命名空间 URL("http://example.com/"
)。然后,我们使用XPath查询,其中 ns:element
表示在 ns
命名空间下的 element
元素。
在我看来,手动指定命名空间是一种比较麻烦的事情,因此我编写了一个模块 xml_util.py
,可以减轻这项工作。具体的源代码在下一节。
假设我们有这样一个 XML 文档,这是一本电子书的 content.opf
文件:
<?xml version="1.0" encoding="utf-8"?>
<package version="3.0" unique-identifier="pub-identifier" prefix="ibooks: http://vocabulary.itunes.apple.com/rdf/ibooks/vocabulary-extensions-1.0/" xmlns="http://www.idpf.org/2007/opf" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/">
<metadata xmlns:opf="http://www.idpf.org/2007/opf">
<dc:identifier id="pub-identifier">9781492097846</dc:identifier>
<meta id="meta-identifier" property="dcterms:identifier">9781492097846</meta>
<dc:title id="pub-title">ColorWise</dc:title>
<meta property="dcterms:title" id="meta-title">ColorWise</meta>
<dc:language id="pub-language">en</dc:language>
<meta property="dcterms:language" id="meta-language">en</meta>
<meta property="dcterms:modified">2022-11-14T20:02:41Z</meta>
<dc:publisher>O'Reilly Media, Inc.</dc:publisher>
<meta property="dcterms:publisher">O'Reilly Media, Inc.</meta>
<dc:date>2022-11-21</dc:date>
<meta property="dcterms:date">2022-11-21</meta>
<dc:description>Data has become the most powerful tool in business today, and telling its story effectively is critical to decision making. Yet color is the most neglected tool in data visualization. With this book, author and DATAcated founder Kate Strachnyi provides the ultimate guide to the correct use of color for representing data in graphs, charts, tables, and infographics. Data and business analysts, data scientists, and others who design infographics and data visualizations will explore color tips and tricks, including the theories behind them and why they work the way they do.</dc:description>
<meta property="dcterms:description">Data has become the most powerful tool in business today, and telling its story effectively is critical to decision making. Yet color is the most neglected tool in data visualization. With this book, author and DATAcated founder Kate Strachnyi provides the ultimate guide to the correct use of color for representing data in graphs, charts, tables, and infographics. Data and business analysts, data scientists, and others who design infographics and data visualizations will explore color tips and tricks, including the theories behind them and why they work the way they do.</meta>
<dc:creator>Kate Strachnyi</dc:creator>
<meta property="dcterms:creator">Kate Strachnyi</meta>
<meta name="cover" content="cover-image" />
<meta property="ibooks:specified-fonts">true</meta>
</metadata>
<manifest>
<item id="toc.ncx" href="toc.ncx" media-type="application/x-dtbncx+xml"/>
<item id="epub-css" href="epub.css" media-type="text/css"/>
<item id="epub.embedded.asset.1" href="DejaVuSans-Bold.otf" media-type="application/vnd.ms-opentype"/>
<item id="epub.embedded.asset.2" href="DejaVuSerif.otf" media-type="application/vnd.ms-opentype"/>
<item id="epub.embedded.asset.3" href="UbuntuMono-Bold.otf" media-type="application/vnd.ms-opentype"/>
<item id="epub.embedded.asset.4" href="UbuntuMono-BoldItalic.otf" media-type="application/vnd.ms-opentype"/>
<item id="epub.embedded.asset.5" href="UbuntuMono-Italic.otf" media-type="application/vnd.ms-opentype"/>
<item id="epub.embedded.asset.6" href="UbuntuMono-Regular.otf" media-type="application/vnd.ms-opentype"/>
<item id="epub.embedded.asset.7" href="css_assets/titlepage_footer_ebook.png" media-type="image/png"/>
<item id="cover" href="cover.xhtml" media-type="application/xhtml+xml"/>
<item id="cover-image" href="assets/cover.png" media-type="image/png" properties="cover-image"/>
<item id="img-idm45649706868752" href="assets/cowi_P001.png" media-type="image/png"/>
<item id="img-idm45649706755360" href="assets/cowi_0101.png" media-type="image/png"/>
<item id="img-idm45649706872976" href="assets/cowi_0102.png" media-type="image/png"/>
<item id="img-idm45649706719136" href="assets/cowi_0103.png" media-type="image/png"/>
<item id="img-idm45649706756352" href="assets/cowi_0104.png" media-type="image/png"/>
<item id="img-idm45649706654896" href="assets/cowi_0105.png" media-type="image/png"/>
<item id="img-idm45649706651664" href="assets/cowi_0106.png" media-type="image/png"/>
<item id="img-idm45649706792176" href="assets/cowi_0107.png" media-type="image/png"/>
<item id="img-idm45649706800848" href="assets/cowi_0108.png" media-type="image/png"/>
<item id="img-idm45649706603392" href="assets/cowi_0109.png" media-type="image/png"/>
<item id="img-idm45649706782320" href="assets/cowi_0110.png" media-type="image/png"/>
<item id="img-idm45649706592976" href="assets/cowi_0111.png" media-type="image/png"/>
<item id="img-idm45649709416752" href="assets/cowi_0112.png" media-type="image/png"/>
<item id="img-idm45649706778096" href="assets/cowi_0113.png" media-type="image/png"/>
<item id="img-idm45649706479280" href="assets/cowi_0114.png" media-type="image/png"/>
<item id="img-idm45649706470416" href="assets/cowi_0115.png" media-type="image/png"/>
<item id="img-idm45649706509376" href="assets/cowi_0116.png" media-type="image/png"/>
<item id="img-idm45649706584336" href="assets/cowi_0117.png" media-type="image/png"/>
<item id="img-idm45649706353552" href="assets/cowi_0201.png" media-type="image/png"/>
<item id="img-idm45649706326320" href="assets/cowi_0202.png" media-type="image/png"/>
<item id="img-idm45649706312880" href="assets/cowi_0203.png" media-type="image/png"/>
<item id="img-idm45649706431440" href="assets/cowi_0204.png" media-type="image/png"/>
<item id="img-idm45649713883248" href="assets/cowi_0205.png" media-type="image/png"/>
<item id="img-idm45649706214768" href="assets/cowi_0207.png" media-type="image/png"/>
<item id="img-idm45649706830976" href="assets/cowi_0208.png" media-type="image/png"/>
<item id="img-idm45649706165664" href="assets/cowi_0209.png" media-type="image/png"/>
<item id="img-idm45649706272736" href="assets/cowi_0210.png" media-type="image/png"/>
<item id="img-idm45649706216080" href="assets/cowi_0211.png" media-type="image/png"/>
<item id="img-idm45649706364544" href="assets/cowi_0212.png" media-type="image/png"/>
<item id="img-idm45649706113616" href="assets/cowi_0213.png" media-type="image/png"/>
<item id="img-idm45649713916640" href="assets/cowi_0301.png" media-type="image/png"/>
<item id="img-idm45649706365952" href="assets/cowi_0302.png" media-type="image/png"/>
<item id="img-idm45649706101328" href="assets/cowi_0303.png" media-type="image/png"/>
<item id="img-idm45649706210592" href="assets/cowi_0304.png" media-type="image/png"/>
<item id="img-idm45649706689680" href="assets/cowi_0305.png" media-type="image/png"/>
<item id="img-idm45649706036752" href="assets/cowi_0306.png" media-type="image/png"/>
<item id="img-idm45649706383968" href="assets/cowi_0307.png" media-type="image/png"/>
<item id="img-idm45649706134272" href="assets/cowi_0308.png" media-type="image/png"/>
<item id="img-idm45649706382176" href="assets/cowi_0309.png" media-type="image/png"/>
<item id="img-idm45649706038960" href="assets/cowi_0310.png" media-type="image/png"/>
<item id="img-idm45649705985424" href="assets/cowi_0311.png" media-type="image/png"/>
<item id="img-idm45649706688336" href="assets/cowi_0401.png" media-type="image/png"/>
<item id="img-idm45649706451200" href="assets/cowi_0402.png" media-type="image/png"/>
<item id="img-idm45649705931728" href="assets/cowi_0403.png" media-type="image/png"/>
<item id="img-idm45649708906784" href="assets/cowi_0404.png" media-type="image/png"/>
<item id="img-idm45649705976656" href="assets/cowi_0405.png" media-type="image/png"/>
<item id="img-idm45649705897360" href="assets/cowi_0406.png" media-type="image/png"/>
<item id="img-idm45649706406000" href="assets/cowi_0407.png" media-type="image/png"/>
<item id="img-idm45649705921216" href="assets/cowi_0408.png" media-type="image/png"/>
<item id="img-idm45649705863728" href="assets/cowi_0409.png" media-type="image/png"/>
<item id="img-idm45649706220448" href="assets/cowi_0410.png" media-type="image/png"/>
<item id="img-idm45649705924032" href="assets/cowi_0411.png" media-type="image/png"/>
<item id="img-idm45649706237536" href="assets/cowi_0501.png" media-type="image/png"/>
<item id="img-idm45649705775760" href="assets/cowi_0502.png" media-type="image/png"/>
<item id="img-idm45649705765216" href="assets/cowi_0503.png" media-type="image/png"/>
<item id="img-idm45649705957664" href="assets/cowi_0504.png" media-type="image/png"/>
<item id="img-idm45649706260768" href="assets/cowi_0505.png" media-type="image/png"/>
<item id="img-idm45649706166464" href="assets/cowi_0506.png" media-type="image/png"/>
<item id="img-idm45649705753040" href="assets/cowi_0507.png" media-type="image/png"/>
<item id="img-idm45649706025376" href="assets/cowi_0508.png" media-type="image/png"/>
<item id="img-idm45649706910704" href="assets/cowi_0509.png" media-type="image/png"/>
<item id="img-idm45649705829664" href="assets/cowi_0510.png" media-type="image/png"/>
<item id="img-idm45649705786240" href="assets/cowi_0511.png" media-type="image/png"/>
<item id="img-idm45649705642272" href="assets/cowi_0512.png" media-type="image/png"/>
<item id="img-idm45649705833584" href="assets/cowi_0513.png" media-type="image/png"/>
<item id="img-idm45649705592736" href="assets/cowi_0514.png" media-type="image/png"/>
<item id="img-idm45649706591328" href="assets/cowi_0515.png" media-type="image/png"/>
<item id="img-idm45649705575152" href="assets/cowi_0516.png" media-type="image/png"/>
<item id="img-idm45649705554320" href="assets/cowi_0517.png" media-type="image/png"/>
<item id="img-idm45649706009696" href="assets/cowi_0518.png" media-type="image/png"/>
<item id="img-idm45649705733472" href="assets/cowi_0519.png" media-type="image/png"/>
<item id="img-idm45649705521520" href="assets/cowi_0520.png" media-type="image/png"/>
<item id="img-idm45649705507136" href="assets/cowi_0521.png" media-type="image/png"/>
<item id="img-idm45649705495120" href="assets/cowi_0522.png" media-type="image/png"/>
<item id="img-idm45649705563968" href="assets/cowi_0523.png" media-type="image/png"/>
<item id="img-idm45649705608176" href="assets/cowi_0524.png" media-type="image/png"/>
<item id="img-idm45649705453344" href="assets/cowi_0525.png" media-type="image/png"/>
<item id="img-idm45649705456400" href="assets/cowi_0526.png" media-type="image/png"/>
<item id="img-idm45649705427504" href="assets/cowi_0527.png" media-type="image/png"/>
<item id="img-idm45649705417136" href="assets/cowi_0528.png" media-type="image/png"/>
<item id="img-idm45649706056736" href="assets/cowi_0601.png" media-type="image/png"/>
<item id="img-idm45649705462128" href="assets/cowi_0602.png" media-type="image/png"/>
<item id="img-idm45649705376576" href="assets/cowi_0603.png" media-type="image/png"/>
<item id="img-idm45649705784752" href="assets/cowi_0701.png" media-type="image/png"/>
<item id="img-idm45649706057712" href="assets/cowi_0702.png" media-type="image/png"/>
<item id="img-idm45649705886656" href="assets/cowi_0703.png" media-type="image/png"/>
<item id="img-idm45649706347088" href="assets/cowi_0801.png" media-type="image/png"/>
<item id="img-idm45649705325344" href="assets/cowi_0901.png" media-type="image/png"/>
<item id="img-idm45649705222352" href="assets/cowi_0902.png" media-type="image/png"/>
<item id="img-idm45649705186048" href="assets/cowi_0903.png" media-type="image/png"/>
<item id="img-idm45649705130064" href="assets/cowi_0904.png" media-type="image/png"/>
<item id="img-idm45649706047920" href="assets/cowi_0905.png" media-type="image/png"/>
<item id="img-idm45649705477616" href="assets/cowi_0906.png" media-type="image/png"/>
<item id="img-idm45649705107504" href="assets/cowi_0907.png" media-type="image/png"/>
<item id="img-idm45649705150240" href="assets/cowi_0908.png" media-type="image/png"/>
<item id="img-idm45649705138912" href="assets/cowi_0909.png" media-type="image/png"/>
<item id="img-idm45649705674304" href="assets/cowi_0910.png" media-type="image/png"/>
<item id="img-idm45649705227072" href="assets/cowi_0911.png" media-type="image/png"/>
<item id="img-idm45649706274672" href="assets/cowi_0912.png" media-type="image/png"/>
<item id="img-idm45649705735744" href="assets/cowi_1001.png" media-type="image/png"/>
<item id="img-idm45649705018912" href="assets/cowi_1002.png" media-type="image/png"/>
<item id="img-idm45649705010192" href="assets/cowi_1003.png" media-type="image/png"/>
<item id="img-idm45649705716880" href="assets/cowi_1004.png" media-type="image/png"/>
<item id="img-idm45649706156432" href="assets/cowi_1005.png" media-type="image/png"/>
<item id="img-idm45649705707600" href="assets/cowi_1006.png" media-type="image/png"/>
<item id="img-idm45649705006688" href="assets/cowi_1007.png" media-type="image/png"/>
<item id="img-idm45649704932368" href="assets/cowi_1008.png" media-type="image/png"/>
<item id="img-idm45649705134064" href="assets/cowi_1009.png" media-type="image/png"/>
<item id="img-idm45649704927056" href="assets/cowi_1010.png" media-type="image/png"/>
<item id="img-idm45649704918256" href="assets/cowi_1011.png" media-type="image/png"/>
<item id="img-idm45649705709152" href="assets/cowi_1012.png" media-type="image/png"/>
<item id="img-idm45649704901232" href="assets/cowi_1013.png" media-type="image/png"/>
<item id="img-idm45649704892192" href="assets/cowi_1014.png" media-type="image/png"/>
<item id="img-idm45649705346736" href="assets/cowi_1015.png" media-type="image/png"/>
<item id="img-idm45649705050160" href="assets/cowi_1016.png" media-type="image/png"/>
<item id="img-idm45649704871840" href="assets/cowi_1017.png" media-type="image/png"/>
<item id="img-idm45649704924784" href="assets/cowi_1018.png" media-type="image/png"/>
<item id="img-idm45649704852912" href="assets/cowi_1019.png" media-type="image/png"/>
<item id="dedication-idm45649707006464" href="dedication01.xhtml" media-type="application/xhtml+xml"/>
<item id="titlepage-idm45649706954560" href="titlepage01.xhtml" media-type="application/xhtml+xml"/>
<item id="copyright-page-idm45649706950224" href="copyright-page01.xhtml" media-type="application/xhtml+xml"/>
<item id="toc-idm45649711153280" href="toc01.xhtml" media-type="application/xhtml+xml" properties="nav"/>
<item id="preface-idm45649706916816" href="preface01.xhtml" media-type="application/xhtml+xml"/>
<item id="chapter-idm45649713752832" href="ch01.xhtml" media-type="application/xhtml+xml"/>
<item id="chapter-idm45649713816720" href="ch02.xhtml" media-type="application/xhtml+xml"/>
<item id="chapter-idm45649706747904" href="ch03.xhtml" media-type="application/xhtml+xml"/>
<item id="chapter-idm45649706163568" href="ch04.xhtml" media-type="application/xhtml+xml"/>
<item id="chapter-idm45649705977648" href="ch05.xhtml" media-type="application/xhtml+xml"/>
<item id="chapter-idm45649705834560" href="ch06.xhtml" media-type="application/xhtml+xml"/>
<item id="chapter-idm45649705732000" href="ch07.xhtml" media-type="application/xhtml+xml"/>
<item id="chapter-idm45649705942688" href="ch08.xhtml" media-type="application/xhtml+xml"/>
<item id="chapter-idm45649705357504" href="ch09.xhtml" media-type="application/xhtml+xml"/>
<item id="chapter-idm45649705438064" href="ch10.xhtml" media-type="application/xhtml+xml"/>
<item id="afterword-idm45649704840816" href="afterword01.xhtml" media-type="application/xhtml+xml"/>
<item id="index-idm45649704821328" href="ix01.xhtml" media-type="application/xhtml+xml"/>
<item id="colophon-idm45649704794064" href="colophon01.xhtml" media-type="application/xhtml+xml"/>
<item id="colophon-idm45649704914880" href="colophon02.xhtml" media-type="application/xhtml+xml"/>
<item id="sgc-nav.css" href="sgc-nav.css" media-type="text/css"/>
<item id="nav.xhtml" href="nav.xhtml" media-type="application/xhtml+xml" properties="nav"/>
</manifest>
<spine toc="toc.ncx">
<itemref idref="cover"/>
<itemref idref="dedication-idm45649707006464"/>
<itemref idref="titlepage-idm45649706954560"/>
<itemref idref="copyright-page-idm45649706950224"/>
<itemref idref="preface-idm45649706916816"/>
<itemref idref="chapter-idm45649713752832"/>
<itemref idref="chapter-idm45649713816720"/>
<itemref idref="chapter-idm45649706747904"/>
<itemref idref="chapter-idm45649706163568"/>
<itemref idref="chapter-idm45649705977648"/>
<itemref idref="chapter-idm45649705834560"/>
<itemref idref="chapter-idm45649705732000"/>
<itemref idref="chapter-idm45649705942688"/>
<itemref idref="chapter-idm45649705357504"/>
<itemref idref="chapter-idm45649705438064"/>
<itemref idref="afterword-idm45649704840816"/>
<itemref idref="index-idm45649704821328"/>
<itemref idref="colophon-idm45649704794064"/>
<itemref idref="colophon-idm45649704914880"/>
<itemref idref="nav.xhtml" linear="no"/>
</spine>
<guide>
<reference type="cover" title="Cover" href="cover.xhtml"/>
<reference type="text" title="Start of Text" href="titlepage01.xhtml"/>
</guide>
</package>
现在假设我们已经用 lxml.etree.fromstring
把上述文档解析并保存到变量 etree
,另外我们也加载了 xml_util.py
中的函数,那么我们就可以用更简略的 XPath 表达式来查询节点了。
>>> # 查询顶层节点:package
>>> generic_find(etree, '/package')
<Element {http://www.idpf.org/2007/opf}package at 0x111f93e40>
>>> # 查询节点:metadata
>>> generic_find(etree, '/package/metadata')
<Element {http://www.idpf.org/2007/opf}metadata at 0x11685c300>
>>> # 查询节点:manifest
>>> generic_find(etree, '/package/manifest')
<Element {http://www.idpf.org/2007/opf}manifest at 0x111d38840>
>>> # 查询 metadata 下所有的节点
>>> generic_xpath(etree, '/package/metadata/*')
[<Element {http://purl.org/dc/elements/1.1/}identifier at 0x116ab4c80>,
<Element {http://www.idpf.org/2007/opf}meta at 0x116b17b80>,
<Element {http://purl.org/dc/elements/1.1/}title at 0x116a78f80>,
<Element {http://www.idpf.org/2007/opf}meta at 0x1169bcac0>,
<Element {http://purl.org/dc/elements/1.1/}language at 0x1169bea80>,
<Element {http://www.idpf.org/2007/opf}meta at 0x111d56f80>,
<Element {http://www.idpf.org/2007/opf}meta at 0x111d54c80>,
<Element {http://purl.org/dc/elements/1.1/}publisher at 0x111d56580>,
<Element {http://www.idpf.org/2007/opf}meta at 0x111d54f80>,
<Element {http://purl.org/dc/elements/1.1/}date at 0x111c62780>,
<Element {http://www.idpf.org/2007/opf}meta at 0x111d57580>,
<Element {http://purl.org/dc/elements/1.1/}description at 0x111e86780>,
<Element {http://www.idpf.org/2007/opf}meta at 0x111e87040>,
<Element {http://purl.org/dc/elements/1.1/}creator at 0x11200af00>,
<Element {http://www.idpf.org/2007/opf}meta at 0x114aa6240>,
<Element {http://www.idpf.org/2007/opf}meta at 0x114aa7a40>,
<Element {http://www.idpf.org/2007/opf}meta at 0x114aa5280>]
>>> # 查询 metadata 下所有的 dc 命名空间节点
>>> generic_xpath(etree, '/package/metadata/dc:*')
[<Element {http://purl.org/dc/elements/1.1/}identifier at 0x116ab4c80>,
<Element {http://purl.org/dc/elements/1.1/}title at 0x116a78f80>,
<Element {http://purl.org/dc/elements/1.1/}language at 0x1169bea80>,
<Element {http://purl.org/dc/elements/1.1/}publisher at 0x111d56580>,
<Element {http://purl.org/dc/elements/1.1/}date at 0x111c62780>,
<Element {http://purl.org/dc/elements/1.1/}description at 0x111e86780>,
<Element {http://purl.org/dc/elements/1.1/}creator at 0x11200af00>]
# 查询 manifest 下所有的 item 节点
>>> generic_xpath(etree, '/package/manifest/item')
[<Element {http://www.idpf.org/2007/opf}item at 0x114aa4f00>,
<Element {http://www.idpf.org/2007/opf}item at 0x1169be880>,
<Element {http://www.idpf.org/2007/opf}item at 0x111d46100>,
<Element {http://www.idpf.org/2007/opf}item at 0x116b17740>,
<Element {http://www.idpf.org/2007/opf}item at 0x116b177c0>,
<Element {http://www.idpf.org/2007/opf}item at 0x1169bc300>,
<Element {http://www.idpf.org/2007/opf}item at 0x1169be000>,
<Element {http://www.idpf.org/2007/opf}item at 0x1169be1c0>,
<Element {http://www.idpf.org/2007/opf}item at 0x1169bd2c0>,
<Element {http://www.idpf.org/2007/opf}item at 0x1169bc140>,
<Element {http://www.idpf.org/2007/opf}item at 0x1169bc380>,
<Element {http://www.idpf.org/2007/opf}item at 0x1167274c0>,
<Element {http://www.idpf.org/2007/opf}item at 0x116727580>,
<Element {http://www.idpf.org/2007/opf}item at 0x116726a40>,
<Element {http://www.idpf.org/2007/opf}item at 0x116726e00>,
<Element {http://www.idpf.org/2007/opf}item at 0x116725cc0>,
<Element {http://www.idpf.org/2007/opf}item at 0x107b30600>,
<Element {http://www.idpf.org/2007/opf}item at 0x107b305c0>,
<Element {http://www.idpf.org/2007/opf}item at 0x107b31d00>,
<Element {http://www.idpf.org/2007/opf}item at 0x107b31840>,
<Element {http://www.idpf.org/2007/opf}item at 0x107b33800>,
<Element {http://www.idpf.org/2007/opf}item at 0x107b33780>,
<Element {http://www.idpf.org/2007/opf}item at 0x107b32540>,
<Element {http://www.idpf.org/2007/opf}item at 0x1078ee200>,
<Element {http://www.idpf.org/2007/opf}item at 0x116b8be80>,
<Element {http://www.idpf.org/2007/opf}item at 0x116b8a500>,
<Element {http://www.idpf.org/2007/opf}item at 0x116b8b500>,
<Element {http://www.idpf.org/2007/opf}item at 0x116b8ba80>,
<Element {http://www.idpf.org/2007/opf}item at 0x116cfa5c0>,
<Element {http://www.idpf.org/2007/opf}item at 0x11684bbc0>,
<Element {http://www.idpf.org/2007/opf}item at 0x1168490c0>,
<Element {http://www.idpf.org/2007/opf}item at 0x11684b700>,
<Element {http://www.idpf.org/2007/opf}item at 0x116849a40>,
<Element {http://www.idpf.org/2007/opf}item at 0x111c7e940>,
<Element {http://www.idpf.org/2007/opf}item at 0x107fe7840>,
<Element {http://www.idpf.org/2007/opf}item at 0x107fe4300>,
<Element {http://www.idpf.org/2007/opf}item at 0x107fe4500>,
<Element {http://www.idpf.org/2007/opf}item at 0x107fe45c0>,
<Element {http://www.idpf.org/2007/opf}item at 0x107fe7f80>,
<Element {http://www.idpf.org/2007/opf}item at 0x107fe7a00>,
<Element {http://www.idpf.org/2007/opf}item at 0x107fe51c0>,
<Element {http://www.idpf.org/2007/opf}item at 0x107fe4540>,
<Element {http://www.idpf.org/2007/opf}item at 0x107fe78c0>,
<Element {http://www.idpf.org/2007/opf}item at 0x116747580>,
<Element {http://www.idpf.org/2007/opf}item at 0x116744640>,
<Element {http://www.idpf.org/2007/opf}item at 0x116746e00>,
<Element {http://www.idpf.org/2007/opf}item at 0x116746f00>,
<Element {http://www.idpf.org/2007/opf}item at 0x116746880>,
<Element {http://www.idpf.org/2007/opf}item at 0x1167452c0>,
<Element {http://www.idpf.org/2007/opf}item at 0x116747a00>,
<Element {http://www.idpf.org/2007/opf}item at 0x116745b80>,
<Element {http://www.idpf.org/2007/opf}item at 0x116746140>,
<Element {http://www.idpf.org/2007/opf}item at 0x1167443c0>,
<Element {http://www.idpf.org/2007/opf}item at 0x1167474c0>,
<Element {http://www.idpf.org/2007/opf}item at 0x1167453c0>,
<Element {http://www.idpf.org/2007/opf}item at 0x116744280>,
<Element {http://www.idpf.org/2007/opf}item at 0x116745000>,
<Element {http://www.idpf.org/2007/opf}item at 0x116744780>,
<Element {http://www.idpf.org/2007/opf}item at 0x116745300>,
<Element {http://www.idpf.org/2007/opf}item at 0x116746840>,
<Element {http://www.idpf.org/2007/opf}item at 0x116745540>,
<Element {http://www.idpf.org/2007/opf}item at 0x116747e00>,
<Element {http://www.idpf.org/2007/opf}item at 0x1167450c0>,
<Element {http://www.idpf.org/2007/opf}item at 0x116747480>,
<Element {http://www.idpf.org/2007/opf}item at 0x1167457c0>,
<Element {http://www.idpf.org/2007/opf}item at 0x116746ec0>,
<Element {http://www.idpf.org/2007/opf}item at 0x116746fc0>,
<Element {http://www.idpf.org/2007/opf}item at 0x116744e80>,
<Element {http://www.idpf.org/2007/opf}item at 0x116744800>,
<Element {http://www.idpf.org/2007/opf}item at 0x116744f00>,
<Element {http://www.idpf.org/2007/opf}item at 0x116746180>,
<Element {http://www.idpf.org/2007/opf}item at 0x1167471c0>,
<Element {http://www.idpf.org/2007/opf}item at 0x1167473c0>,
<Element {http://www.idpf.org/2007/opf}item at 0x116744940>,
<Element {http://www.idpf.org/2007/opf}item at 0x116744580>,
<Element {http://www.idpf.org/2007/opf}item at 0x116744ec0>,
<Element {http://www.idpf.org/2007/opf}item at 0x116747fc0>,
<Element {http://www.idpf.org/2007/opf}item at 0x116747c80>,
<Element {http://www.idpf.org/2007/opf}item at 0x116745380>,
<Element {http://www.idpf.org/2007/opf}item at 0x116745c00>,
<Element {http://www.idpf.org/2007/opf}item at 0x116745cc0>,
<Element {http://www.idpf.org/2007/opf}item at 0x116744480>,
<Element {http://www.idpf.org/2007/opf}item at 0x116744200>,
<Element {http://www.idpf.org/2007/opf}item at 0x1165446c0>,
<Element {http://www.idpf.org/2007/opf}item at 0x116547280>,
<Element {http://www.idpf.org/2007/opf}item at 0x116546d00>,
<Element {http://www.idpf.org/2007/opf}item at 0x116544d40>,
<Element {http://www.idpf.org/2007/opf}item at 0x116544ac0>,
<Element {http://www.idpf.org/2007/opf}item at 0x1165445c0>,
<Element {http://www.idpf.org/2007/opf}item at 0x116547340>,
<Element {http://www.idpf.org/2007/opf}item at 0x116544bc0>,
<Element {http://www.idpf.org/2007/opf}item at 0x1165442c0>,
<Element {http://www.idpf.org/2007/opf}item at 0x116544780>,
<Element {http://www.idpf.org/2007/opf}item at 0x1078f65c0>,
<Element {http://www.idpf.org/2007/opf}item at 0x1078f4380>,
<Element {http://www.idpf.org/2007/opf}item at 0x1078f7340>,
<Element {http://www.idpf.org/2007/opf}item at 0x1078f6280>,
<Element {http://www.idpf.org/2007/opf}item at 0x1078f5340>,
<Element {http://www.idpf.org/2007/opf}item at 0x1078f4100>,
<Element {http://www.idpf.org/2007/opf}item at 0x1078f4bc0>,
<Element {http://www.idpf.org/2007/opf}item at 0x1078f6ec0>,
<Element {http://www.idpf.org/2007/opf}item at 0x1078f4f40>,
<Element {http://www.idpf.org/2007/opf}item at 0x1078f4800>,
<Element {http://www.idpf.org/2007/opf}item at 0x11660aac0>,
<Element {http://www.idpf.org/2007/opf}item at 0x11660be40>,
<Element {http://www.idpf.org/2007/opf}item at 0x11660be00>,
<Element {http://www.idpf.org/2007/opf}item at 0x11660bdc0>,
<Element {http://www.idpf.org/2007/opf}item at 0x11660a5c0>,
<Element {http://www.idpf.org/2007/opf}item at 0x116609f80>,
<Element {http://www.idpf.org/2007/opf}item at 0x111d54a40>,
<Element {http://www.idpf.org/2007/opf}item at 0x111d55480>,
<Element {http://www.idpf.org/2007/opf}item at 0x111d56880>,
<Element {http://www.idpf.org/2007/opf}item at 0x111d55540>,
<Element {http://www.idpf.org/2007/opf}item at 0x111d54f40>,
<Element {http://www.idpf.org/2007/opf}item at 0x111d57a80>,
<Element {http://www.idpf.org/2007/opf}item at 0x111d55080>,
<Element {http://www.idpf.org/2007/opf}item at 0x111d54740>,
<Element {http://www.idpf.org/2007/opf}item at 0x111d56cc0>,
<Element {http://www.idpf.org/2007/opf}item at 0x111d55440>,
<Element {http://www.idpf.org/2007/opf}item at 0x111d55340>,
<Element {http://www.idpf.org/2007/opf}item at 0x1121d4b00>,
<Element {http://www.idpf.org/2007/opf}item at 0x1121d6200>,
<Element {http://www.idpf.org/2007/opf}item at 0x1121d66c0>,
<Element {http://www.idpf.org/2007/opf}item at 0x1121d5140>,
<Element {http://www.idpf.org/2007/opf}item at 0x1121d7300>,
<Element {http://www.idpf.org/2007/opf}item at 0x1121d77c0>,
<Element {http://www.idpf.org/2007/opf}item at 0x1121d7700>,
<Element {http://www.idpf.org/2007/opf}item at 0x1121d6a40>,
<Element {http://www.idpf.org/2007/opf}item at 0x1121d5340>,
<Element {http://www.idpf.org/2007/opf}item at 0x1121d5300>,
<Element {http://www.idpf.org/2007/opf}item at 0x114aa4f80>,
<Element {http://www.idpf.org/2007/opf}item at 0x114aa53c0>,
<Element {http://www.idpf.org/2007/opf}item at 0x114aa7400>,
<Element {http://www.idpf.org/2007/opf}item at 0x114aa73c0>,
<Element {http://www.idpf.org/2007/opf}item at 0x114aa5240>,
<Element {http://www.idpf.org/2007/opf}item at 0x114aa5e00>,
<Element {http://www.idpf.org/2007/opf}item at 0x116773a00>,
<Element {http://www.idpf.org/2007/opf}item at 0x116771ec0>,
<Element {http://www.idpf.org/2007/opf}item at 0x116772100>,
<Element {http://www.idpf.org/2007/opf}item at 0x116772fc0>,
<Element {http://www.idpf.org/2007/opf}item at 0x116770400>,
<Element {http://www.idpf.org/2007/opf}item at 0x116772c00>,
<Element {http://www.idpf.org/2007/opf}item at 0x116773200>,
<Element {http://www.idpf.org/2007/opf}item at 0x116773400>,
<Element {http://www.idpf.org/2007/opf}item at 0x1167732c0>,
<Element {http://www.idpf.org/2007/opf}item at 0x116773680>,
<Element {http://www.idpf.org/2007/opf}item at 0x116772f40>,
<Element {http://www.idpf.org/2007/opf}item at 0x116772480>,
<Element {http://www.idpf.org/2007/opf}item at 0x116773bc0>,
<Element {http://www.idpf.org/2007/opf}item at 0x116770080>]
# 查询 spine 下所有的 itemref 节点
>>> generic_xpath(etree, '/package/spine/itemref')
[<Element {http://www.idpf.org/2007/opf}itemref at 0x116e35280>,
<Element {http://www.idpf.org/2007/opf}itemref at 0x116e36fc0>,
<Element {http://www.idpf.org/2007/opf}itemref at 0x116e35a40>,
<Element {http://www.idpf.org/2007/opf}itemref at 0x116e35400>,
<Element {http://www.idpf.org/2007/opf}itemref at 0x116e35300>,
<Element {http://www.idpf.org/2007/opf}itemref at 0x116e356c0>,
<Element {http://www.idpf.org/2007/opf}itemref at 0x116e365c0>,
<Element {http://www.idpf.org/2007/opf}itemref at 0x116e340c0>,
<Element {http://www.idpf.org/2007/opf}itemref at 0x116e342c0>,
<Element {http://www.idpf.org/2007/opf}itemref at 0x116e34840>,
<Element {http://www.idpf.org/2007/opf}itemref at 0x106baac40>,
<Element {http://www.idpf.org/2007/opf}itemref at 0x114aa6940>,
<Element {http://www.idpf.org/2007/opf}itemref at 0x116ef2000>,
<Element {http://www.idpf.org/2007/opf}itemref at 0x116ef03c0>,
<Element {http://www.idpf.org/2007/opf}itemref at 0x116ef3640>,
<Element {http://www.idpf.org/2007/opf}itemref at 0x116ef0f80>,
<Element {http://www.idpf.org/2007/opf}itemref at 0x116ef3680>,
<Element {http://www.idpf.org/2007/opf}itemref at 0x116ba4c00>,
<Element {http://www.idpf.org/2007/opf}itemref at 0x116931100>,
<Element {http://www.idpf.org/2007/opf}itemref at 0x116930fc0>]
源码
以下就是 xml_util.py
模块的源码,代码有静态类型注解并通过 mypy 测试。
#!/usr/bin/env python
# coding: utf-8
__author__ = "ChenyangGao <https://chenyanggao.github.io/>"
__version__ = (0, 0, 1)
__all__ = [
"fromstring", "tostring", "make_element", "html_fromstring", "html_tostring", "make_html_element",
"xml_fromstring", "to_xhtml", "xpath_of", "generalize_xpath", "generic_xpath", "generic_find",
]
__requirements__ = ["lxml", "lxml-stubs"]
from functools import partial
from re import compile as re_compile, Match
from typing import cast, Callable, Final, Iterator, Mapping, NamedTuple, Optional
from lxml.etree import fromstring, tostring, Element as make_element, _Element, _ElementTree
from lxml.html import fromstring as html_fromstring, tostring as html_tostring, Element as make_html_element
CRE_XML_ENCODING: Final = re_compile(r'(?<=\bencoding=")[^"]+|(?<=\bencoding=\')[^\']+')
XPATH_TOEKN_PATS: Final = [
("DSLASH", r"//"),
("SLASH", r"/"),
("LBRACKET", r"\["),
("RBRACKET", r"\]"),
("LPARAN", r"\("),
("RPARAN", r"\)"),
("DCOLON", r"::"),
("COLON", r":"),
("DDOT", r"\.\."),
("DOT", r"\."),
("AT", r"@"),
("DOLLAR", "\$"),
("COMMA", r","),
("STAR", r"\*"),
("VERTICAL_BAR", r"\|"),
("QM", r"\?"),
("NAME", r"\w[\w0-9._-]*"),
("NUMBER", r"\d+(?:\.\d*)?|\.\d+"),
("STRING", r"'[^'\\]*(?:\\.[^'\\]*)*'|" + r'"[^"\\]*(?:\\.[^"\\]*)*"'),
("WHITESPACES", r"\s+"),
("COMP", r"!=|<=|>=|=|<|="),
("ANY", r"(?s:.)"),
]
CRE_XPATH_TOKEN: Final = re_compile("|".join("(?P<%s>%s)" % pair for pair in XPATH_TOEKN_PATS))
def xml_fromstring(
doc: bytes | str,
/,
parser=None,
*,
base_url=None,
) -> _Element:
doc = doc.lstrip()
if isinstance(doc, str):
index = doc.find("\n")
if index != -1:
m = CRE_XML_ENCODING.search(doc[:index])
if m is not None:
doc = bytes(doc, m[0])
return fromstring(doc, parser, base_url=base_url)
xml_fromstring.__doc__ = fromstring.__doc__
def to_xhtml(
etree: _Element | _ElementTree,
ensure_epub: bool = False,
) -> Callable[..., bytes]:
"""Convert an element node and its child nodes into XHTML format.
:param etree: Element node or node tree that to be processed.
:param ensure_epub: Determine whether to add epub namespaces to the root element,
but it must itself be a root element or node tree.
:return: A helper function (ignore if not needed), used to serialize the current
element node or node tree (depending on what is provided).
"""
if isinstance(etree, _ElementTree):
root = etree.getroot()
is_root = True
else:
root = etree.getroottree().getroot()
is_root = etree is root
# NOTE: Because in Sigil editor, double hyphen (--) within comment will
# issue an error, so I just escape all the double hyphens 😂.
comments: list[_Comment] = etree.xpath(".//comment()") # type: ignore
if comments:
for comment in comments:
if comment.text and "--" in comment.text:
comment.text = comment.text.replace("--", "--")
# NOTE: Because if you want to convert HTML to XHTML, you may need to use
# `lxml.etree.tostring`. When encountering an element node without
# children, it will form a self closing tag, but there is no such
# thing in HTML. However there is a concept of void element in HTML:
#
# - https://html.spec.whatwg.org/multipage/syntax.html#void-elements
# - https://developer.mozilla.org/en-US/docs/Glossary/Void_element
#
# A void element is an element in HTML that cannot have any child nodes
# (i.e., nested elements or text nodes). Void elements only have a start
# tag; end tags must not be specified for void elements.
# To make sure that all non-void elements do not form self closing tags,
# it is possible to replace the text node with "" by checking that their
# text node is None.
for el in etree.iter("*"):
# NOTE: In the past, there were other elements that were void elements, such as
# <param> and <keygen>, but they have all been deprecated and removed.
# An obsoleted element is not occupied by the HTML standard and is not
# considered as a void element. So they can be given new meanings by users,
# and cannot be directly considered as void elements.
#
# - https://developer.mozilla.org/en-US/docs/Web/HTML/Element/param
# - https://developer.mozilla.org/en-US/docs/Mozilla/Firefox/Releases/69
if el.tag.lower() not in (
"area", "base", "br", "col", "embed", "hr", "img", "input", "link",
"meta", "source", "track", "wbr",
):
if el.text is None:
el.text = ""
# NOTE: You need to use epub:type to perform `Expression Structural Semantics`,
# so there are two namespaces that need to be defined.
#
# - https://www.w3.org/TR/epub/#app-structural-semantics
# - https://www.w3.org/TR/xml-names/
if is_root:
if "xmlns" not in root.attrib:
root.attrib["xmlns"] = "http://www.w3.org/1999/xhtml"
if ensure_epub and "xmlns:epub" not in root.attrib:
root.attrib["xmlns:epub"] = "http://www.idpf.org/2007/ops"
# NOTE: Because UTF-8 is currently the most recommended encoding
kwargs = {"encoding": "utf-8"}
if is_root:
# NOTE: Specify the DOCTYPE as HTML5 (<!DOCTYPE html>), ignoring the original.
kwargs["doctype"] = "<!DOCTYPE html>"
# NOTE: Sigil editor does not support XML version > 1.0 🫠
kwargs["xml_declaration"] = '<?xml version="1.0" encoding="UTF-8"?>'
return partial(tostring, etree, **kwargs) # type: ignore
def xpath_of(
el: _Element,
with_name: bool = True,
) -> str:
"""Generates the XPath of a given XML (HTML, XHTML, SGML, ...) element in the document tree.
:param el: The element for which the XPath is to be generated.
:param with_name: A flag to determine if the tag names should be included in the XPath.
If set to True, the tag names are included, otherwise only the structure
is represented in the XPath.
:return: The XPath of the given XML element as a string.
:Note:
- The function traverses up the tree from the given element to the root,
building the XPath along the way.
- In case of elements with namespaces, the local name of the tag (i.e., without the namespace) is used.
- If there are multiple sibling elements with the same tag, the index of the element among its siblings
is included in the XPath.
- If `with_name` is set to False, the function can use `el.getroottree().getpath(el)` instead to get
the XPath without the tag names.
"""
ls: list[str] = []
add = ls.append
if with_name:
get_basetag = lambda tag: tag[tag.find("}")+1:]
while True:
tag = el.tag
basetag = get_basetag(tag)
if tag == basetag:
add(f"/{tag}")
else:
add(f'/*[local-name()="{basetag}"]')
pel = el.getparent()
if pel is None:
break
elif any(get_basetag(sel.tag) == basetag and sel is not el for sel in pel.iterchildren("*")):
i = 0
for sel in pel.iterchildren("*"):
if get_basetag(sel.tag) == basetag:
i += 1
if sel is el:
break
ls[-1] += f"[{i}]"
el = pel
else:
# NOTE: You can use `el.getroottree().getpath(el)` instead.
while True:
pel = el.getparent()
add("/*")
if pel is None:
break
i = 0
for sel in pel.iterchildren("*"):
i += 1
if el is sel:
break
ls[-1] += f"[{i}]"
el = pel
return "".join(reversed(ls))
class Token(NamedTuple):
type: str
value: str
start: int
stop: int
match: Match
def tokenize(
xpath: str,
_tokeniter: Callable[[str], Iterator[Match]] = CRE_XPATH_TOKEN.finditer,
) -> Iterator[Token]:
# Reference:
# - https://www.python.org/community/sigs/retired/parser-sig/towards-standard/
# - https://www.w3.org/TR/xpath/
# - https://github.com/antlr/antlr4
# - https://www.gnu.org/software/bison/
# - https://www.antlr3.org/grammar/list.html
# - https://github.com/antlr/grammars-v4/tree/master/xpath
# - https://github.com/lark-parser/lark
# - https://github.com/dabeaz/ply
for match in _tokeniter(xpath):
token_type = cast(str, match.lastgroup)
token_value = match.group(token_type)
yield Token(token_type, token_value, *match.span(), match)
def generalize_xpath(
xpath: str,
/,
prefix: Optional[str] = None,
) -> str:
"""Generalizes a given XPath expression, so that namespaces can be disregarded.
:param xpath: The XPath expression to be generalized.
:param prefix: An optional namespace prefix to be used in the generalized XPath.
If provided, the prefix is added before the tag names in the XPath.
If not provided, the 'local-name()' function is used to match the tag names in the XPath.
:return: The generalized XPath as a string.
"""
# TODO: Research is ongoing for XPath of more complex nested structures, even XSLT.
parts: list[str] = []
add_part = parts.append
step_begin = True
pred_level = 0
#para_level = 0
cache_name = None
if prefix:
expand = lambda name: f"{prefix}:{name}"
else:
expand = lambda name: f'*[local-name()="{cache_name}"]'
last_type = "ANY"
for token in tokenize(xpath):
type = token.type
value = token.value
if type == "WHITESPACES":
# if cache_name:
# add_part(cache_name)
# cache_name = None
add_part(value)
last_type = type
continue
if step_begin:
if type == "NAME":
# if cache_name:
# add_part(cache_name)
# step_begin = False
cache_name = value
last_type = type
continue
# NOTE: axes end
elif type == "DCOLON":
if cache_name:
if last_type == "WHITESPACES":
parts[-1:] = [cache_name, parts[-1]]
else:
add_part(cache_name)
cache_name = None
else:
step_begin = False
add_part(value)
last_type = type
continue
if not pred_level and type in ("SLASH", "DSLASH", "VERTICAL_BAR"):
if cache_name:
if last_type == "WHITESPACES":
parts[-1:] = [expand(cache_name), parts[-1]]
else:
add_part(expand(cache_name))
cache_name = None
add_part(value)
step_begin = True
last_type = type
continue
if cache_name:
if not pred_level and type == "LBRACKET":
if last_type == "WHITESPACES":
parts[-1:] = [expand(cache_name), parts[-1]]
else:
add_part(expand(cache_name))
elif last_type == "WHITESPACES":
parts[-1:] = [cache_name, parts[-1]]
else:
add_part(cache_name)
cache_name = None
if type == "LBRACKET":
pred_level += 1
elif type == "RBRACKET" and pred_level:
pred_level -= 1
add_part(value)
step_begin = False
last_type = type
if cache_name:
if last_type == "WHITESPACES":
parts[-1:] = [expand(cache_name), parts[-1]]
else:
add_part(expand(cache_name))
return "".join(parts)
def generic_xpath(
el: _Element | _ElementTree,
xpath: str,
/,
**kwargs,
) -> list[_Element]:
"""Executes a generalized XPath expression on an XML element or tree.
:param el: The element or tree on which the XPath expression is to be executed.
:param xpath: The XPath expression to be executed. It will be generalized before execution.
:param kwargs: Additional arguments to be passed to the 'xpath' method of the XML element or tree.
:return: A list of XML elements that match the XPath expression.
:Note:
- The function first generalizes the input XPath expression using the 'generalize_xpath' function.
- If the 'namespaces' argument is not provided in the kwargs, it adds the namespaces from the
XML element or tree to the kwargs.
- It then executes the XPath expression on the XML element or tree using the 'xpath' method with
the updated kwargs and returns the result.
"""
xpath = generalize_xpath(xpath)
if "namespaces" not in kwargs:
nsmap = (el.getroot() if isinstance(el, _ElementTree) else el).nsmap
if nsmap:
kwargs["namespaces"] = {k: nsmap[k] for k in nsmap if k}
return cast(list[_Element], el.xpath(xpath, **kwargs))
def generic_find(
el: _Element | _ElementTree,
path: str,
/,
namespaces: Optional[Mapping[Optional[str], str]] = None,
) -> Optional[_Element]:
"""Finds the first XML element that matches a given Path expression in an XML element or tree.
:param el: The XML element or tree in which to find the matching element.
:param path: The Path expression to be used to find the matching element.
If it starts with '/', it's treated as an absolute path and the search starts from the root of the tree.
If it doesn't start with '/', it's treated as a relative path and the search starts from the input element.
:param namespaces: An optional dictionary of namespace prefixes to URIs.
If provided, these namespaces are used in the XPath expression.
If not provided, the namespaces from the XML element or tree are used.
:return: The first XML element that matches the XPath expression, or None if no matching element is found.
:Note:
- If the Path expression starts with '/', the function first tries to find the root element
that matches the prefix of the Path expression.
- If it finds a matching root element, it updates the Path expression to start from the root element.
- If it doesn't find a matching root element and the Path expression starts with '//',
it updates the Path expression to start from the current element.
- If it doesn't find a matching root element and the XPath expression doesn't start with '//', it returns None.
- If the 'namespaces' argument is not provided, it uses the namespaces from the XML element or tree.
- It then uses the 'find' method of the XML element or tree to find the first element that matches
the XPath expression and returns it.
"""
if path.startswith("/"):
prefix, _, expr2 = path.strip("/").partition("/")
if not prefix:
raise ValueError(f"Invalid expression: {path!r}")
el = (el if isinstance(el, _ElementTree) else el.getroottree()).getroot()
if generic_xpath(el, "/"+prefix):
if not expr2:
return el
path = "./" + expr2
elif path.startswith("//"):
path = ".//" + path.strip("/")
else:
return None
if namespaces is None:
namespaces = (el.getroot() if isinstance(el, _ElementTree) else el).nsmap
if namespaces:
if None in namespaces:
path = '/'.join(map(
lambda p: p if not p or p.startswith(("*", ".", "..")) or ":" in p else f"_:{p}",
path.split('/'),
))
namespaces = {k or "_": namespaces[k] for k in namespaces}
return el.find(path, cast(Mapping[str, str], namespaces))
return el.find(path)
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。