代码有点乱,但是原网页就是这样的,可以变一下
<li class="list__item"><div class="list__title">The world this week</div><a itemProp="url" class="link-button list__link" href="/node/21752687"><span class="print-edition__link-flytitle">Print-edition redesign</span><span class="print-edition__link-title">Introducing our new look</span></a><a itemProp="url" class="link-button list__link" href="/node/21752688"><span class="print-edition__link-title-sub">Politics this week</span></a><a itemProp="url" class="link-button list__link" href="/node/21752686"><span class="print-edition__link-title-sub">Business this week</span></a><a itemProp="url" class="link-button list__link" href="/node/21752685"><span class="print-edition__link-title-sub">KAL’s cartoon</span></a></li><li class="list__item"><div class="list__title">Leaders</div><a itemProp="url" class="link-button list__link" href="/node/21752616"><span class="print-edition__link-flytitle">Politics and power</span><span class="print-edition__link-title">China v America</span></a><a itemProp="url" class="link-button list__link" href="/node/21752617"><span class="print-edition__link-flytitle">Germany</span><span class="print-edition__link-title">Not so grand</span></a><a itemProp="url" class="link-button list__link" href="/node/21752619"><span class="print-edition__link-flytitle">Criminal justice</span><span class="print-edition__link-title">Against pessimism</span></a><a itemProp="url" class="link-button list__link" href="/node/21752620"><span class="print-edition__link-flytitle">Oil markets</span><span class="print-edition__link-title">Beyond boom and bust?</span></a><a itemProp="url" class="link-button list__link" href="/node/21752618"><span class="print-edition__link-flytitle">In praise of the basics</span><span class="print-edition__link-title">Captain Sensible</span></a></li>
浏览器渲染之后,是这样的:
是这样的,每一个<li class="list__item">
都是一段内容,里面包含了一个<div class="list__title">
和N个li
标签,li
标签里有N个超链接。
问题:
我需要把同一个<li class="list__item">
里的<div class="list__title">
和li
标签里的href="/node/21752688">
取出来,效果是这样的(类型不一定是list,这里是为了看起来更直观):
['The world this week','/node/21752687','/node/21752688','/node/21752685'],['Against pessimism','/node/21752618','/node/21752617','/node/21752183']...
xpath支持或关系,只要同时选取多个表达式即可:
|
由于你这两个元素提取的东西不一样,我建议你用多个xpath表达式提取:
提取div的
text
属性:如果你用的是python的
lxml
库,也可以不用xpath的text()
表达式,在得到Element
节点之后使用text
属性获取经过转义之后的str
提取
li
元素下的a
标签的href
属性:当然同上,你也可以只提取到
Element
层级,然后用python对应的提取attribute
方法获得href
属性,不在xpath处理这个问题。最后把这两个xpath表达式提取出来的结果进行拼接即可