python爬虫网页提取器xpath,可是提取不到网址,要怎么修改呢

python爬虫网页提取器xpath,可是提取不到网址,要怎么修改呢?
以下是需要提取的网页代码:

<script type="text/x-handlebars-template" id="descTemplate">
<p>
<img style="max-width:750.0px;" src="https://img.alicdn.com/imgextra/i1/678878759/TB24N1cenAKL1JjSZFCXXXFspXa_!!678878759.png" align="absmiddle" width="750"><img style="max-width:750.0px;" src="https://img.alicdn.com/imgextra/i2/678878759/TB2YMdtclxRMKJjy0FdXXaifFXa_!!678878759.jpg" align="absmiddle" width="750"><img style="max-width:750.0px;" src="https://img.alicdn.com/imgextra/i4/678878759/TB2NCcvcbsTMeJjy1zbXXchlVXa_!!678878759.jpg" align="absmiddle" width="750"><img style="max-width:750.0px;" src="https://img.alicdn.com/imgextra/i1/678878759/TB2soQ.dEl7MKJjSZFDXXaOEpXa_!!678878759.jpg" align="absmiddle" width="750"><img style="max-width:750.0px;" src="https://img.alicdn.com/imgextra/i3/678878759/TB2z6gDcgoQMeJjy0FpXXcTxpXa_!!678878759.jpg" align="absmiddle" width="750"><img style="max-width:750.0px;" src="https://img.alicdn.com/imgextra/i2/678878759/TB2Kst_elcHL1JjSZJiXXcKcpXa_!!678878759.jpg" align="absmiddle" width="750"><img style="max-width:750.0px;" src="https://img.alicdn.com/imgextra/i1/678878759/TB21M4tclxRMKJjy0FdXXaifFXa_!!678878759.jpg" align="absmiddle" width="750"><img style="max-width:750.0px;" src="https://img.alicdn.com/imgextra/i1/678878759/TB20qx8eoEIL1JjSZFFXXc5kVXa_!!678878759.jpg" align="absmiddle" width="750"><img style="max-width:750.0px;" src="https://img.alicdn.com/imgextra/i3/678878759/TB2nVADcgMPMeJjy1XbXXcwxVXa_!!678878759.jpg" align="absmiddle" width="750"><img style="max-width:750.0px;" src="https://img.alicdn.com/imgextra/i3/678878759/TB2_6KcenAKL1JjSZFCXXXFspXa_!!678878759.jpg" align="absmiddle" width="750"><img style="max-width:750.0px;" src="https://img.alicdn.com/imgextra/i1/678878759/TB2bICced.LL1JjSZFEXXcVmXXa_!!678878759.jpg" align="absmiddle" width="750"><img style="max-width:750.0px;" src="https://img.alicdn.com/imgextra/i2/678878759/TB2IayieoQIL1JjSZFhXXaDZFXa_!!678878759.jpg" align="absmiddle" width="750"> </p>
</script>

以下是我写的代码:

describe_image_urls_list = selector.xpath('//*[@id="descTemplate"]/p/img/@src').extract()
        if len(describe_image_urls_list) == 0:
            describe_image_urls_list = selector.xpath('//*[@id="main-con"]/div[2]/div/div[2]/p[2]/img/@src').extract()
        if len(describe_image_urls_list) == 0:
            describe_image_urls_list = selector.xpath('//*[@id="main-con"]/div[2]/div/div[2]/h1/img/@src').extract()

       item["describe_url"] = describe_image_urls_list

可是怎么也提取不出来,各位高手帮忙看看

阅读 4.7k
2 个回答

使用了BeautifulSoup这个库,因为 beautifulsoup 会单独解析script这个标签,所以下班 script 去掉

html = '''<script type="text/x-handlebars-template" id="descTemplate">
<p>
<img style="max-width:750.0px;" src="https://img.alicdn.com/imgextra/i1/678878759/TB24N1cenAKL1JjSZFCXXXFspXa_!!678878759.png" align="absmiddle" width="750"><img style="max-width:750.0px;" src="https://img.alicdn.com/imgextra/i2/678878759/TB2YMdtclxRMKJjy0FdXXaifFXa_!!678878759.jpg" align="absmiddle" width="750"><img style="max-width:750.0px;" src="https://img.alicdn.com/imgextra/i4/678878759/TB2NCcvcbsTMeJjy1zbXXchlVXa_!!678878759.jpg" align="absmiddle" width="750"><img style="max-width:750.0px;" src="https://img.alicdn.com/imgextra/i1/678878759/TB2soQ.dEl7MKJjSZFDXXaOEpXa_!!678878759.jpg" align="absmiddle" width="750"><img style="max-width:750.0px;" src="https://img.alicdn.com/imgextra/i3/678878759/TB2z6gDcgoQMeJjy0FpXXcTxpXa_!!678878759.jpg" align="absmiddle" width="750"><img style="max-width:750.0px;" src="https://img.alicdn.com/imgextra/i2/678878759/TB2Kst_elcHL1JjSZJiXXcKcpXa_!!678878759.jpg" align="absmiddle" width="750"><img style="max-width:750.0px;" src="https://img.alicdn.com/imgextra/i1/678878759/TB21M4tclxRMKJjy0FdXXaifFXa_!!678878759.jpg" align="absmiddle" width="750"><img style="max-width:750.0px;" src="https://img.alicdn.com/imgextra/i1/678878759/TB20qx8eoEIL1JjSZFFXXc5kVXa_!!678878759.jpg" align="absmiddle" width="750"><img style="max-width:750.0px;" src="https://img.alicdn.com/imgextra/i3/678878759/TB2nVADcgMPMeJjy1XbXXcwxVXa_!!678878759.jpg" align="absmiddle" width="750"><img style="max-width:750.0px;" src="https://img.alicdn.com/imgextra/i3/678878759/TB2_6KcenAKL1JjSZFCXXXFspXa_!!678878759.jpg" align="absmiddle" width="750"><img style="max-width:750.0px;" src="https://img.alicdn.com/imgextra/i1/678878759/TB2bICced.LL1JjSZFEXXcVmXXa_!!678878759.jpg" align="absmiddle" width="750"><img style="max-width:750.0px;" src="https://img.alicdn.com/imgextra/i2/678878759/TB2IayieoQIL1JjSZFhXXaDZFXa_!!678878759.jpg" align="absmiddle" width="750"> </p>
</script>'''

soup = BeautifulSoup(html, 'html.parser')
print soup
img= soup.script.get_text()

soup2 = BeautifulSoup(img, 'html.parser')
src = soup2.find_all('img')
for i in src:
    print i.get('src')

script标签里的内容是不会被解析成标签的,虽然长得一样,但它只是字符串 你可以考虑先xpath定位到这个script标签再用正则取,如果你用的是scrapy的话,再用re方法取src的部分应该就可以了吧..

撰写回答
你尚未登录,登录后可以
  • 和开发者交流问题的细节
  • 关注并接收问题和回答的更新提醒
  • 参与内容的编辑和改进,让解决方法与时俱进
推荐问题