python爬虫beautifulsoup string抓取问题

发布于
2016-07-08

图片描述

我要的是这个蓝色部分的内容，但是beautifulsoup里两个方法，一个.strings还有一个get_text()都不行，他们会把下面span里的string：Good Sister-in-lwa:Forbidden love这些都抓取。.string直接抓不到，因为这个方法无法判断该抓取哪个string。
所以我该怎么解决标签里内嵌标签的抓取字符串问题

python 网页爬虫 beautifulsoup

阅读 6.7k

4 个回答

cloverstd

✓ 已被采纳

In [1]: from bs4 import BeautifulSoup

In [2]: html_doc = "<a>123<span>321</span></a>"

In [3]: soup = BeautifulSoup(html_doc, 'html.parser')

In [4]: soup.a.contents[0]
Out[4]: u'123'

In [5]: soup.a.contents
Out[5]: [u'123', <span>321</span>]

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#contents-and-children

dokelung

4.9k1516

发布于
2016-07-08

更新于
2016-07-08

@洛克的想法不錯，把不要的標籤淬出或是移除，再取字串:

>>> from bs4 import BeautifulSoup
>>> html = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
>>> soup = BeautifulSoup(html)
>>> a_tag = soup.a
>>> i_tag = soup.i.extract()
>>> a_tag.string
'I linked to '

或是像 @cloverstd 說的:

>>> from bs4 import BeautifulSoup
>>> html = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
>>> soup = BeautifulSoup(html)
>>> a_tag = soup.a
>>> list(a_tag.strings)
[u'I linked to ', u'example.com']
>>> list(a_tag.strings)[0]
'I linked to '
>>> a_tag.contents[0]
'I linked to '

總之方法很多，任意組合囉...

我回答過的問題: Python-QA