python里面使用BS4爬取的页面代码里面有部分XML数据如何提取

发布于
2018-01-09

更新于
2018-03-13

新手上路，请多包涵

python爬过过来的html数据里面有一段数据，我想取里面的链接地址和标题，以及发布的日期，但是使用find_all()获取不到里面的数据，应该如何获取？
爬过来的数据格式如下：

<record><![CDATA[
<tr><td height="26" align="left" style="border-bottom:dashed 1px #ccc"><span style="padding-right:8px;"><img src="/picture/0/s1609271437127167930.gif" align="absmiddle" border="0"></span><a  style="font-size:12px;" href='/art/2018/1/2/art_275_32953.html' class='bt_link' title='考核合格名单的通知' target="_blank">2017年度学科带头人考核合格名单的通知</a></td><td width="80" align="center" class="bt_time" style="border-bottom:dashed 1px #ccc">[2018-01-02]</td></tr>]]></record>

beautifulsoup

python

阅读 5.1k

2 个回答

skyrainer

✓ 已被采纳新手上路，请多包涵

根据您提出胡思路，有了自己的解决方案。
先用BS获取到目标网页数据段信息，再用正则表达取得里面的数据。


from bs4 import BeautifulSoup


# 定义一个通知新闻的类型
class News(object):
    def __init__(self):
        self.__url = None
        self.__title = None
        self.__posttime = None

    def print_info(self):
        print('%s: %s:%s' % (self.__title, self.__posttime, self.__url))

    def set_url(self, url):
        self.__url = url

    def set_title(self, title):
        self.__title = title

    def set_posttime(self, posttime):
        self.__posttime = posttime

    def get_url(self):
        return self.__url

    def get_title(self):
        return self.__title

    def get_posttime(self):
        return self.__posttime



newslist = []
# 保存最新的通知列表
for link in soup.find_all(attrs={'id': '494'}):
    # print(link)
    # 获取两个td里面的内容
    tr=re.findall(r'<tr[^>]*>(.*?)</tr>',str(link),re.I|re.M)
    #print(tr)
    for trs in tr:
        notice = News()
        #print(trs)
        td = re.findall(r'<td[^>]*>(.*?)</td>', str(trs), re.I | re.M)
        # print(td)
        i = 1
        for newid in td:
            # 第一个TD里面的内容存放的是网址和标题
            # print(newid)
            # 第二个TD里面的内容存放的是发布日期
            if (i % 2) == 0:
                posttime = newid
                notice.set_posttime(posttime)
                i = i + 1
                #notice.print_info()
                newslist.append(notice)
            else:
                # 进一步分解第一个TD里面的内容，分别获取链接和标题属性
                url = re.findall(r'href=\'(\S+)\'', str(newid))
                finalurl = "http://www.zjedu.org" + str(url[0])
                # print(finalurl)
                title = re.findall(r'title=\'(.*?)\'', str(newid))
                stitle=str(title[0]).strip()
                notice.set_url(finalurl)
                print(stitle)
                notice.set_title(stitle)
                i = i + 1
                      
                
                

输出的结果如下：
2017年度学科带头人考核合格名单的通知: [2018-01-02]:/art/2018/1/2/art_275_33408.html

东哥起飞

3.8k8

发布于
2018-01-09

更新于
2018-01-09

from bs4 import BeautifulSoup

a = """<record><![CDATA[ \
<tr><td height="26" align="left" style="border-bottom:dashed 1px #ccc"><span style="padding-right:8px;"> \
<img src="/picture/0/s1609271437127167930.gif" align="absmiddle" border="0"></span>2017年度学科带头人考核 \
合格名单的通知</td><td width="80" align="center" class="bt_time" style="border-bottom:dashed 1px #ccc">[20 \
18-01-02]</td></tr>]]></record>"""

soup= BeautifulSoup(a, 'lxml')
img = soup.img.get('src')
print(img)
td = soup.find_all('td')
for x in td:
    print(x.string)

结果：

/picture/0/s1609271437127167930.gif
None
[20 18-01-02]

我用beautifulsoup无法得到 2017年度学科带头人考核合格名单的通知这个title，原因是使用string的时候，标签内最多有一个子标签才可以，而td下有span和img两个子标签，所以显示none了。固配合了正则表达式来解决，如下：

from bs4 import BeautifulSoup
import re

a = """<record><![CDATA[ \
<tr><td height="26" align="left" style="border-bottom:dashed 1px #ccc"><span style="padding-right:8px;"> \
<img src="/picture/0/s1609271437127167930.gif" align="absmiddle" border="0"></span>2017年度学科带头人考核 \
合格名单的通知</td><td width="80" align="center" class="bt_time" style="border-bottom:dashed 1px #ccc">[20 \
18-01-02]</td></tr>]]></record>"""

soup= BeautifulSoup(a, 'lxml')
img = soup.img.get('src')
td = soup.find_all('td')
pattern = re.compile(r'<td.+?><span.+?>.+?</span>(.+?)</td>')
title = ''.join(re.findall(pattern, str(td[0]))[0])

print(img)
print(title)
print(td[1].string)

运行结果：

/picture/0/s1609271437127167930.gif
2017年度学科带头人考核 合格名单的通知
[20 18-01-02]

撰写回答

你尚未登录，登录后可以

和开发者交流问题的细节
关注并接收问题和回答的更新提醒
参与内容的编辑和改进，让解决方法与时俱进

推荐问题

python里面使用BS4爬取的页面代码里面有部分XML数据如何提取

你尚未登录，登录后可以

字节的 trae AI IDE 不支持类似 vscode 的 ssh remote 远程开发怎么办？

DataCap 中验证码无法显示，后台出现 NullPointerException 错误?

发现深拷贝和浅拷贝效果一致：请问一下有什么区别呢？

如何实现一个深拷贝函数？

Python 成员变量在多个子类实例间共享，如何避免？

为什么 Qwen2.5-Omni-7B 官方教程都报错 Cannot import available module of Qwen2_5OmniModel in modelscope ？

Spark-TTS-0.5B 的 requirements.txt 在哪里？