新手上路，请多包涵

我正在尝试使用 python 和 beautiful soup 来提取以下标签的内容部分：

 <meta property="og:title" content="Super Fun Event 1" />
<meta property="og:url" content="http://superfunevents.com/events/super-fun-event-1/" />

我让 BeautifulSoup 很好地加载页面并找到其他东西（这也从隐藏在源代码中的 id 标签中获取文章 id），但我不知道搜索 html 和找到这些位的正确方法，我试过各种 find 和 findAll 都无济于事。该代码目前遍历 url 列表…

 #!/usr/bin/env python
# -*- coding: utf-8 -*-

#importing the libraries
from urllib import urlopen
from bs4 import BeautifulSoup

def get_data(page_no):
    webpage = urlopen('http://superfunevents.com/?p=' + str(i)).read()
    soup = BeautifulSoup(webpage, "lxml")
    for tag in soup.find_all("article") :
        id = tag.get('id')
        print id
# the hard part that doesn't work - I know this example is well off the mark!
    title = soup.find("og:title", "content")
    print (title.get_text())
    url = soup.find("og:url", "content")
    print (url.get_text())
# end of problem

for i in range (1,100):
    get_data(i)

如果有人可以帮我排序以找到 og:title 和 og:content 那太棒了！

原文由 the_t_test_1 发布，翻译遵循 CC BY-SA 4.0 许可协议

python html web-scraping beautifulsoup

阅读 390

2 个回答

得票最新

社区维基

发布于
2022-12-19

✓ 已被采纳

提供 meta 标记名称作为 find() 的第一个参数。然后，使用关键字参数来检查特定的属性：

 title = soup.find("meta", property="og:title")
url = soup.find("meta", property="og:url")

print(title["content"] if title else "No meta title given")
print(url["content"] if url else "No meta url given")

if / else 如果您知道标题和 url 元属性将始终存在，则这里的检查是可选的。

原文由 alecxe 发布，翻译遵循 CC BY-SA 4.0 许可协议

社区维基

发布于
2022-12-19

试试这个：

 soup = BeautifulSoup(webpage)
for tag in soup.find_all("meta"):
    if tag.get("property", None) == "og:title":
        print tag.get("content", None)
    elif tag.get("property", None) == "og:url":
        print tag.get("content", None)

原文由 Hackaholic 发布，翻译遵循 CC BY-SA 3.0 许可协议

查看全部 2 个回答

推荐问题

使用 BeautifulSoup 和 Python 获取元标记内容属性

字节的 trae AI IDE 不支持类似 vscode 的 ssh remote 远程开发怎么办？

DataCap 中验证码无法显示，后台出现 NullPointerException 错误?

发现深拷贝和浅拷贝效果一致：请问一下有什么区别呢？

如何实现一个深拷贝函数？

Python 成员变量在多个子类实例间共享，如何避免？

问一个鼠标滚动事件，这种是怎么实现的？

form对象根据表单dom元素的name属性获取元素对象是基于什么标准的？兼容性如何？

Stack Overflow 翻译