新手上路，请多包涵

我正在写我的第一个“真正的”项目，一个网络爬虫，我不知道如何修复这个错误。这是我的代码

import requests
from bs4 import BeautifulSoup

def main_spider(max_pages):
    page = 1
    for page in range(1, max_pages+1):
        url = "https://en.wikipedia.org/wiki/Star_Wars" + str(page)
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text, "html.parser")
        for link in soup.findAll("a"):
            href = "https://en.wikipedia.org/wiki/Star_Wars" + link.get("href")
            print(href)
    page += 1

main_spider(1)

这是错误

href = "https://en.wikipedia.org/wiki/Star_Wars" + link.get("href")
TypeError: must be str, not NoneType

原文由 Dylan Boyd 发布，翻译遵循 CC BY-SA 4.0 许可协议

python

阅读 1.9k

2 个回答

得票最新

社区维基

发布于
2023-01-09

✓ 已被采纳

正如@Shiping 所指出的，您的代码没有正确缩进……我在下面更正了它。另外… link.get('href') 在其中一种情况下不返回字符串。

 import requests
from bs4 import BeautifulSoup

def main_spider(max_pages):
    for page in range(1, max_pages+1):
        url = "https://en.wikipedia.org/wiki/Star_Wars" + str(page)
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text, "html.parser")
        for link in soup.findAll("a"):

            href = "https://en.wikipedia.org/wiki/Star_Wars" + link.get("href")
            print(href)

main_spider(1)

为了评估正在发生的事情，我添加了几行代码……在你现有的几行之间并删除了有问题的行（暂时）。

         soup = BeautifulSoup(plain_text, "html.parser")
        print('All anchor tags:', soup.findAll('a'))     ### ADDED
        for link in soup.findAll("a"):
            print(type(link.get("href")), link.get("href"))  ### ADDED

我添加的结果是这样的（为简洁起见被截断）：注意：第一个锚点没有 href 属性，因此 link.get('href') 无法返回值，因此返回 None

 [<a id="top"></a>, <a href="#mw-head">navigation</a>,
<a href="#p-search">search</a>,
<a href="/wiki/Special:SiteMatrix" title="Special:SiteMatrix">sister...
<class 'NoneType'> None
<class 'str'> #mw-head
<class 'str'> #p-search
<class 'str'> /wiki/Special:SiteMatrix
<class 'str'> /wiki/File:Wiktionary-logo-v2.svg
...

为防止错误，可能的解决方案是向代码中添加条件或 try/except 表达式。我将演示一个条件表达式。

         soup = BeautifulSoup(plain_text, "html.parser")
        for link in soup.findAll("a"):
            if link.get('href') == None:
                continue
            else:
                href = "https://en.wikipedia.org/wiki/Star_Wars" + link.get("href")
                print(href)

原文由 E. Ducateme 发布，翻译遵循 CC BY-SA 3.0 许可协议

社区维基

发布于
2023-01-09

维基百科页面上的第一个“a”链接是

<a id="top"></a>

因此，link.get(“href”) 将返回 None，因为没有 href。

要解决此问题，请先检查 None ：

 if link.get('href') is not None:
    href = "https://en.wikipedia.org/wiki/Star_Wars" + link.get("href")
    # do stuff here

原文由 Jackywathy 发布，翻译遵循 CC BY-SA 3.0 许可协议

撰写回答

你尚未登录，登录后可以

和开发者交流问题的细节
关注并接收问题和回答的更新提醒
参与内容的编辑和改进，让解决方法与时俱进

推荐问题

类型错误：必须是 str，而不是 NoneType

你尚未登录，登录后可以

字节的 trae AI IDE 不支持类似 vscode 的 ssh remote 远程开发怎么办？

DataCap 中验证码无法显示，后台出现 NullPointerException 错误?

发现深拷贝和浅拷贝效果一致：请问一下有什么区别呢？

如何实现一个深拷贝函数？

Python 成员变量在多个子类实例间共享，如何避免？

分解质因素的算法很难，理解不了。请问有哪位大佬可以进行解释一下呢？

为什么 Qwen2.5-Omni-7B 官方教程都报错 Cannot import available module of Qwen2_5OmniModel in modelscope ？

Stack Overflow 翻译

类型错误：必须是 str，而不是 NoneType

你尚未登录，登录后可以

字节的 trae AI IDE 不支持类似 vscode 的 ssh remote 远程开发怎么办？

DataCap 中验证码无法显示，后台出现 NullPointerException 错误?

发现深拷贝和浅拷贝效果一致：请问一下有什么区别呢？

如何实现一个深拷贝函数？

Python 成员变量在多个子类实例间共享，如何避免？

分解质因素的算法很难，理解不了。 请问有哪位大佬可以进行解释一下呢？

为什么 Qwen2.5-Omni-7B 官方教程都报错 Cannot import available module of Qwen2_5OmniModel in modelscope ？

Stack Overflow 翻译

分解质因素的算法很难，理解不了。请问有哪位大佬可以进行解释一下呢？