无法下载网页urllib.error.HTTPError: HTTP Error 403: Forbidden?

想提取这个网页的数据

from urllib.request import urlretrieve
import urllib
import random
url="https://cn.investing.com/indices/hnx-30-components"
opener = urllib.request.build_opener()
ua_list = ['Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Firefox/102.0',
           'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
           'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36 Edg/103.0.1264.62',
           'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:98.0) Gecko/20100101 Firefox/98.0',
           'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.81 Safari/537.36 SE 2.X MetaSr 1.0'
           ]
opener.addheaders = [('User-Agent', random.choice(ua_list))]
urllib.request.install_opener(opener)
urlretrieve(url, '/tmp/test.html')

网页无法打开,浏览器可以打开

  File "/usr/local/lib/python3.11/urllib/request.py", line 643, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

请问,如何解决?

阅读 1.8k
avatarAI BotBETA

这个问题是因为服务器识别到你所发送的 HTTP 请求头中的 User-Agent 不正确,因此拒绝了你的请求。服务器认为你的请求是来自一个不合法或者不受信任的客户端,所以返回了 403 Forbidden 错误。

你可以通过修改 User-Agent 头来解决这个问题。User-Agent 头是 HTTP 请求头中的一部分,用于告诉服务器发起请求的客户端类型。

你可以尝试使用如下的代码来修改 User-Agent:

from urllib.request import urlretrieve
import urllib.request
import random

url = "https://cn.investing.com/indices/hnx-30-components"
ua_list = ['Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Firefox/102.0',
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36 Edg/103.0.1264.62',
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:98.0) Gecko/20100101 Firefox/98.0',
            'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.81 Safari/537.36 SE 2.X MetaSr 1.0'
           ]

while True:
    opener = urllib.request.build_opener()
    opener.addheaders = [('User-Agent', random.choice(ua_list))]
    try:
        urllib.request.install_opener(opener)
        urlretrieve(url, '/tmp/test.html')
        break  # 如果成功下载,则跳出循环
    except urllib.error.HTTPError as e:
        if e.code == 403:  # 如果服务器返回 403 Forbidden,则继续尝试其他 User-Agent
            continue
        else:  # 其他错误则退出循环
            break

这段代码会尝试使用随机选择的 User-Agent 来发起请求,如果服务器返回 403 Forbidden,则会继续尝试其他 User-Agent,直到成功下载页面或者所有 User-Agent 都尝试过。如果服务器返回的不是 403 Forbidden,则会停止尝试并退出循环。

1 个回答

被TLS指纹反爬虫了,可以用curl_cffi库爬

import random


from curl_cffi import requests


ua_list = ['Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Firefox/102.0',
           'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
           'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36 Edg/103.0.1264.62',
           'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:98.0) Gecko/20100101 Firefox/98.0',
           'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.81 Safari/537.36 SE 2.X MetaSr 1.0'
           ]
headers = {'User-Agent':random.choice(ua_list)}
url="https://cn.investing.com/indices/hnx-30-components"
resp = requests.get(url, headers=headers, impersonate="chrome110")
print(resp.status_code)
with open('temp.html', 'wb') as fw:
    fw.write(resp.content)

如果页面数据是异步加载的,还是用selenium这类库爬吧

撰写回答
你尚未登录,登录后可以
  • 和开发者交流问题的细节
  • 关注并接收问题和回答的更新提醒
  • 参与内容的编辑和改进,让解决方法与时俱进