想提取这个网页的数据
from urllib.request import urlretrieve
import urllib
import random
url="https://cn.investing.com/indices/hnx-30-components"
opener = urllib.request.build_opener()
ua_list = ['Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Firefox/102.0',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36 Edg/103.0.1264.62',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:98.0) Gecko/20100101 Firefox/98.0',
'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.81 Safari/537.36 SE 2.X MetaSr 1.0'
]
opener.addheaders = [('User-Agent', random.choice(ua_list))]
urllib.request.install_opener(opener)
urlretrieve(url, '/tmp/test.html')
网页无法打开,浏览器可以打开
File "/usr/local/lib/python3.11/urllib/request.py", line 643, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
请问,如何解决?
被TLS指纹反爬虫了,可以用curl_cffi库爬
如果页面数据是异步加载的,还是用selenium这类库爬吧