目标网页:
https://www.investing.com/equities/hoa-phat-group-jsc-ratios
抽取包含P/E Ratio的表。
我的尝试
import lxml.html
from urllib.request import urlopen
url = "https://www.investing.com/equities/hoa-phat-group-jsc-ratios"
file= urlopen(url).read()
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
查看它的network,一堆的get,post,我测试了许多都没有确定是哪个请求获得数据。
请问,如何抽取数据?
curl https://www.investing.com/equities/hoa-phat-group-jsc-ratios
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>403 Forbidden</title>
</head><body>
<h1>Forbidden</h1>
<p>You don't have permission to access this resource.</p>
<script defer src="https://static.cloudflareinsights.com/beacon.min.js/vb26e4fa9e5134444860be286fd8771851679335129114" integrity="sha512-M3hN/6cva/SjwrOtyXeUa5IuCT0sedyfT+jK/OV+s+D0RnzrTfwjwJHhd+wYfMm9HJSrZ1IKksOdddLuN6KOzw==" data-cf-beacon='{"rayId":"7b1d7f836ae31e61","version":"2023.3.0","b":1,"token":"00ab903b5e184b1a9d53b0a7a5085300","si":100}' crossorigin="anonymous"></script>
</body></html>
GET https://www.investing.com/equities/hoa-phat-group-jsc-ratios
就是当前URL的请求,服务端渲染返回的就是你需要的这部分数据
所以你需要做的就是解析这个响应中的HTML,可以考虑用
scrapy
框架处理全过程,或者是beautifulsoup
单纯解析html响应体。补充代码片段: