用python.requests爬取http://app1.sfda.gov.cn/datas...中的表格数据,但是python.requests返回的内容跟浏览器中看的不同,下面附上代码:
import requests
def testLoadRequest():
params1 = {
'tableId': '27',
'tableName': 'TABLE27',
'tableView': '%BD%F8%BF%DA%C6%F7%D0%B5',
'Id': '24583'
}
headers1 = {
'Content-Type': "text/html;encoding=gbk",
'X-Requested-With': 'XMLHttpRequest'
}
url1 = 'http://app1.sfda.gov.cn/datasearch/face3/content.jsp';
try:
r = requests.get(url1,params=params1, headers=headers1)
print(r.text)
print(r.cookies)
print(r.status_code)
print(r.url)
except Exception as e:
print(e)
testLoadRequest()
下面是浏览器看到的内容:
但是用python.requests爬到的html内容如下:
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
<meta http-equiv="Cache-Control" content="no-store, no-cache, must-revalidate, post-check=0, pre-check=0"/>
<meta http-equiv="Connection" content="Close"/>
<script type="text/javascript">function stringToHex(str) {
var val = "";
for (var i = 0; i < str.length; i++) {
if (val == "")val = str.charCodeAt(i).toString(16); else val += str.charCodeAt(i).toString(16);
}
return val;
}
function YunSuoAutoJump() {
var width = screen.width;
var height = screen.height;
var screendate = width + "," + height;
var curlocation = window.location.href;
if (-1 == curlocation.indexOf("security_verify_")) {
document.cookie = "srcurl=" + stringToHex(window.location.href) + ";path=/;";
}
self.location = "/datasearch/face3/content.jsp?tableView=½ø¿ÚÆ÷е&Id=24583&tableName=TABLE27&tableId=27&security_verify_data=" + stringToHex(screendate);
}</script>
<script>setTimeout("YunSuoAutoJump()", 50);</script>
</head>
</html>
很明显爬出来的内容不是表格里的数据,而且有时还会爬不出来报
('Connection aborted.', ConnectionResetError(54, 'Connection reset by peer'))
这个错误,有知道原因的人吗??希望能给我点明一下,谢谢了
帮测试了,请求源存在问题,
url1
我更换了链接可以抓取成功。