import re
import urllib.request
from bs4 import BeautifulSoup
import ssl
ssl._create_default_https_context = ssl._create_stdlib_context
def getcontent(url,page):
headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'}
req = urllib.request.Request(url=url,headers=headers)
res = urllib.request.urlopen(req).read().decode('utf-8')
print(res)
for i in range(1):
url = 'http://www.qiushibaike.com/8hr/page/'+str(i) +'/'
getcontent(url,i)
请问一下大佬 我这是哪里错误了吗? 爬取的网页一直糗百的错误页面
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta http-equiv="X-UA-Compatible" content="ie=edge">
<title>糗百君的飞船出了一点小毛病</title>
<style>
body {
position: absolute;
top: 0;
right: 0;
bottom: 0;
left: 0;
background-color: #2e344a;
color: #fff;
font-size: 14px;
text-align: center;
font-family: arial, sans-serif;
}
.dialog {
position: absolute;
left: 50%;
bottom: 100px;
margin-left: -120px;
text-align: center;
z-index: 100;
}
h1 {
font-size: 16px;
color: #fff;
line-height: 1.5em;
/* padding-top: 360px; */
}
a:link { text-decoration: none; color: #ff9900 }
a:active { text-decoration:blink }
a:hover { text-decoration: none; color: #ff9900 }
a:visited { text-decoration: none; color: #ff9900 }
</style>
</head>
<body>
<iframe width="100%" height="100%" allowtransparency="true" style="background-color:transparent" frameborder="0" src="https://editor.3dpunk.com/editor3?oid=yE51u114009rKS7F&mode=1&transparent=1&startMovie=0&zoom=0&showLoading=0&toolBar=0"></iframe>
<div class="dialog">
<h1>糗百君的飞船出了一点小毛病……</h1>
<p>莫慌, 点击<a href="https://www.qiushibaike.com"> 这里</a> 可以找到出路</p>
</div>
<script>
var _hmt = _hmt || [];
(function () {
var hm = document.createElement("script");
hm.src = "//hm.baidu.com/hm.js?18a964a3eb14176db6e70f1dd0a3e557";
var s = document.getElementsByTagName("script")[0];
s.parentNode.insertBefore(hm, s);
})();
</script>
</body>
</html>
这是一个细节错误:
python 中的
range
是从 0 开始计算的,于是你的到的 url 就是
http://www.qiushibaike.com/8hr/page/0/
, 这个页面是不存在的.应该从 1 开始: