写一个爬虫程序,想获得界面中pdf文件的链接,以方便直接批量下载。例如网址是:pmc
输入的关键字如:human placenta 。直接搜索,然后会看到每一条里面有“PDF–932K”类似的文字,想获取里面的链接。现在,我可以成功获取到第一个界面的所有链接,而且设置了界面一次显示100条记录。但是,当我想获取下一页的时候,无论在data中加入怎么样的参数,都是不行的,因为爬到的网页是https://www.ncbi.nlm.nih.gov/
爬到的结果居然是这个网站的首页,就不明白了,问题出在哪里呢?搜索结果的第一页是可以成功获取到的,但是下一页(从第二页开始),就是爬到的首页。代码如下
import urllib.request
from bs4 import BeautifulSoup, NavigableString, Tag
def listWriteToFile(baseDir,fileName,data):
path=baseDir+fileName
with open(path,"a") as f:
f.write(data)
def saveFile(baseDir,fileName,data):
path=baseDir+fileName
with open(path, "wb") as f:
f.write(data)
def getContent(stuff):
text=''
if stuff == none:
return None
else :
cool=stuff.descendants
for parse in cool:
if (isinstance(parse, NavigableString)):
text += parse
elif parse.name == 'p':
text += '\n'
elif re.match(r"h[0-9]+",parse.name):
text += '\n'
elif ("li" == parse.name):
text +='\n\t'
return text
def parseFile(data):
soup = BeautifulSoup(data)
div_links = soup.find_all('div',{'class':"links"})
pdfUrl_list=[]
if div_links is not None:
for links in div_links:
#print(links.decode())
for child in links.children:
href = child.get('href')
if href.find(".pdf") != -1:
pdfUrl_list.append(href)
print(pdfUrl_list)
return pdfUrl_list
#获取总共的页码数
def getTotalNumber(data) :
soup = BeautifulSoup(data)
page_links = soup.find('h3',{'class':"page"})
totalPage=0
if page_links is not None:
totalPage=page_links.contents[1]['last']
#print(page_links)
#print()
#print (page_links.contents)
print (page_links.contents[1]['last'])
#for string in page_links.stripped_strings:
# print()
# print(repr(string))
#for child in page_links.children:
# totalPage = child.get('last')
return totalPage
webUrl="https://www.ncbi.nlm.nih.gov/pmc/?term=sulfur"
headers = {'User-Agent':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:50.0) Gecko/20100101 Firefox/50.0 Gecko/20100101 Firefox/50.0',"Referer": 'https://www.ncbi.nlm.nih.gov/pmc','Accept':
"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"}
#requestData = urllib.parse.urlencode({'EntrezSystem2.PEntrez.PMC.Pmc_ResultsPanel.Pmc_DisplayBar.PageSize':100,'EntrezSystem2.PEntrez.PMC.Pmc_ResultsPanel.Pmc_DisplayBar.sPageSize':100,'EntrezSystem2.PEntrez.PMC.Pmc_ResultsPanel.Pmc_DisplayBar.sPageSize2':100,'EntrezSystem2.PEntrez.DbConnector.Cmd':'PageChanged','EntrezSystem2.PEntrez.PMC.Pmc_ResultsPanel.Entrez_Pager.CurrPage':1,'EntrezSystem2.PEntrez.PMC.Pmc_ResultsPanel.HistoryDisplay.Cmd':'PageChanged','EntrezSystem2.PEntrez.DbConnector.Db':'PMC'})
requestData = urllib.parse.urlencode({'EntrezSystem2.PEntrez.PMC.Pmc_ResultsPanel.Pmc_DisplayBar.PageSize':100,'EntrezSystem2.PEntrez.PMC.Pmc_ResultsPanel.Pmc_DisplayBar.sPageSize':100,'EntrezSystem2.PEntrez.PMC.Pmc_ResultsPanel.Pmc_DisplayBar.sPageSize2':100,'EntrezSystem2.PEntrez.PMC.Pmc_ResultsPanel.Entrez_Pager.CurrPage':1})
requestData = requestData.encode('utf-8')
req=urllib.request.Request(webUrl,data=requestData,headers=headers)
res=urllib.request.urlopen(req)
webData=res.read()
saveFile("/home/yang/Documents/PMCSpider/data/sulfur/","web",webData)
pdfUrl_list=parseFile(webData)
listWriteToFile("/home/yang/Documents/PMCSpider/data/sulfur/","list",str(pdfUrl_list))
刚刚去题主爬的那个网站看了一下

由于下一页的请求是post类型的所以应该先把post的内容看看,找一下提交的内容的规律
转到下一页就可以直接post相应的内容就可以了
更新:

大概看了看,搜索用的是post请求,主要有这几个内容:
1.搜索关键词:
2.排序方式、当前页数、每页数量等信息

示例:
所以,题主想获取每页的信息可以直接post相应的内容就可以了,具体的每个post的信息的内容还得好好看看。
另, 一些可以考虑的问题:
1.是不是post的时候可以直接把每页数量改得比较大一点,就可以少发几次请求