1.自己在爬取某个网站的时候,程序本身没有任何问题,也是经过测试的,但是自己在用多线程爬取数据的时候,发现先开始很快,无论是用10个线程,还是30个线程,前400页左右的数据都是很快的,但是到了400左右以后(大概页码,不是确定的数值),就发现开始变慢了。
2.我尝试用了代理,随机user-agent,发现情况并没有得到改善,我实在无法理解其中的逻辑,希望各位能够帮忙解答一下。
3.以下是我的全部代码
-- conding:utf-8 --
import re
import time
import random
import openpyxl
import requests
import pandas
import traceback
import threading
from lxml import etree
from ip_proxy import user_agent
from ip_proxy import page_parse_xici
def get_html(url):
try :
headers = random.choice(agent_list)
proxies = random.choice(proxies_list)
req = requests.get(url,headers=headers,proxies=proxies)
req.encoding = "utf-8"
html = etree.HTML(req.text)
return html
except:
traceback.print_exc()
def get_page_list():
for i in range(1,15355):
url = 'http://www.lianhanghao.com/index.php/Index/index/p/{}.html'.format(i)
page_list.append(url)
def page_list_parse():
while True:
lock.acquire()
if len(page_list) == 0:
lock.release()
break
else:
lock.release()
url = page_list.pop()
html = get_html(url)
cnaps_list = html.xpath('//td[@align="center"]/text()')
opening_bank_list = html.xpath('//tbody/tr/td[2]/text()')
telephone_list = [ href.replace("|"," ") for href in html.xpath('//tbody/tr/td[3]/text()') ]
address_list =[ href.replace("扫一扫免费领取秒到个人POS机"," ") for href in html.xpath('//tbody/tr/td[4]//text()') ]
if len(telephone_list) > 10:
for i in range(10,len(telephone_list)):
telephone_list.remove(" ")
if len(address_list) > 10:
for x in range(10,len(address_list)):
address_list.remove(" ")
print(url)
if len(opening_bank_list) != len(telephone_list):
print(url)
for cnaps,opening_bank,telephone,address in zip(cnaps_list,opening_bank_list,telephone_list,address_list):
info = {
"行号":cnaps,
"开户银行":opening_bank,
"电话":telephone,
"地址":address,
}
infos.append(info)
lock.acquire()
df = pandas.DataFrame(infos)
df.to_excel("f:/lianhanghao.xlsx","a")
lock.release()
def main():
for i in range(30):
th = threading.Thread(target=page_list_parse)
th.start()
if name == "__main__":
agent_list = user_agent()
proxies_list = page_parse_xici()
infos = []
page_list = []
get_page_list()
lock = threading.Lock()
main()
希望各位朋友能够帮忙解决一下,谢谢。
这个应该是网站的缓存机制相关,你可以用curl测一下不同区间页面的耗时
如上,我这里测出来是有明显差异的