Python 多线程进行任务时,先开始快后面很慢?

1.自己在爬取某个网站的时候,程序本身没有任何问题,也是经过测试的,但是自己在用多线程爬取数据的时候,发现先开始很快,无论是用10个线程,还是30个线程,前400页左右的数据都是很快的,但是到了400左右以后(大概页码,不是确定的数值),就发现开始变慢了。
2.我尝试用了代理,随机user-agent,发现情况并没有得到改善,我实在无法理解其中的逻辑,希望各位能够帮忙解答一下。
3.以下是我的全部代码

-- conding:utf-8 --

import re
import time
import random
import openpyxl
import requests
import pandas
import traceback
import threading
from lxml import etree
from ip_proxy import user_agent
from ip_proxy import page_parse_xici

def get_html(url):

try :
    headers = random.choice(agent_list)
    proxies = random.choice(proxies_list)
    req = requests.get(url,headers=headers,proxies=proxies)
    req.encoding = "utf-8"
    html = etree.HTML(req.text)
    return html
except:
    traceback.print_exc()

def get_page_list():

for i in range(1,15355):
    url = 'http://www.lianhanghao.com/index.php/Index/index/p/{}.html'.format(i)
    page_list.append(url)


def page_list_parse():

while True:
    lock.acquire()
    if len(page_list) == 0:
        lock.release()
        break
    else:
        lock.release()
        url = page_list.pop()
        html = get_html(url)
        cnaps_list = html.xpath('//td[@align="center"]/text()')
        opening_bank_list = html.xpath('//tbody/tr/td[2]/text()')
        telephone_list = [ href.replace("|","  ") for href in  html.xpath('//tbody/tr/td[3]/text()') ]
        address_list =[ href.replace("扫一扫免费领取秒到个人POS机"," ") for href in  html.xpath('//tbody/tr/td[4]//text()') ]
        if len(telephone_list) > 10:
            for i in range(10,len(telephone_list)):
                telephone_list.remove("    ")
        if len(address_list) > 10:
            for x in range(10,len(address_list)):
                address_list.remove(" ")
        print(url)
        if len(opening_bank_list) != len(telephone_list):
            print(url)
        for cnaps,opening_bank,telephone,address in zip(cnaps_list,opening_bank_list,telephone_list,address_list):
            info = {
                "行号":cnaps,
                "开户银行":opening_bank,
                "电话":telephone,
                "地址":address,
            }
            infos.append(info)
        lock.acquire()
        df = pandas.DataFrame(infos)
        df.to_excel("f:/lianhanghao.xlsx","a")
        lock.release()

def main():

for i in range(30):
    th = threading.Thread(target=page_list_parse)
    th.start()

if name == "__main__":

agent_list = user_agent()
proxies_list = page_parse_xici()
infos = []
page_list = []
get_page_list()
lock = threading.Lock()
main()

希望各位朋友能够帮忙解决一下,谢谢。

阅读 3.9k
1 个回答

这个应该是网站的缓存机制相关,你可以用curl测一下不同区间页面的耗时

➜  ~ curl -o /dev/null -s -w "time_connect: %{time_connect}\ntime_starttransfer: %{time_starttransfer}\ntime_total: %{time_total}\n" "http://www.lianhanghao.com/index.php/Index/index/p/1.html"

time_connect: 0.074130
time_starttransfer: 0.275095
time_total: 0.397056

➜  ~ curl -o /dev/null -s -w "time_connect: %{time_connect}\ntime_starttransfer: %{time_starttransfer}\ntime_total: %{time_total}\n" "http://www.lianhanghao.com/index.php/Index/index/p/1000.html"

time_connect: 0.067994
time_starttransfer: 0.313416
time_total: 1.221437

如上,我这里测出来是有明显差异的

撰写回答
你尚未登录,登录后可以
  • 和开发者交流问题的细节
  • 关注并接收问题和回答的更新提醒
  • 参与内容的编辑和改进,让解决方法与时俱进
推荐问题