如何优化 Python 爬虫的速度?

目前在写一个 Python 爬虫,单线程 urllib 感觉过于慢了,达不到数据量的要求(十万级页面)。求问有哪些可以提高爬取效率的方法?

阅读 745
2 个回答
  1. 专门的库: scrapy

    https://www.runoob.com/w3cnote/scrapy-detail.html
  2. 利用多线程, 多线程或协程呗

多线程的:

import threading
import time
import requests
 
 
def fetch():
    r = requests.get('http://httpbin.org/get')
    print(r.text)
 
t1 = time.time()
 
t_list = []
for i in range(100):
    t = threading.Thread(target=fetch, args=())
    t_list.append(t)
    t.start()
 
for t in t_list:
    t.join()
 
print("多线程版爬虫耗时:", time.time() - t1)

多进程:

import requests
import time
import multiprocessing
from multiprocessing import Pool
 
MAX_WORKER_NUM = multiprocessing.cpu_count()
 
def fetch():
    r = requests.get('http://httpbin.org/get')
    print(r.text)
 
if __name__ == '__main__':
    t1 = time.time()
    p = Pool(MAX_WORKER_NUM)
    for i in range(100):
        p.apply_async(fetch, args=())
    p.close()
    p.join()
 
    print('多进程爬虫耗时:', time.time() - t1)

协程:

import aiohttp
import asyncio
import time
 
 
async def fetch(client):
    async with client.get('http://httpbin.org/get') as resp:
        assert resp.status == 200
        return await resp.text()
 
 
async def main():
    async with aiohttp.ClientSession() as client:
        html = await fetch(client)
        print(html)
 
loop = asyncio.get_event_loop()
 
tasks = []
for i in range(100):
    task = loop.create_task(main())
    tasks.append(task)
 
t1 = time.time()
 
loop.run_until_complete(main())
 
print("aiohttp版爬虫耗时:", time.time() - t1)

1.多线程多进程
2.异步编程
3.分布式爬虫

撰写回答
你尚未登录,登录后可以
  • 和开发者交流问题的细节
  • 关注并接收问题和回答的更新提醒
  • 参与内容的编辑和改进,让解决方法与时俱进
推荐问题