目前在写一个 Python 爬虫，单线程 urllib 感觉过于慢了，达不到数据量的要求（十万级页面）。求问有哪些可以提高爬取效率的方法？

如何优化 Python 爬虫的速度？

2 个回答

得票最新

玛拉_以琳

8.7k41542

发布于
2023-05-24 上海

✓ 已被采纳

专门的库: scrapy
https://www.runoob.com/w3cnote/scrapy-detail.html
利用多线程, 多线程或协程呗

多线程的:

import threading
import time
import requests
 
 
def fetch():
    r = requests.get('http://httpbin.org/get')
    print(r.text)
 
t1 = time.time()
 
t_list = []
for i in range(100):
    t = threading.Thread(target=fetch, args=())
    t_list.append(t)
    t.start()
 
for t in t_list:
    t.join()
 
print("多线程版爬虫耗时：", time.time() - t1)

多进程:

import requests
import time
import multiprocessing
from multiprocessing import Pool
 
MAX_WORKER_NUM = multiprocessing.cpu_count()
 
def fetch():
    r = requests.get('http://httpbin.org/get')
    print(r.text)
 
if __name__ == '__main__':
    t1 = time.time()
    p = Pool(MAX_WORKER_NUM)
    for i in range(100):
        p.apply_async(fetch, args=())
    p.close()
    p.join()
 
    print('多进程爬虫耗时：', time.time() - t1)

协程:

import aiohttp
import asyncio
import time
 
 
async def fetch(client):
    async with client.get('http://httpbin.org/get') as resp:
        assert resp.status == 200
        return await resp.text()
 
 
async def main():
    async with aiohttp.ClientSession() as client:
        html = await fetch(client)
        print(html)
 
loop = asyncio.get_event_loop()
 
tasks = []
for i in range(100):
    task = loop.create_task(main())
    tasks.append(task)
 
t1 = time.time()
 
loop.run_until_complete(main())
 
print("aiohttp版爬虫耗时：", time.time() - t1)