新手上路，请多包涵

我已经用 python 结合 selenium 编写了一个脚本，以从其着陆页抓取不同帖子的链接，最后通过跟踪指向其内页的 url 来获取每个帖子的标题。虽然我这里解析的内容是静态的，但是我用selenium看看它在multiprocessing中是如何工作的。

但是，我的意图是使用多处理进行抓取。到目前为止，我知道 selenium 不支持多处理，但看来我错了。

我的问题：当使用多处理运行时，如何减少使用 selenium 的执行时间？

This is my try (it's a working one) ：

 import requests
from urllib.parse import urljoin
from multiprocessing.pool import ThreadPool
from bs4 import BeautifulSoup
from selenium import webdriver

def get_links(link):
  res = requests.get(link)
  soup = BeautifulSoup(res.text,"lxml")
  titles = [urljoin(url,items.get("href")) for items in soup.select(".summary .question-hyperlink")]
  return titles

def get_title(url):
  chromeOptions = webdriver.ChromeOptions()
  chromeOptions.add_argument("--headless")
  driver = webdriver.Chrome(chrome_options=chromeOptions)
  driver.get(url)
  sauce = BeautifulSoup(driver.page_source,"lxml")
  item = sauce.select_one("h1 a").text
  print(item)

if __name__ == '__main__':
  url = "https://stackoverflow.com/questions/tagged/web-scraping"
  ThreadPool(5).map(get_title,get_links(url))

原文由 robots.txt 发布，翻译遵循 CC BY-SA 4.0 许可协议

python python-3.x selenium web-scraping multiprocessing

阅读 722

2 个回答

得票最新

社区维基

发布于
2022-11-16

✓ 已被采纳

当使用 multiprocessing 运行时，如何减少使用 selenium 的执行时间

在您的解决方案中，很多时间都花在了为每个 URL 启动 webdriver 上。您可以通过每个线程仅启动一次驱动程序来减少此时间：

 (... skipped for brevity ...)

threadLocal = threading.local()

def get_driver():
  driver = getattr(threadLocal, 'driver', None)
  if driver is None:
    chromeOptions = webdriver.ChromeOptions()
    chromeOptions.add_argument("--headless")
    driver = webdriver.Chrome(chrome_options=chromeOptions)
    setattr(threadLocal, 'driver', driver)
  return driver

def get_title(url):
  driver = get_driver()
  driver.get(url)
  (...)

(...)

在我的系统上，这将时间从 1 分钟 7 秒减少到仅 24.895 秒，提高了约 35%。要测试自己，请下载完整的脚本。

注意： ThreadPool 使用线程，受Python GIL约束。如果大部分任务受 I/O 限制，那也没关系。根据您对抓取结果进行的后处理，您可能需要使用 multiprocessing.Pool 代替。这将启动并行进程，这些进程作为一个组不受 GIL 的约束。其余代码保持不变。

原文由 miraculixx 发布，翻译遵循 CC BY-SA 4.0 许可协议

社区维基

发布于
2022-11-16

我在聪明的每线程一个驱动程序答案中看到的一个潜在问题是，它省略了任何“退出”驱动程序的机制，从而留下了进程挂起的可能性。我将进行以下更改：

改为使用类 Driver 它将创建驱动程序实例并将其存储在线程本地存储中，但还有一个析构函数将 quit 删除线程本地存储时的驱动程序：

 class Driver:
    def __init__(self):
        options = webdriver.ChromeOptions()
        options.add_argument("--headless")
        self.driver = webdriver.Chrome(options=options)

    def __del__(self):
        self.driver.quit() # clean up driver when we are cleaned up
        #print('The driver has been "quitted".')

create_driver 现在变成：

 threadLocal = threading.local()

def create_driver():
    the_driver = getattr(threadLocal, 'the_driver', None)
    if the_driver is None:
        the_driver = Driver()
        setattr(threadLocal, 'the_driver', the_driver)
    return the_driver.driver

最后，在您不再使用 ThreadPool 实例但在它终止之前，添加以下行以删除线程本地存储并强制 Driver 实例的析构函数称为（希望如此）：

 del threadLocal
import gc
gc.collect() # a little extra insurance

原文由 Booboo 发布，翻译遵循 CC BY-SA 4.0 许可协议

撰写回答

你尚未登录，登录后可以

和开发者交流问题的细节
关注并接收问题和回答的更新提醒
参与内容的编辑和改进，让解决方法与时俱进

推荐问题

Python 硒多处理

你尚未登录，登录后可以

字节的 trae AI IDE 不支持类似 vscode 的 ssh remote 远程开发怎么办？

DataCap 中验证码无法显示，后台出现 NullPointerException 错误?

如何实现一个深拷贝函数？

发现深拷贝和浅拷贝效果一致：请问一下有什么区别呢？

Python 成员变量在多个子类实例间共享，如何避免？

为什么 Qwen2.5-Omni-7B 官方教程都报错 Cannot import available module of Qwen2_5OmniModel in modelscope ？

Spark-TTS-0.5B 的 requirements.txt 在哪里？

Stack Overflow 翻译