[Crawler] Detailed explanation of the use and deployment of python+selenium+firefox

My dream is worth fighting for. My life today is by no means a cold copy of my life yesterday. —— Stendhal, "Red and Black"

I. Overview

requests not a professional crawler engineer, but I am interested in crawling. I have learned 0611dbb2e0030a, scrapy and other python libraries to crawl some website data. Recently, I just started to do some crawler-related work due to need. The purpose of writing this article is to learn by myself The process and the problems encountered are recorded, on the one hand, to consolidate the knowledge learned, on the other hand, I hope to provide some help to friends who encounter the same problem.

This article mainly introduces from the following aspects (this is also the process of self-learning):

Why use selenium
Traditional configuration uses selenium

2. Why use selenium

When using a crawler tool such as requests , I used the requests.get(url) command to get the content of the webpage, and found that there is no content we need. That is because some webpages are separated from the front and the back. The browser needs to execute the js script through the ajax request to obtain the data and then render to the page, directly request If this page address is used, there is no data in the html. The pages of some websites are js of the original HTML code, which does not include the Ajax request.

How to solve this problem? Under normal circumstances, you can analyze the js script, find the called interface, and directly request the interface to obtain data, but these interfaces are encrypted or verified, and the request interface is more troublesome. For websites that need to execute js scripts to generate pages, it is impossible to obtain data directly through the interface. For convenience, we can directly use selenium + (browser driver) firefox to simulate the behavior of the browser, and you can execute js scripts through this tool Complete data to the entire web page.

Selenium is an automated testing tool
For details and usage, please check the official document: https://www.selenium.dev/documentation/

For example, the data on the Toutiao news webpage is encrypted by algorithms, and it is impossible to directly request the interface. You need to crack its encryption rules. Many big guys on the Internet have written that if you grab Toutiao data, you can find it on Baidu. In short, there are many pits and troublesome. , I will introduce how to use selenium to grab the data of today's headlines.

I probably know that using selenium can get the data in any webpage, but the disadvantages of using selenium are as follows:

is less efficient
Each request is equivalent to opening a browser. This startup efficiency is very low compared to directly calling the interface, and usually takes a few seconds.
resources
Selenium simulates the behavior of the browser, a large number of requests will consume resources extremely

`Three. Traditional way to configure and use selenium`

`1. Configure selenium in windows`

The main demonstration here is to use python + selenium to crawl data, so only the installation method of python will be introduced below. For other installation methods, please check the official documentation.

`Install Selenum library`

Use the following command to install the selenium library:

pip install selenium

`Install firefox browser`

firefox download : 1611dbb2e008e9 http://www.firefox.com.cn/download/

Download the installation package of the corresponding environment according to your needs. Because this is configured in windows, download windows. After the download is complete, double-click the .exe file and click Next to complete the installation.

`Install firefox browser driver`

After installing the browser, you also need to install the driver of the browser to complete the corresponding operation for the browser. Because the firefox browser is used here, the corresponding driver geckodriver needs to be installed.

geckodriver is not installed, when running with the following code:

import time
from selenium.webdriver import Firefox
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("https://www.toutiao.com/a6969138023774667264/")
time.sleep(2)
html = driver.page_source
print(html)
driver.quit()

The following error will be reported:

FileNotFoundError: [WinError 2] 系统找不到指定的文件。
Traceback (most recent call last):
    raise WebDriverException(
selenium.common.exceptions.WebDriverException: Message: 'geckodriver' executable needs to be in PATH.

The geckodriver as follows:

geckodriver : A proxy that uses a W3C WebDriver compatible client to interact with Gecko-based browsers.

The program provides HTTP API described by the WebDriver protocol to communicate with Gecko browsers, such as Firefox. It converts calls into Marionette remote protocol by acting as a proxy between the local end and the remote end.

geckodriver download link: https://github.com/mozilla/geckodriver/releases

1. Please choose to download according to the system version, as shown in the figure below :

2. After downloading and decompressing, add getckodriver.exe to the Path environment variable.

If you don't want to add it to the environment variable, you can also specify the location of geckodirver when creating the firefox driver instance:
webdriver.Firefox(executable_path="E:/Downloads/geckodriver/geckodriver.exe")

3. Adding getckodriver to the environment variable requires restarting cmd or idea

`Driver download addresses for other browsers`

Browser	Supported operating system	Maintainer	download	Problem tracking
Chromium/Chrome	Windows/macOS/Linux	Google	download	Question
Firefox	Windows/macOS/Linux	Mozilla	download	question
Edge	Windows 10	Microsoft	download	question
Internet Explorer	Windows	Selenium project team	download	Question
Safari	macOS El Capitan and later	Apple	Built-in	question
Opera	Windows/macOS/Linux	Opera	download	question

`2. Configure selenium in linux`

The configuration steps in linux are the same as those in windows, here is a brief introduction.

`Install Selenum library`

Use the following command to install the selenium library:

pip install selenium

`Install firefox browser`

firefox download : 1611dbb2e00f86 http://www.firefox.com.cn/download/

Use the following command to download the linux version of firefox browser:

wget https://download-ssl.firefox.com.cn/releases/firefox/esr/91.0/zh-CN/Firefox-latest-x86_64.tar.bz2

After the download is complete, use the following command to decompress to get Firefox-latest-x86_64.tar :

bunzip2 -d Firefox-latest-x86_64.tar.bz2

Use the following command again to decompress:

tar -xvf Firefox-latest-x86_64.tar

`Install firefox browser driver`

geckodriver driver download address: https://github.com/mozilla/geckodriver/releases

Use the following command to download the driver of the linux system:

wget https://github.com/mozilla/geckodriver/releases/download/v0.29.1/geckodriver-v0.29.1-linux64.tar.gz

After extracting geckodriver store to /usr/local/bin/ path to

tar -zxvf geckodriver-v0.29.1-linux64.tar.gz
cp geckodriver /usr/local/bin/

The same is true for IE and Chrome browsers, IEDriverServer, chromedriver is the same installation method

`Four. python + selenium to get today's headline data`

The following code is the url get the headlines today in accordance with title, release time, source, text content, images address , a detailed description see code comments:

from lxml import etree
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC  # 和下面WebDriverWait一起用的
from selenium.webdriver.support.wait import WebDriverWait


def html_selenium_firefox(url):
    """
    根据 url 使用 selenium 获取网页源码
    :param url: url
    :return: 网页源码
    """
    opt = webdriver.FirefoxOptions()
    # 设置无界面
    opt.add_argument("--headless")
    # 禁用 gpu
    opt.add_argument('--disable-gpu')
    # 指定 firefox 的安装路径，如果配置了环境变量则不需指定
    firefox_binary = "C:\\Program Files (x86)\\Mozilla Firefox\\firefox.exe"
    # 指定 geckodirver 的安装路径，如果配置了环境变量则不需指定
    executable_path = "E:\\Downloads/geckodriver\\geckodriver.exe"
    driver = webdriver.Firefox(firefox_binary=firefox_binary, executable_path=executable_path, options=opt)
    # 发送请求
    driver.get(url)
    # 显式等待：显式地等待某个元素被加载
    wait = WebDriverWait(driver, 20)
    wait.until(EC.presence_of_element_located((By.CLASS_NAME, 'article-content')))
    wait.until(EC.presence_of_element_located((By.TAG_NAME, 'span')))
    # 获取网页源码
    html = driver.page_source
    # 关闭浏览器释放资源
    driver.quit()
    return html


def get_news_content(url):
    html = html_selenium_firefox(url)
    tree = etree.HTML(html)
    title = tree.xpath('//div[@class="article-content"]/h1/text()')[0]
    # xpath 查找没有 class 的元素：span[not(@class)]
    pubtime = tree.xpath('//div[@class="article-meta mt-4"]/span[not(@class)]/text()')[0]
    # xpath 查找 class="name" 的元素：span[@class="name"]
    source = tree.xpath('//div[@class="article-meta mt-4"]/span[@class="name"]/a/text()')[0]
    # xpath 某个标签中的所有元素：//div
    content = tree.xpath('//article')[0]
    # 处理 content 乱码问题
    content = str(etree.tostring(content, encoding='utf-8', method='html'), 'utf-8')
    # 提取 content 中所有图片的地址
    images = etree.HTML(content).xpath('//img/@src')

    result = {
        "title": title,
        "pubtime": pubtime,
        "source": source,
        "content": content,
        "images": images,
    }
    return result


if __name__ == '__main__':
    url = "https://www.toutiao.com/a6969138023774667264/"
    result = get_news_content(url)
    print(result)

For more use of selenium and xpath, you can check the official documentation, and I won’t explain it in detail here.

Reference article:

https://blog.csdn.net/rhx_qiuzhi/article/details/80296801

https://github.com/mozilla/geckodriver

https://www.selenium.dev/documentation

[Crawler] Detailed explanation of the use and deployment of python+selenium+firefox

I. Overview

2. Why use selenium

`Three. Traditional way to configure and use selenium`

`1. Configure selenium in windows`

`Install Selenum library`

`Install firefox browser`

`Install firefox browser driver`

`Driver download addresses for other browsers`

`2. Configure selenium in linux`

`Install Selenum library`

`Install firefox browser`

`Install firefox browser driver`

`Four. python + selenium to get today's headline data`

惜鸟

`引用和评论`

是什么让 Java 应用程序的 CPU 使用率飙升？

Anaconda安装教程以及Anaconda和pip配置国内镜像

科学计算编程涉及到的技术栈简介

Python3 格式化时间（qbit）

manus 的替代品有哪些？使用LLM大模型技术做手机/网页/浏览器自动化操作技术汇总

怎么判断自己下载的 trae 是国际版还是国内版？

如何系统地入门学习stm32？