3
头图
My dream is worth fighting for. My life today is by no means a cold copy of my life yesterday. —— Stendhal, "Red and Black"

I. Overview

requests not a professional crawler engineer, but I am interested in crawling. I have learned 0611dbb2e0030a, scrapy and other python libraries to crawl some website data. Recently, I just started to do some crawler-related work due to need. The purpose of writing this article is to learn by myself The process and the problems encountered are recorded, on the one hand, to consolidate the knowledge learned, on the other hand, I hope to provide some help to friends who encounter the same problem.

This article mainly introduces from the following aspects (this is also the process of self-learning):

  1. Why use selenium
  2. Traditional configuration uses selenium

2. Why use selenium

When using a crawler tool such as requests , I used the requests.get(url) command to get the content of the webpage, and found that there is no content we need. That is because some webpages are separated from the front and the back. The browser needs to execute the js script through the ajax request to obtain the data and then render to the page, directly request If this page address is used, there is no data in the html. The pages of some websites are js of the original HTML code, which does not include the Ajax request.

How to solve this problem? Under normal circumstances, you can analyze the js script, find the called interface, and directly request the interface to obtain data, but these interfaces are encrypted or verified, and the request interface is more troublesome. For websites that need to execute js scripts to generate pages, it is impossible to obtain data directly through the interface. For convenience, we can directly use selenium + (browser driver) firefox to simulate the behavior of the browser, and you can execute js scripts through this tool Complete data to the entire web page.

Selenium is an automated testing tool

For details and usage, please check the official document: https://www.selenium.dev/documentation/

For example, the data on the Toutiao news webpage is encrypted by algorithms, and it is impossible to directly request the interface. You need to crack its encryption rules. Many big guys on the Internet have written that if you grab Toutiao data, you can find it on Baidu. In short, there are many pits and troublesome. , I will introduce how to use selenium to grab the data of today's headlines.

I probably know that using selenium can get the data in any webpage, but the disadvantages of using selenium are as follows:

  1. is less efficient

    Each request is equivalent to opening a browser. This startup efficiency is very low compared to directly calling the interface, and usually takes a few seconds.

  2. resources

    Selenium simulates the behavior of the browser, a large number of requests will consume resources extremely

Three. Traditional way to configure and use selenium

1. Configure selenium in windows

The main demonstration here is to use python + selenium to crawl data, so only the installation method of python will be introduced below. For other installation methods, please check the official documentation.

Install Selenum library

Use the following command to install the selenium library:

pip install selenium

Install firefox browser

firefox download : 1611dbb2e008e9 http://www.firefox.com.cn/download/
image-20210815101511993.png

Download the installation package of the corresponding environment according to your needs. Because this is configured in windows, download windows. After the download is complete, double-click the .exe file and click Next to complete the installation.

image-20210815101856842.png

Install firefox browser driver

After installing the browser, you also need to install the driver of the browser to complete the corresponding operation for the browser. Because the firefox browser is used here, the corresponding driver geckodriver needs to be installed.

geckodriver is not installed, when running with the following code:

import time
from selenium.webdriver import Firefox
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("https://www.toutiao.com/a6969138023774667264/")
time.sleep(2)
html = driver.page_source
print(html)
driver.quit()

The following error will be reported:

FileNotFoundError: [WinError 2] 系统找不到指定的文件。
Traceback (most recent call last):
    raise WebDriverException(
selenium.common.exceptions.WebDriverException: Message: 'geckodriver' executable needs to be in PATH. 

The geckodriver as follows:

geckodriver : A proxy that uses a W3C WebDriver compatible client to interact with Gecko-based browsers.

The program provides HTTP API described by the WebDriver protocol to communicate with Gecko browsers, such as Firefox. It converts calls into Marionette remote protocol by acting as a proxy between the local end and the remote end.

geckodriver download link: https://github.com/mozilla/geckodriver/releases

1. Please choose to download according to the system version, as shown in the figure below :

image-20210815102659371.png

2. After downloading and decompressing, add getckodriver.exe to the Path environment variable.

If you don't want to add it to the environment variable, you can also specify the location of geckodirver when creating the firefox driver instance:

webdriver.Firefox(executable_path="E:/Downloads/geckodriver/geckodriver.exe")

3. Adding getckodriver to the environment variable requires restarting cmd or idea

Driver download addresses for other browsers

BrowserSupported operating systemMaintainerdownloadProblem tracking
Chromium/ChromeWindows/macOS/LinuxGoogle download Question
FirefoxWindows/macOS/LinuxMozilla download question
EdgeWindows 10Microsoft download question
Internet ExplorerWindowsSelenium project team download Question
SafarimacOS El Capitan and laterAppleBuilt-in question
OperaWindows/macOS/LinuxOpera download question

2. Configure selenium in linux

The configuration steps in linux are the same as those in windows, here is a brief introduction.

Install Selenum library

Use the following command to install the selenium library:

pip install selenium

Install firefox browser

firefox download : 1611dbb2e00f86 http://www.firefox.com.cn/download/

Use the following command to download the linux version of firefox browser:

wget https://download-ssl.firefox.com.cn/releases/firefox/esr/91.0/zh-CN/Firefox-latest-x86_64.tar.bz2

After the download is complete, use the following command to decompress to get Firefox-latest-x86_64.tar :

bunzip2 -d Firefox-latest-x86_64.tar.bz2

Use the following command again to decompress:

tar -xvf Firefox-latest-x86_64.tar

Install firefox browser driver

geckodriver driver download address: https://github.com/mozilla/geckodriver/releases

Use the following command to download the driver of the linux system:

wget https://github.com/mozilla/geckodriver/releases/download/v0.29.1/geckodriver-v0.29.1-linux64.tar.gz

After extracting geckodriver store to /usr/local/bin/ path to

tar -zxvf geckodriver-v0.29.1-linux64.tar.gz
cp geckodriver /usr/local/bin/
The same is true for IE and Chrome browsers, IEDriverServer, chromedriver is the same installation method

Four. python + selenium to get today's headline data

The following code is the url get the headlines today in accordance with title, release time, source, text content, images address , a detailed description see code comments:

from lxml import etree
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC  # 和下面WebDriverWait一起用的
from selenium.webdriver.support.wait import WebDriverWait


def html_selenium_firefox(url):
    """
    根据 url 使用 selenium 获取网页源码
    :param url: url
    :return: 网页源码
    """
    opt = webdriver.FirefoxOptions()
    # 设置无界面
    opt.add_argument("--headless")
    # 禁用 gpu
    opt.add_argument('--disable-gpu')
    # 指定 firefox 的安装路径,如果配置了环境变量则不需指定
    firefox_binary = "C:\\Program Files (x86)\\Mozilla Firefox\\firefox.exe"
    # 指定 geckodirver 的安装路径,如果配置了环境变量则不需指定
    executable_path = "E:\\Downloads/geckodriver\\geckodriver.exe"
    driver = webdriver.Firefox(firefox_binary=firefox_binary, executable_path=executable_path, options=opt)
    # 发送请求
    driver.get(url)
    # 显式等待:显式地等待某个元素被加载
    wait = WebDriverWait(driver, 20)
    wait.until(EC.presence_of_element_located((By.CLASS_NAME, 'article-content')))
    wait.until(EC.presence_of_element_located((By.TAG_NAME, 'span')))
    # 获取网页源码
    html = driver.page_source
    # 关闭浏览器释放资源
    driver.quit()
    return html


def get_news_content(url):
    html = html_selenium_firefox(url)
    tree = etree.HTML(html)
    title = tree.xpath('//div[@class="article-content"]/h1/text()')[0]
    # xpath 查找没有 class 的元素:span[not(@class)]
    pubtime = tree.xpath('//div[@class="article-meta mt-4"]/span[not(@class)]/text()')[0]
    # xpath 查找 class="name" 的元素:span[@class="name"]
    source = tree.xpath('//div[@class="article-meta mt-4"]/span[@class="name"]/a/text()')[0]
    # xpath 某个标签中的所有元素://div
    content = tree.xpath('//article')[0]
    # 处理 content 乱码问题
    content = str(etree.tostring(content, encoding='utf-8', method='html'), 'utf-8')
    # 提取 content 中所有图片的地址
    images = etree.HTML(content).xpath('//img/@src')

    result = {
        "title": title,
        "pubtime": pubtime,
        "source": source,
        "content": content,
        "images": images,
    }
    return result


if __name__ == '__main__':
    url = "https://www.toutiao.com/a6969138023774667264/"
    result = get_news_content(url)
    print(result)

For more use of selenium and xpath, you can check the official documentation, and I won’t explain it in detail here.

Reference article:

https://blog.csdn.net/rhx_qiuzhi/article/details/80296801

https://github.com/mozilla/geckodriver

https://www.selenium.dev/documentation


惜鸟
328 声望2.3k 粉丝