Getting started with Python crawler development

1. Introduction to Python

Python is an easy-to-learn and powerful programming language. It provides an efficient high-level data structure, as well as simple and effective object-oriented programming features. Python's elegant syntax and dynamic typing, as well as the nature of an interpreted language, make it an ideal language for script development in many fields.

At the same time, the Python interpreter is easy to extend, you can use C or C++ (or other languages that can be called from C) to extend new functions and data types, and Python can also be used as an extended programming language in customizable software. In addition, the Python interpreter on most platforms and the source code and executable files of the rich standard library can be downloaded freely Python official website

Like other scripting languages, Python itself has been developed from many other languages, and has borrowed many advantages of scripting languages including ABC, Modula-3, C, C++, Algol-68, SmallTalk, Unix shell and other scripting languages. Compared with other scripting languages, Python has the following characteristics:

easy to learn and read : Python has relatively few keywords, simple structure, and a clearly defined grammar, making it easier to learn, and Python's code structure is simple and easy to read.
easy to maintain : Python’s success lies in its source code is fairly easy to maintain.
extensive standard library support : One of the biggest advantages of Python is that it has a rich cross-platform standard library, and it is compatible with UNIX, Windows and Macintosh.
Powerful scalability : If you want to write some algorithms or modules that you do not want to open, you can use C or C++ to complete the corresponding function development, and then call it in your Python program.
portable : Based on its open source features, Python supports multi-platform porting.
database supports : Python provides interfaces to all major commercial databases.
GUI programming : Python supports GUI programming, and the developed program can be transplanted to multiple systems.

2. Install Python on MAC platform

Mac OS X 10.8 and above systems have Python 2.7 installed, but Python 2.7 is very old and many APIs do not support it. It is recommended to install Python 3.7 and above from the Python website ( https://www.python.org

在这里插入图片描述
After the installation is complete:

There will be a Python 3.9 folder in your Applications folder. Here you can find IDLE, which is a development environment that is a standard part of the official Python distribution; and PythonLauncher, which handles double-clicking Python scripts in the Finder.
Framework/Library/Frameworks/Python.framework, including Python executable files and libraries. The installer adds this location to the shell path, and the symbolic link to the Python executable is placed in /usr/local/bin/.

At the same time, the Python version provided by Apple is installed in /System/Library/Frameworks/Python.framework and /usr/bin/python respectively. It should be noted that these contents should never be modified or deleted, as they are controlled by Apple and used by Apple or third-party software.

Wait for the Python installation to complete, then open the system's ./bash_profile file to register Python, execute the open -e .bash_profile command in the terminal to open the .bash_profile file, and then add the following script:

export PATH="/Library/Frameworks/Python.framework/Versions/3.9/bin:${PATH}"

Then, execute the following script commands in the terminal.

alias python="/Library/Frameworks/Python.framework/Versions/3.9/bin/python3.9"
source .bash_profile

After the execution is complete, use python -V view the latest version and you can see that it has been updated.
在这里插入图片描述

Three, development tools

Tools supporting Python development include Jupyter notebook, Pycharm, Subllime/Vs code/Atom + kite, etc.

3.1 Jupyter notebook

Jupyter Notebook is opened in the form of a web page. You can write and run the code directly on the web page, and the running result of the code will also be displayed directly under the code block. After installing via pip, enter jupyter notebook on the command line and it will open in the default browser. In the eyes of some Python developers, jupyter notebook is the best IDE because it takes the interactive features of Python to the extreme. It has the following advantages:

Shareable
Support more than 40 programming languages
Lightweight
Interactive
Excellent visualization service
Support Markdown

Installation reference: Jupyter Notebook introduction, installation and use tutorial

3.2 Pycharm

PyCharm is a Python IDE created by JetBrains, and the refactoring plug-in Resharper of VS2010 comes from JetBrains. Like other development, PyCharm supports common code completion, smart prompts, and grammar checking. This software supports it. In addition, it also integrates version control, unit testing, git functions, and can quickly create Django, Flask and other Python The web framework is very good to use. It is often used in the development of large-scale projects. The only drawback is that there are some cards to start up, which is not free, but you can download the community free version. For the problem of the paid version, you can use my following method to crack: JetBrain unlimited reset 30-day trial period technique .

3.3 Subllime/Vs code/Atom + kite

Sublime Text is a lightweight code editor, cross-platform, supports dozens of programming languages, including Python, Java, C/C++, etc., is small and flexible, runs fast, supports code highlighting, auto-completion, and syntax prompts , Plug-in extensions are rich, it is a very good code editor, after configuring the relevant files, you can directly run the python program.

VS Code is a cross-platform code editor developed by Microsoft. It supports the development of common programming languages and has rich plug-in extensions. It not only supports intelligent completion, syntax checking, code highlighting, but also supports git functions and runs smoothly. It is a very good one. After installing the relevant plug-in, you can run the python program directly.

Atom is a code editor specially developed by github for programmers. It is also a platform. The interface is simple and intuitive, and it is very convenient to use, with automatic completion, code highlighting, syntax prompts, and faster startup and running speed. For beginners Said it is a very good code editor.

Fourth, run Python

Currently, there are three ways to run Python:

4.1 Interactive interpreter

We can enter Python through the command line window, and then start writing Python code in the interactive interpreter, and then we can perform Python coding work on Unix, DOS, or any other system that provides a command line or shell.

$ python # Unix/Linux
或者
C:>python # Windows/DOS

Common parameters of the Python command line are:

-d: display debugging information during parsing
-O: Generate optimized code (.pyo file)
-S: Do not introduce the location to find the Python path at startup
-V: output the Python version number
-X: Built-in exceptions (only for strings) have been obsolete since version 1.6
-c cmd executes a Python script, and uses the result as a cmd string
file execute a python script in a given python file

4.2 Command line script

You can execute Python scripts on the command line by introducing an interpreter in your application, as shown below:

$ python script.py # Unix/Linux
或者
C:>python script.py # Windows/DOS

Note: When executing the script, please pay attention to check whether the script has executable permissions.

4.3 Integrated Development Environment

PyCharm is a Python IDE created by JetBrains. It supports macOS, Windows, and Linux systems. You can run Python programs by clicking the run button on the panel.
在这里插入图片描述

Fifth, use Requests to implement web crawlers

5.1 Basic Principles of Web Crawlers

The so-called crawler is a program or script that automatically crawls information on the World Wide Web in accordance with certain rules. The basic principle behind it is that the crawler program initiates an HTTP request to the target server, and then the target server returns the response result. The crawler client receives the response and extracts data from it, and then performs data cleaning and data storage.

Therefore, a web crawler is also an HTTP request process. Take the browser to access a web site as an example. Starting from the user entering the URL, the client uses DNS resolution to query the IP address of the target server, and then establishes a TCP connection with it. After the connection is successful, the browser constructs an HTTP request and sends it to the server. After receiving the request, the corresponding data is found from the database and encapsulated into an HTTP response, and then the response result is returned to the browser. The browser analyzes, extracts, and renders the response content and finally displays it to the user. The complete process is as follows:

在这里插入图片描述

It should be noted that the request and response of the HTTP protocol must follow a fixed format. Only by following a unified HTTP request format can the server correctly parse the requests sent by different clients. Similarly, if the server follows a unified response format, the client can only The responses sent from different websites can be correctly parsed.

5.2 Web crawler example

Python provides a lot of tools to implement HTTP requests, but third-party open source libraries provide richer functions, and developers do not need to start writing from Socket communication.

Before initiating a request, we must first construct the request object Request, specify the url address, request method, and request header. The request body data here is empty, because we don't need to submit data to the server, so we don't need to specify it. The urlopen function will automatically establish a connection with the target server and send an HTTP request. The return value of this function is a response object Response, which contains attributes such as response header information, response body, and status code.

However, the built-in module provided by Python is too low-level and requires a lot of code to be written. For simple crawlers, you can consider Requests. Requests has nearly 30k Star on GitHub, which is a very Pythonic framework.

The following is a sample code for requesting a URL code using the built-in Python module urllib.

import ssl
from urllib.request import Request, urlopen

def print_hi():
    context = ssl._create_unverified_context()
    request = Request(url="https://foofish.net/pip.html",
                      method="GET",
                      headers={"Host": "foofish.net"},
                      data=None)

    response = urlopen(request, context=context)
    headers = response.info()  # 响应头
    content = response.read()  # 响应体
    code = response.getcode()  # 状态码
    print(headers)
    print(content)
    print(code)

if __name__ == '__main__':
    print_hi()

Execute the above code, you can see that the captured information is printed out on the Python console:
在这里插入图片描述

Next, let us get acquainted with the process and method of using the Pythonic framework.

5.2.1 Installation requests

The installation of requests is very simple, just use the install command of pip.

pip install requests

5.2.2 Basic request

GET request
The basic GET request is relatively simple, just use the get() method of requests to request.

    import requests

    url = ''
    headers = {'User-Agent':''}
    res = requests.get(url, headers=headers)
    res.encoding = 'utf-8'
    print(res.text)

POST request
POST request is also very simple, just use the post() method of requests to request.

  ...
  data = {}
  res = requests.post(url, headers=headers, data=data)
  ...

5.2.3 Advanced request

Request parameter
In front-end development, the parameters required by the GET request with parameters are spliced behind the request address, while the GET request used by Python uses params.

    ...
    params = {}
    res = request.get(url, headers=headers, params = params)
    ...

Specify Cookie
Pattern cookie login:

    ...
    headers = {
        'User-Agent' : '',
        'Cookie'     : '',
    }
    res = request.get(url, headers=headers)
    ...

Session
If you want to keep the login (session) state with the server without having to specify cookies every time, you can use session. The API and requests provided by Session are the same.

import requests

s = requests.Session()
s.cookies = requests.utils.cookiejar_from_dict({"a": "c"})
r = s.get('http://httpbin.org/cookies')
print(r.text)
# '{"cookies": {"a": "c"}}'

r = s.get('http://httpbin.org/cookies')
print(r.text)
# '{"cookies": {"a": "c"}}'

client verification

When the web client is authenticated, the auth field is usually included, as shown below.

    ...
    auth = ('用户名', '密码')
    res = request.get(url, headers=headers, auth=auth)
    ...

set timeout
Sometimes, we need to specify a timeout for the request, then we can specify the timeout field to set it, as shown below.

requests.get('https://google.com', timeout=5)

set proxy
Too many requests sent in a period of time are easy to be judged by the server as a crawler, so we often use proxy IP to disguise the real IP of the client, for example.

import requests

proxies = {
    'http': 'http://127.0.0.1:1080',
    'https': 'http://127.0.0.1:1080',
}

r = requests.get('http://www.kuaidaili.com/free/', proxies=proxies, timeout=2)

5.2.3 Small test

After introducing the above basics, next we use Requests to complete a crawling Zhihu column article as an example. How did I find it? Just click on the requests on the left one by one and observe whether there is data on the right. Those static resources ending with .jpg, js, css can be ignored directly, as shown in the figure below.
在这里插入图片描述
These are not tricks for front-end development. Then we copied the request to the browser and found that we can really use it to get the corresponding data. Next, we analyze how this request is constituted.

Request URL: https://www.zhihu.com/api/v4/members/6c58e1e8c4d898befed2fafeb98adafe/profile/creations/feed?type=answer&column_id=c_1413092853995851776
Request method: GET
Mozilla/5.0 (Linux; Android 6.0.1; Moto G (4)) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Mobile Safari/537.36
Query parameters:
type: answer
column_id: c_1413092853995851776

With these request data, we can use the requests library to build a request, and then grab the network data through Python code, as shown below.

import requests


class SimpleCrawler:
    def crawl(self, params=None):
        # 必须指定UA，否则知乎服务器会判定请求不合法
        url = "https://www.zhihu.com/api/v4/members/6c58e1e8c4d898befed2fafeb98adafe/profile/creations/feed"
        # 查询参数
        params = {"type": 'answer',
                  "column_id": 'c_1413092853995851776'}

        headers = {
            "authority": "www.zhihu.com",
            "user-agent": "Mozilla/5.0 (Linux; Android 6.0.1; Moto G (4)) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Mobile Safari/537.36",
        }
        response = requests.get(url, headers=headers, params=params)
        print("返回数据：", response.text)
        # 解析返回的数据
        for follower in response.json().get("data"):
            print(follower)

if __name__ == '__main__':
    SimpleCrawler().crawl()

Then, run the above code, the output result is as follows:

在这里插入图片描述

The above is a single-threaded crawler based on Requests, very simple. Through this example, we understand its usage and process. As you can see, requests are very flexible. Request headers, request parameters, and cookie information can be directly specified in the request method. If the return value response is in json format, you can directly call the json() method to return the python object.

Getting started with Python crawler development

1. Introduction to Python

2. Install Python on MAC platform

Three, development tools

3.1 Jupyter notebook

3.2 Pycharm

3.3 Subllime/Vs code/Atom + kite

Fourth, run Python

4.1 Interactive interpreter

4.2 Command line script

4.3 Integrated Development Environment

Fifth, use Requests to implement web crawlers

5.1 Basic Principles of Web Crawlers

5.2 Web crawler example

5.2.1 Installation requests

5.2.2 Basic request

5.2.3 Advanced request

5.2.3 Small test

xiangzhihong

引用和评论

如何给本地部署的 DeepSeek-R1投喂数据

python与nodejs哪个性能高

Anaconda安装教程以及Anaconda和pip配置国内镜像

如何减少跨团队交付摩擦？——基于 DevOps 与敏捷的最佳实践

Python 描述符

科学计算编程涉及到的技术栈简介

使用 chardet 判断文件编码需要注意的坑——过大的文件会导致高耗时