Why should crawler engineers have some basic back-end common sense?

In the fan exchange group today, a classmate said that he found Requests and fixed it:

The corresponding picture in the chat history is:

Seeing the screenshot of this classmate, I probably know what problem he encountered and why he mistakenly thought it was a bug in Requests.

To explain this, we need to first understand the problem that two kinds of display format JSON string and json.dumps of ensure_ascii parameters.

Suppose we have a dictionary in Python:

info = {'name': '青南', 'age': 20}

When we want to convert it into a JSON string, we might write code like this:

import json
info = {'name': '青南', 'age': 20}
info_str = json.dumps(info)
print(info_str)

The running effect is shown in the figure below, Chinese has become Unicode code:

We can also add a parameter ensure_ascii=False to make Chinese display normally:

info_str = json.dumps(info, ensure_ascii=False)

The running effect is shown in the figure below:

This student believes that because {"name": "\u9752\u5357", "age": 20} and {"name": "青南", "age": 20} look at the string, they are obviously not equal. When Requests sends data in POST, there is no such parameter by default. For json.dumps , omitting this parameter is equivalent to ensure_ascii=True :

So actually Requests will convert the Chinese into Unicode code and send it to the server when the POST contains Chinese data, so the server can't get the original Chinese information at all. So it will cause an error.

But in fact, this is not the case. I often tell my classmates in the group that students who are crawlers should have some basic back-end knowledge so as not to be misled by this phenomenon. In order to explain why the above student's understanding is wrong and why this is not a bug of Requests, let's write a service with POST and see if there is a difference between the data in the two cases of POST. In order to prove that this feature has nothing to do with the network framework, I use Flask, Fastapi, and Gin to demonstrate.

First, let's take a look at the Requests test code. Here are three ways to send data in JSON format:

import requests 
import json 

body = {
    'name': '青南',
    'age': 20
}
url = 'http://127.0.0.1:5000/test_json'

# 直接使用 json=的方式发送
resp = requests.post(url, json=body).json() 
print(resp)

headers = {
    'Content-Type': 'application/json'
}

# 提前把字典序列化成 JSON 字符串，中文转成 Unicode，跟第一种方式等价
resp = requests.post(url,
                     headers=headers,
                     data=json.dumps(body)).json()
print(resp)

# 提前把字典序列化成 JSON 字符串，中文保留
resp = requests.post(url,
                     headers=headers,
                     data=json.dumps(body, ensure_ascii=False).encode()).json()
print(resp)

This test code uses 3 methods to send POST requests. The first method is the json= parameter that comes with Requests. The parameter value is a dictionary. Requests will automatically convert it into a JSON string. In the latter two ways, we manually convert the dictionary into a JSON string in advance, and then send it to the server data= These two methods need to specify 'Content-Type': 'application/json' , so that the server knows that the JSON string is sent.

Let's take a look at the back-end code written by Flask:

from flask import Flask, request
app = Flask(__name__)


@app.route('/')
def index():
    return {'success': True}


@app.route('/test_json', methods=["POST"])
def test_json():
    body = request.json 
    msg = f'收到 POST 数据，{body["name"]=}, {body["age"]=}'
    print(msg)
    return {'success': True, 'msg': msg}

The running effect is shown in the figure below:

It can be seen that no matter which POST method is used, the backend can receive the correct information.

Let's look at the Fastapi version again:

from fastapi import FastAPI
from pydantic import BaseModel 


class Body(BaseModel):
    name: str
    age: int 

app = FastAPI()



@app.get('/')
def index():
    return {'success': True}


@app.post('/test_json')
def test_json(body: Body):
    msg = f'收到 POST 数据，{body.name=}, {body.age=}'
    print(msg)
    return {'success': True, 'msg': msg}

The running effect is shown in the figure below. The data sent by the three POSTs can be correctly identified by the backend:

Let's take a look at the back end of the Gin version:

package main

import (
    "fmt"
    "net/http"

    "github.com/gin-gonic/gin"
)

type Body struct {
    Name string `json:"name"`
    Age  int16  `json:"age"`
}

func main() {
    r := gin.Default()
    r.GET("/", func(c *gin.Context) {
        c.JSON(http.StatusOK, gin.H{
            "message": "running",
        })
    })
    r.POST("/test_json", func(c *gin.Context) {
        json := Body{}
        c.BindJSON(&json)
        msg := fmt.Sprintf("收到 POST 数据，name=%s, age=%d", json.Name, json.Age)
        fmt.Println(">>>", msg)
        c.JSON(http.StatusOK, gin.H{
            "msg": fmt.Sprintf("收到 POST 数据，name=%s, age=%d", json.Name, json.Age),
        })
    })
    r.Run()
}

The operation effect is as follows, the data of the three request methods are exactly the same:

From this we can know that no matter whether Chinese is in the form of Unicode code or directly in the form of Chinese characters in the JSON string submitted by our POST, the back-end service can parse it correctly.

Why do I say that it doesn't matter which form Chinese is displayed in the JSON string? This is because, for JSON strings, the process of re-converting them into objects by the programming language (called deserialization) itself can handle them correctly. Let's look at the picture below:

ensure_ascii parameter only controls the display style of JSON. When ensure_ascii is True , ensure that there are only ASCII characters in the JSON string, so characters that are not within 128 characters of ASCII will be converted. When ensure_ascii is False , these non-ASCII characters are still displayed as they are. It's like putting on or not putting on one person, the essence will not change. When modern programming languages deserialize them, both forms can be correctly identified.

So, if you use a modern web framework to write the backend, there should be no difference between these two JSON formats. Request default json= parameters, equivalent to ensure_ascii=True , any modern web framework can correctly identify the content submitted by POST.

Of course, if you use C language, assembly or other languages to barely write the back-end interface, it may indeed be different. But a person with normal IQ, who would do this?

In summary, the problem encountered by this student is not a bug in Requests, but a problem with his back-end interface itself. Maybe that back-end uses some kind of mentally retarded web framework. The information it receives from POST is not deserialized, it is a JSON string, and the back-end programmer uses regular expressions from the JSON string. Extract the data, so when it finds that there is no Chinese in the JSON string, an error is reported.

In addition to the problem of sending JSON with POST, I used to have a subordinate. When using Scrapy to send POST information, because he would not write POST code, he had a whim, splicing the fields sent by POST to the URL, and then used GET to request , It is found that data can also be obtained, similar to:

body = {'name': '青南', 'age': 20}
url = 'http://www.xxx.com/api/yyy'
requests.post(url, json=body).text

requests.get('http://www.xxx.com/api/yyy?name=青南&age=20').text

Therefore, this student came to a conclusion that he believed that this is a universal law, and all POST requests can be transferred to GET requests in this way.

But obviously, this conclusion is also incorrect. This can only mean that the back-end programmers of this website make this interface compatible with two ways of submitting data at the same time, which requires additional code to be written by the back-end programmers. By default, GET and POST are two completely different request methods, and they cannot be converted in this way.

If this student knows some simple back-ends, he can immediately write a back-end program to verify his guess.

Another example, there are some websites, they may include another URL in the URL, for example:

https://kingname.info/get_info?url=https://abc.com/def/xyz?id=123&db=admin

If you don't have basic backend knowledge, then you may not see what is wrong with the above URL. But if you have some basic knowledge of the back-end, you might ask a question: URL &db=admin , belongs https://kingname.info/get_info a parameter with url= same level; still belongs https://abc.com/def/xyz?id=123&db=admin parameters? You will be confused, and the backend will be confused, so this is why we need urlencode at this time. After all, the following two ways of writing are completely different:

https://kingname.info/get_info?url=https%3A%2F%2Fabc.com%2Fdef%2Fxyz%3Fid%3D123%26db%3Dadmin

https://kingname.info/get_info?url=https%3A%2F%2Fabc.com%2Fdef%2Fxyz%3Fid%3D123&db=admin

Finally, I will summarize a sentence from the preface of my crawler book:

Crawling is a miscellaneous subject. If you can only crawl, then you are not good at crawling.

Why should crawler engineers have some basic back-end common sense?

青南

`引用和评论`

5分钟，自己做一个隧道代理

70k star，取代Postman！这款轻量级API工具，太香了！

大模型时代，后端程序员如何避免被AI卷死？

C++ 中 VS 项目引入公共配置文件

LSM-TREE从入门到入魔：从零开始实现一个高性能键值存储｜得物技术

疯狂推荐！从零开始 Dify 部署全攻略！

Cherry Studio 入门 MCP：为你的大模型插上翅膀

Why should crawler engineers have some basic back-end common sense?

青南

引用和评论

5分钟，自己做一个隧道代理

70k star，取代Postman！这款轻量级API工具，太香了！

大模型时代，后端程序员如何避免被AI卷死？

C++ 中 VS 项目引入公共配置文件

LSM-TREE从入门到入魔：从零开始实现一个高性能键值存储 ｜ 得物技术

疯狂推荐！从零开始 Dify 部署全攻略！

Cherry Studio 入门 MCP：为你的大模型插上翅膀

`引用和评论`

LSM-TREE从入门到入魔：从零开始实现一个高性能键值存储｜得物技术