Why should crawler engineers have some basic back-end knowledge?

In the fan chat group today, a student said that he found a bug in Requests and fixed it:

The corresponding picture in the chat history is:

Looking at the screenshot of the student, I could see what the problem was and why he mistook it for a bug in Requests.

To explain this, we need to first understand the two representations of JSON strings and the ensure_ASCII parameter of json.dumps.

Suppose we have a dictionary in Python:

info = {'name': 'green south'.'age': 20}
Copy the code

When we want to convert it to a JSON string, we might write code like this:

import json
info = {'name': 'green south'.'age': 20}
info_str = json.dumps(info)
print(info_str)
Copy the code

The result is shown in the following picture, where Chinese becomes Unicode code:

We can also add a parameter ensure_ASCII =False to make Chinese appear normally:

info_str = json.dumps(info, ensure_ascii=False)
Copy the code

The operating effect is shown in the figure below:

Because {“name”: “\ U9752 \ U5357 “, “age”: 20} and {“name”: “qingnan “, “age”: 20} would not be equal obviously from a u5357″, “age”: 20}. For Json. dumps, omits are equivalent to ensure_ASCII =True:

So Requests actually sent Unicode codes to the server when the POST contained Chinese data, so the server couldn’t retrieve the original Chinese information at all. So it will cause an error.

But actually, that’s not the case. I often tell my classmates that crawlers should have some basic back-end knowledge so as not to be misled by this phenomenon. To explain why the above student’s understanding is wrong and why this is not a bug of Requests, let’s write a service with POST and see if there is any difference between our data in the two cases of POST. To prove that this feature is independent of the network framework, I use Flask, Fastapi, and Gin to demonstrate it.

First, let’s look at the Requests test code. Here we send JSON data in three ways:

import requests 
import json 

body = {
    'name': 'green south'.'age': 20
}
url = 'http://127.0.0.1:5000/test_json'

Send it directly as json=
resp = requests.post(url, json=body).json() 
print(resp)

headers = {
    'Content-Type': 'application/json'
}

Serialize the dictionary to JSON string in advance and convert Chinese to Unicode, equivalent to the first method
resp = requests.post(url,
                     headers=headers,
                     data=json.dumps(body)).json()
print(resp)

Serialize the dictionary into A JSON string in advance
resp = requests.post(url,
                     headers=headers,
                     data=json.dumps(body, ensure_ascii=False).encode()).json()
print(resp)
Copy the code

This test code uses three ways to send POST Requests, the first of which is the json= argument that comes with Requests, whose value is a dictionary. Requests automatically converts it to a JSON string. In the latter two ways, we manually pre-convert the dictionary to a JSON string and send it to the server using the data= argument. Either way, the server needs to specify ‘content-type ‘: ‘application/json’ in the Headers header to know that it is sending a JSON string.

Let’s take a look at the back-end code that Flask wrote:

from flask import Flask, request
app = Flask(__name__)


@app.route('/')
def index() :
    return {'success': True}


@app.route('/test_json', methods=["POST"])
def test_json() :
    body = request.json 
    msg = F 'receives the POST data,{body["name"] =}.{body["age"] =}'
    print(msg)
    return {'success': True.'msg': msg}
Copy the code

The operating effect is shown in the figure below:

As you can see, the back end receives the correct information no matter which POST method is used.

Let’s look at the Fastapi version:

from fastapi import FastAPI
from pydantic import BaseModel 


class Body(BaseModel) :
    name: str
    age: int 

app = FastAPI()



@app.get('/')
def index() :
    return {'success': True}


@app.post('/test_json')
def test_json(body: Body) :
    msg = F 'receives the POST data,{body.name=}.{body.age=}'
    print(msg)
    return {'success': True.'msg': msg}
Copy the code

The operation effect is as shown in the figure below. The data sent by the three types of POST can be correctly identified by the back end:

Let’s take a look at the Gin version of the back end:

package main

import (
    "fmt"
    "net/http"

    "github.com/gin-gonic/gin"
)

type Body struct {
    Name string `json:"name"`
    Age  int16  `json:"age"`
}

func main(a) {
    r := gin.Default()
    r.GET("/".func(c *gin.Context) {
        c.JSON(http.StatusOK, gin.H{
            "message": "running",
        })
    })
    r.POST("/test_json".func(c *gin.Context) {
        json := Body{}
        c.BindJSON(&json)
        msg := fmt.Sprintf("Name =%s, age=%d", json.Name, json.Age)
        fmt.Println("> > >", msg)
        c.JSON(http.StatusOK, gin.H{
            "msg": fmt.Sprintf("Name =%s, age=%d", json.Name, json.Age),
        })
    })
    r.Run()
}
Copy the code

The result is as follows, the data of the three request modes are identical:

From this we can see that the back-end service can correctly parse whether the Chinese in the JSON string we POST is in the form of Unicode codes or directly in the form of Chinese characters.

Why do I say it doesn’t matter what form Chinese is displayed in a JSON string? This is because, for JSON strings, the process by which the programming language converts them back into objects (called deserialization) can handle them properly. Take a look at the picture below:

The ensure_ASCII argument only controls the style of the JSON display. When ensure_ASCII is True, it ensures that only ASCII characters are in the JSON string, so any characters not in the 128 ASCII characters will be converted. When ensure_ASCII is False, these non-ASCII characters are displayed as is. It’s like a person wearing makeup or not wearing makeup. The essence doesn’t change. Modern programming languages recognize both correctly when deserializing them.

So, if you’re writing the back end with a modern Web framework, there should be no difference between the two JSON forms. Request’s default JSON = argument, equivalent to ensure_ASCII =True, should be recognized correctly by any modern Web framework for POST submissions.

Of course, if you’re using C, assembly, or some other language to write your back-end interface naked, that might make a difference. But who in their right mind would do that?

To sum up, the problem this student encountered was not the bug in Requests, but his back-end interface itself. Maybe the backend is using some stupid Web framework, and the message it receives from the POST is a JSON string without being deserialized, and the backend programmer uses regular expressions to extract data from the JSON string, so when it finds that there is no Chinese in the JSON string, I got an error.

In addition to the problem of POST sending JSON, one of my subordinates used to use Scrapy to send POST information. Because he could not write the code of POST, he had the idea to put the field sent by POST into the URL, and then used the GET mode to request, and found that he could also GET data. Similar to:

body = {'name': 'green south'.'age': 20}
url = 'http://www.xxx.com/api/yyy'
requests.post(url, json=body).text

requests.get('green south & age = 20 http://www.xxx.com/api/yyy?name=').text
Copy the code

So the student came to the conclusion that he thought it was a general rule that all POST requests could be forwarded to GET requests in this way.

But obviously, this conclusion is also incorrect. It just means that the backend programmers of the site made the interface compatible with both methods of submitting data, which required the backend programmers to write extra code to implement. By default, GET and POST are two completely different requests and cannot be converted in this way.

If the student learned some simple back-end, he could immediately write a back-end program to verify his conjecture.

As an example, some websites may include another URL in their URL, for example:

https://kingname.info/get_info?url=https://abc.com/def/xyz?id=123&db=admin
Copy the code

If you don’t have basic back-end knowledge, you probably won’t be able to tell what’s wrong with the urls above. But if you have some basic backend common sense, you might ask a question: the url of the & db = admin, belongs to a parameter of https://kingname.info/get_info, with the url = level; Or a parameter that belongs to https://abc.com/def/xyz?id=123&db=admin? You’re going to be confused, and the back end is going to be confused, so that’s why we need urlencode at this point, because the two ways of writing it are completely different:

https://kingname.info/get_info?url=https%3A%2F%2Fabc.com%2Fdef%2Fxyz%3Fid%3D123%26db%3Dadmin

https://kingname.info/get_info?url=https%3A%2F%2Fabc.com%2Fdef%2Fxyz%3Fid%3D123&db=admin
Copy the code

To conclude with a sentence from the preface to my book of reptiles:

Reptile is a miscellaneous science, if you only know reptile, then you can not learn reptile.

Why should crawler engineers have some basic back-end knowledge?

Related Posts

Mysql index merge queries cause deadlock problems

Is the Go Map implementation better than Java Map

Analysis and implementation of consistent Hash algorithm in distributed cache