“This is the 18th day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021”

preface

In the previous section, we described how to exchange data between a client and a server. We can use GET and POST methods to interact with the server, and sensitive data should only be sent using POST requests to avoid exposing the book in the URL. Of course, the server supports other HTTP methods, such as PUT and DELETE, but none of these are supported in the form.

First, about forms

The browser on the client needs to interact with the web server, and the server needs to return corresponding information based on user input.

Consider an example from w3c:

www.w3school.com.cn/html/html_f…

See section 5.2 for how GET, POST and the server interact.

Let’s focus on a problem with the login form.

2. Manage cookies

1. Log in using cookies

HTTP protocol itself is stateless, how to save the information to come or login website?

So we need some mechanism to identify users outside of the HTTP protocol. So you have sessions and cookies.

What is a Cookie and what is a Session?

Session tracing is a common technique used in Web applications to track a user’s entire Session. Common Session tracking techniques are cookies and sessions. Cookies identify users by recording information on the client, and sessions identify users by recording information on the server.

Cookie, which means “Cookie”, is a mechanism proposed by W3C and first developed by Netscape community. Cookies have become the standard, supported by all major browsers such as Internet Explorer, Netscape, Firefox, Opera, etc. Because HTTP is a stateless protocol, the server has no way of knowing the identity of the client from the network connection alone. So give clients a pass, one for each, and whoever accesses it must bring their own pass. So the server can identify the client from the pass. That’s how cookies work.

A Cookie is actually a small piece of text information. The client requests the server, and if the server needs to record the user state, it issues a Cookie to the client browser using response. The client browser saves the Cookie. When the browser requests the site again, the browser submits the requested URL along with the Cookie to the server. The server checks the Cookie,

To identify the user status. The server can also modify the contents of the Cookie as needed.

Let’s look at an example of how to use cookies to log in. Sometimes crawlers can only grab information from web pages after logging in. Such as Weibo, Zhihu, Renren and so on.

More detailed information about the cookies, can see: www.w3cschool.cn/pegosu/skj8…

2. ## Supplement the use of the cookieJar

Cookies have time limits, domain limits, coding issues, and so on. Managing cookies by yourself can be tedious, especially when there are multiple cookies to manage, and it is difficult to manage cookies well.

If a 302 redirect is returned after a web login, the set-cookie information will be lost in urllib2 Response, resulting in a login failure.

We need a generic cookie-handling tool to automatically handle set-cookie requests; Automatically manage expired cookies and automatically issue special cookies in the corresponding domain; To address these issues, we introduced the CookieJar;

Iii. CAPTCHA

Website in order to prevent malicious fraud and attack hacker programs, a defensive measures taken. The technology is said to have been introduced by paypal and is now widely used on Internet sites.

Generally, there are two ways to process CAPTCHA:

1) When the verification code needs to be input, the program pops up a picture for the user to enter;

2) Use image recognition technology to identify the information in the picture;

Optical Character Recognition OCR: OCR (Optical Character Recognition) refers to the process in which electronic devices (such as scanners or digital cameras) examine characters printed on paper, determine their shapes by detecting dark and bright patterns, and then translate the shapes into computer characters using Character Recognition methods.

Procedures to deal with complex verification code method:

1. Tesseract, Google’s open source project;

Install the Tesseract:

Install in Ubuntu:

         sudo apt-get install tesseract-ocr

pip install pytesseract

Training and testing: www.cnblogs.com/cnlian/p/57…

Simple Python test code:

PIL import Image from Pytesseract import * # loading Image Image = image.open ('test1.jpg') # identifying process text = image_to_string(image) print(text)Copy the code

2. Using Baidu AI, etc. :

\