\

Cookies were born for the interactive Web and are mainly used in the following three aspects: \

  1. Session state management (such as user login status, shopping cart, game score, or other information that needs to be logged)
  2. Personalization (such as user-defined Settings, themes, etc.)
  3. Browser behavior tracking (e.g. tracking and analyzing user behavior, etc.)

\

Today we will use the Requests library to log in douban and then climb the movie review as an example, with code to explain the Cookie session state management (login) function.

This tutorial is for learning only, not commercial profit! If there is any infringement of the interests of the company, please contact in time!

First, demand background

Before, I climbed the bullet screen of Youku and generated word cloud pictures, and found that the quality of Youku bullet screen is not high, there are many prepositions and some invalid words, such as: ha, ah, these, those… Douban has a good reputation, and some books or movies are recommended very well, so today we will take the review of Douban, and then generate word cloud, see the effect!

Ii. Function Description

We used the Requests library to log in to Douban, then crawled the reviews, and finally generated the word cloud!

Why we do not need to log in the previous cases (jingdong, Youku, etc.), but today to climb douban need to log in? That’s because Douban only allows you to view the first 200 reviews without logging in, and then you need to log in to view them, which is also a kind of anti-creep method!

Iii. Technical scheme

Let’s take a look at the simple technical solution, which can be roughly divided into three parts:

  1. Analyze douban login interface and use requests library to log in and save cookies
  2. Analyze douban film review interface to achieve batch capture of data
  3. Use word cloud to analyze movie review data

\

After the scheme is confirmed, let’s start the actual operation.

4. Log in to Douban

Before we do crawler, we start from the browser and use the debug window to view the URL.

1. Analyze the Douban login interface

Open the login page, then bring up the debug window, enter the user name and password, click Login.



It is recommended that you enter the wrong password so that you do not miss the request due to a page jump! Here we get the URL of the login request:

Accounts.douban.com/j/mobile/lo…

Since this is a POST request, we also need to look at the parameters that are carried when the request is logged in, so we’ll pull down the debug windowForm Data.

2. Code to log in douban

Once we have the login request URL and parameters, we can use the Requests library to write a login feature!

3. Save the session status

How do we get our code to auto-save cookies?

You may have seen or used the urllib library, which stores cookies as follows:

Copy the code

But as we said earlier in the Requests library:

Requests library is a third party network library based on URllib /3. It is characterized by powerful features and elegant APIS. As you can see from the above figure, the Requests library is also recommended by the Python official documentation for HTTP clients. In practice, the Requests library is used more frequently.

\

So today we’re going to take a look at how gracefully the Requests library helps us automatically save cookies? Let’s tweak the code a little so that it can automatically save cookies to maintain session state!



In the above code, we made two changes:

  1. Add a line at the tops = requests.Session()The Session object is generated to store cookies
  2. Making requests is no longer a Requests object, but a Session object

\

As you can see, the request object is now a Session object, which initiates requests in the same way as the original Requests object, except that cookies are automatically added to each request.

4. Is this a Session object?

Is the requests.Session object what we call a Session?

Of course notThe requests.Session object is just an object for storing cookies



So don’t confuse the Requests.Session object with the Session technology!

Five, crawl movie reviews

\

Now that we’ve logged in and saved session state, we’re ready to get down to business!

1. Analyze douban film review interface

First of all, find the movie you want to analyze in Douban, and select an American movieInto the Wild



Then drop down to find the movie review, bring up the debug window, and find the URL to load the movie review

2. Crawl a piece of movie review data

\



But it was an HTML web page, and we needed to extract the movie review data

3. Extraction of film review content

In the image above, we can see that the crawl returns HTML, and the review data is nested in the HTML tag. How to extract the review content?

Here we use regular expressions to match the desired tag content, of course there are more advanced extraction methods, such as using some libraries (such as BS4, xpath, etc.) to parse the HTML extract content, and the use of libraries is relatively efficient, but this is the rest of the content, we will use the re to match today!

Let’s first examine the structure of the page that returns HTML



We found that the reviews were on<span class="short"></span>In this tag, we can write the re to match the content in this tag.



Check the extracted content

4. Batch crawl

After we crawl, extract and save a piece of data, we crawl in batches. From previous crawls, we know that the key to bulk crawls is to find paging parameters, and we can quickly find one in the URLstartParameters are the parameters that control paging.



This is only 25 pages, we can go to the browser to verify that it is really only 25 pages, it is only 25 pages!

Analyze movie reviews

After the data is captured, let’s get the word cloud to analyze the movie!

Based on the use of word cloud analysis of the case has already spoken about two, so brother pig will only be a simple explanation!

1. Stuttering participles

Because the movie review we download is a paragraph of text, and the word cloud we do is to count the number of words, so we need to split words first!

2. Use word clouds

\

Final Results:



From these words we can know that it is about a film aboutPursuit of selfwithIn real lifeThe movie.

Seven,

Today we take douban as an example, learned a lot of things, to sum up:

  1. Learn how to make POST requests using the Requests library
  2. Learned how to use the Requests library to log in to a website
  3. Did you learn how to use the Requests library Session object to maintain Session state
  4. Learned how to use regular expressions to extract content from web tags

\

Click to become a registered member of the community ** “Watching” **