I. Project Background

Douban movie provides the latest movie reviews and reviews including movie news and ticket booking services. You can record the movies and TV shows you want to watch, watch and watch, score and write movie reviews. It has greatly facilitated people’s life.

Today, take TV series (American TV series) as an example, batch crawl corresponding movies and write them into CSV files. Users can choose the movies they want better by scoring them.

【 II. Project Objectives 】

Get the corresponding movie name, score, details link, download the picture of the movie, save the document.

[III. Libraries and Websites involved]

1. The website is as follows:

https://movie.douban.com/j/search_subjects?type=tv&tag=%E7%BE%8E%E5%89%A7&sort=recommend&page_limit=20&page_start={}
Copy the code

2. Libraries involved: Requests ****, FAke_useragent, JSON ****, CSV

3. Software: PyCharm

Iv. Project Analysis

1, how to multi-page request?

When clicking on the next page, the paged page increases by 20 for each additional page. Replace the variable of the transform with {}, and then use the for loop to traverse the url to achieve multiple url requests.

2, how to get the real requested address?

When requesting data, no corresponding data is found on the page. In fact, Douban uses javascript to load content dynamically to prevent collection.

1) F12 right check, find Network, left menu Name, find the fifth data, click Preview.

2) Open subjects, you can see that the title is the corresponding movie name. Rate is the corresponding rating. The subjects dictionary is parsed with JS to find the required fields.

  1. How to web access?

    Movie.douban.com/j/search_su…

    Movie.douban.com/j/search_su…

    Movie.douban.com/j/search_su…

    Movie.douban.com/j/search_su…

When clicking on the next page, each additional page increases by 20, with {} instead of the variable of the transformation, and then with the for loop through the url, to achieve multiple url requests.

V. Project Implementation

1. We define a class that inherits object, init that inherits self, and a main function that inherits self. Import the required library and request url.

2, randomly generate UserAgent, construct request header, prevent backcrawl.

3, send the request, get the response, page callback, convenient next request.

4. Json parses the page data and obtains the corresponding dictionary.

5, for traversal, get the corresponding movie name, rating, next details page link.

6. Create CSV files for writing, define corresponding header content, and save data.

7, picture address request. Define the image name and save the document.

8. Call the method to realize the function.

9. Project Optimization

1) Set the time delay.

2) Define a variable u, for traversal, indicating the page to be climbed. (Clearer and more impressive).

[VI. Effect Display]

1. Click the green triangle to enter the start page and end the page (starting from page 0).

2. Display the download success information on the console.

3. Save the CSV file.

4. Film picture display.

【 VII. Summary 】

1. It is not recommended to grab too much data, which is easy to load the server.

2. In this paper, the difficulties and key points in the application of Python to climb douban network, and how to prevent reverse crawling, have made a relative solution.

3. I hope that through this project, I can help understand the basic process of JSON parsing pages, how to join strings and how to use the format function.

4. This paper is based on The Python web crawler and uses the crawler library to obtain Douban movies and pictures. When realizing, there will always be a variety of problems, do not have high expectations and low hands, work frequently, can understand more profound.

This article reprinted text, copyright belongs to the author, such as infringement contact xiaobian delete!

The original address: www.tuicool.com/articles/mu…

Need source code or want to learn more(Check it out here)