The project of actual combat

Static Web combat

In this section we will show a complete crawler for everybody to general process, the content of the project for the extraction of the cat’s eye film TOP100 list all of the information and store to CSV file, its homepage address is http://maoyan.com/board/4, in 3.2.2 we have access to all the film in the first page of the, However, how to obtain the data of the second and third pages, namely, to obtain the corresponding URL of the second and third pages, so we can constantly turn the page in the browser to find the change rule of URL in the address bar:

The second page: http://maoyan.com/board/4?offset=10 page 3: http://maoyan.com/board/4?offset=20 page 4:  http://maoyan.com/board/4?offset=30 ......Copy the code
We can write a function to get the data per page, and the received parameter is the page number:

import requests
headers = {
    'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'
}

The offset parameter, which defaults to 0, is the first page
params = {
    'offset': 0
}
def get_html(page):
    ' ''Get an HTML page: param page: number of pages :return: HTML page of this page'' '
    params['offset'] = page * 10
    url = 'http://maoyan.com/board/4'
    try:
        response = requests.get(url, headers=headers, params=params)
        if response.status_code == 200:
            html = response.text
            return html
        else:
            return -1
    except:
        return None

Copy the code
After we get the HTML page, we can extract the corresponding movie information, such as the list of properties that every movie has: movie title, star, release date, score and so on. There are many ways to extract information. Here we use regular expressions to extract movie information:

def parse_infor(html):
    ' ''Extract movie info from HTML page :param HTML: HTML page :return: Movie info list'' '
    # Write regular string rules, extract movie name, star, release date, rating information
    pat = re.compile('
      
.*? (. *?)

.*? (. *?)

.*? (. *?)

.*?
*?>
*?>
*?>
*?>
.*? . *? (. *?) (. *?)

.*? .*? .*? '
*?>
*?>
*?>
*?>
, re.S) # get a double list results = re.findall(pat, html) one_page_film = [] if results: for result in results: film_dict = {} # Get movie name information film_dict['name'] = result[0] # Get lead actor information start = result[1] # replaces the '\n' character in the string, the newline character start.replace('\n'.' ') # Remove the Spaces on both sides of the string and slice off the three characters' star: 'at the beginning of the string start = start.strip()[3:] film_dict['start'] = start # Get release time information releasetime = result[2] # Use slicing to remove five characters at the beginning of the string 'release time:' releasetime = releasetime[5:] film_dict['releasetime'] = releasetime Get the score information. Since the score is spliced with two characters, we also need to perform the splicing operation after extracting left_half =result[3] right_half = result[4] score = left_half + right_half film_dict['score'] = score # Print the movie info: print(film_dict) Save the movie information dictionary to a one-page list of movies one_page_film.append(film_dict) return one_page_film else: return None Copy the code
Readers who are not familiar with regular should review the previous knowledge, although regular may be troublesome to write, at that time, he is the most efficient extraction, then we can store the extracted movie information, here we store it as CSV file:

def save_infor(one_page_film):
    ' ''Store extracted good movie info :param HTML: Movie info list :return: None'' '
    with open('top_film.csv'.'a', newline=' ') as f:
        csv_file = csv.writer(f)
        for one in one_page_film:
            csv_file.writerow([one['name'], one['start'], one['releasetime'], one['score']])

Copy the code
The above is the process of obtaining one HTML page and extracting movie information and storing it in CSV. Next, we construct ten pages of URL to complete the acquisition and storage of all movie information in maoyan TOP100 list. The complete program is as follows:

import requests
import re
import csv
import time

headers = {
    'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'
}
params = {
    'offset': 0
}


def get_html(page):
    ' ''Get an HTML page: param page: number of pages :return: HTML page of this page'' '
    params['offset'] = page * 10
    url = 'http://maoyan.com/board/4'
    try:
        response = requests.get(url, headers=headers, params=params)
        if response.status_code == 200:
            html = response.text
            return html
        else:
            return -1
    except:
        return None


def parse_infor(html):
    ' ''Extract movie info from HTML page :param HTML: HTML page :return: Movie info list'' '
    pat = re.compile('
      
.*? (. *?)

.*? (. *?)

.*? (. *?)

.*?
*?>
*?>
*?>
*?>
.*? . *? (. *?) (. *?)

.*? .*? .*? '
*?>
*?>
*?>
*?>
, re.S) results = re.findall(pat, html) one_page_film = [] if results: for result in results: film_dict = {} # Get movie name information film_dict['name'] = result[0] # Get lead actor information start = result[1] # replaces the '\n' character in the string, the newline character start.replace('\n'.' ') # Remove the Spaces on both sides of the string and slice off the three characters' star: 'at the beginning of the string start = start.strip()[3:] film_dict['start'] = start # Get release time information releasetime = result[2] # Use slicing to remove five characters at the beginning of the string 'release time:' releasetime = releasetime[5:] film_dict['releasetime'] = releasetime # Get rating information left_half =result[3] right_half = result[4] score = left_half + right_half film_dict['score'] = score # Print the movie info: print(film_dict) Save the movie information dictionary to a one-page list of movies one_page_film.append(film_dict) return one_page_film else: return None def save_infor(one_page_film): ' ''Store extracted movie info :param one_page_film: Movie info list :return: None'' ' with open('top_film.csv'.'a', newline=' ', errors='ignore') as f: csv_file = csv.writer(f) for one in one_page_film: csv_file.writerow([one['name'], one['start'], one['releasetime'], one['score']]) if __name__ == "__main__": # Use loops to build page numbers for page in range(10): # request page html = get_html(page) if html: # Extract information one_page_film = parse_infor(html) if one_page_film: # Store information save_infor(one_page_film) time.sleep(1) Copy the code

Dynamic Web practice

In this section, we will crawl the real-time box office data of Maoyan movies and learn to obtain the data we want from the dynamic web page. First, we will open the professional version of Maoyan – Real-time Box Office, whose website is: https://piaofang.maoyan.com/dashboard, then we can see now the movie box office data, real-time “today” real-time data can be seen in constant dynamic increase:

When we look at the source code of the page And there is no movie-related box office information, it is likely that the page is using Ajax(Asynchronous Javascript And XML) technology. Dynamic web pages (refers to a web programming technique as opposed to static web pages. Static web pages, as HTML code is generated, the content and appearance of the page is essentially unchanged — unless you modify the page code. Dynamic web pages, on the other hand, have the same code, but the content can change over time, as a result of environment or database operations. We can use the browser’s developer tools for analysis:

We can find that every once in a while there will be a new request, the request type is XHR, and the Ajax request type is XHR. The request may be the real-time updated box office information, and the data we need may be in these files, so we choose one to analyze:

In the Preview, relevant information we can see a lot of films, which we want to get real time movie box office data, and the content is the JSON format, the browser developer tools automatically parse convenient we check, then we only need to simulate the Ajax requests in Python, with these data and then parse can, These Ajax requests are still HTTP requests, so just grab the URL and use Python to simulate the request, which we can copy directly, as shown below:



Get a link to the request, which we’ll emulate in Python:

headers = {
    'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'
}


def get_html():
    ' ''Get JSON file :return: Data in JSON format'' '
    Request the URL of second.json
    url = 'https://box.maoyan.com/promovie/api/box/second.json'
    try:
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            Since it is a JSON file, we can return the data in JSON format for subsequent extraction
            return response.json()
        else:
            return -1
    except:
        return None

Copy the code
After obtaining the corresponding JSON data, we can use it for extraction operations.

def parse_infor(json):
    ' ''Extract movie box office data from JSON data, including: movie name, release information, gross box office, percentage of box office, cumulative box office :param JSON: JSON format data :return: each loop returns the dictionary type movie data'' '
    if json:
        Use the get() method in JSON to get the corresponding information layer by layer
        items = json.get('data').get('list')
        for item in items:
            piaofang = {}
            piaofang['Movie name'] = item.get('movieName')
            piaofang['Release Info'] = item.get('releaseInfo')
            piaofang['General box office'] = item.get('boxInfo')
            piaofang['Box office percentage'] = item.get('boxRate')
            piaofang['Cumulative box office'] = item.get('sumBoxInfo')
            # Use the generator to return one data per loop
            yield piaofang
    else:
        return None

Copy the code
Readers may see we didn’t use the conventional function returns, the return of it using the generator, so it can return each cycle time data, specific readers can generator | Liao Xuefeng official website to learn more about learning, then we will extract the good box office information is stored as a formatted HTML file:

def save_infor(results):
    ' ''Store a formatted HTML file for movie box office data: Param Results: Generator for movie box office data :return: None'' '
    rows = ' '
    for piaofang in results:
        Fill in the HTML table with the format string in Python
        row = '<tr><td>{}</td><td>{}</td><td>{}</td><td>{}</td><td>{}</td></tr>'.format(piaofang['Movie name'],
                                                                                        piaofang['Release Info'],
                                                                                        piaofang['General box office'],
                                                                                        piaofang['Box office percentage'],
                                                                                        piaofang['Cumulative box office'])
        # Use string concatenation loop to store each formatted movie box office information
        rows = rows + '\n' + row
    # HTML page formatted with string concatenation
    piaofang_html = ' ''
       < HTML lang="en">  
            table1_5'> < tr > < th > movie name < / th > < th > release information < / th > < th > composite box office < / th > < th > sales accounted for < / th > < th > total box office < / th > < / tr >'' ' + rows + ' ''    '' '
    Store HTML pages that have been formatted
    with open('piaofang.html'.'w', encoding='utf-8') as f:
        f.write(piaofang_html)

Copy the code
By integrating the above procedures, we can get a complete box office data acquisition code example:

import requests

headers = {
    'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'
}


def get_html():
    ' ''Get JSON file :return: Data in JSON format'' '
    Request the URL of second.json
    url = 'https://box.maoyan.com/promovie/api/box/second.json'
    try:
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            Since it is a JSON file, we can return the data in JSON format for subsequent extraction
            return response.json()
        else:
            return -1
    except:
        return None


def parse_infor(json):
    ' ''Extract movie box office data from JSON data, including: movie name, release information, gross box office, percentage of box office, cumulative box office :param JSON: JSON format data :return: each loop returns the dictionary type movie data'' '
    if json:
        Use the get() method in JSON to get the corresponding information layer by layer
        items = json.get('data').get('list')
        for item in items:
            piaofang = {}
            piaofang['Movie name'] = item.get('movieName')
            piaofang['Release Info'] = item.get('releaseInfo')
            piaofang['General box office'] = item.get('boxInfo')
            piaofang['Box office percentage'] = item.get('boxRate')
            piaofang['Cumulative box office'] = item.get('sumBoxInfo')
            # Use the generator to return one data per loop
            yield piaofang
    else:
        return None


def save_infor(results):
    ' ''Store a formatted HTML file for movie box office data: Param Results: Generator for movie box office data :return: None'' '
    rows = ' '
    for piaofang in results:
        Fill in the HTML table with the format string in Python
        row = '<tr><td>{}</td><td>{}</td><td>{}</td><td>{}</td><td>{}</td></tr>'.format(piaofang['Movie name'],
                                                                                        piaofang['Release Info'],
                                                                                        piaofang['General box office'],
                                                                                        piaofang['Box office percentage'],
                                                                                        piaofang['Cumulative box office'])
        # Use string concatenation loop to store each formatted movie box office information
        rows = rows + '\n' + row
    # HTML page formatted with string concatenation
    piaofang_html = ' ''
       < HTML lang="en">  
            table1_5'> < tr > < th > movie name < / th > < th > release information < / th > < th > composite box office < / th > < th > sales accounted for < / th > < th > total box office < / th > < / tr >'' ' + rows + ' ''    '' '
    Store HTML pages that have been formatted
    with open('piaofang.html'.'w', encoding='utf-8') as f:
        f.write(piaofang_html)


if __name__ == "__main__":
    # Get information
    json = get_html()
    # Extract information
    results = parse_infor(json)
    # Store information
    save_infor(results)

Copy the code
The HTML file storage effect is as follows:



It can be seen that dynamic web page crawler may be simpler, the key is to find the corresponding XHR format request, and generally this format of files are JSON format, relatively simple and convenient extraction, and readers may ask why to store this information as HTML file format, Like film of readers will often open the cat’s eye view the movie box office data every day, why don’t you try the crawler will what we have learned knowledge to make a regular crawl box-office data and push to the personal E-mail crawler small program, it will save us every day to open the web page to view, let data active service for us, is a learning tool. Readers who are interested can try it by themselves. In the following figure, the author expands the real-time box office information emails received every day according to this crawler, and regularly climbs and pushes them to the author every day, as shown in the following figure:



The promotion content is shown below: