Data Science Club

Chinese data scientist community

Author: Suke, Zero-based python crawler and Data Analysis

Blog: www.makcyun.top

Crawlers are the fastest and easiest way to get started with Python. Because the routine of the primary reptile is relatively fixed, there are only a few common methods, so it is easier to get started. Select the cat’s Eye Top100 movies with simple webpage structure as the case for practice. The point is to extract the key content using the four methods described above. Having different solutions to a problem can help broaden your mind and make it flexible with practice.

The two major parsing libraries use regular expressions, xpath, and CSS selectors

1. Why are you crawling this page?

  • Lazy people who don’t want to scroll through 100 movie introductions page by page and want to browse the whole thing on one page (in an Excel spreadsheet, for example);

Want to dig into some interesting information, such as: Which movie has the highest rating? Which actor has the largest number of credits? Which country/region has the most films on the list? Which year has the most movies on the list? This information is not so easy to get directly on the web page, so crawlers are needed. \

2. Crawler target

  • Extract the movie name, cover image, ranking, rating, actor, country/region of release, rating and other information of Top100 movies from the web page and save it as a CSV text file.
  • According to the crawling results, a simple visual analysis is carried out.

Platform: windows7 + SublimeText3

3. Crawl steps

3.1. URL analysis

First, open the url of cat eye Top100: maoyan.com/board/4?off… . The page is very simple and contains the information described above as the crawler target. Scroll down to the bottom of the page and click on page 2 to see the url change to: maoyan.com/board/4?off… . Therefore, we can infer the variation rule of URL: offset represents offset, 10 represents the number of movie offsets on a page, that is, the first movie is from 0-10, and the second movie is from 11-20. Therefore, to get all 100 movies, you only need to construct 10 urls, then get the content of the web page in turn, and then use different methods to extract the content you need. Now, get the first page using the Requests method.

3.2. Requests get home page data

Start by defining a function that gets a single page: get_one_page(), passing in the URL argument.

 1def get_one_page(url):
 2    try:
 3        headers = {
 4            'User-Agent''the Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36'}
 5        You can't climb without headers
 6        response = requests.get(url, headers=headers)
 7        if response.status_code == 200:
 8            return response.text
 9        else:
10            return None
11    except RequestException:
12        return None
13    The # try-except statement catches exceptions
Copy the code

Next, set the URL in the main() function.

1def main():
2    url = 'http://maoyan.com/board/4?offset=0'
3    html = get_one_page(url)
4    print(html)
5
6
7if __name__ == '__main__':
8    main()
Copy the code

After running the above procedures, the source code of the home page was climbed down. As shown below:

Next, we need to extract a few items of content we need from the whole web page, using the method is the four methods mentioned above, respectively explained below.

3.3. Four methods of content parsing and extraction

3.3.1. Regular expression extraction

The first is to use regular expression extraction. What is a regular expression? The following jumbled notation is the syntax for regular expressions.

1'<dd>.*? board-index.*? >(d+)</i>.*? src="(.*?) ". *? name"><a.*? > (. *?) </a>.*? 'Copy the code

It is a powerful string manipulation tool. Regular expressions are called regular expressions because they recognize regular strings. Define it like this: “If you give me a string that matches the rules, I’ll return it”; “If the string doesn’t conform to the rule, I ignore it.” The web page captured through Requests is a bunch of strings that can be processed to extract what we want.

If you don’t already know about it, check out the following tutorial:

www.runoob.com/regexp/rege… https://www.w3cschool.cn/regexp/zoxa1pq7.html

Now, start extracting the key content. Right-click the website-inspect-network option, select the first file on the left and navigate to the corresponding location of the movie information, as shown below:

You can see that the information related to each movie is in the node DD. So it can be extracted from this node using the re. The first thing to extract is the ranking of movies. It is inside the I node of class=”board-index”. For content that does not need to be extracted, use ‘.*? ‘instead, the numbers to be extracted are enclosed in brackets (), and the numbers inside () are represented as (d+). The regular expression can be written as:

1'<dd>.*? board-index.*? >(d+)</i>'Copy the code

The url of the image is located in the ‘SRC’ attribute of the IMG node. The regular expression can be written as:

1'src="(.*?) ". *? 'Copy the code

The code between the first and second regees is not needed; use ‘.*? ‘substitute, so the two parts together are written as:

1'<dd>.*? board-index.*? >(d+)</i>.*? src="(.*?) "Copy the code

Similarly, you can use the regular expression to write down the main actors, release dates, and ratings in sequence. The complete regular expression is as follows:

1'<dd>.*? board-index.*? >(d+)</i>.*? src="(.*?) ". *? name"><a.*? > (. *?) </a>.*? star">(.*?) </p>.*? releasetime">(.*?) </p.*? integer">(.*?) </i>.*? fraction">(.*?) </i>.*? </dd>'Copy the code

Once the regular expression is written, you can define a page parsing extraction method: parse_one_page (), which extracts the content:

 1def parse_one_page(html):
 2    pattern = re.compile(
 3        '
      
.*? board-index.*? >(d+).*? src="(.*?) ". *? name"> (. *?) .*? star">(.*?)

.*? releasetime">(.*?) (.*?) .*? fraction">(.*?) .*?
*?>
'
, re.S) 4    # re.s matches any character. If not, a newline character cannot be matched 5    items = re.findall(pattern, html) 6    # print(items) 7    for item in items: 8        yield { 9            'index': item[0].10            'thumb': get_thumb(item[1]),  # define the get_thumb() method to further process the url 11            'name': item[2].12            'star': item[3].strip()[3:].13            # 'time': item[4].strip()[5:], 14            Select * from time; select * from time 15            'time': get_release_time(item[4].strip()[5:).16            'area': get_release_area(item[4].strip()[5:).17            'score': item[5].strip() + item[6].strip() 18            Score consists of two parts: integer + decimal 19        } Copy the code

Tips: re.s: Matches any character, if not, will not match newline character; Yield: The advantage of using yield as a generator is that you can iterate over iterations and collate the data into a dictionary, producing a nice output. Specific usage can reference: blog.csdn.net/zhangpingha… .strip(): Used to strip whitespace from strings.

Get_thumb (), get_release_time (), and get_release_area () are three more methods:

 1# Get large cover image
 2def get_thumb(url):
 3    pattern = re.compile(r'(.*?) @. *? ')
 4    thumb = re.search(pattern, url)
 5    return thumb.group(1)
 6# http://p0.meituan.net/movie/5420be40e3b755ffe04779b9b199e935256906.jpg@160w_220h_1e_1c
 7# remove @160w_220h_1e_1c to create a larger image
 8
 9
10Extract the release time function
11def get_release_time(data):
12    pattern = re.compile(r'(.*?) ((| $) ')
13    items = re.search(pattern, data)
14    if items is None:
15        return 'unknown'
16    return items.group(1)  # return the first matching parenthesis (.*?) The result is time
17
18
19Extract country/region functions
20def get_release_area(data):
21    pattern = re.compile(r'.*((.*))')
22    # $matches the end of a string, in this case (.*?). ; (| $, matching string contains (or only (. *?)
23    items = re.search(pattern, data)
24    if items is None:
25        return 'unknown'
26    return items.group(1)
Copy the code

Tips: ‘r’ : The re is preceded by ‘r’ to tell the compiler that this string is a raw string and do not change its mind. When a string uses a regular expression, it is best to prefix it with ‘r’. Said ‘|’ regular ‘|’ or ‘, ‘ ‘: ∗ ∗ regular’ ∣ said ‘or’, ‘ ‘means to match a line at the end of the string; .group(1) : returns the result of the first parenthesis of the search match, i.e. (.*?) , gropup() returns all results 2013-12-18(, group(1) returns’ (‘.

Next, modify the main() function to print the contents of the crawl:

 1def main():
 2    url = 'http://maoyan.com/board/4?offset=0'
 3    html = get_one_page(url)
 4
 5    for item in parse_one_page(html):  
 6        print(item)
 7
 8
 9if __name__ == '__main__':
10    main()
Copy the code

Tips: if _ name_ == ‘_ main_ ‘: Code blocks under if _ name_ == ‘_ main_ ‘will be run when the. Py file is run directly; When the.py file is imported as a module, code blocks under if _ name_ == ‘_ main_ ‘are not run. Reference: blog.csdn.net/yjk13703623…

If you run the program, you can successfully extract the desired content with the following result:

1{'index''1'.'thumb''http://p1.meituan.net/movie/20803f59291c47e1e116c11963ce019e68711.jpg'.'name'Farewell my Concubine.'star''Cheung Guo-wing, Zhang Feng-yi, GONG Li'.'time''1993-01-01'.'area'Hong Kong, China.'score''9.6'}
2{'index''2'.'thumb''http://p0.meituan.net/movie/54617769d96807e4d81804284ffe2a27239007.jpg'.'name'Roman Holiday.'star''Gregory Peck, Audrey Hepburn, Eddie Abbott.'.'time''1953-09-02'.'area''the United States'.'score''9.1'}
3{'index''3'.'thumb''http://p0.meituan.net/movie/283292171619cdfd5b240c8fd093f1eb255670.jpg'.'name''Shawshank Redemption'.'star''Tim Robbins, Morgan Freeman, Bob Gunton.'.'time''1994-10-14'.'area''the United States'.'score''9.5'}
4{'index''4'.'thumb''http://p0.meituan.net/movie/e55ec5d18ccc83ba7db68caae54f165f95924.jpg'.'name''This killer is not too cold'.'star''Jean Reno, Gary Oldman, Natalie Portman.'.'time''1994-09-14'.'area''France'.'score''9.5'}
5{'index''5'.'thumb''http://p1.meituan.net/movie/f5a924f362f050881f2b8f82e852747c118515.jpg'.'name''the godfather'.'star''Marlon Brando, Al Pacino, James Kean.'.'time''1972-03-24'.'area''the United States'.'score''9.3'}
6
7..8}
9[Finished in 1.9s]
Copy the code

This is the first extraction method, but if you’re not used to the complex syntax of regular expressions, try the second method below.

3.3.2. LXML and xpath extraction

This approach uses the parsing tool LXML, coupled with xpath syntax, and its path-selection expressions to extract the desired content efficiently. The LXML package is a third-party package and needs to be installed by yourself. If the xpath syntax is not familiar with, may refer to the following tutorial: www.w3school.com.cn/xpath/xpath…

 1</div>2, 3, 4,<div class="container" id="app" class="page-board/index" >5 and 6<div class="content">
 7    <div class="wrapper">
 8        <div class="main">
 9            <p class="update-time">2018-08-18<span class="has-fresh-text">Has been updated</span></p>
10            <p class="board-content">Rules of the list: The top 100 classic films from Maoyan film library are ranked in a comprehensive order according to the score and number of viewers, updated at 10 am every day. Relevant data comes from cat's Eye Movie Library.</p>
11            <dl class="board-wrapper">
12                <dd>
13                        <i class="board-index board-index-1">1</i>
14    <a href="/films/1203" title="Farewell my Concubine" class="image-link" data-act="boarditem-click" data-val="{movieId:1203}">
15      <img src="//ms0.meituan.net/mywww/image/loading_2.e3d934bf.png" alt="" class="poster-default" />
16      <img src="http://p1.meituan.net/movie/20803f59291c47e1e116c11963ce019e68711.jpg@160w_220h_1e_1c" alt="Farewell my Concubine" class="board-img" />
17    </a>
18    <div class="board-item-main">
19      <div class="board-item-content">
20              <div class="movie-item-info">
21        <p class="name"><a href="/films/1203" title="Farewell my Concubine" data-act="boarditem-click" data-val="{movieId:1203}">Farewell my concubine</a></p>
22        <p class="star">Starring: Leslie Cheung, Zhang Fengyi, Gong Li 24</p>
25<p class="releasetime">Release Time: 1993-01-01(Hong Kong, China)</p>    </div>
26    <div class="movie-item-number score-num">
27<p class="score"><i class="integer">9.</i><i class="fraction">6</i></p>        
28    </div>29 and 30</div>
31    </div>
32
33                </dd>
34                <dd>
Copy the code

There are two ways to extract the first movie ranking information based on the selected HTML pages. First: direct copy. //*[@id=”app”]/div/div/div[1]/dl/dd[1]/ I //*[@id=”app”]/div/div/div[1]/dl/dd/ I /text() / / * [@ id = “app”] / / div / / dd/I/text (). \

Second: observe the structure of the web page to write their own. Notice the div node id = app first, because id is unique in the entire web structure and there can’t be another one like it, so you can use that div node as a starting point for xpath syntax, and then look down at the three level div nodes, which can be omitted as: //div, followed by two parallel p nodes, dl nodes, dd nodes, and finally the I node text. You can omit the middle as long as you ensure that the path selects a unique text value of ‘1’, for example, leaving out the p and DL nodes and leaving only the ones that follow. In this way, the full path to: / / * [@ id = “app”] / / div / / dd/I/text (), and type.

Along these lines, you can write xpath paths for other things. //*[@id=”app”]//div//dd is the same from the end, so in order to simplify the code, we assign the first part of the path to a variable items, and finally extract the code as follows:

 1# 2 Extract content using LXML in conjunction with xpath
 2def parse_one_page2(html):
 3    parse = etree.HTML(html)
 4    items = parse.xpath('//*[@id="app"]//div//dd')
 5    //*[@id="app"]/div/div/div[1]/dl/dd
 6    # print(type(items))
 7    # * for matching all nodes, @ for attributes
 8    The first movie is dd[1], to extract the page all movies are removed [1]
 9    # xpath://*[@id="app"]/div/div/div[1]/dl/dd[1]
10    for item in items:
11        yield{
12            'index': item.xpath('./i/text()') [0].13            The dot before #./ I /text() indicates starting from the items node
14            #/text() Extracts the text
15            'thumb': get_thumb(str(item.xpath('./a/img[2]/@src') [0].strip())),
16            # 'thumb': To locate in a network, use @src as elements instead of @src, resulting in a list index out of range error.
17            'name': item.xpath('./a/@title') [0].18            'star': item.xpath('.//p[@class = "star"]/text()') [0].strip(),
19            'time': get_release_time(item.xpath(
20                './/p[@class = "releasetime"]/text()') [0].strip()[5:).21            'area': get_release_area(item.xpath(
22                './/p[@class = "releasetime"]/text()') [0].strip()[5:).23            'score' : item.xpath('.//p[@class = "score"]/i[1]/text()') [0] + 
24            item.xpath('.//p[@class = "score"]/i[2]/text()') [0]
25        }
Copy the code

[0] is added to xpath because it returns a list of only 1 string. [0] is added to extract the list as a string to make it concise. Network: Locate in the original Network TAB, not Elements, otherwise you won’t be able to extract relevant content; Class attribute: p[@class = “star”]/text() indicates the text value of p node whose class attribute is “star”; Img [2]/@src’ : extract the SRC attribute value of the IMG node, without adding ‘/text()’.

When you run the program, you successfully extract what you want, with the same results as the first method.

That’s the second extraction method, but if you’re not comfortable with xpath syntax, try the third one below.

3.3.3. Beautiful Soup + CSS selector

Beautiful Soup, like LXML, is a very powerful Python parsing library that can be extracted efficiently from HTML or XML files. About its use, refer to the following tutorial: www.crummy.com/software/Be…

CSS selector selection is a pattern for selecting elements that need to be styled, and its syntax can also be used to quickly locate the desired node and extract the content. Usage may refer to the following tutorial: www.w3school.com.cn/cssref/css_…

The following uses this method for extraction:

 1Extract with beautifulsoup + CSS selector
 2def parse_one_page3(html):
 3    soup = BeautifulSoup(html, 'lxml')
 4    # print(content)
 5    # print(type(content))
 6    # print('------------')
 7    items = range(10)
 8    for item in items:
 9        yield{
10
11            'index': soup.select('dd i.board-index')[item].string,
12            # iclass = 'board-index board-index-1'
13            'thumb': get_thumb(soup.select('a > img.board-img')[item]["src"]),
14            Eelement is a SRC node, while network is a SRC node
15
16            'name': soup.select('.name a')[item].string,
17            'star': soup.select('.star')[item].string.strip()[3:].18            'time': get_release_time(soup.select('.releasetime')[item].string.strip()[5:).19            'area': get_release_area(soup.select('.releasetime')[item].string.strip()[5:).20            'score': soup.select('.integer')[item].string + soup.select('.fraction')[item].string
21
22        }
Copy the code

Run the above program, the result is the same as the first method.

3.3.4. Beautiful Soup + find_all function extraction

Beautifulsoup works with a CSS selector and can be extracted directly using its own find_all function. Find_all, as the name implies, is a query for all elements that meet the criteria. You can pass it some attributes or text to get the elements that meet the criteria. Its API is as follows:

1find_all(name , attrs , recursive , text , **kwargs)
Copy the code

Common syntax rules are as follows: soup. Find_all (name=’ul’) : searches for all UL nodes. Ul nodes can also be nested. Li.string and li.get_text() : both get the text of the Li node, but the latter is recommended; Soup. Find_all (attrs={‘id’: ‘list-1’})) : soup. Find_all (attrs={‘id’: ‘list-1′})) : Soup. Find_all (id=’list-1′) soup. Find_all (class_=’element’)

According to the above common syntax, you can extract the required content of the web page:

 1def parse_one_page4(html):
 2    soup = BeautifulSoup(html,'lxml')
 3    items = range(10)
 4    for item in items:
 5        yield{
 6
 7            'index': soup.find_all(class_='board-index')[item].string,
 8            'thumb': soup.find_all(class_ = 'board-img')[item].attrs['src'].9            Get (' SRC ') attrs[' SRC ']
10            'name': soup.find_all(name = 'p',attrs = {'class' : 'name'})[item].string,
11            'star': soup.find_all(name = 'p',attrs = {'class':'star'})[item].string.strip()[3:].12            'time': get_release_time(soup.find_all(class_ ='releasetime')[item].string.strip()[5:).13            'area': get_release_time(soup.find_all(class_ ='releasetime')[item].string.strip()[5:).14            'score':soup.find_all(name = 'i',attrs = {'class':'integer'})[item].string.strip() + soup.find_all(name = 'i',attrs = {'class':'fraction'})[item].string.strip()
15
16        }
Copy the code

So those are the four different methods of content extraction.

3.4. Data storage

The above output results are in dictionary format, and the dictionary format data can be stored in CSV files using the DictWriter function of CSV package.

1# Store data to CSV
2def write_to_file3(item):
3    with open('the cat's eye top100. CSV'.'a', encoding='utf_8_sig',newline=' 'as f:
4        # 'a' for append mode (add)
5        # utF_8_SIG export CSV non-garbled characters
6        fieldnames = ['index'.'thumb'.'name'.'star'.'time'.'area'.'score']
7        w = csv.DictWriter(f,fieldnames = fieldnames)
8        # w.writeheader()
9        w.writerow(item)
Copy the code

Then modify the main() method:

 1def main():
 2    url = 'http://maoyan.com/board/4?offset=0'
 3    html = get_one_page(url)
 4
 5    for item in parse_one_page(html):  
 6        # print(item)
 7        write_to_csv(item)
 8
 9
10if __name__ == '__main__':
11    main()
Copy the code

The result is shown below: \

Then download the cover image:

 1def download_thumb(name, url,num):
 2    try:
 3        response = requests.get(url)
 4        with open('Cover image /' + name + '.jpg'.'wb'as f:
 5            f.write(response.content)
 6            print('%s movie cover downloaded' %num)
 7            print('-- -- -- -- -- -)
 8    except RequestException as e:
 9        print(e)
10        pass
11     # can't be w, otherwise an error will be reported, wb is used because the image is binary
Copy the code

3.5. Paging crawl

The above has completed the extraction of one page of movie data, followed by the extraction of the remaining 9 pages of a total of 90 movies. To traverse the url, pass in an offset parameter to the url, modify as follows:

 1def main(offset):
 2    url = 'http://maoyan.com/board/4?offset=' + str(offset)
 3    html = get_one_page(url)
 4
 5    for item in parse_one_page(html):  
 6        # print(item)
 7        write_to_csv(item)
 8
 9
10if __name__ == '__main__':
11    for i in range(10) :12        main(offset = i*10)
Copy the code

This completes all the movie crawls. The results are as follows:

4. Visual analysis

As the saying goes, “text is not as good as table, table is not as good as chart.” The following is a simple data visualization analysis and chart presentation based on the data results of Excel.

4.1. Top10 movie ratings

First, want to see the top 10 highest-rated movies?

The procedure is as follows:

 1import pandas as pd
 2import matplotlib.pyplot as plt
 3import pylab as pl  # change the x coordinate
 4
 5plt.style.use('ggplot')   The default plot style is ugly, replace it with a nice ggplot style
 6fig = plt.figure(figsize=(8.5))   # Set the image size
 7colors1 = '#6D6D6D'  # Set the color of the title and text labels on the chart
 8
 9columns = ['index'.'thumb'.'name'.'star'.'time'.'area'.'score']  Set the table header
10df = pd.read_csv('maoyan_top100.csv',encoding = "utf-8",header = None,names =columns,index_col = 'index')  # open table
11# index_col = 'index' set index to index
12
13df_score = df.sort_values('score',ascending = False)  # In descending order of score
14
15name1 = df_score.name[:10]      # x coordinate
16score1 = df_score.score[:10]    # y coordinates
17plt.bar(range(10),score1,tick_label = name1)  To draw a bar graph, use range() to keep the x axes in the correct order
18plt.ylim ((9.9.8))  Set the vertical axis range
19plt.title('top10 movie ratings',color = colors1) # titles
20plt.xlabel('Movie Title')      # x title
21plt.ylabel('score')          # y title
22
23Add numeric labels to each bar chart
24for x,y in enumerate(list(score1)):
25    plt.text(x,y+0.01.'%s' %round(y,1),ha = 'center',color = colors1)
26
27pl.xticks(rotation=270)   # The x axis name is too long and overlaps. Rotate to vertical display
28plt.tight_layout()    # Automatically control the blank edge to display all x axis names
29# plt.savefig(' highest movie rating top10.png') # Save the image
30plt.show()
Copy the code

The result is shown below: \

“Farewell My Concubine” and “A Chinese Odyssey” topped the list, along with “The Shawshank Redemption” and “The Godfather.” Well, almost all of them.

4.2. Comparison of the number of films by country

Then, want to see the 100 movies from which countries? The procedure is as follows:

 1area_count = df.groupby(by = 'area').area.count().sort_values(ascending = False)
 2
 3# Drawing method 1
 4area_count.plot.bar(color = '#4652B1')  # set to bluish purple
 5pl.xticks(rotation=0)   # X axis name too long overlap, rotate to vertical
 6
 7
 8# Drawing method 2
 9# plt.bar(range(11),area_count.values,tick_label = area_count.index)
10
11for x,y in enumerate(list(area_count.values)):
12    plt.text(x,y+0.5.'%s' %round(y,1),ha = 'center',color = colors1)
13plt.title('Number of Films by Country/Region',color = colors1)
14plt.xlabel('Country/Region')
15plt.ylabel('Quantity (Part)')
16plt.show()
17# plt.savefig(' ranking countries by number of films. PNG ')
Copy the code

The result is shown below: \

Apart from the fact that the website itself does not show films from different countries, 10 countries/regions have been “contracted” to make the list. The U.S. topped the list with 30 films, followed by Japan with eight, and Korea with seven.

I have to say there are 5 in Hong Kong and none in Mainland China…

4.3. Year of concentration of film productions

Let’s take a look at the years that contributed the most movies over the course of 100 years.

 1Extract the year from the date
 2df['year'] = df['time'].map(lambda x:x.split('/') [0])
 3# print(df.info())
 4# print(df.head())
 5
 6# Count the number of movies released in each year
 7grouped_year = df.groupby('year')
 8grouped_year_amount = grouped_year.year.count()
 9top_year = grouped_year_amount.sort_values(ascending = False)
10
11
12# drawing
13top_year.plot(kind = 'bar',color = 'orangered'# Color set to salmon red
14for x,y in enumerate(list(top_year.values)):
15    plt.text(x,y+0.1.'%s' %round(y,1),ha = 'center',color = colors1)
16plt.title('Film Number year Ranking',color = colors1)
17plt.xlabel('Year')
18plt.ylabel('Quantity (Part)')
19
20plt.tight_layout()
21# plt.savefig(' Number of movies ranked by year.png ')
22
23plt.show()
Copy the code

The result is shown below: \

You can see 100 movies from 37 years. Among them, 2011 has the largest number of films on the list, with nine. That was followed by seven the year before. Recall that it was the first two years of college, but I felt that there was nothing else I remembered except Avatar.

On the other hand, 1994, which was dubbed the “miracle year in movie history” by the Internet, only ranked sixth. This leads me to further question the authority of the cat’s eye list. Looking further back, movies from 1939 and 1940 also made the list. At that time, it was still the era of black and white movies. It seems that the reputation of movies has no absolute relationship with external technology, but quality is king.

4.3.1. Actor with the most film credits

Finally, see which actors have the most credits in the top 100 films. The procedure is as follows:

 1The actors in the table are in the same column, separated by a comma separator. You need to split it up and extract it all into the list
 2starlist = []
 3star_total = df.star
 4for i in df.star.str.replace(' '.' ').str.split(', ') :5    starlist.extend(i)  
 6# print(starlist)
 7# print(len(starlist))
 8
 9# set removes duplicate actor names
10starall = set(starlist)
11# print(starall)
12# print(len(starall))
13
14starall2 = {}
15for i in starall:
16    if starlist.count(i)>1:
17        # Screen out actors with more than one movie
18        starall2[i] = starlist.count(i)
19
20starall2 = sorted(starall2.items(),key = lambda starlist:starlist[1] ,reverse = True)
21
22starall2 = dict(starall2[:10])  Convert tuples to dictionary format
23
24# drawing
25x_star = list(starall2.keys())      # x coordinate
26y_star = list(starall2.values())    # y coordinates
27
28plt.bar(range(10),y_star,tick_label = x_star)
29pl.xticks(rotation = 270)
30for x,y in enumerate(y_star):
31    plt.text(x,y+0.1.'%s' %round(y,1),ha = 'center',color = colors1)
32
33plt.title('Number of Films produced by Actors',color = colors1)
34plt.xlabel('actors')
35plt.ylabel('Quantity (Part)')
36plt.tight_layout()
37plt.show()    
38# plt.savefig(' ranking the number of films made by actors. PNG ')
Copy the code

The result is shown below: \

Leslie Cheung ranked first, which was not previously suspected. Tony Leung chiu-wai and Stephen Chow followed, followed by Brad Pitt. Surprisingly, six of the top 10 stars were Hong Kong stars. Kind of seriously doubt if this is the Hong Kong version of top100 movies…

I was curious to see which seven films cheung topped the list with a huge lead of seven.

1df['star1'] = df['star'].map(lambda x:x.split(', ') [0])  # Extract # 1 actor
2df['star2'] = df['star'].map(lambda x:x.split(', ') [1])  # Extract # 2 actor
3star_most = df[(df.star1 == 'Leslie Cheung') | (df.star2 == 'Leslie Cheung')], [['star'.'name']].reset_index('index')
4# | said two conditions or query, and then reset the index
5print(star_most)
Copy the code

“Farewell my Concubine” ranked no. 1, “Happy Together” ranked no. 17, and “Legend of the Condor Heroes” ranked no. 27. Suddenly, I only saw “Max Payne”… If you have time, look at his other works.

1 Index star Name 20 1 Guorong Cheung, Zhang Feng Yi, Gong Li Farewell my Wife 31 17 Guorong Cheung, Leung Chaowei, Zhang Chen Chunguang 42 27 Guorong CHEUNG, Leung Chaowei, Jacky Cheung 53 37 Guorong CHEUNG, Leung Chaowei, Carina Lau Ashes of Time 64 70 Leslie Cheung, Joey Wong, A Beautiful Girl's Ghost 75 99 Leslie Cheung, Maggie Cheung, Andy Lau: The True Story 86 100 Dillon, Leslie Cheung, Chow Yun-fat: A HeroCopy the code

Due to the limited amount of data, only the above brief analysis is made.

4.4. Complete code

Due to limited space, please long press scan to follow the public account “data Science Club” and reply to “cat’s Eye” to obtain the download address for the complete code of this article. \

Python Chinese community as a decentralized global technology community, to become the world’s 200000 Python tribe as the vision, the spirit of Chinese developers currently covered each big mainstream media and collaboration platform, and ali, tencent, baidu, Microsoft, amazon and open China, CSDN industry well-known companies and established wide-ranging connection of the technical community, Have come from more than 10 countries and regions tens of thousands of registered members, members from the Ministry of Public Security, ministry of industry, tsinghua university, Beijing university, Beijing university of posts and telecommunications, the People’s Bank of China, the Chinese Academy of Sciences, cicc, huawei, BAT, represented by Google, Microsoft and other government departments, scientific research institutions, financial institutions, and well-known companies at home and abroad, nearly 200000 developers to focus on the platform.

\

Click **** to read the original article and become a free member of **** community