The text and pictures in this article are from the network, only for learning, communication, do not have any commercial use, all the copyright belongs to the original author, if you have any questions, please contact us in time for processing

1. An overview of the

It’s been five years since the launch of Honor of Kings. As one of the most popular mobile games in China in these years, in addition to providing entertainment, we tried to have some fun from their official website and learn some simple basic operations of Python crawler on the occasion of the fifth anniversary.

This article will mainly introduce simple Python crawler, including web page analysis, data request, data analysis and data preservation, is suitable for some websites with no reverse crawl, aimed at learning communication, do not use for any commercial illegal use.

Page analysis is to open the page you need to request data, and then “F12” to see what the page source data looks like (if you have web knowledge will be better to deal with, but I did not systematic learning, operation more familiar with a bit);

Data request we see people love “requests” library of choose and employ persons, on the library you can check the link to know more detailed usage (requests. Readthedocs. IO/zh_CN/lates…

Data parsing generally depends on the data format of the request. If the data is in HTML format, I will introduce “BS4” and “xpath” for parsing. If the data is in JSON format, I will introduce both JSON and eval for parsing.

There are two ways to save data. If it is an image class, I will use the “open” and “write” functions. If it is a text class, I will use “TO_excel” in Pandas to save it as a form format.

2. Web page analysis

We mentioned in the overview of the request data will be in HTML format or JSON format, in two cases in fact the corresponding real request address is different, how to judge, as a beginner my personal experience is to try, this chapter will introduce both of the attempt scheme, you in the actual operation according to the situation and choose!

2.1. HTML page source data

The following zhang heroes list page, for example, hold down the “” F12″ “, then click the developer in the upper left corner of the model that has the mouse arrow icon, and then on the left side of the selected data area, you need in the developer mode area will appear the data area data information, such as the “details page address” here, “avatar picture address” and “name”, All we need is this information, so we can just request the link.

2.2. The json data source

In this case, we can select “Network — >XHR” in the developer mode, refresh the page, and search in the name. In general, we can find some data that we need.

Click Preview to see that it is the source data that we need, and then in “Headers” you can find the actual link to request the source data. In this case, the data request is called “get”, which we will cover in the next section.

3. Data request

We mentioned that we use the Requests library for data requests here. There are two common types of requests: POST and GET. Just here we use to get a can, in addition to the request can take a lot of parameters, such as request header, cookies, and so on, you check the overview of the link document to understand it.

A simple example:

The import requests # hero list page address url = 'https://pvp.qq.com/web201605/herolist.shtml' resp = requests. Get (url) # Set the decoding mode (since the requested data is garbled in Chinese, it is decoded here) resp.encoding=resp.apparent_encodingCopy the code

Requested HTML source data

Import requests # station item details page address url = 'https://pvp.qq.com/web201605/js/item.json' resp = requests. Get (url) # Set the decoding mode (since the requested data is garbled in Chinese, it is decoded here) resp.encoding=resp.apparent_encodingCopy the code

Intra-office item JSON data

4. Data analysis

For different source data parsing methods, there are two common entry-level methods “BS4” and “xpath” for HTML data parsing. For “JSON” data, it is actually relatively easy to handle. Here are two simple ways to use “JSON” and “eval”.

4.1. HTML data parsing

4.4.1. Bs4

Beautiful Soup is a Python library that extracts data from HTML or XML files, using your favorite converter to navigate, find, and modify documents in the usual way.

More details you can go to the operation (beautifulsoup. Readthedocs. IO/zh_CN/v4.4…. ~

Looking at the HTML data structure, we can find the desired data in the “ul” node, satisfying “class=”herolist clearfix” “under all “li” nodes. For BS4, use the “find_all” method to locate. (See the code comments for more details)

BeautifulSoup = BeautifulSoup(resp.text,'html.parser') # Class = "herolist clearfix"; class = "herolist clearfix"; Find_all ('ul', class_="herolist clearfix")[0] Lis = ul.find_all('li') # Create an empty table to store data herolists = [] # Loop all li for Li in lis: Herolist [' hero name '] = li.get_text() # get(); Herolist [' hero profile '] = li.find('a').get('href') herolist[' hero profile '] = li.find('a').find('img').get(' SRC ')  herolists.append(herolist)Copy the code

Data analysis results

4.1.2. Xpath

XPath is a language for finding information in an XML document that can be used to traverse elements and attributes in an XML document. More grammar operation can view (www.w3school.com.cn/xpath/xpath.) .

Because the essential process is similar to BS4, but the syntax function operation is not quite the same, here do not do a detailed introduction, directly look at the code to understand the first.

HTML = etree.HTML(resp.text) html_ul = html.xpath('//ul[@class="herolist clearfix"]')[0] html_lis = html_ul.xpath('./li') herolists = [] for html_li in html_lis: Herolist = {} herolist [' hero name '] = html_li. Xpath ('/a/text () ') [0] herolist [' hero details page] = html_li. Xpath ('/a / @ href ') [0] Herolist [' hero '] = html_li.xpath('./a/img/@src')[0] herolists.append(herolist)Copy the code

4.2. Json data parsing

When the requested data is in JSON format, directly check the data type and find it is STR, as follows:

We can use the json.loads() and eval methods to convert this into a list in the same format as the parsed HTML data above.

Import json js = resp.text # js.loads () li = js.loads ()Copy the code

Json data parsing

5. Data saving

For image data, request image data and write it locally; For text data forms, convert to a dataframe type and save it to an Excel file (using the Pandas library).

5.1. Store picture data

We have our hero profile data in our hero list, so here’s how to save the hero profile data locally.

For Li in herolists: Site address # # for hero picture: '/ / game. Gtimg. Cn/images/yxzj img201606 heroimg / 506/506. JPG' head_url = li [' hero picture] # combination HTTPS: Url = f' HTTPS :{head_URL}' Head_data = requests. Get (URL) # Set full path for storing images. Head_path = f' save address '# Open empty file to write image data  open(head_path, 'wb').write(head_data.content)Copy the code

Hero profile picture data storage

5.2. Text data form storage

Import pandas as pd # to dataframe df = dataframe (li)Copy the code

Data preview

Df ['des1'] = df['des1'].str.replace('<br>',', '). STR. Replace (' < p > ', '). The STR. Replace (' < / p > ', ') df [' des2] = df [' des2] STR. Replace (' < br > ', ', '). STR. Replace (' < p > ', '). The STR. Replace (' < / p > ', ') # for form, Df.to_excel (r' address ',index=False,sheet_name=' Sheet_name ')Copy the code

Item text data form

6. Come and play, too

In the title we mentioned that The 5th anniversary of King of Glory has 102 heroes and 326 skins. In fact, in the list of heroes we grab only 93 HTML data, how to get all of them? You can refer to the json data request to find out how to according to the characteristics of the relevant data (such as hero profile address change is actually the hero ID, hero details page is also).

You can see the new hero and the new skin, how to climb related data try.

6.1. Number of new skins for heroes

Five years online, a total of 93 heroes have new skin, among which Diao Chan, Mulan and Sun Wukong have the most new skin, up to 5!!

Of the 93 heroes, most only have 1 new skin

6.2. Added the skin online time

From the perspective of launching month, January is the peak month of launching hero, which is related to the Spring Festival in January, after all, the Spring Festival is also the most profitable month for this product.

In terms of the launch year, in fact, the product was launched at the end of October 2015, so the overall new skin was not much, 16 and 17 years of rapid growth of the game, the team productivity did not significantly improve? So after 18 years, the team is big, and new skin rubs are skyrocketing!

6.3. King hero capacity

When the king started on October 28, 2015, there were 33 heroes, and the familiar Arthur, Xiang Yu, Angela and so on were the first batch. Up to now, 69 new heroes have been added in 5 years.

February, August, November and January have the most new heroes in terms of launching months. Why?

From the year of launch, in 2015, 7 new heroes were launched within 2 months of launch, which is very fast, after all, a lot of quantity was saved. Then we can see that the output of new heroes in 2016 is the peak, and then the trend of decreasing year by year, why? “After all, heroic design takes a lot of brains!!”