The text and pictures in this article come from the network, only for learning, exchange, do not have any commercial purposes, copyright belongs to the original author, if you have any questions, please contact us to deal with

The following article comes from Tencent cloud author: Wu Yu Beichen

(Want to learn Python? Python Learning exchange group: 1039649593, to meet your needs, materials have been uploaded to the group file stream, you can download! There is also a huge amount of new 2020Python learning material.)

This article is a small attempt after learning the basic knowledge of Python. This time, it will climb the king of Honor anchor ranking on panda TV’s web page and demonstrate the principle of a crawler without the help of a third-party framework.

First, the idea of implementing Python crawler

Step 1: Define your purpose

1. Find the page whose data you want to crawl. 2

Step 2: simulate Http request, extract data and process data

1. Simulate THE Http network request, send a request to the server, and obtain the HTML 2 returned to us by the server. Use regular expressions to extract the data we need (such as the host name and popularity in this case) from the Html 3. The extracted data are processed and displayed in a form that we can view intuitively

Second, view the web source code, observe the key value

We should first find the need to deal with the web page, namely: king pesticide on a cat TV web page, and then look at the source code of this web page, observe where we need to pay attention to the data, the following is a screenshot of the web page effect:



A web page. The PNG

Then, we need to view the Html source code of the current page in the browser, different browsers to view the operation will be different, this need to baidu about. This time we need to get the name of each anchor and the number of video views. From the source code below, we can quickly find the location of these key data, as shown in the figure:



The Html source code. PNG

Three, the implementation of Python crawler concrete practice

Here is to create a crawler class Spider, and then use different re to get the data in the Html tag, and then rearrange it after printing display, the specific code is as follows:

From urllib import request module urllib import re module re: # need to grab the network link url = "https://www.panda.tv/cate/kingglory" # regular: The div code string reString_div = '<div class="video-info">([\s\ s]*?)</div>' # regex: To obtain the host name reString_name = '< / I > ([\ s \ s] *?) < / span >' # regular: ReString_number = '<span class="video-number">([\s\ s]*?)</span>' def __fetch_content(self): Request network, R = request.urlopen(Spider. Url) data = r.read() htmlString = STR (data,encoding=" UTF-8 ") return htmlString def __alalysis(self,htmlString): Get the data preliminarily using the re, "VideoInfos = re.findall(spider.restring_div,htmlString) anchors = [] #print(videoInfos[0]) for html in videoInfos : name = re.findall(Spider.reString_name,html) number = re.findall(Spider.reString_number,html) anchor = {"name":name,"number":number} anchors.append(anchor) #print(anchors[0]) return anchors def __refine(self,anchors): "By refining the data further, F = lambda Anchor :{"name": Anchor ["name"][0]. Strip (),"number": Anchor ["number"][0]} newAnchors = list(map(f,anchors)) #print(newAnchors) return newAnchors def __sort(self,anchors): Data Analysis: "Anchors = sorted(anchors,key=self.__sort_seed,reverse = True) return anchors def __sort_seed(self,anchor): List_nums = re.findall('\d*', Anchor ["number"]) number = float(list_nums[0]) if '000' in Anchor ["number"]: number = number * 10000 return number def __show(self,anchors): Display the data, print the data that has been sorted "for rank in range(0, Len (anchors)): Print (" anchors "+ STR (rank+1) +" name" :" + anchors[rank]["number"] +" \t" + def startRun(self): Program entry, "HtmlString = self.__fetch_content() anchors = self.__alalysis(htmlString) anchors = self. Anchors = self.__sort(anchors) self.__show(anchors) # Create a crawldata spider = spider () spider.startrun ()Copy the code

Then, we should see the following print:



Execute crawler. PNG