0x1 Tools Required

To do a good job, you have to do a good job, and the foundation of corpus climbing is based on Python.

We developed based on Python3, using the following modules: Requests, LXML, and JSON.

This section briefly describes the functions of each module

01 | requests

Requests is a Python third-party library that is particularly handy for handling URL resources. Its official documentation bears the tagline: HTTP for Humans. I consider Requests to be an order of magnitude better than Python’s built-in URllib experience.

Let’s do a quick comparison:

urllib:

1import urllib
2 2import urllib 
3 
4URL_GET = "https://api.douban.com/v2/event/list" 
5Build request parameters
6params = urllib.urlencode({'loc':'108288'.'day_type':'weekend'.'type':'exhibition'}) 
7 
8# send request
9response = urllib2.urlopen('? '.join([URL_GET,'%s'])%params)
10#Response Headers11print(response.info())
12#Response Code
13print(response.getcode())
14#Response Body
15print(response.read()) 
Copy the code

Requests:

1import requests 
2 
3URL_GET = "https://api.douban.com/v2/event/list" 
4Build request parameters
5params = {'loc':'108288'.'day_type':'weekend'.'type':'exhibition'} 
6 
7# send request
8response = requests.get(URL_GET,params=params) 
9#Response Headers
10print(response.headers)
11#Response Code
12print(response.status_code)
13#Response Body
14print(response.text)
Copy the code

We can see that there are some differences between the two libraries:

  1. Parameter construction: Urllib needs to urlencode the parameters, which is quite troublesome; Requests requires no additional coding and is concise.

  2. Request sending: urllib needs to construct additional URL parameters into the required form; Requests is much simpler, with get directly corresponding to links and parameters. Finally, if your time is not very tight, and want to quickly improve, the most important thing is not afraid of hardship, I suggest you can price @762459510, that is really very good, many people progress quickly, need you not afraid of hardship oh! You can go to add a look at ~

  3. Connection mode: “Connection “:”close” indicates that the socket channel is closed at the end of each request, whereas the Requests library used URllib3 to reuse the socket multiple times. “Connection “:”keep-alive” : indicates that multiple requests to use a connection consume fewer resources

  4. The Accept-encoding of the Requests library is more complete and will not be used here

In summary, using Requests is much simpler and easier to understand, which makes development much easier.

02 | LXML

BeautifulSoup is a library, XPath is a technology, and the most commonly used XPath library in Python is LXML.

Once we get the page that Requests returned, how do we get the data we want? This is where LXML, the powerful HTML/XML parsing tool, comes in. Python has no shortage of parsing libraries, so why LXML among them? For comparison, we chose BeautifulSoup, another well-known HTML parsing library.

Let’s do a quick comparison:

BeautifulSoup:

1from bs4 import BeautifulSoup # import libraries
2# Assume that HTML is HTML that needs to be parsed
3
4Passing HTML into the constructor of BeautifulSoup gives you the object of a document
5soup = BeautifulSoup(html,'html.parser',from_encoding='utf-8')
6Find all h4 tags
7links = soup.find_all("h4") 
Copy the code

lxml:

1from lxml import etree
2# Assume that HTML is HTML that needs to be parsed
3
4Pass HTML into the constructor of etree to get a document object
5root = etree.HTML(html)
6Find all h4 tags
7links = root.xpath("//h4") 
Copy the code

We can see that there are some differences between the two libraries:

  1. Parsing HTML: BeautifulSoup is written in a similar way to JQ, with a user-friendly API that supports CSS selectors. The syntax of LXML has a learning cost

  2. Performance: BeautifulSoup is DOM-based, loads the entire document and parses the entire DOM tree, so the time and memory costs are much higher; BeautifulSoup is written in Python, and LXML >>BeautifulSoup is obvious in terms of performance.

To sum up, BeautifulSoup is simpler and easier to use. Although LXML has some learning costs, it is generally very simple and easy to understand. Most importantly, it is written in C and is much faster. Finally, if your time is not very tight, and want to quickly improve, the most important thing is not afraid of hardship, I suggest you can price @762459510, that is really very good, many people progress quickly, need you not afraid of hardship oh! You can go to add a look at ~

03 | json

Python comes with json libraries, which are sufficient for basic JSON processing. But if you want to be more lazy, you can use third-party JSON libraries, such as DemJSON and SimpleJSON.

Both of these libraries are superior to SimpleJSON in terms of import module speed, encoding and decoding speed, plus compatibility. So if you want to use square libraries, you can use SimpleJSON.

0x2 Determine the corpus source

With the weapon ready, the next step is to determine the direction of the climb.

Take the esports corpus as an example, now we will climb the esports related corpus. The familiar esports platforms are Esports Penguin, Esports Penguin and Esports Penguin (Squint), so we use the live games on Esports Penguin as the data source to crawl.

We log in the official website of Penguin E-sports, enter the game list page, can find a lot of games on the page, through the manual to write these game name income is obviously not high, so we began our first step crawler: game list crawl. Finally, if your time is not very tight, and want to quickly improve, the most important thing is not afraid of hardship, I suggest you can price @762459510, that is really very good, many people progress quickly, need you not afraid of hardship oh! You can go to add a look at ~

1import requests 
2from lxml import etree 
3 
4# Update the game list
5def _updateGameList(): 
6 # HEAD information for sending HTTP requests, used to masquerade as a browser
7 heads = {
8 'Connection': 'Keep-Alive'.9 'Accept': 'text/html, application/xhtml+xml, */*'.10 'Accept-Language': 'en-US,en; Q = 0.8, useful - Hans - CN; Q = 0.5, useful - Hans; Q = 0.3 '.11 'Accept-Encoding': 'gzip, deflate'.12 'User-Agent': 'the Mozilla / 6.1 (Windows NT 6.3; WOW64; Trident / 7.0; The rv: 11.0) like Gecko '
13 }
14 # need to crawl the game list page 15 url = 'https://egame.qq.com/gamelist'
16
17 # Do not compress HTML, maximum link time is 10
18 res = requests.get(url, headers=heads, verify=False, timeout=10)
19 To prevent errors, encode UTF-8
20 res.encoding = 'utf-8'
21 # Build HTML as Xpath schema
22 root = etree.HTML(res.content)
23 Use Xpath syntax to get the game name
24 gameList = root.xpath("//ul[@class='livelist-mod']//li//p//text()")
25 # output the name of the game to climb to
26 print(gameList) 
Copy the code

When we get these dozens of game names, the next step is to climb the corpus of these dozens of games, at this time the question comes, which website should we climb these dozens of games guide, taptap? How to play? 17173? After analyzing these websites, it is found that there are only articles corpus of some popular games on these websites, and some unpopular or low-popularity games, such as “Soul Chip”, “Miracle: Awakening”, “Death is Coming”, etc. It is difficult to find a large number of articles corpus on these websites, as shown in the figure:

We can find that the corpus of articles “miracle: Awakening” and “soul chip” is very few, and the quantity does not meet our requirements. So is there a more general resource station, it has a very rich corpus of articles, can meet our needs. Finally, if your time is not very tight, and want to quickly improve, the most important thing is not afraid of hardship, I suggest you can price @762459510, that is really very good, many people progress quickly, need you not afraid of hardship oh! You can go to add a look at ~

Actually calm down to think, this resource station we are useful every day, that is Baidu. We searched related games on Baidu News and got a list of search results. The links in these lists were almost all strongly related to the search results, so that the problem of insufficient data sources could be easily solved. But a new problem appeared at this time, and it is a more difficult to solve the problem — how to grab the content of any webpage articles?

Because different websites have different page structures, we can’t predict the data of which websites we will climb to, and we can’t write a set of crawlers for every website, the workload is unimaginable! However, we can not simply climb down all the words in the page, with that corpus to carry out training is undoubtedly a nightmare!

After the battle of wits and courage with each website, query information and thinking, finally found a more general scheme, the following for you to talk about the author’s ideas.

0x3 Article corpus crawl of any website

01 | Extraction method

1) Text extraction based on Dom tree

2) Search for text blocks based on web page segmentation

3) Text extraction based on marker window

4) Based on data mining or machine learning

5) Text extraction based on row block distribution function

02 | Extraction principle

You see these are a little confused, they are exactly how to extract it? Let me take my time.

1) Text extraction based on Dom tree:

This method is mainly to establish a Dom tree through relatively standard HTML, and then the Dom traverses the Dom to compare and identify all kinds of non-text information, including advertising, links and non-important node information. After the non-text information is extracted, the rest is the natural text information.

But there are two problems with this approach

Relying heavily on the good structure of HTML, this approach may not work well if we crawl to a web page that is not written according to the W3c specification.

Tree establishment and traversal complexity, space complexity are high, tree traversal methods will be different due to HTML tags.

  1. Search for text blocks based on web segmentation:

This method uses the dividing lines in HTML tags and some visual information (such as text color, font size, text information, etc.).

There is a problem with this approach:

Different website HTML style is different, there is no way to divide unified, can not guarantee universality.

  1. Text extraction based on marker window:

To introduce a concept of tag window, we combine two tags and the text contained within them to form a tag window (e.g

I am a h1

“I am H1” is the contents of the tag window.

This method first takes the title of the article, all the markup window in HTML, in its word segmentation. Then, the distance L between the sequence of the title and the sequence of words in the marking window is calculated. If L is less than a threshold value, the text in the marking window is considered to be the text.

While this approach looks good, there are problems:

It is not efficient to segment all the text on a page.

The threshold of word distance is difficult to determine, and different articles have different thresholds.

4) Based on data mining or machine learning

Use big data to train the machine to extract the main text.

This approach is certainly great, but it requires HTML and body data, and then training. We won’t go into that here.

5) Text extraction based on row block distribution function

For any web page, the body and tags are always mixed together. The core of this method has highlights: the density of the text area; The length of the row block; The body area of a web page is certainly one of the areas with the most dense distribution of text information, and this area may be the largest (the comment message is long and the body is short), so the block length is cited to judge at the same time.

Implementation idea:

We take the HTML off the tag, leaving only the body, and all the white space after the tag is taken out, which we call Ctext;

For each Ctext, take the surrounding k lines (k

Remove all whitespace from Cblock, and the total length of the text is called Clen;

The coordinate system is established with Ctext as the horizontal axis and Clen of each line as the vertical axis.

Take this page for example: www.gov.cn/ldhd/2009-1… The body area of the page is 145 to 182 lines.

As can be seen from the figure above, the correct text region is all the most valued and continuous region on the distribution function graph, which usually contains a sudden rise point and a sudden drop point. Therefore, the problem of web page body extraction is transformed into two boundary points of sudden rise and sudden fall on the line block distribution function. The two boundary points contain the region containing the maximum line block length of the current web page and are continuous.

After a lot of experiments, it is proved that this method has a high accuracy for text extraction of Chinese web pages. The advantage of this algorithm lies in that the line block function is independent of HTML code and HTML tag, so it is easy to implement and has a high accuracy.

The main logic code is as follows:

1# assume the content is the HTML you already got
2 
3
Max_text_len and (not BOOLstart):
38 
# Cblock = 0; # Cblock = 0
39 if (Ctext_len[i + 1] != 0 or Ctext_len[i + 2] != 0 or Ctext_len[i + 3] != 0) :40 boolstart = True41 start = i
42 continue
43 if (boolstart):
44 
# Cblock = 0; # Cblock = 0
45 if (Ctext_len[i] == 0 or Ctext_len[i + 1] = =0) :46 end = i
47 boolend = True
48 tmp = []
49
50 # Determine if there is any text below
51 if(boolend):
52 for ii in range(start, end + 1) :53 if(len(lines[ii])
Copy the code

0 x4 epilogue

At this point, we can obtain the corpus of articles of any content, but this is just the beginning. After obtaining these corpus, we need to clean it again, segmentation, part-of-speech tagging, etc., before we can get the corpus that can be used.