Python crawler introduction combat

1. Introduction

First to introduce myself. I am a Java developer, since the second half of this year, has been the major technology blog published some of their articles, almost a few months, before in cnblog blog website statistics code, see traffic increasingly up every day, National Day just few things, Just think about writing a crawler and see how much reading increases, which is where this article comes from.

2. Technology selection

Crawler this function, what is my personal understanding written language can, as long as can normal to send the HTTP request, the response is back on the static page template HTML are extracted from the data we need, principle is very simple, this thing can manually to statistics collection, of course, but the network platform, after all, still a lot of time or picture, Write a crawler to crawl down the data, save to the database, and then write a statistical report OF SQL statement is more convenient, if there is time, I will write a simple before and after the separation of the report sample to share out.

Python crawler courses are very popular on the Internet now. In fact, I am a little anxious to play with the operation, rather than using the old business to write this crawler, of course, it turns out that writing a crawler in Python is much easier than writing a crawler in Java.

3. Environment preparation

First of all, the author’s computer is Win10, Python selection is 3.7.4, it seems that now Python3 online crawler tutorial is not much, which is still encountered a lot of problems, the following will also share to you.

The author chose VSCode as the development tool, here I recommend Microsoft this open source product, very lightweight, you need to install any plug-ins, do not use plug-ins, very independent, if you feel unable to handle the friends can choose JetBrains Pycharm. There’s a community version and a paid version, and in general, we’re fine with the community version.

Here I directly create a new folder, created a file named spiderdemo. py, this is we will write the crawler file, can give you a look at the author’s development environment, as follows:

This is actually a screenshot of successful debugging. As can be seen from the log printed below, the author captured the data of three platforms here.

4. The database

Mysql 5.7.19, utF8MB4, utF8MB4, utF8MB4 Mysql utf8 = utf8; Mysql utf8 = utf8mb4; Mysql utf8 = utf8mb4;

You need a driver to connect to the Mysql database in Python. You need a driver to connect to the Mysql database in Java.

pip install pymysql
Copy the code

PIP is a package management tool for Python. My personal understanding is that it is similar to Maven, and all the third-party packages we need can be downloaded from it.

Of course, there may be a timeout here, depending on your network, but when I execute this command at night, there are all kinds of timeout, of course Maven will have a domestic mirror war, PIP will also have a mirror war, here are the list:

Ali cloud mirrors.aliyun.com/pypi/simple…
Pypi.mirrors.ustc.edu.cn/simple/ China university of science and technology
Douban (douban) pypi.douban.com/simple/
Tsinghua university pypi.tuna.tsinghua.edu.cn/simple/
University of science and technology of China pypi.mirrors.ustc.edu.cn/simple/

The command is as follows:

PIP install -i https://mirrors.aliyun.com/pypi/simple/ library nameCopy the code

The author here only tried ali cloud and tsinghua University mirror station, the rest did not try, the above content from the network.

Table structure, design is as follows, here design is very rough, simple only do a table, redundant words I also don’t say, everyone look at the picture, after the field all have comments:

Submit the sentences to the Github repository for those who need them.

5. Actual combat

The overall idea is divided into the following steps:

The HTML static resource for the entire page is requested back through a GET request
Through some matching rules to match the data we need
Store in database

5.1 Requesting HTML static resources

Python3 provides urllib, a standard library that requires no additional installation.

from urllib import request
Copy the code

Next we send a GET request using urllib, as follows:

req_csdn = request.Request('https://blog.csdn.net/meteor_93')
req_csdn.add_header('User-Agent'.'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari)
html_csdn = request.urlopen(req_csdn).read().decode('utf-8')
Copy the code

User Agent (UA) is a special string header that enables the server to identify the operating system and version, CPU type, browser and version, browser rendering engine, browser language, browser plug-in, etc.

Here this is added in the request header to simulate normal browser requests, many server will do inspection, found that is not a normal browser requests will be refused to directly, although the measured the author crawl behind these all have no the test platform, but can add just add, of course the real inside a browser sends a request header is not only a UA, There will also be some other information, as shown below:

The UA information here is copied directly from here. As the code is written here, we have obtained the static resource html_csdn. The next step is to parse this resource and match the information we need.

5.2 xpath data matching

What is xpath?

XPath is a language for finding information in XML documents. XPath can be used to traverse elements and attributes in an XML document. XPath is a major element of the W3C XSLT standard, and both XQuery and XPointer build on XPath expressions.

As we can see from the above sentence, xpath is used to find XML, and our HTML can be thought of as an XML document with non-standard syntax, which happens to be the way we can parse AN HTML document.

Before we can use xpath, we need to install the xpath dependency library, which is not a standard library provided by Python.

pip install lxml
Copy the code

If the network is not strong students can use the above mirror station for installation.

But the xpath expression is very simple, the specific syntax you can refer to W3school provide tutorial (www.w3school.com.cn/xpath/xpath)… , the author here will not introduce, the specific use of the following:

read_num_csdn = etree.HTML(html_csdn).xpath('//*[@id="asideProfile"]/div[3]/dl[2]/dd/@title')[0]
fans_num_csdn = etree.HTML(html_csdn).xpath('//*[@id="fan"]/text()')[0]
rank_num_csdn = etree.HTML(html_csdn).xpath('//*[@id="asideProfile"]/div[3]/dl[4]/@title')[0]
like_num_csdn = etree.HTML(html_csdn).xpath('//*[@id="asideProfile"]/div[2]/dl[3]/dd/span/text()') [0]Copy the code

Here I mainly get the total number of reads, total number of followers, ranking and total number of likes.

Here are a few basic uses that are sufficient for this example:

expression	describe
`nodename`	Select all the children of this node.
`/`	Select from the root node.
`//`	Select nodes in the document from the current node selected by the match, regardless of their location.
`.`	Select the current node.
`.`	Select the parent node of the current node.
`@`	Select properties.
`text`	Select the content of the current node.

Another easy way to get xpath expressions is through the Chrome browser. See the screenshot below:

Open F12, right-click to generate the content of the xpath expression, and click Copy -> Copy xpath.

One thing to note here is that the data type we extract directly through xpath is not the basic data type. If we want to do an operation or string concatenation, we need to do a strong type hardening, otherwise an error will be reported, as follows:

req_cnblog = request.Request('https://www.cnblogs.com/babycomeon/default.html?page=2')
req_cnblog.add_header('User-Agent'.'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari)
html_cnblog = request.urlopen(req_cnblog).read().decode('utf-8')

max_page_num = etree.HTML(html_cnblog).xpath('//*[@id="homepage_top_pager"]/div/text()')

# Maximum number of pages
max_page_num = re.findall(r"\d+\.? \d*", str(max_page_num))[0]
Copy the code

Max_page_num = max_page_num = max_page_num = max_page_num = max_page_num = max_page_num = max_page_num

5.3 Writing to the Database

I will not do much to introduce the operation of the database, have written Java students should be very clear how to write JDBC, first use IP, port, user name, password, database name, character set and other information to obtain the connection, and then open the connection, write a SQL, put the SQL spell, execute the SQL, Then submit the data and close the connection as follows:

def connect():
    conn = pymysql.connect(host='localhost',
                           port=3306,
                           user='root',
                           password='123456',
                           database='test',
                           charset='utf8mb4')

    # Get the operation cursor
    cursor = conn.cursor()
    return {"conn": conn, "cursor": cursor}

connection = connect()
conn, cursor = connection['conn'], connection['cursor']

sql_insert = "insert into spider_data(id, plantform, read_num, fans_num, rank_num, like_num, create_date) values (UUID(), %(plantform)s, %(read_num)s, %(fans_num)s, %(rank_num)s, %(like_num)s, now())"

Copy the code

In this example, the crawler is only responsible for one data crawl, so all you need is an INSERT statement, and then after each platform crawl is complete, replace the placeholder in the SQL statement with the COMMIT operation. The example code is as follows:

csdn_data = {
    "plantform": 'csdn'."read_num": read_num_csdn,
    "fans_num": fans_num_csdn,
    "rank_num": rank_num_csdn,
    "like_num": like_num_csdn
}

cursor.execute(sql_insert, csdn_data)
conn.commit()
Copy the code

6. Summary

After a real experience with Python crawler, I do feel that writing the program syntax in Python is very simple. The overall program uses 130+ lines. Roughly speaking, it would take 200+ lines to write the same function in Java. Using httpClient to send a GET request and parse the response is not as simple as Python’s 2-3 lines of code. The crawler in this example is actually very imperfect. At present, it can only crawl the data of the platform that does not need login, and some platforms can only see the statistical data after login, which needs to be combined with Cookies to complete the simulated login process. I will continue to refine this little reptile when I have time.

The Python crawler this time is more like a technical taste, in which a large number of codes are obtained through the search engine, and the writing method is a little ignorant. The author decides to systematically learn Python in the future, and will share some learning process at that time. Interested partners can study and discuss with the author.

As usual, the code for this article is also submitted to the Github repository and the Gitee repository, where you can request it if you want. The repository is named Python-Learn, which serves as a kind of monitoring of your learning.

7. Sample code

Example code -Github

Example code -Gitee

Reference: 8.

XPath tutorial