Learn Python crawler, the best guide for xiao Bai

Hi, everyone. I believe that those of you who click here are very interested in crawlers, and so are bloggers. Bloggers were attracted by crawlers when they first got to know them, because they felt SO COOL. You feel a sense of accomplishment when you see a string of data floating on the screen after typing code, don’t you? Even worse is that technology can be applied to a lot of life of the crawler scenario, for example, automatically vote, interested in batch download articles, novels, video, WeChat robot, crawl important data for data analysis, real feel the code was written for oneself, for their service, can also help others, so life is short, I choose the crawler.

1. What is a reptile?

First of all, we should understand one thing, that is, what is a crawler, why to crawler, the blogger Baidu, is explained like this:

A web crawler (also known as a web spider, a web bot, and more commonly as a web chaser in the FOAF community) is a program or script that automatically crawls information from the World Wide Web according to certain rules. Other less commonly used names include ant, autoindex, simulator or worm.

Basically, crawlers can mimic the behavior of browsers and do what you want, customize what you search for and download, and automate it. For example, browsers can download novels, but sometimes they can’t download them in bulk, so crawlers come in handy.

There are many programming environments to realize crawler technology. Java, Python, C++ and so on can be used for crawler. But bloggers choose Python, and I’m sure many people do, because Python is really good for crawlers, rich third-party libraries are powerful, you can do what you want with a few lines of code, and most importantly, Python is great for data mining and analysis. It feels great to be able to crawl and analyze data in Python.

2. Crawler learning route

Know what is crawler, to tell you about the basic route of learning crawler summarized by the blogger, only for your reference, because everyone has their own method, here just to provide some ideas.

The general steps for learning a Python crawler are as follows:

Learn the basics firstPythonGrammar knowledge
Learn about some of the important built-in libraries that Python crawlers useurllib.httpEtc., used to download web pages
Learning regular expressionsre,BeautifulSoup (bs4),Xpath (LXML)Web page parsing tools
Start some simple site crawls (the blogger started from Baidu, haha) and learn about the crawls
Learn about some of the crawling mechanisms of reptiles,header.robot.The time interval.The proxy IP.Hidden fieldsEtc.
Learn some special site crawlers to solveThe login,Cookie,Dynamic web page JS simulationThe problems such as
learningseleniumAutomated tools for asynchronously loading pages
Understand the combination of crawler and database, how to store crawler data,Mysql.Mongodb
Learning to use PythonmultithreadingandasynchronousTo improve the efficiency of crawlers
Learning the framework of the crawler,Scrapy,PySpiderEtc.
learningRedis distributedCrawlers (large data requirements)
learningincrementalThe crawler

The above is a general overview of learning, many content bloggers also need to continue to learn, about the details of each step mentioned, the blogger will gradually share with you in the following content with practical examples, of course, there will also be some interesting content about crawlers.

3. Start with the first crawler

The implementation of the first crawler code I think should start from urllib, when the blogger began to learn is to use urllib library to hit a few lines of code to achieve a simple data crawling function, I think most partners are also so come over. I was like, wow, that’s amazing, you can do a seemingly complicated task in just a few lines of code, and I was like, how do you do this in a few lines of code, and how do you do more sophisticated crawls? With this question in mind, I began the study of urllib.

First of all, I have to mention the process of data crawling, to understand exactly what it is, when learning urllib will be easier to understand.

Crawler process

In fact, the crawler process is the same as the browser to browse the web page process. Truth we should all understand, is when we enter the url on the keyboard after clicking search, through the network will first go through the DNS server, analysis of the domain name of the url, find the real server. Then we send GET or POST request to the server through HTTP protocol, if the request is successful, we GET the web page we want to see, generally using HTML, CSS, JS and other front-end technology to build, if the request is not successful, the server will return to us the status code of request failure, common 503,403, etc..

The crawler process is the same, by making a request to the server to get the HTML page, and then parsing the downloaded page to get the content we want. Of course, this is an overview of a crawler process, and there are a lot of details we need to deal with, which will be shared later.

After understanding the basic process of reptile, we can start our real reptile journey.

Urllib library

Python has a built-in URllib library, which is an important part of the crawler process. The use of the built-in library can be completed to the server to request and obtain the function of the web page, so it is also the first step to learn the crawler.

The blogger uses PYTHon3. x, and the urllib library structure is a bit different from that of Python2.x. The URllib2 and urllib libraries used in Python2.x are combined into a unique URllib library.

First, let’s take a look at what the URllib library for Python3.x has.

The blogger’s IDE is Pycharm, which is very easy to edit and debug. Enter the following code in the console:

>>importurllib
>>dir(urllib)

['__builtins__','__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__','__path__', '__spec__', 'error', 'parse', 'request', 'response']
Copy the code

As you can see, urllib has four important attributes, error, parse, Request, and Response, in addition to the built-in attributes that start and end with a double underscore.

The beginning of doc in Python’s urllib library is briefly described as follows:

Error: “Exception classesraised by urllib.” —- is the Exception raised by urllib
Parse: “Parse (absolute andrelative) URLs.” —- Parse absolute andrelative URLs
Request: “An extensiblelibrary for opening URLs using a variety of protocols” —- opens An extended library of URLs using various protocols
Response: “Response classesused by urllib.” —- Response class used by urllib

Among these four attributes, the most important one is Request, which completes most of the functions of crawlers. Let’s take a look at how request is used.

The use of the request

Request The simplest operation is the urlopen method, which looks like this:

import urllib.request
response = urllib.request.urlopen('http://python.org/')
result = response.read()
print(result)
Copy the code

The running results are as follows:

b'<! doctype html>\n<! --[if lt IE 7]>... </body>\n</html>\n'Copy the code

The result is garbled!! Don’t worry, this is because of coding problems, we just need to read the requested class file and decode it.

Modify the code as follows:

import urllib.request
response = urllib.request.urlopen('http://python.org/')
result = response.read().decode('utf-8')
print(result)
Copy the code

The running results are as follows:

<! doctype html> <! --[if lt IE 7]> <html class="no-js ie6 lt-ie7 lt-ie8>.. <! --[if IE 7]> <html class="no-js ie7 lt-ie8 lt-ie9">.. <! --[if IE 8]> <html class="no-js ie8 lt-ie9"> <! [endif]--> <! --[if gt IE 8]><! --><html class="no-js" lang="en" dir="ltr" <head> <meta charset="utf-8"> ...Copy the code

This is the HTML page we want, how about it? Simple.

Let’s take a look at the urlopen method and the parameters it applies.

Urlopen method

def urlopen(url, data=None, timeout=socket._GLOBAL_DEFAULT_TI
            MEOUT,*, cafile=None, capath=None, 
            cadefault=False, context=None):
Copy the code

Urlopen is one of the request methods that opens a URL, either as a string (as in the example above) or as a Request object (described later).

Url: that is, the URL we input, (for example: www.xxxx.com/);
Data: Is the additional information that we send to the server for request (such as the user information that we need to fill in to log in to the web page). If a data parameter is required, it is a POST request. If there is no data parameter, it is a GET request.
In general, the data parameter only makes sense when requested under the HTTP protocol
The data argument is specified as a byte object, or byte object
The data argument should use the standard structure. This requires urllib.parse.urlencode() to convert the data. Data is going to show you how to use it in the future, right
Timeout: specifies the optional timeout period, in seconds, to prevent the request time from being too long. If this parameter is not specified, the default timeout period is used.
Cafile: points to a single file and contains a set of CA certificates (rarely used, default is ok);
Capath: refers to the document target and is also used for CA authentication (rarely used, default is ok).
Cafile: can be ignored
Context: set SSL encrypted transmission (rarely used, default);

It returns a file-like object and can perform various operations on this object (like the read operation above, which reads the entire HTML). Other common methods include:

Geturl (): Returns the URL to see if there is a redirect.

result = response.geturl()

Results: https://www.python.org/

info(): Returns meta information, such as HTTPheaders.

result = response.info()

Results:

x-xss-protection: 1; mode=block X-Clacks-Overhead: GNU Terry Pratchett … Vary: Cookie Strict-Transport-Security: max-age=63072000; includeSubDomains

getcode(): Returns a replyThe HTTP status code, success is200Failure may be503Can be used to check the availability of proxy IP addresses.

result = response.getcode()

Results: 200

The Request method

class Request:
    def __init__(self, url, data=None, headers={},
                 origin_req_host=None, unverifiable=False,
                 method=None):
Copy the code

As defined above, Request is a class that initializes the parameters required by the Request:

url.dataAnd the aboveurlopenAs mentioned in.
headersisThe HTTP requestPacket information, such asUser_AgentParameters, etc., it allows crawlersDisguised as a browserWithout the server knowing you’re using a crawler.
origin_reg_host.unverifiable.methodNot often used

Headers is very useful. Some websites have an anti-crawler mechanism. If there is no headers in a request, an error will be reported.

So how do you find headers for your browser?

You can go to F12 to view the headers of a request, for example, Chrome. Press F12-> Network to view the headers of a request. You can copy the headers information of this browser to use.

! [](https://pic1.zhimg.com/v2-1c761492064340555618877f871f2fb4_b.jpg)

Here’s how Request works:

import urllib.request
headers = {'User_Agent': ''}
response = urllib.request.Request('http://python.org/', headers=headers)
html = urllib.request.urlopen(response)
result = html.read().decode('utf-8')
print(result)
Copy the code

The result is the same as urlopen, which accepts objects of the Request class as well as specified parameters. Fill in your browser information.

There are many other methods in urllib’s requset property, such as proxy, timeout, authentication, HTTP POST mode request, etc., which will be shared next time. This time, we will focus on the basic functions.

Let’s talk about exceptions, urllib’s error method.

The use of the error

The error attribute contains two important exception classes, URLError and HTTPError.

1. URLError class

def __init__(self, reason, filename=None):
    self.args = reason,
    self.reason = reason
    if filename is not None:
        self.filename = filename
Copy the code

URLError classisOSErrorSubclasses of, inheritOSError, does not have any behavior characteristics of its own, but will be used as a base class for all other types of error.
URLError classThe definition is initializedreasonArgument, meaning that when using objects of the URLError class, you can view the error reason.

2. HTTPErro class

def __init__(self, url, code, msg, hdrs, fp):
    self.code = code
    self.msg = msg
    self.hdrs = hdrs
    self.fp = fp
    self.filename = url
Copy the code

HTTPErrorisURLErrorClass, which will be presented when an HTTP error occursHTTPError.
HTTPErrorIs also an example of a valid HTTP response, since HTTP protocol errors are valid responses, includingStatus code.headersandbody. So to see thatHTTPErrorInitialization defines parameters for these valid responses.
When usingHTTPErrorClass to view the status code,headersAnd so on.

Let’s use an example to see how to use these two Exception classes.

Request import urllib.error try: headers = {'User_Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; Rv :57.0) Gecko/20100101 Firefox/57.0'} Response = urllib.request.request ('http://python.org/', headers=headers) html = urllib.request.urlopen(response) result = html.read().decode('utf-8') except Urllib.error.URLError as e: if hasattr(e, 'reason'): print(' error '+ STR (e.eason)) except urllib.error.HTTPError as e: If hasattr(e, 'code'): print(' error status is' + STR (e.code)) else: print(' request passed successfully. ')Copy the code

The above code uses a try.. The exception structure implements simple web page crawls and returns Reason when an exception such as URLError occurs or code when an HTTPError error occurs. The addition of exceptions enriches the crawl structure, making it more robust.

Why is it stronger?

Don’t underestimate these exception errors, they are very useful and critical. Think about it, when you’re writing code that has to run a crawl and parse automatically over and over again, you don’t want interruptions in the middle of your program. If these exception states are not set, then it is very likely that an error will pop up and be terminated, but if the full exception is set, the error code will be executed without interruption when it encounters an error (such as printing the error code as above).

These interrupts can vary, especially if you are using proxy IP pools, where many different errors can occur, and exceptions come in handy.

4. To summarize

The definition and learning route of crawler are introduced
The process of crawler is introduced
Introduce the use of urllib library to start crawler learning, including the following methods:
Request Request: urlopen, request
The error exception

That will do for today’s sharing, feel useful, trouble thumb up, collect the article, the process of learning Python, often because there is no data or nobody guidance leading to he didn’t want to learn it, so just for a group of 】【 learning Python communication, can obtain the PDF books, tutorials, etc. To everyone free of charge, also can learn, Welcome everyone.