Introduction to the Crawler Python Toolkit (1)

This article is from NetEase Cloud community

Author: Wang Tao

Outline of this paper:

A brief introduction to today’s two crawler developed Python libraries
Details the individual arguments in the Requests library and function
The application of Httpcilent in Tornado is introduced in detail
conclusion

Goal: Learn about a quick crawler toolkit used in Python.

Basics: Basic Python syntax (2.7)

Here we go!

Simple crawlers: I refer to disposable code as simple crawlers, which are customized and not generic. Unlike crawler frameworks, a new crawler requirement can be implemented through configuration. For beginners, this article can basically meet your needs. If you have friends who are interested in frameworks, it will also be beneficial for you to understand PySpider framework if you know Tornado of this article. (Pyspdier uses the Tornado framework)

Introduction to Requests and Tornado

With the development of big data, artificial intelligence, and machine learning, Python’s programming status continues to rise. Leaving aside other features (because I don’t know), let’s talk about How Python works with crawlers.

1. Requests Basis

I believe that people who want to quickly get started with crawlers will choose Python, and select requests library, I would like to obtain baidu home page source code steps? Step 1: Download and install Python step 2: PIP install the Requests library Step 3: Execute python -c ‘import requests; print requests.get(“www.baidu.com”).content’

python -c 'import requests; print requests.get("http://www.baidu.com").content'<! DOCTYPE html> <! --STATUS OK--><html> <head><meta http-equiv=content-type content=text/html; charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheettype= "text/CSS href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css > < title > baidu once, you'll know < / title > < / head > < body link =#0000cc> 
       
        
         
          
         
         
          
          
          
          
         
         
         < submit id=su value= > 
         
        
       
       
        
         
        
      
  
      
        
         
          About Baidu  
         © 2017 Baidu < a href=http://www.baidu.com/duty/ > before using Baidu required < / a > < a href=http://jianyi.baidu.com/ < img src=//www.baidu.com/img/gs.gif>  
         
      
   Copy the code

2. Efficient fetching of requests

Efficient fetching, so let’s change serial to parallel. When people talk about concurrency, they think of multi-threading, multi-process.

But as you know, CPython is actually pseudo-multithreaded, i.e. essentially single-threaded, because the Ptyhon interpreter uses a large lock GIL to ensure that the interpreter (or the Python VM) only gets scheduled for interpretation execution. Note: GIL exists in retaining, Jython without this restriction (www.jython.org/jythonbook/…

For the sake of simplicity, just run with multiple threads. After all, most of Python’s built-in data structures are thread-safe (lists,dict, tuples, etc.) and you don’t have to worry about the complexity of competing code.

The concept of coroutines is already supported in many programming languages. In Python, coroutines are implemented using the yield keyword. Tornado is an asynchronous non-blocking framework based on coroutines. It can be used to implement network requests more efficiently than multithreaded requests.

3. Tornado Introduction

Before we introduce Tornado, let’s briefly introduce the concept of coroutine.

3.1 coroutines

In single-threaded programming: procedural programming, we encapsulate blocks of code into a function that has the characteristics of an entry and an exit. When we call a function, we wait for it to finish before we can continue with the rest of the code. While coroutine in the single thread condition, a function can enter many times, many times to return, we call the coroutine function, we can temporarily return to execute other coroutine function at its breakpoint. (This is a bit like multithreading, where one thread blocks and the CPU dispatches another thread). To see how this works, the logic is simple: we add show_my_sleep to IOLoop four times, each with a different entry. Show_my_sleep Prints information, sleep, prints information. Based on the result, we can see that the show_my_sleep function goes to sleep at the yield statement, temporarily surrendering the right to run, and when the sleep ends, the execution continues from the yield statement.

import randomfrom tornado.ioloop import IOLoopfrom tornado import [email protected] show_my_sleep(idx): The interval = random uniform (5, 20)print "[{}] is going to sleep {} seconds!".format(idx, interval)    yield gen.sleep(interval)    This is used as a breakpoint to surrender the right to run the code
    print "[{}] wake up!!".format(idx)def main():
    io_loop = IOLoop.current()
    io_loop.spawn_callback(show_my_sleep, 1)  # schedule this function for the next loop
    io_loop.spawn_callback(show_my_sleep, 2)
    io_loop.spawn_callback(show_my_sleep, 3)
    io_loop.spawn_callback(show_my_sleep, 4)
    io_loop.start()if __name__ == "__main__":
    main()Copy the code

Results:

[1] is going to sleep 5.19272014406 seconds! [2] Is going to sleep 9.42334286914 seconds! [3] is going to sleep 5.11032311172 seconds! [4] Is going to sleep 13.0816614451 seconds! [3] wake up!! [1] wake up!! [2] wake up!! [4] wake up!!Copy the code

3.2 introduction of Tornado

[translation: www.tornadoweb.org/en/stable/g…

Tornado is a Python based asynchronous network framework that uses non-blocking IO and can support tens of thousands of concurrent visits, making it ideal for long polling, Websockets, and other applications that require persistent connections. Tornado mainly includes four parts: - Web framework, Includes RequestHandler (which can be used to create WEB applications and various supported classes)- client-side, server-side HTTP implementations (including HttpServer and AsyncHttpClient)- asynchronous network libraries containing IOLoop and IOStream, They are built into HTTP components and can be used to implement other protocols. - Tornado. Gen, which makes asynchronous code write more directly than chain callbacks. The Tornado WEB Framework together with HTTP Server can serve as a full-stack alternative to WSGI. You can use the Tornado Web Framework in a WSGI container, or you can use Http Server as a container for other WSGI frameworks, but either combination is flawed. To take full advantage of Tornado, you need to use Tornado's Web framework and HTTP Server.Copy the code

We mainly use Tornado’s HttpClient and coroutine library to realize concurrent network requests under single thread. Here, show you the code!

import tracebackfrom tornado.ioloop import IOLoopfrom tornado import genfrom tornado.curl_httpclient import CurlAsyncHTTPClientfrom tornado.httpclient import [email protected] fetch_url(url):
    """Grab url"""
    try:
        c = CurlAsyncHTTPClient()  Define an HttpClient
        req = HTTPRequest(url=url)  Define a request
        response = yield c.fetch(req)  Make a request
        print response.body
        IOLoop.current().stop()  Stop the ioloop thread
    except:        print traceback.format_exc()def main():
    io_loop = IOLoop.current()
    io_loop.spawn_callback(fetch_url, "http://www.baidu.com")  Add coroutine functions to Ioloop loops
    io_loop.start()if __name__ == "__main__":
    main()Copy the code

4, Tornado concurrent

The simple idea here is to implement parallel calls for multiple callbacks by adding callbacks to ioloop.

def main():
    io_loop = IOLoop.current()
    io_loop.spawn_callback(fetch_url, "http://www.baidu.com")  # schedule this function for the next loop
    ' '' io_loop.spawn_callback(fetch_url, url1) ... . io_loop.spawn_callback(fetch_url, urln) '' '
    io_loop.start()if __name__ == "__main__":
    main()Copy the code

After a brief introduction to the two application packages, let’s look at the key functions and parameters in detail.

Requests key functions and parameters

In addition to adding custom HTTP headers to counter crawler strategies, we’ll introduce the Two key requests functions get and POST from an application perspective. Function definition:

def get(url, params=None, **kwargs):
    """Sends a GET request. :param url: URL for the new :class:`Request` object. :param params: (optional) Dictionary or bytes to be sent in the query string for the :class:`Request`. :param \*\*kwargs: Optional arguments that ``request`` takes. :return: :class:`Response 
      
       ` object :rtype: requests.Response "
      ""

    kwargs.setdefault('allow_redirects', True)    return request('get', url, params=params, **kwargs)Copy the code

def post(url, data=None, json=None, **kwargs):
    """Sends a POST request. :param url: URL for the new :class:`Request` object. :param data: (optional) Dictionary, bytes, or file-like object to send in the body of the :class:`Request`. :param json: (optional) json data to send in the body of the :class:`Request`. :param \*\*kwargs: Optional arguments that ``request`` takes. :return: :class:`Response 
      
       ` object :rtype: requests.Response "
      ""

    return request('post', url, data=data, json=json, **kwargs)Copy the code

As we can see, both the GET and POST methods for Requests call request. Request is defined as follows:

    def request(self, method, url,        params=None,
        data=None,
        headers=None,
        cookies=None,
        files=None,
        auth=None,
        timeout=None,
        allow_redirects=True,
        proxies=None,
        hooks=None,
        stream=None,
        verify=None,
        cert=None,
        json=None):
        """Constructs a :class:`Request <Request>`, prepares it and sends it.
        Returns :class:`Response <Response>` object.

        :param method: method for the new :class:`Request` object.
        :param url: URL for the new :class:`Request` object.
        :param params: (optional) Dictionary or bytes to be sent in the query          
          string for the :class:`Request`.
        :param data: (optional) Dictionary, bytes, or file-like object to send         
           in the body of the :class:`Request`.
        :param json: (optional) json to send in the body of the
            :class:`Request`.
        :param headers: (optional) Dictionary of HTTP Headers to send with the
            :class:`Request`.
        :param cookies: (optional) Dict or CookieJar object to send with the
            :class:`Request`.
        :param files: (optional) Dictionary of ``'filename': file-like-objects``         
           for multipart encoding upload.
        :param auth: (optional) Auth tuple or callable to enable
            Basic/Digest/Custom HTTP Auth.
        :param timeout: (optional) How long to wait for the server to send
            data before giving up, as a float, or a :ref:`(connect timeout,
            read timeout) <timeouts>` tuple.
        :type timeout: float or tuple
        :param allow_redirects: (optional) Set to True by default.
        :type allow_redirects: bool
        :param proxies: (optional) Dictionary mapping protocol or protocol and
            hostname to the URL of the proxy.
        :param stream: (optional) whether to immediately download the response
            content. Defaults to ``False``.
        :param verify: (optional) whether the SSL cert will be verified.
            A CA_BUNDLE path can also be provided. Defaults to ``True``.
        :param cert: (optional) if String, path to ssl client cert file (.pem).
            If Tuple, ('cert', 'key') pair.
        :rtype: requests.Response
        """Copy the code

NetEase Cloud Free experience pavilion, 0 cost experience 20+ cloud products!

For more information about NetEase’s r&d, product and operation experience, please visit NetEase Cloud Community.

How to prevent customers from purchasing distributed Aggregation from the perspective of risk control