Structure and complete chain of crawler

Hello, everyone. I’m Learning English.

We must have seen so many crawler tutorials, are very hope to know what the main work with crawler to do and what skills they need to master it.

Today I will bring you the structure and complete chain of the crawler.

Crawlers aren’t the same as social and entertainment apps you install on your phone, but they may be related. The news and stock charts you read in the morning could be gathered by a crawler. The core of a crawler is data — it works around data.

Chain of crawler

  • Finishing requirements
  • Analysis of the target
  • Making a network request
  • Text parsing
  • Data warehousing
  • The data warehouse
  • Search engine timely display, information aggregation, data analysis, deep learning samples, operation reference

A library commonly used by reptile engineers

Network request is the beginning and one of the most important parts of crawler. We’ll start with Requests, the network request library most used by Python crawler engineers. The library is known for its simplicity, ease of use, and stability. The installation method is as follows:

pip install requests
Copy the code

Once installed, we can simply experiment with HTTPGET requests. Send the request to http://httpbin.org/get.

The corresponding object can be accessed through the “.” symbol.

import requests


url = 'http://httpbin.org/get'
response = requests.get(url)
status_code = response.status_code	Response status
text = response.text	# response body
headers = response.headers	# response headers
print(text)
print('\n')
print(status_code, headers)
Copy the code

The code results are as follows:

{ "args": {}, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Host": "httpbin.org", "User-Agent": "Python-requests /2.25.1"," x-amzn-trace-id ": "Root= 1-60CC838F-122C533C14CC8b360f325fb4 "}, "origin": "120.236.204.4", "url": "http://httpbin.org/get"} 200 {'Date': 'Fri, 18 Jun 2021 11:29:19 GMT', 'content-type ': 'application/json', 'Content-Length': '306', 'Connection': 'keep-alive', 'Server': 'Gunicorn /19.9.0',' access-Control-allow-Origin ': '*', 'access-Control-allow-credentials ': 'true'}Copy the code

As you can see from the response text above, user-agent is python-requests/2.25.1 at this point.

Many times, however, we want to fake the request header to fool the server’s validation measures, which can be done by customizing the request header. You can forge user-Agent into the same logo as Chrome, with the following code:

headers = {
	 "User-Agent": "Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36"
}
response = requests.get(url, headers=headers)
text = response.text	# response body
print(text)
Copy the code

The code runs as follows:

{ "args": {}, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Host": "httpbin.org", "User-Agent": "Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36", "x-amzn-trace-id ": Root= 1-60cc8616-5C121b946f9ebc6c7c30beA1 "}, "origin": "120.236.204.4", "url": "http://httpbin.org/get"}Copy the code

As you can see from the above results, user-Agent has successfully disguised itself as the Chrome logo.

Some requests may require the client to carry parameters, for example, when using Baidu search engine, search crawler, the requested URL is actually https://www.baidu.com/s?wd= crawler.

As you can see from the URL above, we made a GET request to Baidu with a query parameter: crawler.

The specific code is as follows:

import requests


url = 'https://www.baidu.com/s'
params = {
	'wd': 'crawlers'
}
headers = {
	 "User-Agent": "Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36"
}
response = requests.get(url, headers=headers, params=params)
Copy the code

For example, when we are in the login scenario, an HTTPPOST request is made to send the username and password to the server, so we need to send the body of the request along with the network request.

The specific code is as follows:

import requests

url = 'http://httpbin.org/post'
headers = {
	 "User-Agent": "Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36"
}
info = {'username': 'Book-learning'.'password': '1234567'}
response = requests.post(url, headers=headers, data=info)
text = response.text	# response body
print(text)
Copy the code

The code runs as follows:

{ "args": {}, "data": "", "files": {}, "form": { "password": "1234567", "username": "\u5543\u4e66\u541b" }, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Content-Length": "53", "Content-Type": "Application/X-www-form-urlencoded ", "Host": "httpbin.org"," user-agent ": "Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36", "x-amzn-trace-id ": Root= 1-60cc8a30-7fea4a67361996737CAa7b97 "}, "json": null, "origin": "120.236.204.4", "url": "http://httpbin.org/post" }Copy the code

From the above running results, we can see that the username and password we submitted are uploaded to the server as form data.

For more information on how to use the Requests Network request library, see the official documentation link below:

Requests Official Documentation

Docs.python-requests.org/en/latest/u…

Asynchronous network request library Aiohttp

First we need to understand what is asynchronous and what is synchronous. Synchronization means that after the first call occurs, the next call will not proceed until the result is returned. If the result is not returned, the code will remain in the waiting state and will not proceed.

Since the concept of coroutines was proposed and Python supported the async and await keywords, Python coroutines have flourished.

Coroutines can be understood as thread optimization, also known as lightweight processes. It is a system scheduling mechanism that saves more resources and is more efficient than thread.

Coroutines have the characteristic of executing only one task at a time among multiple tasks that are opened. Only when the current task is blocked will it switch to the next task to continue. This mechanism can achieve synchronization of multiple tasks and successfully avoid the complexity of using locks in threads, simplifying development. There are several modules in Python that can implement coroutines, such as Asyncio, Tornado, or gEvent.

Using asyncio as an example, let’s first understand the concepts used to create coroutines:

  • Event_loop: is a mechanism for calling coroutine handlers. The program starts an infinite loop, calling the corresponding coroutine function when an event occurs
  • Async /await keyword: async is used to define a coroutine and await is used to suspend the blocking asynchronous invocation interface.

The code for a simple coroutine implementation is as follows:

# import asyncio module
import asyncio


Define coroutine handlers
async def demo(x) :
	print(x)
	r = await asyncio.sleep(5)
	print(x, 'again')


Generate coroutine objects and pass them to demo
coroutine = demo('python.org')
loop = asyncio.get_event_loop()
try:
	Register the coroutine object with the implementation event loop object and start running
	loop.run_until_complete(coroutine)
finally:
	The program ends and closes the event loop object
	loop.close()
Copy the code

The running result is as follows:

python.org
python.org again
Copy the code

The code above prints python.org and python.org again after 5 seconds

The process of implementing coroutines consists of the following steps:

1. Define coroutine functions

2. Generate coroutine objects

Register the coroutine object with the implementation event loop object

Close the event loop object

Note: async and await were added in python3.5, so the above code only supports python3.5+.

Coroutines encapsulate more and are easier to implement than threads. Coroutines are multi-task sequential execution, only after the current task is suspended, will switch to other tasks to execute; Threads are done in CPU rotation mode.

Therefore, the operation mechanism of the two is essentially different, which makes them irreplaceable. In practical work, threads and coroutines need to be flexibly used according to the requirements of the task.

Note: After Python3.7 +, you can use the asyncio.run() function call and execute the coroutine, which is much simpler.

The specific code is as follows:

import asyncio
import time


async def  say_after(delay, what) :
	await asyncio.sleep(delay)
	print(what)


async def main() :
	print(f"started at {time.strftime('%X')}")

	await say_after(1.'hello')
	await say_after(2.'world')

	print(f"finished at {time.strftime('%X')}")


asyncio.run(main())
Copy the code

The running result is as follows:

started at 13:22:35
hello
world
finished at 13:22:38
Copy the code

The asyncio.create_task() function is used to run multiple coroutines concurrently as asyncio tasks.

The specific code is as follows:

async def main() :
	task1 = asyncio.create_task(say_after(1.'hello'))
	task2 = asyncio.create_task(say_after(2.'world'))

	print(f"started at {time.strftime('%X')}")

	await task1
	await task2


	print(f"finished at {time.strftime('%X')}")
Copy the code

The running result is as follows:

started at 14:24:20
hello
world
finished at 14:24:22
Copy the code

Apparently, the program is running a second faster than before.

Waitable object

Python coroutines are waitable objects, so they can be waited on in other coroutines.

The specific code is as follows:

import asyncio
async def nested() :
	return 42
async def main() :
	nested()
	print(await nested())
asyncio.run(main())
Copy the code

On line 5, nothing happens to the program, a coroutine object is created and called directly, without waiting, so the program will not run.

Note:

  • Coroutine function: defines a function of the form async def
  • Coroutine object: The return object of a call to a coroutine function

Create a task

Tasks can run in parallel to schedule coroutines.

A coroutine is encapsulated into a task with asyncio.create_task(), which is automatically scheduled for execution.

The specific code is as follows:

import asyncio


async def nested() :
	return 42


async def main() :
	task = asyncio.create_task(nested())
	print(await task)

asyncio.run(main())
Copy the code

For more information on coroutines and tasks, refer to the official documentation.

Coroutines and tasks

Aiohttp,

Now that we know the difference between synchronous and asynchronous, we’ll move on to the asynchronous network request library: Aiohttp.

The network request library not only has the perfect network request client function, but also supports the network server, so it can be used to build a Web application. It supports both HTTP protocol and Webscoket protocol and is a rare tool for crawler engineers.

Installation method:

pip install aiohttp
Copy the code

Sample code, as shown below:

import asyncio
import aiohttp


async def main() :

    async with aiohttp.ClientSession() as session:
        async with session.get('http://python.org') as response:
            print('Status:', response.status)
            print('content-type:', response.headers['content-type'])

            html = await response.text()
            print(html)

asyncio.run(main())
Copy the code

This code changes urllib.request.urlopen() to async with aiohttp.clientSession () as session.

In actual combat

Here to dangdang as an example, write an asynchronous crawler, improve the efficiency of crawler.

from lxml import etree
import asyncio
import aiohttp


async def fetch(session, url) :
    async with session.get(url) as response:
        # print(type(response.text(encoding='gb2312', errors='ignore')))

        return await response.text(encoding='gb2312', errors='ignore')


async def get_data() :
    url = 'http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-24hours-0-0-1-1'
    async with aiohttp.ClientSession() as session:

        html = etree.HTML(await fetch(session, url))
        book_names = html.xpath('//div[@class="name"]/a/@title')
        return book_names


async def save_data(book_names) :
    f = open('book1.txt'.'w', encoding='utf-8')
    for book_name in book_names:
        f.write(book_name)
        f.write('\n')


async def main() :
    book_names = await get_data()
    print(book_names)
    await save_data(book_names)


if __name__ == '__main__':
    asyncio.run(main())
Copy the code

In this way, the program runs more efficiently, alternating between the client initiating the request and the server returning the response.

The last

After reading this article, if you think it is helpful, you can point to [read again]. I will continue to work hard and grow up with you.

Every word of the article is my heart to knock out, only hope to live up to every attention to my people.

Click on it to let me know you’re doing the best you can with your life.

I am a person who concentrates on learning. The more you know, the more you don’t know, the more wonderful content. See you next time!