Technical Summary of Python crawlers (2)

Crawler theory, how to capture packets, synchronous and asynchronous analysis to capture data, our crawlers are directional crawlers
Urllib2 uses urllib2 to fetch data
Requests point, point, point. Repeat three times. You must master the use of requests

Headers: In order to mask the browser and defeat the anti-crawling measures of the other site, consider using Headers first for a site that can’t get data directly.

Cookies: mainly used in cracking the login algorithm, after cracking the login algorithm, get cookies, and then access the personal home page according to the cookies, etc., need to see the data after login. Cookies are data written locally by the server to the browser

Params: concatenates the “? “of the URL In the following data, each ampersand represents a dictionary key

Data: Used for sending data in POST mode

Timeout: specifies the time for opening a browser to access a third-party website.

Proxies: If a third party site has been proxies, proxies, proxies, proxies, proxies, proxies, proxies, proxies

Extract data focus, focus, repeat three times, crawling without extracting data is meaningless

Regular expressions: Provide the simplest (.?) Is used with (.?). extract

Bs4: Documents need to be converted into BeatfulSoup objects, which need to be parsed by an LXML parser

Xpath: The LXML parser parses, parent nodes, and child nodes for lookup

Selenium can be used in conjunction with the phantomJS browser without a graphical interface for some sites where data cannot be retrieved using synchronous or asynchronous methods
RabbitMQ: message queue, first-in, first-out, middleware for building distributed program development. Producer puts production data into rabbitMQ according to methods defined in consumer, consumer monitors rabbitMQ queues and starts working when there is data in RabbitMQ, consumer start mode: celery -A ss worker … , adopt multi-process + coroutine. Producers use the delay keyword to store their produced messages in RabbitMQ.
It starts with a scrapy engine, which sends a request to the spiders for the first url they want to climb. The engine then gives the url to the scheduler. The scheduler also returns the url to the scrapy engine, which gives it to the downloader, which begins the download, and when it’s done, returns the downloaded data to the scrapy engine, which gives the data to the spider, which parses it, and if it has a url that needs to be retrieved, Repeat with the yield Request, give it to the scrapy engine, and finally give the data that the spider will save to item Piplines for archiving

Items: defines the data field to be saved. Spider assigns values to the Items field. Piplines: obtains the items value assigned by the spider for storing in the library. The first group of letters is: Mengy, the second group is: 7762, and she will arrange to learn them in order. , whether you are big bull or small white, is want to change or want to enter the industry can come to understand progress together to learn! There are development tools, a lot of dry goods and technical information sharing!

Configuration in common setting:

DOWNLOAD_DELAY = 2 # DOWNLOAD_DELAY = 2 # DOWNLOAD_DELAY = 2 # DOWNLOAD_DELAY = 2 # DOWNLOAD_DELAY = 2 # DOWNLOAD_DELAY = 2 #

DOWNLOAD_TIMEOUT = timeout in 30 #requests, the maximum time to visit each other’s sites

CONCURRENT_REQUESTS_PER_IP = 1 # Maximum number of concurrent requests for a single IP address

COOKIES_ENABLED = True # Enable cookies

< span style = “box-sizing: border-box! Important

DEFAULT_REQUEST_HEADERS = {

}

USER_AGENT # USER_AGENT masquerades browser

scrapy-redis

Scrapy queues by themselves do not allow for distributed crawling of multiple spiders. Redis does this

Reids in-memory databases are also based on key/value storage

The set method has three parameters: key, values, and duration. The set method has three parameters: key, values, and duration

Get has only one parameter: key

The use of OCR, simple verification code and telephone number, etc., can use OCR recognition, OCR recognition of the character in the picture, but the recognition rate is not very high, we can understand. We can use the proxy IP address for websites with crawler anti-crawler operations. The first group of letters is: Mengy, the second group is: 7762, and she will arrange to learn them in order. , whether you are big bull or small white, is want to change or want to enter the industry can come to understand progress together to learn! There are development tools, a lot of dry goods and technical information sharing!

Technical Summary of Python crawlers (2)

Related Posts

Gateway Spring – Cloud – Gateway source code parsing — RoutePredicateHandlerMapping routing matching of the processor (3.2)

Django-multitenant database Project (Python/Django+Postgres+Citus)

Shell tee use