Pay attention to the “water drop and silver bullet” public account, the first time to obtain high-quality technical dry goods. 7 years of senior back-end development, with a simple way to explain the technology clearly.

It takes about 15 minutes to read this article.

Because I have been doing crawler collection related development before, this process must not deal with “proxy IP”, this article will record how to achieve a crawler proxy service, this article mainly to explain ideas.

The cause of

People who have done crawlers should know that there are many websites and data to be captured. If crawler crawls too fast, it will inevitably trigger the anti-crawling mechanism of the website. And these sites to deal with crawlers, almost the same way is to block IP.

So we also want to stable, continuous capture of the data of these sites, how to solve? There are two general solutions:

Crawl web site data using the same server IP, but slow down
Fetching data using multiple proxy IP addresses

The first option sacrifices time and speed, but in general our time is valuable, and ideally, we can get as much data as possible in as little time as possible. So the second option is recommended, so where can I find so many proxy IP addresses?

Looking for agent

Most directly, use a search engine to search.

For example, if you use Google, Bing, or Baidu, enter the keyword “Free proxy IP address”. The first few pages are almost all websites that provide proxy IP addresses. After opening them one by one, you can find that they are almost all a list page that displays dozens or even hundreds of proxy IP addresses.

However, if you look closely, you will find that each site offers a limited number of free IP, and once you use them, you will find that some of them are no longer valid. After all, people prefer you to buy their paid proxy IP.

As a cunning programmer, you can’t be deterred by this difficulty. Think carefully, since the search engine can search to provide so many proxy websites, each website provides dozens or hundreds of proxy IP, if there are 10 proxy websites, that together also have hundreds to thousands of.

So it’s very simple, what you need to do is, collect these sites that provide proxy IP, write a collection program to capture these free proxy IP, think is not very simple?

The test agent

Well, with this idea, you can write a program that collects proxy IP addresses, and then you can get hundreds or thousands of proxy IP addresses. Of course, the more proxy websites collected, the more proxy IP will be collected, so we try to collect as many proxy websites as possible.

Wait, so many proxy IP addresses, does someone really give them to you for free?

Of course not. As mentioned earlier, many of these agents are no longer valid. So what? How do you know which agents are valid and which are not?

Simply write an HTTP program, hang up these proxies, visit a stable web site, and see if it can be accessed properly. If it can be accessed properly, then the proxy IP is available. If it fails, the proxy IP is invalid.

Curl curl curl curl curl curl curl curl curl curl curl curl curl curl curl curl

#Use agent 48.139.133.93:3128 to visit netease home pageThe curl -x 48.139.133.93:3128 "" "http://www.163.com"Copy the code

Of course, this way is just for the convenience of demonstration, the actual best way is: write a multithreaded proxy test program, and then use these proxies to visit a website, according to the results of the visit, finally can output a batch of available proxy IP.

Using the agent

By writing a program that tests the proxy IP, we can find out which proxy is available.

Then it’s easy to use.

For example, if we output the available proxy IP just obtained to a file, each line is an IP, then our program could use:

The program reads the proxy file and loads the proxy list into an array
Select a proxy IP at random from the array and initiate the HTTP request

In this way, if the available proxy IP stable in the hundreds or thousands, basically can maintain a period of time to capture the data of a website, generally speaking, to capture thousands of tens of thousands of data is not a problem.

However, if I want to continuously collect data from a web site, or crawl millions or even billions of web pages, this batch of proxy IP may gradually fail. What do you do in this case?

Continuous supply of agents

Using the method just described, grab a batch of proxy sites, and then output the list of available proxy IP through the test program, but this is only one-time, and the number of proxy IP is often very small, in our continuous collection requirements, certainly not enough to meet the needs. So how can we continuously find available proxy IP?

According to the above method, we can optimize as follows:

Collect more proxy IP sites (data base)
Monitor these websites regularly and collect proxy IP lists
The program automatically detects proxy IP availability and outputs available proxy IP (file or database)
The program loads files or databases and randomly selects proxy IP addresses to initiate HTTP requests

According to this optimization idea, we can write an automatic collection of proxy IP program. In this way, our crawler can periodically go to the file/database to obtain the available proxy IP for use.

But there is a small problem. How do you know the quality of each proxy IP? That is, what is the speed per proxy IP?

When we test the availability of proxy IP, we can record the response time of the request to visit the website, which is the quality of proxy IP. The shorter the response time, the higher the quality of the proxy IP address.

With the quality of proxy IP, when we use these proxy IP, we can give priority to the use of high-quality, improve the success rate of our crawler grasping web pages.

However, don’t use these high-quality proxy IP addresses too often, or the other sites will quickly block them. So what to do about it?

We continue to optimize. When using proxy IP addresses, we can limit the number of times a proxy IP address can be used within a short period of time. For example, a proxy IP address can only be used for 10 times within 5 minutes.

In this way, since we can ensure the quality of grasping, we can also ensure that the proxy IP will not be blocked because of a large number of use in a short time.

As a service

After the previous series of optimizations, we have set up a working proxy service, but it is file or database based.

To use these proxies, the crawler can only read files or databases, and then select proxy IP according to some rules. This is rather tedious. Can we make it easier for the crawler to use proxies? At this point we need to make proxy access a service.

There is a well-known server software called SQUID, which is a forward agent software. With its cache_peer mechanism, you can do this very well.

We can simply write a list of available proxy IP addresses in a squid configuration file in a certain format according to its cache_peer rule. The cache_peer profile rules also support setting the weight and maximum number of times each proxy IP can be used, which means squid can automatically schedule and select our proxy IP based on the configuration.

After the squid configuration is complete, start squid and we can use the configured proxy IP through this port.

Suppose that our crawler is deployed on A server, squid is deployed on B server, the website server to be crawled is C, and our proxy IP is D/E/F.

When no proxy is used: Crawler A directly requests web server C
When proxy is normally used: crawler server A requests website server C by setting proxy IP D/E/F
usesquidWhen: crawler server A is deployed on server Bsquidthecache_peerMechanism that automatically schedules proxy IP D/E/F and finally accesses web server C

The crawler does not need to consider how to load and select the proxy. It only needs to configure the proxy IP list into SQUID according to the rules of the configuration file. Squid can help us to manage and select the proxy. Most importantly, the crawler only needs to configure this port of squid to use proxy!

integration

Ok, now that the servitization is complete, the only step left is integration. Let’s comb through the whole process:

Collect as many proxy websites as possible, use the proxy IP collection program to periodically collect proxy websites (30 minutes /1 hour can be), resolve all proxy IP, write into the database
The proxy test program takes out all the proxy IP from the database, then hangs the proxy, visits a certain stable website, according to the visit result, marks in the database whether the proxy is available, at the same time, also records in the database the response time to visit the website
Write a program that loads all available agents from the database, calculates a usage weight and sets the maximum usage times based on the agent response time, and writes it to the squid configuration file in cache_peer format
Write a program or script that triggers an automatic reload of the SQUID configuration file to load the latest list of proxy IP addresses
Periodically repeat 1-4, continuously output available proxy to SQUID
Crawler specifies the service IP and port of SQUID for pure website collection operation

In this way, a complete proxy service can be set up, and the proxy service can periodically output the quality of the proxy IP. Crawlers don’t care about proxy collection and testing, just usesquidThe unified service entry can crawl data.

The above is just to provide a proxy service to build the design idea, after the idea, the rest of the code becomes relatively simple.

Crawler series:

How to build a crawler proxy service?
How to build a universal vertical crawler platform?
Scrapy source code analysis (a) architecture overview
Scrapy source code analysis (two) how to run Scrapy?
Scrapy source code analysis (three) what are the core components of Scrapy?
Scrapy source code analysis (four) how to complete the scraping task?

My advanced Python series:

Python Advanced – How to implement a decorator?
Python Advanced – How to use magic methods correctly? (on)
Python Advanced – How to use magic methods correctly? (below)
Python Advanced — What is a metaclass?
Python Advanced – What is a Context manager?
Python Advancements — What is an iterator?
Python Advancements — How to use yield correctly?
Python Advanced – What is a descriptor?
Python Advancements – Why does GIL make multithreading so useless?

Want to read more hardcore technology articles? Focus on”Water drops and silver bullets”Public number, the first time to obtain high-quality technical dry goods. 7 years of senior back-end development, with a simple way to explain the technology clearly.

How to build a crawler proxy service?

The cause of

Looking for agent

The test agent

Using the agent

Continuous supply of agents

As a service

integration

Related Posts

C++ function related knowledge point analysis

Advanced Python: Recursive functions and decorators for Python (emphasis)! | August more challenges

Actual combat: Chapter 1: Preventing others from accessing users’ private data through their urls