This is the fourth blog post in the Java crawler series. In the last article, Java crawlers encountered asynchronous data loading. Try both methods! We have a brief talk about handling asynchronous data loading from the perspective of the built-in browser kernel and reverse parsing. In this article, we will briefly talk about crawlers, resource sites based on user access behavior to block crawlers and their corresponding solutions.

Crawler shielding is a protection measure for resource websites. The most commonly used anti-crawler strategy should be based on users’ access behavior. For example, each server can only visit X times within a certain period of time, and if the visit exceeds this number, it is considered to be a crawler. Whether a crawler is a crawler is judged based on the User access behavior, not only the visit times, but also the User Agent request header of each request and the interval time of each visit. In general, it is determined by a number of factors, of which the number of visits is the main.

Anti-crawler is a self-protection measure for every resource website to protect resources from being occupied by crawlers. For example, douban.com we used before will shield crawler based on user access behavior. After the number of visits per minute of each IP reaches a certain number, 403 error will be directly returned to the request in the following period of time, which means that you do not have permission to access the page. So today we take Douban network as an example, we use the program to simulate this phenomenon, the following is a collection of douban movies I wrote the program

/** ** /
public class CrawlerMovie {

    public static void main(String[] args) {
        try {
            CrawlerMovie crawlerMovie = new CrawlerMovie();
            // Douban movie link
            List<String> movies = crawlerMovie.movieList();
            // Create a pool of 10 threads
            ExecutorService exec = Executors.newFixedThreadPool(10);
            for (String url : movies) {
                // Execute thread
                exec.execute(new CrawlMovieThread(url));
            }
            // Thread closed
            exec.shutdown();
        } catch(Exception e) { e.printStackTrace(); }}/** ** Reverse parsing **@return* /
    public List<String> movieList(a) throws Exception {
        // Get 100 movie links
        String url = "Https://movie.douban.com/j/search_subjects?type=movie&tag= hot & sort = recommend&page _limit = 200 & page_start = 0";
        CloseableHttpClient client = HttpClients.createDefault();
        List<String> movies = new ArrayList<>(100);
        try {
            HttpGet httpGet = new HttpGet(url);
            CloseableHttpResponse response = client.execute(httpGet);
            System.out.println("Get douban movie list, return verification code:" + response.getStatusLine().getStatusCode());
            if (response.getStatusLine().getStatusCode() == 200) {
                HttpEntity entity = response.getEntity();
                String body = EntityUtils.toString(entity, "utf-8");
                // Format the request result as JSON
                JSONObject jsonObject = JSON.parseObject(body);
                JSONArray data = jsonObject.getJSONArray("subjects");
                for (int i = 0; i < data.size(); i++) {
                    JSONObject movie = data.getJSONObject(i);
                    movies.add(movie.getString("url"));
                }
            }
            response.close();
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            client.close();
        }
        returnmovies; }}/** ** Collect the thread */
class CrawlMovieThread extends Thread {
    // Link to be collected
    String url;

    public CrawlMovieThread(String url) {
        this.url = url;
    }
    public void run(a) {
        try {
            Connection connection = Jsoup.connect(url)
                    .method(Connection.Method.GET)
                    .timeout(50000);
            Connection.Response Response = connection.execute();
            System.out.println("Collect Douban movie, return status code:" + Response.statusCode());
        } catch (Exception e) {
            System.out.println("Collecting Douban movies, collecting anomalies:"+ e.getMessage()); }}}Copy the code

The logic of this program is relatively simple, the first collection of douban popular movies, here use direct access Ajax to obtain douban popular movie links, and then parse out the movie’s details page links, multithreaded access details page links, because only in the case of multithreading can achieve douban access requirements. Douban’s popular movies page is as follows:

Run the above program several times and you should end up with the result shown below

As you can see from the above image, the status code returned by httpClient is 403, indicating that we have no permission to access the page, that is, douban.com has considered us a crawler, and rejected our access request. Let’s analyze our current access architecture. Since we are directly accessing Douban.com, the access architecture at this time is as follows:

If we want to break through this limitation, we can’t directly access the server of Douban.com, we need to pull in a third party, let others access on our behalf, we find a different person for each visit, so that we won’t be restricted, this is the so-called IP proxy. The access architecture now looks like this:

The IP proxy that we use, we need to have IP proxy pools, so let’s talk about IP proxy pools

IP agent pool

Proxy server has a lot of manufacturers are doing this, specific I don’t say, baidu IP agent can find out a lot of, these IP agents have provided free of charge and proxy IP, charging proxy IP high availability, speed, in the online environment, if you need to use a proxy suggested using the charge proxy IP. If we just do our own research, we can collect the free public proxy IP of these manufacturers, which has poor performance and availability, but does not affect our use.

Since we were a Demo project, we built our own IP proxy pool. How do we design an IP proxy pool? The following diagram shows the architecture of a simple IP proxy pool

As shown in the architecture diagram above, an IP proxy system involves four modules, namely, IP collection module, IP storage module, IP detection module, and API interface module.

  • IP acquisition module

Responsible for collecting proxy IP addresses from major IP proxy vendors. The more websites collected, the higher the availability of proxy IP addresses

  • IP storage module

To store the collected proxy IP, high-performance databases such as Redis are commonly used. In terms of storage, we need to store two kinds of data, one is to detect the available proxy IP, the other is to collect the proxy IP that has not been detected.

  • IP detection module

Detect whether the collected IP is available, so as to improve the availability of the IP provided by us. We first filter out the unavailable IP.

  • API interface module

Provides available proxy IP addresses externally in the form of interfaces

This is the design of IP proxy pools, which we need to briefly understand, because there is no need to write IP proxy pool services, there are already plenty of good open source projects on GitHub, there is no need to reinvent the wheel. I have selected the open source IP proxy pool project Proxy_pool with 8K star on GitHub for you to use as our IP proxy pool. About proxy_pool please visit: https://github.com/jhao104/proxy_pool

Deploy proxy_pool

Proxy_pool is written in Python, but it doesn’t matter, because it can now be containerizable. With containerizable deployment, you can shield the installation of some environments, and just run the image to run the service without knowing the implementation. So this project can be used by Java programmers who do not know Python. Proxy_pool uses Redis to store collected IP addresses, so you need to start the Redis service before starting proxy_pool. Here are the proxy_pool docker startup steps.

  • Pull the mirror

docker pull jhao104/proxy_pool

  • Run the mirror

Docker run –env db_type=REDIS –env db_host=127.0.0.1 –env db_port=6379 –env db_password=pwd_str -p 5010:5010 jhao104/proxy_pool

After running the image, we wait a while because it takes a while to start collecting and processing the data for the first time. Wait and visit http://{your_host}:5010/get_all/. If you get the results shown below, you have successfully deployed the Proxy_pool project.

Using IP Proxy

After setting up the IP proxy pool, we can use the proxy IP to collect Douban movies. We already know that in addition to IP, the User Agent request header is also a factor for Douban to judge whether access is a crawler, so we also forged the User Agent request header. We use a different User Agent header for each access.

We introduce IP proxy and random User Agent request header for Douban movie collection program, the specific code is as follows:

public class CrawlerMovieProxy {

    /** * Common user agent list */
    static List<String> USER_AGENT = new ArrayList<String>(10) {
        {
            add("Mozilla / 5.0 (Linux; Android 4.4.1. Nexus 7 Build/JRO03D) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166 Safari/535.19");
            add("Mozilla / 5.0 (Linux; U; Android 4.0.4; en-gb; Gt-i9300 Build/IMM76D) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30");
            add("Mozilla / 5.0 (Linux; U; The Android 2.2. en-gb; Gt-p1000 Build/FROYO) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1");
            add("Mozilla / 5.0 (Windows NT 6.2; WOW64; The rv: 21.0) Gecko / 20100101 Firefox / 21.0");
            add("Mozilla / 5.0 (Android; Mobile; The rv: 14.0) Gecko/Firefox / 14.0 14.0");
            add("Mozilla / 5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.94 Safari/537.36");
            add("Mozilla / 5.0 (Linux; Android 4.0.4; Galaxy Nexus Build/IMM76B) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.133 Mobile Safari/535.19");
            add("Mozilla / 5.0 (the device; CPU OS 5_0 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9A334 Safari/7534.48.3");
            add("Mozilla / 5.0 (iPod; U; CPU like Mac OS X; En) AppleWebKit/420.1 (KHTML, like Gecko) Version/3.0 Mobile/3A101a Safari/419.3"); }};/** * Get user agent ** randomly@return* /
    public String randomUserAgent(a) {
        Random random = new Random();
        int num = random.nextInt(USER_AGENT.size());
        return USER_AGENT.get(num);
    }

    /** * Set the proxy IP address pool **@paramThe queue queue *@throws IOException
     */
    public void proxyIpPool(LinkedBlockingQueue<String> queue) throws IOException {


        // A proxy IP address can be obtained randomly at a time
        String proxyUrl = "http://192.168.99.100:5010/get_all/";

        CloseableHttpClient httpclient = HttpClients.createDefault();

        HttpGet httpGet = new HttpGet(proxyUrl);
        CloseableHttpResponse response = httpclient.execute(httpGet);
        if (response.getStatusLine().getStatusCode() == 200) {
            HttpEntity entity = response.getEntity();
            String body = EntityUtils.toString(entity, "utf-8");

            JSONArray jsonArray = JSON.parseArray(body);
            int size = Math.min(100, jsonArray.size());
            for (int i = 0; i < size; i++) {
                // Format the request result as JSON
                JSONObject data = jsonArray.getJSONObject(i);
                String proxy = data.getString("proxy");
                queue.add(proxy);
            }
        }
        response.close();
        httpclient.close();
        return;
    }


    /** * Randomly obtain a proxy IP address **@return
     * @throws IOException
     */
    public String randomProxyIp(a) throws IOException {

        // A proxy IP address can be obtained randomly at a time
        String proxyUrl = "http://192.168.99.100:5010/get/";

        String proxy = "";

        CloseableHttpClient httpclient = HttpClients.createDefault();

        HttpGet httpGet = new HttpGet(proxyUrl);
        CloseableHttpResponse response = httpclient.execute(httpGet);
        if (response.getStatusLine().getStatusCode() == 200) {
            HttpEntity entity = response.getEntity();
            String body = EntityUtils.toString(entity, "utf-8");
            // Format the request result as JSON
            JSONObject data = JSON.parseObject(body);
            proxy = data.getString("proxy");
        }
        return proxy;
    }

    /** ** Douban movie link list **@return* /
    public List<String> movieList(LinkedBlockingQueue<String> queue) {
        // Get 60 movie links
        String url = "Https://movie.douban.com/j/search_subjects?type=movie&tag= hot & sort = recommend&page _limit = 40 & page_start = 0";
        List<String> movies = new ArrayList<>(40);
        try {
            CloseableHttpClient client = HttpClients.createDefault();
            HttpGet httpGet = new HttpGet(url);
            // Set the IP proxy
            HttpHost proxy = null;
            // Randomly obtain a proxy IP address
            String proxy_ip = randomProxyIp();
            if (StringUtils.isNotBlank(proxy_ip)) {
                String[] proxyList = proxy_ip.split(":");
                System.out.println(proxyList[0]);
                proxy = new HttpHost(proxyList[0], Integer.parseInt(proxyList[1]));
            }
            // Get a random request header
            httpGet.setHeader("User-Agent", randomUserAgent());
            RequestConfig requestConfig = RequestConfig.custom()
                    .setProxy(proxy)
                    .setConnectTimeout(10000)
                    .setSocketTimeout(10000)
                    .setConnectionRequestTimeout(3000)
                    .build();
            httpGet.setConfig(requestConfig);
            CloseableHttpResponse response = client.execute(httpGet);
            System.out.println("Get douban movie list, return verification code:" + response.getStatusLine().getStatusCode());
            if (response.getStatusLine().getStatusCode() == 200) {
                HttpEntity entity = response.getEntity();
                String body = EntityUtils.toString(entity, "utf-8");
                // Format the request result as JSON
                JSONObject jsonObject = JSON.parseObject(body);
                JSONArray data = jsonObject.getJSONArray("subjects");
                for (int i = 0; i < data.size(); i++) {
                    JSONObject movie = data.getJSONObject(i);
                    movies.add(movie.getString("url"));
                }
            }
            response.close();
        } catch (Exception e) {
            e.printStackTrace();
        } finally{}return movies;
    }


    public static void main(String[] args) {
        // Store queue of proxy IP
        LinkedBlockingQueue<String> queue = new LinkedBlockingQueue(100);

        try {
            CrawlerMovieProxy crawlerProxy = new CrawlerMovieProxy();
            // Initializes the IP proxy queue
            crawlerProxy.proxyIpPool(queue);
            // Get the douban movie list
            List<String> movies = crawlerProxy.movieList(queue);

            // Create a thread pool of fixed size
            ExecutorService exec = Executors.newFixedThreadPool(5);
            for (String url : movies) {
                // Execute thread
                exec.execute(new CrawlMovieProxyThread(url, queue, crawlerProxy));
            }
            // Thread closed
            exec.shutdown();
        } catch(Exception e) { e.printStackTrace(); }}}/** ** Collect the thread */
class CrawlMovieProxyThread extends Thread {
    // Link to be collected
    String url;
    // Proxy IP queue
    LinkedBlockingQueue<String> queue;
    / / the proxy class
    CrawlerMovieProxy crawlerProxy;

    public CrawlMovieProxyThread(String url, LinkedBlockingQueue<String> queue, CrawlerMovieProxy crawlerProxy) {
        this.url = url;
        this.queue = queue;
        this.crawlerProxy = crawlerProxy;
    }

    public void run(a) {
        String proxy;
        String[] proxys = new String[2];
        try {
            Connection connection = Jsoup.connect(url)
                    .method(Connection.Method.GET)
                    .timeout(50000);

            // If the proxy IP queue is empty, re-obtain the IP proxy
            if (queue.size() == 0) crawlerProxy.proxyIpPool(queue);
            // Get the proxy IP from the queue
            proxy = queue.poll();
            // Resolve the proxy IP address
            proxys = proxy.split(":");
            // Set the proxy IP address
            connection.proxy(proxys[0], Integer.parseInt(proxys[1]));
            // Set user agent
            connection.header("User-Agent", crawlerProxy.randomUserAgent());
            Connection.Response Response = connection.execute();
            System.out.println("Collect Douban movie, return status code:" + Response.statusCode() + ", request IP:" + proxys[0]);
        } catch (Exception e) {
            System.out.println("Collecting Douban movies, collecting anomalies:" + e.getMessage() + ", request IP:" + proxys[0]); }}}Copy the code

Running the modified collector may require multiple runs because your proxy IP may not be valid every time. If the proxy IP is valid, you get the following results

It can be seen from the results that a large number of proxy IP addresses are invalid and only a small part of proxy IP addresses are valid in the 40 visits to the movie details page. The results directly prove that free proxy IP is not available, so if you need to use proxy IP online, it is best to use a paid proxy IP. Although the availability of the IP proxy pool we set up by ourselves is not too high, the IP proxy we set to access douban movies has been successful, and the use of IP proxy has successfully bypassed the restrictions of Douban.

There are many reasons why crawler server is blocked. This article mainly introduces how to bypass douban.com’s access restriction by setting IP proxy and forging User Agent request header. How do we keep our programs from being treated as crawlers by resource sites? The following three points need to be done:

  • Forge the User Agent request header
  • Using IP Proxy
  • Unfixed collection interval time

I hope you found this article helpful. The next one is an exploration of multi-threaded crawlers. If you are interested in reptiles, you might as well follow a wave, learn from each other, and improve on each other

Article insufficient place, hope everybody gives directions a lot, common study, common progress

The last

Play a small advertisement, welcome to scan the code to pay attention to the wechat public number: “The technical blog of the flathead brother”, progress together.