This is the fifth Java crawler series blog, in the last Java crawler server was blocked, do not panic, let’s change a server, we simply talk about anti-crawler strategy and anti-crawler methods, mainly for IP is blocked and its corresponding methods. We have covered the basics of crawlers in the previous articles. In this article we will talk about crawler architecture.

In the previous chapters, our crawler is single-threaded. When we debug the crawler, there is no problem with single-threaded crawler, but when we use single-threaded crawler to collect web pages in the online environment, single-threaded crawler will expose two fatal problems:

  • The collection efficiency is very slow, because each thread is serial, and the next execution can only be executed after the last execution
  • The utilization rate of server’s CUP is not high, think about our server is 8 core 16G, 32G only run one thread will not be too waste

The online environment cannot care about the efficiency of the collection as long as the results are correctly extracted, as our local tests did. In this era when time is money, it is impossible to give you time to collect slowly, so the single-thread crawler program is not feasible. We need to change the single-thread mode into multi-thread mode to improve the collection efficiency and improve the utilization rate of the computer.

The design of multi-threaded crawler program is much more complicated than that of single thread, but different from other businesses in high concurrency to ensure data security, multi-threaded crawler is not so high in data security requirements, because each page can be regarded as an independent body. In order to do a good job in multi-threaded crawler, two points must be done: the first point is the unified maintenance of URL to be collected, the second point is the removal of URL, let’s briefly talk about these two points.

Maintain urls to be collected

A multithreaded crawler cannot maintain its own URL, as a single thread does. If so, the web page collected by each thread will be the same. You are not collecting multiple threads, you are collecting the same page multiple times. For this reason, we need to maintain the URL to be collected uniformly. Each thread receives the URL from the unified URL maintenance to complete the collection task. If a new URL link is found on the page, it will be added to the container of unified URL maintenance. Here are several containers suitable for uniform URL maintenance:

  • JDK security queues, such as LinkedBlockingQueue
  • High-performance NoSQL, such as Redis and Mongodb
  • MQ messaging middleware

To the weight of the URL

URL deduplication is also a key step in multi-threaded collection, because if not, we will collect a large number of repeated urls, which does not improve our collection efficiency. For example, when we collect the first page of a pagination news list, we can get links of 2, 3, 4 and 5 pages. During the collection of the second page, links of 1, 3, 4 and 5 pages will be obtained. There will be a large number of list page links in the URL queue to be collected, which will be repeated collection or even into an infinite loop, so URL deduplication is needed. URL to the method is very much, the following is a few common URL to the way:

  • Save urls to databases for deduplication, such as Redis and MongoDB
  • Rehash urls into hash tables, such as hashSets
  • The URL is saved in the hash table after MD5, which can save space compared with the above method
  • Use Bloom filters for weight removal, which saves a lot of space, but is less accurate.

We all know the two core knowledge points of multi-threaded crawler. Below, I draw a simple multi-threaded crawler architecture diagram, as shown below:

Above we mainly understand the multi-threaded crawler architecture design, next we might as well try Java multi-threaded crawler, we collect tiger pu news as an example to combat the Java multi-threaded crawler, Java multi-threaded crawler design to be collected URL maintenance and URL to heavy, because we are just a demonstration here, So we use the JDK’s built-in containers to do this. We use LinkedBlockingQueue as the URL to collect maintenance container, and HashSet as the URL to remove. The following is the Java multithreaded crawler core code, detailed code above GitHub, address at the end of the article:

/** * multithreaded crawler */
public class ThreadCrawler implements Runnable {
    // The number of articles collected
    private final AtomicLong pageCount = new AtomicLong(0);
    // List page links to regular expressions
    public static final String URL_LIST = "https://voice.hupu.com/nba";
    protected Logger logger = LoggerFactory.getLogger(getClass());
    // Queue to be collected
    LinkedBlockingQueue<String> taskQueue;
    // List of collected links
    HashSet<String> visited;
    / / thread pool
    CountableThreadPool threadPool;
    / * * * *@paramUrl start page *@paramThreadNum Number of threads *@throws InterruptedException
     */
    public ThreadCrawler(String url, int threadNum) throws InterruptedException {
        this.taskQueue = new LinkedBlockingQueue<>();
        this.threadPool = new CountableThreadPool(threadNum);
        this.visited = new HashSet<>();
        // Add the start page to the queue to be collected
        this.taskQueue.put(url);
    }

    @Override
    public void run(a) {
        logger.info("Spider started!");
        while(! Thread.currentThread().isInterrupted()) {// Get the URL to be collected from the queue
            final String request = taskQueue.poll();
            // If the request is empty and the current thread has no threads running
            if (request == null) {
                if (threadPool.getThreadAlive() == 0) {
                    break; }}else {
                // Perform a collection task
                threadPool.execute(new Runnable() {
                    @Override
                    public void run(a) {
                        try {
                            processRequest(request);
                        } catch (Exception e) {
                            logger.error("process request " + request + " error", e);
                        } finally {
                            // Collect page +1pageCount.incrementAndGet(); }}}); } } threadPool.shutdown(); logger.info("Spider closed! {} pages downloaded.", pageCount.get());
    }

    /** * Process the collection request *@param url
     */
    protected void processRequest(String url) {
        // Determine if it is a list page
        if (url.matches(URL_LIST)) {
	        // The list page parses the detail page link and adds it to the URL queue to be collected
            processTaskQueue(url);
        } else {
	        // Parse the pageprocessPage(url); }}/** * process link collection ** process list page, add URL to queue **@param url
     */
    protected void processTaskQueue(String url) {
        try {
            Document doc = Jsoup.connect(url).get();
            // Details page link
            Elements elements = doc.select(" div.news-list > ul > li > div.list-hd > h4 > a");
            elements.stream().forEach((element -> {
                String request = element.attr("href");
                // Check whether the link exists in the queue or in the collected set. If it does not, add it to the queue
                if(! visited.contains(request) && ! taskQueue.contains(request)) {try {
                        taskQueue.put(request);
                    } catch(InterruptedException e) { e.printStackTrace(); }}}));// List page links
            Elements list_urls = doc.select("div.voice-paging > a");
            list_urls.stream().forEach((element -> {
                String request = element.absUrl("href");
                // Determine if the list links to extract meet the requirements
                if (request.matches(URL_LIST)) {
                    // Check whether the link exists in the queue or in the collected set. If it does not, add it to the queue
                    if(! visited.contains(request) && ! taskQueue.contains(request)) {try {
                            taskQueue.put(request);
                        } catch(InterruptedException e) { e.printStackTrace(); }}}})); }catch(Exception e) { e.printStackTrace(); }}/** * parse the page **@param url
     */
    protected void processPage(String url) {
        try {
            Document doc = Jsoup.connect(url).get();
            String title = doc.select("body > div.hp-wrap > div.voice-main > div.artical-title > h1").first().ownText();

            System.out.println(Thread.currentThread().getName() + "In" + new Date() + "Picked up tiger news." + title);
            // Store the collected url into the collected SET
            visited.add(url);

        } catch(IOException e) { e.printStackTrace(); }}public static void main(String[] args) {

        try {
            new ThreadCrawler("https://voice.hupu.com/nba".5).run();
        } catch(InterruptedException e) { e.printStackTrace(); }}}Copy the code

Let’s use 5 threads to collect tiger news list page to see the effect. Run the program and get the following results:

As you can see, we started 5 threads and collected 61 pages, which took 2 seconds, so we can say that the effect is not bad. Let’s compare with the single thread and see how big the difference is. We set the number of threads to 1, start the program again, and get the following result:

It can be seen that it takes 7 seconds for a single thread to collect 61 pieces of news, which is almost 4 times as long as that of multi-thread. When you consider that this is only 61 pages, the gap will be bigger and bigger with more pages, so the efficiency of multi-thread crawler is still very high.

Distributed crawler architecture

Distributed crawler architecture is a large collection procedures only need to use, the architecture of the in general use single multithreading can meet the needs of the business, anyway, I am not distributed crawler project experience, so it is a piece of me is nothing to tell, but we as a technician, we need to heat preservation technology, although no, understands understanding but also just as well, I looked through a lot of data and came to the following conclusions:

Distributed crawler architecture and our multi-threaded crawler architecture in the idea is the same, we only need to improve on the basis of multi-threading can become a simple distributed crawler architecture. Because the crawler is deployed on different machines in the distributed crawler architecture, the URL to be collected and the collected URL cannot be stored in the memory of the crawler machine. We need to maintain it in a unified machine, such as Redis or MongoDB. Instead of taking links from an in-memory queue like LinkedBlockingQueue, a simple distributed crawler architecture emerges, but there are a lot of details, because I have no experience with distributed architecture, and I don’t know, if you’re interested, Welcome to exchange.

Source: source code

Article insufficient place, hope everybody gives directions a lot, common study, common progress

The last

Play a small advertisement, welcome to scan the code to pay attention to the wechat public number: “The technical blog of the flathead brother”, progress together.