This is the first day of my participation in Gwen Challenge

The introduction

Crawler4j is an open source Java crawler framework with more than 4K stars. I believe its source code is worth my study, so I write this article. Welcome to contact me if there is a mistake!

In fact, the focus of this article is not to study the various logic, details, etc., crawler4J source code, mainly with crawler4J this example to see the use of Wait and notify in Java, see how popular open source projects are used and how to code.

For a quick look, you can either look directly at the core logic or look directly at the simplest version of Wait/Notify

The body of the

The two most important classes in Crawler4J are CrawlController and WebCrawler. One is used to set up and start the crawler, and the other is the core implementation class of the crawler. Most of the code discussed here is in the CrawlController class.

If you are familiar with controller, there are two ways to turn it on:

// Usage 1: block, the following line of code will be executed after all crawler threads have finished
controller.start(factory, numberOfCrawlers);
Copy the code
// Usage 2: non-blocking, after start, all code before waitUntilFinish is executed immediately, blocking at waitUntilFinish
controller.startNonBlocking(factory, numberOfCrawlers);
// Everything in between is executed asynchronously
controller.waitUntilFinish();
Copy the code

The source code for this article focuses on the use of Wait and notify.

Two important variables

First, there are two important variables defined in CrawlController that are required for this functionality:

/** * Is the crawling of this session finished? * /
protected boolean finished;

protected final Object waitingLock = new Object();

Copy the code
  • Finished Indicates whether the crawl is complete
  • WaitingLock is used for locking

Blocking start method

The rest of the code has been commented out in order to focus only on the important stuff.

The entry point for our call to the start method is here

/**
* Start the crawling session and wait for it to finish.
*
* @param crawlerFactory
*            factory to create crawlers on demand for each thread
* @param numberOfCrawlers
*            the number of concurrent threads that will be contributing in
*            this crawling session.
* @param <T> Your class extending WebCrawler
*/
public <T extends WebCrawler> void start(WebCrawlerFactory<T> crawlerFactory,
                                        int numberOfCrawlers) {
    this.start(crawlerFactory, numberOfCrawlers, true);
}
Copy the code

It calls another start method with more arguments, isBlocking, as shown in the comment below for the start method

protected <T extends WebCrawler> void start(final WebCrawlerFactory<T> crawlerFactory, final int numberOfCrawlers, boolean isBlocking) {

    // Construct a specified number of threads from the supplied crawlerFactory class and start them running

    // create a monitorThread monitorThread as follows
    Thread monitorThread = new Thread(new Runnable() {

        @Override
        public void run(a) {
            try {
                synchronized (waitingLock) {

                    while (true) {
                        // Set the monitoring cycle
                        sleep(config.getThreadMonitoringDelaySeconds());
                        boolean someoneIsWorking = false;
                        
                        // Part 1:
                        // Observe whether each crawler thread runs normally, if not, take corresponding measures
                        // The first part of the code omitted, interested can go to Github to see
                    
                        // Part 2:
                        // Check to see if there are still working threads, if not, prepare to exit and close the resource
                        It looks like no thread is working, waiting for... And so on to print the log in the source section
                        // notifyAll is called when closed
                        if(! someoneIsWorking && shutOnEmpty) {// Again, make sure there are no threads working and no urls waiting to be crawled in the queue

                            // Release resources

                            waitingLock.notifyAll();

                            // Release resources}}}}catch (Throwable e) {
                if (config.isHaltOnError()) {
                    // An error occurred
                    setError(e);
                    synchronized (waitingLock) {
                        // Release resources

                        waitingLock.notifyAll();

                        // Release resources}}else {
                    logger.error("Unexpected Error", e); }}}}); monitorThread.start();// If you need to block, call waitUntilFinish and the code will block at this point
    if(isBlocking) { waitUntilFinish(); }}Copy the code

As you can see from the code, the block is in the last few lines, the waitUntilFinish method after the monitor thread is started.

The monitor thread calls waitingLock.notifyAll() to end the block after the monitor thread has finished monitoring it. How does this happen? Let’s look at the waitUntilFinish method.

How does the waitUntilFinish method block start

The source code for this method is very short, so I’ll just put it out.

/**
* Wait until this crawling session finishes.
*/
public void waitUntilFinish(a) {
    while(! finished) {synchronized (waitingLock) {
            if (config.isHaltOnError()) {
                Throwable t = getError();
                if(t ! =null && config.isHaltOnError()) {
                    if (t instanceof RuntimeException) {
                        throw (RuntimeException)t;
                    } else if (t instanceof Error) {
                        throw (Error)t;
                    } else {
                        throw new RuntimeException("error on monitor thread", t); }}}if (finished) {
                return;
            }
            try {
                // Allocate and wait for the resource to be locked
                waitingLock.wait();
            } catch (InterruptedException e) {
                logger.error("Error occurred", e); }}}}Copy the code

First, synchronized in both the Start and waitUntilFinish methods decorates critical blocks of code and fights for the same lock, waitingLock. This means that when one party executes, the other party blocks. What we want is for waitUntilFinish to block until the crawler thread is finished executing (that is, in the synchronized block corresponding to the start method) and then for the waitUntilFinish method to end. This is how this part of the source code is handled, and this is the idea behind wait and notify.

The core logic

Let’s review the logic of the source code:

  1. MonitorThread uses synchronized in its run method to acquire the waitingLock, and loops to check that all crawler threads and crawler tasks are completed.
  2. WaitUntilFinish obtains the lock waitingLock using synchronized and checks whether the crawl is complete based on the variable isFinished. If not, the wait method is called to surrender resources to the monitorThread’s Run method.
  3. Even if waitUntilFinish gets the lock waitingLock after calling the wait method, it determines whether to loop through the wait method again based on whether the climb ends isFinished.
  4. MonitorThread notifyAll after checking that all crawler threads have finished executing, the monitorThread calls notifyAll (like notify, except that it notifies all threads competing for lock resources) to let waitUntilFinish continue from wait.
  5. WaitUntilFinish gets the lock resource and executes from the code after calling the wait method. When the loop checks isFinished and finds that the crawl isFinished, it returns and the whole process is complete.

There are also many details not mentioned, such as delay Settings, cycle monitoring cycle, resource release and so on, because it is not the focus of this article, you can refer to the source code to understand.

See through the appearance to the essence

One thread calls wait to relinquish the lock to another thread, and the other thread calls Notify /notifyAll to notify the other thread that the lock is finished.

A few more details:

  • When notify is called, the thread that called wait does not acquire the lock resource immediately. Instead, the thread that called notify does not acquire the lock resource until the thread that called notify releases it. Only other threads can acquire the lock resource
  • After a wait is called to release and regain the lock, the code continues on the line below the wait method rather than returning to the beginning of the synchronized block, which is why the source code uses a while loop to repeatedly acquire lock resources. Because if there is no loop and the thread retrieves the lock after releasing it before the crawl is complete (that is, isFinished is False), waitUntilFinish will end
  • Wait can be set to wait(long timeout) to wake up after the timeout

Extremely simple version implementation

To help you understand, the simplest version of crawler4J is as follows (note that only wait/notify is implemented) :

package thread_practice;

public class WaitNotify {

    private final Object waitingLock = new Object();
    private boolean isFinished = false;

    public void start(a) {
        synchronized (waitingLock) {
            isFinished = false;
            System.out.println("doing sth...");
            try {
                Thread.sleep(5000);
            } catch (InterruptedException e) {
                e.printStackTrace();
            }
            System.out.println("done.");
            isFinished = true; waitingLock.notifyAll(); }}public void waitUntilFinish(a) {
        synchronized (waitingLock) {
            if (isFinished) return;

            try {
                waitingLock.wait();
            } catch(InterruptedException e) { e.printStackTrace(); }}}public static void main(String[] args) {
        WaitNotify wn = new WaitNotify();
        new Thread(() -> wn.start()).start();
        wn.waitUntilFinish();
        System.out.println("continue another thing..."); }}Copy the code

Within 5 seconds of executing the program:

5 seconds after executing the program:

You can see that the main thread does block at wn.waituntilFinish (), and it continues after 5 seconds. The logic, as I explained in the previous sections, is a simplified version that extracts only the core parts.

conclusion

This article introduces and discusses how to use wait and notify in practical scenarios based on crawler4J examples. A simplified version of the function is also implemented according to crawler4J scenarios. Communication between threads is dependent on WAIT and notify, and not just wait and notify. I will do more research in this area in the future.

If there are any mistakes in this article, please contact me to correct them.

extension

Here, a single thread notifies a single thread that the task has completed. What if multiple threads notify a single thread?

  • If a thread is notified when multiple threads have finished executing, you can use a monitoring thread to loop through to see if all threads have finished executing, as crawler4J does
  • If one of the threads is finished to notify, how to implement?