This article is intended for those with basic Java knowledge

Author: HelloGitHub – Salieri

HelloGitHub introduces the Open Source project series.

Project Address:

Github.com/KFCFans/Pow…

PowerJob online logging is a well-received feature. It can display the logs generated by developers during task processing in real time on the front-end interface, helping developers better monitor the execution of tasks. The following figure shows its functions (the front interface is a little ugly, please ignore it automatically).

The function of online logging sounds very simple at first glance. It is nothing more than the worker sending log data to the server, which is displayed in the front end after the server receives it. However, for PowerJob, a system that supports distributed deployment of all nodes and distributed computing, there are still many difficulties. Briefly speaking, there are the following points:

  1. Many-to-many problem: In the ideal deployment mode of PowerJob, there are multiple servers and multiple workers. When a task starts distributed computing, its logs are scattered on each machine. To display the logs uniformly in the front end, a collector is required to gather the scattered logs together.
  2. Concurrency problem: When the worker cluster scale is large, once the distributed computing task is executed, the log QPS generated by it is also a large number. In order to easily support distributed tasks of a magnitude of millions, it is necessary to solve the problem of high QPS in the case of concurrency.
  3. Sorting problem: In distributed computing, logs are distributed among different machines. Even if logs are collected and summarized to the same machine, the order of logs cannot be guaranteed due to network delays and other reasons. Sorting logs by time is a strong requirement. Therefore, the sorting problem of large-scale log data needs to be solved.
  4. Data storage: When the amount of log data is very large, how to efficiently store and read this batch of data is also a problem to be solved.

Therefore, in order to achieve the online log function perfectly, PowerJob internally implements a distributed log system with all the functions. Without further ado, let’s begin to analyze one by one

Many-to-many problem

This problem is actually solved when PowerJob solves the main selection problem of multiple workers and multiple servers. To put it simply, in PowerJob system, all workers in a certain group are only connected to a certain server when running. Therefore, select the current worker to report log data. Tasks cannot be executed across groups. Therefore, all log data generated during the running of a task is reported to the server connected to the current group. In this way, logs are collected.

Second, concurrency problem

Concurrency problems are also easy to solve.

You’ve all heard of message-oriented middleware, and one of its features is peak clipping. After the introduction of messaging-oriented middleware, the original synchronous call is transformed into asynchronous indirect push. In the middle, the instantaneous flow peak is accepted at one end through a queue, and the message is pushed out smoothly at the other end. Message-oriented middleware, like a reservoir, holds the upstream flood and reduces the peak flow into the downstream channel, thus achieving the purpose of flood relief.

PowerJob uses a similar method to handle the high concurrency of logs. It introduces a local queue to cache the messages to be sent to the server, and then sends the messages to the server in batches. The synchronization is asynchronous, and the batch sending mechanism is introduced. Make full use of every data transfer opportunity to send as much data as possible, thus reducing the impact on the server.

Third, the sorting problem

3.1 Log Storage

Before sorting the questions, let’s talk about how the server handles the log data it receives, that is, how it stores logs.

It’s not a difficult choice, and a simple elimination method will get you the right answer:

  1. Internal or external? PowerJob as task scheduling middleware, minimum dependency is always the guiding ideology that needs to be firmly controlled. Therefore, it seems unlikely to use external storage media when the minimum dependency is known to be only the database, at least not to send the received logs directly to external storage media, otherwise another wave of huge QPS will have very high performance requirements on the external components that depend on, which does not conform to the framework design principles. Therefore, the primary storage medium for online logs should be assumed by the server itself.
  2. Memory or disk? Now that the server is determined to store the raw data, you are faced with a choice between memory and disk. But is that really a choice? Millions of text data stored in memory, not a problem with OutOfMemory? Obviously, save disk.

After a wave of simple elimination, the tier-one storage scheme for logs is determined: the server’s local disk. So what are the problems with disk storage?

Apart from the complexity and difficulty of manipulating files, there is one simple requirement that can make this scheme fall off the cliff: sorting.

As we all know, logs must be sorted by time, otherwise they are unreadable. PowerJob is a purely distributed system, so it is impossible to expect all log data to be sent to the server in order, so reordering logs is a must. But let’s consider the difficulty.

  • First, the log is plain text data, and to sort it, you first have to turn the entire log file into a bunch of log records, or branches.
  • Second, since the log must have been converted to a human-readable format, YYYY-MM-DD HH: MM: ss.sss, it would be a hassle to parse it back to a sortable timestamp.
  • And finally, the final BOSS, is sort. Remember, disk storage was chosen because there was not enough memory. This means that this sort can’t be done in memory. I don’t need to say much about the difficulty and efficiency of external sorting. I also believe that most programmers (myself included) have never dealt with external sorting, so why should I?

3.2 Introduction to H2 Database

So, is there any framework/software that can use disk for both storage and sorting? Can such a good thing ever happen? Don’t tell me. There is. And is far away, right in front of the programmer can be said to be inseparable from the same thing – database.

“Wait, didn’t you just say that you don’t use databases as tier 1 storage media? Why, change your mind?”

“Well, young man. This database is not that database, this database is powerJob-server embedded database H2.

H2 is an embedded database developed in Java that is itself just a class library, that is, a single JAR file that can be embedded directly into an application project. In embedded mode, the application starts the H2 database in the JVM and connects via JDBC. This mode supports both data persistence and memory.

H2 is easy to use. After introducing dependencies into a project, it is automatically launched with the JVM. Applications can connect via JDBC urls and specify the mode to use in the JDBC URL. So use the following JDBC URL to connect:

jdbc:h2:file:~/powerjob-server/powerjob_server_db
Copy the code

H2 also supports fairly standard SQL specifications and is perfectly compatible with ORM frameworks such as Spring Data Jpa and MyBatis, making it very easy to use. In PowerJob-Server, I used H2 via Spring Data Jpa, and the user experience was very friendly (although the multi-source configuration was very unfriendly!). .

With the built-in H2 database, log storage and sorting is no longer a problem to solve

3.3 Storage and Sorting

After H2 is introduced, powerJob-Server processes online logs as follows:

  1. Receive the log data from the worker and write it directly into the embedded database H2
  2. When invoked online, the ORDER by log_time function of SQL query statements is used to sort and output logs

It can be seen that the appropriate technology selection can make the solution of the problem much easier ~

Some other optimizations

This article introduces the core principle and implementation of PowerJob distributed log component, of course, in practical use, but also introduced many optimizations, limited space, here briefly mentioned, interested students can go to see the source ~

  • High frequency online access voltage reduction: If every time a user views a log, it needs to be queried and output from the database, this will be very slow and efficient. After all, when the volume of data reaches a certain level, disk I/O alone takes a lot of time. Therefore, PowerJob-Server generates a cache file for each query. The log query within a certain period of time is directly returned through the file cache instead of using the DB query solution every time.
  • Log pagination: Behind millions of pieces of data, the resulting file size is much larger than normal network bandwidth can easily support. Therefore, to quickly display online logs on the front-end console, you need to introduce paging to display part of the log data at a time. This is also a relatively complex file operation.
  • Remote storage: All logs are stored on the server, which obviously does not meet the design goal of high availability. After all, changing a server means that all log data is lost. Therefore, PowerJob introduces mongoDB as the persistent storage medium for logs. Mongodb allows users to directly use its underlying distributed file system, GridFS. After careful consideration, I consider this an acceptable and relatively powerful extension dependency, so I choose to introduce it.

Five, the last

In the next issue, I will tell you about PowerJob as a framework for each node to communicate at any time. How to choose the underlying serialization framework and how to design the specific serialization scheme

I’ll see you next time


Follow the HelloGitHub public account