There are more and more business lines in the department. Any online application may have problems due to various reasons. For example, at the business level, the order volume decreased compared with last week, and the traffic suddenly decreased. There is a technical problem. The system ERROR occurs and the interface response is slow. For the transportation business in China, an obvious feature is that it relies on the services of many suppliers, so we also need to pay attention to whether there are exceptions when invoking the supplier interface.

In order to enable all business lines under large traffic to find and solve problems as soon as possible through alarm, and thus improve the service quality of the business system, we decided to build a unified monitoring and alarm system. On the one hand, the system anomalies have been found in the first time and solved in time; On the other hand, find some potential problems as soon as possible. For example, a system does not affect the normal operation of the business logic, but some operations take a long time. If such problems are not handled in time, they may affect the development of the business in the future.

This paper mainly introduces the positioning and overall architecture design of the hornet’s nest traffic business monitoring and alarm system, as well as our experience of stepping on pits in the process of landing practice.

Architecture design and implementation

We hope that the monitoring and alarm system has the following three capabilities:

1. Automatic alarm of common components: Create default alarm rules for framework components commonly used by various business systems (such as RPC, HTTP, etc.) to facilitate unified monitoring at the framework level.

2. Business self-defined alarm: Business indicators are self-defined buried field by business development to record the special running status of each business and system module.

3. Identify problems quickly: Finding problems is not the goal, solving them is the key. We hope that after sending the alarm message, developers can find the problem at a glance, so as to solve the problem quickly.

Under such a premise, the overall architecture diagram and key processes of the alarm center are shown as follows:

Vertically, Kafka has an alarm center on the left and a business system on the right.

The architecture of the alarm center is divided into three layers. The top layer is the WEB background management page, which mainly completes the maintenance of alarm rules and the query of alarm records. The middle layer is the core of the alarm center; The lowest layer is the data layer. The business system accesses the alarm center through a JAR package called Mes-client-starter.

We can divide the work of the alarm center into five modules:

1. Data collection

We find system problems by collecting and reporting indicators, that is, we record and upload some indicators we are concerned about in the process of system operation. Upload can be log, UDP, and so on.

First of all, we did not duplicate the wheel for data collection module, but realized it directly based on MES (big data analysis tool inside hornet’s nest), mainly considering the following reasons: First, data analysis and alarm are similar in data sources; Second, it can save a lot of development costs; At the same time also convenient alarm access.

So what exactly should be collected? Take an order request of a user in a large traffic service scenario as an example. The whole link may include HTTP request, Dubbo call, SQL operation, and verification, conversion, and assignment. A whole set of calls will involve many classes and methods. It is impossible for us to collect every class and every method call, which is time-consuming and meaningless.

In order to find as many problems as possible at the lowest cost, we selected some framework components commonly used in the system, such as HTTP, SQL, and RPC framework Dubbo, to achieve unified monitoring at the framework level.

For business, each business system focuses on different metrics. For the different indicators that developers of different businesses need to pay attention to, such as the number of successful orders, developers can manually define the indicators that different businesses and system modules need to pay attention to through the API provided by the system.

2. Data storage

Elasticsearch is used to store the collected dynamic indicator data for two reasons:

** One is dynamic field storage. ** Each business system may focus on different indicators, and each middleware has different concerns, so which fields to bury and the type of each field are unpredictable, which requires a database that can dynamically add fields to store buried points. Elasticsearch does not require predefined fields and types. Buried data can be inserted automatically.

Two is able to withstand the test of massive data. Each user request into each monitoring component will produce multiple buried points, the amount of data is very large. Elasticsearch supports storage of large amounts of data and has good horizontal scalability.

Elasticsearch also supports aggregative computing, making it easy to quickly perform count, sum, avG, etc.

3. Alarm rules

With buried data, the next step is to define a set of alarm rules that quantify our concerns into specific data to check if preset thresholds are exceeded. This is the most complex problem and the core of the whole alarm center.

The most central part of the previous architecture diagram was the “rule execution engine”, which performed scheduled tasks to drive the system. The execution engine queries all valid rules, filters and aggregates Elasticsearch based on the rule description, compares the result of the previous aggregation calculation with the preset threshold, and sends an alarm message if the result meets the preset threshold.

This process involves several key technical points:

1). Scheduled tasks

In order to ensure the availability of the system and avoid the failure of the whole monitoring and alarm system due to a single point of failure, we set the alarm rule to be executed once every minute in a “minute” cycle. Here, Elastic Job is used for distributed task scheduling, facilitating the start and stop of tasks.

2). “three-step” alarm rule

We define the realization of alarm rules as “filter, aggregate and compare”. For example, suppose this is an ERROR buried log for service A:

app_name=B   is_error=false  warn_msg=aa   datetime=2019-04-01 11:12:00
app_name=A   is_error=false                datetime=2019-04-02 12:12:00
app_name=A   is_error=true   error_msg=bb  datetime=2019-04-02 15:12:00
app_name=A   is_error=true   error_msg=bb  datetime=2019-04-02 16:12:09

Copy the code

Alarm rules are defined as follows:

  • Filtering: Delineation of a data set by several conditions. App_name =A, is_error=true, datetime between ‘2019-14-02 16:12:00’ and ‘2019-14-02 16:13:00’

  • Aggregate: Calculates the previous data set using predefined aggregate types such as count, AVg,sum, Max to get a unique value. For the above problem, we select count to count the number of errors.

  • Compare: Compare the results obtained in the previous step with the set threshold.

How to implement alerts for complex conditions, such as the failure rate and traffic fluctuation mentioned above?

Suppose you have A problem: if the failure rate of service A is greater than 80% and the total number of requests is greater than 100, send an alarm notification.

As we know, the failure rate is actually the number of failures divided by the total number of failures, and the number and total number of failures can be obtained by the aforementioned ** “filter + aggregate” **, so in fact, the problem can be described by the following formula:

FailedCount/totalCount > 0.8 && totalCount > 100Copy the code

Then we use expression engine fast-EL to calculate the above expression, and the result can be compared with the set threshold.

3) Automatically create default alarm rules

For common Dubbo, HTTP, etc., because there are many classes and methods involved, developers can maintain alarm rules through the background management interface. Alarm rules will be stored in MySQL database and cached in Redis at the same time.

In the case of Dubbo, we first get all the providers and consumers through Dubbo’s ApplicationModel and combine the information about these classes and methods with the rule template (which can be interpreted as rules that exclude specific classes and methods). Create rules for a method in a class.

For example, if the average response time per minute of dubbo/ORDER/getOrderById provided by A service exceeds 1 second, the alarm will be generated. B Service call dubbo interface/train/grabTicket/range false if the number of states exceeds 10, alarm, etc.

4. Alarm behavior

At present, after the alarm rule is triggered, there are mainly two ways to alarm:

  • Email alarm: through the establishment of different person in charge for each type of alarm, so that the relevant personnel know the system anomaly in the first time.

  • Wechat alarm: As a supplement to the email alarm.

Later, we will continue to improve the strategy of alarm behavior, such as adopting different alarm methods for different levels of problems, so that developers can not only quickly find the alarm problems, but also not involve too much energy in the development of new functions.

5. Auxiliary positioning

In order to help developers quickly locate problems, we designed a hit sampling function:

First of all, I extracted the tracer_id of the matching rule and provided a link to directly jump to Kibana to view the relevant logs and restore the link.

Secondly, the developer can also set his own field to focus on, and THEN I will extract the corresponding value of that field, so that the problem can be seen at a glance.

Technically, define a hit sampling field that allows the user to enter one or more dollar braces. For example, we might be looking at a vendor’s interface performance, and the fields that hit the sample might be in the top half of the figure below. When an alarm message needs to be sent, extract the fields, query the corresponding values in ES, and replace them with Freemarker. The final message sent to the developer is as follows, so the developer can quickly know what is wrong with the system.

Trampling pit experience and evolution direction

The establishment of large traffic business monitoring and alarm system is a process from 0 to 1. In the whole development process, we have encountered many problems, such as: memory is filled up instantly, ES is getting slower and slower, and GC is Full frequently. The following is a specific talk about our optimization experience for the above points.

The pit of tread

1. The memory is filled instantly

Every system has its limits, so you need a dam that can hold back the flood when it comes.

The same is true for the alarm center, which faces the biggest bottleneck in receiving MES buried logs transmitted from Kafka. At the early stage of online operation, a large number of buried logs were sent to the alarm center instantly due to an anomaly of the business system, resulting in the system memory being filled up.

The solution is to evaluate the maximum bearing capacity of each node and do a good job of system protection. To solve this problem, we take a limited approach. Since Kafka consumes messages in pull mode, we only need to control the pull rate. For example, with Guava’s RateLimiter:

messageHandler = (message) -> {
  RateLimiter messageRateLimiter = RateLimiter.create(20000);
  final double acquireTime = messageRateLimiter.acquire();
  /**
   save..
  */
}
Copy the code

ES is getting slower and slower

Due to the large amount of MES logs, there are also hot and cold. In order to ensure performance and facilitate data migration, we create ES index according to the granularity of application + month, as shown below:

3. Frequent Full GC

We used Logback as the logging framework and defined an Appender to collect ERROR and WARN logs. If you want to collect the Spring container before starting (TalarmLogbackAppender at this time has not been initialized) logs, an extension of Logback jars of DelegatingLogbackAppender provides a way to cache, The memory leak is in the cache.

Normally, the Spring context in the ApplicationContextHolder is not empty after the system is started and the logs are automatically fetched from the cache. However, if the class ApplicationContextHolder is not initialized for any reason, the logs will accumulate in the cache, resulting in frequent Full GC.

Solutions:

1. Ensure that ApplicationContextHolder is initialized

2. There are three kinds of DelegatingLogbackAppender mode: OFF SOFT ON: If you need to enable one, use SOFT mode. In this case, the cache is stored in a list wrapped with SoftReference, which can be reclaimed by the garbage collector if the system memory is low.

The recent planning

At present, there are some imperfections in the system, which are also some plans for the future:

  • Easier to use: provide more help tips to help developers quickly familiar with the system.

  • More alarm dimensions: currently support HTTP, SQL, Dubbo component automatic alarm, will support MQ, Redis scheduled task and so on in the future.

  • Graphical display: The buried point data is displayed by graphical display, which can more intuitively show the operation of the system and also help the developer to set the system threshold.

summary

To sum up, the architecture of large traffic business monitoring and alarm system has the following characteristics:

  • Support flexible alarm rule configuration, rich filtering logic

  • Automatic add common components alarm, Dubbo, HTTP automatic access alarm

  • Access is simple, access to MES system can be quickly access to use

Online production operation and maintenance mainly do three things: find problems, locate problems and solve problems. To find a problem is to notify the system administrator as soon as possible when the system is abnormal. Locating and resolving problems is about providing developers with the information necessary to fix the system quickly, as accurately as possible.

The positioning of the alarm system should be the first step and means of entry in the on-line problem solving chain. Organic series through core clue data and data tracing system (tracer link, etc.), deployment and release system can greatly improve the efficiency of online problem solving and better escort production.

No matter what we do, our ultimate goal is to improve the quality of service.

Author: Song Kaojun, senior R&D engineer of Hornet’s Nest transportation platform.

(Photo source: Internet)