Introduction | eagle eye is the tencent PCG technology operations is responsible for the massive levels of distributed real-time monitoring and log analysis system, in response to a company strategy requires, the old business migration on the cloud, eventually produced a welcome change. This article will introduce the overall cloud solution of distributed log system (Hawk-eye), and we hope to communicate with you.

I. Introduction of Hawk-Eye Platform

Hawk-eye is a mass-level distributed real-time monitoring and log analysis system responsible by PCG technical Operation Department, which supports multi-language reporting. The domain name is:

http://log2.oa.com/

ATTA supports multi-language reporting (JAVA, Python, C++, etc.). After reporting, hawk-eye pulls data from ATTA system and finally writes it to ES through the inverted index mechanism, quick query function and write function of ES.

Using the inverted index mechanism of ES, the ability to return billions of seconds of data query, Hawk-Eye provides the following functions:

1. Real-time logs query service data

After the real-time log query service data is reported to ATTA, the development can query logs in time through Hawk-Eye to locate problems, and the operation and maintenance can query the running status of services in real time through the data statistics interface provided by Hawk-Eye.

2. Data analysis ability

After hawk-eye data is stored, users can call it directly through the API for OLAP analysis.

3. Error log alarm service

If an error occurs, the program can report the error log according to hawk-Eye specifications. Hawk-eye divides words and generates minute-level alarms based on different error codes.

4. Grafana analyzes alarms in real time

Data reported to Hawk-Eye were analyzed and alerted in real time by Grafana. (Because ES does not support large concurrent queries, real-time analysis of large data is not possible)

Two, the cloud background

The company adjusted its strategy, set up a new cloud business group, set up an internal “technical committee”, and started the two strategic directions of “open source synergy” and “cloud on business”.

What benefits does the Hawk-Eye team get from going to the cloud during architecture evolution? What is the value of the upper cloud?

1. Business value

  • Focus on business and improve r&d efficiency;

  • Accelerate technological upgrading and maintain technological advantages (traditional Internet vs cloud era);

  • Use better cloud open source component services (availability, stability, documentation apis…) ;

  • Computing resource reuse, elastic scaling, cost optimization;

  • Standardize CI/CD processes.

2. Engineer value

  • Broaden technical vision, avoid closed doors;

  • Skills are more valuable;

  • Export great components to the cloud for greater impact.

3. Tencent Cloud Value

  • Output business cloud experience for customers;

  • Help Tencent cloud polish cloud components.

Iii. Component cloud architecture selection

In order to ensure business continuity and architecture evolution, the main flow of data import process is not changed much. Kafka directly uses THE cloud CKAFKA, ES directly uses the cloud ES.

ES and Kafka use components directly on the cloud, while other components need to be refactored.

1. The refactoring LogSender

The producer program write Kafka performance bottleneck is particularly large, especially serious data loss during peak periods.

Producer program write data flow: read BOSS subscription ->IP resolution -> write Kafka.

(1) IP resolution performance bottleneck

The producer program was C++ version before, and after printing logs, it was found that the IP resolution time was particularly serious in peak hours. Check the code. IP resolution is locked. Therefore, data loss is particularly serious during peak periods. The solution is: change IP resolution to binary search algorithm for IP location, and then cancel the lock, solve the problem.

(2) Kafka performance bottleneck problem

Because our producer program reads many, many topics and then writes them to Kafka, we tried sending them with one producer and multiple producers, and the performance didn’t improve.

When Kafka is sent, it locks the queue based on the topic partition, and when the queue is full, it sends a batch of messages. So the solution is that each BOSSID should have a separate sending client.

  • Large data volume, multiple Kafka clients

  • A batch of topics with a small amount of data can share a Single Kafka producer.

After optimization: when the amount of data is very large, due to program performance reasons, a single node can only process about 130,000 pieces of data at most per minute. After improvement, a single node can process about 55W data. Performance increased by 4 times.

2. Kafka type selection

In general, Kafka’s higher versions support more features than its lower versions, such as transactions, data transfer between disks, and write performance does not degrade. The selection here is the highest version.

Of course, CKAFKA does not give us the opportunity to choose the version, so it is important to ensure that the client writes to the Kafka server version to avoid unnecessary problems.

For example, when a client of an earlier version writes to Kafka of a higher version and uses data compression, the server decompresses the data after receiving it and compresses the data in the corresponding format (if the version is the same, this operation does not occur), increasing the operating cost of the server.

Kafka can deliver up to 400MB/s on a single machine, while Kafka we built can deliver up to 100MB/s on a single machine, which is a four-fold increase in performance.

3. The refactoring Hangout

ES write part, there are many components in the industry, the most famous is Logstach, due to performance is not enough, we redeveloped a set of Kafka write ES components.

The core optimization points are as follows:

Due to the large reduction of disk IO, performance can continue to increase by more than 2 times under extreme optimization. Overall, ES writes improve performance by about six times.

4. ES selection

The ES earlier version supports TCP write and HTTP write, while the ES later version supports only HTTP write. The differences are as follows:

  • TCP writes are faster than HTTP;

  • HTTP writes are more stable. TCP writes are written directly to nodes, which is prone to load imbalance. HTTP is easier to load balance through data nodes.

So we went with the cloud version ES 6.8.2.

After the cloud effect:

  • Write 1TB data on average, 80 cores, 256G memory and 12TB disk (BX1 model) are required under cloud;

  • 3 * (16-core 64GB 5TB hard disk) is required on the cloud;

  • Save resources about 1 times on average.

Four, the change after the cloud

There are more than 50 ES clusters and 12 Kafka clusters.

1. Less work

If we do not go to the cloud, an AVERAGE OF 20 machines are needed to build these clusters for an ES cluster. From machine application, machine initialization, disk RAID and ES installation, 3-4 people/day are needed for each ES on average, so the construction cost already needs more than 200 people (62*3-4)/ day, without mentioning the cluster operation and maintenance cost. Far more than the manpower of hawkeye.

2. Cost reduction

With the optimization of each component, the overall performance will be improved by at least 2-3 times, the required resources will be reduced by 2-3 times year-on-year, and the annual cost will be saved by at least 2kW.

3. Get more focused

After the clouds:

  • Hawk-eye focuses on write performance optimization, greatly improving write efficiency;

  • The establishment of the monitoring system. After the data is reported to ATTA, the data account is checked and the alarm is given when the delay of data is found in time.

  • In terms of the development of new functions, ES supports the query on alternate days. If the data on the same day jumps, the amount of writing can be increased through the mechanism of establishing backup indexes.

5. Subsequent architecture evolution

1. Monitoring system construction

Core modules should have both logs and monitoring. The monitoring dimensions of different modules should correspond to each other, so that core modules, logs and monitoring can be provided. When business is abnormal, basic data (such as CPU/Mem, etc.), index data and log data can be timely called out to build a complete monitoring system.

2. Architecture continues to evolve

Currently, self-developed Hangout writes can only guarantee at least once, but cannot guarantee exactly once. Try to use flink’s checkpoint mechanism to ensure data link integrity.