This article by”
AI the front“Original, original
Ali database into the whole network second level real-time monitoring era


Author | Wu Bilang (Wei Li)


Edit | Emily

AI Front Line introduction: “2017 pairs of 11 again hit a record 325000 / SEC deal to create, behind the figure is up to tens of millions of times per second, a database, how to large-scale automation, guarantee the stability of the database, quickly found that the problem is a big problem, it is also a database control platform to complete the task.”


With the continuous expansion of the database scale of Alibaba, the construction of database management and control platform has gone through many stages, from scripting, instrumentalization and platformization to the current DBPaaS. DBPaaS covers a wide range of scenarios such as group, local database under subsidiaries, public cloud and hybrid cloud for the first time in this year’s Double 11. On November 11 this year, the database has been fully deployed in containers, and the flexible use of offline resources and public cloud resources are greatly promoted. The fully optimized monitoring and acquisition link realizes the second-level collection, monitoring, display and diagnosis of all database instances in the whole network. Real-time processing of more than 10 million monitoring indicators per second, so that there is no hiding anomalies. DBPaaS also continues to make breakthroughs in the automation, scale, digitalization, intelligence and other directions of database management.

Among them, the construction of database monitoring system is typical.

In the normal operation state of business, online system failure, in tens of thousands of databases, how to find abnormal, rapid diagnosis is also a very challenging thing. In the double eleven all-link pressure survey, the system throughput is not up to the expected or RT jitter occurs in the service, so it is a practical topic to diagnose and locate the database problem quickly. In addition, post-troubleshooting of complex database failures, on-site restoration, and historical event tracing also force us to build a monitoring system that covers all environments, database instances, and events online.

To do:

1) Cover all computer rooms of Ali’s global subsidiaries.

2) Covering ali Ecology includes all businesses of new retail, new finance, new manufacturing, new technology and new energy.

3) Covering all database hosts, operating systems, containers, databases and networks.

4) All performance indicators shall be monitored continuously at the level of 1 second.

5) All-weather continuous and stable operation.


DBPaaS monitors the running status of Double 11

On November 11, 2017, the second-level monitoring system of DBPaaS platform processed 10 million performance indicators per second on average and 14 million performance indicators at peak value, ensuring the healthy operation of all database instances distributed online in China, the United States, Europe and Southeast Asia. Real-time second-level monitoring is achieved, which means that at any time, DBA students can see all performance trends for any database instance up to a second ago.

The DBPaaS monitoring system uses only 0.5% of the machines in the database resource pool to support the entire collection link, computing link, storage and display diagnostic system. The monitoring system perfectly recorded every RT jitter scene of every full-link pressure test this year, helping DBA quickly diagnose database problems and providing suggestions for subsequent system optimization.

During the promotion period of Double 11, we did not expand the capacity of the machine or degrade the service, so that DBA students could enjoy the double 11 with tea. In daily business operation guarantee, we also have 7*24 service capacity.


How did we do it

Implementing a real-time second-level monitoring system that supports tens of thousands of database instances requires a number of technical challenges. It is said that excellent architectures evolve, and the construction of monitoring systems is also iterative with increasing size and complexity. By 2017, the monitoring system has gone through four phases of improvement.

First generation surveillance system

The architecture of the first-generation monitoring system is very simple. The collecting Agent directly writes the performance data into the database, and the monitoring system directly queries the database.

As database clusters grow, the disadvantages of a simple architecture become apparent.

First of all, the capacity of the stand-alone database is not scalable enough. With the expansion of the monitored database, the daily performance index writing volume is very large, and the database capacity is insufficient. The accumulated monitoring historical data for a long time often triggers the disk space warning, and we are often forced to delete the long-term data.

Secondly, the scalability of monitoring indicators is insufficient. At first there were only a dozen database monitoring items, but it quickly became clear that there were not enough. Because people often take MySQL documentation and say, I want to see this, I want to see that, can you put it in the monitoring system. Performance metrics are presented on the premise of storage, and scalability flaws in the storage tier are a headache for us. For this functional requirement, both wide and narrow tables have obvious drawbacks. With wide tables, DDL is performed every time a new batch of performance metrics is added, which is mitigated by predefined extension fields, but ultimately limits the product imagination. Narrow table can solve the storage problem of any performance index in structure, but it also brings the disadvantages of data write volume enlargement and storage space expansion. Finally, the overall reading and writing ability of the system is not high, and it does not have horizontal scalability.

All of the above reasons have led to the birth of the second generation of monitoring systems.

Second generation monitoring system

The second generation of monitoring system introduced DataHub module and distributed document database. Data link from acquisition Agent to DataHub to distributed document database, monitoring system from distributed document.

Acquisition Agent focuses on performance data acquisition logic, constructs unified data format, calls DataHub interface to transfer data to DataHub, and acquisition Agent does not need to care where performance data exists. DataHub as the connecting node, realize the decoupling of acquisition and storage. First, it shields the data storage details to the collecting Agent, only exposing the simplest data delivery interface. Second, DataHub receives the optimal write model according to the characteristics of storage engine, such as batch write, compression, etc. Third, using LVS, LSB technology can realize DataHub horizontal expansion. Distributed document databases solve scalability problems. Horizontal expansion is used to solve the problem of insufficient storage capacity. The Schema Free feature can provide performance indicators for scalability problems.

As the monitoring system continues to run, database instances expand, performance indicators continue to increase, and users of the monitoring system expand, but new problems occur. First, DBA students often need to view the performance trend of the database over several months to predict the future trend of database traffic, when the system query speed is basically unavailable. Second, the cost of storing full performance data for a year is becoming increasingly unbearable. Every year when double 11 is tested, DBA students always ask about the performance trend of last year’s Double 11. Third, DataHub has the hidden danger of lost collected data. As the original data collected is first buffer in DataHub memory, the collected data in the memory will be lost as long as the process restarts.

Third generation monitoring system

To solve the problem of slow query speed, document databases and relational databases are row-oriented databases, that is, basic data is read and written. Performance data is stored in one row per second, and N performance indicators are stored in a time-key table. Although all performance metrics at the same time are placed on the same row, they are not closely related. Because the typical monitoring diagnostic requirement is to check the trend of the same or several indicators over a period of time, rather than the index at the same time (instantaneous value), for example:

In order to find out the performance trend of a certain indicator, the database storage engine has to scan the data of all indicators, which costs a lot of CPU and memory. Obviously, these are wasted. Although the Column Family technology can alleviate the above problems to some extent, how to set up the Column Family is a big challenge. Do you need to couple the policies of the storage layer with the requirements of the monitoring diagnostic layer? That doesn’t seem like a good idea.

Therefore, we focus on the column database. The read and write characteristics of monitoring performance indicators are very suitable for the column database, and the sequential database represented by OpenTSDB has entered our field of investigation. OpenTSDB uses timelines to describe each particular object with a time series, and reads and writes to timelines are independent. OpenTSDB is undoubtedly part of the architecture of the third generation monitoring system.

In order to eliminate the hidden danger of DataHub stability, the distributed message queue is introduced to play the role of peak clipping and valley filling. Even if DataHub crashes all the way, message re-consumption can be adopted to solve the problem. Distributed message queues, which can be Kafka or RocketMQ, are already highly available.

The third generation architecture has made great progress compared with the past. In 2016 double 11, the whole network database was monitored at the level of 10 seconds, and the core database cluster was monitored at the level of 1 second.

With the expansion of Ali ecology and the deepening of globalization, all kinds of wholly-owned subsidiary businesses are fully integrated into Ali system. In addition to mainland China, there are also businesses in the United States, Europe, Russia and Southeast Asia. At the same time, new technologies are emerging in the field of Ali database, unit deployment has become normal, containerized scheduling is covering the whole network, storage and computing separation is constantly advancing, and the same business database cluster may have different deployment strategies in different units. Meanwhile, the size of DBA team has not been expanded accordingly. It is normal for one DBA student to support the business of multiple subsidiaries, and some DBAs are also responsible for new technology promotion. In the link of database performance diagnosis, it is more and more urgent to fight for the efficiency for DBA and provide the path from macro to micro to diagnosis for DBA: one-stop service from large market to cluster, to unit, to instance, to host, container and so on.

Under such diagnostic requirements, the third-generation monitoring architecture is a little inadequate, mainly in the query:

1) High dimensional performance diagnosis query speed is slow. Take cluster QPS as an example. Since the QPS data of each instance is stored in OpenTSDB, when querying cluster dimension QPS, we need to scan the QPS of each instance of the cluster, and then group by timestamp sum all instance QPS. This requires scanning a lot of raw data.

2) OpenTSDB cannot support complex monitoring requirements, such as viewing the trend of average CLUSTER RT. The average cluster RT is not AVG (RT for all instances), but sum(execution time)/sum(number of executions). In order to achieve the goal, only two timelines can be detected, which is displayed on the page after internal calculation in the monitoring system, resulting in a long response time for users.

3) The performance diagnosis speed of a long time span is slow, for example, the performance trend of one month needs to scan the original 2,592,000 second-level data points to display in the browser. Considering the performance of the browser, the original second-level data cannot and does not need to be displayed. Fifteen minutes of time accuracy data is enough.

OpenTSDB has also realized that since version 2.4, it has been equipped with simple predictive computing capabilities. In terms of functional flexibility, system stability and performance, OpenTSDB cannot meet DBPaaS second-level monitoring requirements.


DBPaaS next-generation architecture

With the next generation architecture, we have upgraded OpenTSDB to the more powerful HiTSDB, and replaced the simple DataHub with a real-time pre-aggregation engine based on streaming computing.

In terms of responsibility definition, the complexity of monitoring and diagnosis requirements is left to the real-time pre-aggregation engine, and the query requirements for the sequential database are limited to a time line. This requires that the sequential database must achieve the ultimate performance of a single timeline. Alibaba’s high-performance sequential database HiTSDB, developed by the brother team, has achieved the ultimate compression and reading and writing ability. It has done a lot of compression by taking advantage of the characteristics of sequential data isometric time stamps and small changes in values. At the same time, it is fully compatible with OpenTSDB protocol and has been tested in Ali Cloud.

The new architecture allows us to think about monitoring and diagnostics without being tied to a storage layer. First, in order to query the performance of high-dimensional performance trend, the pre-aggregation engine calculates the performance indicators according to the business database cluster, unit and instance and writes them into HiTSDB. Second, the performance indicator aggregation calculation function library is established. The aggregation calculation formula of all performance indicators can be configured, realizing the free setting of monitoring indicators. Third, reduce the time accuracy in advance, divided into six precision: 1 second, 5 seconds, 15 seconds, 1 minute, 5 minutes, 15 minutes. Different compression strategies are used for performance data with different time precision.

Real-time computing engine

Real-time computing engine realizes the hierarchical aggregation of instance, unit and cluster, and Bolt writes HiTSDB to each level of aggregation. The choice of streaming computing platform is free. Currently our program runs on the JStorm computing platform, which gives us natural high availability.


Calculate engine performance in real time

The real-time computing engine uses resources of 0.1% of the total machine size of the database to calculate the whole network’s second-level monitoring data, processing more than 10 million performance indicators per second on average, with average TPS of 6 million and peak TPS of 14 million. The figure below is the TREND curve of HiTSDB TPS during Double 11.


Key optimization points

To achieve this high throughput with so few computing resources must have used a lot of black technology.

1) In the prediction calculation, we use incremental iterative calculation. We do not need to wait for all the performance indicators within the time window to be collected before we start the calculation. Instead, we only retain the intermediate results as many performance data as possible, which greatly saves memory. This optimization saves at least 95% memory compared to the conventional calculation method.

2) At the collecting end, the performance data packets are merged, and similar and adjacent packets are combined and reported to Kafka, so that the JStorm program can process data in batches.

3) Using the characteristics of streaming computing to achieve data locality, the data collected by the instance of the same cluster unit is in the same Kafka partition. This reduces network transport and Java serialization/deserialization of the computation process. This can reduce network traffic by 50%. If you are interested, why can’t you partition by instance or cluster?

4) Use JStorm custom scheduling feature to make Bolt scheduling with data correlation in the same JVM. This is to cooperate with the second step above to achieve data flow in the same JVM as far as possible.

5) For map-Reduce data transmission that has to occur, batch transmission should be used as far as possible, and the transmitted data structure should be multiplexed and trimmed to Reduce repeated data transmission and Reduce serialization and deserialization pressure.


future

Ali DBPaaS whole-network second-level monitoring enables database control to realize digitalization. After this year, we have accumulated a lot of valuable structured data. With the development of big data technology and machine learning technology, it is possible for database management and control to become intelligent.

1) Intelligent diagnosis, based on the existing omnidirectional monitoring without dead Angle, combined with event tracking, intelligent problem location.

2) Scheduling optimization: by analyzing the portrait characteristics of each database instance, several database instances with complementary resources can be scheduled together, which ultimately saves costs.

3) Budget estimation: determine the capacity demand of each database cluster according to the business transaction volume target before each big promotion by analyzing the historical operating status of the database, thus providing a basis for automatic expansion.

For more content, you can follow AI Front, ID: AI-front, reply “AI”, “TF”, “big Data” to get AI Front series PDF mini-book and skill Map.