Alibaba’s monitoring platform has undergone many iterations and replacements, and has been slowly transformed from simple automation to intelligent system operation and maintenance in the tortuous development.

On May 18-19, 2018, the Global Software and Operation Technology Summit hosted by 51CTO was held in Beijing.

In the sub-forum of “AIOps under Containers”, Cheng Chao, head of monitoring from Alibaba Group, gave a wonderful speech on the theme of “The Development Road of Alibaba Monitoring from automation to intelligence”.

This article will share ali’s best practices in building a super-large second-level monitoring platform from the following three parts:

  • Have to upgrade

  • The science of uniting the internal

  • Looking at the stars

Have to upgrade

Like most other companies, Alibaba initially adopted the open source model of Nagios+Cacti.

The biggest problem with this combination is that it doesn’t scale, and once the volume of data reaches that level, all sorts of problems arise.

In addition, since we didn’t delve deeply into the open source combinations, we couldn’t customize or modify them. In 2009, we decided to scrap the portfolio and build our own monitoring system.

As the chart above shows, the self-built system supported Alibaba’s growth for the next five years, and some functions are still used today.

The concept of Domain is introduced, that is, Monitor Domain. The highlight of this monitoring system is that it solves the problem of data volume and can support horizontal expansion.

On the database side, there were no HBase and NoSQL solutions, so we used MySQL. But it is well known that MySQL does not support monitoring well.

For example, when we need to insert the data of a new monitoring item into the database, we have to build the database and the table in turn, which is obviously quite tedious.

Then, as HBase matured, we switched the entire database over to HBase. This brings a lot of convenience to the development team and improves the monitoring quality of the whole system.

The picture above shows Alibaba’s newest and most important monitoring platform, Sunfire. In terms of storage, we used HBase before, but now we have changed to HiTSDB (High-performance Time Series Database).

In addition, in terms of data collection, Agent was installed on the machine before, but the current system mainly collects logs, including service logs, system logs, message queue logs, etc.

By docking SQL, we abstract away the data access layer while keeping the upper layer unchanged, which has the benefit of representing the concept of “tenant.”

Unlike many monitoring systems that Push data, our architecture pulls data from the top down.

We have something in common with Prometheus (which supports a multi-dimensional metric data model, with servers fetching data periodically through HTTP and having the flexibility to query for monitoring purposes), although we are a little stronger in the background.

The current scale of the system is reflected in:

  • The number of internal tenants exceeds 90. Tenants here refer to tmall, Taobao, Hema, Youku, Autonavi and other application systems.

  • The number of machines is more than 4,000. This is the number of singles’ Day last year, where the background is not purely physical machines, but mostly 4-core, 8GB virtual machines.

  • The number of applications is more than 11,000.

  • Processing capacity is about 2 terabytes of data per minute. Of course, this is also the number of singles’ Day last year.

The science of uniting the internal

Let’s take a look at the specific features of Alibaba’s current monitoring system and the business pain points it can solve:

Zero-Copy

Based on past monitoring experience, when the business discovers that the CPU jitter indicator collected is actually caused by the monitoring system, they would rather not monitor the system.

Therefore, through optimization, agents on each monitored host do not call any business resources or do any processing, but directly gather and “pull” the original data to the central node. “Use bandwidth for CPU” this is a principle when we design the Agent of monitoring system.

Also, we don’t even compress the logs, because compression also uses the CPU of each host.

Light-Akka

On the framework side, we used Akka to build, considering its advanced design concepts and performance.

But it turns out that because it’s written in Scala, messages can’t be delivered “once and for all,” meaning 100% reachable. So we extracted what we needed most and re-implemented it in Java.

Full-Asynchronous

Because of the large amount of data, the monitoring system is running, any node once blocked is fatal.

We “asynchronize” the architecture’s key core links by sending tasks to the RegisterMapper.

In order to make the whole link of monitoring system realize “asynchronous” core operation, we can use Unity in network transmission and Java asynchronous Http Client and other technologies. With a few modifications, you can achieve a fully asynchronous effect.

LowPower-Agent

Since the main task of Agent is to obtain logs, we can greatly reduce the CPU consumption and achieve low-power Agent by constantly guessing the cycle of logs and remembering them in a sequential manner according to the “cursor” recorded in the last log.

MQL and self-ops are also two important aspects in the figure above, which we will discuss further:

As various services have many functions, a huge amount of data needs to be monitored, and data types and formats are more complex, so we adopt a variety of API invocation methods.

For Alibaba, we have developed a set of Query Language to Monitor data: Monitor Query Language – MQL in accordance with standard SQL syntax. It can unify different kinds of requirements, and then realize the query of all monitoring systems.

In theory, NO matter how complex the request, MQL can be expressed in an SQL language. And the syntax is defined by us, as shown in the white text above.

At the bottom of the figure above is a real-world example that queries CPU data at 5-minute intervals from 1 hour ago.

So it’s very simple to implement and you can see it. People familiar with SQL can almost write without learning it.

Above is the self-ops interface, which is a little rough because it’s for our own internal use.

For the operation and maintenance workload of 4000 machines every day, although different business systems have different monitoring tools, we think it is necessary to make our own monitoring system into a self-operation and maintenance system.

Therefore, we set up our own internal CMDB from the perspective of machine management, including software version control, release packaging and other functions.

In this way, we no longer rely on various middleware and other components, but also establish the overall stability of the monitoring system. In addition, the system brings us some additional benefits.

For example, Alibaba could easily “go out” and take over the systems of foreign acquisition firms such as Lazada.

As we all know, the monitoring system is generally established after the service system. Different services have different types of logs, and the same features in the logs will have different formats.

Therefore, we put a lot of effort into Agent to make our system compatible with all possibilities. For example, different systems have a lot of possibilities for expressing a date.

So we’ve included seven common and uncommon date formats here. At the same time, we can also accommodate different log directory writing methods.

It can be seen that when preparing Agent, we should not always think about letting the business side adapt to ourselves, but reflect the core value of the whole monitoring system by adapting to the business.

As mentioned earlier, we implemented our own MQL, but we still use HBase on the backend. While HBase was very stable, it was a bit “sluggish” when it came to further development. It is a struggle to support level 2 caching, not to mention aggregation of dimensions.

Therefore, in order for MQL to work, we need to switch to Alibaba’s internal TSDB database HiTSDB based on the OpenTSDB specification.

In order to adapt to the large-scale monitoring, we are now trying to continuously optimize HiTSDB, and it is expected to be completed before this year’s Singles’ Day.

Above is an overall frame diagram, our monitoring platform is located in the upper part. Of course, inside Alibaba, there are actually several different monitoring systems, each of which has its own unique value in its vertical field.

Since our system is the largest, we wanted to unify the various technical components under the monitoring platform.

As shown in the red “Computing framework” section, it is a very large part of the structure, so we include disaster recovery, performance monitoring, and asynchrony all in it.

At present, there is a situation in Alibaba that a single application involves more than 10,000 virtual machines. Then how will the thousands of monitoring machines responsible for collecting log events conduct Map and directly deposit it into HBase after collecting the indicators of this application?

In the current transaction pattern, each transaction generates a row of logs. We collect a huge amount of log information in a minute.

To turn them into transaction numbers, it is common practice to extract the numbers in the Map layer and aggregate them in the Reduce layer as Hadoop does in two steps.

For example, there were several dimensions of units in 10,000 machines before, but now we need to add another dimension of units, so the code is “written dead” on the Reduce layer, so we need to make corresponding code modification.

After the data is collected, the data can be aggregated by equipment room, application, and group and stored in HBase.

Now with the HiTSDB solution, we simply do a Map, convert log data to Key/Value, and throw it directly into HiTSDB, so there is no need for the Reduce layer.

The advantage of this is that by omitting the other steps and using only the MQL API, we can achieve simple statistics.

To sum up, it is called: “Make the front light, make the back heavy.” This is also a big change in our architecture.

This is a diagram of Prometheus architecture, which is very similar to our Sunfire, and operates in a “pull” way.

So we’re trying to be fully compatible with Prometheus’s front end ecosystem, not its back end.

As shown on the right, Prometheus’ front desk offers a lot of Exporters and even a Exporter to IoT. Because of the same way of pulling, so we are compatible with no effort.

As mentioned, we use 4000 machines to monitor the system, which is a lot of overhead. Another benefit of compatibility is cost savings.

The old pattern of pulling logs as they are, consumes both bandwidth resources and central computing costs.

Now, according to Prometheus, statistics, which count only transactions per unit of time, are much less in total.

In terms of alarm and notification, we achieved the following two effects through “cutting out” attempts:

  • Rough cut out false positives in alarms and notifications.

  • Suppress alarm and notification outbreaks to avoid alarm storms.

Looking at the stars

We want to connect systems with a comprehensive, full-link diagram. I don’t think business links are automated.

The diagram above illustrates the relationship between the application and the machine. But most application developers don’t like to use this diagram because it’s too complex, detailed, and not layered.

So we brought in people from the business side to do a manual rendering that reflected their focus in detail. According to their hand-drawn drawings, we made the Demo diagram above. In this year’s 618 campaign, we implemented various system monitoring based on this chart.

Although most of us engaged in monitoring work come from the staff who used to do development and script writing in operation and maintenance, we should not be limited to solving all kinds of current operation and maintenance problems, but should pay more attention to some aspects of the business.

Last year Alibaba tore down its entire operations team and integrated them into development. With DevOps, we’re adding platform layers, tool layers, automation, intelligence, and so on.

Without nanny-like operation and maintenance services, the tools team and development team need to develop a set of tools, which is omni-dimensional + full-link mode horizontally.

Vertically, it includes: network quality, application, line index, APM, network itself, IDC, and data. And this graph is a great “tandem”.

If you look closely, you’ll notice that in the figure above, the metrics below each square are exactly the same. Why is that?

In the monitoring chapter of the book SRE: Google Operations and Decryption, it identifies the “golden four” : traffic, latency, success rate, saturation.

Then we can use these four indicators to judge and solve any business, system or application.

So here, we also use these four indicators as business and application standards to measure various business problems.

For example, in terms of standardization at the data level, we tried to do a “smart baseline” from the year before last that had something to do with AI.

I do not know whether we have noticed, in fact, Taobao’s trading measure has a certain law: at night when sleeping, eating trading volume will be relatively low; After eight o ‘clock, the volume of transactions will increase; In addition, the 10pm juhuasuan will also drive up transaction volume.

So the idea of a smart baseline is to simulate this curve and be able to predict what’s going to happen in the next half hour.

Through this curve, no matter business side, development, operation and maintenance, or operation, can configure their own alarm rules.

For example, we could configure an alarm between 2 am and 4 am if the actual transaction data falls by more than 20% relative to the baseline (trading volume is low at night and volatility becomes apparent once the transaction volume is low). During the day, an alarm is given when there is a 5% deviation from baseline.

In this way, we successively developed thousands of smart baselines and applied them as basic specifications in smart monitoring scenarios, eventually increasing the accuracy rate to more than 80%.

It can be seen that through the standardization of business indicators, we can measure and calculate the problems encountered by the system from the indicators of success rate.

When it comes to AI, I think we are still at the stage of “weak intelligence”, and there is no direct step to strong AI.

There is a saying that “nowadays there is more demand for weak intelligence than strong intelligence”, so we need to have a transitional stage.

If we click underneath the little squares on the previous page, this picture will appear (although the real scene is more complicated than this picture). The figure reflects the business metrics and system metrics, while on the right is the intelligence analysis made.

In the previous “full link” picture, there was a red order. In the traditional model, developers create a process in their heads: start with one metric, check it, and if it shows up as normal, quickly move on to the next, and so on.

Our system, then, should be able to help developers scan their brains for all the possibilities for a problem, namely, the relevant metrics or block diagrams in the image above, according to our algorithms to find the point of failure.

It was simple, not even smart, but it worked. This is what I call “retarded power,” and we’re going to roll it out on a large scale this year.

It can be seen that “weak intelligence” is more important than “strong intelligence”, which is also an example of AI landing in the field of monitoring.

Finally, I hope you can look up to the stars while doing your daily development and operation work.

So I’ve got a picture for you here, and it’s taken from a big, big, big picture. I used it to explain the value of CMDB to my boss, and it worked.

As shown in the figure, you can assume that you are the boss of a business and try to think from the boss’s point of view about how to increase revenue and reduce costs for the business, especially for IT.

For example: Under normal circumstances, monitoring does not generate direct value on Aliyun, which is reflected in the dimension of revenue. The costs we measure will also include additional costs, the “EX costs” shown in the figure.

For example, we can consider whether it is really necessary to use the cost of more than 4000 machines to do the monitoring system.

Thus, the stargazing “observation points” can start from the three green points in the figure, namely MTTR (mean time to Recovery), prevention, and measurement.

These are the areas that business operators are most concerned about. And you can think a little bit more about the other nodes in the diagram.

Author: Cheng Chao

Editors: Chen Jun, Tao Jialong, Sun Shujuan

Submission: If you want to contribute, please contact [email protected]

More than 10 years of experience in operation and maintenance system development, now working in Alibaba Infrastructure Business Group, responsible for alibaba Group monitoring. Led the construction of the first generation of Alibaba CMDB system. The monitoring business now covers all business groups of Alibaba.

Excellent article recommendation:

Finally, someone has clarified the Java memory model!

This article explores the cache architecture in distributed systems

Novice can also understand the monitoring and alarm system architecture design