1

preface

During the popularization of Internet technology, data monitoring is very important for every company. In recent years, with the maturity of some excellent monitoring tools (such as Zabbix, Graphite, Prometheus), each company will build its own monitoring system to analyze the overall business flow and deal with abnormal alarm. However, with the improvement of system complexity and the maturity of microservices, monitoring has new problems to be solved, such as link relationship and cross-system fault location.

To reduce the pressure of company business line resources and development of monitoring, iQIYI technology product team has developed a set of full automatic monitoring platform of link, can provide a unified monitoring standards and basis of monitoring ability, enhance the capacity of fault location and depth analysis, enhance monitoring accuracy and transparency, this article will be based on the monitoring of some experience, and share the link automation monitoring platform.

2

background

In recent years, monitoring tools such as ELK Stack, Cat and Google Dapper are also trying to solve some new problems in the field of machine data analysis and real-time log processing. We have made an analysis of these problems and concluded that ELK Stack relies heavily on ES, and it is difficult to expand its storage and query capabilities. Cat focuses on the Java back end. The full-link monitoring idea based on Google Dapper is relatively mature, but most open source implementations lack in-depth analysis and poor query performance, as shown in the following figure:

The dimension ELK Stack Cat Pinpoint/SkyWalking
visualization weak general general
The report rich rich In the
indicators There is no There are There is no
The topology There is no Simple dependency graph good
Buried point Logstash/Beats intrusion No intrusion, enhanced bytecode
The query weak weak weak
community Ok, there is Chinese Ok, well documented Generally, the document is missing, without Chinese
case A lot of companies Ctrip, Dianping, Lufax, Liepin no
The source of the ELK aBay CAL Google Dapper

On the other hand, as microservices mature, real-time monitoring becomes more important. Basic monitoring, such as Prometheus, solves basic indicators and alarm problems, and partial implementation of full-link monitoring solves link tracking problems. However, the two functions complement each other and are not integrated into a unified full-link monitoring platform.

Based on the analysis of these tools, we formed a unified full-link automatic monitoring platform based on the existing basic monitoring and log collection and integrating the idea of Google Dapper, which can be flexibly and quickly accessed to other businesses of the company. For the reform of Google Dapper, we added the part of caching and offline processing, which greatly improved the query performance; An in-depth analysis section is added to automatically diagnose specific user barriers; On the basis of the link UI display, monitoring indicators are added. When viewing service links, monitoring indicators can be seen. After experience upgrade, performance bottlenecks can be found more easily, resource scaling can be guided, and capacity warnings can be seen.

Below, we summarize four parts of full-link monitoring: link collection, indicator collection, log collection, and in-depth analysis, and implement them in the full-link monitoring platform one by one.

Figure 1 Overall practice

3

Platform is introduced

General overview

  • Link acquisition, including call chain and service topology, is a concatenator of full link analysis.
  • Indicator collection is integrated into the service chain, enabling the whole link to have basic monitoring capabilities.
  • The data source for log collection is also the data source for full-link analysis.
  • In-depth analysis includes offline and online modules, which meet the requirements for fault location of all links.

Figure 2 Full link analysis process

Link to collect

Link collection is divided into two parts: call relation chain and service topology:

  • Call the relationship chain:Calls between two systems are called spans, and each Span records service information and context information. The field that concatenates the Span relationship is the Trace ID, which is the unique algorithm value generated by each request. An invocation chain is a directed acyclic graph (DAG) composed of multiple spans that represents the complete processing of a request. Each node in the diagram represents a Span, and the edges in the diagram represent the invocation relationships between different services (or within services). By analyzing the Span in depth, we can get a chain of calls for each request.

FIG. 3 Directed acyclic graph DAG

Invocation chain need to access, and then through the Agent to collect log depth analysis, access to the process to ensure low loss (the impact on the system is small enough), good ductility, the development of the next few years, can fully meet the live), access way has two kinds: (1) code intrusion patterns According to the specification, buried in the related components manual delivery. (2) Non-invasive mode (to ensure the transparency of the application level) support Java, Go, Lua and other agents, the principle of the use of probe technology, the client application without any code intrusion, easy to use and easy to docking.

Figure 4. Link acquisition architecture diagram

  • Our design

1) Basic ability of link analysis:

(1) With call chain retrieval capability, there are Trace links specific to the interface level, you can view the call relationship according to Trace ID.

② The call relationship contains the response time of each node, request methods and parameters, as well as customized Tag information, which facilitates the query and optimization of links.

2) Link analysis optimization:

(1) Add conditions for each query to solve the response delay caused by full table scanning.

② Increase the ability to interact, the old site folding operations are not very convenient.

③ The non-invasive Agent supported by Skywalking is not complete, we have done secondary development for some frameworks and versions.

  • Service link Topology ELK Stack’s Kibana has no service topology capability. Skywalking and Zipkin in the industry have the service topology capability, but the visualization is weak and the function is single. In addition, the full-link realization currently seen has not been added to the client node.
  • Our design

Package the client log, add front-end node in the link; Based on Skywalking, the UI page has been upgraded to enhance interaction and vision; The storage component was changed from a relational database to a graph database, giving the UI more logical presentation space and faster response time. Finally, complete the whole link to provide more user-friendly visualization (for example, we support three-tier presentation: first line of business, services within the line of business, and calls within the service).

Figure 5 Line of business dimensions

**** Figure 6 Service dimensions

Figure 7 Service dimension switching view

In addition, monitoring indicators are added to each node of the link, including aggregation of machine performance and service indicators. Monitoring is also displayed on links, facilitating intuitive analysis of architecture bottlenecks. In addition, user-defined indicators can be used to interconnect with existing indicators (extensions of basic indicators). When abnormal nodes exist, in addition to alarm, colors in the visualization are also eye-catching (Warn yellow and Error red), which can find bottleneck points from the global dimension and guide the existing service scaling or flow control degradation processing.

Figure 8 Monitoring indicators

We have summed up the front and back end service-dependent links as logical links and are adding another layer of physical links to display the network/room topology.

Metrics collected

In terms of index collection: the technology of index collection can be achieved by Graphite, Prometheus and the monitoring system of time series database. The problem is that each business line has its own set of monitoring, such as the success rate of the same calculation, because of the effect of storage or performance, etc, there are differences between the algorithms (or according to the total number of successful/total number of requests, some on each machine per minute the success rate of polymerization on the base of summary do arithmetic average or weighted average). Therefore, monitoring is unified and data analysis of the overall architecture can be described.

  • Our design

It is mainly to optimize the process, unify the monitoring indicators, and deliver data according to the specifications of each business line. Then unified monitoring alarm algorithms, such as Success Rate, QPS, RT, P999, jitter alarm, abnormal detection, capacity estimation, etc.

Unified monitoring, so that all lines of business are the same thing, after accurate data, more problems can be found by comparison. In addition, it also reduces the development cost of monitoring for each line of business.

The results of the indicators will be displayed on the links mentioned above to upgrade the experience and make it easy to find the overall architecture bottlenecks.

Log collection

  • Log collection is divided into two stages:
  1. In the log monitoring stage of the ELK Stack, Logstash/Beats+Kafka+ES is adopted. The advantage is flexible collection, but the disadvantage is weak ES storage capability and query capability.
  2. In terms of full-link monitoring, for example, Mogujie adopts Logstash+Kafka+ES+Hadoop. The advantage is that it solves the ES storage capacity problem, while the disadvantage is that it does not solve the query capacity problem.
  • Our design \

At the beginning of launch, although we tried to use Spark/Flink and other big data tools to collect data, the OPS was too large and there was still a delay. We made some optimizations, and the final solution is shown as follows:

Figure 9 Log collection process

The ground:

① Client logs are sent to Kafka through Http. Back-end logs are automatically collected using Logstash.

② When data is collected to Kafka, partition Kafka in advance according to the volume of traffic. This avoids Reparation in later Spark flow tasks, as this is also a time-consuming operation.

③ Spark stream task is used to set the parallelism, divided into multiple tasks parallel consumption Kafka, and stored in the corresponding storage components. Avoid storing multiple storage components in one stream and reduce the total latency exponentially.

(4) Hikv and ES store only required indexes. Original logs are stored in Hbase. Data is associated with indexes between storage devices.

⑤ ES We used most of the technical optimizations (such as increasing the time interval of index.refresh_interval, disabling refresh and setting Replicas =0, use ES auto-generated IDS, adjust the mappings field, reduce the number of batch queries per session to < 1W, and adjust the batch size and batch time. Finally, optimization was carried out at the machine level, with the old Cpu upgraded to new Cpu, HDD upgraded to SSD, and the effect was obvious (the difference between the ten years ago and the new Ferrari).

⑥ A large number of logs need storage space, considering the high cost of the cache component, in Hikv storage, because the log Value is more than a string, we do PB first (compared with XML, the amount of data after its serialization is about 1/3 to 1/10, the speed of parsing is about 20-100 times faster), improve the serialization performance. Then do GZIP (5 times compression) for PB to improve the compression ability. In addition, the introduction of cache components and in-depth analysis described later can fundamentally reduce the ES query pressure and ensure architectural stability.

⑦ Storage is isolated according to service lines. Different service lines store their own storage. There is no impact on performance. At present, the log storage OPS of one service line is in the peak of hundreds of thousands per second, the average data length of Hikv is in terabytes (less than 1K after compression), the data volume of ES is in the hundreds of GIGABytes per hour (depending on the total number of hours saved), and Hbase logs are logged in tens of terabytes per day. Log collection to achieve quasi real-time!

In-depth analysis of

  • In-depth analysis of common problems in alarm, such as OPS, RT, Success Rate and P999, can only reflect the overall quality of service. However, when it comes to personal APP problems of users, the traditional way is to develop manual troubleshooting, which requires good architecture technology and rich business experience. The troubleshooting cycle is long and the results are vague. This is a pain point in the industry.
  • Our design

The system is associated with the client and user feedback system, and the link information and back-end logs are used to automatically discover fault points, and offline and real-time diagnosis are combined to achieve quasi-real-time fault location.

Figure 10. Interlocking diagnosis

Convergence analysis approach

(1) Connect client errors with customer service system user feedback, and use a single record as the starting point for analysis. Then, based on the link relationship, the logs corresponding to the starting point for all subsequent link services are aggregated offline. Among them, the index we aggregate is Device ID. Because some services cannot obtain this parameter, we optimize the Trace ID algorithm (including Device ID). Firstly, at the beginning of the service request, Trace ID is automatically generated in full amount to ensure that the back service has Trace ID, so that the back service can extract Device ID from it.

② For all node logs, perform multi-dimensional optimal diagnosis until the error point is found or the traversal is completed. Cooperate with Easy Rule to develop special diagnostic strategies in different scenarios, flexible and extensible. In addition to policies, we have user behavior analysis, which shows user requests and behaviors over time.

Offline aggregation preempts performance pressure from future queries or analyses. You can also delete a large number of error-free information when the TTL expires to save resources. Before offline aggregation is complete, real-time aggregation is enabled. Dual channels ensure quasi-real-time problem location. (4) In addition to the metrics generated by the timing database, we store some of the metrics that need to be aggregated in Clickhouse to support more dimensions of aggregation, complementing monitoring capabilities and ensuring the quality of log collection and in-depth analysis. In the case that logs are all reported, this approach overlays the problems that can be expected from normal development operations. Constant complementary strategies make deep analysis smarter.

4

Share the overall architecture design

FIG. 11 Overall architecture design

Currently in through link the client and the background, on the basis of compatible with different system architecture and service implementation background, process automation, a single line of business acquisition OPS peak in hundreds of thousands of every second, and Hikv amount of data in a few T (compressed data length less than 1 k), ES data volume per hour around hundreds of G (see how many hours total), The number of Hbase landing logs is tens of TB a day. Links, indicators, log collection, and in-depth analysis are timely.

Access revenue:

① Unified indicator monitoring

② Rich alarm mechanism

The root cause of alarm positioning

(4) Resource expansion analysis

⑤ Automatic analysis of logs

⑥ Call detection across the machine room

5

conclusion

Since the launch of iQiyi full-link automatic monitoring platform, the gap in link monitoring on mobile terminals has been filled, the scope of monitoring has been expanded, the efficiency of problem location has been improved, and the overall service quality of mobile terminals has been guaranteed from the point of view of links. Full link is now part of the company’s microservices reference architecture through a unified technical specification. By automatically identifying dependencies, links can be visualized, aggregation indicators and alarms can be generated at the minute level, and fault points and dependencies can be found in a timely manner. Through the automatic analysis based on the rule engine, the problem that the error log cannot be located due to the short storage time is solved, the error log search efficiency is increased by more than 50%, and the response speed of customer complaints is improved. In addition, link analysis is invoked to accurately help identify link performance bottlenecks for optimization and improve the overall architecture quality.

In the future, we will add the function of full-link pressure measurement (common pressure measurement is basically single system pressure measurement) on the current basis, so that the system can have online pressure measurement and simulation pressure measurement capability, perceive the system load capacity in advance, and make the system resources expand intelligently to cope with the sudden traffic such as holidays or hot series.

\

Scan the qr code below, more exciting content to accompany you!