Introduction: This document describes the application performance monitoring value and solution.

1. What is total observation?

To understand total observation, let’s look at some of the problems with traditional operations.

  • Data isolation, scattered in different departments, troubleshooting difficulties;
  • Multiple tools from multiple manufacturers cannot automate unified analysis;
  • Faults are three-dimensional. Logs and indicators can only be observed on one side.
  • The value of big data cannot be brought into full play by only collecting data without really in-depth analysis.

Total observation is an improvement on traditional operation and maintenance. It summarizes log, index, APM data in a platform, so that operation and maintenance, development, business personnel to observe and analyze all data from a unified perspective, can achieve —

  • Establish unified visual view, alignment time and filtering conditions;
  • Establish unified rule-based monitoring and alarm;
  • Establish unified machine learning intelligent monitoring and alarm.

In the whole observation, including log, index and APM, APM may be relatively unknown to everyone.

2. What is application Performance Monitoring APM

APM definition: Enterprises use APM to monitor, diagnose, and analyze the running status of complex software and applications. This shortens the time and improves the accuracy of fault location, improves application operating efficiency, and optimizes user experience.

APM involves technology types including artificial intelligence, big data, and cloud computing. Its core is user experience, improving application reliability, improving application quality, and reducing TOTAL COST of ownership in IT.

With today’s application of diversified and complicated, we need through the APM such an application performance monitoring, realize the end-to-end business performance analysis, helps in the understanding of our service at the same time, the time is spent on what above, for example service what is the cause of the crash, where is the bottleneck of the whole service, so as to enable us to better to tracking, optimization of the end user experience.

3. Application performance monitoring APM scenario

3.1 APM application scenarios and pain points

• Apply exception diagnostics

– It is difficult to locate faults in distributed microservice architecture applications.

– Complicated service logic makes it difficult for enterprises to organize and manage application architectures.

• App experience management

– User experience directly affects the development prospects of application services, but it is difficult to obtain the reality and specific conditions of users accessing the system. You need to locate new faults or recast customer feedback in a timely manner to efficiently solve the fault and prevent customer loss

• Apply exception diagnostics

– Analyze associated indicators and alarm data from multiple perspectives and generate a root cause analysis report

– Analyze the cause of abnormal transactions in real time based on historical data and o&M experience

3.2 APM capability and business value

• Active and passive monitoring, focusing on end-user experience optimization

• Real-time, visual application architecture to help users fully understand complex infrastructure

• Application data accumulation and real-time update to provide data support for solving problems on different platforms

• Path tracking and timely warning to reduce failure losses

• In-depth monitoring of application components, focusing on monitoring the effectiveness of inter-tool operations, helping users quickly locate and resolve problems

4. Release of aliccloud Elasticsearch application performance monitoring function

Elasticsearch Server node is built on open source Elastic APM, providing one-click hosting on the cloud for Elasticsearch application performance monitoring Server node service pull-up, supporting the use of Elasticsearch as its data store, and allowing real-time monitoring of thousands of applications performance.

You can use the Agent to collect detailed performance information, including incoming requests, database queries, cache calls, external HTTP requests, errors, and exceptions. Elasticsearch is used to store and visually analyze performance information, providing enterprises and developers with efficient application performance optimization and monitoring capabilities.

4.1 Collect data based on the default Agent and data collection template

The user can use an open source library written in the same language as the service, the agent hooks into the application and collects performance metrics and errors, and all data is collected and sent to the Server.

4.2 Creating and managing the Server instance of Aliyun ES application performance Monitoring hosted on the cloud

A Server node can be pulled up and flexibly scaled up and configured with one click. The Server receives data from the agent via the JSON HTTP API, and a single node can typically process data from hundreds of agents.

4.3 Configuring and Associating Aliyun ES instance with Kibana to store and analyze performance indicator data

With Indexing Service of ES and Openstore of mass storage, ali Cloud can achieve high concurrent write capability and store searching mass data at low cost and near real time. Free hosted Kibana nodes on the cloud provide rich data analysis and visualization capabilities.

5. Technical difficulties and solutions for full observation scenarios

How to solve the pain point in the full observation-log scenario with the Elastic Stack capability on the cloud.

5.1 What are the pain points faced by all observation scenarios

  • Obtaining logs or indicators is difficult

Machine, service system, network link, operating system, many indicators and log acquisition methods are different, the landing process is complex;

  • High requirements on log/indicator normalization

How to obtain effective information from massive logs during the coordination and connection of upstream and downstream links?

  • High concurrent write and poor system stability

Service/traffic jitter and log write peaks are high, which challenges the stability of the bypass system.

  • Massive data storage costs are high

The log scenario involves massive data, TB or even PB.

  • It is difficult to unify log analysis and indicator monitoring

Monitoring can be done well with the help of timing system, but anomaly analysis is difficult, on the contrary, how to complete on a unified platform;

  • The system has high scalability requirements

Technical evolution brought by business adjustment is always happening. Technical components are updated quickly, and the operation and maintenance framework needs strong compatibility.

5.2 ELK on cloud full observation solution capability

  • Beats/APM Obtains logs and indicators

Lightweight to provide various types of metic, LOGS, APM data acquisition capabilities;

  • Data cleaning SQL is easier

Support log/indicator collection template in various network formats, real-time calculation Flink provides complete streaming SQL capability;

  • Cloud ES write hosting and super stability

Indexing Service provides homegrown ES write hosting services, cross-room deployment, same-city disaster recovery, and scenario kernel optimization.

  • Low-cost data storage

Ali Cloud ES provides cold and hot separation data storage mode, and self-developed storage engine Openstore optimized storage compression algorithm;

  • Log analysis, indicator monitoring, and APM capabilities are available

ElastiStack is fully hosted, providing one-stop log analysis, monitoring, Tracing capabilities;

Optimize the engine for timing scenarios to ensure the performance of timing log monitoring and analysis;

  • The open source ecosystem is highly scalable

Based on a distributed architecture and flexible and open RestAPI and Plugin framework, it supports a variety of extensibility capabilities.

6, ES full observation solution to achieve log monitoring/operation and maintenance/analysis

  • Solution selection: 100% compatible with open source, seamless connection with all kinds of open source ecological components; Supports multi-cloud or cross-cloud log monitoring and O&M analysis scenarios
  • Solution advantages: Provides the end-to-end collection, transmission and analysis capabilities of Elasticsearch on the cloud, providing high-performance read/write solutions for massive data, high flexibility, and low cost

7. Analysis of pain points in sequence log scenes

What are the problems in the log scenario with more write and less read?

(1) It is difficult to implement the large elastic expansion of peak write pressure effectively

(2) The cost of massive computing and storage resources is high and the resources are idle during peak periods

(3) Cluster operation and maintenance management is complex to ensure system stability

Ali Cloud Elasticsearch log enhanced version

Full observation data writing and hosting and mass storage capacity based on cloud native self-research engine technology

  • Logs are written to Serverless

With Indexing Service, we will write massive data to ES log scenarios and charge users according to actual traffic. With Indexing Service, users will be able to deal with flood peaks without having to reserve resources and maintain large-scale clusters.

  • Mass storage Openstore

The storage capacity of the cluster is not required to be reserved in advance. The data is compatible with native ES query. The document node can store 100 TB data and manage data through flexible and easy-to-use data life cycle strategy

  • Cloud 10x write elastic expansion

Massive computing examples on the cloud break through the writing bottleneck, no need to reserve resources in advance, no idle waste at low peak

  • Cost reduction of more than 50%

On-demand use, pay by actual written traffic, cloud write by volume, optimize resource cost

  • Ultra low cost of storage

Compared with the efficient cloud disk storage cost reduced by 70%, there is no need to reserve resources in advance, and there is no low peak idle waste

  • Massive data can be queried

Compared with the efficient cloud disk storage cost reduced by 70%, the storage Serverless is paid according to the actual usage

9. Application service data link tracking and analysis

The case of an automobile brand (SLA/KPI indicator tracking, sales support system link tracking and log analysis) is introduced based on Elasticsearch in The automotive Industry application Service Data link Tracking and log Analysis.

(1) Scene requirements

In the context of the digital transformation of the entire business process in the automotive industry, the internal support system and dependent IT components (such as mobile gateway) can generate a large amount of Metric, TraceLog, Log and other data, which needs to be quickly implemented on the cloud.

Under the IT department of an automobile brand enterprise, there are several content management systems (CMS), distributor operation office system (DMO), operation quality monitoring system (QIS), marketing operation analysis system (MMP), BI system and other internal support systems.

•IT business system is complex, which not only needs to meet the continuous business needs, but also needs the overall cloud. IT needs products that can quickly move horizontally, connect with the original IT system on/off the cloud, and ensure the flexible and open technical architecture to support the subsequent free expansion;

• The expected future log data scale is over petabyte (180 days) and the underlying technology architecture needs to combine low-cost storage, fast access, on-demand retrieval and analysis capabilities;

(2) Program value points

  1. Extremely low cost of migration/transformation: The IT architecture of foreign/joint venture auto companies refers to the overseas IT architecture of foreign parties. ES is a very popular technical architecture solution. Ali Cloud ES is fully compatible with open source, and the cost of cloud migration/transformation on customer operation and maintenance system is extremely low, and the system can be launched within one week at the earliest.
  2. Low storage cost: a large amount of data is stored (240TB of storage capacity for a customer log cluster). Provides storage media for tiering. For example: 1 PB log stored in OSS 12.6W/ month, pay 3W yuan/month more, log can obtain second-level fuzzy retrieval, aggregation analysis query and other capabilities (20.9W/ month cheaper than self-built ELK directly using efficient cloud disk);
  3. Real elastic scaling: Serverless(servialized) storage and computation separation architecture is provided. Write fees are charged according to traffic, and no money is charged if there is no traffic, which is “instantaneous elastic scaling” in a real sense.

Overall scheme Framework

10, ES application performance APM Server creation

It takes 3 minutes to quickly pull up the APM Server for data transmission. The minimum cost is 180 YUAN per month

In the APM Server Console list, you can see how many APM Servers are running.

We can see the access address of the APM server and assign this access address to the APM Agent. APM Agent supports multiple client languages during data collection, enabling rapid data collection configuration.

Once the data is collected, we can go to the Kibana interface and use Dev Tools to create some indexes.

The Kibana interface allows you to view all APM service data, such as average response time, P95 value, exception occurrence time and so on.

To view the detailed data of a service:

Click on a waterfall view of a specific request data:

View details of the waterfall view:

For example, if you find that a number of select are in progress, you can click to see the details:

Viewing link-wide data:

The original link

This article is the original content of Aliyun and shall not be reproduced without permission.