The author was Liu Haoyang

In order to give you a better understanding of APM system design and implementation in MSP, we have decided to write a series of articles on micro-service observation in detail, which delve into APM system products, architecture design and basic technology. This article, the first in a series, shares some of our thoughts on observability.


Erda Cloud is our upcoming one-stop developer Cloud Platform that provides enterprise development teams with DevOps (_DevOps Platform, DOP_), MicroService Governance (_MicroService Platform, MSP_), _Cloud Management Platform (CMP_), and _FastData Platform (FDP_) cloud native services.

As the core platform in the Erda Cloud, MSP provides a managed microservice solution, including API gateway, registry, configuration center, application monitoring and logging services, to help users solve the technical complexity problems caused by microservitization of business systems. Along with the product upgrade, we also designed a new APM (Application Performance Monitoring) product centered on service observation, to explore the best practice of applying observability in the field of application monitoring.

In order to give you a better understanding of APM system design and implementation in MSP, we will write a series of articles on micro-service governance in detail, in-depth APM system products, architecture design, and basic technology. This article, the first in a series, shares some of our thoughts on observability.

From monitoring to observability

With the popularity of cloud native concept and cloud native architecture design in recent years, more and more development teams begin to use DevOps pattern for system development, and disassemble large systems into small service modules, so that the system can be better containerized deployment. Cloud-native capabilities such as DevOps, microservices and containerization can help business teams deliver systems quickly, consistently, reliably and at scale. At the same time, the complexity of the system increases exponentially, which brings unprecedented operational and maintenance challenges, such as:

  • Calls between modules change from intra-process function calls to inter-process calls, and the network is always unreliable.
  • As the service invocation path becomes longer, the flow direction becomes uncontrollable and the difficulty of troubleshooting becomes greater.
  • With the introduction of cloud native systems like Kubernetes, Docker, and Service Mesh, the infrastructure layer becomes more of a black box for business development teams.

In traditional monitoring systems, we tend to pay attention to such indicators as CPU, memory, network, interface request volume of application service, resource utilization rate of virtual machine, but in complex cloud native systems, only focusing on single point or single dimension of indicators is not enough to help us grasp the overall operating status of the system. In this context, the “observability” of distributed systems comes into being. In general, we think that the biggest change in observability relative to monitoring in the past is the expansion of the data that the system needs to process from indicators to a wider range of areas. Taken together, there are roughly a few types of data that are considered the pillars of observability:

  • Metrics
  • Tracing
  • Logging

Relationships between Metrics, tracing, and logging

To unify data acquisition and standard specifications in observability systems, while providing a vendor-independent interface, CNCF merged OpenTracing and OpenEnsus into the OpenTelemetry program. OpenTelemetry defines the data model and acquisition, processing and export methods of observed data through specs, but is not concerned with how the data is used, stored, presented and alerted. The current official recommendation is:

  • Use Prometheus and Grafana to store and present Metrics.
  • Use Jaeger for the storage and presentation of distributed tracing.

Thanks to the flourishing of the cloud native open source ecosystem, the technical team can easily build a monitoring system, such as using Prometheus + Grafana for basic monitoring, using Skywalking or Jaeger for tracking, and using Elk or Loki for logging. However, for users of the observable system, different types of observation data are stored in different backends, so troubleshooting problems still needs to jump between multiple systems, and neither efficiency nor user experience can be guaranteed. To solve the problem of integrated storage and analysis of observable data, our self-developed unified storage and query engine provides seamless correlation analysis of index, trace and log data. In the rest of this article, we will detail how we provide observability analysis capabilities for services.

Observation portal: observability topology

Observability presents relationships between three types of data, allowing us to associate Metrics and Tracing with tags, and Tracing and logging with request context. Therefore, it is common to use the following methods to locate interface exceptions in an online application: _ Use Metrics and alerts to discover problems, then use Tracing to locate the modules where the exception may occur, and finally use Logging to locate the root of the error _.

While this method works for most of the time, we don’t consider it a best practice to observe systems:

  • Although Metrics can help us find problems in a timely manner, we often find a large number of single points of problem without a global view of the state of the system.
  • Business development teams need to be familiar with the concepts and use of the Metrics, Tracing, and Logging systems. If the monitoring system is based on open source components, you need to jump from one system to another to complete a problem detection, which is very common in many companies today.

In our practice of monitoring needs of users in different fields, we found that topology can be a natural entry point to the observation system. Is different from common distributed tracking platform, we not only the topology as a runtime architecture of application system, based on the real request of 100% sampling to map the topology relationship, further in the topological node service requests and service instance state (in the future will give more observation data, such as flow rate, the state of the physical nodes, etc.).

In the layout of the topology page, we divided the page into left and right columns. The status bar on the right side will display the key indicators of the system that we need to observe, such as the number of service instances, error requests of the service, code anomalies and alarm times, etc. When we click on the topology node, the status bar will detect the node type and display different status data. The node types we currently support for showing state are API Gateway, Services, External Services, and Middleware.

When a service node is clicked, the status bar displays a state overview of the service, a transactional invocation overview, and a QPS line chart

How to observe the service?

Based on the observable topology, we can easily view the overall state of the system from a global perspective, and we also provide a way to drill down from the topology into the service to quickly locate service failures. When a service exception is found, we allow a link to _ service analysis _, which provides observational analysis of the three dimensions of transaction, exception, and process.

Taking the interface exception mentioned above as an example, our troubleshooting method is as follows:

  1. Query the interface */ Exception that triggered the exception on the transaction analysis page.
  2. Then click the data point on the request or delay trend graph to associate the slow transaction trace and the error transaction trace sampled by the interface.
  3. View the request link details and the root of the log location error associated with the request in the pop-up trace list.

Found the failed transaction request

Automatically associate the invocation link for this request

Automatically associate the log context for the request link

Where are we going?

Due to the limitation of the length of this paper, this paper will not show too much product details. With the help of the above scenes, we propose a direction for the design of observable APM products: Different data were integrated and analyzed on the back end based on the system and service observation, instead of deliberately emphasizing the system’s support for the separate query of the three kinds of observability data, and the separation of Metrics, Tracing and Logging was shielding as far as possible from users in terms of product functions and interactive logic. In addition, we will continue to explore the infinite possibilities of code-level diagnostics, full-link analysis, and intelligent operations in the field of observability.


  • The enterprise project | monitoring, observability and data storage,
  • Metrics, Tracing, and Logging

Welcome to Open Source

As an open source one-stop cloud native PaaS platform, Erda has platform-level capabilities such as DevOps, micro-service observation governance, multi-cloud management and fast data governance. Click the link below to participate in open source, discuss and communicate with many developers, and build an open source community. Everyone is welcome to follow, contribute code and STAR!

  • Erda Github address:
  • Erda Cloud Website: