Author | Zhang Anzhe (Luo Yu)

Source |ERDA official account

takeaway: In order to give you a better understanding of APM system design and implementation in MSP, we have decided to write a series of articles called “Micro Service Observation in detail”, which delve into APM system products, architecture design and basic technology. This article, the second in a series of articles, focuses on the use and future vision of Erda Dashboard, a DashBoard system we developed.

A series of articles of “Micro Service Observation” in detail:

  • “From Surveillance to Observability, Where Are We Going?”
  • “Only after I got started did I know that the dashboard system was really cool to use!” (this paper)

The introduction

Containerization and micro-service make the system expansibility and robustness improved a lot compared with the past, but it also brings problems: the task and workload of operation and maintenance is increasingly heavy.

The microservices that make up the application and all the infrastructure that supports it will have their own logs, metric data, and monitoring and logging systems built upstream. Dispersed data and systems can be a huge burden on the support team and can ultimately be a roadblock to development and operations.

Our thinking

We have unified the analysis, aggregation and storage of logs, data, indicators, etc. of different services, but different tenants, applications, services, and instances may need their own set of representable monitoring for analysis and troubleshooting, which poses a great challenge to the flexibility of monitoring.

Up to now, there have been a lot of custom and visual open source dials on the market, such as Kibana, Grafana, Dash, etc., but there are also some shortcomings, such as difficult to use and high learning cost, but the biggest problem is that the operation is not coherent, which leads to the failure of quick troubleshooting. There are some systems with their own monitoring charts, but these data range and query rules, perturbation and combination are fixed procedures, encounter some complex situations, or need to use third-party tools for analysis and troubleshooting, not only time-consuming but also laboring. As a PaaS platform, we developed the Erda Dashboard to unify the user experience.

Introduction to the

Erda Dashboard is a self-developed DashBoard system. Its front end is based on Echarts and React-Grid-Layout, and its back end adopts the InfluxQL time-sequence query language and the self-developed Metrics Search Engine. It is used by most of the charts on Erda and includes features such as custom dashboards.

The architecture diagram



Considering the need to store and analyze a large amount of data, and the need to generate sequence diagrams with the help of time in most scenarios and search in tens of millions of pieces of data, the pressure is very high. On top of that, we have customization requirements, which dictate that the components must be open source. Combined with the above reasons, we adopted distributed open source ElasticSearch for storage and reformed it with the following structure:

    {
        "_index": "spot-application_cache-full_cluster-r-000001",
        "_type": "spot",
        "_id": "xxx",
        "_score": 1,
        "_source": {
          "name": "application_cache",
          "timestamp": 1621564500000000000,
          "tags": {
            "_meta": "true",
            "source_application_id": "9",
            "source_application_name": "xxx",
            "source_org_id": "1",
            "source_project_id": "5",
            "source_project_name": "xxx",
            "source_runtime_id": "26",
            "source_runtime_name": "xxx",
            "source_service_id": "xxx",
            "source_service_instance_id": "xxx",
            "source_service_name": "xxx",
            "source_workspace": "DEV",
          },
          "fields": {
            "elapsed_count": 2,
            "elapsed_max": 345831,
            "elapsed_mean": 331554,
            "elapsed_min": 317277,
            "elapsed_sum": 663108
          },
          "@timestamp": 1621564500000
        }
    }

Considering the ease of use and universality, we abandon the native DSL query mode, encapsulate the sequential query language Influxql and customize it into advanced functions to realize complex query and analysis. For normal analysis, we use a combination of Low Code + custom function expressions to quickly produce analysis charts.

Initial use

In Erda MSP, we provide a large number of built-in dials to help troubleshoot common system problems such as process analysis, error analysis, link tracing, transaction analysis, and more.

These dials are composed of several different charts, and we can manipulate and adjust the time range of their individual charts:

When the time span is large and the amount of data is large, it can be viewed in full screen for statistical trend or specific analysis:

Product features

Ops the market

Operation and maintenance platforms, also known as dashboards, are used to generate highly customizable analysis charts for developers and operations personnel. Currently, they exist in multi-cloud management platforms and micro-service governance platforms, providing charts with high degree of freedom, extensibility and high customization.



Enter the new operation and maintenance market ->> after adding charts, you can enter the chart editor, providing a wealth of configuration functions:

For example, index grouping (FROM), dimension (GROUP BY), value (SELECT), result filter (WHERE), result sorting (ORDER BY), result interception (LIMIT), etc., correspond to SQL one to one, generate charts after simple configuration, save to generate dial.

A variety of different timing diagrams can be seamlessly switched between chart types to display:

In order to support more complex query and analysis, provide SQL query, free permutation and combination:

Export function is provided to generate snapshot of dial with one key for sharing:

Extensions with custom functions

Erda Dashboard provides a wealth of custom extension methods, such as:

  • Diffps: Used to calculate disk I/O, network I/O, etc.

For example, some of the original indicators of Docker Container have no rate value, but only Counter value, such as network IO. At this time, the Counter value needs to be processed to calculate the rate. The general solution is to use a stream computing engine similar to Flink for secondary aggregation. However, it is not generic and can cause dependencies, such as custom metrics, which we choose to implement on the query side and support grouping.

   SELECT time(),application_name::tag,diffps(rx_bytes::field)
   FROM docker_container_summary 
   GROUP BY time(),application_name::tag
   Limit 5

  • It is more intuitive and clear when you use a sequence diagram + a custom expression.

vision

In the future, the Erda Dashboard will unify all charts on the Erda MSP, expand the rich chart types and custom methods, and support more data sources. The planned template market will enable instant dial-building, sharing within the organization, open market, etc., which will greatly improve the operational and development efficiency. In addition, the analysis report is formed by linkage with the alarm system, so as to achieve the goal of “finding problems in 1 minute and locating fault causes in 3 minutes” as soon as possible.

Welcome to Open Source

As an open source one-stop cloud native PaaS platform, Erda has platform-level capabilities such as DevOps, micro-service observation governance, multi-cloud management and fast data governance. Click the link below to participate in open source, discuss and communicate with many developers, and build an open source community. Everyone is welcome to follow, contribute code and STAR!

  • Erda Github address:https://github.com/erda-project/erda
  • Erda Cloud Website:https://www.erda.cloud/