In large-scale website system design, with the popularity of distributed architecture, especially microservice architecture, we decouple the system into smaller units, and build complex systems by constantly adding new, small modules or reusing existing modules. With the increasing number of modules, a request may involve the collaborative processing of dozens or even dozens of services, so how to accurately and quickly locate online faults and performance bottlenecks has become a thorny problem we have to face.


What is the Jaeger

In order to solve the complex service error location and performance problems in Distributed architecture, Google proposed the design and construction of Distributed tracking system in the paper Dapper, A Large-scale Distributed Systems Tracing Infrastructure.

Inspired by Dapper, Jaeger is a distributed tracking platform created by Uber that can be used to monitor and track distributed systems built based on the microservices model. Jaeger opened source in April, 2017, and entered the CNCF incubator in September. In October, 2019, Jaeger officially graduated from CNCF and became the top project of CNCF.

The popularity of Jaeger can be attributed to the large factory and strong organizational support, as well as the native support for the OpenTracing standard (which can be considered as the reference implementation of the OpenTracing protocol). There is support for several major languages (Java,.NET, Golang, Python, NodeJS, etc.), and the community has a large number of OpenTracing ecological components to use directly.

Jaeger uses the gRPC plug-in design to support multiple backend storage, including memory, Badger, Cassandra, Elasticsearch, gRPC plug-in and so on. In Jaeger’s new release, a streaming architecture is also implemented to handle data analysis, but additional Kafka and Flink components are introduced.

However, in order to achieve the complete observability of the microservice system, we found that Jaeger itself also has certain limitations:

  • Compared with other observability systems, Jaeger focuses more on Tracing, with limited log and indicator support. Due to its lack of monitoring and alarm mechanisms, Jaeger is often implemented in conjunction with other systems, such as Prometheus, ELK, etc.

  • Jaeger, as part of an observability/monitoring system, is an important means for development and operations students to locate and discover business system problems. We must ensure that the monitoring system lives longer than the business system. As an open source project, Jaeger itself only provides solutions, but does not provide evaluation schemes for deployment scale and how to ensure high availability of services. This requires the operation and maintenance students to provide specific deployment schemes based on their experience in high availability of services and research on the scale of business systems.

So how can we reduce the complexity of the observability platform in this case? How to provide high availability and high performance back-end services?

The best way to do this is to find a Jaeger compatible back-end system that provides high reliability and high performance capabilities.

When Jaeger meets Erda

As a collaborative development platform for applications on the cloud, Erda provides SaaS observable cloud services out of the box, eliminating the complexity of operating and maintaining multiple monitoring and logging systems. Meanwhile, Erda also provides complete micro-service observation capabilities, including but not limited to:

  • Service performance monitoring, including interface call monitoring, SQL call monitoring, slow transaction analysis, JVM monitoring, etc
  • Distributed link tracking, call link waterfall/flame map and other analysis modes
  • Distributed log query and analysis
  • Visual and flexible alarm configuration, supporting alarm convergence and noise reduction
  • Custom dashboard analysis

In general, there are two different ways to replace a Jaeger backend:

  • The original data is generated by Jaeger SDK, and the query mode continues to use Jaeger UI. In this way, the application developers continue to use the previous mode, but it is limited to the Trace capability provided by Jaeger

  • The native data was generated using the Jaeger SDK and queried using the Erda microservices observation platform

On Erda, we currently only support the second method, because in addition to the Trace capability, Erda can also automatically discover the service call topology, automatically analyze the call performance of the service interface, etc., based on Jaeger data.

Next, let’s take a look at how to use the Jaeger SDK to connect data to the Erda microservices observation platform.

First, create a monitoring project in the admin center (the difference between a monitoring project and a r&d project is that the latter includes a full DevOps r&d capability in addition to the observation capability) :

Next, find the monitoring project created in the micro-service governance platform and click “Environment Settings” > “Access Configuration Page” after entering:

Currently, Erda supports Jaeger SDK directly connecting to the back end. In order to distinguish the tracking data reported by different users and authentication, we need to obtain three variables [Access point], [environment ID] and [environment Token] according to the prompts on the page.

Using the Java SDK as an example, we can use Jaeger SpringCloud Starter or any other SDK compatible with OpenTracing, and then add the above three variable tags to Jaeger tags. And elevate the SDK’s access point is modified to 【 the collector. Erda. Cloud/API/jaeger /… such as:

opentracing:
  jaeger:
    service-name: <your_service_name>
    http-sender:
      url: https://collector.erda.cloud/api/jaeger/traces
    log-spans: true
    tags:
      erda.env.id: <your_env_id>
      erda.env.token: <your_token>
Copy the code

Jaeger and Erda function comparison

Topology analysis can automatically calculate and generate the dependent topology of Trace. Compared with Jaeger, it adds a lot of index calculation, including QPS, error rate, average delay, status code distribution, etc. :

Erda can automatically calculate service nodes from Jaeger Trace data and generate a global Top comparison of services:

Erda provides server-side APM functionality that Jaeger does not have:

Erda can compute and analyze Trace data and generate a large number of customizable alarm policies, which Jaeger does not yet support:

In addition, Erda link tracing analysis capabilities are enhanced and flame map mode is supported:

summary

As a representative implementation of the OpenTracing protocol, Jaeger is also the first choice for CNCF’s top projects and many cloud native frameworks to implement Trace capabilities. If you are using Jaeger, it is easy to try to plug data into Erda for statistics and analysis without changing the code.

In addition, Erda 2.0 will also be officially released today. In this version, the overall visual interaction of the product will be completely revised and the product user experience will be deeply optimized. New online project level of research and development process, based on the single application of CI/CD, provides project level line and the deployment of the core functions of products, environment, and make software project product research and development, delivery, more simple and elegant ~ we will be in a subsequent article detailed analysis was carried out on the new version, interested friends can continue to pay close attention to!

See links & further reading

Dapper, a Large-scale Distributed Systems Tracing Infrastructure

“Highly Reliable Deployment solutions for Jaeger using SLS Trace”