[Reproduced please indicate the source] :Juejin. Cn/post / 684790…

SkyWalking is an observational analytics platform and application performance management system. Provides integrated solutions for distributed tracking, service grid telemetry analysis, measurement aggregation and visualization.

Features:

  • Multiple monitoring tools, language probes and Service mesh
  • Multi-language automatic probe, Java,.NET Core and Node.js
  • Lightweight and efficient, no need for big data
  • Modularization: UI, storage, and cluster management are optional
  • Support the alarm
  • Excellent visualization scheme

Skywalking technology architecture

The whole system is divided into three parts:

  • Agent: Collects tracing and metrics and reports them
  • OAP: Collecting tracing and metric information puts data into persistent containers (ES, H2 (in-memory database), mysql, etc.) through the Analysis Core module, and performs secondary statistics and monitors alarms
  • Webapp: the front and back ends are separated. The front end is responsible for rendering, and encapsulates the query request as graphQL and submits it to the back end. The back end forwards the query to the OAP cluster through the ribbon for load balancing, and then renders the query results for display

Skywalking also offers a number of other features:

  • Configuration overloading: Support for overwriting default configurations with JVM parameters and dynamic configuration management
  • Cluster management: This is mainly reflected in OAP, which shares the traffic pressure of data reporting and the computing pressure of secondary computing through cluster deployment. In addition, the cluster can be configured to switch roles for data collector and alarm respectively. Note that agent currently does not support multi-collector load balancing, but randomly selects an instance from the cluster for data reporting
  • Supports K8S and Mesh
  • Support the extension of data containers, such as the official main push is ES, through the extension interface, can also be implemented to support other data containers
  • Support for extensions to data reporting receivers, e.g., currently mainly support gRPC receiving agent reporting, but plug-ins can also be implemented to support other types of data reporting (Zipkin, Telemetry and envoy support is officially implemented by default)
  • Client-side and server-side sampling are supported, but server-side sampling is the most meaningful
  • A data query script specification OAL (Observability Analysis Language) with a syntax similar to Linq was developed to simplify the workload of data query extension
  • Supports monitoring and early warning to trigger alarms by comparing OAL data indicators with thresholds. Supports Webhook extended alarm mode, user-defined statistical period, and alarm silence to prevent repeated alarms
Data containers

Because Skywalking doesn’t have its own custom data container or use multiple data containers to add complexity, instead, Skywalking mainly uses ElasticSearch (which is basically the same with open source to keep things simple, for example, Pinpoint only uses HBase), Therefore, the characteristics of the data container and its own data structure basically limit the upper limit of the business. Take ES as an example:

  • ES has strong query function, crushing all other containers in data screening, and has great potential in data screening (Skywalking default query dimension is much stronger than HBase Pinpoint)
  • Support sharding and Replicas data backup, high availability/high performance/big data support is very good
  • Supports batch inserts, greatly enhancing the insert performance under high concurrency
  • Low data density is due to the fact that ES will build a large number of indexes in advance to optimize the search query, which is the price of powerful query function and good performance. However, link tracking often has a lot of context to record. So Skywalking binaries these contexts and then puts them into the data_binary field via Base64 encoding and marks the field as NOT_analyzed to avoid pre-processing and query indexing

Skywalking, in general, try to use the advantage of ES in big data and query, at the same time to minimize the disadvantage of the impact of low density, ES data, for the moment, ES in the call chain tracing aspect is data containers, and in terms of data index, ES can be fairly complete business, though less compared with temporal database, But data support below the petabyte level should not be too much of a problem.

The data structure

If the data container determines the upper limit, the data structure determines the actual height to be reached. The data structure of Skywalking is mainly as follows:

  • Data dimension (ES index is Skywalking_ *_inventory)
    1. Service: service
    2. Examples of the instance:
    3. The endpoint: interface
    4. Network_adress: external dependency
  • The data content
    1. The original data
    • Call chain trace data (ES index is skywalking_segment, Skywalking main data consumption is here)
    • Metrics (mainly JVM or envoy runtime metrics such as ES index Skywalking_instance_jVM_CPU)
    1. Secondary statistical index
    • Indicators (such as PXX and SLA based on dimension and time, for example, ES index Skywalking_DATABase_access_p75_month)
    • Slow database query records (database index: Skywalking_top_n_database_statement)
    1. Association relation (Association relation between dimensions and indicators. The ES index is Skywalking_relation)
    2. Special record
      • Alarm information (ES index: Skywalking_alarm_record)
      • Concurrency control (ES index is skywalking_register_lock)

The largest number is the call chain trace data and various indicators, and these data can be set to expire through OAP to reduce the impact of historical data on disk occupancy and query efficiency.

Call chain trace data

As the core data of Skywalking, Skywalking_segment basically lays the foundation of the whole system. However, if you want to understand the call chain tracing in detail, you have to refer to openTracing.

OpenTracing is basically a de facto standard of open source call chain tracing systems. It defines the basic flow and data structure of call chain tracing, and also provides implementation in various languages. If openTracing were represented by a graph, it would look like this:

Among them:

  • SpanContext: ** A component similar to MDC (Slfj) or ThreadLocal that is responsible for context retention and delivery throughout the call chain data collection process
  • **Trace: ** The complete record of a call
    • Span: a node/step in a call. Similar to a stack of information, Trace is composed of multiple spans with parent-child or side-by-side relationships between spans to mark the location of the node/step in the call
      • Tag: key information about a node or step
      • Log: Detailed records of nodes/steps, such as exception stacks when exceptions occur
    • Baggage: Like SpanContext, it is not a data structure but a mechanism for context passing across spans or instances. Baggage’s data is more run-time than persistent

Take a Trace as an example:

First, the external request calls A, and THEN A synchronously calls B and C in turn. When B is called, D will be synchronously called; when C is called, E and F will be synchronously called; when F is called, G will asynchronously call H, and finally complete A call.

In the figure above, a Trace is represented by the dependency between spans, while in the timeline, it can be expressed as follows:

Of course, if the call is synchronous, the time consumed by the parent Span includes the time consumed by the child Span.

In Skywalking, we take a record of skywalking_segment as an example:

{
	"trace_id": "52.70.15530767312125341"."endpoint_name": "Mysql/JDBI/Connection/commit"."latency": 0."end_time": 1553076731212,
	"endpoint_id": 96142,
	"service_instance_id": 52,
	"version": 2."start_time": 1553076731212,
	"data_binary": "CgwKCjRGnPvp5eikyxsSXhD///////////8BGMz62NSZLSDM+tjUmS0wju8FQChQAVgBYCF6DgoHZGIudHlwZRIDc3FsehcKC2RiLmluc3RhbmNlEghyaXN rZGF0YXoOCgxkYi5zdGF0ZW1lbnQYAiA0"."service_id": 2."time_bucket": 20190320181211,
	"is_error": 0."segment_id": "52.70.15530767312125340"
}
Copy the code

Among them:

  • Trace_id: the unique ID of the call. It is generated in Snowflake mode
  • Endpoint_name: indicates the interface to be called
  • Latency: time consuming
  • End_time: indicates the end time stamp
  • Endpoint_id: indicates the unique ID of the interface to be called
  • Service_instance_id: unique ID of the instance to be invoked
  • Version: indicates the version number of the data structure
  • Start_time: indicates the start time stamp
  • Data_binary: Contains all Span data for this call, serialized and Base64 encoded, not analyzed and used for query
  • Service_id: indicates the unique ID of the service
  • Time_bucket: specifies the time period in which the call is made
  • Is_error: Indicates whether to fail
  • Segment_id: The unique ID of the data itself, similar to the primary key, generated in Snowflake mode

As can be seen here, Skywalking although compared with Pinpoint query has more dimensions, but also very limited, and in addition to the endPoint, there is no field associated with business, only through time/service/instance/interface/success mark/time to conduct non-business related queries, If you are going to enhance business-specific search queries later, you should also add fields for holding dynamic content (such as messageId, orderId, and other business keywords) for quick positioning

indicators

Compared with Tracing, indicator data is much simpler. Generally speaking, indicators include indicator flags, time stamps and indicator values. Indicators in Skywalking can be divided into two categories: one is the collected original indicator values, such as various runtime indicators of JVM (such as CPU consumption, memory structure, GC information, etc.); One is a variety of secondary statistics indicators (such as TP performance indicators, SLA, etc., as well as higher time dimension indicators for easy query, such as minute, hour, day, week, month)

For example, the following is a record in the index skywalking_endpoinT_CPM_hour that identifies the CPM indicator of an interface within an hour:

{
	"total": 8900,
	"service_id": 5,
	"time_bucket": 2019031816,
	"service_instance_id": 5,
	"entity_id": "Seven"."value": 148}Copy the code

Each field has the following meanings:

  • Total: total number of calls in one minute
  • Service_id: indicates the unique ID of the owning service
  • Time_bucket: indicates the statistical period
  • Service_instance_id: unique ID of the owning instance
  • Entity_id: indicates the unique ID of an interface (endpoint)
  • Value: CPM indicator value (CPM = Call per minute, that is, total/60)
agent

Agent (APM-Sniffer) is a Java probe implementation of Skywalking.

  • Collect JVM metrics for application instances
  • Through tangential programming, data burying point is used to collect call chain data
  • The collected data is reported through RPC

Of course, the Agent also realizes the client sampling, but in the APM monitoring system, the client data sampling is soulless, so I will not repeat here.

First, the agent by org. Apache. Skywalking. Apm. Agent. The core. The boot. BootService, realizing the overall plug-in agent can load all the BootService implementation, And through ServiceManager to manage the life cycle of these plug-ins, collection OF JVM indicators, gRPC connection management, call chain data maintenance, data reporting OAP these services are extended in this way.

Then, agent also through ByteBuddy to JavaAgent mode, through bytecode enhanced mechanism to construct AOP environment, and then provide PluginDefine specification convenient probe development, finally realize non-invasive data burial point, collection of call chain data.

OAP

Similar to Agent, OAP, as the core Module of Skywalking, also implements its own extension mechanism, but it is called Module here. Specifically, you can refer to library-module. Under the mechanism of Module, Skywalking implements its own core components:

  • Core: specifications and interfaces for the entire OAP core services (remoting, Cluster, Storage, Analysis, Query, and Alarm)
  • Cluster: Implementation of cluster management
  • Storage: the concrete implementation of the data container
  • Query: A concrete implementation of the query interface provided for the front end
  • Receiver: implementation of the receiver that receives data reported by the probe
  • Alarm: monitors the implementation of alarms

And an optional component:

  • Telemetry: Monitors the health of the OAP itself

And the aforementioned OAP high scalability is reflected in the core business of the specification are defined in the core, if has the need to expand, only need to do their own, and don’t need to do to change, the most typical example is the official support storage, not only supports single demo memory database H2 and classic ES, Even the current open source Tidb can be accessed.

The installation
  1. Download the latest installation package
  2. Decompress the package and run it in the bin directorystartup.shStart the
  3. accesshttp://localhost:8080/You can see the panel
  4. Add the following VM parameters to start the service:
-javaagent:${agent_home}/agent/skywalking-agent.jar -Dskywalking.agent.service_name=${service_name}
Copy the code

[Reproduced please indicate the source] :Juejin. Cn/post / 684790…