How to use Skywalking for full link monitoring

Contents of this article

  • Skywalking full link monitoring
  • Skywalking parameter configuration
  • Skywalking UI monitoring perspective and indicators
  • Some interesting points

Skywalking full link monitoring

Below is a common microservice architecture I found on the Internet. It can be seen that it uses spring Cloud framework components and the back-end service is Java. By full link monitoring I mean monitoring the link from Nginx to the database.


We know that SkyWalking can easily monitor back-end Java applications through agent. Please refer to the official documentation for skywalking installation [1]

Here are some screenshots of the interface: Skywalking allows you to monitor the database from the service entry point, even the SQL and parameters of the database (SQL parameter display needs to be configured separately, which will be discussed later).




However, we do not monitor the upstream source of the request, which is the Nginx entry. If we monitor all the requests coming from the Nginx entry through the Java service and eventually to the database, we complete the full link monitoring of the request. Up here we dealt with the bottom half, now let’s deal with the top half.

Skywalking -nginx-lua[2] This is another Skywalking project that allows you to monitor Nginx. Skywalking -nginx-lua uses Lua to weave agents. So require that your Nginx either have lua modules or use software such as openResty that comes with Lua functionality modules.

I’m using openResty and only need to configure it to monitor (note the Chinese comments) :

http {
    lua_package_path "/Path/to/... /skywalking-nginx-lua/lib/skywalking/? .lua;;";

    # Buffer represents the register inform and the queue of the finished segment
    lua_shared_dict tracing_buffer 100m;
  # Init is the timer setter and keeper  # Setup an infinite loop timer to do register and trace report.  init_worker_by_lua_block {  local metadata_buffer = ngx.shared.tracing_buffer   -- Set service name  metadata_buffer:set('serviceName'.'User Service Name')  -- Instance means the number of Nginx deployment, does not mean the worker instances  metadata_buffer:set('serviceInstanceName'.'User Service Instance Name')  # This is your SkyWalking Server address  require("client"):startBackendTimer("http://127.0.0.1:12800")  }   server {  listen 8080;   location /ingress {  default_type text/html;   rewrite_by_lua_block {  ------------------------------------------------------  -- NOTICE, this should be changed manually  -- This variable represents the upstream logic address  -- Please set them as service logic name or DNS name  --  -- Currently, we can not have the upstream real network address  ------------------------------------------------------  require("tracer"):start("upstream service")  -- If you want correlation custom data to the downstream service  -- require("tracer"):start("upstream service", {custom = "custom_value"})  }   # This is your target downstream service, such as Java's microservices gateway Proxy_pass http://127.0.0.1:8080/backend;  body_filter_by_lua_block {  if ngx.arg[2] then  require("tracer"):finish()  end  }   log_by_lua_block {  require("tracer"):prepareForReport()  }  }  } } Copy the code

Here are some screenshots



So far we have completed the monitoring of the whole link.

Skywalking parameter configuration

Some Chinese documents

  • The agent of the document[3]
  • The UI document[4]

By modifying agent/config/agenet. The ability to get the config file

According to the document at https://github.com/apache/skywalking/blob/v8.0.0/docs/en/setup/service-agent/java-agent/README.md

  • 1 Can obtain parameters in SQL. By default, the parameters cannot be obtained. Of course, set the maximum length of the parameter. However, getting parameters can cause performance problems.
property key Description Default
plugin.mysql.trace_sql_parameters If set to true, the parameters of the sql (typically java.sql.PreparedStatement) would be collected. false
plugin.mysql.sql_parameters_max_length If set to positive number, the db.sql.parameters would be truncated to this length, otherwise it would be completely saved, which may cause performance problem. 512
  • 2 Collect HTTP parameters
Plugin.tomcat. collect_http_params or plugin.springmvc. Collect_http_params
 plugin.springmvc.collect_http_params=true
 The maximum length of a request parameter to be collected. Too many characters may affect performance.
 plugin.http.http_params_length_threshold=1024
Copy the code
  • 3 Configuration of data storage duration in the configuration file of Skywalking – OAP
core:
  selector: ${SW_CORE:default}
  default:
    # Mixed: Receive agent data, Level 1 aggregate, Level 2 aggregate
    # Receiver: Receive agent data, Level 1 aggregate
 # Aggregator: Level 2 aggregate  role: ${SW_CORE_ROLE:Mixed} # Mixed/Receiver/Aggregator  restHost: The ${SW_CORE_REST_HOST: 0.0.0.0}  restPort: ${SW_CORE_REST_PORT:12800}  restContextPath: ${SW_CORE_REST_CONTEXT_PATH:/}  gRPCHost: The ${SW_CORE_GRPC_HOST: 0.0.0.0}  gRPCPort: ${SW_CORE_GRPC_PORT:11800}  gRPCSslEnabled: ${SW_CORE_GRPC_SSL_ENABLED:false}  gRPCSslKeyPath: ${SW_CORE_GRPC_SSL_KEY_PATH:""}  gRPCSslCertChainPath: ${SW_CORE_GRPC_SSL_CERT_CHAIN_PATH:""}  gRPCSslTrustedCAPath: ${SW_CORE_GRPC_SSL_TRUSTED_CA_PATH:""}  downsampling:  - Hour  - Day  - Month  # Set a timeout on metrics data. After the timeout has expired, the metrics data will automatically be deleted.  enableDataKeeperExecutor: ${SW_CORE_ENABLE_DATA_KEEPER_EXECUTOR:true} # Turn it off then automatically metrics data delete will be close.  dataKeeperExecutePeriod: ${SW_CORE_DATA_KEEPER_EXECUTE_PERIOD:5} # How often the data keeper executor runs periodically, unit is minute  recordDataTTL: ${SW_CORE_RECORD_DATA_TTL:3} # Unit is day  metricsDataTTL: ${SW_CORE_RECORD_DATA_TTL:7} # Unit is day Copy the code

These are the four main lines

enableDataKeeperExecutor: ${SW_CORE_ENABLE_DATA_KEEPER_EXECUTOR:true} # Turn it off then automatically metrics data delete will be close.
dataKeeperExecutePeriod: ${SW_CORE_DATA_KEEPER_EXECUTE_PERIOD:5} # How often the data keeper executor runs periodically, unit is minute
recordDataTTL: ${SW_CORE_RECORD_DATA_TTL:3} # Unit is day
metricsDataTTL: ${SW_CORE_RECORD_DATA_TTL:7} # Unit is day
Copy the code

Skywalking UI monitoring perspective and indicators

CPM Requests per minute

CPM stands for Call per minutes, which is a Throughput indicator. The following figure shows the throughput and average throughput of the concatenated global, service, instance, and interface.


First 185CPm =185/60=3.08 requests/second.

SLA Service level agreement

A Service Level Agreement (SLA) is a Service Level Agreement.


In IT, slAs can measure the availability of a platform, and here are N nines:

  1. 1 year = 365 days = 8760 hours
  2. 99 = 8760 * 1% => 3.65 days
  3. 99.9 = 8760 * 0.1% => 8.76 hours
  4. 99.99 = 8760 * 0.01% => 52.6 minutes
  5. 99.999 = 8760 * 0.001% => 5.26 minutes

Therefore, as long as there is a large outage accident in a year, 4 9 is certainly not possible, the general platform 3 9 is about the same. But two nines are essentially unavailable, equating to 87.6 hours a year and 1.825 hours a week (four weeks a month). The following figure shows the SLA of services, instances, and interfaces by year and month.

Percent Response Statistics

Represents the proportion of some values in the samples collected. Skywalking has some columns of values p50, P75, P90, P95 and P99. P99:390 indicates that the response time of 99% requests is less than 390ms. 99% is generally used to throw out some extreme values, representing the vast majority of requests.

Slow Endpoint Slow Endpoint

An Endpoint represents a specific service, such as an interface. The following is the global Top N data, through which we can observe the performance of the platform.

Heatmap hot to

Heapmap can be translated into either a heat map or a Heapmap. The darker the Contributions, the more requests there are. This is similar to GitHub Contributions. The horizontal coordinate is the response time, and if you mouse over it, you can see the exact amount. Through the thermal diagram, the overall flow rate of the platform can be intuitively felt on the one hand, and the overall performance can also be felt on the other hand.

apdex


Is a measure of server performance. Apdex has three indicators:

  • Satisfied: request response time is less than or equal to T.
  • Tolerable: The request response time is greater than T and less than or equal to 4T.
  • Disappointed: Request response time greater than 4T.

T: user-defined time, for example, 500ms. Apdex = (satisfactory number + tolerable number /2)/total number. For example, service A defines T=200ms. Among 100 samples, 20 requests are less than 200ms, 60 requests are between 200ms and 800ms, and 20 requests are greater than 800ms. Calculate apdex = (20 + 60/2)/100 = 0.5.

Some interesting points


Red indicates that the request of the current node is abnormal for a period of time. When all nodes turn red, the service is completely unavailable at this stage. Topology can quickly identify potential problems of a service, troubleshoot them, and prevent them.

Look carefully at the lines that flow, one-way and two-way, one-way left to right or right to left, so you know who your services depend on whom. Two-way is proof that your service has a circular reference dependency problem.

In the latest version 8.1, there is an endpoint port dependency analysis, which can analyze the interface-level dependency relationship and know who an interface is called by and whom it calls.


The resources

[1]

Skywalking official documentation: https://github.com/apache/skywalking/blob/master/docs/en/setup/README.md


[2]

Skywalking nginx – the lua project address: https://github.com/apache/skywalking-nginx-lua/


[3]

Skywalking user-agent documents: https://skyapm.github.io/document-cn-translation-of-skywalking/zh/8.0.0/setup/service-agent/java-agent/


[4]

Skywalking – UI document: https://skyapm.github.io/document-cn-translation-of-skywalking/zh/8.0.0/ui/

Follow the public account for more exciting content