Background: In production operation and maintenance, we often have to carry on the “pot” that can ensure the whole family to eat, and then continue to struggle because there is no sound observable scheme or platform. This article will take you one step further by explaining how Traefik’s observability solution works and how to use it.

Note: In the ecology of Traefik-2.x, observability is divided into the following sections and promoted to the special documentation traefik-observability.

  • Service log: Traefik Operation logs related to the process
  • Access log: Access log of the proxy service taken over by Traefik (access.log)
  • Metrics: Traefik provides its own detailed metrics
  • TracingTraefik also provides traceability related interfaces for visualizing calls in distributed or microservices

Service log

Note: By default,Traefik writes logs to stdout in text format. If you want to view logs in docker mode, you need to use docker logs container_name to view logs.

The related configuration

$cat traefik.toml [log] filePath = "/path/to/traefik.log Configuration of the log file format [text (text | json)] level = "DEBUG" # of the specified log output level [ERROR (ERROR | DEBUG | INFO | PANIC | FATAL | WARN)]

Cli configuration

Copy the code

--log.filePath=/path/to/traefik.log --log.format=json --log.level=DEBUG

Note: Specific logging configuration parameters need to be compatible with the larger version of Traefik in the current environment, otherwise unexpected problems may occur.

Access log

The access log is used to record the details of each request made through traefik. It contains the headers of HTTP requests and the response time. It is similar to the access. Access logs can be used to analyze overall traefik traffic as well as traffic and status details for individual services.

Note: Access logs are also written to standard output in text format by default

The related configuration

$cat traefik.toml [accessLog] filePath = "/path/to/traefik.log" format = "" BufferingSize = 100 # This parameter is required to write logs asynchronously. Filters] # specify a set of filter connectors that are logically Or. Specifying multiple filters will retain more accessLog codes than specifying only one. StatusCodes = ["200", "300-302"] # retryAttempts = true # Retain logs when there are retries minDuration = "10ms" # Retain access logs when the request takes longer than the specified duration

[accesslog.fields] # Restrict access to fields in the log (fields. Names and fields. Header options can be used to determine the output of fields) defaultMode = "keep" # Each field can be set to "keep", "drop", and "redact". [accesslog.fields.names] # Specifies the field name" ClientUsername" = "drop" # [accesslog.fields.headers] # Set headers defaultMode = "keep" # Keep all headers [accessLog fields. Headers. Names] # header field set to the specified keeping rules "the user-agent" = "redact" "Authorization" = "drop" the content-type = "" "keep"

Cli configuration

--accesslog=true --accesslog.filepath=/path/to/access.log --accesslog.format=json --accesslog.bufferingsize=100 - the accesslog. Filters. Statuscodes = 200300-302 -- accesslog. Filters. Retryattempts -- accesslog. Filters. Minduration = 10 ms --accesslog.fields.defaultmode=keep --accesslog.fields.names.ClientUsername=drop --accesslog.fields.headers.defaultmode=keep --accesslog.fields.headers.names.User-Agent=redact --accesslog.fields.headers.names.Authorization=drop --accesslog.fields.headers.names.Content-Type=keep

Copy the code

Note: Since Traefik is used as an edge node in the Kubernetes cluster to broker internal HTTP services, Traefik is deployed inside the cluster, mounting process and access logs as volumes to the edge node’s data directory.

Traefik was deployed within the K8S cluster using DaemonSet. The specific configuration is as follows:

$ cat traefik-ds.yml --- kind: DaemonSet apiVersion: extensions/v1beta1 metadata: name: traefik-ingress-controller namespace: kube-system labels: k8s-app: traefik-ingress-lb spec: template: metadata: labels: K8s-app: traefik-ingress-lb name: traefik-ingress-lb spec: affinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: node-role.kubernetes.io/master operator: DoesNotExist serviceAccountName: traefik-ingress-controller terminationGracePeriodSeconds: 30 hostNetwork: true containers: - image: Traefik :v1.7.16 Name: traefik-ingress-lb ports: -name: HTTP containerPort: 80 hostPort: 80-name: admin containerPort: 8080 securityContext: capabilities: drop: - ALL add: - NET_BIND_SERVICE args: - --api - --kubernetes - --logLevel=INFO - --traefikLog.filePath=/logdata/traefik.log - --configfile=/config/traefik.toml - --accesslog.filepath=/logdata/access.log - --accesslog.bufferingsize=100 volumeMounts: - mountPath: /config name: config - mountPath: /logdata name: access-log volumes: - configMap: name: traefik-config name: config - name: access-log hostPath: path: /opt/logs/ingress/

Check traefik status

$ kubectl get pods -n kube-system | grep traefik traefik-ingress-controller-2dx7k 1/1 Running 0 5h28m ... .

$kubectl kube get SVC - n - system | grep traefik traefik ingress - service ClusterIP 10.253.132.216 < none > 80 / TCP, 8080 / TCP 123D Traefik-web-UI ClusterIP 10.253.54.184 < None > 80/TCP 172D

You can check whether the Traefik service is running properly through the ping and admin interfaces of the node

$curl 10.253.132.216 / ping OK

$curl 10.253.132.216:8080 <a href="/dashboard/">Found</a>

View the process logs and access logs on the nodes scheduled to Treafik

Tree - $2 L/opt/logs/ingress / / opt/logs/ingress / ├ ─ ─ access. Log └ ─ ─ traefik. The log

$ tail -n 10 /opt/logs/ingress/traefik.log time="2020-04-07T08:53:38Z" level=warning msg="Endpoints not available for my-data/my-data-selfaccess-dev" time="2020-04-07T08:53:38Z" level=warning msg="Endpoints not available for my-data/my-data-metadata-prod1"

10 $tail - n/opt/logs/ingress/access log 172.16.21.28 - [7 / Apr / 2020:08:52:54 + 0000] "POST /. Kibana / _search? Ignore_unavailable = true&filter _path = aggregations. Types. Buckets HTTP / 1.1 "503 161" - "" -" 491674 "Prod-es-cluster. soulapp-inc. Cn" "http://20.0.41.10:9200" 1ms 172.16.21.28 - - [07/Apr/2020:08:52:54 +0000] "POST /. Kibana / _search? Ignore_unavailable = true&filter _path = aggregations. Types. Buckets HTTP / 1.1 "503 161" - "" -" 491671 "Prod-es-cluster. soulapp-inc. Cn" "http://20.0.26.20:9200" 1ms 172.16.21.28 - - [07/Apr/2020:08:52:54 +0000] "GET /.kibana/doc/config% 3a6.4.0 HTTP/1.1" 503 301 "-" "-" 491675 "prod-es-cluster.soulapp-inc.cn" "http://20.0.14.6:9200" 1ms

Copy the code

As you can see from the output format of the access log, traefik’s access log is similar to Nginx’s. With this log, we can use elK-like log analysis schemes to break down the overall state of the site, such as UV, PV, region distribution, state distribution, and response time.

In addition, access logs are directly exported to the Node node for persistence. Later, logs can be sent to the ELK Stack for analysis through the log collection plug-in on the Node host. Of course, the log collection end of the ELK Stack can also be directly deployed to the POD of Traefik.

Metrics

Traefik supports four Metrics backend architectures by default:

  • Datadog
  • Influxdb
  • Prometheus
  • StatsD

To enable metrics support, do the following:

# toml profile [metrics]

Yaml configuration file

metrics: {}

Cli configuration

Copy the code

--metrics=true

Datadog backend support

Configuration details:

# toml [metrics] [metrics. Datadog] address = "127.0.0.1:8125" addEntryPointsLabels = true AddServicesLabels = true # Enable meirtcs[true] pushInterval = 10s #push metrics to datalog [10s]

Cli configuration

-- the metrics. Datadog = true -- metrics. Datadog. Address = 127.0.0.1:8125, metrics. Datadog. AddEntryPointsLabels = true --metrics.datadog.addServicesLabels=true --metrics.datadog.pushInterval=10s

Copy the code

InfluxDB Support for the backend

Configuration details:

[metrics :8089] protocol = "udp" # influxdb transfer protocol (udp) | HTTP] [udp database = "db" # specified metrics into library (" ") retentionPolicy = "two_hours" # metrics in influxdb reserve strategy [""] username = "" password = "" addEntryPointsLabels = true AddServicesLabels = true # Enable meirtcs[true] pushInterval = 10s #push metrics to datalog [10s]

Cli configuration

Copy the code

--metrics.influxdb=true --metrics.influxdb.address=localhost:8089 --metrics.influxdb.protocol=udp --metrics.influxdb.database=db --metrics.influxdb.retentionPolicy=two_hours --metrics.influxdb.username=john --metrics.influxdb.password=secret --metrics.influxdb.addEntryPointsLabels=true --metrics.influxdb.addServicesLabels=true --metrics.influxdb.pushInterval=10s

Prometheus backend support

Configuration details:

Toml set [metrics] [metrics. Prometheus] buckets = [0.1,0.3,1.2,5.0] 5.000000] addEntryPointsLabels = true # Add metrics labels [true] addServicesLabels = true # enable meirtcs in service EntryPoint = "traefik" # specify the endpoint of metrics [traefik(default: management port 8080/metrics)]. You can also define manualRouting = true.

Cli configuration

-- the metrics. Prometheus = true -- metrics. Prometheus. Buckets = 0.100000, 0.300000, 1.200000, 5.000000 - metrics. Prometheus. AddEntryPointsLabels = true -- metrics. Prometheus. AddServicesLabels = true

We define a metrics endpoint and specify a port

--metrics.prometheus.entryPoint=metrics --entryPoints.metrics.address=:8082 --metrics.prometheus.manualrouting=true

Copy the code

Note: Unlike the other two methods, Prometheus only exposes metrics, which requires periodic pull using Promethee-Server to collect data.

After the configuration takes effect, you can access the following ports for testing:

$curl localhost:8082/metrics

Use the default Traefik endpoint (with admin's port)

Copy the code

$ curl localhost:8080/metrics

StatsD backend support

Detailed configuration:

# toml set [metrics] [metrics. StatsD] address = "localhost:8125 [true] Add metrics label addServicesLabels = true # Enable meirtcs in service [true] pushInterval = 10s # push prefix = "traefik" # define prefixes for metrics collection [traefik]

Cli configuration

Copy the code

--metrics.statsd=true --metrics.statsd.address=localhost:8125 --metrics.statsd.addEntryPointsLabels=true --metrics.statsd.addServicesLabels=true --metrics.statsd.pushInterval=10s --metrics.statsd.prefix="traefik"

Metrics analysis based on Prometheus backend

Note: Because traefik-1.7.6 is used in the production environment, some of the above configuration parameters may not be applicable to traefik-1.7.6. For details, please refer to the supported parameters of the specific version. In addition, we consider Traefik as the Ingress solution in kubernetes cluster. So the following operations are performed within an available K8S cluster.

Traefik – 1.7 – the metrics

1. Traefik metrics configuration

$cat traefik-config.yaml apiVersion: v1 kind: ConfigMap metadata: name: traefik-config namespace: kube-system data: traefik.toml: | defaultEntryPoints = ["http","https"] debug = false logLevel = "INFO"
InsecureSkipVerify = true [entryPoints] [entryPoints.http] address = ":80" compress = true [entryPoints.https] address = ":443" [entrypoints.https. TLS] [web] address = ":8080" [kubernetes] [metrics. Prometheus] Buckets =[0.1,0.3,1.2,5.0] entryPoint = "HTTP"Copy the code

After the POD is rescheduled, you can view the exposed endpoint

$kubectl get ep - A | grep traefik kube - system traefik - ingress - service 172.16.171.163:80172.16. 21.26:80172.16. 21.27:80 + 11 more... 122 d kube - system traefik - web - UI 172.16.171.163:8080172.16. 21.26:8080172.16. 21.27:8080 + 4 more... 171d

$kubectl kube get SVC - n - system | grep traefik traefik ingress - service ClusterIP 10.253.132.216 < none > 80 / TCP, 8080 / TCP 122D Traefik - web-UI ClusterIP 10.253.54.184 < None > 80/TCP 171D

Test access to the Metrics service (because the two services associated with Traefik are routed to the POD of traefil-ingress, the following two effects are consistent)

The curl -s 10.253.54.184 / metrics | head - 10

Copy the code

Note: It is recommended to use the SVC address exposed by Traefik-web-ui

Indicators and related meanings:

Index items meaning
process_max_fds The traefik process has the largest FD
process_open_fds Fd of process open
process_resident_memory_bytes Process memory usage
process_start_time_seconds Process start time
process_virtual_memory_bytes Processes occupy virtual memory
traefik_backend_open_connections Traefik backend opens the link
traefik_backend_request_duration_seconds_bucket Traefik Back-end request processing time
traefik_backend_request_duration_seconds_sum The total time
traefik_backend_request_duration_seconds_count Total request time
traefik_backend_requests_total Total number of requests processed by a backend (by Status code, Protocol, and Method)
traefik_backend_server_up Whether the back-end is up(0)
traefik_config_last_reload_failure Time traefik last failed reload
traefik_config_last_reload_success The last time a reload was successfully performed
traefik_config_reloads_failure_total Number of failures
traefik_config_reloads_total Number of successful
traefik_entrypoint_open_connections Number of open links at entry points (method and Protocol partition)
traefik_entrypoint_request_duration_seconds_bucket Time spent processing requests at entry points (status code, protocol, and method.)
traefik_entrypoint_requests_total Total number of requests processed by an entry point (status code distribution)

2. Configuration Prometheus - server

Note that at this point we need to configure Prometheus-Server to periodically pull the metrics exposed by Traefik.

# create Prometheus monitor service $cat Prometheus -traefik.yml apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: labels: k8s-app: traefik-ingress-lb name: traefik-metrics namespace: monitoring spec: Interval: 30s Port: admin path: /metrics jobLabel: K8s-app # Match monitoring namespace app= GPU-metrics SVC namespaceSelector: matchNames: -kube-system selector: matchLabels: k8s-app: traefik-ingress-lbCopy the code

$ kubectl apply -f prometheus-traefik.yml

After configuration, you can view metrics data in Prometheus.

prometheus-traefik-metrics

3. Configure Grafana based on the kpi

According to the indicator display of Traefik-Metrics, there are many open source templates on Grafana’s official website. Here I have also made an open source template based on actual requirements, which can be used directly (please refer to Baidu for specific import method).

Traefik-Process-All-Grafana-Template

Traefik- Global monitoring details

With the visualization of Merics, it is easy to see how the entire traffic changes when doing various rolling upgrades and cut-flow publishing for HTTP services.

Tracing

Tracing systems allow developers to visualize the flow of calls in their infrastructure.

Traefik follows the OpenTracing specification, an open standard designed for distributed tracing.

Traefik supports five tracking system backends:

  • Jaeger
  • Zipkin
  • Datadog
  • Instana
  • Haystack
  • Elastic

Note: Datadog,Instana,Haystack are commercial solutions and will not be introduced below

1. The configuration

Note: By default, Traefik uses Jaeger to track the back-end implementation of the system.

# toml profile $cat Traefik. Toml [tracing] serviceName = "traefik" # Select the back end implementation of the tracing system [traefik(indicates using jaeger)] spanNameLimit = 150 # Restrict the name phase of long names (this prevents some trace providers from deleting traces that exceed their length limit)

Cli configuration

--tracing=true --tracing.serviceName=traefik --tracing.spanNameLimit=150

Copy the code

2.Jaeger

Related configurations:

# toml profile $cat traefik.toml [tracing] [tracing. Jaeger] # enable jaeger tracing support samplingServerURL = "Http://localhost:5778/sampling" # specified jaeger user-agent HTTP sampling address samplingType = # "const" Specify the sampling type [const (const | probabilistic | rateLimiting)] samplingParam = 1.0 # Sampling the value of the parameter [1.0 (const: 0 | 1, probabilistic: 0 and 1, rateLimiting: the span of per second)] localAgentHostPort = "127.0.0.1:6831 # To generate 128-bit traceId, compatible with OpenCensus Propagation = "jaeger" # Set the header of the data transmission type [jaeger (jaeger | b3 compatible OpenZipkin)] traceContextHeaderName = "uber - trace - id # track context header, used for the transmission of tracking context HTTP top [tracing. Jaeger. The collector] # specifies the collector service endpoint jaeger = "http://127.0.0.1:14268/api/traces? Format =jaeger.thrift" user = "my-user" # HTTP authentication user when submitting to collector [""] password = "my-password" # HTTP authentication password when submitting to collector [""]

Cli configuration

--tracing.jaeger=true --tracing.jaeger.samplingServerURL=http://localhost:5778/sampling - tracing. Jaeger. SamplingType = const -- tracing. Jaeger. SamplingParam = 1.0 -- tracing. Jaeger. LocalAgentHostPort = 127.0.0.1:6831 - tracing. Jaeger. Gen128Bit -- tracing. Jaeger. Propagation = jaeger --tracing.jaeger.traceContextHeaderName=uber-trace-id -- tracing. Jaeger. The collector. The endpoint = http://127.0.0.1:14268/api/traces? format=jaeger.thrift --tracing.jaeger.collector.user=my-user --tracing.jaeger.collector.password=my-password

Copy the code

3.Zipkin

Related configurations:

# toml configuration file [tracing] [tracing zipkin] # specified using zipkin httpEndpoint = "http://localhost:9411/api/v2/spans" # tracking system Id128Bit = true # Use Zipkin 128-bit trace ID (true) SampleRate = 0.2 # Specifies how often the system is requested to trace [1.0(0.1-1.0)]

Cli configuration

--tracing.zipkin=true --tracing.zipkin.httpEndpoint=http://localhost:9411/api/v2/spans --tracing.zipkin.sameSpan=true - tracing. Zipkin. Id128Bit = false -- tracing. Zipkin. SampleRate = 0.2

Copy the code

4.Elastic

Related configurations:

Toml profile $cat Traefik. toml [tracing] [tracing. Elastic] serverURL = "http://apm:8200 = "" # specify an Elastic APM service token serviceEnvironment = "" # specify an APM Server environment

Cli configuration

--tracing.elastic=true --tracing.elastic.serverurl="http://apm:8200" --tracing.elastic.secrettoken="mytoken" --tracing.elastic.serviceenvironment="production"

Copy the code


Knowledge of the planet
The public,

This article is formatted using MDNICE