We all know that in Kubernetes cluster, the Ingress solution is commonly used to broker the traffic within the cluster. The common Ingress solution is Traefik and Nginx, and the traditional Nginx as the reverse proxy and load device within the enterprise. Specific configurations may be required in different scenarios to meet the requirements, otherwise bizarre exception status codes will appear. In this article, we take a look at the 500 and 502 exceptions we encountered in Traefik.

One, foreword

Before we begin, let’s take a look at a few exception status codes that are commonly encountered in Nginx as a reverse proxy tool:

  • 499: The client disconnects. Usually a request is not returned within the specified time, the client actively closes the connection, and the client returns no data (499 will be logged in Nginx). The general isClient Timeout
  • 500: The server has an internal error. The server encountered an unknown error that prevented it from completing the request processing, usually due to a backend business logic exception (itself a Bug)
  • 502: Gateway error. Typically, the gateway does not receive the expected response from the upstream service (The upstream does not return data or does not return data as specified in the protocol), the gateway feels useless and returns a gateway error. The general isThe back-end server is downorBusiness logic timeout
  • 504: The gateway times out. Indicates that the gateway does not obtain response data from the upstream in time. This is usually caused by a timeout on the Nginx gateway as a client to the upstream server, but for the upstream server, it will continue to execute until the end. (Timeout for Nginx gateway as client)
In the case of # 499, the client specified a timeout of N seconds, but the request actually took M seconds to execute on the server (M>N seconds), and the client got impatient and shut down

# For 499 status, the solution is to optimize the back-end code logic or modify the nginx parameters

$ cat nginx.conf

proxy_ignore_client_abort    on;

M $curl - I - 3 http://127.0.0.1/hello.php





# 502 is usually a case where the server on the back end of the Nginx gateway is down

PHP - FPM. Conf (request_terminate_timeout5s); PHP - FPM. Conf (request_terminate_timeout5s); Abnormal data returned to the Nginx gateway

Several error logs on the back end at # 502

 recv() failed (104: Connection reset by peer) while reading response header from upstream

 upstream prematurely closed connection while reading response header from upstream

 connect(a)failed (111Connection refused) while connecting to upstream

In general, the problems that occur are usually because the back end has hung up, or because the back end is overloaded and temporarily unresponsive

Proxy_read_timeout can be added to nginx for temporary relief

$ cat nginx.conf

proxy_read_timeout  20s;



In # 504, the client ->Nginx->Backend. In Backend, Nginx needs to be used as a client to access Backend services. However, before Backend execution is complete, Nginx exceeds the client timeout. A 504 exception occurs (but what is returned for the client?).

# For scenario 504, the common practice is to optimize the logic of Backend and reduce the execution time appropriately; You can also increase the timeout for Nginx as a client

Nginx is used as a Proxy when it is used as a Proxy

$ cat nginx.conf

uwsgi_connect_timeout 5;

uwsgi_send_timeout 5;

uwsgi_read_timeout 5;

fastcgi_read_timeout 5;

fastcgi_send_timeout 5;

proxy_connect_timeout      90;

proxy_send_timeout         90;

proxy_read_timeout         90;

Copy the code

Traefik’s quirky 500 and 502

Deployment configuration of Traefik in a Kubernetes cluster

The traefik configuration for our current cluster is as follows:

# Traefik configmap configuration file

$ cat traefik-config.yaml

apiVersion: v1

kind: ConfigMap

metadata:

  name: traefik-config

  namespace: kube-system

data:

  traefik.toml: |

    defaultEntryPoints = ["http"."https"]

    debug = false

    logLevel = "INFO"



    InsecureSkipVerify = true

    [entryPoints]

      [entryPoints.http]

      address = ": 80"

      compress = true

      [entryPoints.https]

      address = ": 443"

        [entryPoints.https.tls]

    [web]

      address = ": 8080"

    [kubernetes]

    [metrics]

      [metrics.prometheus]

      buckets=[0.1.0.3.1.2.5.0]

      entryPoint = "traefik"

    [ping]

    entryPoint = "http"



# Traefik DaemonSet configuration

$ cat traefik-ds-v17.16..yaml

---

apiVersion: v1

kind: ServiceAccount

metadata:

  name: traefik-ingress-controller

  namespace: kube-system

---

kind: DaemonSet

apiVersion: extensions/v1beta1

metadata:

  name: traefik-ingress-controller

  namespace: kube-system

  labels:

    k8s-app: traefik-ingress-lb

spec:

  template:

    metadata:

      labels:

        k8s-app: traefik-ingress-lb

        name: traefik-ingress-lb

    spec:

      affinity:

        nodeAffinity:

          requiredDuringSchedulingIgnoredDuringExecution:

            nodeSelectorTerms:

            - matchExpressions:

              - key: node-role.kubernetes.io/master

                operator: DoesNotExist

      serviceAccountName: traefik-ingress-controller

      terminationGracePeriodSeconds: 30

      hostNetwork: true

      containers:

      - image: traefik:v17.16.

        name: traefik-ingress-lb

        ports:

        - name: http

          containerPort: 80

          hostPort: 80

        - name: admin

          containerPort: 8080

        securityContext:

          capabilities:

            drop:

            - ALL

            add:

            - NET_BIND_SERVICE

        args:

        - --api

        - --kubernetes

        - --logLevel=INFO

        - --traefikLog.filePath=/logdata/traefik.log

        - --configfile=/config/traefik.toml

        - --accesslog.filepath=/logdata/access.log

        - --accesslog.bufferingsize=100

        volumeMounts:

        - mountPath: /config

          name: config

        - mountPath: /logdata

          name: access-log

      volumes:

      - configMap:

          name: traefik-config

        name: config

      - name: access-log

        hostPath:

          path: /opt/logs/ingress/

---

kind: Service

apiVersion: v1

metadata:

  name: traefik-ingress-service

  namespace: kube-system

  labels:

    k8s-app: traefik-ingress-lb

spec:

  selector:

    k8s-app: traefik-ingress-lb

  ports:

    - protocol: TCP

      port: 80

      name: web

    - protocol: TCP

      port: 8080

      name: admin

Copy the code

Python’s external API interface

# interface external ingress

$ kubectl  get ingress -n s-data

NAME                     HOSTS                    ADDRESS   PORTS   AGE

data-api.bgbiao.cn   data-api.bgbiao.cn             80      236d

ops.bgbiao.cn       ops.bgbiao.cn                 80      236d



Test the external interface

$ curl data-api.bgbiao.cn  -i

HTTP / 1.1 401 Unauthorized

Access-Control-Allow-Headers: Content-Type, X-TOKEN

Access-Control-Allow-Origin: *

Content-Length: 58

Content-Type: application/json

Vary: Accept-Encoding

Date: Sun, 28 Jun 2020 14:55:00 GMT



The interface requires login, so we pressure the login interface to simulate the problem

$ curl -X POST  --data '@/root/login.json' -H 'Content-type:application/json' http://data-api.bgbiao.cn/account/users/login/   -i

HTTP / 1.1 200 OK

Access-Control-Allow-Headers: Content-Type, X-TOKEN

Access-Control-Allow-Origin: *

Content-Length: 250

Content-Type: application/json

Vary: Accept-Encoding

Date: Sun, 28 Jun 2020 14:56:33 GMT

Copy the code

The weird 500 and 502

After the service was deployed, everything was fine, but a simple pressure test revealed that some requests to the service failed.

#Pressure test using tool AB

#According to the pressure test results, 20 concurrent pressure test 200 requests, during which there were 7 failed requests

$ ab -c 20 -n 200   -T 'application/json' -p /root/login.json http://data-api.bgbiao.cn/account/users/login/

.

Benchmarking data-api.bgbiao.cn (be patient)

Completed 100 requests

Completed 200 requests

Finished 200 requests





Server Software:

Server Hostname:        data-api.bgbiao.cn

Server Port:            80



Document Path:          /account/users/login/

Document Length:        250 bytes



Concurrency Level:      20

Time taken for tests:   1.340 seconds

Complete requests:      200

Failed requests:        7

   (Connect: 0, Receive: 0, Length: 7, Exceptions: 0)

Write errors:           0

Non-2xx responses:      7

Total transferred:      91371 bytes

Total body sent:        46400

HTML transferred:       48387 bytes

Requests per second: 149.21 [#/ SEC] (mean)

Time per request:       134.035 [ms] (mean)

Time per request:       6.702 [ms] (mean, across all concurrent requests)

Transfer rate: 66.57 [Kbytes/ SEC] Received

                        33.81 kb/s sent

                        100.38 kb/s total



Connection Times (ms)

              min  mean[+/-sd] median   max

Connect:        1    1   0.1      1       1

Processing:     2  116  27.8    114     179

Waiting:        2  116  27.8    114     179

Total:          3  117  27.8    116     180



Percentage of the requests served within a certain time (ms)

50%116

66%121

75%125

80%129

90%154

95%167

98%173

99%175

100%180 (longest request)





#
Save pressure measurement results to text for simple analysis

#Simple analysis of 200 pressure test requests, there are 4 requests failed, respectively 500 error twice and 502 error twice

$ ab -v 10 -c 20 -n 200   -T 'application/json' -p /root/login.json http://data-api.bgbiao.cn/account/users/login/  > ab-log.txt

$cat ab-log.txt | grep HTTP | sort| uniq -c

196 HTTP / 1.0 200 OK

      2 HTTP/1.0 500 Internal Server Error

      2 HTTP/1.0 502 Bad Gateway

      1 POST /account/users/login/ HTTP/1.0

Copy the code

Traefik 500 and 502 troubleshooting

In the case of Kubernetes clustering, traefik plays a similar role to Nginx in that we mentioned the reasons for the 500 and 502 status codes in the Nginx scenario and related solutions.

At the beginning, we mentioned the configuration information of traefik within the cluster. For SRE, any service produced must have relevant observable data. Therefore, We also persisted traefik’s access logs and process logs by default (corresponding to access.log and Traefik.log, respectively) and exposed Traefik’s Prometheus metrics interface.

For the above pressure test request, we caught the following exception log in the access log:

$ tail -f access.log | grep data-api.bgbiao.cn | grep -v HTTP / 1.0 "' 200 '

192.168.0.23- -28/Jun/2020:14:57:38 +0000"POST /account/users/login/ HTTP/1.0" 500 21 "-" "ApacheBench / 2.3" 122267376 "data-api.bgbiao.cn/" "http://20.0.41.8:8080" 0ms

192.168.0.23- -28/Jun/2020:14:57:38 +0000"POST /account/users/login/ HTTP/1.0" 500 21 "-" "ApacheBench / 2.3" 122267385 "data-api.bgbiao.cn/" "http://20.0.26.9:8080" 1ms

192.168.0.23- -28/Jun/2020:14:57:38 +0000"POST /account/users/login/ HTTP/1.0" 500 21 "-" "ApacheBench / 2.3" 122267410 "data-api.bgbiao.cn/" "http://20.0.41.8:8080" 1ms

192.168.0.23- -28/Jun/2020:14:57:38 +0000"POST /account/users/login/ HTTP/1.0" 500 21 "-" "ApacheBench / 2.3" 122267418 "data-api.bgbiao.cn/" "http://20.0.41.8:8080" 1ms

192.168.0.23- -28/Jun/2020:14:57:38 +0000"POST /account/users/login/ HTTP/1.0" 500 21 "-" "ApacheBench / 2.3" 122267484 "data-api.bgbiao.cn/" "http://20.0.26.9:8080" 1ms

192.168.0.23- -28/Jun/2020:14:57:38 +0000"POST /account/users/login/ HTTP/1.0" 502 11 "-" "ApacheBench / 2.3" 122267518 "data-api.bgbiao.cn/" "http://20.0.26.9:8080" 1ms

192.168.0.23- -28/Jun/2020:14:57:39 +0000"POST /account/users/login/ HTTP/1.0" 500 21 "-" "ApacheBench / 2.3" 122267550 "data-api.bgbiao.cn/" "http://20.0.26.9:8080" 4ms

192.168.0.23- -28/Jun/2020:15:02:06 +0000"POST /account/users/login/ HTTP/1.0" 502 11 "-" "ApacheBench / 2.3" 122272696 "data-api.bgbiao.cn/" "http://20.0.41.8:8080" 2ms

192.168.0.23- -28/Jun/2020:15:02:06 +0000"POST /account/users/login/ HTTP/1.0" 502 11 "-" "ApacheBench / 2.3" 122272711 "data-api.bgbiao.cn/" "http://20.0.41.8:8080" 1ms

192.168.0.23- -28/Jun/2020:15:02:06 +0000"POST /account/users/login/ HTTP/1.0" 500 21 "-" "ApacheBench / 2.3" 122272836 "data-api.bgbiao.cn/" "http://20.0.26.9:8080" 0ms

192.168.0.23- -28/Jun/2020:15:02:06 +0000"POST /account/users/login/ HTTP/1.0" 500 21 "-" "ApacheBench / 2.3" 122272837 "data-api.bgbiao.cn/" "http://20.0.41.8:8080" 0ms

Copy the code

As you can see, the status codes of failed requests are the same as those in our pressure test results, which are 500 and 502, although there are more than four, but this is not important for the time being.

Usually for most people, when they see 500 or 502 in the agent layer, they will subconsciously think that it must be the upstream service. However, such speculation can be quickly eliminated, and the methods of elimination are basically as follows:

  • Continue to access the upstream server during pressure tests 500 and 502
  • Pressure the upstream server directly using the same pressure parameters
  • Deploy the upstream service on a separate ECS and pressure it with the same parameters

After the above three methods of testing, we basically ruled out the upstream service problems, so we officially doubt whether traefik itself has performance or other parameters problems.

Change the log level in Traefki to DEBUG:

$cat traefik - ds - v1.7.6 yaml

.

        args:

        - --api

        - --kubernetes

        - --logLevel=DEBUG

.

Copy the code

The following information is displayed in the log:

# 500 Related logs

time="2020-06-28T15:35:05Z" level=debug msg="'500 Internal Server Error' caused by: http: server closed idle connection"

time="2020-06-28T15:35:05Z" level=debug msg="Vulcand/foxy HTTP: / forward/Round trip: http://20.0.26.9:8080, code: 500, Length: 21, duration: 1.486276 ms"



# 502 Related logs

time="2020-06-28T15:35:05Z" level=debug msg="'502 Bad Gateway' caused by: EOF"

time="2020-06-28T15:35:05Z" level=debug msg="Vulcand/foxy HTTP: / forward/Round trip: http://20.0.26.9:8080, code: 502, Length: 11, duration: 1.530677 ms"

Copy the code

The 500 message above is probably caused by traefik’s service process actively closing the idle link, while the 502 message below is caused by EOF, which seems to have been disconnected without finishing reading the response data, causing Traefik to return 502. This is often the case in Nginx, and it is easy to adjust the configuration parameters (mentioned at the beginning of this article), but traefik deployment patterns and parameter adjustments require careful documentation.

Traefik-related Issues on Github showed that the issue had occurred several times.

  • Traefik return 500 internal error – no 500 on backend
  • Sporadic 502 response only when running through traefik

One is 500 issues and one is 502 issues, but usually they come in pairs.

500 and 502 problem solutions

As mentioned in the first issue above, Traefik has HTTP keep-alive enabled by default in HTTP reverse proxy, but not in Python applications. Since the above test program was actually developed using Python, adjust this parameter first.

Add keep-alive to uwsgi.ini

$ cat uwsgi.ini

[uwsgi]

HTTP = 0.0.0.0:8080

http-keepalive = 1

chdir = /opt/app/

wsgi-file = /opt/app/main.py

callable = app

Stats = 0.0.0.0:8081

processes = 2

threads = 10



After restarting the application, test again

The error 502 and 500 that appeared before have almost disappeared

# 200 concurrent requests, total 10000 requests, failed number 0, total 1min(QPS to 160😸)

$ ab  -c 200 -n 10000   -T 'application/json' -p /root/login.json http://data-api.bgbiao.cn/account/users/login/

.

.



Concurrency Level:      200

Time taken for tests:   59.323 seconds

Complete requests:      10000

Failed requests:        0

Write errors:           0

Total transferred:      4670000 bytes

Total body sent:        2320000

HTML transferred:       2500000 bytes

Requests per second: 14655 [#/sec] (mean)

Time per request:       1186.454 [ms] (mean)

Time per request:       5.932 [ms] (mean, across all concurrent requests)

Transfer rate: 76.88 [Kbytes/ SEC] Received

                        38.19 kb/s sent

                        115.07 kb/s total



Connection Times (ms)

              min  mean[+/-sd] median   max

Connect: 1 3 38.6 1 1035

Processing: 101 942 1457.7 857 32684

Waiting: 101 942 1457.7 857 32684

Total: 102 945 1458.0 861 32685



# P99 reaches 7.3s

Percentage of the requests served within a certain time (ms)

50% by 861

66% by 1033

75% by 1136

80% by 1191

90% by 1886

95% by 2281

98% by 4209

99% by 7399

 100%  32685 (longest request)

Copy the code

Problems 500 and 502 have been temporarily solved by turning on HTTP keep-alive parameter of the business layer. Is it possible to optimize the problem through traefik parameter? It has been mentioned in the second issue above.

That is, modify the following traefik parameters and redeploy the traefik cluster:

# close traefik keep - the alive parameters, the default value is 200, if the parameter is 0, then use the go DefaultMaxIdleConnsPerHost parameters in the standard library

# keep-alive is primarily used to reuse links to reduce open files, but this can happen with a large number of short connections

--maxidleconnsperhost=-1

# increase the timeout time for idle links and forwarding responses by setting the retry times. The default value is 0

--retry.attempts=10

# this parameter has been replaced by -- respondingtimeouts. Idletimeout parameter, the default is 3 m0s

--idletimeout=60s

The default is 0s

--forwardingtimeouts.responseheadertimeout=60s



# traefik Idle link timed out

$ ./traefik --help | grep idletimeout

    --idletimeout (Deprecated) maximum amount of time an idle (keep-alive) connection will remain (default "0s")

    --respondingtimeouts.idletimeout IdleTimeout is the maximum amount duration an idle (keep-alive) connection will (default "3m0s")



Parameter related to response timeout

 $ ./traefik --help | grep respondingtimeouts

    --respondingtimeouts Timeouts for incoming requests to the Traefik instance (default "true")

    --respondingtimeouts.idletimeout IdleTimeout is the maximum amount duration an idle (keep-alive) connection will (default "3m0s")

    --respondingtimeouts.readtimeout ReadTimeout is the maximum duration for reading the entire request, including (default "0s")

    --respondingtimeouts.writetimeout WriteTimeout is the maximum duration before timing out writes of the response. (default "0s")





# forwarding timeout

$ ./traefik --help | grep forwardingtimeouts

    --forwardingtimeouts Timeouts for requests forwarded to the backend servers (default "true")

    --forwardingtimeouts.dialtimeout The amount of time to wait until a connection to a backend server can be (default "30s")

    --forwardingtimeouts.responseheadertimeout The amount of time to wait for a server's response headers after fully writing (default "0s")

Copy the code

The final modified traefik parameters are as follows:

--maxidleconnsperhost=-1 --maxidleconnsperhost=-1

        - --api

        - --kubernetes

        - --logLevel=INFO

        - --traefikLog.filePath=/logdata/traefik.log

        - --configfile=/config/traefik.toml

        - --accesslog.filepath=/logdata/access.log

        - --accesslog.bufferingsize=100

        - --forwardingtimeouts.responseheadertimeout=60s

        - --respondingtimeouts.idletimeout=180s

        - --retry.attempts=10

        - --idletimeout=180s

Copy the code

The creepy 503 in SLB

After the domain name was directly resolved to traefik node and 502 and 500 no longer appeared, we mounted the node of Traefik to the Intranet SLB of Ali Cloud, but the strange 503 problem began to appear again.

Simple pressure test after the SLB is connected (the Intranet uses the free SLB 😹)

# client->ali-slb->traefik->pods

$ ab -v 10 -c 200 -n 2000   -T 'application/json' -p postfile.json http://data-api.soulapp.cn/get_doc > slb-log.txt

$ cat slb-log.txt | grep  HTTP / 1.1 '200' | wc -l

1322

$ cat slb-log.txt | grep  HTTP / 1.1 '503' | wc -l

678





# client->traefik->pods

$ ab -v 10 -c 200 -n 2000   -T 'application/json' -p postfile.json http://data-api.c.bgbiao.cn/get_doc > traefik-log.txt

Completed 200 requests

Completed 400 requests

Completed 600 requests

Completed 800 requests

Completed 1000 requests

Completed 1200 requests

Completed 1400 requests

Completed 1600 requests

Completed 1800 requests

Completed 2000 requests

Finished 2000 requests



$ cat traefik-log.txt  | grep  HTTP / 1.0 '200' | wc -l

2000

Copy the code

Later, I saw the following document in aliyun document, which basically means that the SLB side of Aliyun has made certain traffic limiting policies for SLBS of different specifications. At this time, 503 will be returned to the client.

Ali cloud TPS – 503


The public,