Abstract: Alibaba Cloud SLB can distribute traffic to multiple cloud servers (ECS), and supports TCP layer 4 load balancing and HTTP/HTTPS layer 7 load balancing. SLB can reduce the impact on services when a single ECS is abnormal and improve system availability. In addition, back-end servers can be dynamically expanded or scaled down in combination with elastic Scaling Services (ESS) to rapidly cope with service traffic changes.

The SLB can distribute traffic to multiple cloud servers (ECS) and supports TCP layer 4 load balancing and HTTP/HTTPS layer 7 load balancing. SLB can reduce the impact on services when a single ECS is abnormal and improve system availability. In addition, back-end servers can be dynamically expanded or scaled down in combination with elastic Scaling Services (ESS) to rapidly cope with service traffic changes.

The SLB Layer 7 access log contains rich content and provides nearly 30 fields, such as request receipt time, CLIENT IP address, Latency, request URI, back-end RealServer (ECS) address, and return status code. After the SLB layer 7 access log function is enabled, the SLB records all access logs of the corresponding instance to the log service. In two topics, this article shows you how to use logging services to discover some of the value behind SLB access logging.

Where do requests come from

This is a client_IP question that can be answered by looking directly at the client_IP field of the access log. But sometimes we find client_IP always has the same number of values, and our gut tells us something is wrong:

A client request is load-balanced from the original IP address to the SLB, and if not brokered, client_IP records the original client IP address. If the request is forwarded multiple times by the proxy, the client_IP of the access log may not reflect the source of the request.

Fortunately, there are two other fields in the SLB access log that can help us solve the real client_IP problem:

  • Http_x_forwarded_for (http_X_FORWARded_for) is the RFC7293 standard. Assume that the client sends a request to client_0 and passes through three proxies proxy_1, proxy_2 and Proxy_3 in sequence before reaching the server. Proxy_3 is directly connected to the load balancer. Then proxy_3 appends the IP address of proxy_2 to X-Forwarded-for, indicating that proxy_2 forwards requests For service. These layers are connected to form a comma-linked string “client_0_IP, proxy_1_IP, proxy_2_IP”, the first of which is the original client IP address.
  • Http_x_real_ip, derived from the HTTP custom header X-real-IP field, is an informal standard but widely used in the industry. This is the most convenient and correct value on the premise that the original client IP address is always recorded by each layer of proxy.

If you’re interested, either the X-Forwarded-For or X-Real-IP fields are likely to be inaccurate. Read this article about the X-Forwarded-For header.

In this paper, the Real request source IP is calculated according to x-real-IP priority policy, and the algorithm is expressed as the following decision tree:

If the values of http_X_FORWARded_for and http_X_real_IP are “-“, the values are invalid. SQL = case/ WHEN;

Real_client_ip is an optimized version of the real client IP obtained by the algorithm:

Based on real_client_IP, you can use the log service IP geography function to calculate the geographic information (country, province, carrier, latitude and longitude) of the access source. For example, PV distribution is calculated according to provincial dimension:

What does the HTTP status code say

408 Request Timeout

The phenomenon of

Clients request services deployed on SLB, but network timeouts often occur.

The screening process

First use SQL to count whether there is an abnormal status code:

not (status : 200) | select status, count(*) as pv group by status order by pv desc

The analysis found that in the last 15 minutes of the access log there are some 408 returned requests:

The 408 status code indicates that the server has not received a complete request for a certain period of time. At this point, the server decides not to wait, sets the Connection header value to close in response, and actively closes the Connection.

When 408 occurs, Request Timeout is displayed. The two most likely causes are: the client does not send the packet to the server within the timeout period; Or the server is overloaded and not processing requests in a timely manner. If the server side load can be ruled out through monitoring, you can shift more focus to the client side.

Select client_IP from 408

status : 408 | select client_ip, count(*) as pv group by client_ip order by pv desc

If client_IP is concentrated on several specific sources, there is a greater likelihood that individual client network traffic will cause problems.

At the same time, check the log of 408 status code and find that the upstream_ADDR and upstream_status of abnormal request are not recorded, indicating that the request did not reach the back-end real Server. At this point, it can be considered that the network timeout caused by client problems is very likely.

Next, please check the network monitoring or packet capture survey on the client.

499 Client Closed Request

The phenomenon of

Traffic on the SLB load balancer dropped, and no 5XX error was seen on the back-end server.

The screening process

In the classic opening, we first look at the distribution of abnormal status codes, but this time we suspect that 499 is the cause:

The 499 status code indicates that the client actively closes the connection while the server Nginx is processing the request.

As evidenced by the exception access log, upstream_ADDR records that the request was processed on the Real Server, but does not record the back-end status code upstream_STATUS of the response, indicating that the back-end server did not complete the processing of the request. Also, the total request processing time request_time took more than 10 seconds, perhaps because the long wait caused the user to stop downloading the task.

The original link