The problem found

Recently, the number of requests to the nodes of the online service cluster is very uneven, and the specific performance can even be poor to the maximum single-node traffic is 4-5 times of the minimum single-node traffic.

This service is A traffic playback service (abbreviated as service A later). The traffic comes from two parts: one is the front-end request traffic from the console, and the other is the permanent replay traffic from DDMQ. The former traffic is very small and can be ignored, while the latter is the bulk of the traffic, which is estimated to be about 3k~4kqps.

The traffic of the former console is directly forwarded to RS after domain name resolution to Kirin.

The latter chunk of MQ pushes traffic as DDMQ requests LVS to forward to RS.

In order to eliminate the influence from the console during the experiment, the domain name forwarding traffic to RS is temporarily closed, and only DDMQ traffic is retained.

Put forward the speculation

The first response is whether the weight of node containers configured on the elastic cloud page is different, resulting in inconsistent load of all machines. The check finds that the weight of node machines is the default value of 100, and there is no exception. Therefore, the check is complete.

The second intuition is that QPS may be positively correlated with the number of TCP connections established by DDMQ upstream to the service, assuming that DDMQ httpClient has a reasonable load on the downstream HTTP connection pool. It is assumed that the total number of TCP connections on the service machine is positively correlated with the number of TCP connections established upstream to the service.

However, the observation of TCP connection monitoring found that the 0 machine with the lowest QPS maintained the most long connections, while the 4 machine with the highest QPS maintained the number of TCP connections was not very prominent.

Therefore, it was found that most of the connections were TCP connections established for local service services by the local Service Mesh process Chorus (Self-developed Service Mesh Proxy of the Department), and only 6 external TCP connections were established for local service mesh ports. Although the total number of TCP on machine 4 is relatively small, there are 24 EXTERNAL TCP connections to the local Servicemesh port. Therefore, the hypothesis that the total number of TCP connections on the service machine is positively correlated with the number of TCP connections established by the upstream service is temporarily overturned, and the MONITORING chart of TCP connections is temporarily thrown aside.

It was found that machine 4 with the largest load maintained 24 external TCP connections to the inbound port, but the discovery does seem to be consistent with the second conclusion that QPS may be positively correlated with the number of TCP connections established by the upstream service (DDMQ) to the service.

The unbalanced load may be caused by the unbalanced load distributed by LVS to downstream machines when the LVS forwards SYN packets that do not establish sessions for the first time, resulting in uneven TCP connection establishment. Therefore, a horizontal comparison is made between another service B in the department that has similar traffic and also relies on LVS to expose services. It was observed that service B also had A slight traffic imbalance, but it was not as extreme as service A. By observing the traffic data of service B for nearly an hour, it was found that the worst case of maximum QPS/minimum QPS of the machine was only in the range of 1.3~1.4, which was much better than the load imbalance of service A.

After observing the number of TCP connections established externally, it is found that service B has more upstream connections and TCP connections are also established. Therefore, the load balance is much better than service A.

verification

Consulting LVS (DGW) related person in charge said that the principle of LVS (DGW) four-layer load is

So-called Layer 4 network load balancing: The VIP port used by client to access DGW is vport(TCP UDP port). DGW schedules services to the same Rs_IP based on the same session (client_IP ->client_port-> dgw_VIP →dgw_vport) : Rs_port (real service server). Different sessions are scheduled to different RS to achieve load balancing.

The difference between DGW 4-layer load balancing and 7-layer proxy is that the principle of 4-layer load is to replace the IP packet header and TCP + UDP packet header content of the client’s data packet and then forward the data packet to RS. DGW does not establish a link to RS, and the real TCP UDP link negotiation remains to Client ↔ Server

It can be seen that the main way of LVS load realization is to poll the downstream RS to establish A session when the session is not established and forward SYN packets to achieve load balancing. It seems that the unbalanced load of service A is mainly caused by the unbalanced load during the session establishment. After further consultation, the person in charge of LVS said that DGW clusters generally have 24 server nodes in one cluster responsible for loading LVS traffic, and the default scheduling policy is weighted mode and based on cpu-core number, which is set to 8 by default, that is to say, an LVS machine may have 8 TCP links difference to downstream load in extreme cases. Twenty-four LVS will further magnify this number by a factor of 24, so in the case of a small number of TCP connections, the extreme case of uneven load may lead to a large difference in RS traffic with different services.

The above scenario also happens to correspond to the upstream scenario of service A. DDMQ push mode uses Keepalive and maintains HTTP connection pool for the downstream. Therefore, if the upstream service only has DDMQ, the number of connections to the downstream is limited, which will lead to uneven TCP connections under the default DGW scheduling weight policy. To cause the problem of uneven request volume. Because the upstream scenario of service B is more complex, the problem of uneven connection number is also existed, but the problem of uneven request load is much better than that of service A because of the large number of connections maintained.

The improved scheme

In scheme 1, I tried to degrade the HTTP request of the upstream keepalive by force, and actively disconnected the HTTP long connection of the upstream, so that the upstream could re-establish the connection to the downstream, so that LVS could re-load the downstream for many times, so as to achieve the effect of load balancing. I found that DGW was honest with me. The load balancing effect is really ideal.

However, this method is obviously very high in energy consumption and is not recommended. Therefore, we try to change the weight mode of LVS to the downstream load into polling mode (to be added, SRE has not changed, and we temporarily think this method will be better).

Plan 2: Change the default weight mode to polling mode

Scheme 3, MQ push change pull, not within the scope of this topic discussion, skipped

Scheme 3, MQ push change pull, not within the scope of this topic discussion, skipped

Come to the conclusion

In a scenario where upstream requests are large but TCP connections take a long time and total connections are small (for example, if and only if upstream traffic is pushed downstream by DDMQ). It is not recommended that the service use DGW to access external traffic. If DGW is used, you are advised to change the default weight mode to polling mode.