This article has been organized to my Github address

Preface: Recently for cloud service providers replacement, as I’m in charge of the new line of three-way calling API, maintenance and management in the service of ali cloud deployment to tencent cloud process, we found in tencent cloud call jingdong pressure measuring interface TP999 jitter is very severe, although the business layer have retry the operation but still more overtime, does not meet the business requirements… Then we set out on the path of optimization for the various problems found in the process.

start

That is ordinary night, suddenly throw up alarm messages in the group, multiple lines of business call external API timeout, dubbo thread pool quantity is not enough, also good service soon returned to normal, but it also carry a big wake up for me, I maintain the service as a business call polymerization export, once the service anomalies will lead to a lot of business failure. Although we knew during the pressure test that the availability of API calls after the cloud migration would be significantly reduced, such a big impact before the C-side business was migrated was a lack of preparation.

Based on surface problem analysis

Appear a problem to begin from the easiest place to start to solve first, this kind of psychology is good or bad? The problem was analyzed first, and the thread pool skyrocketing alarm was reported. Combined with log and APM (Skywalking) tools, it was found that the number of threads and CPU soared at the time of alarm. As the main function provided by the service was three-way call, many threads stuck in the steps of calling three-way API and waiting for response results. It is easy to understand that the three-party API jitter and too many HTTP connections created by the service first lead to slow response of the application, and then the service timeout triggers retries, causing the application load to double, which exacerbates the slow service problem. In the absence of limiting the ability to degrade the service once this situation is very easy to lead to an avalanche, fortunately there is not too much business loss, as a mature programmer can not leave their own pit ah, hurry to fill the pit!

Multidirectional pit filling

When the problem is found, there is no shortage of solutions

  1. Modify the HTTP call method to introduce HTTP connection pool

In fact, this scheme was considered in the process of pressure test before, because the three party SDK used in HTTP calls all use Java native HttpURLConnection, which kept creating connections under the condition of high interface concurrency, the performance was really poor. Consider the familiarity of the team and override the HttpUtil implementation of the three-party SDK with HttpClinet, with minor modifications. Modify the pressure test again, TP999 index, CPU utilization rate has decreased significantly, this wave beautiful, input and output ++!

  1. Is the first plan all right? Support request timeout grading, thread isolation

After the development of the first scheme went online, we found new problems, and the timeout situation on the business side was still obvious! The reason for this is very easy to think of, in view of the different business we only configuration for a timeout configuration, though part of the business (mainly C terminal business, the interface is sensitive overtime) will timeout and then trigger the retry in advance, but still my application thread in blocking the request, it is a waste of resources, think carefully, before the problem was not solved, The number of dubbo threads serving with client retry is going to explode. As different business parties have different response time requirements for interfaces, we have implemented connection pool isolation, and request levels can be set during invocation for different business parties. We have created three levels’ FAST ‘,’ STANDARD’ and ‘SLOW’ under rough classification. Different tiers are configured by service providers to use different timeout schemes. Service stability took a step further, and once again the problem was solved.

Note that HttpClinet is usually configured with connection pools, and the connection pool timeout parameter must be set to three

ConnectionRequestTimeout Indicates the timeout period for obtaining the connection from the connection pool. ConnectTimeout Indicates the timeout period for obtaining the connection from the connection pool. SocketTimeout Indicates the timeout period for reading the socketCopy the code
  1. Reasonable application retry (consumer retry)

    Although code optimization has been done on the provider service, it is necessary to use retry properly in order to improve business success. To note here if slower service response must avoid consumers multistage retry, if our entire business call chain each layer did try again then can lead to slow response in link services increasingly growing pressure, serious cause retry storm, crushing services directly, so setting up reasonable retry is a key ring, Here, we should also consider the introduction of circuit breaker downgrade scheme to avoid accidents.

After the above three moves serial attack, the problem basically came to an end, 2C4G 200 DUbbo threads, 30 HTTP threads, the throughput is fixed 1K, the service TP999 in the pressure test process can be basically stable below 500 milliseconds, to meet the requirements of all parties.

The problem again

With the increasing number of applications, our C-terminal business side raised a new problem, why the timeout of the service is so different from the service of Ali (here it only refers to JINGdong Interface). Since the new business is directly online to Tencent Cloud, it is suspected that there may be a performance problem of our service. I really didn’t pay too much attention to this problem before there were only some B-side services. We must check this problem carefully, or check it from the APM tool side. Sure enough, the TP999 jitter of the interface is very severe, and many of them reach the timeout threshold, while c-side services attach great importance to these indicators, which is also worth learning. Often these indicators online is easy to expose the root problem, TP90, TP99 indicators look quite normal in the high concurrency, that is because the response time of the high request to average down.

Struggle to solve

The problem this time is no longer simple. In the face of the timeout response of TP999, we observed the overall flow of the application, the JVM situation and so on, but failed to find the bottleneck. Compared with the previous aliyun application, we believe that the problem exists, perhaps in the network request level. Now we’re gonna take it one step at a time.

0. Check the APM tool and find that some HTTP requests do time out, and some HTTP requests do not take a long time, but according to the APM call link statistics, the time spent in the provider service is relatively long, resulting in timeout. The HTTP connection pool has timed out, and the connection in the pool has timed out. The httpClient log does not have this problem, because the headers in the response only return Keepalived. Does not return that it has a timeout. I repeatedly searched for the second problem but failed to find the reason. There was not much service pressure and no long time GC occurred. It was really difficult to explain this problem.

1. In order to observe the application’s request and response information, we captured packets at the HTTP exit. Through the packet capture analysis of a large number of requests, we found that there was an IP address in Hong Kong in the captured packets when the response was relatively high.

2. Our server and export IP are all from Beijing. Why will the IP resolved by DNS be returned to Hong Kong? With the problem, we asked Tencent cloud official for help. In the process of troubleshooting the official problem, we did other tests. We wrote a python script on the server to resolve the DNS domain name of the specified domain name regularly. At this time, Tencent Cloud official also responded. As THE DNS resolution server set up by JINGdong itself could not accurately identify the IP region of Tencent Cloud, it occasionally resolved to Hong Kong. The solution is that they urge Jingdong to identify the IP of Tencent cloud.

3, the official settlement is slow, can we have any temporary solution? When we were in trouble, we also connected to the Kong exit gateway built for us by the architecture team. There was a special NET exit, higher bandwidth, and perfect granfa monitoring, which was more obvious when combined with the monitoring request shaking. Therefore, we thought of fixing host and changing the DNS scheme, because it was too risky to fix host. We tried to change the DNS scheme. Originally, the default DNS of Tencent Cloud was used on the machine. We respectively replaced the DNS of Ali and the DNS of 114 for testing, and the test results showed that there was not much difference. Finally, we adopted the DNS of Ali, and the effect was obvious after replacing the DNS of Ali. The result of this problem is clear, but the reason is still in the guessing stage, the guess is 114 and Ali DNS have optimization for this, So in this scenario is better than Tencent DNS, if you have a scheme can verify this problem welcome to come out oh).

The DNS resolver of the Kong gateway is written in the kong.conf configuration file. If this parameter is not configured, the system’s /etc/resolv.conf file will be read and cached in memory after each kong startup. Nameserver does not work for Kong if you change the system after starting Kong! Fortunately, we tested DNS replacement on a dedicated machine before, so we were sure of the Kong configuration problem, but it took a long time to find the problem.

Here I also want to tell you that if you can cooperate with others, please be prepared to compare data, so as to avoid others’ uncooperation due to doubt, or to have a comparison when the results of cooperation are not as expected.

4. Remember there was a problem at the beginning that the reason was not found at that time? ‘Some HTTP requests did not take long time, but according to APM statistics, the call link spent a long time in the provider service, resulting in timeout. I compared Kong’s httpClient logs with the server’s httpClient logs, and found that the request header received the request quickly, but the body did not read it and then triggered the timeout. The reason why HTTP takes so little time as shown in skyWaling is that the request is over once the response starts, and then why only the header is returned? I pinged the IP in question from Hong Kong. The response time was obviously slower than Beijing’s, and there were occasional packet loss. That explains the problem.

Always gets

After another round of problem solving, testing and verification results it took almost two weeks, and the availability of the service has improved a level, is a small hole, but it also let us the whole of the HTTP request process done a lot of carding and secondary cognitive, this is me in the whole solution to solve the problem of the timeout path and journey, thank you see me about the past, I hope we can make progress together.