Gopher refers to north

You should never try to persuade people to be kind without suffering

After more than two weeks of hard work, I finally got some results in order to find out the cause of 100% timeout of one interface online. Although I still have doubts, due to time constraints and personal ability problems, I will make the following summary for the future fight.

The egress bandwidth is full

It was a fluke to discover the problem. Vaguely remember this is a stormy night, the wind, the rain doomed tonight’s extraordinary. Sure enough, the root cause of 100% timeout on line was found!

Our online interface needs external requests, and our outbound bandwidth is full, which naturally takes a long time and causes a timeout. Of course, this is the result, after all, the intermediate process of hardship has been far beyond the scope of Lao Xu’s words can describe.

reflection

As a result, there is still plenty of reflection. For example, why is there no advance warning when the outgoing bandwidth is full? Both confidence and lack of experience are worth writing down.

Before the bandwidth problem was really discovered, Xu had doubts about bandwidth in his heart, but he did not seriously verify it, and only listened to others’ speculation, which led to the delay in discovering the problem.

httptrace

Sometimes you have to blow a wave of Go’s good support for HTTP trace. Xu also made a demo based on this, which can print the time of each stage of HTTP request.

The above is the output of each stage of an HTTP request. If you are interested, you can go to github.com/Isites/go-c…

Lao Xu’s suspicion of bandwidth is mainly based on the source code in this demo for online analysis tests to speculate.

Framework problem

This part is more suitable for Tencent brothers to read, other non-tencent technology can be directly skipped.

The framework of our company is TarsGo, and we set handleTimeout to 1500ms online. This parameter is mainly used to control the total time of an interface not to exceed 1500ms. Our timeout alarm is 3s, so the 100% timeout alarm should not appear even if the bandwidth is full.

In order to investigate this reason, Lao Xu had to spend some time reading the source code, and finally found that the handleTimeout control of [email protected] was invalid.

Take a look at the source code in question:

func (s *TarsProtocol) InvokeTimeout(pkg []byte) []byte {
	rspPackage := requestf.ResponsePacket{}
	rspPackage.IRet = 1
	rspPackage.SResultDesc = "server invoke timeout"
	return s.rsp2Byte(&rspPackage)
}
Copy the code

When the total execution time of an interface exceeds handleTimeout, the InvokeTimeout method will be invoked to inform the client that the invocation has timed out. However, the response of IRequestId is ignored in the above logic, which results in that the client cannot match the response packet with a certain request when receiving the response packet. This causes the client to wait for a response until it times out.

The final modification is as follows:

func (s *TarsProtocol) InvokeTimeout(pkg []byte) []byte {
	rspPackage := requestf.ResponsePacket{}
	// invoketimeout need to return IRequestId
	reqPackage := requestf.RequestPacket{}
	is := codec.NewReader(pkg[4:])
	reqPackage.ReadFrom(is)
	rspPackage.IRequestId = reqPackage.IRequestId
	rspPackage.IRet = 1
	rspPackage.SResultDesc = "server invoke timeout"
	return s.rsp2Byte(&rspPackage)
}
Copy the code

Later, Lao Xu used a local demo to verify that handleTimeout finally took effect. Of course, Lao Xu has submitted the issue and PR on Github for this revision, and it has been incorporated into the master. Related issues and PR are as follows:

Github.com/TarsCloud/T…

Github.com/TarsCloud/T…

There are still doubts

So far, things have not worked out perfectly.

The figure above shows the maximum time we took for an external request, and the burr was severe and the time took was almost unnatural. The part marked in red in the picture takes about 881 seconds, but in fact we have made a strict timeout control when we initiate HTTP requests, which is also the most troublesome problem for Lao Xu. The acne on his face these days is to prove that he stayed up late.

Even more alarming, when we replaced the official HTTP with fasthttp, the burrs were gone! Xu thinks he has some superficial understanding of the HTTP source code of go, but the cruel reality is simply suspicious of life.

Up to now, old Xu once again briefly read the HTTP source code, still did not find the problem, this probability will become a pending case, but also hope you experienced big guy to share one or two, at least let this article finish.

No bandwidth was pulled when fasthTTP was replaced

Good vision

Finally, no more words, straight to the picture!