From the public Gopher refers to the north

: Don't persuade others to be kind without suffering others

After more than two weeks of fierce battle, in order to investigate the 100% timeout of an online interface, there are now some results. Although there are still doubts, due to time constraints and personal ability issues, I will summarize the following to prepare for future battles.

Full export bandwidth

It was a fluke to be able to find this problem. I vaguely remember that this was a stormy night, this wind and this rain are destined to be extraordinary tonight. Sure enough, the root cause of 100% online overtime was discovered!

Our online interface needs external requests, and our outgoing bandwidth is full and naturally takes a long time, which leads to timeouts. Of course, this is the result. After all, the hardship of the intermediate process is far beyond what Lao Xu's words can describe.

Reflection

The result is there, and there is still no need for some reflection. For example, why is there no early warning when the outgoing bandwidth is full? Whether it is confident of sufficient bandwidth or lack of experience, it is worth remembering for the old Xu.

Before the bandwidth problem was actually discovered, Lao Xu had doubts about the bandwidth in his heart, but he did not verify it seriously. He only listened to the speculation of others and delayed the discovery of the problem.

httptrace

Sometimes I have to blow a wave of Go's good support for http trace. Lao Xu also made a demo based on this, which can print the time-consuming stages of HTTP requests.

The above is the time-consuming output of each stage of an http request who are interested can go to 1613aca42056b0 https://github.com/Isites/go-coder/blob/master/httptrace/trace.go to get the source code.

Lao Xu's suspicion of bandwidth is mainly based on the speculation given by online analysis and testing of the source code in this demo.

Frame problem

This part is more suitable for Tencent brothers to read, and other non-Tencent technologies can be skipped directly.

Our framework is TarsGo. We set handletimeout to 1500ms online. This parameter is mainly used to control the total time of an interface to not exceed 1500ms, and our timeout alarms are all 3s, so even if the bandwidth is full, this percentage timeout The warning should not appear either.

To study this reason, the old Xu had to spend some bits and pieces of time to read the source code and found a TarsGo@v1.1.6 of handletimeout control is invalid.

Let's take a look at the source code in question:

func (s *TarsProtocol) InvokeTimeout(pkg []byte) []byte {
    rspPackage := requestf.ResponsePacket{}
    rspPackage.IRet = 1
    rspPackage.SResultDesc = "server invoke timeout"
    return s.rsp2Byte(&rspPackage)
}

When the total execution time of an interface exceeds handletimeout InvokeTimeout method will be called to inform the client that the call has timed out, and the above logic ignores the IRequestId , which causes the client to be unable to match the response package with a certain request when it receives the response package. , Which causes the client to wait for a response until it times out.

The final modification is as follows:

func (s *TarsProtocol) InvokeTimeout(pkg []byte) []byte {
    rspPackage := requestf.ResponsePacket{}
    //  invoketimeout need to return IRequestId
    reqPackage := requestf.RequestPacket{}
    is := codec.NewReader(pkg[4:])
    reqPackage.ReadFrom(is)
    rspPackage.IRequestId = reqPackage.IRequestId
    rspPackage.IRet = 1
    rspPackage.SResultDesc = "server invoke timeout"
    return s.rsp2Byte(&rspPackage)
}

Later, Xu used the demo to verify handletimeout finally took effect. Of course, this modification of the old Xu has submitted issue and pr on github, and has been merged into the master at present. Related issues and pr are as follows:

https://github.com/TarsCloud/TarsGo/issues/294

https://github.com/TarsCloud/TarsGo/pull/295

Still have doubts

Up to this point, the matter still has not been perfectly resolved.

The above picture shows the largest time-consuming statistics we have done for external requests. The glitch is serious and time-consuming is simply unreasonable. The red part in the picture takes about 881 seconds. In fact, we have implemented strict timeout control when making http requests. This is also the most troublesome problem for Lao Xu. The acne on his face these days is because of it. Proof of staying up late.

What is even more frightening is that after we replaced the official http with fasthttp , the glitches disappeared! Old Xu thinks he still has some shallow understanding of the http source code of go, and the cruel reality makes life doubtful.

So far, Lao Xu once again briefly read the http source code, and still did not find any problems. This is a high probability that it will become an unsolved case. I hope you will share a few words with experienced bosses to at least let this article have a beginning and an end.

When replacing fasthttp, the bandwidth has not been found to be full

Beautiful vision

Finally, there is nothing to say, just go to the picture!


Gopher指北
158 声望1.7k 粉丝