Author: Liu Hao

Statement of Agreement: Creative Commons Signature 4.0 International License.

background

It was a sunny morning, and I arrived at the office with my usual charge pack. Recently, I have been maintaining the company’s basic push service. As the push service is asynchronous, the service side cannot perceive the error log and send status in time after invoking the interface, and there is a certain delay in log query. A debugging tool called Pushup was created to address this problem. Pushup detects almost zero delay in sending messages and provides troubleshooting suggestions for error logs, improving access efficiency. It looks something like this:

A client of a service line reports that huawei push messages are intermittently lost during self-access. At first, I was stunned, and then I thought that there must be something wrong with your use. Huawei’s push not only has few restrictions, but also is relatively stable. But still with rigorous attitude ready to check, and then I a lunge to quickly hide the potential of their own tried to carry on, found that there is indeed a loss of the problem [dog head][dog head][dog head].

Trace positioning

There are many service modules involved in the push service. Since this problem is easy to recirculate, the fault Trace ID is used for query. The causes of the problem are shown in the following figure:

summary

It appears that the cause of the failure is a DNS resolution timeout in the Go service.

Test it out on a machine

If you find a problem, you should first check whether it is your own problem. DNS resolution timeout, the simplest and direct method is to test whether the DNS resolution is normal. The following two steps were tested.

dig

Dig is a common DNS resolution testing tool, I believe everyone is very familiar with. But the test results were unexpected, millisecond response, no abnormal case for many tests. So I didn’t want to curl up and found that it was also a millisecond response. Just a little face MB…..

tcpdump

Dig found no problem, so grab the bag and have a look. This domain has a large number of ServeFail status codes returned in the test environment.

summary

The curl response is normal, the DIG response is normal, and the DNS ServeFail status code appears when the domain name is resolved. However, the time to observe DIG was not particularly serious, but DNS resolution in the NET library took more than 5 seconds and even more than 10 seconds in some tests. (At this time, I called SRE to check together)

The source code to debug

Nslookup is fast and everything, but NET parsing is slow and confusing. Confused I choose to debug the source code of NET library. Source code debugging is generally used in two ways, one is through DLV GDB debug tools such as breakpoint debugging, one is printf debugging method.

Since the problem only involved DNS resolution, I stripped out the business logic and wrote 2 lines of debugging code as simple as this:

import (
	"log"
	"net"
	"os"
	"runtime"

	"github.com/pkg/profile"
)

func main(a) {
	log.Printf("dig %s\n", os.Args[1])
	runtime.SetCPUProfileRate(5000)
	p := profile.Start()
	defer p.Stop()
	log.Println(net.LookupHost(os.Args[1))}Copy the code

This code mainly has two purposes of testing, one is to strip the business logic to test whether there is a problem with THE DNS resolution of the GO NET library, the other is to analyze the running time. Because this code running time is too short, you need to use the runtime. SetCPUProfileRate (5000) adjust the sampling rate of the CPU, or you will find in the profile file is empty.

In addition, since I am using an M1 chip for Mac, I use cross-compilation to compile the executable file on my machine:

 GOOS=linux GOARCH=386 go build
Copy the code

After running it, I found that there was indeed a problem, but what is more surprising is that it took more than 10s for this run to return the resolved IP. So I downloaded the cpu.pprof file and opened it with the Go Tool to see the following result (this may not be the first result, but it does not affect the analysis of the problem):

Although this graph shows the approximate time, it is still difficult to infer the problem, so I used itprinfDebugging method.

Prinf good solution

Prinf debugging is an ancient magic that has been handed down to the present day and is still a very useful and even highly respected method in some problem diagnosis processes. Speaking is the process of printing some key logs and inferring the program from the logs. In the process of troubleshooting this problem I need to make log inserts for DNS resolution of NET library.

Pay attention to

Net library code to change, who knows what will be changed. Git git git git git git git git git git git git git git git git

Print what content

Debug logs are about being detailed and paying attention to key information. The following information should be included: execution time, function name, and key parameters. Also pay attention to formatting. In fact, a neat format is often easier to spot.

Where is print

First of all, pprof has counted out the key time consuming function, which can be used as a reference. Secondly, dichotomy printing is also recommended when locating problems. Finally, although violent printing method does not constitute a crime, but does not advocate [dog head].

Print the result

Use logs to locate problems indnsPacketRoundTripAfter the method, I did a detailed log output to the process of running the method (sorry I was rash, to everyone apologize).When the execution result comes out, it can be seen that the parsing of A record is very fast, but the parsing of AAAA record is time-consuming. In the process of several tests, some even appear timeout.

Well, I don’t know if you noticed, some of the requests came back pretty quickly but some of the requests came back pretty late, and they’re still hanging on. That’s right, that’s AAAA!

Several key concepts of DNS

If you’ve ever configured DNS, you’ll often hear the following:

  • A: Indicates an ipv4 address
  • CNAME: A record pointing to another domain name
  • AAAA: ipv6 address
  • SOA: Start of Authority: Start authorization server

Does the domain name support IPv6?

Therefore, according to the above information, only the AAAA record parsing has a problem. Does this domain support V6? The official answer is no.

What could go wrong with ServeFail?

Under normal circumstances, for example, xiaomi’s APi.xmpush.xiaomi.com will return NOERROR and attach an SOA when there is no AAAA record. At this time, the server will cache the TTL seconds carried in the SOA with empty records. However, after ServeFail is returned, the server will cache 1 second by default. This is very easy to cause cache breakdown. This is similar to the reason why empty records should be cached during service development.

Who returned to ServeFail?

Inside the classroom is full of Ali cloud DNS, 114.114, 8.8 and Tencent DNS, A students asked A circle, found that we have raised their 🙋. At this time A student noticed, 8.8 students why are you so excellent, you did not raise your hand! Then student A asked Ali’s classmates, “Why did you fail when others didn’t fail? Why did you fail?” Ali’s classmate went back to his notebook and said, “8.8.8.8 server took soa records directly from huawei.com, and our DNS went up one level. Cloud.huawei.com. This is how local DNS handles different hobbies “so, everyone is not wrong! Then to the manufacturer feedback this question, but did not get a reply.

So, can you do without ipv6 resolution?

It can be seen from the above debugging that the main problem is AAAA record, that is, the ipv6 address in the resolution of the problem. After using DIG to conduct AAAA domain name resolution test in 114.114.114.114, it was found that Huawei did not configure AAAA record. This is especially true when you see the early version of the go source (1.16.3) that is dead on parsing types (although I checked the master code to make some minor changes to this area, but it is still dead).

func (r *Resolver) goLookupIPCNAMEOrder(ctx context.Context, name string, order hostLookupOrder) (addrs []IPAddr, cname dnsmessage.Name, err error) {
	// ...qtypes := [...] dnsmessage.Type{dnsmessage.TypeA, dnsmessage.TypeAAAA}// ...
}
Copy the code

summary

Net library parses A record and AAAA record at the same time. However, when AAAA record is parsed, the Client fails to receive DNS response and receives ServeFail response. As defined in RFC 1035, ServeFail is generally an exception and should not be returned if the record does not exist.

Testing 114.114.114.114 and Aliyun DNS returns ServeFail, but 8.8.8.8 returns NOERROR with an SOA.

Is there a problem with your internal DNS service?

SRE said that the DNS server built by himself would directly forward to Ali’s DNS service if it was found to be an external domain name during the resolution process. However, we found in the test process that although ali’s AAAA records returned ServeFail, the response speed was not slow. Therefore, SRE continued to capture packets on the DNS service and found that when receiving the ServeFail request, the local recursive server would trigger iterative query. There was something wrong and it would be a waste of time to try again.

How to solve it?

The problem is pretty clear at this point. The rest is easier to solve:

  • SRE has configured the cloud.huawei.com domain to forbid iterative query, that is, the forward service will respond directly to whatever is returned, and the domain name will not be iterated again.
  • Since DNS is queried for every request in GO, consider trying to cache DNS locally.
  • Report DNS AAAA resolution issues to the vendor (and have been doing so).

conclusion

This paper summarizes the whole process by recording the first to share with everyone the way and method to query the problem, the second is to communicate with everyone more, grow. Review the process, analyze the advantages and disadvantages, and accumulate more experience in order to select the optimal way to deal with problems and improve Troubleshooting efficiency.

reference

  • Ali DNS: The thing about failed domain name resolution