Recently, we did an upgrade of the K8S cluster on our Intranet, and found that the APISIX gateway service was 503 abnormal, so we did an analysis. We use APISIX as a traffic gateway both on the Intranet and online, and contribute 6 PR to APISIX, so we have a relatively good understanding of its source code. The following investigation process is more tortuous, emotional ups and downs, you watch patiently.

The phenomenon of

All interface requests taken over by APISIX are 503, as shown below.

The network topology is also very simple, with APISIX forwarding traffic to the Java service at the back end.

The APISIX error log is as follows.

Curl = curl = curl = curl = curl = curl = curl = curl

curl "http://school-performance-http.easicare-test-2:8080/school-performance
/student-archive/schools/30375dee54dc47ef8410b6508cd7aa6a
/archive-bags/0054616e455f4ccc91f64e9cf11e5571/students/
335cb51e8c0343918969e939b1461e8f" \
     -H 'accesstoken: masaike'
Copy the code

When in doubt, grab the bag first

The packet directly requested by APISIX is as follows. The result is normal for IpV4 and No such name is returned for IpV6. I thought it was because lua didn’t get the IP, so there was no logic to send the request for the following three handshakes. The request was terminated directly at APISIX.

This problem can be determined synchronously using NSLookup

$nslookup -type=A school-performance HTTP. Easicare-test-2 Server: 169.254.20.10 Address: 169.254.20.10#53 Name: school-performance-http.easicare-test-2.svc.kubernetes.local Address: $nslookup -type=AAAA school-performance HTTP. Easicare-test-2 Server: 169.254.20.10 Address: 10.96.136.142 $nslookup -type=AAAA school-performance HTTP. 169.254.20.10#53 ** server can't find school-performance HTTP. Easicare-test-2: NXDOMAINCopy the code

It can be seen that the A record (IPv4) address is correctly returned, but the AAAA (IPv6) query returns the NXDOMAIN, which is the DNS response code (Rcode=3), indicating that there is no record, that is, the domain name resolution result does not exist.

Verify whether the problem is caused by IPv6 returning to the NXDOMAIN

With that in mind, I looked at the code for the latest version of APISIX and found that in January of this year, this part of the logic was added to allow users to turn off ipv6 parsing via apisix.enable_ipv6. The specific PR is here github.com/apache/apis… Nginx resolver configuration and core/ DNS /client.lua to add enable_ipv6 parameter handling.

Changes to the nginx configuration file section

Lua code changes

So we repackaged the image upload with the latest version of APISIX, and sure enough, the problem was solved.

At this point, I thought I had found the root cause and let go of the problem.

It’s too early to rejoice

Later I thought, a big version of the upgrade, with a lot of changes, how can you be sure that the one that brought? So I changed the 2.10.1 version of APISIX code to remove IPv6 parsing, as shown below.

diff --git a/apisix/core/dns/client.lua b/apisix/core/dns/client.luaindex a6dbfb37.. c5c1b8c3100644
--- a/apisix/core/dns/client.lua
+++ b/apisix/core/dns/client.lua
@@ - 137..7 +137.14@ @function _M.new(opts)
     -- make sure each client has its separate room
     package_loaded["resty.dns.client"] = nil
     local dns_client_mod = require("resty.dns.client")
+    local table_remove = table.remove
 
+    for i, v in ipairs(opts.order) do
+        if v == "AAAA" then
+            table_remove(opts.order, i)
+            break
+        end
+    end
Copy the code

I thought that this change would solve the problem, but it turned out that the service was still 503, and the problem was not solved at all. In addition, I did not initiate AAAA record query again through packet capture, indicating that my change took effect, which indicates that the problem is not caused by THE RETURN of AAAA record to NXDOMAIN.

Start to doubt life, capture the packet shows that the resolution of A record has been successful, why APISIX will think that the domain name or failure?

Since the latest version 2.13.0 is available, let’s compare the code and see how the DNS logic is different.

DNS resolution for APISIX is implemented through the lua-resty-dns-client library, which is under APISIX’s friend Kong project: github.com/Kong/lua-re… The dependency comparison between 2.10.1 and 2.13.0 is as follows.

As you can see, APISIX 2.10.1 uses Lua-resty-DNS-client 5.2.0. APISIX 2.13.0 uses 6.0.2. Overwrite the latest version of lua-resty-DNS-client code into the older version of APISIX

Cp - rf lua - resty - DNS client - 5.2.3 requires/SRC/resty/DNS / * / usr/local/apisix/deps/share/lua / 5.1 / resty/DNS /Copy the code

Restart APISIX and find that the problem is resolved, with no connection to IPv4 or IPv6. APISIX failed to resolve the IPv4 domain name while IPv6 failed. This error is too low and should not happen.

At this point, the problem has been limited to what is wrong with the library, using a binary approach, fortunately, the library version is not many, from6.0.25.2.0Version bipartite coverage test.

It quickly became clear that version 5.2.2 was OK and that lower versions would be problematic, so compare 5.2.2 and 5.2.1 to see what was changed.

The logic here is to deal with the dot problem at the end of the domain name. The configuration of /etc/resolv.conf is as follows:

The cat/etc/resolv. Conf nameserver 169.254.20.10 search imdach - dev - dev. SVC. Kubernetes. Local. SVC. Kubernetes. Local. kubernetes.local. gz.cvte.cnCopy the code

For example, we query b340e61adc0ec7ccd840bdcd8a59989cb6598 app – 537-3. Imdach dev – dev domain, will, in turn, query search, as shown below.

As you can see, DNS queries and returns do not carry the last dot, such as query

app-537b340e61adc0ec7ccd840bdcd8a59989cb6598-3
.imdach-dev-dev.svc.kubernetes.local.
Copy the code

Both the request and response domains are

app-537b340e61adc0ec7ccd840bdcd8a59989cb6598-3
.imdach-dev-dev.svc.kubernetes.local
Copy the code

DNS queries and responses do not have a final dot.

However, string matching is required in Lua. Qname has a period (.). The DNS returns a result that the IP address is queried but the domain name does not have a period (.).

At this point, the reason is basically clear, so why the problem recently? So I asked the students of K8S operation and maintenance and got a positive reply.

To verify the problem 100%, I manually changed /etc/resolv.conf to remove the dots from search, and APISIX was rolled back to the original version of the problem. The problem was also resolved and access was normal.

Conf nameserver 169.254.20.10 Search imdac-dev -dev.svc.kubernetes.local svc.kubernetes.local kubernetes.local gz.cvte.cnCopy the code

Should a domain name end with a dot

In fact, the standard DNS domain name is required to end with a dot, but people in the process of using domain names often omit the last dot,. Is the root domain name. Access to all domain names is essentially to resolve from the root domain name, such as care.seewo.com. The theory starts by asking where the root DNS server.com is.

There is a special article on this topic for those who are interested in studying it

www.dns-sd.org/trailingdot…

summary

Due to the upgrade of K8S on the Intranet, the end of search in /etc/resolv.conf is added with a period. As a result, domain name resolution of APISIX of earlier versions (APISIX 2.12 or lower) fails, which has nothing to do with the return of IPV6 to NXDOMAIN.

Afterword.

Analysis of the problem must calm down, carefully explore the root of the problem, not based on success.

Today I read a sentence that I think is quite good, and I want to share it with you: “Experience is used to deal with special scenes, while methodology is used to deal with general scenes. Without experience, it may be slow, and without methodology, it may be difficult to make progress.”

If the above analysis gives you some insight, that’s great.