That’s how it is.

That morning, I finished the development of a new function easily. I had planned to have lunch at noon. I had a beautiful sleep and ran tests, touched fish and paddled in the water in the afternoon.

Just as I was feeling so proud of my perfect plan, a message flashed through the WeChat group, someone @ me.

Fuck!!

This time! Someone @ me!!

I didn’t like it at the time. Naming @ me. It’s not good to be named.



Delimit a mobile phone look, as expected, the production of an accident. And it looks like it’s an emergency.



Let me give you a brief description of what happened.

We have developed a feature that allows us to collect host hardware metrics. One of them is to get the IPv6 address, and this IPv6 address is not getting the FE80 address (the FE80 address is similar to the IPv4 192.168, is a LAN address, this should be passed), but it is getting a bunch of numbers.



Here’s what happened. The customer submitted the upgrade order, and they were also afraid of a problem, so they upgraded 2,000 units first, and planned to upgrade the remaining 20,000 units that evening.

Then near noon, I found that out of the two thousand, about a hundred of them got the wrong address.

If disrelish me to say wordy words, ask my teacher master to translate.



To put it simply, the customer discovered a problem with the program in the middle of the day, but had to fix it in the afternoon because they had to upgrade all the production environments that night. We can’t delay their upgrade tonight.This sincere critical survival of autumn also.

What’s going on?

Capital special a face Meng force!

Can’t help ah, who let us is to eat this bowl of rice? As a result, I cleaned up briefly and even ignored my nap. I put my blue bag on my back and took the bus and went directly to the scene of the customer.



At that time, I had no idea. I had never encountered this problem before, so I went through the logic of the code quickly in my mind to see if I could find a place where I could throw the ball. Well, tests! It must be a bad test! With a question on the line!

As a result:



I was told very clearly that the tests found no problems.

This is awkward. No problems were detected in the corporate environment, nor in the PIT environment (Note: the PIT environment is a set of machines provided by the customer that are infinitely close to the production environment). That is, the problem seems to be unique to production environments.

That’s ridiculous…

What the hell is this?

There’s no way. Go to the scene.



After the long and complicated admission procedures of the customer park, we finally arrived at the production room. The customer found two machines with problems, said, you debug it! Then he went on with his own work, leaving me alone.

If you have played with the production machine, you should know that you should not do what the customer does not want to do. Do not think of deleting the library and running away, and then take a boat to Singapore, Vietnam, Cambodia.

Especially with what I was givenrootPermissions. Customers are even more cautious, requiring their approval to restart the process. Don’t even think about dangerous orders.

That’s a chicken.



But this time I can not be timid ah, the old husband I reputation can keep, see in this one stroke.

First of all, I stressed to the customer that this problem was once in a hundred years, and it was definitely not easy to solve. Even if the root cause could be found this afternoon, we would have to modify the code and test when we went back. It might be too late to upgrade this evening.

Hit this precautionary needle, the client also knows the problem is thorny, has the heart to prepare. Say, you check the reason first, it is really difficult to solve the word tonight to suspend the upgrade, before Friday can be done on the line.

So I can go ahead and do it.



Let’s first talk about the logic of our program to get IPv6.

Our program is developed in C language. To support cross-platform, we have introduced a program calledsigarThird-party libraries. thissigarThe library is specially used to obtain the hardware index of the machine, and has the implementation for different platform operating system.

inLinuxUnder, the logic of getting IPv6 is to read/proc/net/if_net6The configuration file for. The first entry in this file isIPV6The address.

In short, getipv6Is the implementation of reading the contents of a specified string from a file. Sounds like such a simple thing. What could go wrong with this logic?

sigarThis logic in the librarysrc/os/linux/linux_sigar.cIn thesigar_net_interface_ipv6_config_getIn a function. Of course we made a few tweaks, and the final result looked like this:

while (fscanf(fp, "%32s %02x %02x %02x %02x %16s\n", addr, &idx, &prefix, &scope, &flags, ifname) ! = EOF) { if (strEQ(name, ifname) && addr ! = strstr(addr, "fe80")) { status = SIGAR_OK; break; }}

Consider the file /proc/net/if_net6. This file contains the interface and unicast address. Its internal format is as follows:

Addr (32-bit) If_index (normal 2-digit hexadecimal number) Prefix = prefix = prefix = prefix = prefix = prefix = prefix Scope (usually 2-digit hexadecimal number) Flags (general 2-digit hexadecimal number) IFNAME (16 bits)
Ipv6 address The interface ID The prefix length Address Scope of Application Sign a The network card name

Such a look, the above code parsing, it seems that there is no problem.

But wait, what the hell is this?



I smell a crime. I don’t know what’s wrong, but I sense something’s wrong here. The second, if_index, why are there so many three digits?

A three-digit string is used instead%02xThis is going to be a problem.

But why do we have three digits?



I don’t know. No way, offering Baidu!

Ran goose, a Baidu, is like this:



As we all know, the big China technology article, eighty percent is known to eat (C) excretion (S) (D) difficult (N) of a website to hijack, as long as a search, the same out of the link is the site, and the content of the high rate of repeat, low gold content, amazing.

Since Baidu does not help me, I can only seek the help of Gu Sao.

But!!!!



Finish the calf, this can do?

I suddenly think of my classmate of a high school hair small is to do overseas bid, oneself start a business became a boss, the za don’t try white piao a wave?



B: well… It was a lot of trouble to get it, but somehow I got it.

Google a tour, most also just tell youif_net6What is the meaning of the document? Didn’t tell me thatif_indexHow many people should there be.

Then I amif.hI found this in the header fileif_indexDefinition:

struct if_nameindex
  {
    unsigned int if_index;  /* 1, 2, ... */
    char *if_name;      /* null terminated name: "eth0", ... */
  };

intType, can have 8 bits in theory ah, sweat ~

Well, forget it. And just to verify that I’m right, it isif_indexThere were three causes, and I asked my testing colleagues to look for them in the PIT environmentif_net6There are three tests in the file soon:

Then there’s the fun of modifying the code. If you have a hard code, the next time you change it, you’ll still have a problem.

while (fscanf(fp, "%s %s %s %s %s %s\n", addr, idx, prefix, scope, flags, ifname) ! = EOF) { if (strEQ(name, ifname) && addr ! = strstr(addr, "fe80")) { status = SIGAR_OK; break; }}

In fact,fscanfWhen the function handles this fixed width parsing, if the string has three digits but only parses two, it does not discard the third digit, but pushes the third digit into the next field.

Let me give you an example.



The second record in the figure above,if_indexfor321, according to the logic of the above code, will32As aif_indexAnd the rest1As aprefix.40As ascope.20positionflags.80forifname.veth1ffa1a8It should be the name of the network card, but it will think of it as the second oneipv6Address, resulting in all behind the dislocation.

B: well… The next step is to fix the code, reissue the patches, and get customers to update their tests.

Although the process was tortuous, the result was good.



So far, the story is a paragraph, but it’s a pity that my nap ah.