Author | Tao Jianhui

Originally published:

Track a bald Crash

When we write C programs, we often encounter Crash. Most of the cases are caused by null pointer or wild pointer. From the view of the call stack, it is easy to find the problem. However, one type of Crash that is difficult to debug is memory overflow. The memory space of the overflow part just covers the data (such as a structure) accessed by another thread, so when another thread reads the data, the data obtained is invalid, often resulting in unforeseen errors, or even Crash. But because the thread that caused the overflow has left the scene, the problem is difficult to locate. This is an internal blog POST I wrote in May 2020, in which I used a mistake I made at the time as an example of how to solve this kind of problem for your reference.

Specific problems

On the feature/ Query branch, in the Community repository, execute the following script and Crash occurs.

./test.sh -f general/parser/col_arithmetic_operation.sim
Copy the code

A recurring problem

I logged in to the specified machine and looked at core dump, which I did. The Call Stack screenshot is as follows:

Step 1: Where do you crash? Shash. c:250 line, use GDB “f 1” to check the stack 1, check the *pObj, check the hashFp NULL, and crash. But why is it null? What about the other parameters? DataSize must also be set to 0 incorrectly. So we can conclude that this structure is not correct. We need to see if the level call is passed the correct arguments.

Step 2: Use GDB “f 2” to view stack 2, rpcmain. c:605, and *pRpc. The arguments in these structures appear to be normal, including Pointers to hash values. It is therefore assumed that the call is OK, and that the call to taosGetStrHashData should have provided the correct arguments.

Step 3: Since the parameters are correct, look at the shash.c program, it is only possible that the SHashObj structure has been released, when accessing, natural invalid. There is only one possibility, taosCleanUpStrHash is called, so I immediately add a print log to the function (note the log output control in TDengine, asyncLog is set to 1 in taos.cfg). Otherwise, the log may not be printed during crash. Re-run the script and view the log. TaosCleanUpStrHash has not been called. So now there’s only one possibility that this chunk of memory is corrupted by another thread.

Step 4: Fortunately, we have a great run-time memory checking tool called ValGrind that you can run to find clues. Valgrind — leak-check=yes — track-origins=yes taosd -c test/dnode1/ CFG Screenshot below:

Rpcmain. c:585: invalid write (memcpy) From a coding standpoint, this should not be a problem, because you copy a fixed-size structure, SRpcConn, which is executed every time you run here. So the only way pConn can point to an invalid area of memory, so how can pConn be invalid? Let’s take a look at the program:

See line 584, pConn = pRpc->connList + sid. This SID is assigned by the taosAllocateId. If sid exceeds pRpc-> Sessions, pConn definitely points to an invalid region. So how do we know for sure?

Step 6: Add 578 lines of the log, print the assigned ID, compile, and re-run the test script.

Step 7: Crash, look at the log, you can see sid output 99(Max is 100), and everything is fine, but then crash. Therefore, it can be asserted that the assigned ID exceeds pRpc→session.

Step 8: Look at tidPool. c to see why. The ID assignment is from 1 to MAX, while the RPC module can only use 1 to max-1. In this case, when the ID is returned as Max, the RPC module naturally generates invalid write.

The solution

Now that you know why, there are two ways to do it:

1. In tidpool.c, taosInitIdPool, subtract maxId by one so that the assigned ID is only 1 to max-1.

2. In the rpcOpen() function of rpcmain. c, call rpcOpen

pRpc->idPool = taosInitIdPool(pRpc->sessions);
Copy the code

Instead of

pRpc->idPool = taosInitIdPool(pRpc->sessions-1);
Copy the code

If the application requires a maximum of 100 sessions, then the maximum number of RPCS to be created is 99

pRpc->sessions = pInit->sessions;
Copy the code

Instead of

PRpc - > sessions = pInit - sessions + 1;Copy the code

validation

Both methods recompile, run the test script, and crash does not happen again.

experience

If the memory is corrupted, run valgrind once to check for invalid write. Because it is a dynamic checker, errors should be correct. A. invalid write B. crash C. write D. write

How to avoid similar problems

At the heart of this BUG is the fact that tidPool.c assigns ids in the range of 1 to Max, while the RPC module assumes the ID assignment is 1 to max-1. So it’s the conventions between modules that are wrong.

How to avoid it? Each module should specify the external API, notify people if the API is changed, and run complete test cases so as not to break some convention.

Git commit ID: 89d9d62