Author | tao jianhui

Originally published on:

Trail the Crash of the bald man

When we write C programs, we often encounter crashes, and most of the cases are caused by null Pointers or wild Pointers. From the Call Stack, it is generally easy to find out the problem. However, one type of Crash that is difficult to debug is memory overflow. The overflow of the memory space just covers the data accessed by another thread (such as a structure), so when another thread reads this data, the data obtained is invalid, often causing unforeseen errors, or even crashes. But because the thread that caused the data overflow has left the scene, the problem can be difficult to locate. This is an internal blog post I wrote in May 2020. I used one of my mistakes at that time as an example to share the solution to this problem for your reference.

Specific problems

On the Feature/Query branch, in the Community repository, execute the following script and Crash occurs.

./ -f general/parser/col_arithmetic_operation.sim

A recurring problem

I logged in to the specified machine, looked at Core Dump, and it did. Screenshots of Call Stack are as follows:

I want to crash the car. *pObj, stack 1, stack 1, stack 1, stack 1, stack 1, stack 1 But why is it set to empty? What about the other parameters? Setting dataSize to 0 must also be wrong. Therefore, it can be concluded that this structure is wrong. We need to see if the right arguments are passed in the call layer.

Rpcmain. c: line 605, look at *pRpc. The parameters in these structures appear to be normal, including pointer values to the hash. So you can conclude that the call is OK and that calling TaosGetStrHashData should provide the correct parameters.

The SHashObj structure has already been released, and will not be used when accessing it. The taosCleanUpStrHash function was called, so I immediately added a line to the function to print the log (note that the TDengine log output control, the system configuration file taos.cfg asyncLog parameter is set to 1, Otherwise, the log may not be printed in the crash. Rerunning the script, looking at the logs, finds that the taosCleanUpStrHash has not been called. Now there is only one possibility that the memory of this piece of data is corrupt by another thread.

Step 4: Fortunately, we have a great runtime memory checker called valgrind, which you can run to look for clues. Valgrind-leak-check =yes — track-origins=yes taosd-c test/dnode1/ CFG = “invalid write”, “invalid write”, “invalid write”, “invalid write”, “invalid write”, “invalid write” The screenshot is as follows:

Rpcmain. c:585 has an invalid write. This is memcpy. From a coding point of view, this should not be a problem, because the copy is a fixed-size structure, SRpcConn, that will be executed every time it runs there. The only possibility is that pConn is referring to an invalid memory area. How can pConn be invalid? Let’s look at the procedure:

Look at line 584, pConn = prpc-> connList + sid. This SID is assigned by the Taosallocateid. If SID exceeds PRPC-> SESSIONS, then PCONN is definitely pointing to an invalid region. How do you know for sure?

Step 6: Add 578 lines to the log, print the assigned ID, compile, and rerun the test script.

Step 7: Crash. If you look at the log, you can see that the SID prints to 99(Max is 100) and that it is fine, but then crashes. Therefore, it can be asserted that it is because the assigned ID exceeds PRPC → Session.

The RPC module can only use 1 to max-1, while the RPC module can only use 1 to max-1. When the ID is returned to Max, the RPC module automatically generates an invalid write.

The solution

Now that you know why, it’s easy. There are two ways to do it:

  1. In tidPool. C, TaosInitidPool, decrement the maxID by one so that the assigned ID will only be from 1 to max-1.
  2. In the rpcOpen() function of rpcmain.c, the
pRpc->idPool = taosInitIdPool(pRpc->sessions);

Instead of

pRpc->idPool = taosInitIdPool(pRpc->sessions-1);

If the application requires a maximum of 100 sessions, RPC will create at most 99 sessions. To ensure a maximum of 100 sessions, RPC will create at most 99 sessions

pRpc->sessions = pInit->sessions;

Instead of

PRpc - > sessions = pInit - sessions + 1;


Either way, recompile, run the test script and the crash doesn’t happen again.


In a scenario where memory is broken, be sure to run valgrind once to see if there is an invalid write. Because it is a dynamic checker, the errors reported should be correct. Invalid write is the only way to get rid of the “crash” problem.

How to avoid such problems

At the heart of this BUG is the fact that tidpool.c is assigned an ID range of 1 to Max, whereas the RPC module assumes an ID assignment of 1 to max-1. So it’s the contract between the modules that’s the problem.

How to avoid it? Each module should specify the external API, notify people of changes to the API, and run full test cases so as not to break any conventions.

Git Commit ID: 89d9d62. This is a real example of TDengine.