Author: Tao Jianhui
This is an internal blog I wrote in May 2020, hoping that my R & D and technical support students could help users quickly locate bugs and solve problems. I did another iteration in December 2020 and did some internal training on it. I’ve also been following questions from the WeChat group and have noticed that many users find it difficult to analyze the logs of Tdengine or any other distributed system. In this post, I present a method for analyzing the logs of a TDengine. If you can master it, analyzing the logs of a distributed system will be extremely easy.
TDengine is a cluster system, and any operation will involve APP, TAOSC, MNode, VNode and other logical nodes. These nodes communicate with each other through sockets. Also, in a test, there may be multiple instances of TDEngine, which makes the analysis even more complicated. For an operation, how to match and collude the logs of different logical nodes and filter them efficiently becomes the key to analyze the problem.
This article summarizes a set of methods, can let you quickly find the Bug.
Turn on the associated log switch
Each module of TDEngine has its own debugFlag, including TAOSC, DNode, VNode, MNode, TSDB, WAL, Sync, Query, RPC, Timer and so on. Currently, the log output of each module can be controlled to:
- FATAL/ERROR, Error will appear on the log
- The log will show WARN
- Info, important information
- Debug, general message
- Trace, very detailed and recurring debugging information
- Dump, raw data
The output log can be controlled to:
- The screen
All of the above controls are controlled by a byte in the debugFlag. The bit diagram in this byte is as follows:
Therefore, if you want to output error, warning, info, and debug to a log file, then debug should be set to: 135; If you also want to output trace level logs, the debug should be set to: 143; If only error and warning are printed, debug is set to 131. Normally, it is recommended to set DEBUG to 135.
The debug flag of each module is controlled by the configuration file taos. CFG. The parameters of each module and the module name shown in the log are as follows:
If there are too many configuration parameters, the simplest way is to set the DEBUGFLAG parameter of DEBUG. After this parameter is set, the DEBUGFLAG parameter of all modules is set to the same parameter except TMR log. The default value of DebugFlag is 0. When the DebugFlag is non-0, all logging configuration parameters will be overridden. It is not recommended to set UTIL except in a special case. The timer debugFlag is set to 135 and 131 is appropriate.
The log file
The default log directory is /var/log/taos, but this can be specified by modifying the logDir configuration parameter in taos.cfg
- The client log file is called Taoslogy.x (multiple log files can be generated on a single machine because you can run multiple clients)
- The server-side log file is taosdlog.x
The size of the log file is controlled. After reaching a certain number of lines (the numOfLogLines configuration parameter in taos.cfg), new log files will be generated. But TDengine only keeps two log files, with alternating file names ending in 0 or 1.
Log files, from left to right, are divided into four chunks
- Time stamps, accurate to the subtle
- Thread ID, because it is multi-threaded, is important because only the log output from the same thread is time-guaranteed and is output in the designed flow
- Module name, three letters
- Log output for each module
There are several steps to analyzing logs
When a test or customer reports a Bug, whether manually or automatically, it occurs by performing a specific action. This is typically done by executing an SQL statement. This problem can be caused by either the client or the server code. The following example illustrates the analysis of the log using the Create Table example, removing the timestamp from the diagram for easier focus interpretation.
Look at the client first
The first thing you need to look at is the client log. The sample screenshot is as follows:
- Find the offending SQL statement and search for “SQL:” in the client log to see it (screenshot, line 2). SQL result:success SQL result:success SQL result:success SQL result:success SQL result:success How to find the failed SQL quickly, need to know the approximate time range, what the SQL statement is, so that the search will be fast.
- TAOC log data, one of the most important parameters is pSql, which is an address assigned to the internal SQL Obj. TAOSC places this address information at the top of all TAOSC logs, immediately after TSC. This value is critical and is the key to traditional client and server logging. In the screenshot above, it’s highlighted on a green background.
- The pSql parameter is passed to the RPC module as ahandle, and RPC prints all messages (green background). Since many modules call the RPC module, RPC will print out who made the call. For example, in the screenshot, TSC will print out RPC TSC.
- RPC sends the CREATE TABLE message to the server, and the RPC log is printed out (line 8 of the screenshot), telling which dnode’s End Point was sent to. The screenshot shows the hostname: 9BE7010A917E, port is 6030. If there is a problem, then we need to check the server log where the End Point is located.
- As you can see, the RPC module received a response from the server, but in order to avoid the conversion cost, the log only shows the hexadecimal IP address (line 9, 0x20012ac) and the port number. The logging of the RPC module is critical because it links the various logical nodes together.
Now let’s look at the server
After analyzing the client log, the server log is very important. Here is the create-table example again. Please see the screenshot:
- From the client log, find the pSql value 0x5572c4fab3a0, so directly search 0x5572c4fab3a0 in taosdlog, you can see the log with the green background in the screenshot. So pSql is an important parameter to string client and server logs together.
- For the create-table operation, there is an MNode handling. In the screenshot, it is because the first table is created, so it needs to create a vnode first, then create a table and other operations. It involves many modules, so I will not explain in detail.
- Finally, MNode creates the table and sends back the response through the RPC module (line 52 and the last line in the screenshot), so it is clear that the server is working normally.
Note: When the DNode module receives the message, it will distribute the message to the MNode and VNode message queue according to the message type. The worker thread then consumes the message in the message queue and processes the message. For VNode, the submodules TSDB, Wal, Sync, and CQ are all executed by a single thread, so after finding PSQL (the second line of screenshot), you need to look down according to the thread ID, you can know the whole process, which is easy to analyze.
A few key points
- Find the failed SQL statement first
- Find the value of pSql and copy it. Let’s say it is XXXXX
- Grep “XXXXX” taoslogx.x, find the client log associated with this SQL, and see if you can find the problem
- Open the Taosdlog server log, search for the value of pSql XXXXX, and check the timestamp to see if the operation failed
- The server logs are then analyzed
The message of the RPC module is critical. It is important to note that for each RPC message, Parse Code: XX is printed as the result of protocol parsing, with 0 indicating no problem and other values indicating that protocol parsing was unsuccessful. But at the same time, the message itself has code: 0XXX, which is an error code brought by the sender. Generally, it is sent from the server to the client. If it is correct, the code is 0, otherwise it will report an error.
Another log matching method
When a client sends a message through the RPC module, the log has a similar message
This is the RPC’s source ID, dest ID, and transcation ID, all combined to uniquely identify a link from a single client. Each new message is sent with a memory ID incremented by one, so it is unique for a period of time (the memory ID is two bytes and will be looped).
Version 1.6 can only rely on the string sig to match the client to the server log, but it needs to see a lot of context, so cumbersome and inefficient.
Version 2.0 passes PSQL to the server side so that client-server log matching is greatly accelerated.
Be familiar with logging methods
- The first step is to understand the design of TDEngine and understand the flow of each major operation.
- Turn on all log switches (set DebugFlag to 135), execute all SQL statements once, check the client and server logs against each SQL.
View the SQL statement executed by the client
The client generates a lot of logs and looks at the SQL statements that were executed, making it easy to analyze and repeat the problem. How many ways are there to find out what SQL statements the system actually executes
- If the client log is open, execute: grep “SQL:” taoslog* and you will see all the SQL statements executed in the log.
- If you use TAOS to execute SQL statements manually, look at the hidden.taos_history file in the home directory, which contains all the historical commands that TAOS executed.
- To configure the client, in the configuration file, set the TscenableRecordSQL parameter to 1, which prints the SQL statement entered by the client to a separate file (tscnote-xxx.0, XXXX is PID), the same directory as the client log.
- For the Resetful interface, setting the HttpEnableRecordSQL parameter to 1 in the TAOSD configuration file prints the HTPP request to a separate file (httpnote.0), the same directory as the server log.
Dynamic change log
Sometimes the server or client cannot be restarted, but the log setting is not correct. In this case, you need to set it dynamically. Follow the following steps:
Show dnodes; Alter dnode ID debugFlag 143; // Set the appropriate debugFlag
The ID of the second step is obtained from the first step.
Sometimes it is necessary to output the subsequent log to a new file to facilitate the view and search of the log and execute:
alter dnode id resetlog;
Sometimes the shell is not connected at all. In this case, you can send the SIGUSR1 command to the process from the machine on which TAOSD is running, such as:
kill -SIGUSR1 pidxxx
The original starting from: https://mp.weixin.qq.com/s/LUz1niYajOR35UpOlfszIQ