Recently, a new function was launched in production. During the use of the function, we found a weird problem by checking the log. This article records the troubleshooting process and solutions of this problem.

The program in question uses the Redis client hiredis of THE C language to connect to and operate Redis. Since we are in a distributed environment, we use the Redis transaction feature to avoid the situation where multiple clients make changes to the Redis data at the same time. For an introduction to the use of Hiredis and Redis transactions, please refer to the article “Hiredis C client Hiredis use” and “Hiredis implementation Redis pipeline”.

The original code of the program is as follows:

void GFRedisClient::QueryRowsByFields(const char *key, map<string, string> &mapField2Value, vector<map<string, string> > &outRecords) { if (! Key | | strlen (key) < = 0) {KSBLog (WARN, "into service code is empty"); return; } // get the Redis connection redisContext * CTX = CreateContext(); If (CTX == NULL) {KSBLog(WARN, "query error, cannot get Redis connection, service: %s", key); return; } // Monitor table data SendWatch(CTX, key); Vector <string> vecRecordRowIds; GetSpecRecords(ctx, key, mapField2Value, vecRecordRowIds); If (vecRecordrowids.empty ()) {KSBLog(INFO, "error: %s", key); vecrecordrowids.empty () {KSBLog(INFO," error: %s", key); ReleaseContext(ctx); return; } // Implement int pipelineCount = 0; SendMulti(ctx, pipelineCount); SendQueryFields(CTX, Key, pipelineCount); SendQuerySpecRecords(CTX, Key, vecRecordRowIds, pipelineCount); SendExec(CTX, pipelineCount); ProposeQueryRows(CTX, key, pipelineCount, outRecords); KSBLog(INFO, "query 2d table return Redis connection, service: %s", key); ReleaseContext(ctx); }Copy the code

The main function of this code is to query the specified two-dimensional table records from Redis. Because we need to use the Redis pipelined transaction feature, and some Redis command operations depend on the results of previous Redis command execution, we use Redis WATCH command to monitor whether the specified Redis has been changed. If the key has been changed, EXEC fails and returns nil; If the key has not been changed, the EXEC command returns normal results. The code above has been running for several months without any problems.

Recently in production, we observed an abnormal situation by viewing the exception log. The abnormal log output is as follows:

DEBUG|2017-12-13 11:30:38 -185|gfredis.cpp(430)-<ProposeQueryRows>: Perform assembly line command returns, type: 5, return the result: OK, returned a result: 0 DEBUG | | 11:30:38 2017-12-13-185 gfredis. CPP (430) - < ProposeQueryRows > : Perform assembly line command returns, type: 5, return the result: QUEUED, returned a result: 0 DEBUG | | 11:30:38 2017-12-13-185 gfredis. CPP (430) - < ProposeQueryRows > : Perform assembly line command returns, type: 5, return the result: QUEUED, returned a result: 0 DEBUG | | 11:30:38 2017-12-13-185 gfredis. CPP (430) - < ProposeQueryRows > : Pipeline command, type: 4, result: (null), child result: 0Copy the code

The last line is the output log for EXEC execution, i.e. EXEC command execution failed? ! By checking logs, you can confirm that the same key is not updated at this point in time. So what causes EXEC failure? At first, we even wondered if the hiredis implementation was buggy, but we immediately abandoned the idea. Hiredis has been around for a long time and is widely used, so there should be no low-level bugs. With no other clue, I had to go back to the code itself. The code includes the following handling:

If (vecRecordrowids.empty ()) {KSBLog(INFO, "error: %s", key); vecrecordrowids.empty () {KSBLog(INFO," error: %s", key); ReleaseContext(ctx); return; }Copy the code

If the specified record cannot be fetched, it is returned directly rather than continuing with the MULTI and EXEC operations below. Could that be the reason? Just to be sure, I checked the logs from the last two days. The specified log cannot be found:

INFO 168 | | 2017-12-11 19:23:30 gfredis. CPP (254) - < QueryRowsByFields > : from Redis find no record of the specified conditions, service:Copy the code

Following this idea, we used Redis-CLI to do a test, the test process is as follows:

time The client A Client B
T1 WATCH abcd
T2 SET abcd 1
T3 WATCH abcd
T4 GET abcd
T5 MULTI
T6 SET abcd 2
T7 EXEC

After client A runs the WATCH command for the key abcd, client B updates the key abcd. Then client A continued to execute WATCH. Although no client modified the key between the second execution of WATCH and EXEC, EXEC still returned nil, that is, transaction execution failed.

We continue to analyze the source code of Redis, and we can see that when the Redis service executes the WATCH command, it will determine whether the client has been monitored, and if so, it will return directly. That is, executing WATCH multiple times on the same client before executing MULTI has the same effect as executing WATCH the first time.

/* Watch for the specified key */ void watchForKey(redisClient *c, robj *key) { list *clients = NULL; listIter li; listNode *ln; watchedKey *wk; /* Check if we are already watching for this key */ listRewind(c->watched_keys,&li); while((ln = listNext(&li))) { wk = listNodeValue(ln); if (wk->db == c->db && equalStringObjects(key,wk->key)) return; /* Key already watched */ } /* This key is not already watched in this DB. Let's add it */ clients = dictFetchValue(c->db->watched_keys,key); if (! clients) { clients = listCreate(); dictAdd(c->db->watched_keys,key,clients); incrRefCount(key); } listAddNodeTail(clients,c); /* Add the new key to the list of keys watched by this client */ wk = zmalloc(sizeof(*wk)); wk->key = key; wk->db = c->db; incrRefCount(key); listAddNodeTail(c->watched_keys,wk); }Copy the code

This makes it clear what caused the problem. The program executed the WATCH command once on December 11 and did not proceed with the EXEC command, because the specified record could not be queried from Redis. Then on December 13, we execute the WATCH command again. However, between December 11 and December 13, the same key was updated by another client. Therefore, when MULTI and EXEC are executed again on December 13, the transaction fails and EXEC returns nil.

To fix this problem, we need to add an UNWATCH command before the specified record is returned from Redis. For an introduction to the UNWATCH command see redis. IO /commands/ UN… This way, the next time a transaction is executed, it is not affected by the original WATCH command. The corrected code looks like this:

If (vecRecordrowids.empty ()) {KSBLog(INFO, "error: %s", key); vecrecordrowids.empty () {KSBLog(INFO," error: %s", key); SendUnWatch (CTX); ReleaseContext(ctx); return; }Copy the code

summary

This article records the discovery of a BUG in the Redis application. Because the BUG is difficult to reproduce in the test environment, it is difficult to fix the BUG. Of course, if a BUG can be reproduced, it can be fixed. Through analysis of program code and logs, the unexpected results of a bright future were obtained. Therefore, in the actual process of troubleshooting bugs, you can start by trying to analyze the program code and logs so that you can eventually fix the problem.

The resources

  1. Redis. IO/switchable viewer/tran…
  2. Redis. IO/commands/wa…
  3. Redis. IO/commands/UN…
  4. Redis. IO/commands/di…
  5. Redisbook. Readthedocs. IO/en/latest/f…