“Ding…” A beautiful Saturday was awoken by the news.

The students in the business team told me that many user accounts were forced offline today. The normal logic of our account system is that the token can be valid for one day after the user logs in once. The problem is that users need to log in again every 10 minutes or so. There are generally two reasons for this situation: 1. There is a problem when token is generated. 2. Problems occur during token verification.

By checking the log, I found that there was no corresponding token in Redis when the token was authenticated. If the validity period of set to Redis is correct when a new token is generated, the problem can be basically determined by Redis.

Then I checked the monitoring of Redis and found that Redis forced to clear keys several times due to high memory usage during that period. However, according to the log, there was no traffic spike during this period, and the number of keys in Redis did not increase significantly. So what causes Redis to be overloaded? After confirming that the memory increase of Redis was not caused by us, we contacted the students of the business group to assist them, they said that there was indeed an online service recently, and the newly launched functions used Redis. But I still feel very strange, why the number of keys in Redis is not increased, and I don’t see other business keys. After some inquiries, I learned that the business group used the Redis DB1, while I used (and just checked) DB0. I did make a mistake in troubleshooting the problem here.

So will the different dB of Redis affect each other? Normally, we use different DBS for data isolation, which is fine. ** However, Redis does not only clean the db that occupies the largest amount of data, but also cleans all db. ** I did not know much about this knowledge before, and here is just a guess based on the phenomenon.

Curiosity drove me to test this idea. So I decided to look directly at the Redis source code. The code for cleaning up keys is in the evict.c file.

Redis keeps an “expired key pool” that holds keys that might be cleaned. The saved data structure is as follows:

struct evictionPoolEntry {
    unsigned long long idle;    /* Object idle time (inverse frequency for LFU) */
    sds key;                    /* Key name. */
    sds cached;                 /* Cached SDS object for key name. */
    int dbid;                   /* Key DB number. */
};
Copy the code

Idle is the idle time of an object. In Reids, there are two key expiration algorithms: one is LRU approximation and the other is LFU. The default is an approximate LRU.

The approximate LRU

Before explaining the approximate LRU, let’s take a quick look at the LRU. When Redis’s memory usage exceeds our maxMemory, it will clean up keys that have not been used for a long time. According to the LRU algorithm, we need to sort all keys (also can be set to only eliminate keys with expiration time) by idle time, and then eliminate the part of data with the largest idle time, so that the memory footprint of Redis can be reduced to a reasonable value.

The downside of the LRU algorithm is that we need to maintain a list of all (or only expired) keys, sorted by the most recent use. This consumes a lot of memory, and updating the sort each time the key is used takes up additional CPU resources. This is not allowed on performance critical systems like Redis.

Therefore, Redis employs an algorithm that approximates LRU. When Redis receives a new write command and runs out of memory, an approximate LRU algorithm is triggered to force some keys to be cleaned. Redis takes samples of five keys, and then puts expired keys into the “expiration pool”. The keys in the expiration pool are sorted by idle time. Redis will first clean the keys that have been idle for the longest time until the memory is less than maxMemory.

The cleaning effect of approximate LRU algorithm is shown in the figure (image from Redis official document)

Maybe it’s not clear, but let’s just go to the code.

Source code analysis

The figure above shows the main logical invocation paths of the approximate LRU algorithm in the code.

The main logic is in the freeMemoryIfNeeded function

The getMaxmemoryState function is first called to determine the current state of memory

int getMaxmemoryState(size_t *total, size_t *logical, size_t *tofree, float *level) {
    size_t mem_reported, mem_used, mem_tofree;

    mem_reported = zmalloc_used_memory();
    if (total) *total = mem_reported;

    intreturn_ok_asap = ! server.maxmemory || mem_reported <= server.maxmemory;if(return_ok_asap && ! level)return C_OK;

    mem_used = mem_reported;
    size_t overhead = freeMemoryGetNotCountedMemory();
    mem_used = (mem_used > overhead) ? mem_used-overhead : 0;

    if (level) {
        if(! server.maxmemory) { *level =0;
        } else {
            *level = (float)mem_used / (float)server.maxmemory; }}if (return_ok_asap) return C_OK;

    if (mem_used <= server.maxmemory) return C_OK;

    mem_tofree = mem_used - server.maxmemory;

    if (logical) *logical = mem_used;
    if (tofree) *tofree = mem_tofree;

    return C_ERR;
}
Copy the code

C_OK is returned if the memory usage is less than maxMemory, and C_ERR is returned otherwise. In addition, this function returns some additional information by passing a pointer argument.

  • total: Total number of bytes used, whetherC_OKorC_ERRAre effective.
  • logical: The size of the used memory after subtracting the slave or AOF bufferC_ERRAvailable at the time.
  • tofree: The size of memory to free, only returnC_ERRAvailable at the time.
  • level: The proportion of memory used, usually between 0 and 1, greater than 1 when the memory limit is exceeded. Whether it isC_OKorC_ERRAre effective.

After determining the state of memory, it returns directly if the memory does not exceed the usage limit, otherwise it continues down. Now that we know how much memory we need to free, we can begin to free memory. Each memory release records the size of the freed memory until the freed memory is not less than toFree.

First, according to maxmemory_policy, there are different implementation methods for different cleaning policies. Let’s take a look at the specific implementation of LRU.

for (i = 0; i < server.dbnum; i++) {
  db = server.db+i;
  dict = (server.maxmemory_policy & MAXMEMORY_FLAG_ALLKEYS) ?
    db->dict : db->expires;
  if((keys = dictSize(dict)) ! =0) { evictionPoolPopulate(i, dict, db->dict, pool); total_keys += keys; }}Copy the code

The first step is to populate the “expired pool,” which goes through each DB (confirming my original idea) and fills it with the evictionPoolPopulate function.

void evictionPoolPopulate(int dbid, dict *sampledict, dict *keydict, struct evictionPoolEntry *pool) {
    int j, k, count;
    dictEntry *samples[server.maxmemory_samples];

    count = dictGetSomeKeys(sampledict,samples,server.maxmemory_samples);
    for (j = 0; j < count; j++) {
        unsigned long long idle;
        sds key;
        robj *o;
        dictEntry *de;

        de = samples[j];
        key = dictGetKey(de);
				/* some code */
        if (server.maxmemory_policy & MAXMEMORY_FLAG_LRU) {
            idle = estimateObjectIdleTime(o);
        }

        /* some code */
        k = 0;
        while (k < EVPOOL_SIZE &&
               pool[k].key &&
               pool[k].idle < idle) k++;
        if (k == 0 && pool[EVPOOL_SIZE- 1].key ! =NULL) {
            continue;
        } else if (k < EVPOOL_SIZE && pool[k].key == NULL) {}else {
            if (pool[EVPOOL_SIZE- 1].key == NULL) {
                sds cached = pool[EVPOOL_SIZE- 1].cached;
                memmove(pool+k+1,pool+k,
                    sizeof(pool[0])*(EVPOOL_SIZE-k- 1));
                pool[k].cached = cached;
            } else {
                k--;
                sds cached = pool[0].cached; /* Save SDS before overwriting. */
                if (pool[0].key ! = pool[0].cached) sdsfree(pool[0].key);
                memmove(pool,pool+1.sizeof(pool[0])*k); pool[k].cached = cached; }}/* some code */}}Copy the code

Due to the length, I have intercepted part of the code, from which we can see that Redis first sampled part of the key, and the number of samples for maxmemory_samples is usually 5. We can also set it by ourselves. The larger the number of samples is, the closer the result will be to that of the LRU algorithm.

After sampling, we need to get the free time for each key and populate it with the specified location in the expiration pool. Here, the “expired pool” is sorted by idle time from smallest to largest, that is, idle big keys are at the far right.

After the expiration pool is filled, the key most suitable for cleaning is retrieved from back to front.

/* Go backward from best to worst element to evict. */
for (k = EVPOOL_SIZE- 1; k >= 0; k--) {
  if (pool[k].key == NULL) continue;
  bestdbid = pool[k].dbid;

  if (server.maxmemory_policy & MAXMEMORY_FLAG_ALLKEYS) {
    de = dictFind(server.db[pool[k].dbid].dict,
                  pool[k].key);
  } else {
    de = dictFind(server.db[pool[k].dbid].expires,
                  pool[k].key);
  }
  /* some code */
  if (de) {
    bestkey = dictGetKey(de);
    break; }}Copy the code

After finding the key to be deleted, you need to perform synchronous or asynchronous cleanup based on the cleanup policy.

if (server.lazyfree_lazy_eviction)
  dbAsyncDelete(db,keyobj);
else
  dbSyncDelete(db,keyobj)
Copy the code

Finally, write down the space size of this cleaning, which can be used to judge whether to continue cleaning in the circular condition.

delta -= (long long) zmalloc_used_memory();
mem_freed += delta;
Copy the code

Clean up the strategy

Finally, let’s take a look at several cleanup strategies supported by Redis

  • Noeviction: Write requests will not continue processing (DEL can continue processing).
  • Allkeys-lru: approximate lru for allkeys
  • Volatile – LRU: Uses an approximate LRU algorithm to weed out keys with expiration dates
  • Allkeys-random: randomly discards some keys from allkeys
  • Volatile -random: All keys with expiration time are randomly eliminated
  • Volatile – TTL: eliminates the keys with the shortest validity

Redis4.0 starts to support LFU policies, which, like LRU, come in two types:

  • Volatile – lFU: uses the LFU algorithm to eliminate keys with expiration time
  • Allkeys-lfu: removes allkeys and uses lfu

Write in the last

Now I know what Redis does when it reaches its memory limit. In the future, when problems occur, they will not only check their own DB.

As for the follow-up treatment of this accident, I first asked the business students to roll back the code, and then asked them to use a separate Redis, so that similar business problems would not affect our account service, and the overall impact range would become more controllable.