1. How is caching used in a project? Why cache? What are the consequences of improper cache use?

Interviewer psychoanalysis

This question, Internet companies must ask, if a person does not even know the cache, it is really embarrassing.

Whenever you ask about caching, the first question you’re going to ask is where does your project use caching? Why? Can’t I? What are the possible adverse consequences if it is used?

This is to see if you have thought behind the cache. If you are silly and can’t give the interviewer a reasonable answer, you will not be impressed with your work because you don’t think enough.

Analysis of interview questions

How is the cache used in a project?

This needs to be combined with the business of their own projects.

Why cache?

There are two main uses for caching: high performance and high concurrency.

A high performance

Suppose this scenario, you have an operation, a request to come over, you all kinds of messy operation mysql, a long time to find a result, 600ms. However, this result may not change for hours, or changes may not be reported to the user immediately. So what do we do now?

If you want to find a value from a key in the cache for 600ms, you can find a value from a key in the cache for 2ms. 300 times better performance.

That is to say, for some results that require complex operation and time-consuming to find out, and it is certain that there is little change in the future, but there are many read requests, then directly put the query results in the cache, then directly read cache.

High concurrency

So if you have a system that gets 10,000 requests a second at peak times, that single mysql machine will definitely die. You can only cache a lot of data at this time, not mysql. The cache function is simple, to put it bluntly, it is key-value operation. The concurrency of single machine support is easily tens of thousands per second, and the support of high concurrency is so easy. The concurrent load of a single machine is dozens of times that of a mysql single machine.

Caches run on memory, which naturally supports high concurrency.

What are the negative consequences of using caching?

Common cache problems are as follows:

Cache and database dual write inconsistencies, cache avalanche, cache penetration, and cache concurrency contention are explained later.

2. What is the difference between Redis and memcached? What is the threading model of Redis? Why can Redis single thread support high concurrency?

Interviewer psychoanalysis

Redis is actually a single-threaded working model. If you don’t know this, you won’t know anything about it when you’re playing with redis.

You may also be asked about the difference between Redis and memcached, but memcached was a popular caching solution for Internet companies in the early years, but now it’s mostly Redis, and memcached is not used by many companies anymore.

Analysis of interview questions

What’s the difference between Redis and memcached?

Redis supports complex data structures

Redis has more data structures than Memcached and can support richer data operations. If you need a cache that can support more complex structures and operations, Redis is a good choice.

Redis natively supports the cluster mode

In redis3.x, cluster mode is supported. Memcached has no native cluster mode and relies on the client to write data in fragments to the cluster.

The performance comparison

Because Redis uses only one core, while Memcached can use multiple cores, Redis performs better than Memcached at storing small data on average per core. For data over 100K, memcached performs better than Redis. Redis has recently been optimized for storing big data, but it still lags behind memcached.

Redis thread model

Redis uses the file event handler internally. This file event handler is single-threaded, so redis is called the single-threaded model. It uses the IO multiplexing mechanism to monitor multiple sockets at the same time, and pushes the socket that generates the event into the memory queue. The event dispatcher selects the corresponding event processor for processing according to the event type on the socket.

The structure of the file event handler consists of four parts:

  • Multiple socket
  • IO multiplexing procedures
  • File event dispatcher
  • Event handler (connection reply handler, command request handler, command reply handler)

Multiple sockets may concurrently produce different operations, each corresponding to a different file event, but the IO multiplexer listens for multiple sockets, queues the socket that generated the event, and the event dispatcher retrieves the socket from the queue one at a time. According to the event type of the socket, it is sent to the corresponding event handler for processing.

Let’s look at a communication process between the client and Redis:

! [](https://pic1.zhimg.com/80/v2-9ac953f7d296c3a23f3d3c144f50c158_720w.jpg)

To understand, communication is through socket to complete, do not understand the students can first take a look at socket network programming.

The socket01 client requests the Server socket of the Redis process to establish a connection. When the Server socket generates an AE_READABLE event, the IO multiplexer listens for the event generated by the server socket. Push the socket into the queue.

The file event dispatcher takes the socket from the queue and gives it to the connection reply handler. The connection reply handler creates a Socket01 that can communicate with the client and associates the AE_READABLE event for that Socket01 with the command request handler.

If the client sends a set key value request, the SOCket01 in Redis will generate an AE_READABLE event, and the IO multiplexer will queue the socket01. At this point, the event dispatcher gets AE_READABLE events from socket01 from the queue. Since AE_READABLE events from Socket01 have already been associated with the command request handler, the event dispatcher hands the event to the command request handler for processing. The socket01 key value command asks the processor to read the socket01 key value and set the key value in its own memory. When the operation is complete, it associates the AE_WRITABLE event of socket01 with the command reply handler.

If the client is ready to receive the result, then Socket01 in Redis generates an AE_WRITABLE event, which is also queued. The event dispatcher finds the associated command reply handler, which injects a result of this operation into socket01. For example, ok, and then disassociate socket01’s AE_WRITABLE event from the command reply handler.

A communication is thus completed.

Why is the Redis single-threaded model so efficient?

  • Pure memory operation
  • The core is IO multiplexing mechanism based on non-blocking
  • Single-threading instead avoids the frequent context switching problems of multithreading

3. What are the data types of Redis? In which scenarios are they appropriate?

Interviewer psychoanalysis

Only if the interviewer feels that you are a junior student with less than three years of experience and may not have a deep understanding of the technology, will they ask this question. Otherwise, the interviewer really doesn’t want to ask questions during the precious interview time.

There are two main reasons for asking this question:

  • See if you have a comprehensive understanding of redis has what functions, how to use, what scene with what, afraid you will not be the simplest KV operation;
  • See how you’ve played Redis in your actual projects.

If you don’t give a good answer, including a few data types and a few scenarios, then you’re done. You’re not going to impress the interviewer who thinks you just do simple sets and gets.

Analysis of interview questions

Redis has the following data types:

  • string
  • hash
  • list
  • set
  • Zset (sorted set)

string

This is the simplest type, just plain set and get, simple KV cache.

hash

This is a map like structure, which generally allows you to cache structured data, such as an object (provided that the object is not nested with other objects) in Redis, and then operate on a hash field every time you read or write from the cache.

hset person name bingo
hset person age 20
hset person id 1 hget person name
person = {
 "name": "bingo",
 "age": 20,
 "id": 1
}
Copy the code

list

A list is an ordered list, and you can do a lot of things with this.

For example, you can store a list of data structures, such as a list of fans, a list of comments on articles, and so on.

For example, you can read elements in a closed interval through the lrange command, and you can achieve paging queries based on list. This is a great function. Based on Redis, you can achieve simple high-performance paging, which can do things like microblog that drop down constantly paging, with high performance, and go page by page.

Start position 0, end position -1, the end position -1 indicates the last position in the list, that is, view all. lrange mylist 0 -1

For example, you could make a simple message queue, and then you could pop it in from the head of the list and pull it out from the tail of the list.

lpush mylist 1
lpush mylist 2
lpush mylist 3 4 5
# 1
rpop mylist
Copy the code

set

Set is an unordered set, automatically deduplicated.

If you need to do a quick global de-weighting of some data, you can also do it based on a HashSet in the JVM memory, but what if your system is deployed on multiple machines?

Redis is used for global set de-duplication.

Put the fans of both big V’s in two sets and do the intersection of the two sets.

Sismember mySet 3 delete some elements srem mySet Scard mySet # randomly delete an element spop mySet #------- operate multiple sets ------- # move elements from one set to another smove Sinter yourSet mySet # Find the union of two sets sunion yourSet mySet # Find the element sdiff that is in yourSet but not in mySet yourSet mySetCopy the code

sorted set

A sorted set is a sorted set. It is unrepeatable but sorted. It is given a score when it is written in.

Zadd board 85 Zhangsan Zadd board 72 Lisi zadd board 96 Wangwu zadd board 63 zhaoliu Zrevrange board 0 3 #Copy the code

4. What are the expiration strategies of Redis? What are the memory flushing mechanisms? Write the LRU code implementation by hand?

Interviewer psychoanalysis

If you don’t even know this question, you will be confused and unable to answer it. When you write code online, you will take it for granted that the data written into Redis will exist, which will lead to various bugs in the system. Who is responsible?

There are two common problems:

(1) What happened to the data written to Redis?

Some students may encounter that redis in the production environment often loses some data, writes in, after a while may be lost. Oh, my God, classmate, that means redis is useless. Redis is a cache. You store it as a cache, right?

What is cache? Use memory as a cache. Is memory infinite? Memory is precious and finite, disks are cheap and plentiful. A machine may have only a few tens of gigabytes of memory, but a few terabytes of hard disk space. Redis is primarily memory based for high performance, high concurrency read and write operations.

Since memory is limited, for example redis can only use 10GB, what if you write 20GB of data into it? Of course it kills 10 gigabytes of data, and then it keeps 10 gigabytes of data. What data does that kill? What data is retained? Of course, it is to eliminate the infrequently used data, keep the commonly used data.

(2) Data clearly expired, why still occupy the memory?

This is determined by redis expiration policy.

Analysis of interview questions

Redis expiration policy

The Redis expiration policy is: delete periodically + delete lazily.

The so-called periodic deletion means that redis randomly selects some keys with expiration time every 100ms by default, checks whether they are expired, and deletes them if they are expired.

If you put 10W keys in redis, all of which are set to expire, and you check 10W keys every few hundred milliseconds, the Redis will basically die, and the CPU will be overloaded with expired keys. Note that this is not a case of going through all the keys with expiration dates every 100ms, which would be a performance disaster. In fact redis randomly selects some keys every 100ms to check and delete.

However, the problem is that regular deletion may result in many expired keys not being deleted when the time comes. What can be done? So it’s lazy deletion. That is, when you get a key, Redis checks, is that key expired if it’s expired? If it’s out of date it’s deleted at this point and it doesn’t give you anything back.

When a key is retrieved, if the key has expired, it is deleted and nothing is returned.

The answer is: go through the memory elimination mechanism.

Memory flushing mechanism

Redis memory flushing mechanism has the following:

  • Noeviction: New write operations will bug when memory is not large enough to accommodate new writes.
  • Allkeys-lru: Removes the least-recently used key (which is the most commonly used) from the key space when memory is insufficient to accommodate new writes.
  • Allkeys-random: Removes a random key from the key space when memory is insufficient to accommodate new data writes.
  • Volatile -lru: Removes the least recently used key from the expired key space when memory is insufficient to accommodate new writes (this is generally not appropriate).
  • Volatile -random: Randomly removes a key from the expired key space when memory is insufficient to accommodate new writes.
  • Volatile – TTL: When the memory is insufficient to accommodate new data, the key whose expiration time is earlier is removed from the key space.

Write an LRU algorithm

You can hand-write the original LRU algorithm on the spot, but the amount of code is too large to be practical.

Don’t build your own LRU from the ground up, but at least know how to implement a Java version of LRU using existing JDK data structures.

class LRUCache<K, V> extends LinkedHashMap<K, V> { private final int CACHE_SIZE; ** @param cacheSize cacheSize */ public LRUCache(int cacheSize) {// true specifies the order in which the linkedHashMap is accessed. The most recently accessed ones are placed in the head, and the oldest accessed ones are placed in the tail. Super ((int) math.ceil (cacheSize / 0.75) + 1, 0.75F, true); CACHE_SIZE = cacheSize; } @override protected Boolean removeEldestEntry(map.entry <K, V> younger);} return size() > CACHE_SIZE; }}Copy the code

5. How to ensure high concurrency and high availability of Redis? Can you introduce the master slave replication principle of Redis? Can you introduce redis’ sentry principle?

Interviewer psychoanalysis

In fact, ask this question, mainly test you, Redis single machine can bear how high concurrency? How to expand capacity to carry more concurrency if a single machine fails? Will Redis fail? How can redis be highly available if it will hang?

These are questions that you must consider in your project, and if you haven’t, you haven’t given enough thought to the problems in your production system.

Analysis of interview questions

If you are using Redis caching, you must consider how to use Redis to add multiple machines, ensure that Redis is highly concurrent, and how to ensure that Redis does not die when it dies, i.e. redis is highly available.

Since this section is quite extensive, it will be divided into two sections. – Redis master-slave architecture – Redis based on sentry to achieve high availability Redis to achieve high concurrency mainly rely on the master-slave architecture, one master and many slave, generally speaking, many projects are actually enough, a single master to write data, a single machine tens of thousands of QPS, many slave to query data, multiple slave instances can provide 10W QPS per second.

If you want to accommodate large amounts of data while achieving high concurrency, you need redis clusters, which can provide hundreds of thousands of concurrent reads and writes per second.

Redis is highly available, if it is master slave architecture deployment, then add sentinel can be achieved, any instance down, you can carry out master/standby switchover.

6. What are the persistence methods of Redis? What are the pros and cons of different persistence mechanisms? How is persistence implemented at the bottom?

Interviewer psychoanalysis

If Redis just caches data in memory, if Redis goes down and restarts, all the data in memory is lost.

You must use Redis’ persistence mechanism to write data to memory asynchronously and slowly to a disk file for persistence.

If Redis goes down and restarts, it will automatically load some of the data that was persisted from the disk. You may lose a little data, but at least you won’t lose all of the data.

If redis is suspended and then restarted, the data in the memory will be lost. Can the data be recovered on reboot?

Analysis of interview questions

Persistence is mainly for disaster recovery, data recovery, it can also be classified as a link of high availability, for example, if you have a whole Redis down, and then Redis is not available, you have to do is to make Redis available, as soon as possible.

Restart Redis and make it available to the outside world as soon as possible. If no data backup is done, then Redis starts up and is not available.

It is possible to say that a large number of requests come in, the cache is not hit, the data is not found in Redis, this time is dead, cache avalanche problem. All requests that are not hit in Redis will go to the source of the mysql database, and suddenly mysql receives high concurrency, and then dies.

If you do a good job of redis persistence, backup and recovery plan to achieve enterprise-level, so even if your Redis failure, you can also backup data, rapid recovery, once the restoration of external services immediately.

Redis persists in two ways

  • RDB: RDB persistence mechanism, which periodically persists data in REDis.
  • AOF: The AOF mechanism logs each write command and writes it to a log file in appends-only mode. When Redis restarts, the entire data set can be reconstructed by playing back the write command in the AOF log.

Through RDB or AOF, redis memory data can be persisted to disk, and then these data can be backed up to other places, such as Ali Cloud and other cloud services.

If Redis fails, the memory and disk data on the server are lost. You can copy back the previous data from the cloud service and put it in the specified directory. Then restart Redis, redis will automatically restore the data in the memory according to the data in the persistent data file and continue to provide services.

If both RDB and AOF persistence mechanisms are used, then when Redis restarts, AOF will be used to rebuild the data, because the data in AOF is more complete.

RDB pros and cons

  • RDB will generate multiple data files, and each data file represents the data of REDis at a certain time. This method of multiple data files is very suitable for cold backup. Such complete data files can be sent to some remote secure storage, such as Amazon S3 cloud service. In China, it can be on the ODPS distributed storage of Ali Cloud to regularly back up data in Redis with a predetermined backup strategy.
  • RDB has very little impact on the read and write services provided by Redis, so that Redis can maintain high performance, because the main redis process only needs to fork a sub-process and let the sub-process perform disk I/O operations for RDB persistence. ,
  • Compared to AOF persistence, restarting and restoring redis processes directly based on RDB data files is much faster.
  • RDB is not as good as AOF if you want to lose as little data as possible when Redis fails. In general, RDB data snapshot files are generated every 5 minutes or more, at which point you have to accept that if the Redis process goes down, the data in the last 5 minutes will be lost.
  • Each time the RDB forks the RDB snapshot data file, if the data file is too large, the service provided to the client may be suspended for milliseconds, or even seconds.

AOF the pros and cons

  • AOF can better protect against data loss. Generally, AOF will execute fsync operation every second through a background thread and lose data for a maximum of one second.
  • AOF log files are written in appends-only mode, so there is no disk addressing overhead, write performance is very high, and the file is not prone to breakage, and even if the tail of the file is broken, it is easy to repair.
  • Even if the AOF log file is too large, the background rewrite operation does not affect the client read and write. This is because the rewrite log compacts the instructions to create a minimal log that needs to be retrieved. When a new log file is created, the old log file is written as usual. When the log files are ready after the merge, the old and new log files can be exchanged.
  • Commands for AOF log files are logged in a very readable manner, which is ideal for emergency recovery in the event of catastrophic deletions. For example someone accidentally flushall command to empty all data, rewrite this time as long as the background hasn’t happened yet, you can immediately file copy AOF, will last a flushall command to delete, and then the AOF files back, can through the recovery mechanism, Automatically restore all data.
  • AOF log files are usually larger than RDB data snapshot files for the same data.
  • When AOF is enabled, the write QPS supported is lower than the write QPS supported by RDB, because AOF is typically configured to fsync log files once per second, although the performance is still very high. (If you write in real time, the QPS and redis performance will be greatly reduced)
  • In the past, there was a bug in AOF, that is, the same data was not recovered when the logs recorded by AOF were recovered. Therefore, a more complex command log/merge/playback approach such as AOF is more vulnerable and buggy than the rDB-based approach of persisting a complete data snapshot file at a time. AOF, however, is designed to avoid bugs in the rewrite process, so instead of merging the rewrite log, rewrite it based on the data in memory at the time, which is much more robust.

How to choose BETWEEN RDB and AOF

  • Don’t just use RDB, because that will cause you to lose a lot of data.
  • Don’t just use AOF either, because there are two problems with that. First, if you do cold standby through AOF, recovery is faster without RDB. Second, RDB is more robust and can avoid the bugs of AOF, a complex backup and recovery mechanism.

7. How does Redis cluster work? How is redis key addressed in clustered mode? What are the algorithms for distributed addressing? Did you solve the consistent hash algorithm?

Interviewer psychoanalysis

In previous years, if Redis wanted to build several nodes, each node stored part of the data, it had to use some middleware to achieve this, such as CODIS or TwemProxy. There is some Redis middleware, you read and write redis middleware, and the Redis middleware is responsible for distributing your data across multiple machines in redis instances.

In the past two years, Redis continues to develop, redis also constantly new versions, the current Redis cluster mode, you can deploy multiple Instances of Redis on multiple machines, each instance stores part of the data, at the same time, each primary instance of Redis can be connected to the redis slave instance, automatically ensure that said, If the primary redis instance fails, it will automatically switch to the secondary Redis instance.

Now the new version of Redis, everyone is using Redis cluster, which is redis native support redis cluster mode, then the interviewer will definitely give you a few rounds of criticism on redis cluster. If you haven’t used Redis Cluster, it’s fine. Many people use codis and other clients to support clustering, but you should at least study Redis Cluster.

If you have a small amount of data, mainly for high concurrency and high performance scenarios, such as your cache is generally a few GIGABytes, a single machine is enough, you can use Replication, a master multiple Slaves, a few slaves depending on the read throughput you require, Then build a Sentinel cluster to ensure the high availability of redis master-slave architecture.

Redis Cluster, mainly for massive data **+ high concurrency +** high availability scenarios. A Redis cluster supports N Redis master nodes. Each master node can mount multiple slave nodes. This allows redis to scale horizontally. If you want to support a larger cache, horizontally expand the number of master nodes so that each master node can hold more data.

Analysis of interview questions

Redis cluster is introduced

  • Automatically shards data, placing some data on each master
  • With built-in high availability support, it is possible to continue working when part of the master is not available

In redis Cluster architecture, each Redis should release two port numbers, such as 6379, and add 1W port number, such as 16379.

The 16379 port number is used for communication between nodes, that is, cluster Bus, which is used for fault detection, configuration updates, failover authorization. Cluster Bus uses another binary protocol, the Gossip protocol, for efficient data exchange between nodes, which consumes less network bandwidth and processing time.

Internal communication mechanism between nodes

Basic Communication Principles

Cluster metadata can be maintained in two modes: centralized mode and Gossip protocol. Redis Cluster nodes communicate with each other using the Gossip protocol.

Centralized is the storage of several types of cluster metadata (node information, faults, and so on) on a node. A good example of centralized metadata storage is storm in the big data space. It is a distributed real-time computing engine for big data. It is a centralized metadata storage structure. The bottom layer is based on ZooKeeper (distributed coordination middleware) to store and maintain all metadata.

! [](https://pic3.zhimg.com/80/v2-c06a22979dc0a89968da1498603bc94e_720w.jpg)

Redis maintains cluster metadata in another way, the Gossip protocol, all nodes hold a copy of metadata, different nodes if metadata changes occur, constantly send metadata to other nodes, so that other nodes also make metadata changes.

! [](https://pic1.zhimg.com/80/v2-7e404ee2435fe5f03de006e57925ce64_720w.jpg)

The advantage of centralized mode is that metadata is read and updated with very good timeliness. Once metadata is changed, it will be updated to the centralized storage immediately, and other nodes will be aware of it when reading. The downside is that all metadata update pressure is concentrated in one place, which can cause metadata storage pressure.

Update requests are sent to all nodes intermittently, reducing the pressure. The downside is that metadata updates are delayed, which can cause some delays in some operations in the cluster.

10000 port: Each node has a dedicated port for communication between nodes, that is, the port number that it provides services. For example, port 17001 is used for communication between nodes. Each node sends ping messages to several other nodes at regular intervals, and the other nodes return pong after receiving the ping.

Information exchanged: Information includes fault information, node addition and deletion, hash slot information, and so on.

Gossip protocols

The Gossip protocol contains multiple messages, including ping, pong, meet, fail, and so on.

  • Meet: A node sends a meet to a new node to join the cluster, and then the new node starts communicating with other nodes. Redis-trib.rb Add-node sends a gossip meet message to the newly added node, notifying that node to join our cluster.
  • Ping: Each node frequently sends ping messages to other nodes, including its own status and cluster metadata, to exchange metadata with each other.
  • Pong: Returns ping and meeet, including its own status and other information, also used for information broadcast and update.
  • Fail: After a node determines that another node fails, it sends a fail message to other nodes to inform them that a node is down.

Ping Message Depth

Some metadata must be carried during pings. If pings are frequent, the network may be overloaded.

Each node performs 10 pings per second, each time selecting five other nodes that have not communicated with each other for the longest time. Of course, if the communication delay of a node reaches Cluster_node_timeout / 2, ping the node immediately to avoid a long data exchange delay. For example, if two nodes have not exchanged data for 10 minutes, then the entire cluster is in a situation of severe metadata inconsistencies. Therefore, cluster_node_timeout can be adjusted. If the value is set to a larger value, the ping frequency will be reduced.

Each ping will take the information of its own node, and 1/10 of the information of other nodes, and send it out for exchange. Contains information about at least three other nodes, and a maximum of two other nodes minus the total number of nodes.

Distributed addressing algorithm

  • Hash algorithm (massive cache reconstruction)
  • Consistent Hash algorithm (automatic cache migration) + Virtual Node (automatic load balancing)
  • Hash Slot algorithm of redis cluster

The hash algorithm

A key is given, which first evaluates the hash value, then modulo the number of nodes. And then hit the different master nodes. Once a master node goes down, all requests will be made based on the latest number of remaining master nodes to try to fetch data. This will result in the majority of requests coming in, none of which will be valid in the cache, causing a large amount of traffic to flood the database.

! [](https://pic3.zhimg.com/80/v2-fde5f5f915f54ab150fe99ea694e0c52_720w.jpg)

Consistent hash algorithm

The consistent hash algorithm organizes the entire hash space into a virtual circle. The entire space is organized clockwise. The next step is to hash each master node (using the IP address or host name of the server). This determines the position of each node on its hash ring.

When a key is given, the hash value is computed and the position of the data on the ring is determined. From this position, the first master node encountered is the location of the key.

In the consistent hash algorithm, if a node fails, the data affected is only the data between this node and the previous node in the ring space (the first node encountered by walking counterclockwise), nothing else is affected. The same goes for adding a node.

When there are too few nodes in the consistent hash algorithm, it is easy to cause the problem of cache hot spots because of the uneven distribution of nodes. To solve this hot spot

The consistent hash algorithm introduces the virtual node mechanism, which computes multiple hashes for each node and places a virtual node for each result position. In this way, the data is evenly distributed and the load is balanced.

! [](https://pic4.zhimg.com/80/v2-f0a31b5200e5ddb0735d27cadf670537_720w.jpg)

Hash Slot algorithm of redis cluster

The Redis cluster has a fixed hash slot of 16384, calculates the CRC16 value for each key, and then modulates the hash slot of 16384 to obtain the hash slot corresponding to the key.

Each master in a Redis cluster will hold some slots. For example, if there are three masters, each master may hold more than 5000 Hash slots. Hash Slot makes it easy to add and remove nodes. Adding a master moves part of the hash slot of another master, and removing a master moves its Hash slot to another master. The cost of moving Hash Slot is very low. The API of the client can use the same hash tag for the specified data.

Any machine down, the other two nodes, does not affect. Because key is looking for a hash slot, not a machine. ,

! [](https://pic2.zhimg.com/80/v2-689ae662ea7ae21f0946e122a0c58965_720w.jpg)

High availability and active/standby switchover of redis Cluster

The high availability principle of Redis Cluster is almost similar to sentry.

Check whether the node is down

If one node thinks another node is down, it’s a PFAIL, subjective failure. If multiple nodes believe that the other node is down, then fail is an objective failure, almost the same as sentry, sdown, odown.

In cluster-Node-timeout, if a node does not return pong, it is considered pFAIL.

If a node thinks that a node is pfail, it will ping other nodes in the gossip ping message. If more than half of the nodes think that a node is Pfail, it will become fail.

Filter by node

If the master node is down, select one of its slave nodes to become the master node.

Check the disconnection time between each slave node and the master node. If the disconnection time exceeds cluster-node-timeout *cluster-slave-validity-factor, the slave node is not eligible to switch to the master node.

Slave node election

Each slave node sets an election time based on its offset of data replicated from the master. The slave node with a larger offset (more data replicated) has a higher election time and is preferred for election.

All master nodes vote in the slave election. If the majority of master nodes (N/2 + 1) vote for a slave node, then the election passes and that slave node becomes the master.

The active/standby switchover is performed on the secondary node.

Comparison with sentinels

The entire process is very similar to Sentinel, so Redis Cluster is a powerful direct integration of Replication and Sentinel.

8. What is the avalanche, penetration and breakdown of Redis? What happens when Redis crashes? How does the system deal with this? How to deal with redis penetration?

Interviewer psychoanalysis

In fact, this is a very important question to ask about cache because cache avalanche and cache penetration are two of the biggest cache questions that either don’t come up or if they do come up it’s a fatal question, so the interviewer will definitely ask you.

Analysis of interview questions

Cache avalanche

For system A, assuming A daily peak of 5,000 requests per second, the cache would have been able to handle 4,000 requests per second at peak times, but the cache machine unexpectedly went down completely. The cache is down, and 5000 requests per second are sent to the database, so the database can’t handle it, so it’s going to give an alarm, and then it’s down. At this point, if no special solution is adopted to handle the failure, the DBA is anxious to restart the database, but the database is immediately killed by the new traffic.

This is cache avalanche.

! [](https://pic4.zhimg.com/80/v2-b6e0b4030ed0d9be7384fde61b27564b_720w.jpg)

About 3 years ago, a well-known Internet company in China lost tens of millions of dollars due to a cache accident, which led to an avalanche and the collapse of all the background systems.

Caching avalanches before and after the solution is as follows. – Ex ante: Redis high availability, master slave + Sentinel, Redis Cluster, avoid total crash. – Current: Local EhCache + Hystrix stream limiting & degrade to avoid MySQL being killed. – After: Redis persistent, once restarted, automatically load data from disk, fast recovery of cached data.

! [](https://pic4.zhimg.com/80/v2-3eff5dea634806b09e426ec1584004cf_720w.jpg)

The user sends A request. After receiving the request, system A checks the local EhCache first. If the request is not found, system A checks redis. If neither EhCache nor Redis exists, check the database and write the result in the database to EhCache and Redis.

The flow limiting component, can set the request per second, how many can pass through the component, the rest of the request did not pass, what to do? Go down! You can return some default values, either as a reminder, or blank values.

Benefits: – The database is never dead, and the flow limiting component ensures that only requests pass per second. – As long as the database doesn’t die, that is, 2 out of 5 requests can be processed by the user. – As long as 2/5 requests can be processed, your system is not dead, for the user, it may not be a few clicks, but a few more clicks, can be a page.

The cache to penetrate

For system A, let’s say 5,000 requests per second, and 4,000 of those requests turn out to be malicious attacks by hackers.

Those 4,000 attacks by the hacker, they’re not in the cache, and every time you go to the database, they’re not.

Here’s an example. The database ID starts at 1, and the hacker sends all request ids with negative numbers. This way, there will be no cache, and the request will be queried directly from the database every time. Cache penetration in this malicious attack scenario would kill the database.

! [](https://pic2.zhimg.com/80/v2-e46fd72233a7ee55b523b40af27fe349_720w.jpg)

Each time system A does not find A value in the database, it writes A null value to the cache, such as set-999 UNKNOWN. Then set an expiration time so that the next time the same key is accessed, the data can be fetched directly from the cache before the cache expires.

Cache breakdown

Cache breakdown refers to a situation where a key is very hot and accessed frequently and is in centralized and high concurrency. When the key fails, a large number of requests will break through the cache and directly request the database, just like cutting a hole in a barrier.

The solution is also very simple, you can set the hotspot data to never expire; Or implement a mutex based on Redis or ZooKeeper and wait for the first request to build the cache before releasing the lock so that other requests can access the data through the key.

9, how to ensure the cache and database double write consistency?

Interviewer psychoanalysis

As long as you use cache, you may involve the cache and database double storage double write, as long as you are double write, there will be data consistency problem, so how do you solve the consistency problem?

Analysis of interview questions

In general, it is best not to serialize read and write requests into an in-memory queue if the cache is allowed to slightly inconsistencies with the database, that is, if your system is not strictly “cache + database” consistent.

Serialization guarantees that no inconsistencies will occur, but it can also result in a significant throughput reduction of the system, requiring several times more machines than normal to support a single request on the line.

Cache Aside Pattern

The most classic Cache + database read/write Pattern is Cache Aside Pattern. – Read data from the cache. If the cache does not exist, read data from the database and put it into the cache. At the same time, return a response. – When updating, update database first, then delete cache.

Why delete the cache instead of update it?

The reason is simple. A lot of times, in a more complex caching scenario, the cache is not just a value pulled straight out of the database.

For example, one field of a table may be updated, and the corresponding cache may need to query the data of the other two tables and perform operations to calculate the latest value of the cache.

Also, updating the cache can be costly. Does that mean that every time you change a database, you have to update the corresponding cache? This may be true in some scenarios, but not in more complex scenarios where cached data is computed. If you frequently modify multiple tables involved in a cache, the cache updates frequently. But the question is, will this cache be accessed frequently?

For example, if a table’s fields are changed 20 times in a minute, or 100 times, the cache is updated 20 times, or 100 times. But the cache was only read once in a minute, and there was a lot of cold data. In fact, if you just delete the cache, the cache is recalculated within a minute, and the overhead is significantly reduced. Cache is what you need to cache.

In fact, deleting the cache, rather than updating it, is the idea of lazy computing. Instead of redoing a complex calculation every time, whether it’s needed or not, let it recalculate until it needs to be used. Like Mybatis, Hibernate, have lazy loading idea. Query a department, the department carries a list of employees, there is no need to say that every time you query a department, the data of 1000 employees in the department will be found at the same time. 80% of the time, the department, you just want to access the information in that department. Search the department first and access the employees in the database at the same time, so only when you want to access the employees in the database, you will query 1000 employees in the database.

The most elementary cache inconsistency problem and solution

Problem: Update the database before deleting the cache. If the cache deletion fails, it will result in new data in the database and old data in the cache, causing data inconsistencies.

! [](https://pic1.zhimg.com/80/v2-f4611651abbee05813a8265e5b4340d4_720w.jpg)

Delete the cache first and then update the database. If the database update fails, the database is old, the cache is empty, and the data is not inconsistent. The old data in the database is read and updated to the cache because the cache does not have it at the time of reading.

More complex data inconsistency problem analysis

The data has changed, the cache has been deleted, and the database has been modified, which has not yet been modified. A request comes in, reads the cache, finds that the cache is empty, queries the database, finds the old data before modification, and puts it in the cache. The subsequent data change procedure completes the database modification. The data in the database is different from the data in the cache…

Why does this problem occur when hundreds of millions of traffic are concurrent?

This problem can only occur when reading or writing data concurrently. In fact, if you have a very low concurrency, especially if you have a very low read concurrency, 10,000 visits per day, then very rarely, you’re going to have the kind of inconsistencies that I just described. However, the problem is that if the daily traffic is hundreds of millions of concurrent reads per second is tens of thousands, as long as there are data update requests per second, the above database **+** cache inconsistency may occur.

Solutions are as follows:

When data is updated, operations are routed to an internal JVM queue based on the unique identity of the data. If the data is not in the cache when it is read, the operation of re-reading the data + updating the cache is routed according to the unique identifier and also sent to the same JVM internal queue.

A queue corresponds to a worker thread, and each worker thread receives the corresponding operation sequentially, and then executes it one by one. In this case, a data change operation, first deletes the cache, then updates the database, but has not completed the update. If a read request does not reach the cache, the cache update request can be sent to the queue first. At this time, the cache update request will be backlogged in the queue, and then wait for the cache update to complete synchronously.

There is an optimization point here, in a queue, it is meaningless to string multiple update cache requests together, so we can filter. If there is already one update cache request in the queue, then we don’t need to put another update request in the queue, and just wait for the previous update request to complete.

After the worker thread of that queue has finished the database modification of the previous operation, the next operation, the cache update operation, will read the latest value from the database and write it to the cache.

If the request is still in the waiting time range and polling finds that the value can be fetched, it returns directly; If the request waits more than a certain amount of time, the current old value is read directly from the database this time.

In high concurrency scenarios, the following issues should be addressed in this solution:

(1) The read request is blocked for a long time

Because read requests are very lightly asynchronous, it is important to be aware of read timeouts, within which each read request must be returned. In this solution, the biggest risk is that the data may be updated so frequently that a large number of update operations are backlogged in the queue, and then read requests will have a large number of timeouts, resulting in a large number of requests going directly to the database. Be sure to run some real-world tests to see how often data is updated.

On the other hand, because there may be a backlog of update operations for multiple data items in a queue, you need to test for your own business situation, and you may need to deploy multiple services, each sharing some data update operations. If 100 item inventory modification operations are squeezed in a memory queue and it takes 10ms to complete the inventory modification operation every other time, then the last item read request may wait 10 *100 = 1000ms = 1s before getting data, which will lead to a long time block of read requests.

Must be done according to actual business operation of the system, and to some of the pressure test, and simulated the online environment, look at the busiest time, how much memory queue may squeeze updates, may lead to a final update operations corresponding read requests, how much time will hang, if read requests in 200 ms to return, if you after the calculation, Even on the busiest days, with 10 updates backlogged, waiting up to 200ms, that’s fine.

If a memory queue is likely to have a particularly large backlog of updates, then you add machines so that fewer service instances deployed on each machine process less data, and the fewer backlogged updates per memory queue.

In fact, based on the experience of previous projects, data write frequency is generally very low, so in fact, the backlog of updates in the queue should be very small. For projects like this, with high read concurrency and read cache architecture, write requests are generally very small, and QPS of several hundred per second is good.

Let’s actually do a rough calculation.

If you have 500 writes per second, and if you divide it into five time slices, 100 writes every 200ms into 20 memory queues, you might have 5 writes per memory queue. After each write operation performance test, it is generally completed in about 20ms, so for each memory queue data read request, also hang for a while at most, within 200ms can certainly return.

After the simple calculation just now, we know that the write QPS supported by a single machine is no problem in hundreds. If the write QPS is expanded by 10 times, then expand the machine by 10 times, and each machine has 20 queues.

(2) The concurrency of read requests is too high

There is also a risk that a sudden flood of read requests will hang on the service in tens of milliseconds to see how well the service can hold up and how many machines are needed to hold up the peak of the maximum limit case.

However, not all data are updated at the same time, and the cache is not invalid at the same time. Therefore, the cache of a few data may be invalid every time, and then the corresponding read requests of those data will come, and the concurrency should not be very large.

(3) Request routing for multi-service instance deployment

It is possible that multiple instances of the service are deployed, so it is important to ensure that requests to perform data update operations, as well as cache update operations, are routed through the Nginx server to the same service instance.

For example, read and write requests to the same item are routed to the same machine. You can do your own hash routing between services based on a request parameter, you can also use Nginx’s hash routing function, etc.

(4) Routing problems of hot commodities lead to request skew

In the event that an item’s read and write requests are so high that they all go to the same queue on the same machine, it may cause too much stress on the same machine. That is, since the cache is cleared only when the commodity data is updated, and then the read and write concurrency is caused, the problem is not particularly significant depending on the business system, if the update frequency is not too high, but it is possible that the load on some machines will be higher.

10. What are redis concurrency competition issues? How to solve this problem? Do you know the CAS scheme for Redis transactions?

Interviewer psychoanalysis

This is also a very common problem online, that is, multiple clients simultaneously write a key, the data that should have arrived first may arrive later, resulting in the incorrect version of the data. Or if multiple clients get a key at the same time, change the value and write it back, as long as the order is wrong, the data is wrong.

And Redis has its own Cas-like optimistic locking solution that naturally solves this problem.

Analysis of interview questions

At some point, multiple system instances are updating a key. Distributed locking can be implemented based on ZooKeeper. Each system obtains distributed locks through ZooKeeper to ensure that only one system instance can operate a key at a time, and no one else can read or write the key.

! [](https://pic4.zhimg.com/80/v2-70fdf9736e2df42e72846d39fa860d6b_720w.jpg)

You to write to the cache data, has been found out from the mysql, have written to mysql, the writing time must be saved in the mysql a timestamp, from mysql found out, the timestamp is found out.

Before writing, check whether the timestamp of the current value is newer than the timestamp of the value in the cache. If so, you can write; otherwise, you cannot overwrite new data with old data.

11. How is Redis deployed in production?

Interviewer psychoanalysis

See if you understand the deployment architecture of your company’s Redis production cluster. If you don’t, then you are remiss. Is your Redis master-slave architecture? Cluster architecture? What kind of clustering scheme is used? Is there a high availability guarantee? Is persistence enabled to ensure data recovery? How many gigabytes of memory does online Redis give? What parameters are set? How many QPS does your Redis cluster carry after pressure testing?

Dude, you have to know this stuff, or you really haven’t thought it through.

Analysis of interview questions

Redis cluster has 10 machines, 5 of which deploy the master instance of Redis, and the other 5 deploy the slave instance of Redis. Each master instance has a slave instance. 5 nodes provide read and write services externally, and the peak QPS of each node may reach 50,000 per second. The maximum for five machines is 250,000 read/write requests /s.

What is the configuration of the machine? 32G memory + 8-core CPU + 1T disk, but 10G memory is allocated to the Redis process. In general online production environment, the redis memory should not exceed 10G as far as possible, which may cause problems.

Five machines provide external reading and writing, with a total of 50G of memory.

Because each primary instance has a secondary instance, it is highly available. If any primary instance goes down, it will automatically failover and Redis will automatically change from a real instance to the primary instance to continue providing read and write services.

What data are you writing into memory? What is the size of each piece of data? Commodity data, each piece of data is 10KB. 100 pieces of data is 1mb, 100,000 pieces of data is 1 gb. Resident memory is 2 million items of data, occupying 20 GB of memory, only less than 50% of the total memory. The current peak is around 3,500 requests per second.

In large companies, there is an infrastructure team responsible for the operation and maintenance of the cache cluster.

Author: Programmer chase wind original link: juejin.cn/post/684490…