Ant Financial technical experts shared 25 distributed caching practices and online cases

Preface:

This article focuses on good practices and online examples of using distributed caching. These cases are excellent practices accumulated and formed by the author in several Internet companies, which can help people avoid many unnecessary production accidents in production practice.

1. Core elements of cache design

When we decide to use caching in our applications, it often requires a detailed design, because designing a caching architecture is not as simple as it seems, but it contains profound principles that can cause serious problems like production accidents and even service avalanches if not used properly.

1. Capacity planning

The size of the cache contents

The amount of cached content

Elimination strategy

Cached data structure

Peak read value per second

Peak write per second

2. Performance optimization

Threading model

Preheating method

Cache fragmentation

The ratio of cold to hot data

3. High availability

Replication model

Failure to transfer

Persistent policy

Cache reconstruction

4. Cache monitoring

Cache service monitoring

Cache Capacity Monitoring

Cache request monitoring

Cache response time monitoring

5. Precautions

Whether cache penetration is likely to occur

Whether there are large objects

Whether to use caching for distributed locking

Whether to use cache-supported scripts (Lua)

Whether Race Condition is avoided

Good practices in cache design

Good Practice 1

Cache memory system is the main consumption of server, therefore, when using the cache must first to evaluate applications require data cache size, including the number of data structures, and the size of the cache, the cache cache, the cache expiry time, then according to the business situation in the future according to the usage of the capacity of a certain period of time, Apply for and allocate cache resources based on the capacity assessment results. Otherwise, resources are wasted or cache space is insufficient.

Good Practice 2

You are advised to separate the services that use caching, and use different cache instances for core services and non-core services to physically isolate them. If possible, use an independent instance or cluster for each service to reduce the possibility that applications may affect each other. I often hear of online accidents where companies use shared caches, causing cached data to be overwritten and cached data to be corrupted.

Good Practice 3

Calculate the number of cache instances required by the application based on the memory size provided by the cache instance. Generally, a cache management operation and maintenance team is set up in the company. The team virtualizes the cache resources into multiple cache instances with the same memory size.

For example, if an instance has 4GB memory, you can apply for enough instances as required to use the memory. Such applications need to be sharded. For details, see 4.4.3 in scalable service architecture: frameworks and middleware. Note that if we use the RDB backup mechanism and each instance uses 4GB of memory, then our system will need more than 8GB of memory. Because the RDB backup mechanism uses copy-on-write, we need to fork out a child process and make a copy of memory, so we need a double memory storage size.

Good Practice 4

Caching is usually used to speed up database reads, usually accessing the cache first and then the database, so the cache timeout is very important. I once worked for an Internet company where the cache timeout was set too long due to operational errors, which dragged down the thread pool of the service and eventually led to a service avalanche.

Good Practice 5

It is important that all cache instances need to be monitored. We need reliable monitoring of slow queries, large objects, and memory usage.

Good Practice 6

We do not recommend that multiple businesses share a single cache instance, but this is often the case for cost control reasons. We need specifications to limit the unique prefix of keys used by each application, and to isolate the design to avoid the problem of caches overwriting each other.

Good Practice 7

Any cache key must have a cache expiration time, and the expiration time cannot be concentrated at a certain point, otherwise the cache will be full of memory or cache avalanche.

Good Practice 8

Low frequency access data should not be stored in the cache. As we mentioned earlier, the main purpose of using the cache is to improve read performance.

There used to be a friend to design a set of regular batch systems, because the batch system to a large data model to calculate, so the friend storing this data model in each node of the local cache, and through the message queue to receive updated news to maintain the local cache model in real time, but this model only once a month, So using the cache in this way is wasteful.

Since it is a batch task, it is necessary to divide the task, carry out batch processing, adopt the method of divide and conquer, step by step calculation, and get the final result.

Good Practice 9

It is not easy to cache too much data, especially in Redis, which uses a single-threaded model and blocks the processing of other requests if a single cache key is too large.

Good Practice 10

For keys that store a large number of values, do not use collection operations such as HGETALL. This operation will block requests and affect the access of other applications.

Good Practice 11

Caching is generally used to speed up queries in trading systems. Batch mode is used when a large amount of data is updated, especially when batch processing is performed, but this scenario is rare.

Good Practice 12

If the performance requirements is not very high, use a distributed cache as far as possible, but do not use the local cache, because local cached service between the various nodes of replication, at some point between the copy is not consistent, if the cache is the representative of a switch, and the request of the distributed system is likely to be repeated, would lead to repeated requests to two nodes, One node switch is on, one node switch is off, and if the request processing is not idempotent, it will result in repeated processing and, in severe cases, loss of funds.

Good Practice 13

When writing to the cache, you must write exactly correct data. If part of the cache data is valid and part of the cache data is invalid, you would rather give up the cache than write part of the data to the cache. Otherwise, null Pointers and program exceptions will be caused.

Good Practice 14

In general, the read order is cache first, then database; The write order is database first, cache second.

Good Practice 15

When using local caches (such as Ehcache), it is important to strictly control the number of cache objects and the declaration period. Due to the nature of the JVM, too many cached objects can greatly affect the performance of the JVM and even cause memory overflow.

Good Practice 16

When using cache, there must be degradation processing, especially for key business links, cache problems or failures should also be able to return to the source database for processing.

Online examples of common caching problems

Case 1

Symptom: The database load of an application increases momentarily.

Cause: A fixed expiration time is set for a large number of cache keys used in an application. When the cache is invalid, the database is accessed at the same time for a period of time, causing heavy database pressure.

Conclusion: It is necessary to design the cache when using the cache. It is necessary to fully consider how to avoid common problems such as cache penetration, cache avalanche, and cache concurrency. Especially for high-concurrency cache, it is necessary to set the expiration time of the key randomly, for example, set the expiration time to 10 seconds +random(2). That is, the expiration time is randomly set to 10 to 12 seconds.

Case 2

Symptom: The core operations of the two systems are repeated before and after the migration.

Reason: in the process of migration, repeat traffic into different nodes, with the use of local cache storage migration switch, switch at the instant of the switch and migration lead to various nodes do not match the switch state, have a plenty of open, have a plenty of clearance, so to the flow of different nodes to deal with repeat, a go switch open logic, a go and close logic.

Summary: Avoid using local caches to store migration switches, which should be marked on stateful orders.

Case 3

Symptom: A module designed to use cache to speed up the performance of database read operations, but found that the database load did not decrease significantly.

Cause: Because the data requested by the user of this module does not exist in the database, it is illegal data, so the cache does not hit, and penetrates to the database every time, and the magnitude is large.

Summary: Cache design is necessary when using cache. It is necessary to fully consider how to avoid common problems such as cache penetration, cache avalanche, cache concurrency, etc. Especially for high concurrency cache, it is necessary to cache invalid keys to prevent malicious or unintentional interference or influence on invalid cache queries.

Case 4

Symptom: The monitoring system alarms that a single hash key in Redis occupies a large space.

Reason: The application uses hash keys. The hash key itself has an expiration date, but each key-value pair in the hash key has no expiration date.

Summary: When designing Redis, if you have a large number of key-value pairs to store, use the database type of string keys and set expiration times for each key, and do not store an unbounded collection of data inside a hash key. In fact, whether for cache, memory or database design, if the use of any collection of data structure, it should be considered to set a maximum limit to avoid running out of memory, the most common is the collection overflow caused by memory overflow problem.

Case 5

Symptom: The cache breakdown of a service item causes service logic interruption and inconsistent data.

Cause: Redis performs an active/standby switchover. As a result, the application fails to connect to Redis immediately, and the application does not degrade the cache.

Bottom line: For core business, there must be a downgrading solution when using caching. Common degradation scheme is in the database level reserve enough capacity, problems in one part of the cache, the application can be allowed to return to the source for the time being to the database to continue the business logic, and should not interrupt the business logic, but this requires strict capacity evaluation, please refer to the distributed service architecture: the principle of design and actual combat the content of chapter 3.

Case 6

There is an OutOfMemroyError: GC overhead limt exceed in overhead.

The reason:

Because this project is a historical project, using Hibernate ORM framework, Hibernate enabled level 2 cache, using Ehcache; However, there is no control over the number of cache objects in Ehcache. As the number of cache objects increases, memory is strained, so frequent GC operations are carried out.

Conclusion:

When using local caches (such as Ehcache, OSCache, and application memory), strictly control the number of cache objects and the declaration period.

Case 7

Symptom: A healthy application suddenly alerts you to a high number of threads and then runs out of memory soon after.

The reason: Due to cache the number of connections to achieve maximum limit, unable to connect to the cache application, and set up the larger timeout, lead to access to the cache service are waiting for the cache operation returns, because the cache load is higher, processing not over all requests, but these services are waiting for the cache operation returns, service waiting at this moment, no overtime, will not be able to downgrade and continue to access the database. In BIO mode, the thread pool is full, and the user’s thread pool is full. In NIO mode, the load on the service increases, the service responds slowly, and even the service is overwhelmed.

Conclusion: When using remote caches (such as Redis or Memcached), it is important to set a timeout for the operation. Caches are designed to speed up database reads and degrade the operation. Therefore, it is recommended to use a shorter timeout. You want it to be less than 100 milliseconds.

Case 8

Symptom: A project uses cache to store service data. After the project goes online, an error occurs, and the developer is helpless.

Reason: Developers don’t know how to find, troubleshoot, locate, and resolve cache problems.

Summary: there should be a degradation scheme in the design of cache. The degradation method should be used first when problems are encountered, and perfect monitoring and alarm functions should be designed to help developers quickly find cache problems, so as to locate and solve problems.

Case 9

Symptom: After a project uses the cache, the development test passes, and after it goes into production, the service has unexpected problems.

Cause: The cache key of the application conflicts with the cache key of another application. As a result, the cache key overwrites each other and logic errors occur.

Summary: When using cache, it is necessary to have isolation design. It can be physically isolated by different cache instances, or logically isolated by different application cache keys using different prefixes.

Ant Financial technical experts shared 25 distributed caching practices and online cases

Related Posts

Popular activities | 🎁 “program” have you, taste the answer!

Full backup of MySQL periodically (1)

Process Management (1)