One, the introduction

1-1. What is TMC

TMC, or “Transparent Multilevel Cache”, is a holistic caching solution for in-company applications provided by the acclaimed PaaS team.

TMC added the following functions on the basis of the general “distributed cache solution (such as CodisProxy + Redis and zanKV)” :

  • Application layer hotspot detection
  • Application layer local cache
  • Application layer cache hit statistics

To help the application layer to solve the hotspot access problem in the process of using cache.

1-2. Why do WE do TMC

There are a large number and types of e-commerce merchants using the “like” service, and merchants will do some “commodity kill” and “commodity promotion” activities irregularly, resulting in the cache hotspot access of link applications such as “marketing Activity”, “Commodity Details” and “transaction order” :

  • Information such as activity time, activity type and activity commodity is not expected, resulting inCache hotspot accessCircumstances cannot be predicted in advance;
  • Cache hotspot accessDuring the occurrence, a small number of hot access keys ** at the application layer generate a large number of cache access requests, which impact the distributed cache system, occupy a large number of Intranet bandwidth, and ultimately affect the stability of the application layer system.

To address these issues, a solution that automatically detects hot spots and front-loads hotspot cache access requests to local caches at the application layer is needed, which is where TMC comes in.

1-3. Pain points of multi-level caching solutions

Based on the above description, we summarized the following requirements pain points that need to be addressed by multi-level caching solutions:

  • Hot spot detection: How to quickly and accurately find hot access keys?
  • Data consistency: How to ensure data consistency between the local cache at the application layer and the distributed cache system?
  • Effect verification: How can the application layer check the local cache hit ratio and hotspot key data to verify the effect of multi-level cache?
  • Transparent access: How can the overall solution reduce intrusion on application systems and achieve fast and smooth access?

Focusing on the above pain points, TMC designed and implemented the overall solution. Supports hotspot detection and local cache to reduce the impact of hotspot access on downstream distributed cache services and avoid affecting the performance and stability of application services.

Ii. Overall structure of TMC

The overall architecture of TMC is as shown in the figure above, which is divided into three layers:

  • Storage layer: provides basic KV data storage capacity and selects different storage services for different business scenarios (CODIS/ZANKV/Aerospike);
  • Proxy layer: provides a unified cache access and communication protocol for the application layer, and undertakes the routing function forwarding after horizontal segmentation of distributed data.
  • Application layer: Provides a unified client for application services, with built-in functions such as hotspot detection and local cache, which is transparent to services.

This article focuses on the “hotspot detection” and “local cache” functions of the application layer client.

TMC local cache

3-1. How to be transparent

How does TMC reduce intrusion into business application systems and achieve transparent access?

For corporate Java application services, there are two types of caching client usage:

  • Based on thespring.data.redisPackage, the use ofRedisTemplateWrite business code;
  • Based on theyouzan.framework.redisPackage, the use ofRedisClientWrite business code;

Either way, the Jedis object created by JedisPool ultimately interacts with the caching server proxy layer for requests.

TMC has modified the JedisPool and Jedis classes of the native Jedis package by integrating the TMC “hot spot discovery” + “local cache” initialization logic of the Hermes-SDK package during JedisPool initialization. The Jedis client interacts with hermes-SDK first when interacting with the cache server agent layer, thus achieving transparent access of “hot spot detection” + “local cache”.

For Java application services, only need to use a specific version of jedis-JAR package, no need to modify the code, can access TMC use “hot spot discovery” + “local cache” function, to achieve the minimum intrusion to the application system.

3-2. Overall structure

3-2-1. Module division

The overall structure of TMC local cache is divided into the following modules:

  • Jedis-client: a direct entry point for Java applications to interact with the cache server. The interface definition is the same as that of native Jedis-Client.
  • Hermes-sdk: a self-developed SDK package for “hot spot discovery + local cache”, with which Jedis-Client interacts to integrate corresponding capabilities;
  • Hermes server cluster: receives cache access data reported by herm-SDK, performs hotspot detection, and pushes the hotspot key to herm-SDK for local cache.
  • Cache cluster: the cache cluster consists of the proxy layer and storage layer, providing unified distributed cache service entrance for application clients.
  • Basic components: ETCD cluster, Apollo configuration center, providing “cluster push” and “unified configuration” capabilities for TMC;

3-2-2. Basic processes

1) Obtain the key value

  • When a Java application invokes the Jedis-Client interface to obtain the cached value of the key, the Jedis-Client will ask the Hermes-SDK whether the key is currently a hot key.
  • forHot keyDirectly from theHermes-SDKHot moduleObtain the value cached locally for the hotspot key and do not access itThe cache cluster, so that the access request is front-loaded in the application layer;
  • For theHot keyHermes-SDKthroughCallableThe callbackJedis-ClientNative interface fromThe cache clusterGet value;
  • forJedis-ClientFor each key access request,Hermes-SDKWill go through itCommunication moduleKey Access eventAsynchronous reportingHermes server clusterSo that it can perform hotspot detection according to the reported data;

2) The key value expires

  • Java application callsJedis-Clientset() del() expire()The corresponding key value is invalid when the interface is used.Jedis-ClientCall synchronouslyHermes-SDKinvalid()Method to inform it of the “key value failure” event;
  • forHot keyHermes-SDKHot moduleThe local cache value of the key is first invalidated to reach the local dataStrong consistent. At the same timeCommunication moduleThe invalid key event is asynchronously passedEtcd clusterPush to other Java application clustersHermes-SDKNode;
  • otherHermes-SDKThe node’sCommunication moduleIs called after receiving the Invalid key eventHot moduleInvalidate the locally cached value of the key to reach the cluster dataThe final agreement;

3) Hot spot discovery

  • Hermes server clusterContinuously collectHermes-SDKreportedKey Access event, analyzes and calculates the cache access data of different service application clusters periodically (3s) to detect the data in the service application clusterHot keyList;
  • For the detectedHot keyList,Hermes server clusterIt throughEtcd clusterPushed to different service application clustersHermes-SDK Communication module, inform it of the rightHot keyLists are cached locally;

4) Configure read

  • During startup and operation, herm-SDK reads configuration information of interest from Apollo configuration center (e.g., startup and shutdown configuration, whitelist configuration, ETCD address…). ;
  • During startup and operation, the Hermes server cluster reads configuration information of interest (e.g., service application list, hotspot threshold configuration, ETCD address…) from the Apollo configuration center. ;

Stability of the 3-2-3.

TMC local cache stability is shown in the following aspects:

  • Asynchronous data reporting:Hermes-SDKuseRsyslog technologyAsynchronously reporting key access events does not block services.
  • Communication module thread isolation:Hermes-SDKCommunication moduleIndependent thread pool and bounded queue are used to ensure that the I/O operations reported and monitored by events are isolated from the service execution thread. Even unexpected exceptions do not affect basic service functions.
  • Cache control:Hermes-SDKHot moduleThe upper limit of the local cache size is controlled so that the memory usage does not exceed 64MB (LRU) to eliminate the possibility of JVM heap memory overflow.

3-2-4. The consistency

TMC local cache consistency is shown in the following aspects:

  • Hermes-SDKHot moduleOnly the cacheHot keyData, most non-hotspot key data byThe cache clusterStorage;
  • When a value becomes invalid due to a hotspot key change, hermes-SDK synchronizes the invalid local cache to ensure strong local consistency.
  • When a value fails due to a hotspot key change, the Herm-SDK broadcasts the event through the ETCD cluster to asynchronously invalidate the local cache of other nodes in the service cluster to ensure the consistency of the cluster.

4. TMC hot spot discovery

4-1. Overall process

The TMC hotspot discovery process consists of four steps:

  • The data collectionCollection:Hermes-SDKreportedKey Access event;
  • Heat sliding window: Maintain a time wheel for each Key of the App to record the access heat of the sliding window based on the current moment;
  • Heat gathering: to all keys of the AppHeat > < key,In the form ofHeat sort summary;
  • Hot spot detection: For App, fromSummary of hot Key sortOut of the resultsTopN hotspot Key, pushed toHermes-SDK;

4-2. Data collection

Hermes-sdk places key access events into Kafka in protocol format via local Rsyslog, and each node in the Hermes server cluster consumes Kafka messages to obtain key access events in real time.

The access event protocol format is as follows:

  • AppName: indicates the service application to which a cluster node belongs
  • UniqueKey: service applicationKey Access eventThe key of
  • SendTime: service applicationKey Access eventThe occurrence time of
  • Weight: service applicationKey Access eventAccess weights of

The Hermes server cluster node stores the collected key access events in local memory in the data structure Map

>, Service meaning Map< appName, Map< uniqueKey, Heat >>.
,>

4-3. Thermal sliding Windows

4-3-1. Time slide window

The Hermes server cluster node maintains a time wheel for each key of each App:

  • There are 10 time wheelsTime slice, each time slice records the total access times of the current key in the 3-second time period.
  • The record accumulation of the 10 time slices in the time round indicates the total number of key accesses in the 30-second time window before the current time.

4-3-2. Mapping tasks

Hermes server cluster node generates a mapping task for each App every 3 seconds, which is executed by the “cache mapping thread pool” within the node. The mapping task content is as follows:

  • For the current App, fromMap< appName , Map< uniqueKey , 热度 >>Remove theappNameThe corresponding MapMap< uniqueKey, heat >>;
  • traverseMap< uniqueKey, heat >>For each key, take its heat and store itTime roundCorresponding time slice;

4-4. Heat gathering

After completing the second step “Heat sliding window”, the mapping task will continue to “heat convergence” for the current App:

  • The time wheel heat of each key is summarized (i.e., the total heat in the 30-second time window) to obtain the total heat of the sliding window at the moment of detection.
  • will< key, total heat of sliding window >It’s stored as a sort collectionRedis storage service– that is,Heat convergence result;

4-5. Hot spot detection

  • In the first few steps,Every 3 secondsOne of theThe mapping taskExecution produces a copy of the current moment for each AppHeat convergence result
  • The hot spot detection node in the Hermes server cluster can obtain the hot spot key list for each App by periodically retrieving the TopN key list that reaches the heat threshold from its latest heat aggregation result.

The overall TMC hotspot discovery process is shown as follows:

4-6. Summary of features

4-6-1. The real time

Herm-sdk reports key access events in real time based on rsyslog + Kafka. Mapping tasks Complete the “Heat sliding window” and “Heat convergence” tasks every 3 seconds. When hotspot access scenarios occur, the corresponding hotspot key can be detected in 3 seconds at most.

4-6-2. The accuracy

The heat aggregation result of key is obtained by the convergence of “sliding window based on time wheel”, which relatively accurately reflects the current and recent ongoing access distribution.

4-6-3. The scalability

The nodes of the Hermes server cluster are stateless, and the number of nodes can be expanded horizontally based on the number of Kafka partitions.

The process of “Heat sliding window” + “Heat gathering” is based on the number of apps and multi-threaded expansion within a single node.

Five, TMC actual combat effect

5-1. A product marketing activity of Kuaishou Merchants

Favored merchants conduct activities for a certain commodity through Kuaishou live platform, which causes the commodity to be visited intensively within a short period of time and generate hot spots. The actual hot spot access effect data recorded by TMC during the activity are as follows:

5-1-1. Cache request & hit ratio graph for a core application

  • The blue line above shows application cluster callsget()Method access cache times
  • In the figure above, the green line is the number of times the cache operation hit the TMC local cache

  • The graph above shows the local cache hit ratio curve

It can be seen that the number of cache requests and local cache hits increased significantly during the activity, and the local cache hit ratio reached nearly 80% (that is, 80% of the cache query requests in the application cluster were intercepted by the TMC local cache).

5-1-2. Acceleration effect of hotspot cache on application access

  • The figure above shows the QPS curve of the application interface

  • Figure 1 shows the RT curve of the application interface

It can be seen that the application interface request volume increased significantly during the activity, but the RT of the application interface decreased due to the effect of TMC local cache.

5-2. Effect display of partial application of TMC during Double Eleven

5-2-1. Core application effect of commodity domain

5-2-2. Core application effect of activity domain

Six, TMC function outlook

TMC has provided services for commodity center, logistics center, inventory center, marketing activities, user center, gateway & message and other core application modules in Youzan, and the subsequent applications are also being added.

TMC in providing “hot spot detection” + “local cache” core competence at the same time, as well as application service provides a flexible configuration options, application services can be combined with the actual business situation in the “hot spot” threshold, “hot key detection quantity”, “hot black and white list” of the dimension of the use of free allocation in order to achieve better effect.

Finally, TMC iterations continue…