The database is suddenly disconnected, the third-party interface does not return results, or the network jitter occurs during peak hours…… When the program suddenly fails, our application can tell the caller or user “Sorry, there is a problem with the server”. Or find a better way to improve the user experience.

The background,

When users “swipe” on the Hornet’s Nest App, the recommendation system needs to continuously recommend the content that may be of interest to users, which is mainly divided into several steps: recalling the content calculated according to various machine learning algorithms according to user characteristics and business scenarios, sorting these contents and returning them to the front end.

The recommended procedure involves MySQL and Redis queries, REST service calls, data processing, and more. For the recommender system, the delay requirement is relatively high. The average processing delay required by the wasp nest recommendation system for requests is 10ms, and 99 lines of delay are kept within 1s.

When the external or internal system is abnormal, the recommendation system cannot return data to the front end within a limited time, which prevents users from Posting new content and affects user experience.

Therefore, we hope to design a set of Dr Cache services to return cache data to the front end and users when the application itself or dependent services occur timeout and other anomalies, so as to reduce the number of empty results and ensure that the data is as interesting as possible for users.

Second, design and implementation

Design ideas and technology selection

It’s not just recommendation systems. Caching is already used in many systems, from the number of commonly used integers in the JVM to the session state of web users. The purpose of caching varies, some for efficiency, some for backup; Caching requirements are also high and low, with some requiring consistency and some not. We need to choose the appropriate caching scheme according to the business scenario.

Combined with the business scenarios and requirements mentioned above, we adopted a scheme based on OHC off-heap cache and SpringBoot to add a local Dr Cache system to the existing recommendation system. The following factors are mainly taken into account:

1. Avoid affecting online services and isolate the service logic from the cache logic

In order not to affect online services, we encapsulated the cache system as a CacheService, configured at the end of the existing process, and provided read and write apis for external calls, isolating the business logic from the cache logic.

2. Write cache asynchronously to improve performance

Both read and write caches are time consuming, especially write caches. To improve performance, we considered making the write cache asynchronous. This part is implemented using the THREAD pool ThreadPoolExecutor provided by the JDK. The main thread only needs to submit tasks to the thread pool, and the Worker thread in the thread pool will write tasks to the cache.

3. Local cache to improve access speed

In the recommendation system, the content recommended to users should be thousands of pages, and even the same user may see different content every time they refresh, which does not require strong consistency of cache. Therefore, we only need to cache locally, not in a distributed way. The open source caching tool OHC is used to cache data from successfully processed requests.

4. Back up cache instances to ensure availability

To ensure the availability of the cache, we not only cache in memory, but also periodically back up to the file system to ensure that the application can be loaded from the file system to memory when it is ready to start. You can use SpringBoot to provide scheduled tasks, ApplicationRunner to achieve.

The overall architecture

We kept the existing logic of the recommendation system and, at the end of the existing process, configured CacheModule and CacheService to take care of all the cache-related logic.

CacheService implements caching and provides a read/write interface. The CacheModule processes the requested data and decides whether to call the CacheService cache.

Module reading

1. CacheModule

After the process is complete, the CacheModule determines whether an exception is thrown or whether the response is empty and decides whether to read the cache or submit the cache task.

The workflow for CacheModule is shown here, with the orange part representing the call to CacheService:

  • Submit the cache task. If the request does not throw an exception and the response is not empty, a cache task is submitted to CacheService. The key value of the task is the corresponding service scenario, and the value is the content calculated in the response. The submitted action is non-blocking and has little impact on the elapsed time of the interface.

  • Read cache data. If an application or a dependent application throws an exception, the system reads the cache from the CacheService based on the key value of the service scenario and returns it to the caller. When the user has already used up all available data, the cache does not need to be read, but the requested data is fed back to the user in a timely manner.

2. CacheService

CacheService uses OHC, a separate implementation of the cache from the Apache Cassandra project. In addition, because our entire application is based on SpringBoot, we also use various functions provided by SpringBoot.

As mentioned above, there is no strong consistency requirement for caches, so we use local caches instead of distributed caches and abstract a CacheService class to maintain local caches.

(1) Data format

When the recommended system returns data, it returns data by screen based on service scenarios and user characteristics. Each screen can contain multiple content items. Therefore, the key-set data format is adopted. The cache content is a collection of screens.

(2) Storage location

For Java applications, caches can be stored in memory or in hard disk files. Memory space is divided into heap memory and off-heap memory. We compared these approaches:

To ensure fast read/write speeds and avoid cache GC affecting online services, off-heap is selected as the cache space. OHC was originally included in the Apache Cassandra project and has since branched out as an open source caching tool based on off-heap. It can maintain a large amount of off-heap memory space while also being used for small cached entities with low overhead. So we use OHC as the cache implementation for off-heap.

(3) File backup

On application restart, the cache in the off-heap is empty. To load the cache as quickly as possible, we used SpringBoot’s Scheduling Tasks function to periodically back up the cache from the off-heap to the file system. ApplicationRunner inherits SpringBoot to listen to the process of application startup, and after startup, the backup files in the hard disk are loaded to the off-heap to ensure the availability of cached data.

CacheService maintains a task queue that holds cache tasks submitted by CacheModule in a non-blocking manner. CacheService decides whether to execute these cache tasks.

(4) Apis for CacheModule

  • When reading the cache, the key is passed in, and the cache module reads data randomly from the set and returns it.

  • When writing to the cache, encapsulate the key and value as a task and submit it to the task queue. The task queue is responsible for writing to the cache asynchronously.

(5) Task queue and asynchronous write

Here we use thread pools in the JDK. When constructing a thread pool, use LinkedBlockingQueue as a task queue to quickly add and delete elements. Since the QPS applied is less than 100, the number of worker threads is fixed at 1; When the queue is full, DiscardPolicy is executed to abort queue insertion.

(6) Cache quantity control

If the cache occupies too much memory, online applications will be affected. You can configure the maximum number of caches for different service scenarios to control the number of caches. If the value does not reach the configured value, the processed data is written to the cache. When the configured value is reached, the original cache items can be overwritten by random sampling to ensure the real-time performance of cache.

The CacheService design is as follows:

Online performance

To verify the effect of the DISASTER recovery cache, we buried the cache when it was hit and looked at the number of cache hits per hour through Kibana. As shown in the figure, the system has a certain timeout from 18:00 to 19:00, during which the availability of the system is improved due to the role of the cache service.

We also monitored OHC read and write speeds. The write cache has a latency of milliseconds and is asynchronous. The latency for reading the cache is in microseconds. Basically no additional time consumption was added to the system.

The pit of tread

Before writing the cache to OHC, serialization was required, and we used the open source Kryo as a serialization tool. Previous work with kyro found that deserialization could fail for classes that did not implement Serializable, such as the inner class java.util.arraylist $subList returned using the List#subList method. You can solve this problem by signing up for Serializers manually, and kryo-Serializers warehouse is open source at Github, offering all types of Serializers.

In addition, configure capacity and maxEntrySize in the OHC based on the application scenario. If the value is too small, the write cache will fail. You can calculate the space occupied by the cache before going online, and set the size of the entire cache space and the size of each cache entry.

Third, optimization direction

Based on SpringBoot and OHC, we added a local Dr Cache system to the existing recommendation system, which can return cached data in case of sudden exceptions of dependent services or applications.

There are still some deficiencies in the cache system, and we will focus on the following optimization in the near future:

  • When the number of caches is full, the current application randomly overwrites the existing caches. Future optimizations can be made to replace the oldest cached entries.

  • The granularity of the cache is not fine enough in some scenarios, such as destination pages that are recommended to share a cached key value. In the future, a cache could be configured for each destination based on the destination ID.

  • For now, it is recommended that the system still relies on MySQL for some configuration. In the future, local file caching will be considered.

[Reference]

1. Java Caching Benchmarks 2016 – Part 1

2. On Heap vs Off Heap Memory Usage

3. OHC – An off-heap-cache

4. kryo-serializers

5. scheduling-tasks

The author of this article: Sun Xingbin, Hornet’s Nest recommendation and search back-end r&d engineer.

(Hornet’s nest technology original content, reproduced must indicate the source to save the end of the two-dimensional code picture, thank you for your cooperation.)