Jingdong eight-year architect: How is Redis distributed and the design principle of finance

preface

R2M is a distributed cache system for large-scale online application of JD Finance. Currently, the total memory capacity of the machines under management exceeds 60TB, nearly 600 Redis clusters and more than 9200 Redis instances.

Its main functions include: full web visual operation and maintenance, one-click deployment of cache cluster, overall management of resource pool, online capacity expansion and fast data migration, multi-room switching and disaster recovery, perfect monitoring and alarm, Redis API compatibility, etc.

In this paper, R2M system architecture, resource management, cluster expansion and migration, hot and cold data exchange, multi-room disaster recovery and other aspects will be deeply analyzed, hoping that readers can gain some benefits.

Advantages of business use and operation and maintenance

Extremely simplified access

R2M is easy to access. Take the Java client as an example. You only need to import the client JAR package into the service program and configure the AppName and ZK address.

Configuring Automatic Delivery

The R2M client provides configurations such as connection pool, read/write timeout duration, and retry times. In some cases, you need to tune these configurations. For example, when the 618 and double eleven rush comes, some services need to reduce the client read/write timeout duration to avoid the chain reaction caused by timeout problems.

In order to modify the configuration, the service has to go online again on all the clients, which is time-consuming and may cause failures. To avoid this problem, R2M provides the automatic configuration delivery function. Service clients do not need to go online again. After the configuration is modified on the management terminal, all clients take effect immediately.

Multiple data type API support

Fully compatible with Redis each data type API interface, including hash, set, ordered set, list, publish/subscribe and other nearly 100 interfaces.

Support data hot and cold exchange

On the premise of ensuring high performance and data consistency, dynamic cold and hot exchange of data based on LFU heat statistics is realized, so that cold data is exchanged on SSD and hot data is loaded on Redis, achieving a high balance between high performance and low cost.

Full Web visual o&M

R2M implements visual o&M of all functions including cluster deployment, capacity expansion, data migration, monitoring, cluster offline, data clearing, faulty node replacement and machine offline, greatly improving o&M efficiency and availability.

Multi-room switch with one click

By modifying Redis Cluster to support multi-machine room DISASTER recovery, preventing Cluster brain-splitting, and supporting each system component for multi-machine room switching election, the multi-machine room switching in Web console can be realized with one key.

A variety of cluster synchronization and data migration tools

R2M provides a dedicated migration tool component that supports migration from native Redis to R2M, implementing a migration mechanism based on Redis RDB file transfer and incremental synchronization.

At the same time, you can specify rules such as partial data migration by prefix matching or reverse matching. Due to internal historical reasons, some business of JINGdong Finance used Jimdb of Jingdong Mall before, and R2M also supports migration from Jimdb cluster.

R2M migration tool also supports real-time data synchronization, including data synchronization between R2M clusters and data synchronization between Jimdb source clusters.

Non-java language proxy

Our business developers mainly use Java. For clients with non-Java language, such as Python and C, they are accessed through R2M Proxy component. Various operation and maintenance operations are carried out in Proxy layer, shielding the business side. At the same time, Proxy also provides a high availability guarantee scheme.

System architecture

Component functions

Web console

Is the visual operation and maintenance console of R2M cache system. All o&M operations are performed on the Web Console.

Manager

The management component of the entire system. Responsible for delivering all o&M operations and collecting monitoring data. O&m operations include cluster creation, data migration, capacity expansion, and multi-room switching.

Agent

An Agent component is deployed on each physical machine. The Agent is responsible for deploying, monitoring, and reporting Redis instances.

Cache cluster node

The nodes of each cluster are distributed on different machines, and several nodes form a distributed cluster with decentralization and no single point.

Client

Clients are imported by the service side. For example, Java is imported by jar package.

Proxy

For non-Java client services, the Proxy provides cache access and services.

One-click deployment of the cache cluster

The deployment of distributed Cluster is usually quite troublesome, involving multiple machines, multiple nodes and a series of actions and configurations. For example, deploying Redis Cluster includes node installation and startup, node handshake, master/slave assignment, slot allocation, and setup replication.

Although the tool scripts for building clusters are officially provided, machine recommendation, node installation and configuration are still not automated, which is still not flexible and convenient for enterprises that need large-scale deployment and operation of Redis Cluster.

Therefore, R2M based on Golang realizes all actions from machine recommendation, node deployment, cluster construction and node configuration in one click on the Web console. During this process, Agent components on each machine are responsible for downloading and installing RPM packages and starting instances on specified ports. The Manager component is responsible for completing the cluster build and node configuration process.

There are also some necessary validation conditions and priority rules for automatic deployment, mainly the following:

1. Check whether the number of active nodes and the active/standby policy meet the minimum configuration requirements for distributed election and high availability: more than three active nodes and one active node and one secondary node.

2. Ensure that multiple nodes in a cluster are deployed on the same machine to prevent machine faults.

3. Based on the available memory of the machine, recommend the machine with more free memory.

4. Evenly allocate slots based on the number of active nodes.

Resource planning and overall management

Due to the large number of equipment rooms, machines, and services, resources need to be properly planned in advance to implement platform management. Planning includes: When applying for access, the service provider estimates the capacity and properly allocates the size and number of cache instances and reserved memory. For convenient management and statistics, you also need to group machines based on equipment rooms or services.

In addition to planning in advance, it is also necessary to be clear about resource usage to avoid over-use or serious over-allocation.

Severe overweight, is the actual usage is far less than forecast capacity, this is still a lot of, because many times it is difficult to accurately forecast capacity, or some business development to be on the safe side, or worry about bad expansion, application of capacity is often greater than the actual need, in that case, we can shrink capacity, a key to prevent the waste of resources.

To use, also is the actual machine resource usage exceeded the standard, is be very careful, cache machine using machines such as Redis ultra use may result in serious consequences, such as OOM, data elimination, performance fell sharply, in order to avoid the super, needs to carry on the reasonable resource reserve, choose the appropriate selection strategy, At the same time, the platform should have perfect monitoring and real-time alarm function.

R2M provides hierarchical management and monitoring of the machine room, groups, machines, and instances to facilitate resource planning and maximize the overall planning and balance of resource utilization and prevent resource waste and machine overuse.

Capacity expansion and data migration

When a business applies for a cache, we ask for an estimated capacity and allocate it according to the estimated capacity. However, the estimated capacity is often inaccurate, the plan never catches up with the change, and the expansion of business scenarios, business volume and data growth is always unpredictable.

Therefore, a good capacity expansion mechanism is particularly important, capacity expansion is done well, determines the system’s scalability. In R2M, we decouple horizontal expansion into two steps, adding new nodes and data migration. Adding new nodes is an automatic deployment process, and many of the steps are the same as the cluster creation process, so the key is how to solve the data migration problem.

Data migration mainly solves the following problems:

1. How to ensure that normal read and write services are not affected, that is, the service side is almost unaware of the migration?

2. How to ensure the speed of migration?

3. What can I do if the client needs to read and write a piece of data in the middle of migration?

4. How do I ensure that the client does not read dirty data from the source node or the target node?

5. During the migration, do you want to write to the source node or the destination node?

6. After the migration is complete, how do I notify the client to route read and write requests to the new node?

In order to solve these problems, we fully absorb the advantages of Redis Cluster, but also solve some of its shortcomings and shortcomings.

Data Migration Principles

The figure above is the data migration schematic of Redis Cluster. The migration process is completed through the cooperation of source node, target node and client.

The Redis Cluster divides the data set into 16,384 slots. Slot is the smallest unit of data segmentation, so at least one slot should be migrated during data migration. The minimum granularity of migration is a key, and the migration of each key can be regarded as an atomic operation. Reads and writes to the key are temporarily blocked on both the source and destination nodes, so that there is no migration “intermediate state” where the key is either on the source node or the destination node.

As shown in the figure above, assuming that slot 9 is being migrated, slot 9 is marked as “migrating” in the source node, slot 9 is marked as “importing” in the target node, and all keys in the slot are migrated.

At this point, if the client wants to access the data in slot 9, if the data has not been migrated to the target node, it will directly read and write and return; if the data has been migrated to the target node, it will return the Ask response with the address of the target node, telling the client to request again on the target node.

If all the data in slot 9 has been migrated and the client continues to read and write data from the source node, the client will be sent back to the Moved client. After receiving the Moved response, the client can retrieve the cluster status and update the slot information. Subsequent access to slot 9 will be performed on the new node. This completes the migration of a slot.

Multi-node parallel data migration

Redis Cluster implements hash sharding based on CRC16 algorithm. When there are almost the same number of slots and no large key exists, data distribution on each node is usually very balanced. However, data migration is slow due to the unit of key.

When the amount of data increases rapidly and the capacity needs to be expanded urgently, if the main node data is migrated one by one, some nodes may have overloaded the memory before migration, resulting in data obsolescence or OOM.

Therefore, R2M realizes multi-node parallel data migration to prevent such problems and greatly shorten the time of data migration. In addition, combined with pipeline migration function after Redis 3.0.7, it can further reduce the number of network interactions and shorten the time of migration.

Controlled data migration

After a node is added for horizontal expansion, some data needs to be migrated to the new node for data and load balancing. The data migration process usually takes several minutes to several hours depending on network conditions, instance sizes, and machine configurations.

In addition, for the slot or keys being migrated, the Redis Cluster uses the ASK or MOVED redirection mechanism to tell the client to route requests to the new node, enabling the client to interact with the Redis for one more request response.

In addition, the client cache read/write timeout is short (usually within 100 to 500ms). Multiple factors may cause a large number of read/write timeouts, which greatly affects online services.

Based on this, we realized the suspension of migration task and the continuation of migration task. When the migration is found to affect the service, the migration can be suspended at any time, and the remaining data migration can be continued during the service peak period, so as to be flexible and controllable.

Automatic expansion

R2M also provides an automatic capacity expansion mechanism. When the available memory is sufficient and the capacity used by instances reaches or exceeds the preset threshold, the capacity is automatically expanded.

You can enable automatic capacity expansion for some important services or services whose data cannot be eliminated. Of course, automatic capacity expansion is also conditional. For example, automatic capacity expansion cannot be unlimited. When the instance size reaches a relatively high value, automatic capacity expansion is rejected and some memory should be reserved for Fork operation to avoid OOM occurrence.

Data is stored in hot and cold exchange

Since there are many large capacity (hundreds of GB or several TB) cache clusters online, the cost of the cache machine memory is huge, and the total memory capacity of the online machine has reached about 66TB.

After investigation, it is found that the unit cost gap between mainstream DDR3 memory and mainstream SATA SSD is about 20 times. In order to optimize the comprehensive cost of infrastructure (hardware, machine room cabinet, power consumption, etc.), therefore, We consider to achieve tiered storage of data based on heat statistics and dynamic exchange of data between RAM/FLASH, so as to significantly reduce the overall cost of infrastructure and achieve a high balance between performance and cost.

The basic idea of R2M’s hot-cold exchange storage is as follows: The heat statistics algorithm based on the number of key accesses (LFU) identifies the hot data and keeps the hot data in Redis. The data with no access or less access times is transferred to SSD. If the key on SSD becomes hot again, it is reloaded to Redis memory.

The idea is simple, but in practice it is not that easy to do, because the speed of reading and writing SATA SSDS is still very different from that of Redis reading and writing memory. In order to avoid this kind of unequal performance drag on the performance of the whole system, resulting in a sharp decline in response time and overall throughput, We adopt multi-process asynchronous non-blocking model to ensure the high performance of Redis layer, and maximize SSD read and write performance through carefully designed disk data storage format, multi-version key lazy deletion, multi-thread read and write SSD mechanism.

In the multi-process asynchronous model, there are mainly two processes, one is the SSD read/write process, which is used to access the KEY in the SSD, and the other is the deeply modified Redis process, which is used to read/write the memory key. If the key to be read/write is found on the SSD, The request is forwarded to the SSD read/write process for processing.

The Redis process layer was originally based on Redis 3.2, but after the RC version of Redis 4 was released, especially with support for LFU, Psync2, memory defragmentation and other functions, we decided to cut to Redis 4 for development.

The SSDB read-write process was originally developed based on the open source SSDB. However, due to the poor performance of SSDB’s master-slave replication implementation, the data storage format design is not good enough, and there are many incompatibable problems with Redis API, in addition to the basic network framework, the SSDB was basically rewritten.

In addition, the SSDB storage engine is LevelDB by default, so we changed it to RocksDB, which is popular, for reasons of features, project activity, etc., of course, it also originated from LevelDB.

At present, we have completed the development of the project internally, and conducted comprehensive functional, stability and performance tests, and will soon be online. For reasons of space, there are many topics involved in hot and cold exchange storage, which are not detailed here. The project has been named SWAPDB and is open source on Github

Here I recommend an architecture learning exchange group. Exchange learning group number: 744642380, which will share some senior architects recorded video: Spring, MyBatis, Netty source code analysis, high concurrency, high performance, distributed, microservice architecture principle, JVM performance optimization, distributed architecture and so on to become the architect of the necessary knowledge system. You can also receive free learning resources, welcome to join the group.

Multiple equipment room switching and DISASTER recovery

Multi-room switching is a process in which components coordinate from top to bottom. Multiple factors need to be considered. In R2M, it mainly includes the cache cluster, routing service (such as ZooKeeper and ETCD), and multi-room switching of components (such as Manager, Web Console, and client).

Multi-room switching at the data layer — i.e., multi-room switching for caching services

For multiple machine room support, the key is how to avoid the brain split problem. Let’s start by looking at how brain split happens.

Normally, the nodes of a cache cluster are deployed in the same equipment room. Each primary node is responsible for some data. If the primary node fails, the remaining primary nodes elect their secondary nodes as new primary nodes, that is, automatic failover.

If you need to perform equipment room switchover or implement multi-equipment room Dr For important services, you need to add a secondary node for each primary node in another equipment room. Then, each primary node has two secondary nodes. If a node in the cluster is automatically failover during the running, the primary node may be failover to another equipment room. The master nodes of the same cluster are distributed in different machine rooms.

In this case, if the network link between machine rooms is faulty, the master node of machine room A and the master node of machine room B will consider each other in the FAIL state. Assuming that most of the master nodes are in machine room A, the master node of machine room A will select A new master node from other nodes in the same machine room to replace the master node of machine room B. As A result, the same fragments are taken care of by the master nodes of the two equipment rooms. The client of machine ROOM A writes the data of the fragments to the newly selected master node of machine room A, but the client of machine room B still writes the data to the master node of machine room B. As A result, the brain is split and data cannot be merged.

Cluster-require-full-coverage: cluster-require-full-coverage: cluster-require-full-coverage As long as 16384 fragments are not fully covered due to node breakdown, the whole cluster will refuse service, greatly reducing availability, so it must be shut down in practical applications. But the problem is, if these problems occur, it can lead to the consequences of a split brain.

To solve this problem, we added the Datacenter flag to Redis’s cluster communication messages. When receiving an automatic failover request, The primary node compares its datacenter id with the datacenter ID of the secondary node nominated for failover. If the datacenter id is the same, the primary node approves the automatic failover request. Otherwise, the primary node rejects the request.

Ensure that the automatic failover occurs only on the secondary node in the same equipment room, avoiding the situation that the primary node is distributed in multiple equipment rooms. Manual or forcible failover is not subject to this restriction. We have submitted a pull Request to Redis for multi-room support, and Antirez has added this feature in Redis 4.2 roadmap.

At present, the R2M equipment room switchover and Dr Switchover can be performed with one click on the Web Console. When you need to perform normal maintenance or migration in the equipment room, you can manually switch the primary node to a secondary node across the equipment room.

It is worth mentioning that the normal switchover process ensures the data consistency between the primary and secondary nodes before and after the switchover. When room due to power outages or other reasons out of order, by forcing failover to batch to cut across the room from the master node, is due to the incident, a small amount of data synchronization may not give from node, but the main purpose is to let here, timely recovery service, availability importance than a small amount of data inconsistency or lost.

Switch between system components in multiple rooms

Multiple machine room switching for cache Cluster Routing Service (ZK)

The service client obtains the corresponding cache node address through routing service. In our production environment, ZKS in each equipment room are independently deployed, that is, ZK instances in different equipment rooms belong to different ZK clusters, and service clients in each equipment room directly access the ZK in the equipment room to obtain the cache node.

In R2M, the configuration of ZK routing node storage in each machine room is the same. We keep the information in ZK routing node as simple as possible. All things related to cluster status are not in ZK, and the configuration is basically static. Therefore, there is no need to make any changes to ZK when switching rooms.

Switch between multiple equipment rooms on the client

Business is much room for deploying the client itself, there is no more room, but the client need more room in the cache cluster switch after timely after service routing to switch rooms, which requires notification distributed in multiple rooms of each client, and maintain or establish a connection with each client, is a big hassle.

In R2M, because the client is a Smart client, when it senses an exception, it can retrieve the cluster status from the surviving nodes and automatically sense the node role change and switch, so there is no notification problem.

Switch between multiple machine rooms of the Manager component

The Manager is the management component of the cache cluster. O&m operations, including switching over equipment rooms, are performed through the Manager. Therefore, the Manager must have multiple equipment rooms to ensure that the disaster recovery (Dr) switching over of equipment rooms can be performed at any time.

Note that since the ZKS of different machine rooms are independent, the Manager can not directly rely on ZK to switch to multiple machine rooms, because it is possible that the machine room that is dependent on ZK will die. Therefore, we implemented multiple machine room election of Manager (Raft mechanism). Multiple managers can elect their own leaders and perform automatic failover.

Switch between multiple computer rooms of the Web Console

For Web Console multi-room switching, this is relatively simple. Since the Web is stateless, load balancing is done directly through Nginx, as long as there are Web components available in any machine room.

Monitoring, alarm, and troubleshooting

1. What should I do if service invocation times out or performance indicators jitter?

2. A service invocation may involve multiple services, such as message queue, database, cache, and RPC. How can I confirm that the problem is not cache?

3. If the cache service is faulty, is it a machine problem, network problem, system problem, or service misuse problem?

When appear afore-mentioned problems, in the face of a large number of machines, cluster and countless instances, if there is no a set of perfect monitoring and warning system, not a visual operation interface, convenient and easy to use that only all bewildered, helpless and patiently one instance of the log, check the info output, check the network, the machine, the so-called human flesh operations.

Therefore, R2M provides monitoring indicators and alarm functions of various dimensions to provide early warning for anomalies and locate faults quickly.

Machine and network metrics

Network QoS(packet loss rate, packet retransmission times, number of sending and receiving errors), inbound and outbound bandwidth, CPU, memory usage, and disk read and write rate of each machine.

System parameters

Memory usage of each node, network inbound and outbound traffic, TPS, query hit ratio, number of CLIENT TCP connections, number of discarded keys, records of slow query commands, and instance run logs.

Real-time monitoring and historical statistics charts

Real-time and historical statistical charts is real-time feedback to the parameter indexes and history of visual display, can not only in the direction of rapid positioning is given at the time of the problem, also can provide very valuable to the operations and business development of reference data, the reference data in turn, promote the business system is optimized, and promote better operations.

The Agent deployed on each machine collects and reports historical monitoring statistics of instances. Due to the large number of instances and monitoring items, the amount of monitoring data may be very large.

In order to avoid monitoring data occupying a large amount of database space, we reserve the data of the last 12 hours in the dimension of minutes, the last month in the dimension of hours, and the last year in the dimension of days for historical statistics. The data of minutes before 12 hours is automatically merged into the data of hours. Hourly data generated one month ago is automatically merged into day data, and the highest, lowest, and cumulative average values of indicators in this period are retained during the merger.

For a real-time monitoring chart, a connection is established with the corresponding cache instance only when users request to view it, and real-time information is directly obtained from the instance.

Client performance indicators

Collect statistics on the TP50, TP90, TP99, and TP999 request time of each client to quickly locate problem IP addresses.

The alarm item

Capacity alarms, incoming and outgoing traffic, TPS, instance blocking, instance stopping service, number of client connections, etc.

conclusion

Here is only a brief introduction to the general design ideas and functions of R2M cache system. Many details and basic things are not detailed, such as hot and cold exchange of data. In fact, we met many challenges in the process of making this thing, and we hope to introduce this in detail in the future. Finally, I hope that more technical colleagues will embrace and participate in open source, so that we can have more and better wheels to use.

Jingdong eight-year architect: How is Redis distributed and the design principle of finance

Related Posts

Go language learning

Flink three deployment modes, ensuring high availability difference /Standalone Cluster/Yarn Cluster/ Kubernetes Cluster

Write a simple Intranet penetration with Elixir