Rereading the Amazon Dynamo paper

The public Young_Blog no.

This article is not limited to Dynamo

What is the chateau marmont
Dynamo and MySQL:
Data fragmentation
- The implementation of data sharding
- Redis cluster data sharding
- Dynamo data sharding
- Consistency hashing improvements
Data replication
Dynamo read and write processes
Data consistency and conflict resolution
Dynamo cluster member status monitoring
conclusion
The resources

What is the chateau marmont

Amazon is facing some problems during its business development, mainly limited by the scalability and high availability of relational database. It hopes to develop a new database based on KV storage model and name it Dynamo.

Dynamo’s functional goals are slightly different from those of MySQL, a traditional relational database. For example, Amazon’s business scenarios do not require complex queries in most cases. However, it requires the necessary fault tolerance of single node failure, ultimate data consistency (i.e., sacrificing strong data consistency to ensure availability first), and strong scalability.

To be sure, Dynamo has several key issues to address in order to achieve these functional goals:

It should be inCAPTo choose between,DynamoChoosing to sacrifice strong consistency in certain situations (a tradeoff for most emerging databases) over ensuring availability
Multiple nodes need to be introduced to implement data backup and redundancy through asynchronous data flow replication to support single-node failover and maintain high availability of the cluster
It needs to introduce some sort of rebalancing (rebalance) algorithm to complete the adaptive management and expansion operations of the cluster,DynamoChoose theConsistent hashing algorithm

Dynamo and MySQL:

Dynamo is a NoSQL database that consists of a cluster of instances, one of which is called Instance (or Node, however it doesn’t matter), and consists of three modules. Request coordinator, Gossip protocol detection, local persistence engine, the last of which is designed to be pluggable and can support different storage media, such as BDB, MySQL, etc.

Data fragmentation

The implementation of data sharding

Data sharding is very common, because massive data cannot be stored only on a single node, it must be divided according to certain rules and then stored separately. In MySQL, there is also a separate database and table scheme, which is essentially a kind of data sharding.

The logic of data sharding can be implemented either on the client side or in the Proxy layer, depending on your architecture design. Traditional database middleware mostly implements the logic of data sharding on the client side, by rewriting the physical SQL to access different MySQL libraries. However, in the computing and storage separation architecture advocated by NewSQL database, sharding logic is usually implemented in the computing layer, namely the Proxy layer, which forwards user requests to the correct storage nodes through stateless computing nodes.

Redis cluster data sharding

Redis cluster is also a NoSQL database, how does it solve the hash mapping problem? It starts with 16,384 buckets, which are then allocated to the nodes for possession. Data is put into the 16,384 buckets at a fixed rate. As for the operation of adding and subtracting nodes, some buckets are redistributed, reducing the scope of data flow.

Dynamo data sharding

Dynamo was designed to support incremental scaling because nodes must be added and removed to minimize data flow and minimize cluster performance jitters. Dynamo uses the consistent hash algorithm to add and remove nodes. The details of the consistent hash algorithm are not detailed here, but why consistent hash can solve the problem of traditional hash.

Let’s imagine the limitations of the traditional hash algorithm. Once I give the total number of nodes H, the data partition to which node is fixed (x mod H). At this point, once I increase or decrease the size of H, the mapping of all the data will change, and the only solution is data migration. However, consistent hashing can prioritize the data region on a ring that each node is responsible for. In this way, each time nodes are added or deleted, the scope of influence is limited to a small part of the data.

The blue circle ABCD represents the four actual nodes, and the orange circle represents the data, which falls clockwise on the first node encountered

Consistency hashing improvements

Consistent hashing has disadvantages. If each node is directly mapped to a ring, it may cause complex ranges between nodes, resulting in unbalanced data distribution and machine load.

Ha mothballed optimization measures, so the consistency is introduced virtual nodes, is actually I introduce a middle tier decoupling, virtual node average fell on the ring, then the actual node map with a certain number of virtual nodes, said I this actual physical node is responsible for the data of virtual node scope, so as to achieve the role of balanced load.

Data replication

Data replication is a common method to improve the high availability of databases. It can be divided into synchronous replication, asynchronous replication, and semi-synchronous replication in terms of implementation modes, and one-way replication, two-way replication, and ring replication in terms of application scenarios.

In Dynamo’s design, data is copied to N hosts to ensure disaster recovery. N is the number of redundant copies of data. Remember that each node in Dynamo has a module called the Request Coordinator, which receives a data key value K and copies it to the n-1 nodes behind the ring. Ensure that there are N copies of the key value K, so virtually every node in Dynamo stores both the data it receives and the copies it reserves for other nodes.

Dynamo read and write processes

Dynamo selects one of the copies of the data as the coordinator, and that copy is responsible for forwarding read and write requests and collecting feedback. Typically, this copy is retrieved from a data-node mapping maintained in memory by the client and requests are sent directly to that node.

For the write request, the copy receives the write request, records the update person and time stamp of the data, and forwards the write request to other copies. After W copies are written, the write operation is reported to the client successfully. The read process is similar. Read requests are forwarded to all copies. After receiving the results of R copies, the latest data version is selected.

Obviously, since the coordinator is the only entry point for processing read and write requests, the load on the node where the copy is located must soar.

Data consistency and conflict resolution

In the case of N redundant copies of data, to ensure strong consistency, it is necessary to wait for all copies to be written before returning the client to write success. However, this is detrimental to performance and is generally not done in practice. Dynamo allows the user to write at least W replicas before returning a value from R replicas. Therefore, if W + R > N is used, the correct value is guaranteed to be read.

But there is a problem with determining which of the R values returned is the latest, i.e. each data should have a version information. In order to solve this problem, Dynamo introduced the concept of vector clock, which simply means that the copy of each write adds an updater and a vector set

as the version information for the data change, which will be carried in the subsequent copy process.
,>

For example, copy A receives the update request for the key value K and randomly adds the new version information K:

for the key value K. When it updates K again after waiting, it changes to K:

, so the latter version is updated.
,>
,>

Assuming there is no problem with the network in the cluster, a read of a key value K must be returned to the client with the latest version of the timestamp. But unfortunately, distributed networks are bound to have problems, all kinds of problems.

Suppose the client selects another copy B as the coordinator on the second update, then B saves K for the key value K:

, then the client reads the key value K, and the coordinator finds that it cannot decide which version is the latest. This is similar to Git Merge, which conflict can only be reserved and returned to the client, depending on the specific business logic.
,>

Dynamo cluster member status monitoring

Dynamo also needs to periodically detect cluster node availability in addition to data replication. Some Dynamo products rely on external services for unified processing, such as MySQL’S MHA, RocketMQ’s NS, and TiDB’s PD, while others rely on inter-node adaptive management. For example, Redis cluster and Dynamo both adopt the Gossip protocol as a solution for exchanging node information between clusters. They are completely decentralized architectures without introducing external services.

If you want to know what the Gossip protocol is, it is recommended to learn from the Redis cluster architecture. The cluster implementation using the Gossip protocol is often complex and error-prone. In addition, the Gossip protocol itself is prone to performance jitter due to the large data packets.

conclusion

Recently I reread Dynamo paper and read Design Data-Intensive Applications for several times before. I feel that many concepts of distributed system Design can be well connected, which is very helpful for knowledge sorting. You can also have a look at it if you are interested.

The resources

Dynamo: Amazon’s Highly Available Key-value Store
Design Data-Intensive Applications
Dynamo: Amazon’s Highly Available Key-value Store

The public,