This paper solves the problem of data sharding in distributed systems

The two major problems that distributed systems, especially distributed storage systems, need to solve are data fragmentation and data redundancy. The following figure vividly illustrates the concepts and differences:

Pictures from: http://book.mixu.net/distsys/intro.html

Data A and B belong to data fragments, and the original data is divided into two orthogonal subsets and distributed on two nodes. While dataset C is data redundancy, the same complete data is stored in both nodes. Of course, in a real distributed system, data sharding and data redundancy generally coexist.

This paper mainly discusses three problems of data sharding:

How to do data sharding, that is, how to map data to nodes;
The eigenvalue of the data fragment, that is, which attribute (field) in the data is sharded;
How to ensure the high performance and high availability of the metadata server, and how to ensure the strong consistency of a group of servers.

The so-called distributed system, is to use a number of independent computers to solve a single node (computer) can not deal with the storage, computing problems, this is a very typical idea of divide and rule. Each node is responsible for only a subset of the original problem (that is, the task that the entire system needs to accomplish), but how can the original problem be split into multiple nodes? In distributed storage system, task splitting is data sharding.

Segment, fragment, shard, and partition is to divide a data set into independent and orthogonal data subsets according to certain rules, and then distribute the data subsets to different nodes.

Note that there are rules for sharding data. Different distributed applications have different rules, but they all follow the same principle: sharding data according to the dominant and most frequently used access method.

One, three data sharding methods

This section describes three sharding methods: Hash, consistent Hash, and Range based. For any approach, consider the following questions:

How to divide the original data set?
When the scale of the original problem becomes larger, can it be dynamically adapted by adding nodes?
When a node fails, can tasks on this node be evenly allocated to other nodes?
For modifiable data (such as database data), if the amount of data on a node becomes large, can and how to migrate part of the data to other nodes with less load to achieve dynamic balance?
The scale of metadata management (that is, the relationship between data and physical nodes)? How often and how complex is metadata updated?

To analyze different data sharding methods, suppose there are three physical nodes, numbered N0, N1, and N2, with the following records:

R0: {id: 95, name: ‘aa’, tag:’older’}

R1: {id: 302, name: ‘bb’,}

R2: {id: 759, name: ‘aa’,}

R3: {id: 607, name: ‘dd’, age: 18}

R4: {id: 904, name: ‘ff’,}

R5: {id: 246, name: ‘gg’,}

R6: {id: 148, name: ‘ff’,}

R7: {id: 533, name: ‘kk’,}

Hash tables (hash tables) are the most common data structures that map records (or objects) to a slot in the table based on key values for quick access.

Most programming languages have support for hash tables, such as dict in Python, map in C++, Hashtable in Java, table in Lua, and more. In a hash table, the simplest hash function is mod N (N is the size of the table), which first computes the hash value of the key value (in this case an integer) by mod N, where the remainder is the position in the table.

The same idea applies to the hash method of data fragmentation, that is, the hash value is calculated according to a certain key of the data, and the mapping relationship is established between the hash value and the nodes in the system, so that the data with different hash values are distributed to different nodes.

We select ID as the key of the data fragment, so the data in charge of each node is as follows:

It can be seen that the advantages of data sharding in hash mode are that the mapping relationship is very simple, and the metadata that needs to be managed is very small. Only the number of nodes and the hash mode are needed.

However, the drawback of the hash approach is also obvious: when a node is added or removed, a large amount of data needs to be moved. For example, add a node N3 here, so the hash mode is changed to mod4, and the data migration is as follows:

This is not formal or formal: If some content has already been hashed into the corresponding buffer, and a new buffer is added to the system, the result of the hash should ensure that the previously allocated content is mapped to the original buffer or the new buffer, and not to other buffers in the old buffer set.

In engineering, in order to reduce the amount of data migrated, the number of nodes can be doubled, so that the probability of data migration is at most 50%.

Another disadvantage of the hash approach is that it is difficult to solve the problem of data imbalance. There are two cases:

The distribution of eigenvalues of original data is not uniform, resulting in a large number of data concentrated on a physical node.
For modifiable record data, the size of a single record becomes larger.

In both cases, the load between nodes is unbalanced and difficult to resolve in hash mode.

Consistent hash maps data to a start-to-end hash ring based on eigenvalues, and also maps nodes (by IP address or machine name hash) to the ring. For data, the first node found clockwise from the location of the data on the ring is the storage node of the data.

Here, the above data is still taken as an example. Assuming that the range of id is [0,1000] and the positions of N0, N1 and N2 on the ring are 100, 400 and 800 respectively, the hash ring schematic diagram and data distribution are as follows:

It can be seen that compared with the above hash method, the metadata to be maintained in the consistent hash method contains additional node positions on the ring, but the amount of data is also very small.

For example, if a node N3 is added here, its position on the ring is 600. Therefore, the range segment (400,800) formerly responsible for N2 is now responsible for N3 (400,600) N2 (600,800). Therefore, only record R7 (ID :533) needs to be migrated from N2 to N3.

It is not difficult to find that the consistent hash mode only affects the nodes responding to the hash ring during addition and deletion, without large-scale data migration.

However, in the consistent hash mode, when adding nodes, only one existing node can share the pressure. Similarly, when one node fails, all of the stress on that node is transferred to the next node. What we hope is that “when one party is in trouble, all parties provide support”. Therefore, when adding or deleting nodes, all existing nodes can participate in the response to achieve a new equilibrium state.

Therefore, in practical projects, the concept of virtual nodes is generally introduced, that is, virtual nodes are mapped to hash rings instead of physical nodes. The number of virtual nodes is much larger than that of physical nodes, so one physical node needs to be responsible for the real storage of multiple virtual nodes. During data manipulation, the hash ring is used to find the corresponding virtual node, and then the mapping between the virtual node and the physical node is used to find the corresponding physical node.

After the introduction of virtual nodes, the metadata to be maintained for consistent hash will also increase: First, the problem of virtual nodes on the hash ring and the number of virtual nodes is large; Second, the mapping between virtual nodes and physical nodes. However, the benefit is obvious. When a physical node fails, multiple virtual nodes in the hash ring fail, and the corresponding pressure is spread to multiple remaining virtual nodes, in fact, multiple remaining physical nodes. The same is true when adding physical nodes.

In the project, Both Dynamo and Cassandra use consistent hash algorithms, and the concept of virtual nodes is used in higher versions. In these systems, the data distribution mode and data copy need to be considered comprehensively. After the data copy is introduced, the consistent hash mode also needs to be adjusted accordingly. You can refer to the relevant documents of Cassandra.

In simple terms, it is divided into different intervals according to key values, and each physical node is responsible for one or more intervals. This approach is similar to consistent hash in that the positions of physical nodes on the hash ring change dynamically.

For example, the data interval of the three nodes is N0 (0,200), N1 (200,500), and N2 (500,1000) respectively. Then the data distribution is as follows:

Note that the size of the interval is not fixed, and the amount of data in each data interval has no relation to the size of the interval. For example, if part of the data is very concentrated, the size of the interval should be relatively small, that is, the size of the data volume should be regarded as the fragment standard.

In actual projects, a node is usually responsible for multiple intervals, and each interval becomes a chunk or block. Each chunk has a threshold, and when this threshold is reached, it will be split into two blocks. The purpose of this is to achieve a quick equilibrium when a node is added.

If a node is responsible for only one range of data, range based is similar to consistent hash without the concept of virtual nodes. Range based is similar to a consistent hash with the concept of virtual nodes if a node is responsible for multiple ranges.

The metadata management of range based is relatively complex, and the data range of each node needs to be recorded, especially for multiple ranges of a single node. Furthermore, in the case of modifiable data, if the block is split, the interval information in the metadata also needs to be modified synchronously.

Range based data sharding is widely used, such as MongoDB, PostgreSQL, and HDFS.

Here is a brief summary of the three sharding methods (should be four, with or without virtual node consistency hash two), mainly for the following questions:

The above dynamic data balancing refers to the fourth question raised at the beginning of this chapter, that is, if the data volume of a node becomes large, whether and how to migrate part of the data to other nodes with less load.

Second, the selection of fragment eigenvalues

The above three methods all mention that data is sharded based on key values and eigenvalues. This eigenvalue is called differently in different systems.

Such as:

Sharding key in MongoDB:

https://docs.mongodb.com/manual/core/sharding-shard-key/
Partition keys in Oracle:

https://docs.oracle.com/cd/B28359_01/server.111/b32024/partition.htm

Anyway, the choice of eigenvalues is very, very important.

Distributed Systems for Fun and Profit gives you a simple formula for choosing this eigenvalue:

Refer to the link: http://book.mixu.net/distsys/intro.html

based on what you think the primary access pattern will be

Based on the most common access mode.

The access includes the increase, deletion, change and check of data. For example, in the above example, we choose “ID” as the basis of sharding, so that is the default to add, delete, change and check data are carried out through the “ID” field.

If a large number of data operations are performed through this eigenvalue in an application, data sharding provides two additional benefits:

Improved performance and concurrency, operations are distributed to different shards, independent of each other;
Improved system availability, even if some shards are not available, other shards are not affected.

If you have a lot of operations that don’t use eigenvalues, you’re in trouble. For example, in the example of this article, if the name is used to query, and the metadata records how to map the data location according to the ID, it will be awkward, need to check all the shards, and then do an aggregation.

Another problem is that if a single field is taken as the eigenvalue (such as ID), the data will be distributed to the same node regardless of the distribution mode.

In this case, there are two problems. One is that data balance between nodes cannot be achieved. The other is what if data exceeds the storage capacity of a single node? The point is that even adding nodes, the normal way to solve problems in distributed systems, doesn’t help.

At this point, a single field is no longer an eigenvalue, and you may have to add another field as a “joint eigenvalue,” similar to a joint index in a database.

For example, if the data is the user’s operation log, you can use the ID and timestamp as input to the hash function and then calculate the eigenvalue. But in this case, if you still want to use ID as the query key, you have to traverse all the nodes.

So there is no optimal design, only the design that best meets the requirements of the application.

Sharding key in MongoDB is used as an example to explain the importance of eigenvalue selection and its influence on data operation. If you have database experience, you should have no problem reading the following, even if you haven’t used MongoDB.

Take MongoDB Sharding key as an example:

In my work scenario, except for join and transaction, the use of MongoDB and MySQL are relatively similar, especially the basic CRUD operation and database index. In MongoDB, each shard becomes a shard, the eigenvalue of the shard becomes a Sharding key, and each data is called a Document.

It is important to select the appropriate field for shardingKey, why:

As mentioned earlier, if a non-Sharding key is used to access data, the metadata server, or metadata cache server, cannot know which shard the corresponding data is on. Therefore, the access must be sent to all shards, and the results of all shards can be aggregated.

In MongoDB, mongos (which caches metadata information) does data aggregation.

For data retrieval (R: read or retrieve), there is no problem in obtaining multiple data through the same field, but the efficiency is relatively low; For data updates, if only one data can be updated, then it doesn’t seem right to update any shard, and MongoDB rejects it. The commands corresponding to MongoDB (MongoDD3.0) include but are not limited to:

Findandmodify: This command can only update a document, so the query part must contain a Sharding key.

When using findAndModify in a sharded environment, the query must contain the shard key for all operations against the shard cluster for the sharded collections.

Update: This command has one parameter multi, which is false by default, i.e. only one document can be updated, in which case the query part must contain the Sharding key.

All update() operations for a sharded collection that specify the multi: false option must include theshard key or the _id field in the query specification.

Remove: there is a JustOne argument. If True, only a document can be removed. Sharidng key must also be used.

SQL = “table”; SQL = “table”; SQL = “table”; SQL = “table” In MongoDB, unique indexes can also be created. However, in the Sharded cluster environment, unique indexes can only be created for Sharding keys. If the unique index is not a sharidng key, then it must be checked on all the shards and locked.

Data fragmentation to shard uneven:

Next, we discuss the problem of uneven data shard to shard. If shardkeys are concentrated over a period of time (such as increments over time), data is written to only one shard, resulting in an inability to balance cluster pressure.

MongoDB provides us with:

The range partition:

https://docs.mongodb.com/manual/core/ranged-sharding/
Hash partition:

https://docs.mongodb.com/manual/core/hashed-sharding/

They are not the same thing as the above mentioned sharding methods (hash and ranged Based), but rather refer to sharding key handling. MongoDB must be in the presence of ranged base, as shown in Docuemnt:

MongoDB partitions data in the collection using ranges of shard key values. Each range defines a non-overlapping range of shard key values and is associated with a chunk.

Reference links:

https://docs.mongodb.com/manual/core/sharding-shard-key/

What are “range partition” and “Hash Partition”? A graphic on the website illustrates the difference nicely:

The preceding figure shows a Range partition on the left and a Hash partition on the right. Range partition uses the field itself as the partition boundary, such as x in the figure above. A hash partition rehashes a field into a larger, more discrete range of values.

The biggest benefit of hash partition is to ensure that data is evenly distributed across nodes (by “evenly” I mean at write time, rather than by balancing in MongoDB). For example, the default _ID is objectid, objectid is a 12-byte BSON type, and the first four bytes are machine timestamps. If a large number of data with objectid as the _id are created at the same time, they will be allocated to the same shard. If the _id is set to hash index and hash sharding key, this problem will not occur.

Of course, hash partition also has a major disadvantage compared with range partition, that is, the efficiency of range query is low. Therefore, whether to choose Hash partition or range partition depends on the application scenario.

Finally, we should know that once sharding key is selected, it cannot be modified. If the application must change the Sharidng key, it can only export the data, create a new database, create a new Sharding key, and finally import the data.

Metadata server

In the three data sharding fractions discussed above, metadata is more or less recorded: the mapping between the data and the node, the node state, and so on. The server that records metadata is called metaserver (metaserver). The name varies from system to system, such as Master, ConfigServer, namenode, etc.

The metadata server is like the human brain. One hand can’t be used and the brain can’t work before the whole person is paralyzed. Therefore, to achieve high performance and high availability, metadata servers must be highly scalable — to cope with the growth of metadata.

The high availability of metadata requires that the metadata server cannot become a single point of failure, requiring multiple backups and the ability to switch quickly in the event of a failure.

If there are multiple backups, then the question arises: how do you ensure data consistency across multiple backups?

Consistency and availability of multiple copies are topics discussed in CAP theory. Two schemes are briefly introduced here:

Solution a: Master-slave synchronization, first select the master server, only the master server to provide foreign services, the primary server will metadata information about changes in the form of a log persisted in Shared storage (for example, NFS), and then read the logs from Shared storage and application from the server, and reach the state of the consistent with the primary server if the primary server is a failure is detected (such as through the heart), A new master server will be selected.
Scheme 2: Use distributed consistency protocol to achieve the consistency of multiple copies, such as the famous Paxos protocol and Raft protocol, the specialized version of Paxos which is widely used in the project. The protocol can provide external services for all backups and ensure strong consistency.

In HDFS, the metadata server is called namenode. Before HDFS1.0, namenode was a single point, and once the Namenode was down, the entire system would not work. In hdfs2.0, the single point problem of namenode is solved.

In the figure above, NN refers to NameNode, and DN refers to DataNode (nodes that actually store data). As can be seen from the figure, two Namenodes form mutual backup. One is in Active state and is the Active NameNode. The other NameNode is in the Standby state. Only the active NameNode can provide read and write services.

Data synchronization between Active NN and Standby NN is implemented through shared storage. The shared storage system ensures high availability of Namenode. To ensure strong metadata consistency, the new Active NN can continue to provide services only after the metadata is fully synchronized during the switchover.

In addition, the Zookeeper cluster is responsible for monitoring the status of Namenode and preparing for switchover. In the case of network partition, Zookeeper may consider that the original ActiveNN has failed and elect a new ActiveNN. In fact, the original Active NN continues to provide services. This leads to a phenomenon known as “double-headed,” or brain-split. To solve this problem, a fencing mechanism is proposed, which is to isolate the old Active NameNode so that it cannot provide services to the outside world.

In MongoDB, the metadata server is called Config Server. In MongoDB3.2, it is no longer recommended to use three Mirrored MongoDB instances as config server, but replica set as config server. The purpose of this is to enhance the consistency of Config Server, and the number of Mongods in Config Server can reach the upper line of replica set (50 nodes) from 3, thus improving reliability.

In MongoDB3.0 and earlier, metadata is read and written as follows:

When writing to the three config servers, a coordinator dispatches the same write commands to the three config servers and collects the results. Differing results indicate an inconsistent writes to the config servers and may require manual intervention.

The official MongoDB documentation doesn’t explain this process in detail, although on StackExchange it has been pointed out that the process is a two-phase commit.

MongoDB3.2 and later versions use Replica Set config server. In config server, use WriteConcern: Majority; ReadConcern: Majority; Read References: nearest.

Even though metadata servers can consist of a set of physical machines, consistency between replica sets is guaranteed. But if every request for data goes through the metadata server, the pressure on the metadata server can be very high. In many application scenarios, metadata does not change very frequently, so you can cache data on the access nodes. In this way, applications can directly use the cached data to read and write data, reducing the pressure on the metadata server.

In this environment, the cached metadata must be consistent with the metadata on the metadata server, and the cached metadata must be accurate and not obsolete. The opposite example is a cache such as DNS, where an expired DNS cache is not a problem.

How do you achieve strong cache consistency? It is easy to think of a way to notify all cache servers (Mongos) immediately when metadata changes, but the problem is that communication is delayed and unreliable.

Resolving inconsistencies:

A common idea is the version number: for example, in network communication, where the communication protocol may change, the two parties may use the version number to reach agreement. On cache consistency problem, also can use the version number, basic train of thought is the request with the version number of cache, routing to specific nodes after comparing the actual data of the version number, if the version number is inconsistent, then cache information about old, now need to pull from the metadata server metadata and cache. In MongoDB, this approach is used on mongos cache.

Another solution is the Lease mechanism: “An Efficient fault-tolerant Mechanism for Distributed File Cache Consistency”, Lease Mechanism is widely used in Distributed systems. Lease is useful in many areas where an agreement is required. Here’s a brief description of the lease mechanism.

Lease mechanism:

The lease mechanism is proposed to solve the problem of cache consistency in distributed storage systems. Note that in this section, we refer to the metadata server as the server and the cache server as the client for the purposes of subsequent descriptions.

Key points:

When the server sends cached data to all clients, it issues a lease, which contains an expiration date.
The server lease guarantees that metadata will not change during the lease term.
Clients can use cached metadata boldly within this validity period. If the validity period is exceeded, they cannot use the data and have to go to the server to request it.
If an external request is made to modify metadata on the server (where metadata modification must occur), the server blocks the request until all issued LEASE expires, then modifies the metadata and sends the new metadata and lease to the client.
If the metadata has not changed, the server also needs to issue a new lease (only lease, no data) to the client between the lease expiration dates.

Lease’s thesis title is fault-tolerant. He is Tolerant of matters of taste. The key is that as long as the server issues the data and lease, it does not care whether the client receives the data or not. As long as the lease expires, it can modify the metadata. In addition, the lease is marked by an expiration time (a timestamp), so it does not matter if messages arrive late or are sent repeatedly from the server to the client.

It is not difficult to find that the premise of fault tolerance is that the server and client time should be consistent.

If the server time is slower than that of the client, the client will expire soon after receiving the lease, and the lease mechanism becomes invalid.
If the server’s time is faster than the client’s, it is dangerous because the client will continue to use the cache even after the server has already started updating metadata. In projects, the expiration time of the server is usually set slightly larger than that of the client to solve this problem.
To ensure Time consistency, the Network Time Protocol (NTP) is used to synchronize clocks.

Lease is a promise granted by the issuer within a certain period of validity. The scope of the promise is very broad.

Consider the cache mentioned above;
For example, to perform permission control, only a lease is issued to one node at a time. Only the node that holds the lease can modify data.
For example, in a primary-secondary schema, a lease is issued to a node. Only the node that holds the lease has the primary identity.
For example, node status monitoring, such as whether the primary is healthy in the primary-secondary schema, will be discussed later.

In engineering, lease mechanism also has a number of applications:

The Master node issues a lease to the Primary copy of the Chuck. The Master node issues a lease to the Primary copy of the Chuck.
Chubby uses the PaxOS protocol to decentralize the selection of the primary node. The Secondary node then sends a lease to the primary node. The lease means: “Promise not to elect another node as the primary node during the lease period.”
In chubby, the primary node also issues a lease to each client node. The lease is used to check the client status. A client node can read and write data from a primary node only when it has a valid lease.

Four,

At the end of the article, let’s draw out the main points of the article.

This article mainly introduces sharding related issues in distributed systems, including three distribution methods: Hash, consistent Hash, and range based, as well as their advantages and disadvantages.

Sharding is carried out according to certain eigenvalues, which should be selected from the application scenarios. Combined with MongoDB, the influence of eigenvalues (Sharding key in MongoDB) on data operation is shown.

Shard information (metadata) need dedicated server storage, the metadata server is the core of distributed storage system, so it should be noted that the availability and reliability, in order to alleviate the pressure of the metadata server, distributed systems, in other nodes cache metadata, metadata caching and brings the challenge of consistency, the lease mechanism is introduced.

References:

1. Liu Jie: Introduction to Distributed System Principle:

http://blog.sciencenet.cn/home.php?mod=attachment&id=31413

2. Distributed Systems for Fun and Profit:

http://book.mixu.net/distsys/

3. Consistent_hashing:

https://en.wikipedia.org/wiki/Consistent_hashing

Hadoop NameNode High Availability Implementation Analysis

https://www.ibm.com/developerworks/cn/opensource/os-cn-hadoop-name-node/

5. Some Thoughts on consistency and usability of CAP Theory and MongoDB

http://www.cnblogs.com/xybaby/p/6871764.html

Leases: An Efficient Fault-Tolerant Mechanism for Distributed File Cache Consistency 7, Leases: An Efficient Fault-Tolerant Mechanism for Distributed File Cache Consistency

http://web.eecs.umich.edu/~mosharaf/Readings/Leases.pdf

Author: xybaby

Source: www.cnblogs.com/xybaby/p/7076731.html

This paper solves the problem of data sharding in distributed systems

Related Posts

C++ prime5th

From a young cousin to a corporate project leader, visual reports help a lot!

I probably wouldn’t use golang Map