preface

There are many components in this storage system, in addition to the core nose (controller), disk array (JBOD) and switch equipment, there are auxiliary equipment such as management equipment. 2021Java core interview questions collection to share with you.

First, centralized storage structure

Speaking of distributed storage, let’s take a look at what traditional storage looks like.

Traditional storage, also known as the centralized storage, can be seen from the concept is a concentrated, that is the whole store is concentrated in a system, but it is not a single centralized storage equipment, is concentrated in a system of multiple devices, such as image below EMC storage requires several cabinets to store.

There are many components in this storage system, in addition to the core nose (controller), disk array (JBOD) and switch devices, there are auxiliary devices such as management devices.

The structure contains a nose, which is the most central part of the storage system. Usually, two controllers are installed in the nose of the aircraft to protect the storage system from hardware failure. The front-end port provides storage services for servers, and the back-end port expands the capacity of the storage system. More storage devices can be connected through the back-end port nose to form a very large storage resource pool.

In the whole structure, the nose is the core part of the whole storage system, and the advanced functions of the whole storage system are realized in it. The software in the controller manages disks, abstracts disks into storage resource pools, and divides disks into luns for servers. The LUNS in this case are the disks you see on the server. Of course, some centralized stores are also file servers themselves, providing shared file services. However, from the above we can see that the most important feature of centralized storage is that there is a unified portal through which all data passes, and this portal is the nose of the storage system.

This is the most significant difference between centralized storage and distributed storage.

As shown below:

This storage system consists of many components, in addition to the core nose (controller), disk array (JBOD) and switch equipment, as well as auxiliary equipment such as management equipment.

Distributed storage

Distributed storage was first proposed by Google, whose purpose is to provide Web access problems with large scale and high concurrency scenarios through cheap servers. It adopts an extensible system structure, uses multiple storage servers to share storage load, and uses location servers to locate storage information. It not only improves the reliability, availability and access efficiency of the system, but also is easy to expand.

1. The rise of distributed storage

The rise of distributed storage is closely related to the development of the Internet. Internet companies usually use large-scale distributed storage systems due to their large amount of data and small capital accumulation.

Unlike the traditional high-end servers, high-end memory and high-end processors, the distributed storage system of Internet companies consists of a large number of low-cost and cost-effective ordinary PC servers connected through the network. There are three main reasons for this

(1) The rapid development of Internet business, and pay attention to the cost of consumption, which makes the storage system can not rely on the traditional vertical expansion way, that is, to buy a small machine, not enough to buy a medium-sized machine, or even mainframe. The distributed system at the back end of the Internet needs to support horizontal scaling, that is, adding common PC servers to improve the overall processing capability of the system.

(2) Ordinary PC servers have high cost performance and high failure rate, so it is necessary to realize automatic fault tolerance at the software level to ensure data consistency.

(3) In addition, with the continuous addition of servers, it is necessary to realize automatic load balancing at the software level, so that the processing capacity of the system can be linearly expanded.

2. The importance of distributed storage

From single machine single user to single machine multi-user, and then to the present network era, the application system has undergone a lot of changes. Distributed systems are still a hot topic of discussion. So, what do distributed systems bring to us, or why do we have distributed systems?

(1) The cost performance of upgrading single machine processing capacity is getting lower and lower;

Companies find it increasingly uneconomical to change hardware and scale vertically to improve performance;

(2) There is a bottleneck in single machine processing capacity;

At a certain point in time, a single processor has its own performance bottleneck, which means that even if you’re willing to pay more for computing power, you can’t buy it.

(3) For stability and availability

With a one-click system, everything is fine while the machine is working, but if something goes wrong, the system is completely unusable. Of course, you can consider disaster recovery and backup schemes, and these schemes will let the system evolve into a distributed system;

(4) The inevitable requirements of cloud storage and big data development

Cloud storage and big data are applications built on distributed storage. Mobile terminals have limited computing power and storage space, and there is a strong need to share resources between multiple devices, which makes cloud storage applications such as web disks and photo albums quickly popular. However, the core of cloud storage is the back-end large-scale distributed storage system. Big data is a step further. It requires not only storing massive amounts of data, but also analyzing and extracting valuable parts of the data through appropriate computing frameworks or tools. Without distributed storage, big data analytics are out of the question. Careful analysis also reveals that distributed storage technology is the holy grail of the Internet’s back-end architecture, and mastering this skill makes it much easier to understand the nature of other technologies later on.

3. Types and comparisons of distributed storage

Distributed storage includes various types, including distributed file system, distributed block storage and distributed object storage in the traditional sense, distributed database and distributed cache, etc., but there are no more than three architectures

A. Intermediate control node architecture

The Hadoop Distribution File System (HDFS) is a typical example. In this architecture, one node NameNode stores management data (metadata) and another node DataNode stores service data. This type of server is responsible for managing specific data. Namenode is like the hierarchical organizational structure of the company. Namenode is like the boss, managing only the subordinate managers (Datanodes), and the subordinate managers, and the managers manage the data on the local sites under the nodes.

This storage system consists of many components, in addition to the core nose (controller), disk array (JBOD) and switch equipment, as well as auxiliary equipment such as management equipment.

B. No central architecture at all — computing mode

The architecture represented by Ceph is its typical representative. The difference with HDFS in this architecture is that there is no central node. In this way, clients can directly communicate with storage nodes to avoid performance bottlenecks on central nodes.

As shown in the preceding figure, the core components of the Ceph storage system include the MON service, OSD service, and MDS service.

(1) The MON service is used to maintain the hardware logical relationship of the storage system, including online information such as servers and disks. The MON service ensures the availability of its services by clustering.

(2) The OSD service implements disk management and real data read and write. Usually, one disk corresponds to one OSD service.

(3) MDS only tracks the file hierarchy and storage metadata of CephFS. Ceph block devices and RADOS do not require metadata, and therefore do not require the Ceph MDS daemon

(4) RADOS: RADOS is the CepH storage cluster containing the above three services. All data in Ceph is in the form of objects, and the RADOS object store is responsible for storing these objects regardless of the data type. The RADOS layer ensures that data is always consistent. To do this, data replication, fault detection and recovery must be performed, and data migration must be balanced with the cluster node in which it resides

(5) RBD (Block device) : formerly known as RADOS Block device, provides reliable distributed and high-performance block storage disks for clients.

(6) CephFS: Ceph file system provides a POSIX-compatible file system that uses the Ceph storage cluster to store user data

(7) Librados: Librados library provides a convenient way to access RADOS interface for PHP, RUBY, Java, Python, C++ and other languages

(8) RADOS GW: RGW provides object storage service, which allows applications to connect to Ceph object storage. RGW provides RUSTFUL API compatible with Amazon S3 and openstack Swift

The client accesses the storage system through the RADOS GW, obtains the storage resource layout information from the MON service, and calculates the location of the desired data (including physical server information and disk information) based on the layout information and the name of the data to be written. Then it communicates directly with the CephFS corresponding to the position information to read or write data

C. No central architecture at all — consistent hashing

Architecture represented by SWIFT is its typical representative. Different from Ceph’s method of calculating data location, another method is to obtain data location through consistent hash. The method of consistent hash is to make the device into a hash ring, and then the hash value calculated according to the data name is mapped to a location in the hash ring, so as to achieve the location of the data.

Swift has two kinds of mapping relationship, for a file, by the hash algorithm (MD5) to find the corresponding virtual node (one-to-one mapping relationship), through virtual node mapping (two-dimensional array) ring file to find the corresponding equipment (many-to-many mapping), then completed a mapping file stored on the device.

D. Comparison of distributed storage

So now the question is, if we’re going to choose distributed storage, which one is better? In fact, they each have their own advantages and use scenarios, depending on the specific needs.

(1)HDFS

It is used in big data storage scenarios and is a storage component in the Hadoop big data architecture. At the beginning of HDFS design, it has been clear that its application scenario is big data service. Main application scenarios are as follows:

A, the performance of large file storage is relatively high, such as hundreds of megabytes, several gigabytes of large files. HDFS manages files in metadata mode, and metadata related directories and blocks are stored in NameNode memory. Increasing the number of files occupies a large amount of NameNode memory. If a large number of small files exist, large memory space is occupied and distributed storage performance deteriorates. Therefore, HDFS is recommended for storing large files.

B. Suitable for low-write and multi-read services. In terms of big data analysis services, data is written once, read several times, and then analyzed. The HDFS has a high data transmission throughput, but a low data read delay. It is not suitable for frequent data writing.

C. The HDFS uses the multi-copy data protection mechanism. Common X86 servers ensure data reliability. It is not recommended to use the HDFS in a hypervisor.

(2) Ceph

At present, the most widely used open source distributed storage system has been supported by many manufacturers, and the distributed storage of many hyper-converged systems is deeply customized based on Ceph. Ceph has become a “standard” for LINUX and OpenStack to support their respective storage systems. Ceph provides object storage, block device storage, and file system storage services. It is rare for a distributed storage system to support three types of storage services.

A. Ceph does not adopt HDFS metadata addressing scheme, and uses CRUSH algorithm to balance data distribution and high parallelism. In addition, block storage supports strong data consistency and provides the experience of using traditional centralized storage.

B. Object storage service. Ceph supports Swift and S3 APIS. Block storage supports thin provisioning, snapshots, and cloning. Posix interfaces and snapshots are supported for file system storage services. However, the performance of Ceph supported files is comparable to other distributed storage systems, and the deployment is slightly complex and the performance is also slightly weaker. Ceph is generally applied to block and object storage.

C. Ceph is a decentralized distributed solution, which needs to be planned and designed in advance and has high requirements for the technical team. Especially during Ceph expansion, the performance of the storage system deteriorates due to its balanced data distribution

(3)Swift

The main orientation is object storage. Similar to the object storage service provided by Ceph. Mainly used to solve the problem of unstructured data storage. The main differences between it and the object storage service of Ceph are.

A. When a client accesses the service of the object storage system, Swift requires that the client must access the Swift gateway to obtain data. Ceph uses a single OSD (object storage device) running on each storage node to obtain data information without a single entry point, which is more flexible than Swift.

B. In terms of data consistency, Swift data is the final consistency and has a higher processing efficiency of mass data, but it is mainly targeted at object storage services that have low requirements on data consistency but high requirements on data processing efficiency. Ceph is consistently consistent across clusters. The main application scenario is that in OpenStack, the object storage service uses Swift instead of Ceph.

Analysis of distributed theory

1. Consistency and availability

Due to the existence of exceptions, distributed storage systems are designed to store multiple copies of data redundantly, and each copy is called a copy. This way, when a node fails, data can be read from other replicas. It can be argued that replicas are the only means of fault tolerance in distributed storage systems. Because of the existence of multiple copies, how to ensure the consistency between copies is the theoretical core of the whole distributed system.

Data consistency this word can be seen in common development, or a variety of articles, we often hear something data inconsistency, caused a certain loss, quickly repair. How many kinds of consistency are there?

A. Time consistency: data of all data components is required to be completely consistent at any time;

B, things consistency: transactional consistency can only exist before the start of the transaction and the transaction is completed, in the process of transaction data may be inconsistent, such as A transfer 100 yuan to b, A deduction, 100 b and 100, and the transaction is completed before the transaction can ensure their account is right, then this is transactional consistency. However, in the process of transaction, there may be A case where A deducts 100 yuan and B does not add 100 yuan, which is inconsistent

C. There are many different single-machine transactions involved in the application, and the data is completely consistent only before and after all the single-machine transactions are completed.

It is difficult to describe clearly only by these three kinds of consistency in some complex situations. Therefore, we have introduced the consistency model. Here we briefly introduce several common consistency models from strong to weak.

A. Linear consistency

Also known as strong consistency, it can be seen as having only one single processor, or it can be seen as having only one copy of the data, and all operations are atomic.

As shown in the figure above, for events E1 and E2, if the response of event E1 is before the invoke of event E2, we say that E1 happens before e2.

For the same thread, previous events must happen befor subsequent events. However, for two events on different threads, there will be a Happen before relationship between them only if there is no crossover in the timeline. For those events with crossover, such as event2 and event3 in the figure below, there is no happen before relationship between them, and their order can be arbitrary for the legal order execution we are looking for.

B. Order consistency

Sequential consistency is weaker than strict consistency. Writes to variables do not have to be seen instantaneously; however, writes to variables by different processors must be seen in the same order on all processors, where processors can be replaced by different nodes in a distributed system.

Suppose you have two threads A and B executing concurrently. Thread A consists of three operations, and their sequence in the program is: A1->A2->A3. Thread B also has three operations, and their sequence in the program is: B1->B2->B3. Suppose the effect is as shown in the two figures above if it is in the same order model.

C. Consistency of cause and effect

Causal consistency is a consistency model weaker than sequential consistency, which requires that the order of all operations must follow the order of a single processor (node), while causal consistency only needs to satisfy that the causal operations are sequential consistency.

Simply put, if someone asks you a question and you give the answer, that’s causality, but if you give the answer before the question, that violates causality. For A simple example, if node 1 updates data A, node 2 reads data A and updates data B. Here, data B may be calculated from data A, so there is A causal relationship. However, if node 3 sees data B updated first and then updated, the causal consistency is broken.

D. Final consistency

In fact, in addition to strong consistency, other consistency can be seen as the final consistency, but according to the different requirements of different consistency models and derived a lot of specific consistency models. Of course, the simplest final consistency is not concerned with the order of changes in the middle, just ensure that the consistency at a certain point in time. But this point in time needs to be measured according to different systems, different businesses. It is possible to return any values without any guarantee of order until the final consistency is complete.

E. Availability

Availability means “Reads and writes always succeed”, that is, the service is always available and the response time is normal. For an available distributed system, every non-failing node must respond to every request. So, when we measure the availability of a system, we usually measure it by downtime.

Usually when we describe the availability of a system, we say that the system availability of Taobao can reach 5 9, which means that its availability level is 99.999%, that is, the annual downtime does not exceed (1-0.99999)36524*60=5.256 min, which is a very high requirement.

Good availability means that the system can provide good service for users without user experience such as operation failure or access timeout. A distributed system, upstream and downstream design of many systems such as load balancing, WEB server, application code, database server, etc., any node instability can affect availability

F. Consistency of distributed systems

In July 2000, Professor Eric Brewer of University of California, Berkeley proposed CAP conjecture at ACM PODC conference. Two years later, CAP theory became the accepted theorem of distributed computing after Seth Gilbert and Nancy Lynch of MIT theoretically proved CAP.

Overview of CAP theory: A distributed system can satisfy at most two of Consistency, Availability and Partition tolerance at the same time.

Note that consistency in CAP is all nodes see the same data at the same time.

Consistency must be seen in two dimensions:

(1) From the client point of view, multi-process concurrent access, non-distributed database requires that the updated data can be seen by subsequent access, all are strongly consistent;

(2) From the perspective of the server, how to distribute the updated data to the whole system as soon as possible and reduce the time window to achieve the final consistency is a very important aspect to improve the system availability and user experience.

Refer to the following formula:

N – Number of data copies to be replicated

W – Number of nodes whose data needs to be written during data update

R – The number of nodes that need to be read while reading data

(1) If W+R>N, the nodes written and read overlap, it is strong consistency. For example, if N=2,W=2,R=1 is typical for a relational database with synchronous replication between one master and one standby, the data read from both the master and standby databases is the same.

(2) If W+R<=N, it is weak consistency. For example, if N=2,W=1,R=1 is used for asynchronous replication between one active and one standby relational database, the updated data of the primary database may not be read if the data is read from the standby database, so the consistency is weak.

For a distributed system. P is a basic requirement. Among the three caps, we can only balance between CA and CAP, and try our best to improve P.

There are two systems: CP without A** and AP without C**

Hdfs, Ceph and Swift mentioned above all belong to the category of CP without A, but they are not completely without A. In order to achieve certain availability, the number of copies is generally set as N>=3. Different combinations of N,W, and R strike a balance between availability and consistency to suit different application scenarios.

Then, in real life, there are also some cases of AP without C, as shown in the CAP figure, most of which are Nosql, CoachDB and Cassandra databases. What are the scenarios?

In fact, it does not require the correctness of the occasion, such as the scene of buying mobile phones in a certain meter or the scene of buying train tickets in 12306. It may be that there is stock on the page when you browse the goods in the first few seconds. When you choose the goods and prepare to place an order, the system will remind you that the order failed and the goods have been sold out. It’s A matter of making sure the system works in A(availability) first, and then sacrificing data consistency.

2. Data distribution

Distributed system is different from traditional stand-alone system in that it can distribute data to multiple nodes and realize load balancing among multiple nodes. There are two main ways of data distribution, one is hash distribution, such as consistency hash, on behalf of the system

Amazon Dynamo system, Openstack Swift system; Another method is sequential distribution, that is, the data in each table is overall ordered according to the primary key, which represents the system of Google Bigtable system. Bigtable divides a large table into ordered ranges based on the primary key. Each ordered range is a child table.

A. Hash distribution (Swift)

Hash function hashing characteristics are very good, hash way can be more evenly distributed to the cluster to go. Moreover, the meta information recorded by hash is also very simple. Each node only needs to know the calculation method of hash function and the number of module servers to calculate which machine the data processed should belong to.

However, it is difficult to find a hash function with good hash properties. This is because data under the same user ID can be spread across multiple servers if the primary key is hashed, making it difficult to manipulate multiple records under the same user ID at once; If the user ID is hashed, “data skew” is likely to occur. That is, some large users have a large amount of data, and these users are always processed by the same server regardless of the size of the cluster.

There are two ways to deal with the problem of large users. One is manual splitting. That is, large users in the offline system are identified (for example, a MapReduce job is run) and the large users are split to multiple servers based on their data volume. This is a hash distribution based on special treatment for these large users;

Another approach is automatic splitting, which means that the data distribution algorithm can be adjusted dynamically to automatically split the data of large users across multiple servers. There are two algorithms involved.

One is the traditional hash algorithm. When accessing data, the hash value is calculated first, and then the metadata server is queried to obtain the server corresponding to the hash value. Under this algorithm, the online and offline servers will lead to a large amount of data migration, which is not suitable for production.

Distributed Hash Table (DHT) algorithm. The algorithm idea is as follows: each node in the system is assigned a random token, and these tokens form a hash ring. When storing data, the hash value of the primary Key is calculated and stored on the node where the first clockwise token whose hash value is greater than or equal to the primary Key resides. The advantage of consistent hashing is that the addition/deletion of nodes only affects the neighboring nodes in the hash ring and does not affect other nodes.

As shown in the above, the algorithm itself characteristics can make the disk is divided into more relatively homogeneous virtual partitions, each virtual partitions are hash ring on a node, the ring a range from 0 to 32 maximum, and end to end, when calculate the hash value of the data (or data), inevitably fall to hash ring a certain interval, and then in a clockwise, You must be able to find a node. So this node is where the data is stored. You can see that if there is only one node, up to 32 and no node has been found, then the data is on the first unique node.

The whole data location is based on the above consistent algorithm to realize the request to the device for processing

(1) In object storage, account name/container name/object name compose a location identifier, and an integer number can be calculated by this unique identifier;

(2) Storage devices, Swift builds a virtual partition table, the size of the table is determined in the creation of the cluster (usually hundreds of thousands), this table is actually an array;

(3) The integer value and the array, using a consistent hash algorithm can determine the integer position in the array.

In principle, consistency algorithm can ensure the balance and monotony of data, avoid the dispersion of data, effectively ensure the consistency of data, so that the load is mapped to a specific cache as far as possible.

When there are too few service nodes, it is easy to cause data skew due to the uneven distribution of nodes. Therefore, in practical applications, the number of virtual nodes is usually set to a value larger than 32, so even data distribution can be achieved even with a few service nodes.

B. Sequential distribution (Bigtable)

Hashing breaks the order of the data and only supports random reads, not sequential scans. Certain system can make a compromise in the application layer, such as the Internet application often according to the user for data resolution, and through the hash method for data distribution, the same user data distribution to the same storage nodes, allowed for the same user data execution order scanning, operation problem is solved by the application layer across multiple users. In addition, the data volume of some users may be too large. Because the user data is limited to one storage node, the parallel processing capability of the distributed storage system cannot be utilized.

Sequential distribution is common in the distributed table system (Bigtable). In general, a large table is divided into sequential ranges. Each range is called a sub-table.

As shown in the figure, the primary key of the User table ranges from 1 to 7000. In the distributed storage system, the User table is divided into multiple sub-tables, corresponding to data ranges from 1 to 1000, 1001 to 2000, and so on. From 6001 to 7000. In order to support a larger cluster scale, the Meta table divides the original index into two layers, and uses the Meta table to maintain the node where the User child table resides, thus reducing the burden of the Root node.

Sequential distribution is similar to B+ tree data structure, in that each sub-table is equivalent to a leaf node. As data is inserted and deleted, some sub-tables may become large and some small, and the data is not evenly distributed. If sequential distribution is adopted, the system design needs to consider the splitting and merging of sub-tables, which will greatly increase the complexity of the system.

C. CRUSH distribution

CRUSH algorithm, short for Controlled, Scalable, Decentralized Placement of Replicated Data. Technically speaking, the Crush algorithm is also based on the Hash algorithm. It’s just that the mapping is different from consistent hashing. We use the Ceph distribution process to illustrate.

Ceph data distribution process: first compute the Hash value of data X and mod the result with the PG number to get the PG number corresponding to data X. Then, PG is mapped to a set of OSD nodes using the CRUSH algorithm. Finally, data X is stored in the OSD corresponding to PG. Note: PG stands for Placement Group.

This process involves two mappings, the first of which is the mapping of data X to PG. If PG is used as a storage node, the traditional Hash algorithm is the same. In contrast, PGS are abstract storage nodes that do not increase or decrease as physical nodes join or leave, so the mapping of data to PGS is stable.

In Dynamo’s case, PG plays two roles: the first role is to partition data. The data interval managed by each PG is the same, so the data can be evenly distributed to PG. The second function is to play the role of the Token in Dynamo, which determines partition location. In fact, this is the same as the principle of fixing the number of partitions in Dynamo and keeping the number of partitions equal to the number of virtual nodes.

Taking Ceph as an example, CRUSH algorithm calculates the location of data storage through two mappings to determine how to store and retrieve data. CRUSH enables Ceph clients to communicate directly with OSDs, rather than through a centralized server or agent.

Through algorithmically determined data storage and retrieval methods, Ceph avoids single points of failure, performance bottlenecks, and physical limitations on its scalability. CRUSH requires a mapping of the cluster and uses CRUSH mapping to pseudo-randomly store and retrieve data in OSDs evenly distributed across the cluster.

3, copy,

To ensure high reliability and availability of distributed storage systems, multiple copies of data are stored in the system. When a storage node where a copy resides fails, the distributed storage system can automatically switch services to other copies to achieve automatic fault tolerance. Distributed storage systems use replication protocols to synchronize data to multiple storage nodes and ensure data consistency among multiple replicas.

A. Strong synchronous replication

The client sends write requests to the primary copy, and the primary copy copies the write requests to other secondary copies. The common practice is to synchronize operation logs (Commit logs). The primary copy synchronizes operation logs to the secondary copy. The secondary copy plays back operation logs and notifies the primary copy when the operation logs are complete. The master copy then modifies the machine and waits until all operations are complete before notifying the client that the write is successful. The replication protocol shown in the following figure returns the client write success only after the primary and secondary synchronization is successful. This protocol is called the strong synchronization protocol.

Assume that the number of all copies is N and N>2, that is, the number of standby copies is greater than 1. Therefore, when the strong synchronization protocol is implemented, the primary copy can concurrently send operation logs to all standby copies and wait for a reply. If at least one standby copy returns a success message, the client can reply that the operation is successful. The advantage of strong synchronization is that if the primary copy fails, at least one standby copy has complete data. The distributed storage system can automatically switch services to the latest standby copy without worrying about data loss.

B. Asynchronous replication

The replication mode corresponding to strong synchronization is asynchronous replication. In asynchronous mode, the primary copy does not need to wait for a response from the secondary copy. Only the local modification succeeds, the client is notified that the write operation succeeds. In addition, the master replica pushes client-side changes to other replicas through an asynchronous mechanism, such as a separate replication thread. The benefit of asynchronous replication is that the system is available, but the consistency is poor, and the last part of the update operation may be lost if the primary copy fails unrecoverable.

C. NWR replication

A Replicated write protocol based on writing to multiple storage nodes may also be used in distributed storage systems. For example, N is the number of copies, W is the number of copies of write operations, and R is the number of copies of read operations.

In THE NWR protocol, multiple replicas are not differentiated between primary and secondary. The client writes data to W replicas and reads R replicas based on certain policies. As long as W+R>N, you are guaranteed to read at least one copy that contains the latest update. The problem with this protocol, however, is that the order of operations from different copies may be inconsistent and conflicts may occur when reading from multiple copies. This method is rare in actual systems and is not recommended.

4. Distributed protocols

There are many distributed protocols, among which two-phase commit and Paxos are the most representative. The two-phase commit protocol (2PC) or three-phase commit protocol (3PC) is used to guarantee atomicity of operations across multiple nodes, that is, operations across multiple nodes either succeed or fail on all nodes. The Paxos protocol is used to ensure that multiple nodes agree on a vote, such as which node is the primary node.

A. Two-stage submission

The algorithm idea of two-stage submission can be summarized as follows: participants will inform the coordinator of the success or failure of the operation, and then the coordinator will decide whether to submit the operation or stop the operation according to the feedback information of all participants.

(1) Request stage (voting) :

Transaction coordinator The coordinator notifies the transaction participants that they intend to commit or cancel the transaction, and then the voting process proceeds. During the voting process, participants inform the coordinator of their decision: agree (transaction participant local execution succeeds) or cancel (transaction participant local execution fails).

(2) Submission Stage (execution):

In this phase, the coordinator will make decisions based on the results of the first stage: to submit or cancel, if and only if all of the participants agreed to commit the transaction, the coordinator to inform all the participants to commit the transaction, otherwise coordinator will inform all the participants to cancel the transaction participants received coordinator sent message will perform a response after operation.

(3) Two-stage submission of problems that cannot be solved

A) If A participant does not vote late, the whole stage will be in A waiting state, but this can be solved through the timeout mechanism

B) When the coordinator makes an error and the participant also makes an error, the integrity of the transaction execution cannot be guaranteed by the two phases.

Consider that the coordinator goes down after issuing a COMMIT message, and the only participant who received the message goes down at the same time.

So even if the coordinator creates a new coordinator by election agreement, the status of the transaction is uncertain, and no one knows whether the transaction has been committed.

B. Three-stage submission

Three-phase commit has three phases: CanCommit, PreCommit, and DoCommit

(1) The CanCommit stage is approximately equivalent to the two-stage request stage; DoCommit is approximately equivalent to a two-phase commit phase. The preparation phase, PreCommit, is a buffer that ensures that the states of participating nodes are consistent until the final commit phase.

(2) The three-phase commit inserts a PreCommit between the first and second phases of the two-phase commit, so that in the original two-phase commit, after voting, due to the crash or error of the coordinator, The problem of potentially long delays resulting from participants being in “limbo” without knowing whether to commit or abort was resolved.

(3) Unsolvable problems submitted in three stages

If the coordinator sends a COMMIT request after entering PreCommit, it is assumed that only one participant receives and performs the COMMIT operation, while other participants do not receive the network interruption, abort is selected according to 3PC. In this case, the system status is inconsistent.

C. Paxos

The story of Paxos begins with the question of Byzantium, which is the capital of the Eastern Roman Empire in what is now Istanbul, Turkey. Because of the vast territory of the Byzantine Empire, for defensive purposes, each army was separated far apart, and generals had to rely on messengers to carry messages from one another. In times of war, all the generals in the Byzantine army had to agree on whether they had a chance to win before they attacked the enemy’s camp. But the army may have traitors and enemy spies, traitorous generals who disrupt or shape the decision-making process. At this point, with members known to be rebelling, how the remaining loyal generals could reach an agreement without being influenced by the traitors was the Byzantine general problem.

We negate the hypothesis and give the definition of non-Byzantine model:

(1) The behavior of the consistency module can be executed at any speed, allowing the operation to fail. After the failure, it may restart and run again;

(2) The consistency modules send and communicate information asynchronously. The communication time can be any long, and the information may be lost in the transmission process. The same information can be sent repeatedly, and the order of multiple information can be any. But there’s one thing: the information can’t be tampered with.

Therefore, we have obtained the basic two stages of Paxos: Prepare stage and Accept stage. The logic of these two stages is very complex, which is the basis of the trust algorithm. This paper does not intend to make a deep interpretation. Interested readers can refer to the book blockchain Algorithms.

D. Raft protocol

Raft is short for Replicated And Fault Tolerant, a simplified version of Paxos.

In a distributed system, what is the most correct posture to improve system robustness, availability, and data security? More backups, of course. More backups of services, more backups of data, removing single points, making sure that even if some of the related components fail, the system can still provide health services.

It is good to remove single points and have no fixed authority, but the problem is that it is quite difficult to do so in an environment where information can be lost.

When it came out in 1990, few understood. After many times of simplification and re-interpretation by the author, including practice re-creation and re-interpretation by Google and other teams, more than ten years have passed before it gradually became a fact standard and was understood and accepted by everyone. But until now, the extremely abstract Paxos algorithm, or few people can understand.

Avenue to Jane! It is an immutable truth that Raft’s goal problem is to build a distributed consistency protocol that is easy to understand and build, on the basis of ease, to ensure that the theory is correct.

Raft protocol, if read from a script, will take some time, this article will explain it in a more informal way

Raft is a leader Selection algorithm where each node in a cluster has three possible roles:

(1) leader

The gateway to client communication, the initiator of internal data synchronization, a cluster usually has only one leader node

(2) follower:

Non-leader nodes passively accept data requests from the Leader

(3) candidate:

A temporary role exists only in the election phase of leader. If a node wants to become leader, it initiates a vote request and becomes a candidate at the same time. If the election is successful, it becomes a candidate; otherwise, it is returned to follower.

The algorithm consists of two processes: leader election and log replication:

(1) election process :(suppose there are 5 nodes, S1~S5)

A. In the initial state, we are all equal followers, so who should we follow? Each follower maintains a random timer inside as everyone gets excited;

B. If no one has actively contacted it before the timer expires, it will become a candidate and send a RequestVote to others, assuming S1 and S3 become candidates

C. For candidates with the same conditions, followers adopt a first-come-first-vote strategy. If more than half of the followers agree that S3 is a suitable leader, then congratulations, a new leader is created.

D, S1 Unfortunately, no one is willing to choose this tragic candidate, so it has to honestly change back to the state of the younger follower;

E. Similarly, if there is no contact from the eldest brother during the timer period, it is likely that the eldest brother is already on his knees, as shown in the picture below. All the younger brothers start to get excited again, and a new round of election begins.

(2) log replication :(suppose there are 5 nodes, S1~S5)

A. The leader acts as the coordinator in a distributed transaction and produces a two-phase commit every time there is data update. When the leader receives the data operation request, it does not rush to update the local data (the data is persisted on disk), but generates the corresponding log, and then broadcasts the log generation request to all followers.

B. Each follower has two choices after receiving the request. One is to follow the leader’s command, write a log, and return success. On the other hand, if certain conditions are not met, the follower decides that it should not follow the leader’s command and returns false.

C. At this point, if more than half of the followers have successfully written logs, the leader starts the commit phase 2: The followers also choose whether to write or not according to their own situation and return the results to the leader. Finally, all nodes reach a consensus on the data.

D. If more than half of the followers in either phase return false or none at all, the distributed transaction is unsuccessful. There is no rollback, but since the data is not actually committed on most nodes, it will be overwritten later on

Raft protocol ensures strong leadership of _leader_,client reads and writes pass _leader_, very consistent, but some students will ask, what is the value of distributed? How do you load balance? In practice, we use Multi Raft architecture to combine applications. Different applications elect different leader nodes to balance loads.

5. Deploy across equipment rooms

In distributed system, cross-machine room problem is always a big problem. The network delay between equipment rooms is large and unstable. Cross-machine room problems mainly include data synchronization and service switching. There are three cross-room deployment schemes: switching a cluster across an equipment room, selecting a primary copy for Paxos. The following are introduced respectively.

A. Cluster switchover

Overall cluster switching is the most common solution. As shown in the figure, a system is deployed in two machine rooms: Machine Room 1 and Machine Room 2. The two equipment rooms must be independent. Each equipment room has an independent master controller node, and each master controller node has a backup node. When the master controller node is faulty, the backup node in the equipment room is automatically switched to the master controller node to continue providing services. In addition, the two equipment rooms have the same number of copies. For example, the copies of data slice A are A11 and A12 in equipment room 1, and the copies are A21 and A22 in equipment room 2. At some point, Machine room 1 is the master machine room and machine room 2 is the standby machine room.

Data between equipment rooms can be strongly synchronized or asynchronously synchronized. If asynchronous mode is used, the data in the secondary machine room always lags behind that in the primary machine room. When a fault occurs in the primary equipment room, you have two options: Switch services to the secondary equipment room to risk data loss. Or stop service until the main machine room is restored. Therefore, if data synchronization is asynchronous, the switchover between the active and standby equipment rooms is manual. Users can choose To lose data or stop service based on service characteristics.

In strong synchronization mode, data in the secondary equipment room is the same as that in the primary equipment room. When a fault occurs in the primary equipment room, you can manually switch over or automatically switch over. That is, the distributed lock service is used to detect the services in the primary equipment room. When the primary equipment room is faulty, the secondary equipment room is automatically switched over to the primary equipment room.

B. A single cluster spans the machine room

Deploy a cluster to multiple equipment rooms and allow the master copy of different data fragments to reside in different equipment rooms, as shown in Figure 3-11. Each data fragment contains four replicas in machine room 1 and machine room 2. A1, B1, and C1 are the primary replicas. A1 and B1 reside in machine room 1, and C1 reside in machine room 2. There is only one master controller in the cluster, which needs to communicate with all working nodes in machine room 1 and machine Room 2. When the master controller fails, the distributed lock service detects the fault and switches the backup node in machine Room 2 to the master controller.

In this deployment mode, the master controller must consider equipment room information when distributing data. That is, try to distribute multiple copies of a data fragment to multiple equipment rooms to prevent services from being affected when a single equipment room fails.

C. Paxos Select the primary copy

If the primary copy is selected using the Paxos protocol, multiple copies of each data shard constitute a Paxos replication group. As shown in the figure, B1, B2, B3, and B4 form a replication group. At a certain point, B1 acts as the primary copy of the replication group. When B1 fails, other copies attempt to switch to the primary copy. In this way, there is no need to maintain the lease between the master controller node and the working node, and the failure of the master controller node does not affect the working node. Its advantage is that it can reduce the dependence on the total control node, but its disadvantage is that the complexity of the project is too high, and it is difficult to simulate all abnormal situations offline.

Distributed file systems

Distributed file systems have two main functions: one is to store Blob data such as documents, images, and videos; The other is as a persistence layer for a distributed table system.

1. Google File System (GFS)

GFS,Big Table, and Map Reduce are known as Google’s troika and are the cornerstone of many basic services.

Proposed in 2003, GFS is a distributed file system, which is quite different from the premise of many distributed systems before it, and applies to the following scenarios

(1) Consider component failure as a normal condition, provide fault tolerance mechanism, automatic load balancing, so that distributed file system can run on cheap machines;

(2) For large file storage, the main workload of the system is large-scale streaming reading, and the write operation is mainly written in the way of appending, rarely random write;

(3) write once, read many times, such as web page storage on the Internet

GFS files are divided into chunks of fixed size, which are allocated a 64-bit globally unique chunk handle by the master server at creation time. CS stores chunk on disk in the form of a normal Linux file. To ensure reliability, Chunk makes multiple copies on different machines, three by default.

The master server maintains metadata of the system, including file and chunk namespaces, mapping between files and chunks, and location information of chunk. It is also responsible for the global control of the whole system, such as chunk lease management, garbage collection of garbage chunks, and chunk replication. The master server periodically exchanges information with the CS through heartbeat.

A client is an access interface that GFS provides to an application. It is a set of proprietary interfaces that do not follow POSIX specifications and are provided as library files. When the client accesses GFS, it first accesses the master server node to obtain the CS information that interacts with it, and then accesses these CS directly to complete the data access work.

Note that the client in GFS does not cache file data, only the metadata obtained from the master server, which is determined by the application characteristics of GFS. There are two major applications of GFS: MapReduce and Bigtable. For MapReduce, the GFS client reads and writes data sequentially and does not cache file data. Bigtable as a distributed table system, internal implementation of a set of caching mechanism. In addition, how to maintain consistency between the client cache and the actual data is an extremely complex problem.

Thus, HDFS of Hadoop is a simplified version of GFS, a product of Dr. Cutting’s “copycat” OF GFS. It’s a product of stolen fire.

2. Taobao File System (TFS)

Internet applications often need to store documents, pictures, and videos uploaded by users, such as Facebook albums, Taobao pictures, and Dropbox documents. Documents, pictures, and videos are commonly called Blob data. The Taobao file system (TFS) is characterized by a Blob file system in which data is read only after being written and rarely updated.

TFS is architecturally borrowed from GFS, but very different from GFS.

(1) TFS does not maintain the file directory tree internally, and the flat data organization structure allows the file name to be mapped to the physical address of the file, simplifying the file access process;

(2) Special optimization is made for random read/write access performance of massive small files, which meets taobao’s demand for small file storage and is widely used in various taobao applications;

(3) HA architecture and smooth expansion ensure the availability and expansibility of the entire file system.

A TFS cluster consists of two NameServer nodes (one active and one standby) and multiple DataServer nodes. NameServer monitors the DataSrver status through heartbeat. NameServer is the Master in GFS,DataServer is the ChunkServer in GFS. NameServer is divided into active NameServer and standby NameServer. Only the active NameServer provides services. When the active NameServer fails, the heartbeat daemon detects and switches services to the standby NameServer. Each DataServer runs multiple DSP processes. Each DSP corresponds to a mount point, which generally corresponds to an independent disk to manage multiple disks.

In TFS, a large number of small files (actual data files) are merged into one large file (an improvement over HDFS). These large files are called blocks. Each Block has a unique number (Block ID) within the cluster, and a file can be uniquely identified by < Block ID, intra-block offset >. The actual Block data in TFS is stored in the DataServer. The size of the DataServer is 64MB. By default, three blocks are stored, which is equivalent to chunk in GFS. Application client is the access interface provided by TFS for application programs. The application client does not cache file data, but only NameServer metadata.

3. Fackbook Haystack file system

By 2014, there were more than 400 billion photos on Facebook, with a total size of 30PB. The average size of each photo can be calculated to be 30PB/260GB, or about 100KB. Users add 1 billion photos per week (the total size is 60TB). The average number of new photos per second is 109/7/40000(calculated in 40000s per day), which is about 3800 write operations per second. The peak value of read operations can reach a million times per second.

The early Facebook photo backend was served by NAS-based storage that used NFS to mount photo files in NAS. Later, due to performance and cost considerations, we independently developed Facebook Haystack to store album data.

Similar to TFS, The new Architecture of Facebook Haystack mainly solves the problem of image accessing too many I/O files. The main idea is that multiple logical files share the same physical file. The flowchart of Haystack architecture and read request processing is as follows

The Haystack architecture consists of three parts: Haystack Directory, Haystack Store and Haystack Cache. A Haystack Store is a physical storage node that organizes the storage space in the form of physical volumes. Each physical volume is usually large, such as 100GB, so 10TB of data is only 100 physical scrolls. Each physical scroll corresponds to a physical file, so the physical file meta-information on each storage node is very small. Physical scrolls on multiple physical storage nodes form a logical volume for backup. The Haystack Directory stores the mapping between logical scrolls and physical scrolls. Suppose that the size of each scroll is 100GB, the number of logical scrolls is 20PB/100GB=0.2MB, and the memory usage is negligible. Haystack Cache is mainly used to solve the problem of over-dependence on CDN providers and provide the cache service of recently added images.

The general process of Haystack image reading request is as follows: When a user visits a page, the Web Server requests Haystack Directory to construct a URL: http://< CDN > / < Cache > / < Machine ID > / < Logical volume,Photo > Cache and back-end Haystack Store storage nodes. Haystack Directory constructs urls that omit parts so that users can request Haystack Cache directly without going through the CDN. The Haystack cache contains two parts: the request from the user Browser and the request from the CDN. Haystack cache only caches the request sent by the user Browser and requires the requested Haystack Store to be writable. Generally speaking, the storage nodes of The Haystack Store become read-only after they reach the upper limit of capacity for a period of time. Therefore, the images of the writable nodes are the newly added images and are hotspot data.

The process of Haystack’s write request (image upload) is as follows: The Web Server first requests Haystack Directory to obtain the ID of the image and the writable logical scroll, and then writes the data to each corresponding physical scroll (usually with 3 backups).

File systems like Facebook’s Haystack and Taobao TFS are commonly referred to as Blob file systems. They both solve the problem of large numbers of small image files, so the architecture is similar, with differences including

(1) The selection of logical scroll size. For example, Haystack selects a logical scroll size of 100GB, while the block size in TFS is generally 64MB;

(2) Haystack uses RAID 6, and the underlying file system uses XFS with better performance. Taobao later eliminates the RAID mechanism, and the file system uses Ext3;

(3) Haystack uses the CDN service of Akamai&Limelight, while Taobao already uses its own CDN. Of course, Facebook is also considering its own CDN.

4. CDN Content distribution network

The full name of CDN is Content Delivery Network. The goal is to distribute web content to the “edge” of the network closest to users by adding a new layer of network architecture to the existing Internet. Achieve the following three objectives

(1) To solve the problem of access delay caused by distribution, bandwidth, server performance, suitable for site acceleration, on-demand, live and other scenarios. Users can get the content nearby, solve the Internet network congestion, improve the response speed and success rate of users to visit the website.

(2) Control of delay is undoubtedly an important indicator of modern information technology. The intention of CDN is to reduce resources as much as possible to guarantee information coherence under the circumstances of forwarding, transmission and link jitter.

(3) CDN plays the role of escort and accelerator, triggering information and reaching every user faster and more precise, bringing more extreme use experience.

As shown in the following figure, DNS does not return the IP address of the source server, but the IP address of an edge node selected by the Intelligent CDN load balancing system. The user accesses the edge node using this IP address, and then the node obtains the IP address of the source server through its internal DNS resolution and sends a request to obtain the page required by the user. If the request is successful, the edge node will cache the page, and the user can read the page directly next time, instead of visiting the source server every time.

The CDN architecture of Taobao is self-developed and used to support users’ shopping, especially the massive picture requests on “Double 11” Singles Day. The pictures are stored in the BACKGROUND TFS cluster, and the CDN system will cache these pictures to the edge node nearest to users. CDN adopts two levels of Cache: L1-cache and L2-cache. When users access the images of Taobao.com, they are scheduled to a L1-cache node through Global Load Balancing. If l1-cache hits, the image data is directly returned to the user. Otherwise, request l2-cache node and Cache the returned image data to L1-cache node. If l2-cache is hit, the image data is directly returned to L1-cache node. Otherwise, request the picture server cluster for the source server. Each image server is a Web server running Nginx, which also caches images locally and only requests the backend TFS cluster if the local cache also misses. The image server cluster and TFS cluster are deployed in the same data center.

Figure 4-11 shows the architecture of each CDN node. As can be seen from the figure, load balancing is carried out internally by LVS+Haproxy on each CDN node. LVS is a four-layer load balancing software with good performance. Haproxy is a seven-layer load balancing software that supports more flexible load balancing policies. By combining the two, different image requests can be scheduled to different Squid servers.

The figure above shows the single-node architecture of CDN, which has the following three characteristics

(1) Squid server constitutes the distributed cache in CDN of Taobao single node. This implementation is much simpler than distributed cache, because there is no need to consider data persistence.

(2) Hierarchical cache: SSD+SAS+SATA mixed storage is used on Squid server because the cache data has a high locality. Images are migrated with the change of hot spots. The most popular ones are stored to SSD, the moderately hot ones are stored to SAS, and the light ones are stored to SATA. In this way, SSD performance is well combined with SAS and SATA disk cost advantages.

(3) Low-power server customization. CDN cache service is IO intensive rather than CPU intensive. Therefore, Intel Atom CPU is selected to customize low-power server, which greatly reduces the overall power consumption on the premise of ensuring service performance.

Distributed key value system

Distributed key-value system is used to store semi-structured data with simple relationship. The semi-structured data is encapsulated into objects composed of key-value pairs, in which key is the unique identifier. Value indicates the attribute value, which can be of any type, such as text, picture, or empty. Timestamp is a timestamp that supports multiple versions of an object. Distributed key-value system is stored by key-value pair. Its structure is not fixed. Each tuple can have different fields.

Distributed key-value system supports the operation of adding, deleting, searching, and modifying a single key-value pair. It can run on a CLUSTER of PC servers and realize the expansion of the cluster on demand, so as to process large-scale data and ensure fault tolerance through data backup, avoiding the complexity and cost of data segmentation.

In general, distributed key and value systems are similar to traditional hash tables in terms of data structure, except that distributed key and value systems can distribute data to multiple storage nodes in a cluster. The distributed key value system can configure the number of data backups, and can store all copies of a data to different nodes. When one node fails to provide services, the other nodes will continue to provide services.

1, Amazon chateau marmont

Dynamo stores data in a very simple key-value format and does not support complex queries. Dynamo stores data values in their raw form, without parsing the data. Dynamo is used in Amazon’s shopping cart and S3 cloud storage service. The following problems were solved in the implementation process:

Dynamo uses consistent hashing to distribute data across multiple storage nodes. To summarize, each node in the system is assigned a random token that forms a hash ring. When performing a data store operation, the hash value of the primary key is calculated and stored on the node where the first clockwise token whose hash value is greater than or equal to the value resides. The beauty of consistent hashing is that node additions/deletions affect only the nodes adjacent to the hash ring, not the other nodes.

A. Dynamo architecture

Dynamo uses an improved consistent hash algorithm: Each physical node is assigned multiple tokens based on its performance. Each token corresponds to a virtual node. The processing power of each virtual node is basically the same and distributed randomly in the hash space. During storage, data is hashed to the area in charge of a virtual node and then stored to the physical node corresponding to the virtual node.

In the following figure, there are three nodes in a Dynamo cluster, and each node is allocated three tokens. When storing data, the hash value of the primary key is calculated and the data is stored to the node where the token resides. Assuming that node 4 is added, token allocation of nodes changes, thus realizing automatic load balancing.

In order to find the node where the data belongs, each node is required to maintain certain cluster information for location. Each node in the Dynamo system maintains information about the entire cluster, and the client caches information about the entire cluster. Therefore, most requests can be located to the target node at once.

B. Gossip protocol

Due to machine or human factors, node members are added or deleted from the system frequently. To ensure that each node has the latest member information cached in the Dynamo cluster, all nodes select a node from other nodes to communicate with through the Gossip protocol every fixed time (for example, 1s). If the connection succeeds, the two parties exchange their saved cluster information.

The Gossip protocol is used in P2P systems for autonomous nodes to coordinate knowledge of the entire cluster, such as cluster node status and load. Let’s first look at how two nodes A and B exchange their knowledge of the world.

(1) A tells B the version of all nodes it manages (including nodes in Down state and Up state);

(2) B tells A which version it is older and which version it has the latest, and then sends the latest nodes to A (the nodes in the Down state will not be concerned because the version has not been updated);

(3) A sends the old node in B to B, and updates the latest node information sent by B locally;

(4) After B receives the latest node information from A, it updates the old node cached locally.

Because of the existence of seed nodes, adding new nodes can be done relatively easily. When a new node joins, it first exchanges cluster information with the seed node, so as to have a knowledge of the cluster. Other existing nodes in the DHT(Distributed Hash Table) ring also periodically exchange cluster information with seed nodes to detect new nodes joining.

The cluster is constantly changing, and machines may go offline at any time. Therefore, each node also needs to periodically exchange cluster information with other nodes through the Gossip protocol. If the status of a node has not been updated for a long time, for example, the time since the last update exceeds a certain threshold, the node is considered offline.

2, Taobao Tiar

Tair is a distributed key/value system.

Tair has four engines: MDB, RDB, KDB and LDB. Based on four open source key/value databases: Memcached, Redis, Kyoto Cabinet, and LevelDB. Tair makes it easier for you to use these KV databases. For example, Redis does not provide Sharding operation. If there are multiple Redis servers, you need to write your own code to implement Sharding. Tair helps you to encapsulate these.

Tair has the following advantages:

(1) Unified API. No matter what engine is used at the bottom, the API at the top is the same.

(2) Tair encapsulates cluster operations and liberates developers. When Tair is used inside Taobao, it is generally fault-tolerant with double machine rooms and double clusters, and invalid Server is used to ensure the consistency between two clusters, which are transparent to developers.

A. Tair usage scenarios

(1) Non-persistent (MDB, RDB)

Data can be stored as keys/values

Data loss is acceptable

The speed of access is very high

Individual data sizes are not very large, typically in kilobytes

The amount of data is large and has great potential for growth

Data updates are infrequent

(2) Persistence (KDB, LDB)

Data can be stored as keys/values

Data needs to be persisted

Individual data sizes are not very large, typically in kilobytes

The amount of data is large and has great potential for growth

The data read and write ratio is high

B. Tair architecture

As a distributed system, Tair is composed of a central control node and several service nodes.

A, Config Server function:

(1) Obtain the information of surviving nodes in the cluster through maintenance and data server heartbeat;

(2) Build the data distribution table in the cluster according to the information of the surviving nodes;

(3) Query service based on data distribution table;

(4) Scheduling data migration and replication between data servers;

B. Data Server function

(1) Provide storage engines;

(2) Accept the put, get, and remove operations of the client.

(3) Perform data migration and replication;

(4) plug-ins: handle some custom functions when accepting requests;

(5) Access statistics;

C. Client function

(1) Provide an interface to access tAIR cluster at the application end;

(2) Update and cache data distribution table and invalid Server address;

(3) Local cache to avoid overheated data access affecting TAIR cluster service;

(4) Flow control;

In the figure below, the client first requests the Config Server to obtain the Data Server where the Data resides, and then sends the read and write request to the Data Server. Tair allows Data to be stored across multiple Data servers for exception tolerance.

C. Balanced data distribution

Tair distribution adopts consistent hash algorithm. All keys are divided into Q buckets, which are the basic unit of load balancing and data migration. Config Server assigns each bucket to a different data server according to certain policies, because data performs hash algorithm according to keys. The balance of bucket distribution and data distribution is ensured.

D, fault tolerance,

When a Data Server fails, the Config Server can detect it. Each hash bucket stores multiple copies in Tair. If it is a standby copy, the Config Server assigns it a new Data Server. If it is persistent, the Data is copied to the new Data Server. If it is the primary copy, the ConfigServer first promotes a normal standby copy to the primary copy for external services. Then, select another Data Server to add a backup copy to ensure the number of Data backups.

E. Data migration

When data servers are added or reduced, the Config Server will notice this. The Config Server will recalculate the distribution table of a new bucket on the Data Server and reassign access to the bucket previously served by the reduced machine to the other Data Servers. This is when data migration occurs. For example, in the new table, data Server A needs to be in charge of the bucket B, but there is no data on the bucket B, so the data will be migrated to B, and the Config Server will find that the backup number of the bucket is reduced. The backup of these buckets is then added to the data server with a low load based on load balancing. When the data server is added to the system, the Config Server coordinates the data Server to migrate some buckets under their control to the new Data server according to the load, and adjusts routes after the migration.

Assume that Data Server A wants to migrate buckets 1,2, and 3 to Data Server B. Because the routing table of the client does not change before the migration, the client will route access requests from buckets 1,2, and 3 to A. Now assume that 1 has not been migrated. 2 is being migrated, 3 has been migrated, so if you access 1, you will still access Data Server A, if you access 3, A will forward the request to B, and the result of B will be returned to the customer, if you access 2, it will be processed on A, and if it is the modification of 2, the modification log will be recorded. When bucket 2 completes the migration, it also sends logs to B, applies these logs to B, and finally AB data is consistent to complete the migration. If A is A migration caused by outage, the client will receive an intermediate temporary state allocation table and temporarily assign the bucket responsible by the crashed Data server to the data Server with its backup for processing. At this time, the service is available and the load may be unbalanced. When the migration is completed, A new load balancing state can be achieved.

3, ETCD

ETCD ETCD is a highly available key-value storage system used for shared configuration and service discovery.

(1) Developed and maintained by CoreOS, inspired by ZooKeeper and Doozer;

(2) It is written in Go language and handles log replication through Raft consistency algorithm to ensure strong consistency.

(3) EtCD is widely used in Google’s container cluster management system Kubernetes, Cloud Foundry, and Fleet of CoreOS.

(4) When the cluster network is unstable or the current master node is abnormal, ETCD can carry out the election work of the master node and recover the lost data in the cluster

A. Features of ETCD

Curl curl curl curl curl curl curl curl curl

(2) Security: Optional SSL customer authentication mechanism.

(3) Fast: each instance supports 1000 write operations per second.

(4) Trust: Distributed is fully realized using Raft algorithm.

B. The ability to provide

Etcd mainly provides the following capabilities

(1) Provide the interface for storing and obtaining data, which ensures the strong consistency of data of multiple nodes in the Etcd cluster through the protocol. Used to store meta information and share configurations.

(2) To provide a monitoring mechanism, the client can monitor a key or some key changes. Used to listen for and push changes.

(3) Provide the key expiration and renewal mechanism. The client can renew the key through periodic refresh (the implementation mechanism of V2 and V3 is different). Used for cluster monitoring and service registration discovery.

(4) Atomic CAS(compare-and-swap) and CAD(compare-and-delete) support (V2 through interface parameters, V3 through batch transactions). Used for distributed locks and leader elections.

C. ETCD architecture

(1) Etcd V2 storage, Watch and expiration mechanism

Etcd V2 is a pure memory implementation and does not write data to disk in real time. The persistence mechanism is simple, serializing the store integration into JSON files. The data is a simple tree structure in memory.

Store has a global currentIndex, and every time it changes, index is incremented by 1. And then each event is associated with the currentIndex.

Select * from EventHistroy where waitIndex is less than or equal to waitIndex and waitIndex is less than or equal to currentIndex. And the event that matches the Watch key will be returned directly if it has data. If there is no waitIndex in the history table or the request does not have waitIndex, it is placed in the WatchHub and each key is associated with a Watcher list. When a change is performed, the event generated by the change is placed in the EventHistroy table and the watcher associated with the key is notified.

(2) Etcd V3 storage, Watch and expiration mechanism

Etcd V3 implements watch and Store separately. Let’s first analyze the implementation of Store. Etcd V3 Store is divided into two parts, one is an in-memory index, kvIndex, which is based on Google’s open source Btree implementation of Golang, and the other is back-end storage.

Backend is designed to interconnect with multiple types of storage, and bolTDB is currently used. Boltdb is a stand-alone kv storage that supports transactions. Etcd transactions are implemented based on BoltDB transactions. The key stored by Etcd in BoltDB is reversion, and the value is the key-value combination of Etcd itself. That is to say, Etcd will save each version in BoltDB, thus realizing the multi-version mechanism.

4. Comparison of product selection (Etcd, Zookeeper, Consul comparison)

These three products are often taken to do selection comparison.

(1) Etcd and Zookeeper provide very similar capabilities. They are both universal consistency meta-information storage, provide watch mechanism for change notification and distribution, and are also used as shared information storage by distributed systems. Their positions in the software ecosystem are almost the same and can be replaced by each other. In addition to the differences in implementation details, language and consistency protocols, the biggest difference between them lies in the surrounding ecosystem. Zookeeper is written in Java under Apache and provides RPC interface. It was first hatched from hadoop project and is widely used in distributed systems (Hadoop, Solr, Kafka, Mesos, etc.). Etcd, an open source product from Coreos, is relatively new, gaining a following with its simple rest interface and active community, and is being used in new clusters (such as Kubernetes). Although V3 has been converted to a binary RPC interface for performance, it is still easier to use than Zookeeper.

(2) Consul’s objective is more specific. Etcd and Zookeeper provide distributed consistent storage capabilities. Specific service scenarios, such as service discovery and configuration change, need to be implemented by users themselves. Consul focuses on service discovery and configuration change and comes with KV storage. In the software ecosystem, the more abstract the component is, the wider the scope of application, but at the same time, there must be some shortcomings in meeting the needs of specific business scenarios.

The last

I here organized a distributed architecture collection data documents, Spring series of family bucket, Java systematic information :(including Java core knowledge points, interview topics and 20 years of the latest Internet real questions, e-books, etc.) need friends can pay attention to the public number [procedures yuan small wan] can be obtained.

Distributed storage architecture analysis for large-scale Internet technology architecture analysis