This is the 28th day of my participation in the August Challenge

The basic concept

  • Distributed system: the use of multiple computer coordination to solve a single computer can not solve the computing, storage problems
  • The biggest difference between a stand-alone system and a distributed system is the size of the problem, the amount of data that can be calculated and stored
  • To solve a single-machine problem with distributed solution, the first problem to be solved is how to break the problem into a multi-machine distributed solution, so that each machine in the distributed system is responsible for a subset of the problem
  • How to disassemble the input data of distributed system becomes the basic problem of distributed system because the input object of the problem is data, whether it is computing or storage

Hash way

  • The disadvantages of hash distributed data are highlighted as low scalability:
    • Once the cluster size needs to scale, almost all the data needs to be migrated and redistributed
    • In engineering, when the system of hashing distributed data is extended, the cluster size is often multiplied
    • Recalculate the hash from the data, so that half of the data on one machine can be migrated to the corresponding machine to scale
    • To solve the problem of poor scalability of hash, one idea is to take the corresponding relationship as metadata service and manage it by special metadata server instead of simply dividing the hash value and the machine
    • The number of hashes is often greater than the number of machines, so the same machine is responsible for the remainder of multiple hashes. However, a large amount of metadata needs to be maintained in complex mechanisms
  • Another disadvantage of hash distribution data is that once the data of some characteristic values are seriously uneven, data skew is easy to occur:

Distribution by data range

  • Distribution by data range: Data is divided into different ranges according to the range of eigenvalues so that each server in the cluster processes data in different ranges

  • In engineering, in order to facilitate load balancing operations such as data migration, dynamic interval partitioning technology is often used to make the amount of data served in each interval as much as possible
  • When the amount of data in a certain interval is large, the interval is split into two intervals, so that the amount of data in each data interval is kept below a relatively fixed threshold as far as possible
  • A dedicated server is required to maintain data distribution information in memory, which is a kind of meta-information
  • For a large-scale cluster, a single computer cannot maintain the metadata independently because of the large scale of the metadata. Therefore, multiple machines must be used as the metadata server

Distribution by data volume

  • Data volume distribution Data has nothing to do with specific data characteristics. Instead, data is regarded as a file that grows sequentially and is divided into several chunks based on a fixed size. Different data chunks are distributed to different servers
  • Distributing data by data volume Also requires recording the specific distribution of data blocks and using metadata servers to manage the distribution information as metadata
  • Advantages of distribution by data volume:
    • Because distribution by volume has nothing to do with specific data content, distribution by volume generally does not have the problem of data skew, and the data is always evenly segmented and distributed among clusters
    • When a cluster needs to be re-balanced, it only needs to migrate data blocks
    • Cluster capacity expansion is not limited. You only need to migrate some databases to newly added machines to complete cluster capacity expansion
  • Disadvantages of distribution by data volume:
    • More complex meta-information needs to be managed
    • When the cluster scale is large, the amount of metadata becomes large, requiring a complex mechanism to efficiently manage metadata

Consistency hashing

  • Consistency hashing: consistent hasing
    • It is a database distribution method widely used in engineering
    • Consistency hashing was originally used as a common data distribution algorithm for distributed hash table DHT in P2P networks
  • The basic form of consistent hashing:
    • Use a hash function to compute the hash value of data or data characteristics
    • Let the output range of the hash function be a closed ring, that is, the maximum value of the hash function output is the preceding order of the minimum value
    • Nodes are randomly distributed over the ring, with each node responsible for processing data in the entire hash range clockwise from itself to the next node

    • The consistent hashing approach requires that the node’s position on the consistent hash ring be managed as meta-information, which is more complex than the direct hash distribution approach
    • The location information of a node is only related to the size of the machine in the cluster, and the amount of meta-information is usually much smaller than distributing data by data range and by data volume
  • Improved algorithm of consistency hashing:Introduce the concept of virtual nodes
    • The number of virtual nodes is generally much greater than the number of machines in the future cluster
    • The virtual nodes are uniformly distributed on the consistent hash ring, which has the same function as the nodes in the basic consistent hash algorithm
    • Assign several virtual nodes to each node
    • When manipulating data, the corresponding virtual node is first found on the ring through the hash value of the data, and then the real node corresponding to the metadata is found
  • Advantages of using virtual nodes:
    • Once a node is unavailable, the node will make multiple virtual nodes unavailable, so that multiple adjacent real nodes load failure node pressure
    • Once a new node is added, multiple virtual nodes can be allocated so that the new node can bear the pressure of multiple existing nodes
    • Easy to achieve load balancing during capacity expansion

Copy and data distribution

  • The basic means of fault tolerance and availability improvement in distributed systems is the use of replicas
  • The distribution of data copies mainly affects the scalability of the system
  • Basic data copy policy:
    • In units of machines, several machines are copies of each other
    • The data is identical between copies
      • Pros: Very simple
      • Disadvantages: Data recovery efficiency is not high, scalability is not high
  • Reasonable course of action:
    • Instead of using machines as replicas, the data is split into more reasonable fields and copied in units of data segments
    • In practice, it is common to keep the size of each data segment as equal as possible and within a certain size
    • Data segments are called many different things:
      • segment
      • fragment
      • chunk
      • partion
  • The selection of data segments is directly related to the way data is distributed
  • Hash distribution:
    • The remainder of each hash bucket can be used as a data segment. In order to control the size of the data segment, the number of buckets is often larger than the cluster size
    • Once the data is divided into data segments, replicas can be managed on a data segment basis so that the replicas are no longer hard correlated and each machine can be responsible for a copy of a data segment
  • Once the copy distribution is machine-independent, the recovery efficiency after data loss is very high:
    • Once data is lost on one machine, copies of the lost data segment are distributed among all machines in the cluster, not just among several replica machines, so that data can be copied and recovered from the entire cluster at the same time, and each data source machine in the cluster can make copies with very low resources
  • Replica distribution is machine-independent and also conducive to cluster fault tolerance:
    • If there is a machine outage, stress is spread across the cluster because copies on the downed machine are scattered throughout the cluster
  • The machine-independent distribution of replicas also facilitates cluster scaling:
    • In theory, assuming a cluster size of N machines, when a new machine is added, only 1N−1N+ 1N+ frac{1}{N}-\ FRAc {1}{N+1}N1−N+11 data segments are migrated from each machine to the new machine to achieve load balancing
    • As with data recovery, migrating data from machines in the cluster is very efficient
  • In engineering, it will increase the cost of metadata to be managed and the difficulty of copy maintenance. A compromise:
    • Some data segments are grouped into a data segment group and copy management is carried out according to the granularity of the data segment group. In this way, the copy granularity can be controlled in a proper range

Localized computing

  • The way data is distributed in distributed systems affects the way computing is distributed
  • In a distributed system, the compute node and the storage node for computing data can reside on the same physical machine or on different physical machines
  • If compute nodes and storage nodes are located on different physical machines, data needs to be transmitted over the network, causing high overhead and even the network bandwidth becomes the overall bottleneck of the system
  • Scheduling as many computations as possible to compute nodes on the same physical machine as the storage nodes is called localized computing
  • Localized computing is an optimization of computing scheduling and embodies important distributed ideas:
    • Mobile data is not as good as mobile computing

Selection of data distribution mode

  • In practical engineering practice, data distribution mode can be reasonably selected according to requirements and implementation complexity
  • The data distribution mode can be flexibly combined, and can have the advantages of various modes, so as to obtain better comprehensive results
  • Data skew problem: the method of distributing data by data quantity is introduced on the basis of hash score data to solve the data skew problem
  • The data is distributed according to the hash value of the user ID. When the data volume of a user ID is very large, the data of the user always falls on a certain machine
    • In the data volume distribution mode, the user data volume is collected, and the user data is divided into multiple uniform data segments based on a certain threshold. These data segments are distributed to the cluster
    • The data amount of most users does not exceed the threshold. Therefore, metadata stores only the data segment distribution information of users that exceed the threshold to control the size of metadata
  • In this way, the combination of hash distribution of data and distribution of data by data volume is better