One, the introduction

We often hear about two measures of distributed storage: availability and reliability.

Availability refers to the availability of system services. Generally, the availability is measured by the annual availability divided by the annual availability. Generally, we refer to SLA indicators as availability indicators, which will not be discussed in detail here.

Reliability indicators refer to the reliability of data. We often talk about data reliability with 11 nines, which in object storage means that about one file out of 100 billion objects will be unreadable. Therefore, data reliability indicators pose challenges to distributed storage systems.

This paper focuses on the analysis of distributed system data reliability quantitative model.

Second, the background

There is no need to say more about the importance of data. Basically, data can be called the core of enterprise vitality and the basis for enterprise survival. Therefore, the reliability of data is the foundation of the foundation. Any loss of data will cause losses that cannot be calculated and compensated for.

With the increasing scale of data and the more complex environment, we can generally classify the factors that threaten the reliability of data into several categories:

  • Hardware failure: mainly disk failure, network failure, server failure, IDC failure;

  • Software hidden dangers: kernel BUG, software design BUG, etc.

  • O&m fault: Human misoperation.

Among them, the first kind of hardware failure is the most frequent disk failure, disk failure is normal for students engaged in distributed storage operation and maintenance.

Therefore, we will try to quantify the data reliability of a distributed system in terms of disk failure.

3. Quantification of data reliability

In order to improve the reliability of data, data copy technology and EC code redundancy technology are the most commonly used means of distributed system reliability. Take multiple copies as an example. The more copies there are, the higher the reliability of the data.

In order to make a quantitative estimate of the data reliability of the distributed system, the factors influencing the reliability of the stored data are further analyzed as follows:

  • N: indicates the total number of disks in a distributed system, which can be easily understood. The number of disks is strongly related to reliability, and the size of N is closely related to the degree of data fragmentation.

  • R: The number of copies. The higher the number of copies, the higher the reliability of the data will be, but at the same time, the higher the storage cost will be.

  • T: RecoveryTime data RecoveryTime when a faulty disk occurs. The shorter the RecoveryTime is, the higher the data reliability is.

  • AFR: Annualized Failure Rate Indicates the annual Failure Rate of a disk. This is related to the quality of the disk. The better the quality of the disk, the lower the AFR, and the higher the data reliability.

  • S: The number of copysets refers to the degree to which the data redundancy on one disk is dispersed in the cluster. The more dispersed the data is, the more likely the data redundancy on three disks will be lost. So, just from the dimension of the degree of dispersion, the smaller the degree of dispersion, the better.

Therefore, we can use a formula to express the data reliability of a distributed system throughout the year:

3.1 Annual disk failure rate: AFR

AFR: Annualized Failure Rate, also known as annual disk Failure probability, is used to reflect the Failure probability of a device during the whole year. It can be intuitively understood that the lower the AFR, the higher the system reliability, because AFR is strongly related to the data reliability of the system. This index is usually calculated by MTBF (Mean Time Before Failure), another disk quality index, and MTBF is a factory target. For example, Seagate’s factory MTBF is 120W hours. The following is the calculation formula of AFR:

However, in actual use, the MTBF is often lower than the factory specifications of hard disks. Google calculated the AFR based on the number of disks in their online cluster as follows:

(Statistics of disk AFR in 5 years)

(Photo by Oceanbase.org.cn)

3.2 Copy data replication group: CopySet

Copy data replication group CopySet: In colloquial terms, a node that contains all copies of a data, i.e., data is lost in case of a corrupted CopySet.

(Schematic diagram of single data random replication grouping)

(Photo from www.dazhuanlan.com)

As shown in figure 2, taking 9 disks as an example, the copyset of these 9 disks is {1,5,6}, {2,6,8}. If no special processing is done, the random distribution of data after more data is as follows:

(Schematic diagram of random distribution of massive data)

(Photo from www.dazhuanlan.com)

Maximum CopySet: As shown in the figure above, multiple copies of 12 data are randomly scattered to 9 disks, and three copies containing a certain data can be picked out from any 3 disks in the figure above, which is equivalent to the number of combinations of K elements from N elements:

In the largest CopySet configuration, once three disks fail, the probability of losing data is 100%. In another case, the distribution of data is regular. For example, the data on one disk will only be backed up on another disk, as shown in the figure below. In this case, the CopySet covered by data is only (1,5,7), (2,4,9), (3, 6, 8), that is to say, the CopySet in this case is 3. It makes sense that the minimum CopySet for 9 disks is 3. That’s N over R.

(Schematic diagram of disk granularity redundancy Distribution)

Therefore, the CopySet quantity S conforms to the following:

Since CopySet data can be set to a minimum of N/R, can the number of copysets be set to a minimum? Of course, the answer is no, because, on the one hand, if the CopySet is set to a minimum, when a disk fails, only two other disks can restore the disk, so the data recovery time becomes longer. A longer recovery time also affects data reliability. Moreover, once one of the CopySet is hit, the amount of data lost is very large. Therefore, the number of copysets and RecoveryTime in a distributed system are parameters that balance the data reliability of the entire system and cluster availability.

[1] Copysets: Reducing the Frequency of Data Loss in Cloud Storage provides an alternative strategy for CopySet Replication in distributed systems such as object Storage and file Storage. There is another way to adjust the number of system CopySets according to the reliability and availability of the system, which is to use the storage strategy of combining small files into large files in the case of random placement. The number of large files on each disk can be controlled by controlling the size of large files, such as 100G of a file. The maximum number of files stored on the 8T disk is 8T/100G = 80 files. That is to say, the data on an 8T disk can be dispersed to 80 other disks at most. For the system with cluster disks much larger than 80, it is obvious that the data fragmentation degree of a data disk can be well controlled.

Therefore, in the case that the fragments on the disk are randomly scattered, the number of CopySets can be quantified as the following formula:

Where, P is the disk capacity, B is the fragment size, N is the system disk data, and R is the number of copies. 80% is used.

3.3 Data Recovery Time: Recovery Time

It is understandable that the data recovery time has a great impact on data reliability. Therefore, shortening the data recovery time can effectively reduce the risk of data loss. As described above, the data recovery time is strongly related to the degree of data fragmentation on disks, and the data recovery time is also related to service availability.

For example, if the disk bandwidth is 200MB/s, if the available bandwidth for recovery is 20%, the recovery speed is 40MB/s, the disk capacity is P, the disk usage is 80%, and the BlockSize is B, the recovery speed can be calculated as follows:

Iv. Derivation of reliability model

4.1 Disk Failure and Poisson Distribution

Poisson distribution: The Poisson distribution is actually the limiting case of the binomial distribution. The formula of poisson distribution is as follows:

(Photo from Zhihu)

T is the time period (in hours), n is the number of faulty disks, n is the number of disks in the whole cluster, and is the average number of faulty disks in an hour.

In Section 3.1, we have shown that the probability of a disk failure within a year is AFR, so the probability of a disk failure within an hour is FIT (Failures in Time) :

In this case, the number of failed disks in a cluster with N disks per hour is FIT*N. In other words, it is the average number of failed disks per hour. Therefore, we can get:

4.2 Annual reliability calculation and derivation of the system

According to 4.1, we can get that disk failure conforms to Poisson distribution, and the probability of N disk failure within T hours in a cluster with N disks is as follows:

Next we in 3 copies, for example, to deduce the cluster for the whole year without data loss quantification model of probability, 3 cases, a copy of the cluster for the whole year without data loss probability is not too good quantification, we can through calculating the probability of cluster data leakage throughout the year, and then cluster no data loss probability, throughout the year to calculate:

Annual cluster data loss probability: Only after the first disk fails within t (1 year), then the system enters the data recovery phase, and within the data recovery time tr there is a second disk failure, regardless of how much data is recovered, and then there is a third disk failure within TR. However, these three disks do not necessarily just hit the copyset replication group described in 3.2. If they hit the copyset, the cluster will actually have data loss throughout the year. Because the probability of cluster data loss in the whole year is related to P1, P2, P3 and the Copyset hit probability Pc.

The probability of any disk failure within one year t is:

If the above disk has a problem, it needs to be restored immediately. The failure probability of another disk in TR during the recovery time is as follows:

The probability of any third disk failure within tr at recovery time:

The probability of the three failed disks matching CopySets of the cluster is:

Therefore, it is not difficult to obtain the probability P of annual cluster data loss:

Then the probability of no data loss in the whole year can be calculated from 1-P.

4.3 Calculation and derivation of annual EC redundancy reliability

EC redundant mechanism relative to three copy mechanism is to use an additional check to achieve when there are some blocks of failure under the condition of data is not lost, according to the EC (D, E) data block coding, then through calculating the rest of the EC redundant cluster data loss probability, EC mode the recovery rate of tr and three copies must be different, in addition, Copysets EC mode is different, EC mode is allowed to E a block of data is missing, and it is in E D a block of data has any data blocks back it lost data, therefore, it is not difficult to draw, EC mode of cluster data leakage probability P all the year round, the following formula, the default is 4, E is lost four data block:

Compared with the three-copy mode, the COPYset in EC mode needs to consider losing any E blocks among D+E blocks, then the number of copysets in EC mode is:

5. Reliability model estimation

5.1 Influencing factors of quantitative model

Taking three copies as an example, from the above quantitative formula for calculating the probability of failure of the complete set group, influencing factors can be obtained as follows:

  • N: Number of disks in the cluster.

  • FIT: is also the 1-hour disk failure rate, which can be obtained from AFR;

  • T: This is fixed for 1 year;

  • Tr: recovery time, in hours, which is related to recovery speed W, disk storage capacity, and fragment size.

  • R: number of copies;

  • Z: total storage space of the disk.

  • B: The Size of a fragment or Block, and the maximum Size for merging a small file into a large file.

5.2 Reliability quantization calculation

Next, several factors affecting reliability calculation are brought into the model to calculate reliability calculation according to the current situation of production cluster:

Based on the derivation of disk failure and reliability in 4.2, it can be seen from the calculation of 10 cases in the table that:

In Case 1,2, and 3, by expanding the number of disks from 48 to 804 and then to 3600, the reliability is improved from 11 9s to nearly 13 9s. Then, the number of disks from 804 to 3600 is maintained at 13 9s. Logically, the probability of adding 3 disks increases as the cluster size increases. But due to the recovery rate is increased with the increase of disk linear increase, therefore, the reliability has been promoted, and from 804 to 3600 pieces of plate, reliability does not increase, because at this time restore speed is not linear increase with the increase of disk, because after the disk volume is very big, decided to restore the speed factor becomes a single shard number.

Case 5 and 6 are easy to understand. The recovery speed changes from 100M/S to 10M/S, and the reliability decreases by more than two orders of magnitude.

Case 7 and 8 are also easy to understand. AFR is improved from 0.43 to 1.2 and then to 7, and the reliability is reduced by 3 orders of magnitude.

Case 9 and 10 are more difficult. When the number of disks is 100, the Block size is increased from 80G to 100G, and the reliability is reduced. In this Case, the recovery speed is improved, and the CopySet is also improved, but the speed is more affected.

Case 11 and 12 are also difficult, because we limit the recovery speed to no more than 5 minutes (on the simulation line, because the system detects the failed disk, automatically kicks the disk and other operations also need time), the CopySet in both cases is super large, so the recovery concurrency is very high, but limited by the limit of 5 minutes. Therefore, the recovery speed of the two cases is the same, so the number of PK copysets is smaller. The CopySet of Case12 is less likely to be lost than that of Case11, so the reliability is higher.

Six, summarized

  • First, the lower the AFR, the better. AFR is the biggest factor that directly determines the probability of data loss caused by disk failure in the whole cluster.

  • Second, recovery speed: Without affecting the service availability indicators, maximizing the recovery bandwidth of disk failure is another important factor to improve the reliability of cluster data;

  • If the recovery speed is limited, for example, it takes 5 minutes from the discovery of faulty disks caused by system architecture design to the start of data recovery operation, then CopySet can be reduced by reasonably reducing the dispersion degree of disk data. If the system is based on fragment granularity or Block granularity, Therefore, the reliability of data should be improved by improving the granularity of Block to reduce the degree of data dispersion.

The resources

1. zhuanlan.zhihu.com

2. Copysets: Reducing the Frequency of Data Loss in Cloud Storage

3. https://www.dazhuanlan.com

4.oceanbase.org.cn

Gong Bing, VIVO Internet Universal Storage Research and Development Team