This article has been authorized by the author Sun Jianliang netease cloud community.

Welcome to visit netease Cloud Community to learn more about Netease’s technical product operation experience.


1. Storage system reliability

In general, we use multi-replica technology to improve the reliability of storage systems, whether it is structured database storage (typically MySQL), documentary NoSQL database storage (MongoDB), or regular Blob storage systems (GFS, Hadoop).

Data is almost the life of the enterprise, how to measure the reliability of cluster data more correctly, how to design the system to make the cluster data achieve higher reliability, this is the question to answer.

2. Data Loss and copysets

“What is the probability of data loss when three disks fail simultaneously in a 999 disk 3 backup system?” This is closely related to the design of storage systems. Let’s start with two extreme designs

Design 1: Combine 999 disks and 333 disks.

In this design case, data loss occurs only if one of the disk pairs is selected. In this design, the probability of data loss is 333/C(999,3) = 5.025095326058336*e-07.

Design 2: Data is randomly scattered to 999 disks. In extreme case, the copy of logical data on a random disk is scattered to 998 disks in all clusters. In this design, the probability of data loss is C(999,3)/C(999,3)=1, that is, it must exist.

Through these two extremes, we can see that the probability of data loss is closely related to the degree of data fragmentation. For the convenience of subsequent reading, we introduce a new concept copyset.

CopySet: a group of devices that contains all copies of a data. For example, a copy of data is written to disks 1,2,3. {1,2,3} is a replication group.Copy the code

In a 9-disk cluster, the minimum number of copysets is 3, copysets = {1,2,3}, {4,5,6}, {7,8,9}, that is, a copy of data can only be written to one of the replication groups. Data loss occurs only if {1,2,3}, {4,5,6} or {7,8,9} are both bad. That is, the minimum number of copysets is N/R.

The maximum number of copysets in the system is C(N,R), where R is the number of copies and N is the number of disks. The number of copysets in the system will reach the maximum C(N,R) when the nodes are randomly selected to write replicas. That’s what happens if you pick any R disks and you have three copies of some of your data on those R disks.

Number of disks N and copies R In a storage system, number of copysets S, N/R < S < C(N, R)Copy the code

3. Estimate disk faults and storage system reliability

3.1 Disk Faults and Poisson distribution

The next basic probability distribution needs to be taught before the formal estimation of probability:Poisson distribution. Poisson distribution mainly describes the probability of the occurrence of random events in a system, such as the probability of the number of waiting guests on the bus platform, the probability of N newborns being born in a hospital within an hour, etc. For more vivid examples, please refer to Ruan Yifeng’sPoisson and Exponential Distributions: a 10-minute tutorial).


The above formula is the Poisson distribution. On the left-hand side, P represents probability, N represents some kind of functional relationship, T represents time, N represents quantity, and λ represents the frequency of events.

For example: the probability of 10 out of 1000 disks appearing in a year is P(N(365) = 10). λ is the number of 1000 disks that fail in one day. According to Google’s statistics, the annual failure rate is 8%. Therefore, λ = 1000*8%/365.

The above statistics only show the probability of N disks being damaged. How can we use this information to calculate approximate estimates of data reliability (i.e., the probability of data loss) in a distributed system?

3.2 Loss rate estimation in distributed storage Systems

3.2.1 Failure rate within T time

To estimate the annual failure rate of a distributed storage system, assume that T(within one year), when the system is full of data, the failed disk is not processed. In this case, the annual failure rate of the data is calculated.

Here we define some data: N: number of disks T: statistical time K: number of bad disks S: number of copysets (number of replication groups) in the system R: number of backupsCopy the code

How to calculate the probability of data loss within T(1 year) is to take into account all possible data loss events within T(1 year) from the perspective of probability and statistics. In a system with N redundant disks R, the event of data loss may occur within T time. The event of bad disks R is greater than or equal to R, that is, R, R+1, R+2… N (that is, all the time in K∈[R,N] interval), when these random events occur, under what circumstances will data loss be caused? That’s right, hit a replication group.

If K disks are damaged (K disks are randomly selected), the probability of matching the replication group is

P = X/C(N,K), where X is the number of replication groups that are matched during the random selection of K disks.Copy the code

Then the probability of data loss caused by K disk damage is

Pa(T,K) = p * P(N(T)=K)Copy the code

Finally, the probability of data loss in the system within T time is the sum of the probabilities of all possible data loss events.

Pb(T) = sigma Pa(T,K); [R, K ∈ N]Copy the code

3.2.2 Distributed system measures annual failure rate

Above, we assume that no recovery measures are taken for any hardware failure in a year, and then t can calculate the annual failure rate in this system state in a decade. However, in large-scale storage systems, the recovery program is usually started in the case of data loss, and the recovery is theoretically a random event from the initial state. After this factor is added, reliability calculation becomes more complicated.

In theory, disk failure and recovery in large-scale storage system are extremely complex continuous events. Here, we simplify this probability model into discrete events within different unit time T for statistical calculation. As long as the occurrence probability of continuous events between two T is very small, and most of the bad disk conditions can be recovered within T time, then the next time T can be to start from the new state again, then the approximate correctness of this estimation can be guaranteed. The unit of T is defined as hours (i.e.), so a year can be divided into 365/24/T periods, and the annual failure rate of the system can be interpreted as 100% minus the probability that no failure occurs in all units of T time.

That is, the probability of data loss in the system as a whole is:

Pc = 1 - (1-Pb(T))**(365*24/T)Copy the code

4 References:

  • Google ‘s Disk Failure Experience

  • Building on the distribution

  • Poisson and Exponential distributions: 10 min tutorial

  • Probability theory, binomial distribution and Poisson distribution

  • Disk failures and annual failure rate estimation of storage systems


Netease Cloud provides the object storage service for you. Please click to try it out for free.


The dubbo Event Notification System (2) Finally, someone has spoken clearly about cloud computing, big data and artificial intelligence. (2)