Reliability of Distributed Storage System: Quantitative System Estimation

I. Introduction

We often hear two indicators that measure the quality of distributed storage systems: availability and reliability indicators.

Availability refers to the availability of system services. Generally speaking, the availability of the whole year is divided by the time of the whole year to measure the availability. The SLA indicator we usually say is the availability indicator, so I won’t go into details here.

The reliability index refers to the reliability of the data. We often say that there are 11 9s of data reliability. In object storage, it means that about one file is unreadable when storing 100 billion objects. It can be seen that the challenge brought by the data reliability index to the distributed storage system is self-evident.

This article focuses on analyzing the quantitative model of data reliability of distributed systems.

2. Background

Needless to say, the importance of data. Basically, data can be regarded as the core of the vitality of an enterprise and the foundation for the survival of an enterprise. Therefore, the reliability of data is the foundation, and any loss of data will cause uncalculated and compensated losses to the enterprise.

As the scale of data increases and the environment becomes more complex, we can roughly categorize the factors that threaten the reliability of data into several categories:

Hardware failure : Mainly due to disk failure, network failure, server failure, IDC failure;
Software hidden trouble : Kernel BUG, BUG in software design, etc.;
Operation and maintenance failure : Human misoperation.

Among them, disk failures are the most frequent in the first category of hardware failures, and bad disks are normal for students engaged in distributed storage operation and maintenance.

Therefore, we will try to quantify the data reliability of a distributed system from the dimension of disk failure.

Three, data reliability quantification

In order to improve the reliability of data, data copy technology and EC coding redundancy technology are the most commonly used methods for the reliability of distributed systems. Taking multiple copies as an example, the more the number of copies, the higher the reliability of the data.

In order to make a quantitative estimation of the data reliability of the distributed system, the factors that affect the reliability of the stored data are obtained by further analysis:

N : The total number of disks in a distributed system can be intuitively understood. The number of disks is strongly related to reliability, and the size of N has a great relationship with the degree of data fragmentation.
R : The number of copies, the higher the number of copies, the higher the reliability of the data, but it will also bring greater storage costs.
T : RecoveryTime is the time for data recovery in the event of a bad disk. This is also well understood. The shorter the recovery time, the higher the reliability of the data.
AFR : Annualized Failure Rate The annual failure rate of the disk, which is related to the quality of the disk itself. The better the quality, the lower the AFR, the higher the reliability of the data.
S : The number of CopySets, the degree to which the redundancy of the data on a disk is scattered in the cluster, the more scattered, it is possible that any three bad disks just have redundant data that just have data will be lost. Therefore, only from the perspective of the degree of dispersal, the smaller the degree of dispersal, the better.

Therefore, we can use a formula to express the annual data reliability of the distributed system:

3.1 Annual Disk Failure Rate: AFR

AFR: Annualized Failure Rate is also called the annual failure probability of hard disks. It is generally used to reflect the probability of failure of a device during the entire year. It can be intuitively understood that the lower the AFR, the higher the reliability of the system, because AFR and the system The data reliability is strongly related; and this indicator is usually calculated by another disk quality indicator MTBF (Mean Time Before Failure), and MTBF major hard disk manufacturers have factory indicators, such as the MTBF of Seagate hard disks. The indicator is 120W hours. The following is the calculation formula of AFR:

However, in actual use, the MTBF is often lower than the factory index of the hard disk. Google calculated the AFR based on the hard disk conditions of their online clusters as follows:

(Statistics of hard disk AFR in 5 years)

(Picture from http://oceanbase.org.cn )

3.2 Copy data copy group: CopySet

Copy data copy group CopySet: In layman's terms, a node that contains all copies of a data, that is, if a copyset is damaged, the data will be lost.

(Single data is randomly copied and grouped schematic diagram)

(Picture from https://www.dazhuanlan.com )

As shown in Figure 2, taking 9 disks as an example, the copyset of these 9 disks is: {1,5,6}, {2,6,8}, if no special processing is done, after the data is too much, the data The random distribution of is as follows:

(Schematic diagram of random distribution of massive data)

(Picture from https://www.dazhuanlan.com )

Maximum CopySet: As shown in the figure above, multiple copies of 12 data are randomly scattered onto 9 disks. Anyone who decides to pick 3 disks from the above figure can pick out three copies containing a certain data, which is equivalent to The number of combinations of k elements out of n elements is:

Once three disks are broken in the largest CopySet configuration, the probability of data loss is 100%. In another case, the data distribution is regular. For example, the data on one disk will only be backed up on another disk, as shown in the figure below. In this case, the only CopySet covered by the data is (1, 5, 7) , (2,4,9), (3,6,8) That is to say, CopySet is 3 in this case. It is not difficult for us to understand that the minimum CopySet for 9 disks is 3. That is N/R.

(Schematic diagram of disk granularity redundancy distribution)

Therefore, the number of CopySets S conforms to the following:

Since the CopySet data can be as small as N/R, can the number of CopySet be adjusted to the minimum, the answer is of course no, because, on the one hand, if the CopySet is adjusted to the minimum, when one of the disks is broken, only the other two disks will do this. The recovery operation of the block disk, so that the data recovery time becomes longer, and the longer recovery time will also affect the reliability of the data; and once one of the CopySets is hit, the amount of data lost will be very large. Therefore, the amount of CopySet in the distributed system and the recovery speed RecoveryTime is a parameter that balances the data reliability of the entire system and the availability of the cluster.

Literature [1] Copysets: Reducing the Frequency of Data Loss in Cloud Storage provides a copyset replication selection strategy for distributed systems. In distributed storage systems, such as object storage and file storage, there is another way to The reliability and availability of the system is to adjust the number of CopySets in the system, which is to use the storage strategy of combining small files into large files under random placement. You can control the number of large files on each disk by controlling the size of large files, such as 100G For a file, the maximum number of files stored on an 8T disk is 8T/100G = 80 files. That is to say, the data of an 8T disk is scattered to 80 other disks at most. For a system with a cluster disk far greater than 80, it is obvious It can also well control the degree of data fragmentation of a data disk.

Therefore, when the fragments on the disk are randomly scattered, the number of CopySets can be quantified as the following formula:

Among them, P is the capacity of the disk, B is the slice size, N is the data of the system disk, and R is the number of copies. 80% is the utilization rate.

3.3 Data recovery time: Recovery Time

Data recovery time has a great impact on data reliability. This is well understood, so shortening the data recovery time can effectively reduce the risk of data loss. As mentioned above, the data recovery time is strongly related to the degree of data fragmentation on the disk, and the data recovery time is also related to the availability of the service itself.

For example, if the disk bandwidth is 200MB/s, assuming 20% of the available bandwidth for recovery is 40MB/s, the disk capacity is P, the utilization rate is 80%, and B is the BlockSize size, the recovery speed can be calculated as follows:

Fourth, the reliability model derivation

4.1 Disk failure and Poisson distribution

Poisson distribution: Poisson distribution is actually the limit of binomial distribution. The formula of Poisson distribution is as follows:

(Picture from knows )

Among them, t is the time period (in hours), n is the number of failed disks, N is the number of disks in the entire cluster, and is the average number of failed disks within 1 hour per unit time.

From Section 3.1, we have introduced that the probability of a disk failure within one year is AFR, then the probability of a disk failure in a period of 1 hour per unit time is FIT (Failures in Time):

Then, in a cluster of N disks, the number of disks that fail within 1 hour per unit time is FIT*N, in other words, it is the average number of disks that fail within 1 hour per unit time. So you can get:

4.2 Calculation and derivation of system reliability throughout the year

From 4.1, we get that the disk failure is in accordance with the Poisson distribution, and the probability of n disk failures in a cluster of N disks within t hours:

Next, we take 3 copies as an example to derive a quantitative model for the probability of no data loss in the cluster for the whole year. In the case of 3 copies, the probability of no data loss in the whole year cluster is not easy to quantify. We can calculate the occurrence of clusters throughout the year. The probability of data loss, and then the probability of no data loss in the cluster throughout the year can be calculated:

Probability of data loss in the cluster throughout the year: only after the first disk fails within t (1 year), then the system enters the data recovery phase, and the second disk fails during the data recovery time tr , We do not consider how much data is restored, and then there is a third disk failure in tr, but these three disks may not just hit the copyset replication group we introduced in 3.2. If the copyset hits the copyset, then the cluster is in the whole There really was data loss in 1 year. Because the probability of data loss in the cluster throughout the year is related to P1, P2, P3, and the copyset hit probability Pc.

The probability of any disk failure within 1 year t is:

After the above disk has a problem, it needs to be restored immediately. There is another disk failure probability within the recovery time tr:

The probability of a third arbitrary disk failure within the recovery time tr:

The probability of these three failed disks hitting the cluster's CopySets is:

Therefore, it is not difficult to obtain the probability P of data loss in the cluster throughout the year:

Then 1-P, the probability of no data loss in the cluster throughout the year, can be calculated.

4.3 Calculation and derivation of annual reliability of EC redundancy

Compared with the three-copy mechanism, the EC redundancy mechanism uses additional check blocks to achieve that when some blocks fail, the data will not be lost. According to the (D, E) data block for EC encoding, then the EC redundancy is calculated Under the annual cluster data loss probability, the recovery speed tr in the EC mode is definitely different from the three copies. In addition, the copysets in the EC mode are different. The EC mode allows E data blocks to be lost, and It is because there are any E data blocks in D data blocks that the lost data cannot be retrieved. Therefore, it is not difficult to figure out the probability of data loss P in the whole year of the EC mode. The following formula, the default E is 4, which is Lost 4 data blocks:

Compared with the three-copy mode, the copyset of the EC mode needs to consider the loss of any E blocks among the D+E blocks. The number of copysets in the EC mode is:

5. Reliability model estimation

5.1 Quantitative model influencing factors

Taking the three copies as an example, from the above quantified calculation formula for the probability of failure of the whole cluster, the influencing factors can be obtained:

N: the number of disks in the cluster;
FIT: It is also the failure rate of the disk in 1 hour, which can be obtained from AFR;
t: This is fixed for 1 year;
tr: Recovery time, in hours, related to recovery speed W, disk storage capacity, and slice size;
R: number of copies;
Z: Total storage space of the disk;
B: The size of the fragment or block, the maximum size of a small file merged into a large file.

5.2 Reliability quantitative calculation

Next, we bring several factors that affect the reliability calculation into the model to calculate the reliability calculation according to the current situation of the production cluster:

Combining the derivation of disk failure and reliability in 4.2, through the calculation of 10 cases in the table, we can see:

Case 1, 2, 3 expands the number of disks from 48 disks to 804 to 3600 disks, and the reliability is increased from 11 9s to nearly 13 9s, and then 804 disks to 3,600 disks are maintained at 13 9s. It stands to reason that with the increase of the cluster size, the probability of adding 3 disks will increase, but since the recovery speed also increases linearly with the increase of disks, the reliability has been improving, and from 804 to 3600 disks, reliable The performance has not increased because the recovery speed has not increased linearly with the increase of disks at this time, because after the amount of disks is large, the factor that determines the recovery speed becomes the number of single disk slices.

Cases 5 and 6 are easier to understand, the recovery speed is changed from 100M/S to 10M/S, and the reliability is reduced by more than 2 orders of magnitude;

Case 7, 8 is also relatively easy to understand. The AFR is increased from 0.43 to 1.2 and then to 7, and the reliability is reduced by 3 orders of magnitude;

Case 9, 10 is relatively winding, when the number of disks is 100, the block size is increased from 80G to 100G, and the reliability is reduced. In this case, the recovery speed is increased and CopySet is also increased, but the speed impact is greater. .

Cases 11 and 12 are also more convoluted. Because we limit the recovery speed to not more than 5 minutes (analog online, because the system detects bad disks, automatic kicking and other operations also take time), the CopySets under these two cases are super large. So the recovery concurrency is very high, but it is limited to 5 minutes, so the recovery speed of the two cases is the same, so the number of PK CopySet, Case12 CopySet is smaller than Case11 CopySet, so it is less likely to be lost, so it is reliable Higher sex.

Six, summary

First of all, the lower the AFR, the better. AFR is the biggest factor that directly determines the probability of data loss caused by disk failures in the entire cluster;
The second is recovery speed: without affecting service availability indicators, maximizing the recovery bandwidth from disk failures is another important factor in improving the reliability of cluster data;
If the recovery speed is limited, for example, it takes 5 minutes from the discovery of a bad disk to the kicking of the disk to the start of the data recovery operation caused by the system architecture design, then the CopySet can be reduced by reasonably reducing the degree of dispersion of disk data. If the system is based on points The slice granularity or the block granularity is correspondingly improved by increasing the block granularity to reduce the degree of data dispersion to improve the reliability of the data.

Reference

1. https://zhuanlan.zhihu.com

2.《Copysets: Reducing the Frequency of Data Loss in Cloud Storage》

3. https://www.dazhuanlan.com

4.http://oceanbase.org.cn

Author: vivo Internet Universal Storage R&D Team-Gong Bing

Reliability of Distributed Storage System: Quantitative System Estimation

I. Introduction

2. Background