According to the 2021 Hard Disk "Quality Report" released by cloud storage service provider Backblaze, the reliability of existing storage hardware devices cannot be fully guaranteed, and we need some mechanisms at the software level to achieve reliable storage. A common design principle for distributed software is failure-oriented design.
backblaze
As a widely popular distributed file system, an important problem that HDFS needs to solve is the reliability of data. Hadoop versions prior to 3.0 can only use multiple copies of redundancy for data backup on HDFS to achieve data reliability goals (for example, 11 9s for three copies and 8 9s for double copies). Although the multi-copy redundancy method is simple and reliable, it wastes multiple storage resources. As the amount of data grows, it will bring a lot of additional costs. In order to solve the cost problem of redundant data, in Hadoop 3.0 version, HDFS introduced EC technology (Erasure Code).
This article shares the technical principles of EC and the EC practice of personal push, and takes you to play Hadoop3.0 together!
In-depth interpretation of EC principles
EC technology is deeply used in the fields of RAID and communication, by encoding and decoding data, it can still recover some data when it is lost.
We can understand the goal and function of EC in this way: for n data blocks of the same size, m additional check blocks are added, so that any loss of m data blocks or check blocks in the n+m data can restore the original The data.
Taking the RS-10-4-1024k strategy of HDFS as an example, the data reliability of approximately 5 copies can be achieved under the condition of 1.4 times of data redundancy, that is to say, it can achieve higher data reliability with smaller data redundancy. High data reliability.
1. EC algorithm
Common EC algorithms include XOR and RS. The following is a brief introduction:
//Simple EC algorithm: XOR
XOR is an algorithm based on XOR operation. By bitwise XORing two data blocks, a new data block can be obtained. When any one of the three data blocks is lost, it can be obtained by The XOR of the other two data blocks restores the lost data block.
HDFS implements the algorithm through the EC strategy of XOR-2-1-1024k. Although this method reduces the redundancy, it can only tolerate the loss of one of the three data blocks, and its reliability is still unattainable in many cases. Require.
// Improved EC algorithm: RS
Another coding method to reduce redundancy is Reed-Solomon (RS), which has two parameters, denoted as RS(n,m). Among them, n represents the data block and m represents the check block. It should be noted that under the RS algorithm, the number of check blocks indicates the maximum number of data blocks (including data blocks and check blocks) that can be lost.
The RS algorithm uses a generator matrix (GT, Generator Matrix) to multiply n data cells to obtain a matrix with n data cells (data cells) and m parity cells (parity cells). If the store fails, the store can be restored via the generator matrix as long as n of the n+m cells are available.
The RS algorithm overcomes the limitations of the XOR algorithm and uses linear algebra operations to generate multiple parity cells in order to be able to tolerate multiple failures.
The following figure vividly describes the encoding and decoding process of the RS algorithm:
Encoding process (left of the picture):
- Form m valid data into a vector D.
- Generate a transformation matrix B: It consists of an n-order identity matrix and an n*M van der Monte matrix (Vandemode).
- Two matrices B and D are multiplied to obtain a new matrix with error correction ability.
Decoding process (right of the picture):
- Take the rows that are not missing in the Vandermonde matrix B to form the matrix B`.
- Take the rows that are not lost in the matrix calculated at the end of the encoding process to form the matrix Survivors.
- Use the inverse of B` to multiply by the Survivors matrix to get the original valid data.
In order to explain the encoding and decoding process more generally, we take the RS-3-2-1024k strategy as an example to review the process of encoding and decoding using the EC algorithm.
Assuming that there are three pieces of data: d1, d2, and d3, we need to store two additional pieces of data, so that if any two pieces of data are lost, they can be completely recovered.
First, construct the error correction matrix according to the encoding process:
1. Get the vector D
2. Generate a transformation matrix B
3. Get the error correction matrix D*B
Suppose the data of d1 and d2 are lost, we do data recovery by decoding:
1. Take the rows that are not lost in B to form matrix B`
2. Take the rows that are not lost in the error correction matrix to form the matrix Survivors
3. Calculate the inverse of B` as:
4. The inverse of B` is multiplied by the Survivors matrix to get the original valid data:
So far, we have completed the goal of storing 2 additional data blocks in addition to the original 3 data blocks, so that any lost two of these 5 data blocks can be retrieved.
Compared with the three-copy method, in terms of reliability, the three-copy method can tolerate any two downtime or bad disks in the machine storing the file (data d1, d2, d3), because there is always one copy available, and the Replication to other nodes is restored to the level of three replicas. Similarly, under the RS-3-2 strategy, we can also tolerate any 2 machines where the 5 data blocks are down or bad disks, because the lost 2 data blocks can always be recovered through another 3 data blocks.
It can be seen that the three-copy method and the RS-3-2 strategy are basically equivalent in terms of reliability.
In terms of redundancy (redundancy = actual storage space/effective storage space), in the three-copy mode, each data block requires 2 additional data blocks as copies, and the redundancy is 3/1=3, Under the RS-3-2 strategy, only 2 additional data blocks are needed for every 3 data blocks to achieve the reliability target, and the redundancy is 5/3=1.67.
Finally, we can achieve approximately three-copy reliability with 1.67 times redundancy through RS-3-2.
The following figure is a schematic diagram of the proportion of valid data and redundant data under different strategies on Hadoop. It can be seen that the storage cost of the three-copy mode is the highest:
2. Strip layout
The copy strategy takes the block (Block) as the unit. Data is continuously written into the block until the upper limit of the block (default 128M) is reached, and then the next block is applied. Taking the most common three-copy method as an example, each block will have three copies of the same data stored on three DataNodes (DN).
The HDFS EC strategy adopts the Striping Block Layout. Striped storage uses a block group as a unit, and horizontally saves data on each block. The data of different segments on the same block is discontinuous. After writing a block group, apply for the next block group.
The following figure shows the comparison of the next BlockGroup layout between the continuous layout and the RS(3,2) strategy:
Compared with continuous layout, striped layout has the following advantages:
- Supports direct writing of EC data without offline conversion
- More friendly to small files
- Improved I/O parallelism
A push to land EC on Hadoop2.0
Getui planned the entire cluster at a very early stage, and divided the entire Hadoop cluster into a hot cluster with relatively large computing requirements and a cold cluster with relatively large storage requirements. After the release of Hadoop 3.x, we upgraded the cold cluster to Hadoop 3.x and tried new features including EC coding. Considering the compatibility and stability requirements of the computing engine, and in order to reduce the migration cost, we still keep the hot cluster in Hadoop 2.7.
Compute Engine Access
Since the hot cluster that mainly undertakes computing tasks is the environment of Hadoop2.x, and the internal computing engine does not support Hadoop3.x, in order to implement the EC function in the production environment, we must first solve the problem of Hadoop3.x on Hadoop2.x The ability to access data on EC. To this end, we customized the development of hadoop-hdfs of Hadoop2.7, and transplanted the EC function on Hadoop3.x. The core changes include:
- EC codec and strip related functions introduced
- Adaptation of PB protocol
- The transformation of the client reading process
Resource localization
In the process of deploying and transforming the code package, we use Hadoop's "resource localization" mechanism to simplify the grayscale and online process.
The so-called "resource localization" refers to the process that NodeManager needs to download the resources that the container depends on for execution from HDFS before starting the container. These resources include jars, dependent jars or other files. With the help of the resource localization feature, we can customize and distribute the jar package to the container of the corresponding computing task to control the jar package environment of the container of the application-level task, which makes the subsequent testing, grayscale verification and online very convenient.
SQL access
At present, a large number of tasks in Getui are submitted through SQL. Most of the SQL tasks will be converted into computing tasks of the corresponding computing engine after they are submitted to Yarn. For this part of the SQL task, we can directly use the first solution to access.
However, there are still some tasks that are not submitted through Yarn, but interact directly with HDFS, such as some small data set computing tasks or SQL tasks that directly view several examples through limit (such as select * from table_name limit 3). This requires that the node where this part of the task is located has the ability to access EC data on Hadoop3.x.
Taking Hive as an example, the following figure shows several ways for Hive to access HDFS data in a push environment. Here, HiveCli and Hiveserver2 should be adapted accordingly:
3. Damage check
Among the EC-related bugs submitted by the community, we found that there are some bugs that can cause corrupted encoded data, such as: HDFS-14768, HDFS-15186, HDFS-15240. To ensure that the transformed data is correct, we perform additional block-level checks on the encoded data.
When designing the verification program, we should not only consider the convenience of the verification program so that it can verify the data of the new EC, but also verify the data that has been EC before. Therefore, our main idea is to use all the check blocks and some data blocks to decode the encoded data to compare whether the decoded data blocks are consistent with the original data blocks.
Let's take RS-3-2-1024k as an example to review the verification process:
Based on additional verification tools, we extracted 1PB of EC data from a single copy for verification. The verification results showed that the number and size of faulty files accounted for less than one in a million, and the reliability reached the target requirements.
Looking forward to the present, we also need to specifically count and filter out data with low popularity and filter out small files for subsequent EC encoding processing. In the follow-up, we will explore and design a system so that it can automatically sense the decrease in the popularity of data, and complete the transformation of data from replication strategy to EC strategy. In addition, we will continue to explore the intel ISA-L acceleration library to improve the computing efficiency of the entire EC codec.
Pay attention to the public account of Getui Technology Practice, and unlock more dry goods content of "big data cost reduction and efficiency improvement".
Easter eggs
The "Big Data Cost Reduction and Efficiency Improvement" series of columns is deeply participated in by the daily interactive big data platform architecture team.
The daily interactive big data platform architecture team is the core team responsible for the research and development of the daily interactive big data platform. Based on the customization and optimization of open source components in the big data ecosystem, it helps to reduce costs and improve the efficiency of various businesses of the company. Distributed storage, distributed storage are welcome. Experts in the field of big data platform architecture such as computing and distributed databases will join and communicate.
- Resume delivery: hrzp@getui.com ;
- Technical exchange: Get in touch with us through Getui's technical practice official account, or send an email to tech@getui.com .
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。