Comparison of mainstream distributed file systems

This article is relatively long, and it is recommended to read it carefully. There will be different gains.

I. Overview

Distributed file system is a basic application in the distributed field, among which the most famous is undoubtedly HDFS/GFS. Now the field has become mature, but understanding its design points and ideas will be of reference significance for us when we face similar scenarios/problems in the future.

Moreover, the distributed file system is not the only form of HDFS/GFS. In addition to it, there are other product forms with different forms and their own merits. Understanding them is also beneficial to expanding our horizons.

This article attempts to analyze and think about what problems we need to solve, what kind of solutions, and the basis for their respective choices in the field of distributed file systems.

Second, the past

Decades ago, the distributed file system has appeared, represented by the "Network File System (NFS)" developed by Sun in 1984. The main problem solved at that time was the network-shaped disk. The disk was removed from the host. Independent in.

In this way, you can not only get more capacity, but you can also switch hosts at any time, and you can also realize data sharing, backup, disaster tolerance, etc., because data is the most important asset in a computer. The data communication diagram of NFS is as follows:

The client deployed on the host machine forwards file commands to the remote file server for execution through the TCP/IP protocol, and the whole process is transparent to the host user.

In the Internet age, traffic and data are growing rapidly, and the main scenarios that the distributed file system has to solve have changed. It starts to require very large disk space. This is impossible to achieve in the vertical expansion of the disk system. It must be distributed and distributed at the same time. Under the architecture, the hosts are ordinary servers that are not very reliable. Therefore, indicators such as fault tolerance, high availability, persistence, and scalability have become features that must be considered.

Third, the requirements for distributed file systems

For a distributed file system, there are certain characteristics that must be met, otherwise it cannot be competitive. Mainly as follows:

It should conform to the POSIX file interface standard, so that the system is easy to use, and there is no need to modify the user's legacy system;
It is transparent to users and can be used directly like a local file system;
Persistence, to ensure that data will not be lost;
Scalability, smooth expansion when data pressure gradually increases;
Have a reliable security mechanism to ensure data security;
Data consistency, as long as the content of the file does not change, when you read it, the content you get should be the same.

In addition, some features are distributed bonus points, as follows:

The larger the support space, the better;
The more concurrent access requests supported, the better;
The faster the performance, the better;
The higher the utilization of hardware resources, the more reasonable, the better.

Fourth, the architecture model

From the business model and logical architecture, the distributed file system needs these types of components:

Storage component: responsible for storing file data, it must ensure file persistence, data consistency between copies, data block allocation/merging, etc.;
Management component: Responsible for meta information, that is, the metadata of file data, including which server the file is stored on, file size, permissions, etc. In addition, it is also responsible for the management of the storage component, including whether the server where the storage component is located Normal survival, whether data migration is required, etc.;
Interface components: Provide interface services for applications, including SDK (Java/C/C++, etc.), CLI command line terminal, and support FUSE mounting mechanism.

In terms of deployment architecture, there are two ways of disagreement between "centralization" and "non-centralization", that is, whether to use the "management component" as the central management node of the distributed file system. Both routes have excellent products, and the differences between them are introduced below.

1. There is a central node

Taking GFS as the representative, the central node is responsible for the management and control functions of file location, maintenance of file meta information, fault detection, data migration, etc. The following figure is the architecture diagram of GFS:

In the figure, the GFS master is the central node of GFS, and the GF chunkserver is the storage node of GFS. The operation path is as follows:

Client requests "query a certain part of data of a certain file" from the central node;
The central node returns the location of the file (which file on which chunkserver) and byte interval information;
According to the information returned by the central node, the Client directly sends a data read request to the corresponding chunk server;
The chunk server returns the data.

In this scheme, the central node generally does not participate in the real data reading and writing, but after returning the file meta information to the client, the client communicates directly with the data node. Its main purpose is to reduce the load of the central node and prevent it from becoming a bottleneck. This scheme with a central node has been widely used in various storage systems because the central node is easy to control and powerful.

2. No central node

Represented by ceph, each node is autonomous and self-managed. The entire ceph cluster contains only one type of node, as shown in the figure below (the red RADOS at the bottom is the node that "contains both meta data and file data" defined by ceph) .

The biggest advantage of decentralization is that it solves the bottleneck of the central node itself, which is why ceph claims to be able to expand up infinitely. But if the Client communicates with the Server directly, then the Client must know which node in the cluster it should access when operating on a certain file. Ceph provides a very powerful original algorithm to solve this problem-the CRUSH algorithm.

Five, persistence

For the file system, persistence is fundamental, as long as the Client receives a response from the Server that the server saves successfully, the data should not be lost. This is mainly solved through multiple copies, but in a distributed environment, multiple copies have to face these problems.

How to ensure that the data of each copy is consistent?
How to disperse the copies so that when a disaster occurs, all copies will not be damaged?
How to detect damaged or out-of-date copies, and how to deal with it?
Which copy should be returned to the client?

1. How to ensure that the data of each copy is consistent?

Synchronous writing is the most straightforward way to ensure consistent copy data. When the Client writes a file, the Server will wait for all copies to be successfully written, and then return to the Client.

This method is simple and guaranteed. The only drawback is that performance will be affected. Assuming there are 3 copies, if each copy takes N seconds, it may block the Client for 3N seconds. There are several ways to optimize it:

Parallel writing: One copy is used as the master copy, and data is sent to other copies in parallel;
Chain writing: Several copies form a chain, not waiting for the content to be received and then propagating back, but like a stream, while receiving the data passed by the upstream, and passing it to the downstream.

Another way is to use the W+R>N method mentioned in the CAP. For example, in the case of 3 copies (N=3), W=2, R=2, that is, if two successful writes are considered successful, the read Also read from 2 copies at the same time. This approach reduces write costs by sacrificing a certain read cost, and at the same time increases write availability. This method uses less land in a distributed file system.

2. How to disperse the copies so that when a disaster occurs, all copies will not be damaged?

This is mainly to avoid the occurrence of natural environmental failures in a certain computer room or a certain city, so there is a copy that should be distributed far away. Its side effect is that the write performance of this copy may be reduced to a certain extent because it is the farthest away from the Client. Therefore, if sufficient network bandwidth cannot be guaranteed in terms of physical conditions, the strategy of reading and writing copies needs to be considered.

You can refer to the method of writing only 2 copies for synchronous writing and asynchronous writing for the remote copy. At the same time, in order to ensure consistency, you must pay attention when reading to avoid reading outdated data of the asynchronous write copy.

3. How to detect damaged or expired copies, and how to deal with them?

If there is a central node, the data node communicates with the central node on a regular basis, reporting the relevant information of its own data block, and the central node compares it with the information maintained by itself. If the checksum of a data block is incorrect, it indicates that the data block is damaged; if the version of a data block is incorrect, it indicates that the data block has expired.

If there is no central node, take ceph as an example, it maintains a relatively small monitor cluster in its own node cluster, and the data node reports its own situation to this monitor cluster to determine whether it is damaged or expired.

When a damaged or expired copy is found, remove it from the meta information and create a new copy. The removed copy will be recovered in the subsequent recycling mechanism.

4. Which copy should be returned to the Client?

There are more strategies here, such as round-robin, the fastest node, the node with the highest success rate, the node with the most idle CPU resources, and even the first one as the master node, or the nearest one. One, this will save a certain amount of time for the overall operation completion.

Six, scalability

1. Scaling of storage nodes

When a new storage node is added to the cluster, it will actively register with the central node and provide its own information. When subsequent files are created or data blocks are added to existing files, the central node can be assigned to this new node ,easier. But there are some issues to consider.

How to make the load of each storage node relatively balanced?
How to ensure that newly added nodes will not collapse due to excessive short-term load pressure?
If data migration is required, how to make it transparent to the business layer?
1) How to make the load of each storage node relatively balanced?

First of all, there must be an index to evaluate the load of the storage node. There are many ways, you can consider from the disk space usage, or you can make a comprehensive judgment from the disk usage + CPU usage + network traffic. Generally speaking, disk usage is the core indicator.

Secondly, when allocating new space, priority is given to storage nodes with low resource usage. For existing storage nodes, if the load is overloaded or the resource usage is unbalanced, data migration is required.

2) How to ensure that the newly added nodes will not collapse due to excessive short-term load pressure?

When the system discovers that a new storage node is currently added, its resource usage is obviously the lowest. Then all write traffic is routed to this storage node, which may cause the short-term load of this new node to be excessive. Therefore, during resource allocation, a warm-up time is required, and within a period of time, the write pressure is slowly routed over until a new equilibrium is reached.

3) If data migration is required, how to make it transparent to the business layer?

In the case of a central node, this work is relatively easy to do, and the central node will take care of it-determine which storage node is under more pressure, determine which files to migrate to where, update your own meta information, and write during the migration process What to do if you enter, what to do if you rename it. No upper application is required to process it.

If there is no central node, the cost is relatively high. In the overall design of the system, this situation must also be taken into account. For example, ceph has to adopt a two-layer structure of logical location and physical location. The logical layer (pool and place group), this is unchanged during the migration process, and the movement of the lower physical layer data block is only the address of the physical block referenced by the logical layer has changed. From the client's perspective, the location of the logical block does not occur change.

2. Scaling of the central node

If there is a central node, also consider its scalability. As the central node is the control center, it is in the master-slave mode, so the scalability is relatively large, and there is an upper limit, which cannot exceed the scale of a single physical machine. We can consider various means to raise this upper limit as much as possible. There are several ways to consider:

Store files in the form of large data blocks-for example, the size of HDFS data block is 64M, and the size of ceph data block is 4M, which far exceeds the 4k of a stand-alone file system. Its significance is to greatly reduce the amount of meta data, so that the single-machine memory of the central node can support enough disk space meta information.
The central node adopts a multi-level approach-the top-level central node only stores the meta data of the directory, which specifies which sub-master control node to find the files in a certain directory, and then finds the real storage of the file through the sub-master control node node;
Central nodes share storage devices-multiple central nodes are deployed, but they share the same storage peripheral/database. Meta information is stored here, and the central node itself is stateless. In this mode, the request processing capability of the central node is greatly enhanced, but the performance will be affected to a certain extent. iRODS uses this approach.

Seven, high availability

1. High availability of the central node

The high availability of the central node must not only ensure the high availability of its own applications, but also ensure the high availability of meta data.

The high availability of meta data is mainly about data persistence, and a backup mechanism is required to ensure that it is not lost. The general method is to add a slave node, and the data of the master node is synchronized to the slave node in real time. Shared disks are also used to ensure high availability through the hardware resources of raid1. Obviously, it is easier to deploy the active/standby way of adding slave nodes.

The data persistence strategy of meta data has the following methods:

Save directly to the storage engine, generally a database. It is not impossible to save it directly to the disk in the form of a file, but because the meta information is structured data, it is equivalent to developing a small database by yourself, which is complicated.
Save the log data to a disk file (similar to MySQL's binlog or Redis' aof), and rebuild the result data in the memory when the system starts to provide services. When modifying, modify the disk log file first, and then update the memory data. This way is simple and easy to use.

The current memory service + log file persistence is the mainstream method. The first is pure memory operation, which is very efficient, and the log file is written sequentially; the second is independent deployment without relying on external components.

In order to solve the problem that the log file will grow larger and larger over time, so that the system can start and recover as soon as possible, it is necessary to assist in the way of memory snapshots-regularly save the memory dump, and only keep the log files after the dump time. . In this way, when restoring, start from the latest memory dump file, find the log file after the corresponding checkpoint and start replaying.

2. High availability of storage nodes

In the previous "Persistence" chapter, the high availability of the data copy is guaranteed without losing the data.

8. Performance optimization and cache consistency

In recent years, with the development of infrastructure, gigabit or even 10 Gigabit bandwidth in LANs has become more common. In 10 Gigabyte calculations, about 1250M bytes of data are transferred per second, while the read and write speed of SATA disks has basically reached a bottleneck in recent years. In the vicinity of 300-500M/s, that is, pure read and write, the network has surpassed the capacity of the disk and is no longer a bottleneck. NAS network disks have also begun to become popular over the years.

But this does not mean that there is no need to optimize read and write, after all, the speed of network read and write is still much slower than memory read and write. Common optimization methods mainly include:

Cache file content in memory;
Preload the data block to avoid waiting for the client;
Combining read and write requests means accumulating a single request and sending it to the server in batches.

While the use of cache improves read and write performance, it also brings data inconsistencies:

There will be a phenomenon of missing updates. When multiple clients write to the same file one after another in a period of time, the client that writes first may lose its written content, because it may be overwritten by the content of the client written later;
Data visibility issues. The client reads its own cache. Before it expires, if other clients update the file content, it cannot see it; that is, at the same time, different clients read the same file, the content may be inconsistent.

There are several methods for this type of problem:

File is read-only and not modified: Once the file is created, it can only be read but not modified. In this way, there is no inconsistency in the client-side cache;
Passing locks: Different granularities should be considered when using locks. Are other Clients allowed to read when writing? Are other Clients allowed to write when reading? This is a trade-off between performance and consistency. As a file system, there is no restriction on business, so reasonable It is more difficult to weigh, so it is best to provide locks with different granularities, which are chosen by the business side. But the side effect of this is that the cost of using the business side has increased.

Nine, security

Since the distributed file storage system must be a multi-client, multi-tenant product, and it stores potentially important information, security is an important part of it.

The permission models of mainstream file systems are as follows:

DAC: The full name is Discretionary Access Control , which is the familiar Unix-like permission framework. The user-group-privilege is a three-level system, where user is the owner, and group includes the owner's group and the non-owner's group. Privilege has read, write, and execute . This system is mainly based on the owner as the starting point, who allows the owner to have what permissions on which files.
MAC: The full name is Mandatory Access Control , which is divided from the degree of confidentiality of resources. For example, it is divided into three levels: "normal", "confidential", and "top secret". Each user may have different confidential reading permissions. This kind of authority system originated in the system of a security agency or military, and would be more common. Its authority is controlled and set by the administrator. SELinux in Linux is an implementation of MAC, which is provided to compensate for the defects and security risks of DAC. For the problems solved by SELinux, please refer to What is SELinux?
RBAC: The full name is Role Based Access Control , which is a permission system based on roles. What resource permissions a role has, and which role the user belongs to, are very suitable for the organization of the enterprise/company. RBAC can also be embodied and evolved into a permission model of DAC or MAC.

There are different options for distributed file systems on the market. For example, ceph provides a permission system similar to DAC but slightly different. Hadoop itself relies on the permission framework of the operating system. At the same time, Apache Sentry provides a permission system based on RBAC's authority system is supplemented.

10. Other

1. Space allocation

There are two types of continuous space and linked list space. The advantage of contiguous space is that reads and writes are fast, in order, but the disadvantage is that it causes disk fragmentation. What's more troublesome is that when contiguous large blocks of disk space are fully allocated and holes must be found, continuous allocation needs to know in advance to be written The size of the file in order to find a suitable size of space, and the size of the file to be written is often not known in advance (such as an editable word document, its content can be increased at any time);

The advantage of the linked list space is that there are few disk fragments, and the disadvantage is that reads and writes are very slow, especially for random reads. You must look down one by one from the first file block of the linked list.

In order to solve this problem, an index table appeared-the corresponding relationship between the file and the data block is also saved, stored in the index node (generally called the i node), the operating system will load the i node into the memory, so that the program randomly searches When the data block, it can be completed in the memory. In this way, the disadvantage of the disk linked list is solved. If the content of the index node is too large, the memory cannot be loaded, and a multi-level index structure may be formed.

2. File deletion

Real-time deletion or delayed deletion? The advantage of real-time deletion is that the disk space can be released quickly; delayed deletion is just to set a flag when the delete action is executed, and then delete it in batches at a certain point in time. Its advantage is that the files are still deleted. It can be reserved in stages to avoid accidental deletion to the greatest extent. The disadvantage is that the disk space is still occupied. In a distributed file system, disk space is a relatively abundant resource, so almost all use logical deletion to restore data. At the same time, after a period of time (maybe 2 days or 3 days, this parameter is generally configurable) ), and then recycle the deleted resources.

How to reclaim deleted or useless data? You can start from the meta information of the file-if a data block is included in the "file-data block" mapping table of the meta information, it is useful; if it does not, it means The data block is already invalid. Therefore, deleting a file is actually deleting the "file-data block" mapping information in the meta (if you want to keep it for a period of time, move the mapping information to another place).

3. Distributed file system for small files

There are many such scenarios, such as e-commerce-so many product pictures, personal avatars, such as social networking sites-so many photos, and their characteristics can be briefly summarized as follows:

Each file is not big;
The number is extremely large;
Read more and write less;
Will not be modified.

For this business scenario, the mainstream implementation method is still to store in the form of large data blocks, and small files exist in a logical storage manner, that is, the file meta information records which large data block it is on, and on the data block What is the offset and length of, form a logically independent file. This not only reuses the advantages and technology accumulation of the big data block system, but also reduces the meta information.

4. File fingerprinting and de-duplication

File fingerprint is to calculate the unique identifier of the file through algorithm according to the content of the file. If the fingerprints of the two files are the same, the file content is the same. When using a network cloud disk, I found that sometimes uploading files is very fast, because the file fingerprints play a role. By judging the fingerprint of the file, the cloud disk service provider finds that someone has already uploaded the file before, so there is no need to upload the file, just add a reference. In the file system, file fingerprints can be used to remove duplicates, and can also be used to determine whether the content of a file is damaged or whether the content of a copy of a file is consistent. It is a basic component.

There are also many file fingerprinting algorithms, such as the familiar md5 and sha256, and Google's simhash and minhash which are specifically aimed at the text field.

11. Summary

The content of the distributed file system is complex, and the issues to be considered are far more than those mentioned above, and its specific implementation is also more complicated. This article only tries to start from the problems to be considered in the distributed file system, and give a brief analysis and design. If you encounter similar scenarios in the future, you can think of "there is such a solution", and then come to in-depth study.

At the same time, there are many forms of distributed file systems on the market. The following is the design comparison of several common distributed file systems by a research team.

It can also be seen from here that there are actually many choices, and it is not the way in the GFS paper that is the best. In different business scenarios, there can also be more selection strategies.

Source: https://www.jianshu.com/p/fc0aa34606ce