4

What is a file system?

The file system is a very important component in the computer, which provides consistent access and management methods for storage devices. In different operating systems, the file system will have some differences, but there are also some common features that have not changed much for decades:

  1. The data exists in the form of files, and APIs such as Open, Read, Write, Seek, and Close are provided for access;
  2. Files are organized in tree-like directories, and atomic rename operations are provided to change the location of files or directories.

The access and management methods provided by the file system support the vast majority of computer applications, and the Unix concept of "everything is a file" highlights its important position. The complexity of the file system makes its scalability fail to keep up with the rapid development of the Internet, and the greatly simplified object storage fills in the gap in time to develop rapidly. Because object storage lacks a tree structure and does not support atomic renaming operations, it is very different from a file system, so this article will not discuss it for the time being.

The challenge of a stand-alone file system

Most file systems are stand-alone, providing access and management for one or more storage devices in a stand-alone operating system. With the rapid development of the Internet, the stand-alone file system faces many challenges:

  • Sharing: It is not possible to provide access to applications distributed on multiple machines at the same time, so with the NFS protocol, a single-machine file system can be provided to multiple machines for access at the same time through the network.
  • Capacity: Cannot provide enough space to store data, and the data has to be scattered in multiple isolated stand-alone file systems.
  • Performance: Can not meet the very high read and write performance requirements of some applications, and applications have to do logical splitting to read and write multiple file systems at the same time.
  • Reliability: Limited by the reliability of a single machine, machine failure may cause data loss.
  • Availability: Limited by the availability of a single operating system, operation and maintenance operations such as failure or restart will cause unavailability.

With the rapid development of the Internet, these problems have become increasingly prominent, and some distributed file systems have emerged to meet these challenges.

Here are a few basic architectures of distributed file systems that I have learned about, and compare the advantages and limitations of different architectures.

GlusterFS

GlusterFS is a POSIX distributed file system (open sourced under GPL) developed by Gluster in the United States. The first public version was released in 2007 and was acquired by Redhat in 2011.

Its basic idea is to integrate multiple stand-alone file systems into a unified namespace to provide users through a stateless middleware. This middleware is implemented by a series of translators that can be superimposed. Each translator solves a problem, such as data distribution, copying, splitting, caching, locking, etc. The user can flexibly configure according to specific application scenarios. . For example, a typical distributed volume is shown in the following figure:

image.png

Server1 and Server2 form Volume0 with 2 copies, Server3 and Server4 form Volume1, and they are merged into a distributed volume with larger space.

advantages :
The data files are finally saved on the stand-alone file system with the same directory structure, so there is no need to worry about data loss due to the unavailability of GlusterFS.

There is no obvious single point problem, and it can be expanded linearly.

The support for a large number of small files is estimated to be pretty good.

challenge :
This structure is relatively static and not easy to adjust. It also requires each storage node to have the same configuration. When the data or access is unbalanced, it is impossible to adjust the space or load. Failure recovery capabilities are also relatively weak. For example, when Server1 fails, there is no way to copy files on Server2 to healthy 3 or 4 to ensure reliable data.

Because of the lack of independent metadata services, all storage nodes are required to have a complete data directory structure. When traversing the directory or making directory structure adjustments, all nodes need to be visited to get the correct results. This results in the limited scalability of the entire system, which can be expanded to dozens. It's okay when there are only a few nodes, it is difficult to effectively manage hundreds of nodes.

CephFS

CephFS began with Sage Weil's doctoral thesis research, with the goal of realizing distributed metadata management to support EB-level data scale. In 2012, Sage Weil established InkTank to continue to support the development of CephFS, which was acquired by Redhat in 2014. It was not until 2016 that CephFS released a stable version that can be used in production environments (the metadata part of CephFS is still stand-alone). Currently, the distributed metadata of CephFS is still immature.

Ceph is a layered architecture. The bottom layer is a distributed object storage based on CRUSH (hash), and the upper layer provides three APIs: object storage (RADOSGW), block storage (RDB) and file system (CephFS), as shown in the figure below :

image.png

It is still very attractive to use a set of storage system to meet the storage requirements of multiple different scenarios (virtual machine mirroring, massive small files and general file storage), but because the complexity of the system requires strong operation and maintenance capabilities to support it, the actual At present, only block storage is relatively mature, and there are many applications. Object storage and file systems are not ideal. After hearing some use cases, they gave up after using them for a period of time.

The architecture of CephFS is shown in the figure below:

image.png

CephFS is implemented by MDS (Metadata Daemon), which is one or more stateless metadata services. It loads file system metadata from the underlying OSD and caches it in memory to improve access speed. Because MDS is stateless, multiple backup nodes can be configured to implement HA, which is relatively easy. However, the backup node has no cache and needs to be warmed up again. It may take a long time to recover from failure.

Because loading or writing data from the storage layer will be slower, MDS must use multiple threads to increase throughput. Various concurrent file system operations lead to a significant increase in complexity, deadlocks are prone to occur, or performance due to slow IO decline. In order to obtain better performance, MDS often needs enough memory to cache most of the metadata, which also limits its actual support capacity.

When there are multiple active MDSs, a part (subtree) of the directory structure can be dynamically allocated to a certain MDS and completely processed by it to achieve the purpose of horizontal expansion. Before multiple activations, each lock mechanism is inevitably needed to negotiate the ownership of the subtrees, and to achieve cross-subtree atomic renaming through distributed transactions, which are very complicated to implement. The latest official documents still do not recommend the use of multiple MDS (as a backup is possible).

GFS

Google's GFS is a pioneer and typical representative of distributed file systems, developed from the early BigFiles. In a paper published in 2003, it elaborated its design philosophy and details, which had a great impact on the industry. Later, many distributed file systems were based on its design.

As the name suggests, BigFiles/GFS is optimized for large files and is not suitable for scenarios average file size is less than 1MB. 16154241bd0bdf. The structure of GFS is shown in the figure below:

image.png

GFS has a Master node to manage metadata (all loaded into memory, snapshots and update logs are written to disk), and files are divided into 64MB Chunks and stored on several ChunkServers (using a stand-alone file system directly). The file can only be written in addition, and there is no need to worry about the version and consistency of Chunk (you can use the length as the version). This design using completely different technologies to solve metadata and data greatly simplifies the complexity of the system and has sufficient scalability (if the average file size is greater than 256MB, the Master node can support about 1PB of data per GB of memory). Giving up support for some functions of the POSIX file system (such as random writing, extended attributes, hard links, etc.) also further simplifies the system complexity in exchange for better system performance, robustness and scalability.

Because of the maturity and stability of GFS, Google can more easily build upper-layer applications (MapReduce, BigTable, etc.). Later, Google developed Colossus, a next-generation storage system with stronger scalability, which completely separated metadata and data storage, realized distributed metadata (automatic sharding), and used Reed Solomon coding to reduce storage space occupation. reduce costs.

HDFS

Hadoop from Yahoo is an open source Java implementation version of Google’s GFS, MapReduce, etc. HDFS is also a basic copy of the design of GFS, so I won’t repeat it here. The following figure shows the architecture of HDFS:

image.png

The reliability and scalability of HDFS are still very good. There are many thousands of nodes and 100PB-level deployments. The performance of supporting big data applications is still very good. There are few cases of data loss (because there is no recycle bin configured. Except for data that has been deleted by mistake).

The HA scheme of HDFS was added later, and it was so complicated that Facebook, which was the first to implement this HA scheme, used manual failover for a long period of time (at least 3 years) (do not trust automatic failover) .

Because NameNode is implemented in Java and depends on the pre-allocated heap memory size, insufficient allocation can easily trigger Full GC and affect the performance of the entire system. Some teams tried to rewrite it in C++, but they haven't seen a mature open source solution yet.

HDFS also lacks a mature non-Java client, making it inconvenient to use in scenarios other than big data (tools such as Hadoop) (such as deep learning, etc.).

MooseFS

MooseFS is an open source distributed POSIX file system from Poland. It also refers to the GFS architecture and implements most of the POSIX semantics and APIs. After being mounted through a very mature FUSE client, it can be accessed like a local file system. The architecture of MooseFS is shown in the figure below:

image.png

MooseFS supports snapshots, and it is convenient to use it for data backup or backup recovery.

MooseFS is implemented by C, and Master is a single thread driven by asynchronous events, similar to Redis. However, the network part uses poll instead of the more efficient epoll, which causes a very high CPU consumption when the concurrency reaches about 1000.

The open source community version does not have HA, it uses metalogger to achieve asynchronous cold standby, and the closed source paid version has HA.

In order to support random write operations, chunks in MooseFS can be modified. A version management mechanism is used to ensure data consistency. This mechanism is more complex and prone to strange problems (for example, after the cluster restarts, there may be a few chunks whose actual number of copies is lower than expected).

JuiceFS

The GFS, HDFS and MooseFS mentioned above are all designed for the software and hardware environment of the self-built computer room, which combines data reliability and node availability with multiple machines and multiple copies. However, in the virtual machine of the public cloud or private cloud, the block device is already a virtual block device with a three-copy reliability design. If it is done through multiple machines and multiple copies, the data cost will remain high (in fact Are 9 copies).

So we improved the architecture of HDFS and MooseFS for the public cloud, and designed JuiceFS. The architecture is shown in the following figure:

image.png

JuiceFS uses existing object storage in the public cloud to replace DataNode and ChunkServer to achieve a fully flexible serverless storage system. The object storage of the public cloud has solved the safe and efficient storage of large-scale data. JuiceFS only needs to focus on the management of metadata, and it also greatly reduces the complexity of metadata services (GFS and MooseFS masters must solve metadata problems at the same time. Health management of storage and data blocks). We have also made a lot of improvements to the metadata part, and from the very beginning we have achieved high availability based on Raft. To truly provide a highly available and high-performance service, metadata management and operation and maintenance are still very challenging. Metadata is provided to users in the form of services. Because the POSIX file system API is the most widely used API, we have implemented a highly POSIX compatible client based on FUSE. Users can mount JuiceFS to Linux or macOS through a command line tool, and access it as fast as a local file system.

The dotted part on the right side of the figure above is the part responsible for data storage and access, involving the user's data privacy. They are completely in the customer's own account and network environment, and will not contact the metadata service. We (Juicedata) have no way to access customer content (except metadata, please do not put sensitive content in the file name).

summary

Briefly introduce the architecture of several distributed file systems that I know, and put them in the following figure in the order of appearance (the arrow indicates the former or the new generation version):

image.png

The blue files in the upper part of the above figure are mainly used for big data scenarios, which implement a subset of POSIX, while the green ones below are POSIX compatible file systems.

Among them, the metadata and data separation system design represented by GFS can effectively balance the complexity of the system, effectively solve the problem of large-scale data storage (usually large files), and have better scalability. Colossus and WarmStorage, which support distributed storage of metadata under this architecture, have unlimited scalability.

As a latecomer, JuiceFS learned how MooseFS implements a distributed POSIX file system, and also learned the idea of completely separating metadata and data such as Facebook's WarmStorage, hoping to provide the best distributed storage for public cloud or private cloud scenarios Experience. By storing data in object storage, JuiceFS effectively avoids the high cost caused by the double-layer redundancy (block storage redundancy and distributed system multi-machine redundancy) when using the above distributed file system. JuiceFS also supports all public clouds. You don't have to worry about a cloud service lock-in, and you can smoothly migrate data between public clouds or regions.

Finally, if you have a public cloud account, come to JuiceFS register, and you can mount a PB-level file system on your virtual machine or your Mac in 5 minutes.

Recommended reading:

How to use JuiceFS to speed up AI model training by 7 times


JuiceFS
183 声望9 粉丝

JuiceFS 是一款面向云环境设计的高性能共享文件系统。提供完备的 POSIX 兼容性,可将海量低价的云存储作为本地磁盘使用,亦可同时被多台主机同时挂载读写。