FastDFS solution for massive small file storage

Author: vivo internet server team - Zhou Changqing

1. Introduction to the principle of FastDFS

FastDFS is an open source lightweight distributed file system implemented in C language.

Support Linux, FreeBSD, AID and other Unix systems, solve the problem of large-capacity file storage and high concurrent access, and achieve load balancing for file access, suitable for storing small files between 4KB and 500MB, especially suitable for online files with files as the carrier. Services, such as images, videos, documents, and more.

2. FastDFS Architecture

FastDFS consists of three parts:

Client (Client)
TrackerServer
Storage Server (StorageServer)

2.1 Tracker Server

Tracker Server (tracking server) mainly does scheduling work and plays a role in load balancing.

(1) [Service registration] Manage the StorageServer storage cluster. When the StorageServer starts, it will register itself to the TrackerServer, and regularly report its own status information, including the remaining disk space, file synchronization status, file upload and download times and other statistical information.

(2) [Service Discovery] Before the Client accesses the StorageServer, it must first access the TrackerServer to dynamically obtain the connection information of the StorageServer, and the final data is transmitted with an available StorageServer.

(3) [Load Balance]

store group allocation strategy:

0: polling mode
1: Specify the group
2: Load balance (select the group (volume) with the largest remaining space to upload)

store server allocation strategy:

0: polling mode
1: Sort by IP address and select the first server (the one with the smallest IP address)
2: Sort by priority (the upload priority is set by the storage server, and the parameter name is upload_priority)

stroe path assignment:

0: Take turns, multiple directories store files in sequence
2: Select the directory with the largest remaining space to store the file (note: the remaining disk space is dynamic, so the directory or disk stored to it may also change)

2.2 Tracker Server

Tracker Server (tracking server) mainly provides capacity and backup services.

[Group management] Take the group as the unit, each group contains multiple Storage Servers, and the data is backed up with each other. The storage capacity is based on the storage with the smallest content in the group. The storage of the group is organized as a unit to facilitate application isolation, load balancing and copy data. custom made.

Disadvantages: The group capacity is limited by the storage capacity of a single machine, and data recovery can only rely on other machines in the group to resynchronize.

[Data synchronization] File synchronization can only be performed between Storage Servers in a group, and the push method is used, that is, the source server is synchronized to the target server. The source server reads the binlog file, parses the content of the file, and sends it to the target server according to the operation command, and the target service operates according to the command.

3. Upload and download process

3.1 Analysis of upload process

3.1.1 Select Tracker Server

The trackers in the cluster are all equal, and the client can choose any tracker when uploading files.

3.1.2 Assign Group, Stroage Server and storage path (disk or mount point)

When the tracker receives an upload request, it will first assign a group that can be stored to the file, and then assign a Storage Server to the client in the group. Finally, when receiving a file write request from the client, the Storage Server will assign a data storage directory and write.

(For the allocation strategy in this process, please refer to: [Load Balance])

3.1.3 Generate file_id to write and return

Storage will generate a file\_id as the current file name, file\_id is encoded in base64, including: source storage server ip, file creation time, file size, file CRC32 checksum and random number. There are two 256*256 subdirectories under each storage directory.

Storage will hash route to one of the subdirectories twice based on file_id.

Finally, use file_id as the file name to store the file in the subdirectory and return the file path to the client.

Final file storage path:

group|disk|subdirectory|filename
group1/M00/00/89/eQ6h3FKJf_PRl8p4AUz4wO8tqaA688.apk

【Group】: Assign group when uploading file.
[Disk path]: The virtual path configured by the storage server, corresponding to the configuration parameter store\_path. For example: M00 corresponds to store\_path0, and M01 corresponds to store_path1.
[Two-level directory]: The two-level directory created by the storage server under each virtual disk path is used to store files.

3.2 Download Process Analysis

3.2.1 Parse the path and route

When the tracker receives the download request sent by the client, the tracker parses out the group, size, creation time and other information from the file name, and then selects a storage server according to the group to return.

3.2.2 Check read and return

The client establishes a link with the Storage Server, verifies whether the file exists, and finally returns the file data.

Disadvantages: The synchronization of files between groups is performed asynchronously. It is possible that the uploaded files have not yet been synchronized to the currently accessed Storage Server machine or due to delays, resulting in 404 for downloaded files. So the introduction of nginx\_fastdfs\_module can solve the synchronization and delay problems very well.

3.3 Download architecture after the introduction of fastdfs\_nginx\_module components

FastDFS Nginx Module function introduction

(1) [Anti-theft chain inspection]

Use the FastDFS nginx extension to dynamically generate tokens and set the http.conf configuration.

Enable anti-leech function

http.default\_content\_type =
application/octet-stream
http.mime\_types\_filename=mime.types

Enable token anti-leech function

http.anti\_steal.check\_token=true
token expiration time
http.anti\_steal.token\_ttl=900

http.anti\_steal.secret\_key=xxx
The content returned after the token expires
http.anti\_steal.token\_check_fail=/etc/fdfs/anti-steal.jpg

[Token generation algorithm]: md5(fileid\_without\_group + privKey + ts) and ts does not exceed the ttl range.

The server will automatically verify the validity according to the token, st and the set secret key. Access links in the form of:
http://localhost/G1/M00/00/01/wKgBD01c15nvKU1cAABAOeCdFS466570.jpg?token=b32cd06a53dea4376e43d71cc882f9cb&ts=1297930137

(2) [File metadata analysis]

Obtain metadata information based on file_id, including: source storage ip, file path, name, size , etc.

(3) [File access routing]

Because the file_Id of the file contains the source Storage Server IP when uploading the file, the FastDFS extension component will be redirected or obtained by proxy according to the source server IP when the file under the local machine cannot be obtained (in the case of no synchronization or delay). document.

redirect mode

Configuration item response_mode = redirect, the server returns 302, redirect url
http://sourcestorageip :port/filepath?redirect=1

proxy mode

The configuration item response_mode = proxy, use the source storage address as the host of the proxy proxy, and other parts remain unchanged

4. Synchronization mechanism

4.1 Synchronization Rules

Synchronization occurs only between Storage Servers in this group.

Only the source data needs to be synchronized, and the backup data does not need to be synchronized again.

When adding a new Storage Server, an existing Storage Server will synchronize all existing data (source data and backup data) to the new server.

4.2 Binlog Replication

FastDFS file synchronization adopts binlog asynchronous replication mode. Storage Server uses binlog files to record file upload, deletion and other operations, and synchronizes files according to Binlog. Only the file ID and operation are recorded in Binlog, and the file content is not recorded. The format of binlog is as follows:

Timestamp | Operation Type | Filename
1490251373 C M02/52/CB/CtAqWVjTbm2AIqTkAAACd_nIZ7M797.jpg

Operation Type (Partial):

C for source creation, c for copy creation
A means source append, a means copy append
D means source deletion, d means copy deletion
. . . . . . .

4.3 Synchronization process

After a new Storage Server is added, other Storage Server servers in the group will start synchronization threads and initiate full and incremental synchronization operations to the newly added servers under the coordination of the tracker.

(1) After Storage C is started, it reports the group, ip, port, version number, number of storage directories, number of subdirectories, startup time, whether the synchronization of old data is completed, and the current status to the tracker.

(2) After the tracker receives the request to join Storage C, it updates the local storage list, returns it to C, and synchronizes it to A and B in a timely manner.

(3) Storage C applies for a synchronization request to the tracker, and changes its state to WAIT_SYNC after the response.

(4) Storage A and B find that there is no C from the new storage list synchronized to them during the heartbeat cycle, then start the synchronization thread, and first initiate a synchronization application to the tracker (TRACKER\_PROTO\_CMD\_STORAGE\_SYNC\_SRC\_REQ), tracker It will return the synchronization source IP-level synchronization timestamp to A and B. If the source IP is consistent with its own local IP, it will mark itself as the synchronization source for old data synchronization (full synchronization source). If it is inconsistent, mark itself as the incremental synchronization source. Quantitative synchronization source (synchronized only when the C node status is Active). This decision is generated by the tracker selection, and A and B cannot be used as synchronization sources at the same time, and they are synchronized to C at the same time.

(5) The synchronization source (assuming storage A) records the synchronization information of the target machine in a file with a suffix of .mark, and reports the change to the storage C status as SYNCING.

(6) Read binlog.index from /data.sync directory, binlog file Id, binlog.000 read line by line, and parse it. (See the format in binlog above) Send data to storage C, C Receive and save.

(7) The state change process of storage C during data synchronization is OFFLINE->ONLINE->ACTIVE. ACTIVE is the final state, indicating that storage C has provided services to the outside world.

5. File storage

5.1 LOSF problem

Problems faced by Small File Storage (LOSF):

The local file system innode is prioritized, and the number of small files stored is limited.
The directory hierarchy and the number of files in the directory can lead to high overhead (high IO times) to access files.
Small file storage, backup and recovery are inefficient.

For small file storage problems, FastDFS provides a file merging solution. FastDFS creates a large file of 64M by default. A large file can store many small files. The space for a small file is called a slot. The minimum size of solt is 256 bytes and the maximum size is 16M. When the size is less than 256 bytes, it is stored in 256 bytes, and the files over 16M are stored separately.

5.2 Storage method

(1) [Default storage method] Merge is not enabled, and the file_id generated by FastDFS corresponds to the file actually stored on the disk.

(2) [Combined storage method] Multiple files corresponding to file\_id are stored as one large file. Trunk file name format: /fastdfs/data/00/000001 The file name increases from 1. The generated file\_id is longer, and 16 bytes of extra content will be added to save the offset and other information.

as follows:

[file_size]: Occupies the space of large files (note that the alignment is performed according to the minimum slot-256 bytes)
[mtime]: file modification time
[crc32]: The crc32 code of the file content
[formatted\_ext\_name]: file extension
[alloc_size]: The file size is equal to size
[id]: Large file ID such as 000001
[offset]: The offset of the file content in the trunk file
【size】: File size.

5.3 Storage space management

(1) [Trunk Server] is selected by the tracker leader in a group of Storage Servers, and notified to all Storage Servers in the group, responsible for allocating space for all upload operations in the group.

(2) [Idle balance tree] The trunk server will construct an idle balance tree for each store_path, and free blocks of the same size are stored in the linked list. Each time an upload request is made, it will search the balance tree according to the size of the uploaded file to obtain a value greater than Or close to the free block, and then split the excess from the free block as a new free block and rejoin the balanced tree. If not found, a new trunk file will be rebuilt and added to the balance tree. The allocation process is a process of maintaining an idle balanced tree.

(3) [Trunk Binlog] After the merged storage is enabled, the Trunk Server will have an additional TrunkBinlog to synchronize. The TrunkBinlog records all the free block operations allocated and reclaimed by the TrunkServer, and is synchronized by the Trunk Server to other storage servers in the same group.

The TrunkBinlog format is as follows:

timestamp | operation type | store\_path\_index | sub\_path\_high| sub\_path\_low | file.id| offset | size 1410750754 A 0 0 0 1 0 67108864

The meaning of each field is as follows:

[ file.id ]: TrunkFile file name, such as 000001
[offset]: the offset in the TrunkFile file
[size]: The size occupied, aligned according to the slot

6. File deduplication

FastDFS does not have the ability to deduplicate files, so FastDHT must be introduced to complete it. FastDHT is an efficient distributed hash system for key-value pairs. The bottom layer uses Berkeley DB for database persistence, and the synchronization method uses binlog replication. In the FastDFS deduplication scenario, hash the file content, and then judge whether the files are consistent.

After the file is uploaded successfully, check the storage path corresponding to the Storage storage, and you will find that a soft link is returned. After that, each repeated upload returns a soft link pointing to the file uploaded for the first time. It also ensures that only one copy of the file is saved.

(Note: FastDFS will not return the index of the original file, all returned are soft links, when all soft links are deleted, the original file will also be deleted from FastDFS).

7. Summary

FastDFS is really just a system for managing files (application-level file system), such as managing uploaded files, pictures, etc. Unlike the system disk file system NTFS or FAT and other system-level file systems.