Detailed explanation of JuiceFS data read and write process

For the file system, its read and write efficiency has a decisive impact on the overall system performance. In this article, we will introduce the read and write request processing process of JuiceFS to let everyone have a better understanding of the characteristics of JuiceFS.

Writing process

JuiceFS large files do multi-level split (see JuiceFS how to store files ), in order to improve the efficiency of reading and writing. When processing a write request, JuiceFS first writes the data into the client's memory buffer, and manages it in the form of Chunk/Slice. Chunk is a continuous logical unit divided into 64 MiB according to the offset in the file, and different Chunks are completely isolated. Each Chunk will be further divided into slices according to the actual situation of the application write request; when the new write request is continuous or overlaps with the existing slice, it will be directly updated on the slice, otherwise a new slice will be created.

Slice is the logical unit that starts data persistence. When flushing, it will first split the data into one or more consecutive blocks according to the default size of 4 MiB, and upload them to the object storage. Each block corresponds to an Object; then update Once metadata, write new slice information. Obviously, in the case of application sequential writing, only a is needed, and only one flush is required at the end; at this time, the write performance of the object storage can be maximized.

Take a simple JuiceFS benchmark test as an example. The first stage is to use 1 MiB IO to write 1 GiB files sequentially. The form of data in each component is shown in the following figure:

Note : The compression and encryption in the figure are not turned on by default. To enable related functions, you need to add the option --compress value or --encrypt-rsa-key value

stats command during the test, and the relevant information can be seen more intuitively:

Stage 1 in the figure above:

The average IO size written by object storage is object.put / object.put_c = 4 MiB , which is equal to the default size of Block
The ratio of the number of metadata transactions to the number of object storage writes is approximately meta.txn : object.put_c ~= 1 : 16 , which corresponds to 1 metadata modification and 16 object storage uploads required by Slice flush. It also shows that the amount of data written per flush is 4 MiB * 16 = 64 MiB, the default size of Chunk
The average request size of the FUSE layer is about fuse.write / fuse.ops ~= 128 KiB , which is consistent with its default request size limit

Compared to the sequential write, the situation of the large file random write many complex; may exist within each Chunk plurality of discontinuous the Slice of, on the one hand difficult to achieve 4 MiB data object size, on the other hand Metadata Need to update multiple times. At the same time, when too many slices have been written in a Chunk, Compaction will be triggered to try to merge and clean up these slices, which will further increase the burden on the system. Therefore, JuiceFS will have a more obvious performance degradation than sequential writing in such scenarios.

The writing of small files is usually uploaded to the object storage when the file is closed, and the corresponding IO size is generally the file size. You can also see from the third stage of the above indicator chart (creating a 128 KiB small file):

The size of the object storage PUT is 128 KiB
The number of metadata transactions is roughly twice the PUT count, corresponding to one Create and one Write for each file

It is worth mentioning that JuiceFS will also try to write to the local Cache ( --cache-dir , which can be memory or hard disk) when uploading such objects that are less than one Block, in order to improve the speed of subsequent possible read requests. It can also be seen from the indicator graph that when a small file is created, there is the same write bandwidth under the blockcache, and most of it is hit by the cache when it is read (stage 4), which makes the read speed of small files look special quick.

Since the write request can be written into the client's memory buffer and then returned, the Write latency of JuiceFS is usually very low (tens of microseconds), and the actual upload to the object storage is automatically triggered internally (a single slice is too large, and the slice is too large). Excessive number, too long buffer time, etc.) or the application triggers actively (close the file, call fsync etc.). The data in the buffer can only be released after being persisted, so when the write concurrency is relatively large or the object storage performance is insufficient, it may fill the buffer and cause write blocking.

Specifically, the size of the buffer is --buffer-size , and the default is 300 MiB; its real-time value can be seen in the usage.buf column of the indicator chart. When the usage exceeds the threshold, JuiceFS Client will actively add about 10ms waiting time for Write to slow down the writing speed; if the used usage exceeds twice the threshold, it will cause new writing to be suspended until the buffer is released. Therefore, when it is observed that the Write latency rises and the Buffer exceeds the threshold for a long time, it is usually necessary to try to set a larger --buffer-size . In addition, by increasing the --max-uploads parameter (the maximum number of concurrent uploads to the object storage, the default is 20), it is also possible to increase the bandwidth of writing to the object storage, thereby speeding up the release of the buffer.

Writeback mode

When the data consistency and reliability requirements are not high, you can also add --writeback when mounting to further improve system performance. After the write-back mode is turned on, slice flush only needs to be written to the local staging directory (shared with Cache) to return, and the data is asynchronously uploaded to the object storage by the background thread. Please note that the write-back mode of JuiceFS is different from the commonly understood write-first memory, which requires data to be written to the local Cache directory (the specific behavior depends on the hardware where the Cache directory is located and the local file system). From another perspective, the local directory is the caching layer of object storage at this time.

After the write-back mode is turned on, the size check of the uploaded object will be skipped by default, and all data will be kept in the Cache directory as aggressively as possible. This is especially useful in scenarios where a large number of intermediate files are generated (such as software compilation, etc.).

In addition, the JuiceFS v0.17 version also adds the --upload-delay parameter to delay the time of uploading data to the object storage and caching it locally in a more aggressive manner. If the data is deleted by the application during the waiting time, there is no need to upload it to the object storage, which not only improves performance but also saves costs. At the same time, compared to local hard drives, JuiceFS provides back-end guarantees. When the capacity of the Cache directory is insufficient, data will still be uploaded automatically to ensure that no errors will be perceived on the application side. This function is very effective when dealing with scenarios with temporary storage requirements such as Spark shuffle.

Reading process

When JuiceFS processes read requests, it generally reads from the object storage in a 4 MiB Block alignment method to achieve certain pre-reading functions. At the same time, the read data will be written to the local Cache directory for later use (for example, in the second stage of the indicator diagram, blockcache has a high write bandwidth). Obviously, during sequential reads, these data obtained in advance will be accessed by subsequent requests, and the cache hit rate is very high, so the read performance of the object storage can also be fully utilized. At this time, the flow of data in each component is shown in the following figure:

Note : After the read object reaches the JuiceFS Client, it will be decrypted and then decompressed, which is the opposite of writing. Of course, if the relevant function is not enabled, the corresponding process will be skipped directly.

When reading random small IOs in large files, this strategy of JuiceFS is not efficient. Instead, the actual utilization of system resources will be reduced due to read amplification and frequent writing and eviction of local Cache. Unfortunately, it is difficult for general caching strategies in such scenarios to have high enough returns. One direction that can be considered at this time is to increase the overall capacity of the cache as much as possible, in order to achieve the effect of almost completely caching the required data; in the other direction, you can directly close the cache (setting --cache-size 0 ), and maximize the read of the object storage. Take performance.

The reading of small files is relatively simple, usually the entire file is read in one request. Since small files are directly cached when they are written, access patterns like JuiceFS bench read shortly after writing will basically hit the local Cache directory, and the performance is very impressive.

Summarize

The above is the content related to the JuiceFS read and write request processing flow that this article will briefly explain. Due to the difference in the characteristics of large files and small files, JuiceFS implements different read and write strategies for files of different sizes, thereby greatly improving the overall performance and availability. , Which can better meet the needs of users for different scenarios.

Recommended reading: How to play Fluid + JuiceFS in a Kubernetes cluster

If you have any help, please pay attention to our project Juicedata/JuiceFS ! (0ᴗ0✿)

Detailed explanation of JuiceFS data read and write process

Writing process

Writeback mode

Reading process

Summarize

JuiceFS

引用和评论

深度解析 JuiceFS 权限管理：Linux 多种安全机制全兼容

腾讯 tRPC-Go 教学——（5）filter、context 和日志组件

Go slice切片使用教程，一次通关！

gozero限流、熔断、降级如何实现？面试的时候怎么回答？

腾讯 tRPC-Go 教学——（1）搭建服务

如何系统地入门学习stm32？

一文弄懂用Go实现MCP服务