About the Author:
Dongdong Lv, the architect of the Yunzhisheng supercomputing platform, is responsible for the architecture design and function research and development of large-scale distributed machine learning platforms, and is responsible for the optimization of deep learning algorithm applications and the acceleration of AI models. Research areas include high-performance computing, distributed file storage, distributed caching, etc.
Weiwei Zhu, a full-stack engineer at Juicedata, is responsible for the development and maintenance of JuiceFS CSI Driver, and is responsible for the development of JuiceFS in the cloud native field.
The Atlas team of Yunzhisheng began to contact and follow up with JuiceFS storage in early 2021, and has accumulated a wealth of fluid use experience in the early stage. Recently, the Yunzhisheng team and the Juicedata team have developed the Fluid JuiceFS acceleration engine, enabling users to better use JuiceFS cache management capabilities in the Kubernetes environment. This article explains how to play Fluid + JuiceFS in a Kubernetes cluster.
Background introduction
Introduction to Fluid
CNCF Fluid is an open source Kubernetes-native distributed data set orchestration and acceleration engine. It mainly serves data-intensive applications in cloud-native scenarios, such as big data applications, AI applications, etc. For more information about Fluid, refer to .
Fluid is not about full storage acceleration and management, but rather the acceleration and management of data sets used by applications. Fluid provides a more cloud-native way to manage data sets. Through the cache acceleration engine, the data of the underlying storage system is cached on the memory or hard disk of the computing node, which solves the limitation of data transmission bandwidth in the separation of computing and storage architecture. And the underlying storage bandwidth and IOPS capacity limitations, resulting in problems such as low IO efficiency. Fluid provides cache data scheduling capabilities. The cache is included in kubernetes extended resources. When kubernetes schedules tasks, it can refer to the cache for scheduling strategy allocation.
Fluid has two important concepts: Dataset and Runtime
- Dataset: A data set is a set of logically related data. The same file characteristics will be used by the same computing engine.
- Runtime: The interface of the execution engine that realizes the capabilities of data set security, version management and data acceleration, and defines a series of life cycle methods.
Fluid's Runtime defines a standardized interface. Cache Runtime Engine can interface with a variety of cache engines, providing users with more flexible choices. Users can make full use of the cache engine to accelerate corresponding scene applications for different scenarios and needs.
Introduction to JuiceFS
JuiceFS is a high-performance open source distributed file system designed for cloud environments. It is fully compatible with POSIX, HDFS, and S3 interfaces. It is suitable for scenarios such as big data, AI model training, Kubernetes shared storage, and massive data archive management.
Using JuiceFS to store data, the data itself will be persisted in object storage (for example, Amazon S3), and the metadata corresponding to the data can be persisted in various database engines such as Redis, MySQL, TiKV, etc. according to the needs of the scenario. The JuiceFS client has data caching capabilities. When data is read through the JuiceFS client, the data will be intelligently cached to the local cache path configured by the application (memory or disk), and metadata will also be cached In the local memory of the client node.
For AI model training scenarios, subsequent calculations after the completion of the first epoch can directly obtain training data from the cache, which greatly improves training efficiency. JuiceFS also has the ability to pre-read and read data concurrently. In the AI training scenario, it can ensure the generation efficiency of each mini-batch and prepare the data in advance. Data preheating can switch data on the public cloud to the local node in advance. For AI training scenarios, it can ensure that after applying for GPU resources, there will be preheated data for calculations, saving time for valuable GPU usage.
Why use JuiceFSRuntime
As the underlying infrastructure, Yunzhisheng Atlas supercomputing platform supports the company's model training and reasoning services in various fields of AI. Yunzhisheng started to build the industry-leading GPU/CPU heterogeneous Atlas computing platform and distributed file storage system early on. This computing cluster can provide AI computing with high-performance computing and massive data storage access capabilities. The Atlas team of Yunzhisheng began to contact and follow up with JuiceFS storage in early 2021, and conducted a series of POC tests. The adaptation of data reliability and business scenarios meets our current needs.
In the training scenario, we make full use of the caching capabilities of the JuiceFS client to accelerate data for AI model training, but some problems have been discovered during use:
- The training Pod is mounted through the hostpath, and the JuiceFS client needs to be mounted on each computing node. The mounting requires an administrator operation, and the mounting parameters are fixed and not flexible enough.
- Users cannot manage the cache of the compute node client, and the cache cannot be manually cleaned and expanded.
- Cached data sets cannot be scheduled by kubernetes like Kubernetes custom resources.
Since we have accumulated a certain amount of fluid use experience in the production environment, we have designed and developed JuiceFSRuntime in cooperation with the Juicedata team, combining Fluid's data orchestration and management capabilities with JuiceFS's caching capabilities.
What is Fluid + JuiceFS (JuiceFSRuntime)
JuiceFSRuntime is a Runtime customized by Fluid, in which you can specify the JuiceFS worker, fuse mirror, and corresponding cache parameters. Its construction method is consistent with other Fluid Runtimes, that is, it is constructed by CRD. JuiceFSRuntime Controller monitors JuiceFSRuntime resources and realizes the management of cache Pod.
JuiceFSRuntime supports data affinity scheduling (nodeAffinity), selects the appropriate cache node, supports lazy startup of Fuse pod, and supports users to access data through the POSIX interface. Currently, only one mount point is supported.
Its architecture is shown in the figure above. JuiceFSRuntime is composed of Fuse Pod and Worker Pod. Worker pod mainly implements cache management, such as cache cleaning when Runtime exits; Fuse pod is mainly responsible for the parameter setting and mounting of the JuiceFS client.
How to use JuiceFSRunime
Let's take a look at how to use JuiceFSRuntime for cache acceleration.
Preliminary preparation
To use JuiceFSRuntime, you first need to prepare a metadata engine and object storage.
Build a metadata engine
Users can easily purchase cloud Redis databases of various configurations on the cloud computing platform. If it is used for evaluation and testing, you can use Docker to quickly run a Redis database instance on the server:
$ sudo docker run -d --name redis \
-v redis-data:/data \
-p 6379:6379 \
--restart unless-stopped \
redis redis-server --appendonly yes
Prepare object storage
Like Redis databases, almost all public cloud computing platforms provide object storage services. Because JuiceFS supports object storage services on almost all mainstream platforms, users can deploy according to their own circumstances.
Here is the minio instance run by Dokcer that should be used for the evaluation test:
$ $ sudo docker run -d --name minio \
-p 9000:9000 \
-p 9900:9900 \
-v $PWD/minio-data:/data \
--restart unless-stopped \
minio/minio server /data --console-address ":9900"
The initial Access Key and Secret Key of object storage are both minioadmin.
Download and install Fluid
Follow the document install Fluid, runtime.juicefs.enable
chart values.yaml
, and install Fluid. Ensure that the Fluid cluster is running normally:
kubectl get po -n fluid-system
NAME READY STATUS RESTARTS AGE
csi-nodeplugin-fluid-ctc4l 2/2 Running 0 113s
csi-nodeplugin-fluid-k7cqt 2/2 Running 0 113s
csi-nodeplugin-fluid-x9dfd 2/2 Running 0 113s
dataset-controller-57ddd56b54-9vd86 1/1 Running 0 113s
fluid-webhook-84467465f8-t65mr 1/1 Running 0 113s
juicefsruntime-controller-56df96b75f-qzq8x 1/1 Running 0 113s
Ensure that juicefsruntime-controller
, dataset-controller
, fluid-webhook
, pod
and a number of csi-nodeplugin pod
normally.
Create Dataset
Before using JuiceFS, you need to provide parameters for metadata services (such as redis) and object storage services (such as minio), and create the corresponding secret:
kubectl create secret generic jfs-secret \
--from-literal=metaurl=redis://$IP:6379/1 \ # redis 的地址 IP 为 redis 所在节点的 IP
--from-literal=access-key=minioadmin \ # 对象存储的 ak
--from-literal=secret-key=minioadmin #对象存储的 sk
Create Dataset yaml file
cat<<EOF >dataset.yaml
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
name: jfsdemo
spec:
mounts:
- name: minio
mountPoint: "juicefs:///demo"
options:
bucket: "<bucket>"
storage: "minio"
encryptOptions:
- name: metaurl
valueFrom:
secretKeyRef:
name: jfs-secret
key: metaurl
- name: access-key
valueFrom:
secretKeyRef:
name: jfs-secret
key: access-key
- name: secret-key
valueFrom:
secretKeyRef:
name: jfs-secret
key: secret-key
EOF
Since JuiceFS uses a local cache, the corresponding Dataset only supports one mount, and JuiceFS does not have UFS. You can specify the subdirectory to be mounted in the mountpoint ("juicefs:///" is the root path), which will be mounted as the root directory To the container.
Create a Dataset and view the status of the Dataset
$ kubectl create -f dataset.yaml
dataset.data.fluid.io/jfsdemo created
$ kubectl get dataset jfsdemo
NAME UFS TOTAL SIZE CACHED CACHE CAPACITY CACHED PERCENTAGE PHASE AGE
jfsdemo NotBound 44s
As shown above, the value of the phase attribute in status is NotBound, which means that the Dataset resource object is not currently bound to any JuiceFSRuntime resource object. Next, we will create a JuiceFSRuntime resource object.
Create JuiceFSRuntime
Create the yaml file of JuiceFSRuntime
$ cat<<EOF >runtime.yaml
apiVersion: data.fluid.io/v1alpha1
kind: JuiceFSRuntime
metadata:
name: jfsdemo
spec:
replicas: 1
tieredstore:
levels:
- mediumtype: SSD
path: /cache
quota: 40960 # JuiceFS 中 quota 的最小单位是 MiB,所以这里是 40GiB
low: "0.1"
EOF
Create and view JuiceFSRuntime
$ $ kubectl create -f runtime.yaml
juicefsruntime.data.fluid.io/jfsdemo created
$ kubectl get juicefsruntime
NAME WORKER PHASE FUSE PHASE AGE
jfsdemo Ready Ready 72s
View the status of JuiceFS related components Pod
$$ kubectl get po |grep jfs
jfsdemo-worker-mjplw 1/1 Running 0 4m2s
JuiceFSRuntime does not have a master component, but the Fuse component implements lazy startup and will be created when the pod is used.
Create a cache acceleration job
Create an application that needs to be accelerated, where the Pod uses the Dataset created above to specify the PVC with the same name
$ cat<<EOF >sample.yaml
apiVersion: v1
kind: Pod
metadata:
name: demo-app
spec:
containers:
- name: demo
image: nginx
volumeMounts:
- mountPath: /data
name: demo
volumes:
- name: demo
persistentVolumeClaim:
claimName: jfsdemo
EOF
Create Pod
$ kubectl create -f sample.yaml
pod/demo-app created
View pod status
$ kubectl get po |grep demo
demo-app 1/1 Running 0 31s
jfsdemo-fuse-fx7np 1/1 Running 0 31s
jfsdemo-worker-mjplw 1/1 Running 0 10m
You can see that the pod has been created successfully, and the Fuse component of JuiceFS has also started successfully.
Enter the Pod and execute df -hT
check whether the cache directory is mounted:
$ kubectl exec -it demo-app bash -- df -h
Filesystem Size Used Avail Use% Mounted on
overlay 20G 14G 5.9G 71% /
tmpfs 64M 0 64M 0% /dev
tmpfs 3.9G 0 3.9G 0% /sys/fs/cgroup
JuiceFS:minio 1.0P 7.9M 1.0P 1% /data
You can see that the cache directory has been successfully mounted at this time.
Next, let's test the write function in the demo-app pod:
$ kubectl exec -it demo-app bash
[root@demo-app /]# df
Filesystem 1K-blocks Used Available Use% Mounted on
overlay 20751360 14585944 6165416 71% /
tmpfs 65536 0 65536 0% /dev
tmpfs 3995028 0 3995028 0% /sys/fs/cgroup
JuiceFS:minio 1099511627776 8000 1099511619776 1% /data
/dev/sda2 20751360 14585944 6165416 71% /etc/hosts
shm 65536 0 65536 0% /dev/shm
tmpfs 3995028 12 3995016 1% /run/secrets/kubernetes.io/serviceaccount
tmpfs 3995028 0 3995028 0% /proc/acpi
tmpfs 3995028 0 3995028 0% /proc/scsi
tmpfs 3995028 0 3995028 0% /sys/firmware
[root@demo-app /]#
[root@demo-app /]# cd /data
[root@demo-app data]# echo "hello fluid" > hello.txt
[root@demo-app data]# cat hello.txt
hello fluid
Finally, let's take a look at the cache function. /data
in the pod demo-app, and then cp it out:
$ kubectl exec -it demo-app bash
root@demo-app:~# dd if=/dev/zero of=/data/test.txt count=1024 bs=1M
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 6.55431 s, 164 MB/s
root@demo-app:~# time cp /data/test.txt ./test.txt
real 0m5.014s
user 0m0.003s
sys 0m0.702s
root@demo-app:~# time cp /data/test.txt ./test.txt
real 0m0.602s
user 0m0.004s
sys 0m0.584s
From the execution result, it took 5s for the first cp to establish the cache. In the second cp, because the cache already exists, it only took 0.6s. JuiceFS provides powerful caching capabilities, so that as long as a file is accessed once, the file will be cached in the local cache path, and all subsequent repeated accesses are to obtain data directly from JuiceFS.
Follow-up planning
At present, JuiceFSRuntime does not support many functions, and we will continue to improve in the future, such as Fuse Pod running in Nonroot mode, and Dataload data preheating function.
Recommended reading:
knows x JuiceFS: Use JuiceFS to accelerate Flink container startup
If you have any help, please pay attention to us Juicedata/JuiceFS ! (0ᴗ0✿)
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。