How to play Fluid + JuiceFS in a Kubernetes cluster

About the Author:
Dongdong Lv, the architect of the Yunzhisheng supercomputing platform, is responsible for the architecture design and function research and development of large-scale distributed machine learning platforms, and is responsible for the optimization of deep learning algorithm applications and the acceleration of AI models. Research areas include high-performance computing, distributed file storage, distributed caching, etc.
Weiwei Zhu, a full-stack engineer at Juicedata, is responsible for the development and maintenance of JuiceFS CSI Driver, and is responsible for the development of JuiceFS in the cloud native field.

The Atlas team of Yunzhisheng began to contact and follow up with JuiceFS storage in early 2021, and has accumulated a wealth of fluid use experience in the early stage. Recently, the Yunzhisheng team and the Juicedata team have developed the Fluid JuiceFS acceleration engine, enabling users to better use JuiceFS cache management capabilities in the Kubernetes environment. This article explains how to play Fluid + JuiceFS in a Kubernetes cluster.

Background introduction

Introduction to Fluid

CNCF Fluid is an open source Kubernetes-native distributed data set orchestration and acceleration engine. It mainly serves data-intensive applications in cloud-native scenarios, such as big data applications, AI applications, etc. For more information about Fluid, refer to .

Fluid is not about full storage acceleration and management, but rather the acceleration and management of data sets used by applications. Fluid provides a more cloud-native way to manage data sets. Through the cache acceleration engine, the data of the underlying storage system is cached on the memory or hard disk of the computing node, which solves the limitation of data transmission bandwidth in the separation of computing and storage architecture. And the underlying storage bandwidth and IOPS capacity limitations, resulting in problems such as low IO efficiency. Fluid provides cache data scheduling capabilities. The cache is included in kubernetes extended resources. When kubernetes schedules tasks, it can refer to the cache for scheduling strategy allocation.

Fluid has two important concepts: Dataset and Runtime

Dataset: A data set is a set of logically related data. The same file characteristics will be used by the same computing engine.
Runtime: The interface of the execution engine that realizes the capabilities of data set security, version management and data acceleration, and defines a series of life cycle methods.

Fluid's Runtime defines a standardized interface. Cache Runtime Engine can interface with a variety of cache engines, providing users with more flexible choices. Users can make full use of the cache engine to accelerate corresponding scene applications for different scenarios and needs.

Introduction to JuiceFS

JuiceFS is a high-performance open source distributed file system designed for cloud environments. It is fully compatible with POSIX, HDFS, and S3 interfaces. It is suitable for scenarios such as big data, AI model training, Kubernetes shared storage, and massive data archive management.

Using JuiceFS to store data, the data itself will be persisted in object storage (for example, Amazon S3), and the metadata corresponding to the data can be persisted in various database engines such as Redis, MySQL, TiKV, etc. according to the needs of the scenario. The JuiceFS client has data caching capabilities. When data is read through the JuiceFS client, the data will be intelligently cached to the local cache path configured by the application (memory or disk), and metadata will also be cached In the local memory of the client node.

For AI model training scenarios, subsequent calculations after the completion of the first epoch can directly obtain training data from the cache, which greatly improves training efficiency. JuiceFS also has the ability to pre-read and read data concurrently. In the AI training scenario, it can ensure the generation efficiency of each mini-batch and prepare the data in advance. Data preheating can switch data on the public cloud to the local node in advance. For AI training scenarios, it can ensure that after applying for GPU resources, there will be preheated data for calculations, saving time for valuable GPU usage.

Why use JuiceFSRuntime

As the underlying infrastructure, Yunzhisheng Atlas supercomputing platform supports the company's model training and reasoning services in various fields of AI. Yunzhisheng started to build the industry-leading GPU/CPU heterogeneous Atlas computing platform and distributed file storage system early on. This computing cluster can provide AI computing with high-performance computing and massive data storage access capabilities. The Atlas team of Yunzhisheng began to contact and follow up with JuiceFS storage in early 2021, and conducted a series of POC tests. The adaptation of data reliability and business scenarios meets our current needs.

In the training scenario, we make full use of the caching capabilities of the JuiceFS client to accelerate data for AI model training, but some problems have been discovered during use:

The training Pod is mounted through the hostpath, and the JuiceFS client needs to be mounted on each computing node. The mounting requires an administrator operation, and the mounting parameters are fixed and not flexible enough.
Users cannot manage the cache of the compute node client, and the cache cannot be manually cleaned and expanded.
Cached data sets cannot be scheduled by kubernetes like Kubernetes custom resources.

Since we have accumulated a certain amount of fluid use experience in the production environment, we have designed and developed JuiceFSRuntime in cooperation with the Juicedata team, combining Fluid's data orchestration and management capabilities with JuiceFS's caching capabilities.

What is Fluid + JuiceFS (JuiceFSRuntime)

JuiceFSRuntime is a Runtime customized by Fluid, in which you can specify the JuiceFS worker, fuse mirror, and corresponding cache parameters. Its construction method is consistent with other Fluid Runtimes, that is, it is constructed by CRD. JuiceFSRuntime Controller monitors JuiceFSRuntime resources and realizes the management of cache Pod.

JuiceFSRuntime supports data affinity scheduling (nodeAffinity), selects the appropriate cache node, supports lazy startup of Fuse pod, and supports users to access data through the POSIX interface. Currently, only one mount point is supported.

Its architecture is shown in the figure above. JuiceFSRuntime is composed of Fuse Pod and Worker Pod. Worker pod mainly implements cache management, such as cache cleaning when Runtime exits; Fuse pod is mainly responsible for the parameter setting and mounting of the JuiceFS client.

How to use JuiceFSRunime

Let's take a look at how to use JuiceFSRuntime for cache acceleration.

Preliminary preparation

To use JuiceFSRuntime, you first need to prepare a metadata engine and object storage.

Build a metadata engine

Users can easily purchase cloud Redis databases of various configurations on the cloud computing platform. If it is used for evaluation and testing, you can use Docker to quickly run a Redis database instance on the server:

$ sudo docker run -d --name redis \
    -v redis-data:/data \
    -p 6379:6379 \
    --restart unless-stopped \
    redis redis-server --appendonly yes

Prepare object storage

Like Redis databases, almost all public cloud computing platforms provide object storage services. Because JuiceFS supports object storage services on almost all mainstream platforms, users can deploy according to their own circumstances.

Here is the minio instance run by Dokcer that should be used for the evaluation test:

$ $ sudo docker run -d --name minio \
    -p 9000:9000 \
    -p 9900:9900 \
    -v $PWD/minio-data:/data \
    --restart unless-stopped \
    minio/minio server /data --console-address ":9900"

The initial Access Key and Secret Key of object storage are both minioadmin.

Download and install Fluid

Follow the document install Fluid, runtime.juicefs.enable chart values.yaml , and install Fluid. Ensure that the Fluid cluster is running normally:

kubectl get po -n fluid-system
NAME                                         READY   STATUS              RESTARTS   AGE
csi-nodeplugin-fluid-ctc4l                   2/2     Running             0          113s
csi-nodeplugin-fluid-k7cqt                   2/2     Running             0          113s
csi-nodeplugin-fluid-x9dfd                   2/2     Running             0          113s
dataset-controller-57ddd56b54-9vd86          1/1     Running             0          113s
fluid-webhook-84467465f8-t65mr               1/1     Running             0          113s
juicefsruntime-controller-56df96b75f-qzq8x   1/1     Running             0          113s

Ensure that juicefsruntime-controller , dataset-controller , fluid-webhook , pod and a number of csi-nodeplugin pod normally.

Create Dataset

Before using JuiceFS, you need to provide parameters for metadata services (such as redis) and object storage services (such as minio), and create the corresponding secret:

kubectl create secret generic jfs-secret \
    --from-literal=metaurl=redis://$IP:6379/1 \  # redis 的地址 IP 为 redis 所在节点的 IP
    --from-literal=access-key=minioadmin \ # 对象存储的 ak
    --from-literal=secret-key=minioadmin  #对象存储的 sk

Create Dataset yaml file

cat<<EOF >dataset.yaml
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
  name: jfsdemo
spec:
  mounts:
    - name: minio
      mountPoint: "juicefs:///demo"
      options:
        bucket: "<bucket>"
        storage: "minio"
      encryptOptions:
        - name: metaurl
          valueFrom:
            secretKeyRef:
              name: jfs-secret
              key: metaurl
        - name: access-key
          valueFrom:
            secretKeyRef:
              name: jfs-secret
              key: access-key
        - name: secret-key
          valueFrom:
            secretKeyRef:
              name: jfs-secret
              key: secret-key
EOF

Since JuiceFS uses a local cache, the corresponding Dataset only supports one mount, and JuiceFS does not have UFS. You can specify the subdirectory to be mounted in the mountpoint ("juicefs:///" is the root path), which will be mounted as the root directory To the container.

Create a Dataset and view the status of the Dataset

$ kubectl create -f dataset.yaml
dataset.data.fluid.io/jfsdemo created
 
$ kubectl get dataset jfsdemo
NAME      UFS TOTAL SIZE   CACHED   CACHE CAPACITY   CACHED PERCENTAGE   PHASE      AGE
jfsdemo                                                                  NotBound   44s

As shown above, the value of the phase attribute in status is NotBound, which means that the Dataset resource object is not currently bound to any JuiceFSRuntime resource object. Next, we will create a JuiceFSRuntime resource object.

Create JuiceFSRuntime

Create the yaml file of JuiceFSRuntime

$ cat<<EOF >runtime.yaml
apiVersion: data.fluid.io/v1alpha1
kind: JuiceFSRuntime
metadata:
  name: jfsdemo
spec:
  replicas: 1
  tieredstore:
    levels:
      - mediumtype: SSD
        path: /cache
        quota: 40960   # JuiceFS 中 quota 的最小单位是 MiB，所以这里是 40GiB
        low: "0.1"
EOF

Create and view JuiceFSRuntime

$ $ kubectl create -f runtime.yaml
juicefsruntime.data.fluid.io/jfsdemo created

$ kubectl get juicefsruntime
NAME      WORKER PHASE   FUSE PHASE   AGE
jfsdemo   Ready                       Ready        72s

View the status of JuiceFS related components Pod

$$ kubectl get po |grep jfs
jfsdemo-worker-mjplw                                           1/1     Running   0          4m2s

JuiceFSRuntime does not have a master component, but the Fuse component implements lazy startup and will be created when the pod is used.

Create a cache acceleration job

Create an application that needs to be accelerated, where the Pod uses the Dataset created above to specify the PVC with the same name

$ cat<<EOF >sample.yaml
apiVersion: v1
kind: Pod
metadata:
  name: demo-app
spec:
  containers:
    - name: demo
      image: nginx
      volumeMounts:
        - mountPath: /data
          name: demo
  volumes:
    - name: demo
      persistentVolumeClaim:
        claimName: jfsdemo
EOF

Create Pod

$ kubectl create -f sample.yaml
pod/demo-app created

View pod status

$ kubectl get po |grep demo
demo-app                                                       1/1     Running   0          31s
jfsdemo-fuse-fx7np                                             1/1     Running   0          31s
jfsdemo-worker-mjplw                                           1/1     Running   0          10m

You can see that the pod has been created successfully, and the Fuse component of JuiceFS has also started successfully.

Enter the Pod and execute df -hT check whether the cache directory is mounted:

$ kubectl exec -it demo-app  bash -- df -h
Filesystem      Size  Used Avail Use% Mounted on
overlay          20G   14G  5.9G  71% /
tmpfs            64M     0   64M   0% /dev
tmpfs           3.9G     0  3.9G   0% /sys/fs/cgroup
JuiceFS:minio   1.0P  7.9M  1.0P   1% /data

You can see that the cache directory has been successfully mounted at this time.

Next, let's test the write function in the demo-app pod:

$ kubectl exec -it demo-app bash
[root@demo-app /]# df
Filesystem         1K-blocks     Used     Available Use% Mounted on
overlay             20751360 14585944       6165416  71% /
tmpfs                  65536        0         65536   0% /dev
tmpfs                3995028        0       3995028   0% /sys/fs/cgroup
JuiceFS:minio  1099511627776     8000 1099511619776   1% /data
/dev/sda2           20751360 14585944       6165416  71% /etc/hosts
shm                    65536        0         65536   0% /dev/shm
tmpfs                3995028       12       3995016   1% /run/secrets/kubernetes.io/serviceaccount
tmpfs                3995028        0       3995028   0% /proc/acpi
tmpfs                3995028        0       3995028   0% /proc/scsi
tmpfs                3995028        0       3995028   0% /sys/firmware
[root@demo-app /]#
[root@demo-app /]# cd /data
[root@demo-app data]# echo "hello fluid" > hello.txt
[root@demo-app data]# cat hello.txt
hello fluid

Finally, let's take a look at the cache function. /data in the pod demo-app, and then cp it out:

$ kubectl exec -it demo-app  bash
root@demo-app:~# dd if=/dev/zero of=/data/test.txt count=1024 bs=1M
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 6.55431 s, 164 MB/s
root@demo-app:~# time cp /data/test.txt ./test.txt
real    0m5.014s
user    0m0.003s
sys    0m0.702s
root@demo-app:~# time cp /data/test.txt ./test.txt
real    0m0.602s
user    0m0.004s
sys    0m0.584s

From the execution result, it took 5s for the first cp to establish the cache. In the second cp, because the cache already exists, it only took 0.6s. JuiceFS provides powerful caching capabilities, so that as long as a file is accessed once, the file will be cached in the local cache path, and all subsequent repeated accesses are to obtain data directly from JuiceFS.

Follow-up planning

At present, JuiceFSRuntime does not support many functions, and we will continue to improve in the future, such as Fuse Pod running in Nonroot mode, and Dataload data preheating function.

If you have any help, please pay attention to us Juicedata/JuiceFS ! (0ᴗ0✿)

How to play Fluid + JuiceFS in a Kubernetes cluster

Background introduction

Introduction to Fluid

Introduction to JuiceFS

Why use JuiceFSRuntime

What is Fluid + JuiceFS (JuiceFSRuntime)

How to use JuiceFSRunime

Preliminary preparation

Build a metadata engine

Prepare object storage

Download and install Fluid

Create Dataset

Create JuiceFSRuntime

Create a cache acceleration job

Follow-up planning

JuiceFS

引用和评论

深度解析 JuiceFS 权限管理：Linux 多种安全机制全兼容

Jenkins 企业级 CI/CD 实践：安装、配置与 Kubernetes & Docker 集成

k8s集群部署（一主两从）

k8s实战基础

使用kubeadm部署高可用IPV4/IPV6集群---V1.32

基于k3s部署Nginx、MySQL、PHP和Redis的详细教程

k8s之yaml详解