4
The author of this article Hu Mengyu, Zhihu big data architecture development engineer, is mainly responsible for the secondary development of Zhihu's internal big data components and data platform construction.

background

Because of its reliability and ease of use, Flink has become one of the most popular stream processing frameworks, and it occupies a dominant position in the field of stream computing. As early as 18 years, Zhihu introduced Flink. Since its development, Flink has become one of the most important internal components of Zhihu. It has accumulated more than 4,000 Flink real-time tasks and processes petabytes of data every day.

There are many ways to deploy Flink, which are classified according to the resource scheduler, which can be roughly divided into standalone, Flink on YARN, Flink on Kubernetes, etc. At present, the deployment method used internally by Zhihu is native Kubernetes officially provided by Flink. When it comes to Kubernetes, I have to talk about the problem of container mirroring. Because Flink has various tasks, how to mirror Flink is also a headache.

Flink mirroring and dependency processing

Flink's tasks can be roughly divided into two categories. The first category is Flink SQL tasks. The dependencies of Flink SQL tasks are roughly as follows:

  1. Official connector JAR packages, such as flink-hive-connector, flink-jdbc-connector, flink-kafka-connector, etc.;
  2. Unofficial or internally implemented connector JAR package;
  3. User's UDF JAR package, some complex calculation logic, user may implement UDF by himself.

The second type of Flink task is Flink's jar package task. In addition to the above three dependencies, it also needs to rely on the Flink jar package written by the user.

Obviously, for each Flink task, its dependencies are not the same, and it is impossible for us to create a separate mirror for each Flink task. Our current processing is as follows:

  1. Classify dependencies into stable dependencies and non-stable dependencies;
  2. Stable dependencies include components (such as Flink, JDK, etc.) and official connector packages. This type of dependency is very stable and will only be changed in the two cases of Flink version upgrades and bug fixes. Therefore, we will change this when building the image. Class dependencies are entered into the mirror;
  3. Unstable dependencies include third-party connectors and users’ own JAR packages. Because the third-party connector is not officially maintained by Flink, the probability of a problem needing to be repaired is relatively higher; the user's own JAR package is different for each task, and the user will frequently change and resubmit. For such unstable dependencies, we will dynamically inject them. The method of injection is to store the dependencies in the distributed file system. When the container starts, use the pre command to download it into the container.

After the above processing, the Flink image has a certain ability to dynamically load dependencies. The startup process of Flink Job is roughly as follows:

File system selection

Pain points of HDFS storage dependence

We have always used HDFS to store the file system that Flink relies on, but we encountered the following pain points during use:

  1. The NameNode is under too much pressure during the peak task period. When the container downloads dependencies, it will be stuck in requesting file metadata from the NameNode. For some small batch tasks, the task itself may only need to run for more than ten seconds, but because the NameNode is under too much pressure, It may take a few minutes to download dependencies;
  2. At present, we deploy Flink clusters in multiple data centers, but HDFS has only one large offline computer room cluster. This will cause files to be pulled across data centers and consume dedicated line bandwidth;
  3. There are some special Flink tasks that do not rely on HDFS at all. In other words, it neither uses checkpoints nor reads or writes HDFS. However, because the Flink container is dependent on HDFS, these tasks are still inseparable from HDFS.

Pain points of using object storage

Later, we replaced HDFS with object storage, which solved some of the pain points of HDFS, but soon we discovered a new problem—the single-threaded download speed of object storage is slow. There are generally the following options for object storage download acceleration:

  1. Use multi-threaded download for segmented download, but the container's pre command is actually only suitable for executing some relatively simple shell commands. If you use segmented download, you must make a relatively large modification to this part, which is a relatively large pain point;
  2. Add a proxy layer to the object storage for caching. The acceleration is done by the proxy, and the client can still read it in a single thread. The disadvantage of this approach is that it needs to maintain an additional proxy component for object storage, and the stability of the component also needs to be guaranteed.

Try JuiceFS

It just so happens that the company is working on the POC of JuiceFS internally, and there is a ready-made object storage agent layer available. We conducted a series of tests on it and found that JuiceFS fully meets the needs of our scenario. The following points are more surprising to us :

  1. JuiceFS comes with S3 gateway, which is perfectly compatible with the S3 object storage protocol, which allows us to go online quickly without any changes, and the S3 gateway itself is stateless, which is very convenient for scaling;
  2. JuiceFS has its own cache acceleration function. After testing, after using JuiceFS proxy object storage, the speed of reading files in a single thread is 4 times that of the original;
  3. JuiceFS provides a way to mount the local file system, and you can try to directly mount it into the container directory later;
  4. JuiceFS can choose to deploy metadata and storage separately. For storage, we use the original object storage, and cloud vendors guarantee the availability of 11 nines; for metadata, we choose a distributed KV system—TiKV. The reason for choosing TiKV is that our online architecture group Colleagues have rich experience in development and operation and maintenance of TiKV, and SLA can be greatly guaranteed. In this way, the availability and scalability of JuiceFS is very strong.

JuiceFS is online

The launch process of JuiceFS is divided into the following stages:

  1. For data migration, we need to synchronize the data originally stored on HDFS and object storage to JuiceFS. Because JuiceFS provides tools for data synchronization, and the dependency of Flink is not particularly large, we will soon complete this part of the work;
  2. Modify the address of the Flink mirror pull dependency. Because JuiceFS is compatible with the object storage protocol, we only need to modify the original object storage endpoint on the platform side to the address of the JuiceFS S3 gateway.

After JuiceFS is online, the flowchart of our Flink task start is roughly as follows:

Compared to using HDFS, we can get a predictable container startup time, and the speed of container download dependencies will not be affected by business peaks; compared to native object storage, the speed of container download dependencies is increased by about 4 times .

Outlook

It took less than half a month from the beginning of the investigation of JuiceFS to the launch of JuiceFS. The main reason was that the documents of JuiceFS were very complete, which saved us a lot of detours. Secondly, our partners in the JuiceFS community also had questions and answers, so our launch process was very smooth. .

The benefits of the initial trial of JuiceFS are quite obvious. In the future, we will consider applying JuiceFS in data lake scenarios and algorithm model loading scenarios to make our data use more flexible and efficient.

Recommended reading
JuiceFS CSI Driver Best Practice

project address : Github ( https://github.com/juicedata/juicefs ) If you have any help, welcome to follow us! (0ᴗ0✿)


JuiceFS
183 声望9 粉丝

JuiceFS 是一款面向云环境设计的高性能共享文件系统。提供完备的 POSIX 兼容性,可将海量低价的云存储作为本地磁盘使用,亦可同时被多台主机同时挂载读写。