6

About the Author

  • Wang Zhenhua, Qutoutiao Big Data Director, Qutoutiao Big Data Director.
  • Wang Haisheng, Qutoutiao big data engineer, 10 years of Internet work experience, has been engaged in big data development related work in eBay, Vipshop and other companies, and has rich experience in big data implementation.
  • Gao Changjian, solution architect of Juicedata, has ten years of experience in the Internet industry. He has served as an architect in multiple teams in Zhihu, Immediately, and Xiaohongshu, focusing on technical research in the fields of distributed systems, big data, and AI.

background

The Qutoutiao big data platform currently has a nearly 1,000-node HDFS cluster, which carries the function of storing hot data in the past few months. The daily new data has reached a scale of 100 TB. Daily ETL and ad-hoc tasks will rely on this HDFS cluster, causing the cluster load to continue to rise. Especially for ad-hoc tasks, because Qutoutiao’s business model requires frequent query of the latest data, a large number of ad-hoc query requests every day further increase the pressure on the HDFS cluster and also affect the performance of ad-hoc queries. The long tail phenomenon is obvious . The cluster load remains high, which also affects the stability of many business components, such as Flink task checkpoint failure, Spark task executor loss, etc.

Therefore, a solution is needed to make ad-hoc queries not rely on the data of the HDFS cluster as much as possible. On the one hand, it can reduce the overall pressure of the HDFS cluster and ensure the stability of daily ETL tasks. On the other hand, it can also reduce the time-consuming fluctuation of ad-hoc queries. , Optimize the long tail phenomenon.

Design

Qutoutiao’s ad-hoc query mainly relies on the Presto computing engine. The Hadoop SDK of JuiceFS can be seamlessly integrated into Presto without any code changes. It automatically analyzes each query in a non-intrusive business manner, and automatically retrieves data that needs to be read frequently. Copying from HDFS to JuiceFS, subsequent ad-hoc queries can directly obtain the existing cached data on JuiceFS, avoiding requests to HDFS, thereby reducing the pressure on the HDFS cluster.

In addition, because the Presto cluster is deployed on Kubernetes, there is a need for elastic scaling of the cluster, so it needs to be able to persist the cached data. If an independent HDFS or some caching scheme is used, the cost will be high, and OSS becomes the most ideal choice at this time.

The overall scheme design is shown in the figure below. The green part represents the components of JuiceFS, which mainly includes two parts: JuiceFS metadata service (JuiceFS Cluster in the figure below) and JuiceFS Hadoop SDK (components associated with Presto worker in the figure below).

JuiceFS metadata service is used to manage the metadata of all files in the file system, such as file name, directory structure, file size, modification time, etc. The metadata service is a distributed cluster, based on the Raft consensus protocol, which ensures strong consistency of metadata while also ensuring the availability of the cluster.

JuiceFS Hadoop SDK (hereinafter referred to as SDK) is a client library that can be seamlessly integrated into all Hadoop ecological components. The solution here is integrated into Presto worker. The SDK supports a variety of usage modes, which can replace HDFS and use JuiceFS as the underlying storage of the big data platform, or it can be used as the cache system of HDFS. This solution uses the latter mode. The SDK supports transparent caching of data in HDFS to JuiceFS without changing the Hive Metastore. If the data queried by ad-hoc hits the cache, it will no longer need to request HDFS. At the same time, the SDK can also ensure the data consistency between HDFS and JuiceFS, which means that when the data in HDFS changes, the cached data on JuiceFS can also be updated synchronously without affecting the business. This is achieved by comparing the modification time (mtime) of files in HDFS and JuiceFS. Because JuiceFS implements a complete file system function, files have the attribute mtime, and the consistency of cached data is ensured by comparing mtime.

In order to prevent the cache from occupying too much space, the cached data needs to be cleaned up regularly. JuiceFS supports cleaning up the data N days ago based on the file access time (atime). The reason for choosing atime is to ensure that the frequently accessed data will not be mistaken. delete. It should be noted that many file systems do not update atime in real time in order to ensure performance. For example, HDFS dfs.namenode.accesstime.precision . The default is to update the atime as soon as 1 hour. There are also certain rules for the establishment of the cache, which will determine whether to cache or not to avoid caching some unnecessary data based on the attributes of atime, mtime, and size of the file.

Test program

In order to verify the overall effect of the above solutions, including but not limited to stability, performance, HDFS cluster load, etc., we divided the test process into multiple stages, each stage is responsible for collecting and verifying different indicators, and there may be differences between different stages. A horizontal comparison of data will be carried out.

Test Results

HDFS cluster load

We designed the function of turning on and off JuiceFS in two stages. 10 HDFS DataNodes were randomly selected during the startup phase, and the average daily disk read I/O throughput of each DataNode at this stage was calculated, and the average was about 3.5TB. In the shutdown phase, these 10 nodes are also selected, and the average value of statistics is about 4.8TB. Therefore, using JuiceFS can reduce the load of the HDFS cluster by about 26% , as shown in the figure below.

From another dimension, it can also reflect the effect of reducing the load of the HDFS cluster. In these two stages, we have counted the total amount of I/O read and written to JuiceFS. JuiceFS read I/O is expressed as the reduced I/O volume of the HDFS cluster. If JuiceFS is not used, these requests will directly query HDFS. JuiceFS write I/O represents the amount of data copied from HDFS, and these requests will increase the pressure on HDFS. total amount of read I/O should be as large as possible, and the total amount of write I/O should be as small as possible. The figure below shows the total amount of read and write I/O in a few days. It can be seen that read I/O is basically more than 10 times that of write I/O, which means that the hit rate of JuiceFS data is more than 90%, that is, exceeds 90% of ad-hoc queries do not require HDFS .

Average query time

At a certain stage, 50% of the query requests of each traffic are allocated to the two clusters that are not docked and those that are docked with JuiceFS, and the average query time is calculated separately. As can be seen from the figure below, uses JuiceFS to reduce the average query time by about 13% .

Test summary

The JuiceFS solution greatly reduces the load of the HDFS cluster in a transparent manner without changing the business configuration. More than 90% of Presto queries no longer need to request HDFS, and it also reduces the average Presto query time by 13% , Exceeding the initial expectations of the test target. The long-standing problem of instability of big data components has also been solved.

It is worth noting that the entire test process is also very smooth. JuiceFS completed the basic functions and performance verification of the test environment in just a few days, and soon entered the grayscale test stage of the production environment. In the production environment, JuiceFS runs very smoothly, withstanding the pressure of the full amount of requests, and some problems encountered in the process can be quickly repaired.

Future outlook

Looking forward to the future, there are more places worth trying and optimizing:

  • Further improve the hit rate of JuiceFS cached data and reduce the load of the HDFS cluster.
  • Increase the space of the Presto worker's local cache disk, improve the hit rate of the local cache, and optimize the long tail problem.
  • The Spark cluster is connected to JuiceFS, covering more ad-hoc query scenarios.
  • Smoothly migrate HDFS to JuiceFS, fully realize the separation of storage and computing, reduce operation and maintenance costs, and improve resource utilization.

project address :
Github ( https://github.com/juicedata/juicefs ) If it helps, welcome star (0ᴗ0✿), encourage us!


JuiceFS
183 声望9 粉丝

JuiceFS 是一款面向云环境设计的高性能共享文件系统。提供完备的 POSIX 兼容性,可将海量低价的云存储作为本地磁盘使用,亦可同时被多台主机同时挂载读写。