Practice of JuiceFS on Dasouche Data Platform

Dasouche has built a relatively complete Internet collaborative ecosystem for the automotive industry. In this ecology, it not only covers 90% of the country’s large and medium-sized second-hand car dealers, 9,000+ 4S stores and 70,000+ new-car second networks that have been digitized by Dasouche, but also includes Dasouche’s Cheyipai and Chexing168 , Car Butler, Brexit and other companies with strong industrial chain service capabilities, Great Wall Motors, Changan Automobile, Infiniti and other mainframe manufacturers that have reached in-depth strategic cooperation with Da Soche on new retail solutions, as well as CNPC Kunlun Hospitality And other partners in the upstream and downstream of the industry chain. Based on this ecological layout, Dasouche has digitized every link in the automobile circulation chain, thereby empowering the entire industry.

When it comes to big data, it is no stranger to every company. Storage component HDFS, computing resource management YARN, offline computing Hive, Spark, Spark SQL, column storage database HBase, real-time computing Spark Streaming, Flink, etc. These components are relatively easy to maintain when the cluster is stable, but in the process of rapid development of the company, the rapid growth of cluster capacity is inevitable. As a designer of big data, he has to think about the two from the cost and benefit of the cluster. trade off.

The status quo of big data clusters

Da Soche currently has big data clusters divided into offline computing clusters and real-time computing clusters. Offline computing is based on Hive and Spark, and real-time computing is based on Flink. These two types of clusters are based on two sets of management methods, HDP and CDH, respectively. HDP was chosen for offline computing in the early days, and CDH was later chosen for real-time computing for the convenience of multi-cluster management. Since there is a difference between the two offline computing engines, migration will have compatibility issues. The two sets of clusters have always coexisted, and the resources between the clusters are completely isolated.

Cluster maintenance pain points

The amount of data continues to grow, and it takes time and effort to expand the cluster at a certain cost

From the beginning of 18th to June 19th, offline clusters continued to grow from the initial dozens of nodes to hundreds of nodes, and the amount of data increased from dozens of TiB by more than 10 times, and the number of TiB per day was kept increasing. In the case of saving expenses, the cluster expansion is done once a month, forming a race against the data growth rate. The monthly fixed work has almost become a situation of accepting disk alarm bombing, capacity expansion, data balancing, and data rebalancing. In some extreme situations, for example, Alibaba Cloud does not have data type device resources in a certain availability zone and needs to create a new one in another availability zone, which also involves data network segment changes, which is even more complicated.

Storage resources are not synchronized with computing resources

During the analysis of offline cluster data, it was found that hot data only accounted for about 20%. In the case of continuous expansion of the cluster, computing resources will be relatively redundant, resulting in unnecessary costs. In addition, each balancing will occupy the node network bandwidth and affect the speed of task reading and writing data.

Cross-cluster data synchronization

In order to reduce the mutual influence between real-time tasks and offline tasks, and to facilitate resource control and maximize the value of cloud resource selection, real-time computing and offline computing clusters are physically separated from resources, and difficulties also arise. Real-time and offline cluster data Unable to synchronize in real time, causing some requirements to be unfulfilled.

NameNode memory continues to grow and it takes too long to restart

In file storage, the excessive number of files causes the continuous increase of NameNode management memory, and the restart time is too long, which will inevitably affect data synchronization; and the data life cycle is not strictly controlled at the data warehouse level, and the resource usage will become larger and larger. It will also be affected when analyzing the entire resources in the cluster.

Choose JuiceFS

In view of the above problems, it is imperative to choose a product as the underlying storage. As the cornerstone of big data, storage choices need to comply with the following characteristics:

Compatible with Hadoop framework protocol
Multi-version cluster compatible
High throughput, low latency
Support deep compression to reduce resource usage

At the beginning, we tried to use Alibaba Cloud's OSS as cold standby storage. During the testing process, because there is no meta-information management, data maintenance is very limited. Later, I came into contact with the JuiceFS product, and the selection satisfies the above requirements. For this, we did some performance tests (all based on actual scenarios to extract business logic).

Real-world performance test

The following tests all select actual business data. The data size is selected according to the different query conditions of where. Only the performance of the two file systems is compared:

SELECT + INSERT operation

From about 30 million tables, select data of different magnitudes and insert them into another table with the same structure. It takes time to compare HDFS and JuiceFS horizontally.

SELECT + COUNT operation

It takes time to compare HDFS and JuiceFS horizontally by selecting data of different magnitudes from about 30 million tables.

SELECT + ORDERBY

Sort the data in about 30 million tables and compare the time-consuming horizontally between HDFS and JuiceFS.

To sum up, most of JuiceFS's time-consuming query and insertion of data is relatively stable and the overall time-consuming is less than HDFS. In the case of SELECT data, most of the performance is similar, and the performance is better than HDFS in rare cases, and the performance of single-row sorting operations is similar.

Cost Control

We compared the cost of using JuiceFS and HDFS (HDFS cluster guarantees 20% storage redundancy). With the same amount of data (JuiceFS will perform deep compression again, with a compression ratio of approximately 3:1) and peer-to-peer computing resources, using JuiceFS will save at least 18% per month compared to deploying HDFS using cloud hosts.

Taking a comprehensive look at the performance and cost of JuiceFS, it satisfies the company's requirements for cost and product performance.

Future outlook

Storage and calculation separation

With the introduction of JuiceFS into big data clusters, storage and computing have actually been separated. It has become possible for big data clusters to flexibly expand computing resources. In the early hours of business downturn, the computing resources of business machines can be scheduled to the big data cluster.

The following is the current entire big data cluster architecture:

The following target architecture can be combined with computing storage separation and dynamic scaling design in the future:

Integrate with Kubernetes, apply for resources on demand, save costs and reduce maintenance costs.

Recommended reading:
JuiceFS CSI Driver Best Practices

project address : Github ( https://github.com/juicedata/juicefs ) If you have any help, please follow us at (0ᴗ0✿)

Practice of JuiceFS on Dasouche Data Platform

The status quo of big data clusters

Cluster maintenance pain points

Choose JuiceFS

Real-world performance test

Cost Control

Future outlook

Storage and calculation separation

JuiceFS

引用和评论

FUSE，从内核到用户态文件系统的设计之路

得物新一代可观测性架构：海量数据下的存算分离设计与实践

Fluss：面向实时分析设计的下一代流存储

Apache Flink 2.0：Streaming into the Future

数据库的下一场革命：S3 延迟已降至原先的 10%，云数据库架构该进化了

5 分钟多种方式实现数据脱敏

一文读懂 DeepSeek 如何使用：慧星云 × DeepSeek 提供高效本地化部署服务