How does data lake analysis optimize object-oriented storage OSS?

Introduction to , using DLA as an example. DLA is committed to helping customers build a low-cost, easy-to-use, and flexible data platform, saving at least 50% of the cost of traditional Hadoop. Among them, DLA Meta supports a unified view of 15+ data sources (OSS, HDFS, DB, DW) on the cloud, introduces multi-tenancy, metadata discovery, pursues a marginal cost of 0, and provides it for free. DLA Lakehouse is implemented based on Apache Hudi. The main goal is to provide an efficient lake warehouse and support the incremental writing of CDC and messages. Currently, this area is stepping up productization. DLA Serverless Presto is developed based on Apache PrestoDB, mainly for federated interactive query and lightweight ETL.

background

Data Lake is currently a relatively hot solution at home and abroad. MarketsandMarkets_( https://www.marketsandmarkets.com/Market-Reports/data-lakes-market-213787749.html )_Market research shows the estimated data lake market size It will grow from US$7.9 billion in 2019 to US$20.1 billion in 2024. Some companies have built their own cloud-native data lake solutions to effectively solve business pain points; there are still many companies that are building or planning to build their own data lakes. The report released by Gartner in 2020 shows that _( https://www.gartner.com/smarterwithgartner/the-best-ways-to-organize-your-data-structures/ )_Currently 39% of users are using it For the data lake, 34% of users consider using the data lake within one year. With the maturity of cloud-native storage technologies such as object storage, people will first store structured, semi-structured, image, and video data in object storage at the beginning. When you need to analyze these data, you will choose, for example, Hadoop or Alibaba Cloud's cloud native data lake analysis service DLA for data processing. Compared with the deployment of HDFS, object storage has certain disadvantages in performance analysis. At present, the industry has done extensive exploration and implementation.

1. Challenges based on object storage analysis

1. What is a data lake

According to Wikipedia, a data lake is a type of system or storage that stores data in a natural/original format, usually object blocks or files, including copies of the original data generated by the original system and converted data generated for various tasks, including data from relationships. Structured data (rows and columns), semi-structured data (such as CSV, log, XML, JSON), unstructured data (such as email, document, PDF, image, audio, video) in a database.

From the above, it can be concluded that the data lake has the following characteristics:

Data source: original data, converted data
Data type: structured data, semi-structured data, unstructured data, binary
Data Lake Storage: Scalable Mass Data Storage Service

2. Data lake analysis solution architecture

It mainly includes five modules:

Data source: raw data storage module, including structured data (Database, etc.), semi-structured (File, log, etc.), unstructured (audio and video, etc.);
Data integration: In order to unify the data to the data lake for storage and management, data integration is currently mainly divided into three forms of appearance association, ETL, and asynchronous metadata construction;
Data lake storage: The current industry data lake storage includes object storage and self-built HDFS. With the evolution of cloud native, object storage has been greatly optimized in terms of scalability, cost, and free operation and maintenance. At present, customers are more choosing cloud native object storage as the data lake storage base instead of self-built HDFS.
Metadata management: Metadata management, as a bus connecting data integration, storage and analysis engines;
Data analysis engine: There are currently rich analysis engines, such as Spark, Hadoop, Presto, etc.

3. Challenges faced by object-oriented storage analysis

In order to ensure high scalability of object storage compared to HDFS, a flat approach is chosen for metadata management; metadata management does not maintain a directory structure, so metadata services can be expanded horizontally, unlike HDFS's NameNode. Single point of bottleneck. At the same time, compared to HDFS, object storage can be free of operation and maintenance, store and read on demand, and build a complete storage and computing separation architecture. But the analysis and calculation-oriented also brought some problems:

List is slow: How can object storage be so slow compared to HDFS by directory/listing?
Too many requests: Why is the cost of object storage requests higher than the calculation cost during analysis and calculation?
Rename is slow: Why does Spark and Hadoop analyze and write data stuck in the commit phase?
Slow reading: The analysis of 1TB data is so much slower than the self-built HDFS cluster!
......

4. The current status of object-oriented storage analysis and optimization in the industry

The above are typical problems that everyone encounters when building a data lake analysis solution based on object storage. To solve these problems, we need to understand the architecture difference of object storage compared with traditional HDFS and make targeted optimization. At present, the industry has done a lot of exploration and practice:

JuiceFS: Maintain an independent metadata service, using object storage as a storage medium. Provide efficient file management semantics through independent metadata services, such as list, rename, etc. However, additional services need to be deployed, and all analysis read object storage depends on this service;
Hadoop: Because Hadoop and Spark write data using the OutputCommitter two-phase submission protocol, in the OutputCommitter V1 version, commitTask and commitJob will be renamed twice. Rename on the object storage will copy the object, which is very costly. Therefore, OutputCommitter V2 is proposed. The algorithm only needs to be renamed once, but dirty data will be generated when the commitjob process is interrupted;
Alluxio: Cache remote object storage files locally by deploying an independent Cache service, and analyze and calculate local read data acceleration;
HUDI: The currently emerging HUDI, Delta Lake, and Iceberg use metadata to independently store the file metadata of the dataset to avoid list operations, while providing ACID and read-write isolation similar to traditional databases;
Alibaba Cloud Cloud Native Data Lake Analysis Service DLA: DLA service has made a lot of optimizations on the OSS of read-write object storage, including Rename optimization, InputStream optimization, and Data Cache.

2. DLA object-oriented storage OSS architecture optimization

Because object storage has the above problems for analysis scenarios, DLA has built a unified DLA FS layer to solve the problems of object storage meta-information access, rename, and slow reading. DLA FS also supports DLA's Serverless Spark for ETL reading and writing, DLA Serverless Presto data interactive query, and efficient reading of Lakehouse warehouse building data. The OSS architecture optimization of object-oriented storage is divided into four layers as a whole:

Data lake storage OSS: storage structured, semi-structured, unstructured, and the HUDI format for building warehouses in the lake through DLA Lakehouse;
DLA FS: uniformly solve the analysis and optimization problems of object-oriented storage OSS, including Rename optimization, Read Buffer, Data Cache, File List optimization, etc.;
Analyze the load: DLA Serverless Spark mainly reads the data in OSS and then writes it back to OSS. Serverless Presto mainly performs interactive query on the data opened on OSS;
Business scenarios: DLA-based dual-engine Spark and Presto can support multiple business scenarios.

3. DLA FS object-oriented storage OSS optimization technology analysis

The following mainly introduces the optimization technology of DLA FS object-oriented storage OSS:

1. Rename optimization

The OutputCommitter interface is used in the Hadoop ecosystem to ensure data consistency in the writing process. Its principle is similar to the two-phase commit protocol.

Open source Hadoop provides an implementation of Hadoop FileSystem to read and write OSS files. The implementation of OutputCommitter it uses by default is FileOutputCommitter. For data consistency, to prevent users from seeing intermediate results, first output the results to a temporary working directory when executing tasks. When all tasks are confirmed to be output, the driver will rename the temporary working directory to the production data path. . As shown below:

Since OSS is more expensive than HDFS, its Rename operation is a copy&delete operation, while HDFS is a metadata operation on the NameNode. The DLA analysis engine continues to use the open source Hadoop FileOutputCommitter performance is very poor, in order to solve this problem, we decided to introduce the OSS Multipart Upload feature in DLA FS to optimize the write performance.

3.1 DLA FS supports Multipart Upload mode to write OSS objects

Alibaba Cloud OSS supports the Multipart Upload function. The principle is to divide a file into multiple pieces of data and upload them concurrently. After the upload is complete, let the user choose a time to call the completion interface of Multipart Upload to merge these pieces of data into the original file. This improves the throughput of file writing to OSS. Since Multipart Upload can control the time when the file is visible to the user, we can use it instead of the rename operation to optimize the performance of DLA FS when writing OSS in the OutputCommitter scenario.

Based on the OutputCommitter implemented by Multipart Upload, the entire algorithm flow is as follows:

Using OSS Multipart Upload has the following advantages:

write files without multiple copies. you can see from 160d546b83f170, the originally expensive rename operation is no longer needed, and copy&delete is not required to write files. In addition, compared to rename, the completeMultipartUpload interface of OSS is a very lightweight operation.
is less likely to have data inconsistencies. Although if you want to write multiple files at a time, completeMultipartUpload is still not an atomic operation at this time, but compared to the original rename copy data, its time window will be much shorter, and the probability of data inconsistency will be much smaller. Satisfy most scenes.
The file meta-information related operations in our statistics, the metadata operations of a file in Algorithm 1 can be reduced from 13 to 6 times, and Algorithm 2 can be reduced from 8 to 4 times.

The interfaces that control user visibility in OSS Multipart Upload are CompleteMultipartUpload and abortMultipartUpload. The semantics of this interface is similar to commit/abort. The Hadoop FileSystem standard interface does not provide semantics such as commit/abort.

To solve this problem, we introduce the Semi-Transaction layer in DLA FS.

3.2 DLA FS introduces Semi-Transaction layer

As mentioned earlier, OutputCommitter is similar to a two-phase commit protocol, so we can abstract this process as a distributed transaction. can be understood as the Driver opens a global transaction, and each Executor opens its own local transaction. When the Driver receives the completion information of all local transactions, it will commit the global transaction.

Based on this abstraction, we introduced a Semi-Transaction layer (we did not implement all transaction semantics), which defines interfaces such as Transaction. Under this abstraction, we encapsulate the consistency guarantee mechanism adapted to the OSS Multipart Upload feature. In addition, we have also implemented OSSTransactionalOutputCommitter, which implements the OutputCommitter interface. The upper computing engine, such as Spark, interacts with the Semi-Transaction layer of our DLA FS through it. The structure is as follows:

The following uses DLA Serverless Spark to illustrate the general process of DLA FS's OSSTransactionalOutputCommitter:

setupJob. Driver opens a GlobalTransaction. When GlobalTransaction is initialized, a hidden working directory belonging to this GlobalTransaction will be created on OSS to store the file metadata of this job.
setupTask. Executor uses GlobalTransaction serialized by Driver to generate LocalTransaction. And monitor the completion status of the file.
Executor writes files. file will be monitored by LocalTransaction and stored in the local RocksDB. OSS remote calls are time-consuming. We store the metadata on the local RocksDB and wait until the subsequent submission to reduce the time-consuming remote calls.
commitTask. When the Executor calls the LocalTransaction commit operation, LocalTransaction uploads the metadata related to this Task to the corresponding working directory of OSS, and no longer monitors the file completion status.
commitJob. Driver will call the commit operation of GlobalTransaction. The global transaction will read the list of files to be submitted in all metadata in the working directory and call the OSS completeMultipartUpload interface to make all files visible to users.

The introduction of Semi-Transaction of DLA FS has two advantages:

It does not depend on the interface of any calculation engine, so it can be easily transplanted to another calculation engine later, and the implementation provided by it can be used by Presto or other calculation engines through adaptation.
More implementations can be added under the semantics of Transaction. For example, for the scene of partition merging, the characteristics of MVCC can be added to merge data without affecting the use of data on the line.

2. InputStream optimization

Users reported that the OSS request fee is high, even exceeding the DLA fee (OSS request fee = number of requests × unit price per 10,000 requests ÷ 10,000). The investigation found that it was because the open-source OSSFileSystem would perform read-ahead operations in a unit of 512KB during the process of reading data. For example, if a user reads a 1MB file sequentially, two calls to OSS will be generated: the first request to read the first 512KB, and the second request to read the following 512KB. Such an implementation will cause more requests when reading large files. In addition, because the pre-read data is cached in the memory, if more files are read at the same time, it will also cause some pressure on the memory.

Therefore, in the implementation of DLA FS, we have removed the pre-reading operation. When the user calls hadoop's read, the bottom layer will request OSS to read the entire range of data from the current position to the end of the file, and then read from the stream returned by OSS Take the data that the user needs and return. In this way, if the user reads sequentially, the next read call will naturally read the data from the same stream without initiating a new call. Even if a large file is read sequentially, only one call to OSS can be completed.

In addition, for small seek operations, the implementation of DLA FS reads the data to be skipped from the stream and discards it, so there is no need to generate new calls, and only a large jump will close the current one. Stream and generate a new call (this is because a large jump read-discard will cause the seek delay to increase). This implementation ensures that the optimization of DLA FS will also have the effect of reducing the number of calls on file formats such as ORC/Parquet.

3. Data Cache acceleration

Based on the architecture of the storage and computing separation of object storage OSS, reading data from remote storage through the network is still a costly operation, which often leads to performance loss. The cloud native data lake analysis DLA FS introduces a local caching mechanism to cache hot data on the local disk, shorten the distance between data and calculation, reduce the delay and IO restrictions caused by reading data from the remote, and achieve smaller Query latency and higher throughput.

3.1 Local Cache architecture

We encapsulate the processing logic of the cache in DLA FS. If the data to be read exists in the cache, it will be returned directly from the local cache, without the need to pull the data from the OSS. If the data is not in the cache, it will be read directly from the OSS and cached to the local disk asynchronously.

3.2 Data Cache hit rate improvement strategy

Here, DLA Serverless Presto is used to illustrate how to improve the hit rate of DLA FS's local Cache. Presto's default split submission strategy is NO\_PREFERENCE. Under this strategy, the main consideration is the load of the worker, so which worker a split will be assigned to is largely random. In DLA Presto, we use the SOFT\_AFFINITY submission strategy. When submitting Hive's split, the hash value of the split will be calculated, and the same split will be submitted to the same worker as much as possible to improve the hit rate of the Cache.

When using the \_SOFT\_AFFINITY\_ strategy, the split submission strategy is like this:

Determine the preferred worker and candidate worker for split through the hash value of split.
If the preferred worker is idle, submit to the preferred worker.
If the preferred worker is busy, submit to the candidate worker.
If the candidate worker is also busy, submit to the least busy worker.

Fourth, the value brought by DLA FS

1. Rename optimizes the effect of writing scenes in ETL

In the process of using DLA, customers usually use DLA Serverless Spark to do large-scale data ETL. We use the orders table in the TPC-H 100G data set for write testing, and create a new orders\_test table partitioned by the o\_ordermonth field. Execute sql in Spark: "insert overwrite table \`tpc\_h\_test\`.\`orders\_test\` select * from \`tpc\_h\_test\`.\`orders\`". Using the same resource configuration, one of the Spark versions used is open source Spark and the other is DLA Serverless Spark. Compare their results.

It can be drawn from the figure:

This optimization has greatly improved both Algorithm 1 and Algorithm 2.
Algorithm 1 and Algorithm 2 will be optimized after enabling this feature, but Algorithm 1 is more obvious. This is because algorithm 1 needs to perform rename twice, and one rename is performed at a single point on the driver; algorithm 2 is that each executor performs a distributed rename operation and only needs to be performed once.
With the current amount of data, the gap between Algorithm 1 and Algorithm 2 is not so obvious after turning on this feature. Both methods do not need to perform the rename operation, but whether the completeMultipart is executed at a single point on the driver (algorithm 2 our transformation is that completeMultipartUpload is executed during commitTask), and the large amount of data may still have a relatively large impact.

2. InputStream optimizes the effect in interactive scenes

DLA customers will use DLA's Serverless Presto to analyze multiple formats, such as Text, ORC, Parquet, etc. The following compares the number of access requests based on DLA FS and community OSSFS in 1GB Text and ORC formats.

Comparison of the number of requests for 1GB Text file analysis

The number of calls to the Text class is reduced to about 1/10 of the open source implementation
The number of ORC format calls is reduced to about 1/3 of the open source implementation;
On average, it can save OSS call costs by 60% to 90%;

3. The effect of Data Cache in interactive scenes

We conducted a performance comparison test for the community version prestodb and DLA. For the community version, we chose prestodb version 0.228, and added support for oss data sources by copying the jar package and modifying the configuration. We compared DLA Presto CU version 512 core 2048GB community version cluster.

For the test query, we choose the TPC-H 1TB data test set. Since most of TPC-H's queries are not IO-intensive, we only select queries that meet the following two criteria for comparison:

The query includes a scan of the largest table lineitem, so that the amount of scanned data is large enough, and IO may become a bottleneck.
The query does not involve the join operation of multiple tables, so that there will not be a large amount of data involved in the calculation, so the calculation will not become a bottleneck before the IO.

According to these two criteria, we have selected Q1 and Q6 for querying a single table of lineitem, and Q4, Q12, Q14, Q15, Q17, Q19 and Q20 for performing a join operation between lineitem and another table.

It can be seen that Cache acceleration management has obvious effects in these queries.

Five, cloud native data lake best practices

Best practice, using DLA as an example. DLA is committed to helping customers build a low-cost, easy-to-use, and flexible data platform, which saves at least 50% of the cost compared to traditional Hadoop. Among them, DLA Meta supports a unified view of 15+ data sources (OSS, HDFS, DB, DW) on the cloud, introduces multi-tenancy, metadata discovery, pursues a marginal cost of 0, and provides it for free. DLA Lakehouse is implemented based on Apache Hudi. The main goal is to provide an efficient lake warehouse and support the incremental writing of CDC and messages. Currently, this area is stepping up productization. DLA Serverless Presto is developed based on Apache PrestoDB, mainly for federated interactive query and lightweight ETL. DLA supports Spark mainly to do large-scale ETL on the lake, and supports stream computing and machine learning; it has a 300% cost-effective improvement over traditional self-built Spark, and migration from ECS self-built Spark or Hive batch processing to DLA Spark can save 50 %the cost of. The DLA-based integrated data processing solution can support various business scenarios such as BI reports, large data screens, data mining, machine learning, IOT analysis, and data science.

Copyright Statement: content of this article is contributed spontaneously by Alibaba Cloud real-name registered users. The copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.