【Click to learn more about the warehouse knowledge】
Changes in the market, improvements in policies, innovations in technology... All kinds of factors make us face too many challenges, which still need us to continue to explore and overcome.
This year, NetEase Shufan will continue to launch new columns such as "Financial Experts Talk", "Technical Experts Talk", "Product Experts Talk", etc., gathering digital transformation experts from Shufan and its partners, focusing on big data, cloud native, artificial intelligence, etc. In the field of science and technology innovation, it brings a series of knowledge sharing such as in-depth technical interpretation and its application in various industries, providing valuable reference for the successful digital transformation of enterprises.
Today, You Xiduo, a big data offline technical expert from NetEase Shufan, brings mature technical solutions that can help standardize enterprise-level offline data warehouses to optimize storage and improve performance, and have been practiced and verified within NetEase to provide technical reference for everyone.
1. The pain points faced by Spark enterprise-level offline warehouses
The tasks of the enterprise-level data warehouse are basically based on ETL types. Typically, the data read from multiple tables is converted into a table after a series of SQL operators. So in addition to the performance of Spark3 has been fully guaranteed, the remaining pain points are concentrated in the writing process. Hive and Spark2 also have many problems in writing, such as small files & file skew, data compression ratio is not ideal, and dynamic partition writing is difficult to optimize. In response to these problems, we will analyze the current situation one by one and give new solutions.
Small files & file skew
The traditional solution is to add a DISTRIBUTE BY $columns after SQL, which essentially adds an additional Shuffle to re-partition the data. The quality of the output files strongly depends on this Shuffle field. However, in most scenarios, the data Slope is inevitable, which causes some computing partitions to process a particularly large amount of data, which not only brings about the problem of file skew, but also drags down the entire task completion time in terms of performance.
Students who have a certain understanding of the execution engine may use a very hacky way to optimize DISTRIBUTE BY rand() * files, but this is whether it is the data inconsistency caused by the rand() that has been reproduced internally, or the problem thrown by the Spark community: Repartition + Stage retries could lead to incorrect data, which is enough to prove that this is a flawed solution that may lead to inconsistent data, and we should avoid this use.
In addition, some experienced students will adjust the skewed data by taking the modulo, such as DISTRIBUTE BY $column % 100, $column. This is a working solution, but has several flaws:
1) There is an optimization upper limit; it is difficult to judge the optimal modulo range through optimization and debugging, and can only give a relatively acceptable optimization result
2) There is a huge optimization cost; you need to know the data distribution of the field very well, and then through continuous debugging and verification, you can finally find a more reasonable value
3) The maintenance cost is relatively high; maybe after a month, the data has some changes, then the previously optimized modulo value becomes unreasonable
Data compression ratio is not ideal
The traditional solution is to add a SORT BY $column after the SQL, which essentially adds an in-partition sort before writing to improve data compression. Or add DISTRIBUTE BY $columns SORT BY $columns in combination with Shuffle to allow the same data to fall into one partition and then perform partial sorting to further improve the data compression rate. So here comes the problem, first of all, this can't get around the problem of small files & file skew, so I won't repeat it here. Secondly, the traditional dictionary sorting cannot well preserve the aggregation distribution of data in multi-dimensional scenarios. Multi-dimensional here can be understood as multi-fields in data warehouse scenarios. And excellent data aggregation distribution can improve the data skipping ratio of data files in the query stage. Most of our current tasks only consider the performance of the task itself, and we need to gradually pay attention to the performance of downstream task queries to form a good cycle.
Dynamic partition write scenarios are difficult to optimize
Dynamic partitioning generally occurs in the task of writing large tables, and the amount of data per day often exceeds 1TB. Of course, this is reasonable from a business perspective. After partitioning, downstream task queries are very flexible and efficient. However, the optimization of the dynamic partitioning task itself is very troublesome. It has the problem of small files, the compression rate is not high, and the amount of data is large. This is simply a "strong alliance". And if you think about it carefully, you can find that in the dynamic partition scenario, small files and compression rates are actually mutually exclusive. If the number of files as small as possible is preferred, then we need to consider using the partition field as the Shuffle and sort fields, so that the same partition The data falls into one computing partition, but the compression rate depends on other data fields, resulting in low compression rate. If the compression rate is the priority, then we need to consider the data fields as Shuffle and Sort fields, but at this time, the same partition data will fall into different computing partitions, resulting in a large number of small files.
Faced with this series of problems, we have proposed the following solutions based on Spark3 + Z-Order, and have achieved very good results in the online environment.
2. Rebalance + Z-Order
2.1 Program introduction
Z-Order is a technique that can compress multidimensional data into one dimension, and is widely used in spatiotemporal indexing and images. The Z curve can fill the space of any dimension with an infinitely long one-dimensional curve. For a piece of data in the database, we can regard its multiple fields to be sorted as multiple dimensions of the data, and the z curve can pass through a certain Rules map multi-dimensional data to one-dimensional data, and build z-values that can then be sorted based on the one-dimensional data.
Based on the most classic usage of DISTRIBUTE BY + SORT BY, we propose a new generation of optimization solutions REBALANCE + Z-Order. REBALANCE solves both small file & file skew issues while satisfying DISTRIBUTE BY semantics as much as possible. The word "satisfying as much as possible" is used here because the file skew is essentially caused by the skew of the calculation partition, so when we split the skew partition into multiple pieces, it also destroys the DISTRIBUTE BY semantics. Of course, this does not affect the accuracy of the number. No other problems will arise. The sorting based on the Z-Order algorithm replaces the default dictionary sorting, allowing to continue to retain the aggregation distribution of multi-dimensional data in multi-dimensional scenarios, which can speed up the query performance of downstream tasks while improving the compression rate.
(Fig. 1 Rebalance + Z-Order)
The above figure shows the working principle of Rebalance + Z-Order, involving the upstream tasks and downstream tasks of the table. First, Rebalance exists in the form of Shuffle, and partitions are split and merged in the Shuffle read phase to ensure that each Reduce partition handles the same amount of data. Z-Order-based Data Skipping optimization is strongly dependent on the file format. We know that mainstream columnar storage formats such as Parquet and ORC will record data statistics while writing data. For example, Parquet will record field data at Row Group granularity by default. min/max value. In the process of querying this file, we will compare the predicate condition of Push Down with these statistical values. If the condition is not met, we can directly Skip the Row Group or even the entire file to avoid pulling invalid ones. Data, this is the Data Skipping process.
2.2 Case study
When landing on specific tasks, you can upgrade from Spark2 to Spark3 and then do Z-Order optimization.
- Spark2 -> Spark3
In actual operation, due to the introduction of a Shuffle, the task will have one more Stage, but the execution time is greatly shortened. This is because the original task has data expansion and serious skew in the last stage, resulting in a very large amount of data processed by a single computing partition. After Rebalance, the extra Stage breaks up the bloated data and solves the skew problem, and finally gets a 4x performance improvement. However, another problem occurred at this time. The data compression rate has dropped. Although the data expansion + inclination in the calculation partition runs slowly, it has a higher compression rate. Spark3 + Z-Order
In order to solve the problem of compression ratio, we have added Z-Order optimization, and we can see that the compression ratio is increased by 12 times, which is also nearly 25% higher than that of the tasks in the Spark2 period. And due to the decrease in IO, the computing performance is not slowed down by one more Z-Order. So as to achieve the goal of managing task performance, small files and data compression ratio at the same time.3. Two-Phase Rebalance + Z-Order
3.1 Program introduction
As we mentioned earlier, small files and data compression ratios are mutually exclusive in dynamic partitioning scenarios, but obviously compared to optimization at the business level, we still have a lot of room to improve these two pain points at the engine level. We propose Two-Phase Rebalance + Z-Order to reduce small files as much as possible under the premise of giving priority to compression rate.
(Figure 2 Two-Phase Rebalance)
As shown in the figure above, the entire process consists of 2 stages of Rebalance + Z-Order. In the first stage of Rebalance, we use dynamic partition fields to minimize the number of files, but the compression rate is not high at this time. Rebalance uses the dynamic partition field + Z-Order field to ensure the maximum compression rate of the output, and finally completes the sorting within the partition through Z-Order. Some students here may ask, why does Rebalance not generate small files in the second stage? This is because AQE Shuffle Read inherits Map ordering in the process of splitting Reduce partitions, that is to say, Maps pulled by Reduce partitions must be continuous, and after we rebalance in the first stage, continuous Maps mean that they have The same partition value, so we can avoid the generation of small files as much as possible.3.2 Case study
(Figure 3 comparison chart)
The above picture shows the effect of switching a task from manual optimization to Z-Order optimization. Manual optimization also uses the aforementioned DISTRIBUTE BY + SORT BY combined with modulo method. Of course, the task before manual optimization is even worse. After Two-Phase Rebalance + Z-Order optimization, the compression rate is increased by nearly 13% compared to manual optimization, nearly 8 times compared to the original task, the number of files is reduced by nearly 3 times compared with manual optimization, and nearly 14 times compared with the original task. . At the same time, the task computing performance has also been improved by nearly 15%.4. Two Phase Rebalance + Z-Order + Zstd
4.1 Program introduction
Two Phase Rebalance + Z-Order has met the optimization requirements, but since there will be one more Shuffle than manual optimization, the amount of Shuffle data during the task will increase. This is temporary data, which will be cleaned up automatically after the task ends, but if our local disk redundancy is not enough, there will also be a problem of insufficient storage space. Therefore, we introduced the higher compression rate algorithm Zstd to reduce the amount of data in the Shuffle process while minimizing the impact on task performance.
4.2 Case study
In a specific case, there are some differences between the two compression algorithms. Compared with the default Lz4, Zstd saves nearly 60% of Shuffle data, and there is no obvious performance impact after testing. This is because the IO is greatly reduced to make up for the extra CPU time. In the future, our team will also do more promotion optimization for Zstd.
summary
This article introduces our optimization scheme for enterprise-level offline data warehouse tasks based on Spark3 + Z-Order, and initially solves the current pain points in migration and historical use of Spark. There are also some experiences and insights in the process: no technical solution can perfectly solve all problems, but we must also try our best to find the point that needs to be compromised. Before that, the optimization space is huge.
Author introduction - You Xiduo, NetEase Shufan big data offline technical expert, Apache Kyuubi PMC member, Apache Spark Contributor.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。