1

1 What is data skew

Data skew refers to that a certain task processing process (usually a JVM process) is assigned too many tasks in a big data computing task, causing the task to run for a long time or even eventually fail, which in turn causes the entire large task to take a long time. run or fail. In terms of external performance, in HiveSQL tasks, the progress of map or reduce has been 99% unchanged for several hours; in SparkSQL, in a certain stage, the number of running tasks is 1 or 2 for a long time. In short, if the task progress information has been output, but the content has not changed for a long time, there is a high probability that the data will be skewed. There is a special case that needs to be noted. Sometimes you will see that the task information of SparkSQL also shows that there are 1 or 2 tasks running, but when the progress information is no longer refreshed and appears to be suspended for a long time, this is usually in the final stage. File operations, not skewed data (although this usually means small files are bad).

Breaking it down a little further, the inclination can be divided into the following four categories:

  1. Read tilt. That is, a map (HiveSQL) or task (SparkSQL) cannot be completed for a long time during the data reading phase. This is usually because the file block is too large or the block data is abnormal. This scenario occurs less frequently.
  2. It's tilted. That is, when a certain sorting (such as windowing function or non-broadcasting association) or aggregation operation is required, the processing of the same key (usually a combination of one or more fields or expressions) takes too long. This is usually the most common case, and the situation is more complicated.
  3. write oblique. That is, an operation needs to output a large amount of data, such as more than hundreds of millions or even billions of rows. It mainly occurs in the case of data expansion after association and some situations that can only be operated by one task (such as limit).
  4. File operations are skewed. That is, after the data is generated in the temporary folder, the operations of renaming and moving are very time-consuming due to the huge amount. This usually happens when dynamic partitions result in small files. At present, this situation no longer exists in China and India because we default to merging small files, and Singapore still has it (we are pushing for a solution).

2 Why is there data skew

Big data computing relies on a variety of distributed systems. All computing tasks and data need to be distributed to all available machines and nodes in the cluster for execution according to certain rules. Finally, it may be necessary to aggregate to a few nodes for the final aggregation operation. And data is written to distributed storage systems such as HDFS/S3 for permanent storage. This process is designed to handle most situations, but not all situations. It has the following characteristics:

  1. The law of business data distribution cannot be predicted. For example, the system cannot know in advance whether the value distribution of a field in a table is roughly uniform without calculation.
  2. The number of calculation results cannot be predicted. For example, the result of the association between the two tables cannot be predicted and processed in a targeted manner for the number of output rows of some keys (one associated field or a combination of multiple fields) without calculation; another example is to perform a split operation or explode on the value of a certain field, etc. The number of results generated after the operation is unpredictable and targeted responses are made.
  3. Certain operations can only be performed by a single node. Most operations that need to maintain a global state, such as sorting, Limit, count distinct, global aggregation, etc., are generally arranged to a node for execution.

The above three main characteristics lead to the possibility of a huge amount of data processed by a single node, resulting in the so-called skew problem. Of course, these difficulties are not insoluble. With the passage of time, more and more targeted optimization measures have gradually emerged, and perhaps in the near future business students will no longer be troubled by tilting problems.

3 Solution cases

Since SparkSQL will be mainly promoted in OPPO in the future, the following cases will mainly be shown from the perspective of SparkSQL.

3.1 Fact table related fact table data expansion

Recently, two business classmates raised a troublesome problem, that is, the fact table is associated with the fact table. The output of several keys reaches billions of rows. The data expansion is serious, causing the data calculation and output to be skewed.

For example the following scenarios:

We count the skewed KEY value distribution for two tables:

a table:

table b:

As you can see,

Only looking at the association result of option_id=7 is 46839*130836=6128227404, which is 6.1 billion rows;
The association result of option_id=2 is 71080*125541=8923454280, or 8.9 billion rows.
Severely tilted.

The fact that the fact table is associated with the fact table is occasionally encountered in non-report computing tasks. Usually we solve data skew mainly because the process of calculating results involves too much data to be processed, which leads to slowness, but usually the number of output lines may not be many, and there is no difficulty in writing, so methods such as filtering abnormal data or broadcasting associations are ineffective. effect.

The essence of this problem is that a task is executed by at most one process, and the same key must also be processed in the same task. Therefore, under the premise that this mechanism cannot be changed, we can only find a way to reduce the number of rows output by a task.

So how to reduce the number of data rows that a single task needs to process without affecting the final result?

In fact, there are many suggestions on the Internet, all of which are to deal with the skewed keys separately, break up the keys by adding prefixes and suffixes, etc., and then combine them at the end, but this method is too cumbersome and not elegant. We want to pursue a way that is more friendly to business students and more elegant in code.

Finally, I searched all available system functions and found the aggregate function collect_set/collect_list, which can collect data to reduce the number of rows without losing the data relationship. For example the following two lines:

Can be collapsed into a line:

Finally, we can expand one line into multiple lines by means of explode+lateral view, so as to restore the detailed results that the user finally expects.

The core of the above method is to modify the original inclined operation (associated with the same key) to an operation that is no longer interdependent (one line becomes multiple lines).

The final code is as follows:

Note the noteworthy points in the above code:

  • The function of hint (repartition(1000)) in the code is to consider that the data carried by a single row of data aggregated by collect_list will expand many times after the expansion operation from one row to multiple rows, so the amount of data processed by a single task must be very small. To ensure that the processing speed is fast enough. The function of this hint is to tell the system to divide the results associated with the previous stage into 1000 copies and hand them over to downstream processing;
  • The function of ceil(rand()*N) in the group by statement is to divide a key into at most N lines, which can limit the upper limit of the number of lines generated after the key is finally associated;
  • Control the amount of data processed by a single task through the spark.sql.files.maxPartitionBytes parameter, and further split the data that needs to be processed by a single task. In fact, if the file in point 1 is small enough, this parameter can be omitted.

After verification, the task was completed in 20 minutes, and nearly 80 billion rows of data were generated, including 19 keys of over one billion rows.

3.2 Avoid sorting

Some students with basic algorithm know that sorting operation is a very expensive operation in the software field. The best time complexity of several sorting algorithms currently used on a large scale is O(nlogn), that is, with the increase of data volume rather than non-linear growth. That is to say, the sorting of large-scale data volumes often means huge time consumption. However, this is a common situation in big data SQL, which leads to skew. Once there is a need for sorting, no optimization parameters can be used. Generally speaking, only the code can be rewritten. Fortunately, in the vast majority of big data scenarios, sorting is unnecessary. In many cases, it is just business students who do not understand that sorting is expensive in big data scenarios and writers write down the sorting code. Here are two cases of rewriting the code to avoid sorting.

1) Replace the sorting with the max function.

Recently, I received a business request from a colleague, and I need to do a sample display of the buried point data of a certain business. It is necessary to fish out about 10,000 pieces of data from about 120 billion rows of data. A very simple SQL is as follows:

A little explanation of the meaning of SQL: I want to take out a piece of sample data with rich content for a certain dimension combination in the reported data, so the size of a certain field is used as the descending order and the first piece of the result is taken.

Of course this SQL failed to run. I counted the field set of partition by (hereinafter referred to as key), the largest key has 13.7 billion rows, and there are at least 10 keys with more than 2 billion rows of data. In this way, the memory of the executor is increased, and it cannot run successfully.

The essence of this problem is unnecessary sorting of big data (there is no very efficient way to deal with sorting in big data architecture). Therefore, the idea of optimization is to find a way to reduce this unnecessary sorting.

Since the user only needs the largest one after sorting, isn't it essentially taking the largest value of a certain key. Take out this maximum value, and finally associate it with the source table, you can take out the piece of data corresponding to the maximum value.

There is a precondition here. If we want to kill the sorting when the data is associated back to the source table in the second step, we only have to go one way: broadcast the association (if we use the sort-meger association, the sort step will still be unavoidable). This requires our small table (key-max) to be small enough. Usually this condition is satisfied, because if it is not satisfied, it means that the key value is very large, very sparse, and there will be no tilt dilemma. As explained at the beginning, the amount of deduplicated data in the final key is less than 10,000, which can be completely associated with broadcast.

The final code is as follows:

Note that the above SQL has two caveats:

  • We used semi joins, which are relatively rare in everyday code. It means that the left table matches the right table. If the associated key of a piece of data in the left table is found in the right table, the data in the left table will be retained and will not continue to be searched in the right table. This has two results: 1) it is faster; 2) it will not put the data of the right table into the result). It is equivalent to select * from left_table where key in (select key from right_table). However, the usage of in was not supported in the development process of big data (now it is partially supported), so there is this syntax, which is generally considered to be more efficient in terms of efficiency.
  • Because there may be many pieces of data that can match the maximum value, windowing the row_number again for the final result and take one of them. At this time, since the value of size(xxxx) is the same, any one of them is in line with the business requirements.

In general, the above SQL works well. But this time we had something unexpected: after the above operations, we got more than 80 billion rows of data. Because the data of max(size(xxxx) = size(xxxx) accounts for the vast majority, we cannot effectively filter out a small number of results by matching back. We must find a field that can effectively distinguish each row of data, and the value of this field must be very Loose. In the end, I found that userid is better. Therefore, max(size(xxxx)) was replaced by max(userid), and the task ran quickly. Since it does not affect the principle of optimization, we will not describe this part. detail.

2) Replace the sorting with the quantile function.

When a portrait task was running very slowly, business classmates turned to us and found that the slow code was as follows:

Problem point: The above code is to do a global sorting, and then use the position of its serial number for classification and marking. The above code barely works when the sorted data is less than 550 million rows. However, after the data volume reaches 550 million rows on a certain day, it cannot run out, and the memory of the reducer is increased to 10G.

New idea: Although there may be some parameters that can be adjusted, I think this is not the right direction, so I stopped the research and changed the direction to kill the global sorting. When communicating with a senior, I suddenly realized that since the business is to do a grading, there is no need for a specific sorting number in essence, so theoretically a complete sorting can be omitted. So I naturally thought of the quantile function, and immediately thought of a new solution. The quantile function calculates that the data must be greater than or equal to a certain value to be in a certain position in the overall data ordering. Please search for details.

The code after the change is as follows:

Note that the above code has a little trick, which is to associate the result of the sub-query with only one row by Cartesian product, so as to realize the effect of introducing 4 variables such as p2 to p8 in disguise, which is quite practical.

Effect: Comparing the results of the old and new algorithms, the difference is very small and within the expected range.

When comparing the task execution time, there is about an 87% decrease:

The essence of this case is that it recognizes that the computationally expensive global sequence number is completely unnecessary. There are still many similar situations in our business code, but it is still within the acceptable range of the business, and there is a very large room for optimization. It is hoped that special projects can be carried out in the future to save computing time and resources.

3) Completely avoid sorting by broadcasting associations.

There are two main ways that SparkSQL currently handles associations:

a) Broadcast association . Small tables (controlled by the parameter spark.sql.autoBroadcastJoinThreshold, currently our default value is 20M) will use broadcast association, that is, transfer all the data of the small table to the memory of each node, and quickly complete the association through direct memory operations. The biggest advantage of this method is to avoid shuffling the data of the main table, but it will increase the amount of memory used by the task. In addition, 3 points in particular:

  • At present, our sparksql optimizer cannot accurately judge whether a subquery result (also regarded as a small table) is suitable for broadcasting, so it is still in the follow-up solution;
  • The left table cannot be broadcast regardless of its size;
  • In some cases, there will be similar: Kryo serialization failed: Buffer overflow such OOM appears, and "To avoid this, increase spark.kryoserializer.buffer.max value". But in fact, this setting will not work. The actual reason is: although a table is smaller than 32M, the number of rows in the decompression result reaches tens of millions after high compression, resulting in OOM of the node. At this time, the broadcast association can only be disabled manually.

b) Sort-Merge association . That is, the two tables are first sorted by the connection field, and then matched and associated on some basis. Since the data is sorted, only one-time matching is needed to complete the final association, which is faster. However, the disadvantage of this method is that the associated keys need to be sorted, and each same key and corresponding data must be allocated to an executor, causing a large number of shuffle operations; on the other hand, if an executor needs to process a huge amount of key, usually takes a lot of time and a lot of disk IO.

From the above principle description, it can be seen that if broadcast association is used, the engine does not need to do any sorting at all, and naturally there will be no inclination caused by sorting. This is a huge improvement in efficiency. Of course, the cost is increased memory usage. Generally this increase in memory usage is considered cost-effective.

If the engine does not recognize it, we can influence the execution plan by actively indicating it. For example the following:

To change the execution plan to broadcast s subquery results, just add hint mapjoin (or broadcast).

Judging from the actual results, the speedup associated with broadcasting has more than doubled the effect.

3.3 Avoidance of write skew

This section briefly describes it. In the dynamic partition scenario, it is often difficult to predict the amount of data that each partition will output at the end, but the number of tasks allocated is fixed for each final partition. Taking the country partition conditions as an example, if the Indonesian partition outputs 1 billion rows, while Singapore only outputs 1 million rows, at this time, if we only allocate 2 tasks to write data, a single task in the Indonesia partition will bear 100 million rows of tasks. , will be very slow. If it is set to 100 tasks to write data, it is more suitable for the Indonesian partition, but the Singapore partition will generate 100 small files, which will have a negative impact on subsequent file operations and reading of future downstream tasks. . Finally, after practice, I found a better way. That is, find the skewed partition key, and let the engine distribute different amounts of data to different partitions through the distribute by + case when expression. Specific code (with region as dynamic partition field):

At present, this situation still needs to be applied in overseas tasks. In the future, as we promote AWS to solve the problem of automatic merging of small files, we should no longer have to worry about it.

3.4 Illegal value filtering

This should be the method that has been talked about a lot on the Internet, and I will briefly describe it.

When optimizing the task dwd_ocloud_dau_info_d task of the strategic ecology department, we found that the running time of the task has been increasing, reaching 7 hours at one time, and it will not run successfully until August 1st, always OOM (not enough memory), even if the Increasing the memory of the executor to 10G still does not solve the problem. After careful diagnosis, it is found that the task is slow in a windowing function stage. The code is as follows:

After preliminary statistics on the guid key, it is found that the number of null values has hundreds of millions of rows, and it keeps growing:

This also explains the growing runtime, and the memory overhead and duration of sorting keep growing. After communication with business classmates, it is confirmed that the null value is meaningless and excluded:

Then I ran again under the default parameters, and it was over within 30 minutes. The time consumption is reduced by about 90%, and the effect is obvious.

In this example, the skew value happens to be invalid and can be filtered directly, luckily. Then the students will ask, what if the tilt value is valuable? Generally speaking, it is necessary to take out this type of skew value separately and calculate it with another set of targeted logic, and then union all the results back to other non-slope data calculation results.

4 Conclusion

The situation of data skew processing is basically limited to the above case classification, I believe everyone can master it with a little study. In the future, we have plans to develop tools for diagnosis and optimization, focusing on helping you identify skewed nodes and propose code-level optimization suggestions. Stay tuned!

About the author

Luckyfish OPPO Big Data Service Quality Leader

Mainly responsible for the support and maintenance of the big data platform and service quality assurance. He used to work for JD.com. He has rich experience in big data task development and performance optimization, and has more interest and experience in product experience and cost optimization.

For more exciting content, please scan the code and follow the [OPPO Digital Intelligence Technology] public account


OPPO数智技术
612 声望952 粉丝