1
头图

Preface

Spark is the current mainstream big data computing engine. Its functions cover various types of computing operations such as offline batch processing, SQL processing, streaming/real-time computing, machine learning, graph computing, etc. in the field of big data. Its application scope and prospects are very great. widely. As a memory computing framework, Spark has fast computing speed and can meet diverse data computing and processing requirements such as UDF, large and small tables Join, and multiple output.

As a domestic professional data intelligence service provider, I introduced Spark from the early version 1.3, and built a data warehouse based on Spark to perform offline and real-time calculations of large-scale data. Because Spark's optimization focus before version 2.x was on the computing engine, no major improvements and upgrades were made in metadata management. Therefore, GeTui still uses Hive for metadata management and adopts the big data architecture of Hive metadata management + Spark computing engine to support its own big data business development. This push also widely applies Spark to report analysis, machine learning and other scenarios, providing industry customers and government departments with real-time population insights, group portrait construction and other services.

▲One push In the actual business scenario, SparkSQL and HiveSQL were used to calculate a piece of 3T data. The figure above shows the running speed. The data shows that the calculation speed of SparkSQL2.3 is 5-10 times that of Hive1.2 under the premise of the deadlock queue (120G memory, <50core).

For enterprises, efficiency and cost are always the issues they must pay attention to when processing and calculating massive data. How to give full play to the advantages of Spark and really reduce costs and increase efficiency when performing big data operations? This tweet summarizes the Spark performance tuning coups accumulated over the years and shares with you.

Spark performance tuning-basics

As we all know, the correct parameter configuration can greatly help improve the efficiency of Spark. Therefore, for Spark users who do not understand the underlying principles, we provide a parameter configuration template that can be directly copied to help relevant data developers and analysts use Spark for offline batch processing and SQL report analysis more efficiently.

The recommended parameter configuration template is as follows:

Spark-submit submission method script

/xxx/spark23/xxx/spark-submit --master yarn-cluster  \
--name ${mainClassName} \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.yarn.maxAppAttempts=2 \
--conf spark.executor.extraJavaOptions=-XX:+UseConcMarkSweepGC \
--driver-memory 2g \
--conf spark.sql.shuffle.partitions=1000 \
--conf hive.metastore.schema.verification=false \
--conf spark.sql.catalogImplementation=hive \
--conf spark.sql.warehouse.dir=${warehouse} \
--conf spark.sql.hive.manageFilesourcePartitions=false \
--conf hive.metastore.try.direct.sql=true \
--conf spark.executor.memoryOverhead=512M \
--conf spark.yarn.executor.memoryOverhead=512 \
--executor-cores 2 \
--executor-memory 4g \
--num-executors 50 \
--class 启动类 \
${jarPath} \
-M ${mainClassName} 

spark-sql submission method script

option=/xxx/spark23/xxx/spark-sql
export SPARK_MAJOR_VERSION=2
${option} --master yarn-client \
--driver-memory 1G \
--executor-memory 4G \
--executor-cores 2 \
--num-executors 50 \
--conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=file:log4j.properties" \
--conf spark.sql.hive.caseSensitiveInferenceMode=NEVER_INFER \
--conf spark.sql.auto.repartition=true \
--conf spark.sql.autoBroadcastJoinThreshold=104857600 \
--conf "spark.sql.hive.metastore.try.direct.sql=true" \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.dynamicAllocation.minExecutors=1 \
--conf spark.dynamicAllocation.maxExecutors=200 \
--conf spark.dynamicAllocation.executorIdleTimeout=10m \
--conf spark.port.maxRetries=300 \
--conf spark.executor.memoryOverhead=512M \
--conf spark.yarn.executor.memoryOverhead=512 \
--conf spark.sql.shuffle.partitions=10000 \
--conf spark.sql.adaptive.enabled=true \
--conf spark.sql.adaptive.shuffle.targetPostShuffleInputSize=134217728 \
--conf spark.sql.parquet.compression.codec=gzip \
--conf spark.sql.orc.compression.codec=zlib \
--conf spark.ui.showConsoleProgress=true
-f pro.sql

pro.sql 为业务逻辑脚本

Spark performance tuning-advanced
For readers who are willing to understand the underlying principles of Spark, this article sorts out the interaction diagrams of three common task submission methods, such as standalone, Yarn-client, and Yarn-cluster, to help relevant users understand the core technical principles of Spark more intuitively for reading The content of the next advanced article lays the foundation.

standalone


1) Spark-submit is submitted, and a DriverActor process is constructed by reflection;

2) The Driver process executes the written application, constructs sparkConf, and constructs sparkContext;

3) When SparkContext is initialized, it constructs DAGScheduler and TaskScheduler, and jetty starts webui;

4) TaskScheduler has sparkdeployschedulebackend process to communicate with Master and request registration of Application;

5) After the Master accepts the communication, register the Application, use the resource scheduling algorithm, notify the Worker, and let the worker start the Executor;

6) The worker will start the executor for the application. After the executor is started, it will be registered to the TaskScheduler in the reverse direction;

7) After all Executors are registered to TaskScheduler in reverse, Driver ends the initialization of sparkContext;

8) Driver continues to execute the written application, and every time an action is executed, a job will be created;

9) The job will be submitted to DAGScheduler, and DAGScheduler will divide the job into multiple stages (stage division algorithm), and each stage will create a taskSet;

10) taskScheduler will submit each task in the taskSet to the executor for execution (task allocation algorithm);

11) Every time the executor receives a task, it will use taskRunner to encapsulate the task, and then take out a thread from the thread pool of the executor to execute the taskRunner. (Task runner: copy the written code/operator/function, deserialize it, and then execute the task).

Yarn-client

1) Send a request to ResourceManager (RM), request to start ApplicationMaster (AM);

2) RM allocates the container to a NodeManager (NM) and starts AM, which is actually an ExecutorLauncher;

3) AM applies to RM for container;

4) RM allocates container to AM;

5) AM requests NM to start the corresponding Executor;

6) After the executor is started, reverse registration to the Driver process;

7) Divide the stage in the later order, and submit taskset and standalone mode are similar.

Yarn-cluster

1) Send a request to ResourceManager (RM), request to start ApplicationMaster (AM);

2) RM allocates the container to a NodeManager (NM) and starts AM;

3) AM applies to RM for container;

4) RM allocates container to AM;

5) AM requests NM to start the corresponding Executor;

6) After the executor is started, reverse registration to AM;

7) Divide the stage in the later order, and submit taskset and standalone mode are similar.

After understanding the underlying interactions of the above three common tasks, this article will start from the three aspects of storage format, data tilt, parameter configuration, etc., to share with you an advanced posture for promoting Spark performance tuning.

storage format (file format, compression algorithm)

As we all know, different SQL engines have different optimization methods in different storage formats. For example, Hive is more inclined to orc, and Spark is more inclined to parquet. At the same time, when performing big data operations, point-checking, wide-table query, and large-table join operations are relatively frequent, which requires that the file format is preferably columnar storage and can be divided. Therefore, we recommend columnar storage file formats based on parquet and orc, and compression algorithms based on gzip, snappy, and zlib. In terms of combination, we recommend the combination of parquet+gzip and orc+zlib. This combination takes into account columnar storage and separability, compared to txt+gz, which is a row-based storage and indivisible combination. It is more able to adapt to the needs of the above big data scenarios.

Taking the online data of about 500G as an example, the performance test of different storage file formats and algorithm combinations was carried out under different cluster environments and SQL engines. The test data shows that under the same resource conditions, the parquet+gz storage format is faster than the text+gz storage format on multi-value queries and multi-table joins by at least 60%.

Combined with the test results, we sorted out the storage formats recommended under different cluster environments and SQL engines, as shown in the following table:

At the same time, we also tested the memory consumption of parquet+gz and orc+zlib. Taking a single historical partition data of a table as an example, parquet+gz and orc+zlib save 26% and 49% of storage space respectively than txt+gz.

The complete test results are as follows:

It can be seen that parquet+gz and orc+zlib are indeed effective in reducing costs and improving efficiency. So, how to use these two storage formats? Proceed as follows:

➤Hive and spark enable the compression algorithm of the specified file format

spark:
set spark.sql.parquet.compression.codec=gzip;
set spark.sql.orc.compression.codec=zlib;

hive:
set hive.exec.compress.output=true;
set mapreduce.output.fileoutputformat.compress=true;
set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec;

➤Specify the file format when creating the table

parquet 文件格式(序列化,输入输出类)
CREATE EXTERNAL TABLE `test`(rand_num double)
PARTITIONED BY (`day` int)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
;


orc 文件格式(序列化,输入输出类) 
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
;

➤Online tuning

ALTER TABLE db1.table1_std SET TBLPROPERTIES ('parquet.compression'='gzip');
ALTER TABLE db2.table2_std SET TBLPROPERTIES ('orc.compression'='ZLIB');

➤ctas create table

create table tablename stored as parquet as select ……;
create table tablename stored as orc TBLPROPERTIES ('orc.compress'='ZLIB')  as select ……;

Data tilt

Data skew is divided into map skew and reduce skew. This article focuses on reduce tilt, such as common group by and join in SQL, which may be the hardest hit area. When data skew occurs, the general performance is: some tasks are significantly slower than tasks in the same batch, the amount of task data is significantly larger than other tasks, and some tasksOOM and spark shuffle files are lost. As an example in the following figure, in the duration column and shuffleReadSize/Records column, we can clearly find that the amount of data processed by some tasks has increased significantly, and the time-consuming has become longer, causing data skew:

How to solve data skew?

We have summarized 7 data skew solutions that can help you solve common data skew problems:

Solution 1: Use Hive ETL to preprocess data

That is, in the blood relationship of the data, the skew problem is moved forward, so that downstream users do not need to consider the data skew problem.

⁕This solution is suitable for downstream interactive businesses, such as the second/minute level data retrieval query.

Solution 2: Filter a few keys that cause tilt

That is to remove large sloping keys, this scheme is generally used in conjunction with percentiles. For example, if 99.99% of the id records are within 100, then ids other than 100 can be considered to be eliminated.

⁕This solution is more practical in statistical scenarios, but in detailed scenarios, you need to see whether the filtered big key is the focus and concern of the business.

Solution 3: Improve the parallelism of shuffle operations

That is, the parameters of spark.sql.shuffle.partitions are dynamically adjusted, and the number of partitions written by the shuffle write task is increased to achieve an even distribution of keys. In SparkSQL2.3, the value is 200 by default. Developers can add the following parameters to the startup script to dynamically adjust the value:

conf spark.sql.shuffle.partitions=10000
conf spark.sql.adaptive.enabled=true 
conf spark.sql.adaptive.shuffle.targetPostShuffleInputSize=134217728 

⁕The scheme is very simple, but it can optimize the uniform distribution of keys. For example, originally 10 keys, 50 records for each, and only 1 partition, then the subsequent task needs to process 500 records. By increasing the number of partitions, each task can process 50 records, and 10 tasks can run in parallel, which only takes 1/10 of the original task. However, this solution is difficult to optimize for large keys. For example, if a large key has millions of records, then the large key will still be allocated to one task.

Solution 4: Convert reducejoin to mapjoin

Refers to the join on the map side without the shuffle process. Taking Spark as an example, the data of a small RDD can be sent to each Worker node (NM in Yarn mode) in the form of broadcast variables, and join on each Worker node.

⁕This solution is suitable for small tables join large tables (data volume of more than 100G). The default threshold for small tables here is 10M, and small tables below this threshold can be distributed to worker nodes. The specific adjustable upper limit needs to be less than the memory allocated by the container.

Solution 5: Sample tilt key and split the join operation

The following figure shows an example: table A joins table B, table A has a big key, table B has no big key, the id of the big key is 1, and there are 3 records.

How to perform split join operation?

First, split the id1 in the A table and the B table separately, remove the A'and B'of the big key and join first to achieve a non-tilting speed;

Add a random prefix to the big key of table A, expand the capacity of table B by N times, and join separately; remove the random prefix after joining;

And then on the above two parts of the union.

⁕The essence of the solution is to reduce the risk of data skew caused by a single task processing too much data, and it is suitable for situations with fewer large keys.

Solution 6: Use random prefix and expansion RDD to join

For example, table A joins table B, taking table A with a big key and table B without a big key as an example:

Add a random prefix of [1,n] to each record in table A, expand table B by N times, join.

After the join is completed, the random prefix is removed.

⁕This scheme is suitable for situations with many large keys, but it will also increase resource consumption.

Solution Seven:

That is, the combiner operation is performed on the map side to reduce the amount of data pulled by shuffle.

⁕This scheme is suitable for scenarios such as accumulation and summation.

In the actual scenario, it is recommended that the relevant developers analyze the specific situation in detail, and the above methods can also be used in combination for complex problems.

Spark parameter configuration

For the situation of no data tilt, we summarized the parameter configuration reference table to help you optimize Spark performance. The settings of these parameters are suitable for insights and applications of data around 2T, and basically meet the tuning needs in most scenarios.

summary

Currently, Spark has developed to Spark 3.x, and the latest version is Spark 3.1.2 released (Jun 01, 2021). Many new features of Spark 3.x, such as dynamic partition pruning, major improvements in the Pandas API, and enhanced nested column clipping and push-down functions, provide a good idea for further reducing costs and increasing efficiency. In the future, I Tweet will continue to pay attention to the evolution of Spark, and continue to practice and share.


个推
1.5k 声望2.4k 粉丝

个推(每日互动股份有限公司,股票代码:300766)成立于2010年,是专业的数据智能服务商,致力于用数据让产业更智能。个推深耕开发者服务,并以海量的数据积累和创新的技术理念,构建了移动开发、用户增长、品牌...