Aurora notes丨Spark SQL construction practice in Aurora

Aurora Senior Engineer-Cai

Preface

Spark began to be deployed and used on the Jiguang big data platform in 2018. After multiple iterations, it has gradually become the core engine of offline computing. At present, there are 20,000+ Spark tasks running on the Jiguang big data platform every day, and an average of 42,000 Spark SQL is executed per day. This article mainly introduces some of the practical experience summarized by the Jiguang data platform in the use of Spark SQL, including the following aspects:

Application Practice of Spark Extension
Reconstruction and optimization of Spark Bucket Table
Practical plan for migrating from Hive to Spark SQL

1. Spark Extension application practice

Spark Extension was introduced in SPARK-18127 as an extension point of Spark Catalyst. Spark users can extend custom implementations in various stages of SQL processing, which is very powerful and efficient.

1.1 Analysis of blood relationship

In Jiguang, we have a self-built metadata management platform. The relevant metadata is collected by various data components. The analysis and collection of the blood relationship of Spark SQL is realized through a custom Spark Extension.

The SQL processing of Spark Catalyst is divided into multiple steps such as parser, analyzer, optimizer, and planner. The steps such as analyzer and optimizer are also divided into multiple stages. In order to obtain the most effective blood relationship information, we choose the final planner stage as the entry point For this purpose, we have implemented a planner strategy to analyze the physical execution plan of Spark SQL, and extract metadata information such as read-write tables and store them in the metadata management platform

1.2 Permission verification

In terms of data security, Aurora chose to use Ranger as a component such as permission management, but in the actual use process, we found that the current community version of Ranger mainly provides related access plug-ins for HDFS, HBase, Hive, and Yarn, which are required for Spark. To implement the related functions by ourselves, for the above problems, we also choose to use Spark Extension to help us in the secondary development of permissions. In the process of implementation, we used the implementation principle of Ranger Hive-Plugin to grant Spark SQL access to Hive. The realization of the verification function.

1.3 Parameter control

As more and more business students use Spark SQL in the data platform, we found that each business student has a different level of familiarity with Spark, and the understanding of Spark configuration parameters is also good or bad, in order to ensure the stability of the overall operation of the cluster. , We intercepted the Spark tasks submitted by the business students, extracted the configuration parameters set by the tasks, shielded the unreasonable configuration parameters, and gave risk tips to effectively guide the business students to conduct reasonable online operations.

2. Improvement and optimization of Spark Bucket Table

In the practice of Spark, we are also actively paying attention to the excellent solutions of other companies in the industry. In 2020, we refer to the optimization ideas of Byte Beat for the Spark Bucket Table. The following optimization items:

Compatibility between Spark Bucket Table and Hive Bucket Table
Spark supports Bucket Join where Bucket Num is an integer multiple
Spark supports the Join field and the Bucket field as a Bucket Join that contains a relationship

The above three optimizations have enriched the use scenarios of Bucket Join, which can allow more Join and Aggregate operations to avoid shuffles, and effectively improve the operating efficiency of Spark SQL. After the relevant optimizations are completed, how to better carry out business transformation and promotion , Has become our concern.

Through the analysis of the past SQL execution records of the data platform, we found that the associated query of user ID and device ID is a very high frequency operation. On this basis, we analyzed the metadata information collected through the previous SQL blood relationship analysis. The high-frequency fields of each table for Join and Aggregate operations were analyzed and sorted out, and the most suitable Bucket Cloumn was counted. With the support of these metadata, we assisted us in the promotion and transformation of Bucket Table.

III. Hive migration to Spark

With the rapid development of the company's business, the SQL tasks submitted on the data platform continue to grow, which poses new challenges to the execution time of the tasks and the consumption of computing resources. For the above reasons, we proposed the migration of Hive tasks to Spark. The work goal of SQL, from which we have summarized the following problem requirements:

How to better locate which Hive tasks can be migrated and which cannot
How to make business departments migrate from Hive to Spark SQL without perception
How to conduct a comparative analysis to confirm the operation effect before and after the task migration

3.1 Implementation of Hive Migration Analysis Program

When migrating a business job, we need to know who is in this department. Since Azkaban will have executor information when executing a specific job, we can guess which jobs are available based on the executor. The analysis program uses some table data of the metadata system and some database table information related to azkaban to help us collect how many hive jobs are in the migrated department, and how many sql there are in the hive job, and what is the sql syntax pass rate, Of course, you also need to check the specific execution time and other information of Azkaban during the migration, which is used to help us roughly judge the consumption of resources when fine-tuning the parameters.

Since it is necessary to have relevant read and write permissions to directly detect whether a certain SQL conforms to spark semantics online, it is not safe to directly open the permissions to the analysis program. So the idea is to use the database table structure information stored in the metadata system, and the sql information executed by the business job is collected on azkaban. As long as we have all the database table information required by a certain sql, we can analyze whether the sql conforms to spark semantics by rebuilding the database table structure locally (of course, the online environment is different from the local environment, such as function problems, but in most cases The following is no problem).

Figure 3-1-1

The following is the SQL pass rate obtained by a data department through the analysis program

3.2 Unconscious switching of SQL execution engine

At present, the main way for business parties to use Hive is to connect to hiveserver2 through beeline. Since livy also provides the thriftserver module, beeline can also directly connect to livy. The migration strategy is to first send SQL that conforms to Spark syntax to livy for execution, and if the execution fails, switch to Hive for execution.

Beeline can obtain user SQL. When beeline is started, a livy session is created through the thrift interface, and the user SQL is obtained and sent to livy for execution. During the execution progress and other information can be obtained by querying livy, at the same time, a job corresponds to a session, and each start of beeline corresponds to a session. When the job is executed or beeline is closed, close the livy session. (If spark cannot be executed successfully, go to the previous hive logic)

Figure 3-2-1

With the above switching ideas, we began to modify the design of the beeline program

The important class diagram of beeline is shown in Figure 3-2-2. The Beeline class is the startup class. It obtains user command line input and calls the Commands class to execute. Commands is responsible for calling the JDBC interface to execute and obtain the results. The one-way calling process is shown in Figure 3. -2-3 shown.

Figure 3-2-2

Figure 3-2-3

It can be seen from Figure 3-2-2 and Figure 3-2-3 that all operations are done through the DatabaseConnection object, which holds this object is the DatabaseConnections object, so multiple computing engines are switched, and strategy adaptation is adopted.

DatabaseConnections object, so that you can switch the execution engine without modifying other code (that is, get a different connection)

Figure 3-2-4

3.3 task migration blacklist

As mentioned earlier, when a Hive task is passed through with a SQL analysis program, and after the migration program uses livy to submit the Spark task, it may still fail to execute. At this time, we will use Hive to execute the task to ensure the stability of the task. However, there are many reasons for the failed SQL. Some SQL does use Hive to perform better. If you use Spark SQL to execute it every time and then use Hive to execute it, it will affect the efficiency of the task. Based on the above purpose, we will migrate the program The blacklist function has been developed to ensure that each SQL can find the execution engine it is really suitable for. Considering that beeline is a lightweight client, the recognition function should be done on the livy-server side to develop a function similar to HBO Let's add such abnormal SQL to the blacklist to save migration task execution time.

Goal: Abnormal SQL identification based on HBE (History-Based Executing)

With the above goals, we mainly used the following methods to identify and switch the SQL blacklist

SQL recognition is limited to the same appName (restrict the recognition range to avoid recognition errors)
After obtaining the subsequent traversal content of the SQL abstract syntax tree, the md5 value is generated as the unique identifier of the sql
Write the SQL information that has failed to execute more than N times into the blacklist
Compare the structure tree characteristics of the two SQLs according to the assignment rules during the next execution
Do not switch to Spark SQL for SQL in the blacklist

3.4 Migration results

After the migration and transformation of the migration program this year, the maximum drop of HSQL was 50%+ (then it rebounded with the business growth this year)

IV. Application of

The default version of Spark currently used by Jiguang has been upgraded from version 2.X to version 3.X, and the AQE feature of Spark 3.X also helps us to better use Spark.

Practice configuration optimization:

spark3.0.0 parameters

Dynamically merge shuffle partitions

spark.sql.adaptive.coalescePartitions.enabled true

spark.sql.adaptive.coalescePartitions.minPartitionNum 1

spark.sql.adaptive.coalescePartitions.initialPartitionNum 500

spark.sql.adaptive.advisoryPartitionSizeInBytes 128MB

Dynamically optimize the data skew, considering the actual data characteristics, we set the skewedPartitionFactor to 1

spark.sql.adaptive.skewJoin.enabled true

spark.sql.adaptive.skewJoin.skewedPartitionFactor 1

spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes 512MB

V. Follow-up planning

Currently, for Spark tasks running online, we are developing a set of Spark full-link monitoring platform. As part of our big data operation and maintenance platform, the platform will undertake the collection and monitoring of the running status of online Spark tasks. We hope that Through the platform, we can locate Spark tasks that are wasteful of resources, write a large number of small files, and have problems such as slow tasks in a timely manner, and perform targeted optimization based on this, so that the data platform can run more efficiently.

Aurora notes丨Spark SQL construction practice in Aurora

spark3.0.0 parameters

Dynamically merge shuffle partitions

Dynamically optimize the data skew, considering the actual data characteristics, we set the skewedPartitionFactor to 1

极光JIGUANG

引用和评论

AIGC | 如何用“Flow”，轻松解决复杂业务问题

【Hadoop】HDFS架构解析

【Hadoop】HBase系统解析及适用场景

基于 pyflink 的算法工作流设计和改造

MCP+Hologres+LLM 搭建数据分析 Agent

某全球领先网络解决方案提供商基于 Apache Doris 统一 Trino、Pinot、Iceberg、Kyuubi技术栈

SelectDB 实时分析性能突出，宝舵成本锐减与性能显著提升的双赢之旅

Aurora notes丨Spark SQL construction practice in Aurora

spark3.0.0 parameters

Dynamically merge shuffle partitions

Dynamically optimize the data skew, considering the actual data characteristics, we set the skewedPartitionFactor to 1

极光JIGUANG

引用和评论

AIGC | 如何用“Flow”，轻松解决复杂业务问题

【Hadoop】HDFS架构解析

【Hadoop】HBase系统解析及适用场景

基于 pyflink 的算法工作流设计和改造

MCP+Hologres+LLM 搭建数据分析 Agent

某全球领先网络解决方案提供商 基于 Apache Doris 统一 Trino、Pinot、Iceberg、Kyuubi技术栈

SelectDB 实时分析性能突出，宝舵成本锐减与性能显著提升的双赢之旅

某全球领先网络解决方案提供商基于 Apache Doris 统一 Trino、Pinot、Iceberg、Kyuubi技术栈