Jingdong: Flink SQL optimization actual combat

The authors of this article are Zhang Ying and Duan Xuehao from Jingdong Algorithm Service Department, and are proofread by Apache Hive PMC and Alibaba technical expert Li Rui. The main content is:
background
Optimization of Flink SQL
Summarize

1. Background

At present, the data processing flow recommended by JD Search is shown in the figure above. It can be seen that real-time and offline are separated. Offline data processing mostly uses Hive/Spark, and real-time data processing mostly uses Flink/Storm.

This has caused the following phenomenon: In a business engine, users need to maintain two sets of environments and two sets of codes. Many common features cannot be reused, and data quality and consistency are difficult to guarantee. And because the underlying data model of the streaming batch is inconsistent, a lot of piecing logic needs to be done; even for data consistency, a lot of data comparisons such as year-on-year, month-on-month, and secondary processing are required, which is extremely inefficient and very error-prone.

And Flink SQL, which supports batch flow integration, can solve this pain point to a large extent, so we decided to introduce Flink to solve this problem.

In most jobs, especially Flink jobs, the optimization of execution efficiency has always been the key to Flink's task optimization. In the case of the daily data increase of PB level in Jingdong, the optimization of the job is particularly important.

Students who have written some SQL homework must know that for Flink SQL homework, in some cases, the same UDF will be called repeatedly, which is very unfriendly to some resource-consuming tasks; in addition, the effect of execution efficiency can be roughly measured by Shuffle, join, failover strategies, etc. are considered; in addition, the process of Flink task debugging is also very complicated, especially for companies with online machine isolation.

To this end, we implemented the embedded Derby as the Hive metadata storage database (allowEmbedded); in terms of task recovery, batch jobs have no checkpoint mechanism to achieve failover, but Flink's unique region strategy can make batch jobs fast Recovery; In addition, this article also introduces related optimization measures such as object reuse.

2. Optimization of Flink SQL

1. UDF reuse

In the Flink SQL task, the following situation will occur: if the same UDF appears in both the LogicalProject and the Where condition, the UDF will be called multiple times (see https://issues.apache.org/jira/ browse/FLINK-20887). However, if the UDF consumes a lot of CPU or memory, this extra calculation will greatly affect the performance. For this reason, we hope to cache the UDF result and use it directly next time. When designing, you need to consider: (very important: please ensure that the subtask chain of LogicalProject and where conditions are together)

There may be multiple subtasks in a taskmanager, so this cache is either thread (THREAD LOCAL) level or tm level;
In order to prevent some situations from causing the logic of clearing the cache to fail, the cache must be cleared in the close method;
In order to prevent the memory from increasing indefinitely, the selected cache should be able to actively control the size; as for the "timeout period", it is recommended to configure it, but it is best not to be less than the time of successive UDF calls;
As mentioned above, there may be multiple subtasks in a tm, which is equivalent to a multi-threaded environment in tm. First, our cache needs to be thread-safe, and then we can determine whether or not a lock is required based on the business.

Based on the above considerations, we use guava cache to cache the results of UDF, and then directly go to the cache to get the data when calling, which can reduce the consumption of tasks as much as possible. The following is a simple use (at the same time set the maximum use size, timeout, but no write lock):

public class RandomFunction extends ScalarFunction {
    private static Cache<String, Integer> cache = CacheBuilder.newBuilder()
            .maximumSize(2)
            .expireAfterWrite(3, TimeUnit.SECONDS)
            .build();

    public int eval(String pvid) {
        profileLog.error("RandomFunction invoked:" + atomicInteger.incrementAndGet());
        Integer result = cache.getIfPresent(pvid);
        if (null == result) {
            int tmp = (int)(Math.random() * 1000);
            cache.put("pvid", tmp);
            return tmp;
        }
        return result;
    }
    @Override
    public void close() throws Exception {
        super.close();
        cache.cleanUp();
    }
}

2. Unit Testing

You may be wondering why you put unit testing into optimization. Everyone knows that the Flink task debugging process is very complicated, especially for companies with online machine isolation. JD’s local environment has no way to access the task server, so in the initial stage of debugging tasks, we spent a lot of time uploading jar packages, viewing logs, and other activities.

In order to reduce task debugging time and increase the development efficiency of code developers, an embedded Derby is implemented as Hive's metadata storage database (allowEmbedded), which can be regarded as a way to optimize development time. The specific ideas are as follows:

First create Hive Conf:

public static HiveConf createHiveConf() {
    ClassLoader classLoader = new HiveOperatorTest().getClass().getClassLoader();
    HiveConf.setHiveSiteLocation(classLoader.getResource(HIVE_SITE_XML));

    try {
        TEMPORARY_FOLDER.create();
        String warehouseDir = TEMPORARY_FOLDER.newFolder().getAbsolutePath() + "/metastore_db";
        String warehouseUri = String.format(HIVE_WAREHOUSE_URI_FORMAT, warehouseDir);

        HiveConf hiveConf = new HiveConf();
        hiveConf.setVar(
                HiveConf.ConfVars.METASTOREWAREHOUSE,
                TEMPORARY_FOLDER.newFolder("hive_warehouse").getAbsolutePath());
        hiveConf.setVar(HiveConf.ConfVars.METASTORECONNECTURLKEY, warehouseUri);

        hiveConf.set("datanucleus.connectionPoolingType", "None");
        hiveConf.set("hive.metastore.schema.verification", "false");
        hiveConf.set("datanucleus.schema.autoCreateTables", "true");
        return hiveConf;
    } catch (IOException e) {
        throw new CatalogException("Failed to create test HiveConf to HiveCatalog.", e);
    }
}

Next, create the Hive Catalog: (using reflection to call the embedded interface)

public static void createCatalog() throws Exception{
    Class clazz = HiveCatalog.class;
    Constructor c1 = clazz.getDeclaredConstructor(new Class[]{String.class, String.class, HiveConf.class, String.class, boolean.class});
    c1.setAccessible(true);
    hiveCatalog = (HiveCatalog)c1.newInstance(new Object[]{"test-catalog", null, createHiveConf(), "2.3.4", true});
    hiveCatalog.open();
}

Create tableEnvironment: (same as official website)

EnvironmentSettings settings = EnvironmentSettings.newInstance().useBlinkPlanner().inBatchMode().build();
TableEnvironment tableEnv = TableEnvironment.create(settings);
TableConfig tableConfig = tableEnv.getConfig();
Configuration configuration = new Configuration();
configuration.setInteger("table.exec.resource.default-parallelism", 1);
tableEnv.registerCatalog(hiveCatalog.getName(), hiveCatalog);
tableEnv.useCatalog(hiveCatalog.getName());

Finally close Hive Catalog:

public static void closeCatalog() {
    if (hiveCatalog != null) {
        hiveCatalog.close();
    }
}

In addition, for unit testing, building a suitable data set is also a very large function. We have implemented CollectionTableFactory to allow us to build a suitable data set by ourselves. The usage is as follows:

CollectionTableFactory.reset();
CollectionTableFactory.initData(Arrays.asList(Row.of("this is a test"), Row.of("zhangying480"), Row.of("just for test"), Row.of("a test case")));
StringBuilder sbFilesSource = new StringBuilder();
sbFilesSource.append("CREATE temporary TABLE db1.`search_realtime_table_dump_p13`(" + "  `pvid` string) with ('connector.type'='COLLECTION','is-bounded' = 'true')");
tableEnv.executeSql(sbFilesSource.toString());

3. Choice of join method

Traditional offline Batch SQL (SQL for bounded data sets) has three basic implementation methods, namely Nested-loop Join, Sort-Merge Join and Hash Join.

	efficiency	space	Remark
Nested-loop Join	Difference	Big occupancy
Sort-Merge Join	There is sort merge overhead	Small footprint	An optimization measure for ordered data sets
Hash Join	high	Big occupancy	Fit size table

Nested-loop Join is the simplest and most straightforward. It loads two data sets into memory, and uses an inline traversal method to compare whether the elements in the two data sets meet the Join conditions one by one. The time efficiency and space efficiency of Nested-loop Join are the lowest. You can use: table.exec.disabled-operators:NestedLoopJoin to disable it.
The following two pictures are the effects before and after the disabling (if your disabling does not take effect, let's see if it is Equi-Join first):

Sort-Merge Join is divided into two stages: Sort and Merge: First, the two data sets are sorted separately, and then the two ordered data sets are traversed and matched separately, similar to the merge of merge sort. (Sort-Merge Join requires the two data sets to be sorted, but if the two inputs are ordered data sets, it can be used as an optimization solution).
Hash Join is also divided into two stages: first convert a data set into a Hash Table, and then traverse another data set element and match it with the elements in the Hash Table.
- The first stage and the first data set are called build stage and build table respectively;
- The second stage and the second data set are called probe stage and probe table, respectively.
Hash Join is more efficient but requires a lot of space. It is usually used as an optimization solution when one of the join tables is a small table suitable for putting in memory (it is not that overwriting to disk is not allowed).

Note: Sort-Merge Join and Hash Join are only applicable to Equi-Join (Join conditions all use equals as the comparison operator).

Flink has made some subdivisions on top of join, including:

	Features	use
Repartition-Repartition strategy	Partition and shuffle the data set separately, if the data set is large, the efficiency is extremely poor	The two data sets are not much different
Broadcast-Forward strategy	Send all the data of the small table to the machine of the large table data	There is a big gap between the two data sets

Repartition-Repartition strategy: The two data sets of Join use the same partition function to partition their keys, and send the data through the network;
Broadcast-Forward strategy: The large data set is not processed, and the other relatively small data set is all copied to a part of the data machine in the cluster.

As we all know, batch shuffle is very time-consuming.

If there is a big gap between the two data sets, it is recommended to use Broadcast-Forward strategy;
If the two data sets are similar, it is recommended to use the Repartition-Repartition strategy.

You can use: table.optimizer.join.broadcast-threshold to set the table size that uses broadcast. If it is set to "-1", it means that broadcast is disabled.

The following picture shows the effect before and after disabling:

4. multiple input

In Flink SQL tasks, reducing shuffle can effectively improve the throughput of SQL tasks. In actual business scenarios, we often encounter such a situation: the upstream output data has met the data distribution requirements (such as continuous multiple join operators). , Where the keys are the same), at this time Flink's forward shuffle is a redundant shuffle, and we hope to chain these operators together. Flink 1.12 introduces the feature of mutiple input, which can eliminate most unnecessary forward shuffles and chain the source operators together.

table.optimizer.multiple-input-enabled：true

The following figure shows the topology diagram with multiple input enabled and not enabled (the operator chain function has been enabled):

5. Object reuse

The upstream and downstream operators will go through the serialization/deserialization/replication stages for data transmission. This behavior greatly affects the performance of the Flink SQL program. You can improve performance by enabling object reuse. But this is very dangerous in DataStream, because the following situation may happen: Modifying the object in the next operator accidentally affects the object of the above operator.

However, Flink's Table / SQL API is very safe and can be enabled in the following ways:

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment（）;
env.getConfig().enableObjectReuse();

Or by setting: pipeline-object-reuse:true

Why is there such a big performance improvement when object reuse is enabled? In the Blink planner, the data exchange between two operators of the same task will eventually call BinaryString#copy. Looking at the implementation code, you can find that BinaryString#copy needs to copy the bytes of the underlying MemorySegment. By enabling object reuse to avoid copying, you can Effectively improve efficiency.

The following picture shows the corresponding flame picture when object reuse is not turned on:

6. Failover strategy for SQL tasks

In batch task mode, checkpoint and its related features are all unavailable. Therefore, the checkpoint-based failover strategy for real-time tasks cannot be applied to batch tasks, but batch tasks allow tasks to communicate through Blocking Shuffle. After the task fails for unknown reasons, all the data required by the task is stored in Blocking Shuffle, so you only need to restart the task and all downstream tasks connected to it through Pipeline Shuffle:

jobmanager.execution.failover-strategy:region (operator that has been finished can be directly restored)

table.exec.shuffle-mode:ALL_EDGES_BLOCKING (shuffle strategy).

7. shuffle

Shuffle in Flink is divided into pipeline shuffle and blocking shuffle.

Pipeline shuffle has good performance, but requires high resource requirements and is poorly fault-tolerant (the operator will be assigned to a previous region. For batch tasks, if this operator has a problem, it will be restored from the previous region);
Blocking shuffle is a traditional batch shuffle, which will place data on the disk. This shuffle is fault-tolerant, but it will generate a large number of disks and network IO (if you are worry-free, blocking suffle is recommended). blocking shuffle is divided into hash shuffle and sort shuffle,
- If your disk is ssd and the concurrency is not too large, you can choose to use hash shuffle. This shuffle method generates many files and random reads, which has a greater impact on disk io;
- If you are sata and the concurrency is relatively large, you can choose to use sort-merge shuffle. This kind of shuffle generates less data, reads sequentially, and does not generate a lot of disk IO, but the overhead will be higher (sort merge).

Corresponding control parameters:

table.exec.shuffle-mode, this parameter has multiple parameters, the default is ALL_EDGES_BLOCKING, which means that all edges will use blocking shuffle, but you can try POINTWISE_EDGES_PIPELINED, which means that forward and rescale edges will automatically start pipeline mode.

taskmanager.network.sort-shuffle.min-parallelism, set this parameter to be less than your parallelism, you can turn on sort-merge shuffle; the setting of this parameter needs to consider some other situations, and the specific settings can be set according to the official website.

Three, summary

This article focuses on the selection of shuffle and join methods, object reuse, UDF reuse, etc., to introduce the optimization measures that JD has made in Flink SQL tasks. In addition, I would like to thank all colleagues such as Fu Haitao from the Real-time Computing R&D Department of JD.com for their support and help.

For more Flink related technical issues, you can scan the code to join the community DingTalk exchange group
Get the latest technical articles and community dynamics in the first time, please follow the public account~

Jingdong: Flink SQL optimization actual combat

1. Background

2. Optimization of Flink SQL

1. UDF reuse

2. Unit Testing

3. Choice of join method

4. multiple input

5. Object reuse

6. Failover strategy for SQL tasks

7. shuffle

Three, summary

ApacheFlink

引用和评论

微财基于 Flink 构造实时变量池

读Flink源码谈设计：图的抽象与分层

Apache Hudi源码解读—Filnk写Hudi链路

Fluss：面向实时分析设计的下一代流存储

Apache Flink 2.0：Streaming into the Future

Fluss：面向实时分析设计的下一代流存储

流存储Fluss：迈向湖流一体架构