Secret of DAG, the core technology of MaxCompute&#39;s execution engine

Introduction to a rare EB-level data distributed platform in the industry. MaxCompute supports the operation of tens of millions of distributed jobs every day. These jobs have different characteristics, ranging from very large jobs with hundreds of thousands of computing nodes to small and medium-sized distributed jobs. Different users have different expectations for operations of different scales/features in terms of running time, resource utilization efficiency, and data throughput rate. As one of the core technologies of the MaxCompute execution engine, DAG not only provides a unified dynamic execution framework at the bottom, but also implements an offline hybrid execution mode (Bubble Execution), achieving the goal of balancing extreme performance and efficient resource utilization. .

As a rare EB level data distributed platform in the industry, the MaxCompute system supports the operation of tens of millions of distributed jobs every day. In terms of the number of jobs of this magnitude, there is no doubt that the job features that the platform needs to support are also diverse: there are ultra-large jobs that include hundreds of thousands of computing nodes, which are unique in the "Ali-scale" big data ecosystem. There are also small and medium-sized distributed jobs. At the same time, different users have different expectations for operations of different scales/features in terms of running time, resource utilization efficiency, and data throughput rate.

Fig.1 MaxCompute online data analysis

Based on different scales of jobs, the current MaxCompute platform provides two different operating modes. The following table summarizes and compares these two modes:

Fig.2 Offline (batch) mode vs. integrated scheduling quasi-real-time (smode) mode

As can be seen from the above figure, offline operations and integrated scheduling quasi-real-time operations have very significant differences in scheduling methods, data transmission, and resource usage. It can be said that the two operating modes represent the resources needed to apply massive amounts of data on the scene to optimize throughput and resource utilization , and by the processing medium (small) amount of the whole data pre-computing node pull up (and Data direct transmission and other means accelerated) reduces the two extremes of execution delay These differences will eventually be reflected in aspects such as execution time and job resource utilization. Obviously, high Throughput as the main optimization goal, and low , will have very different performance indicators in various aspects. For example, taking the 1TB-TPCH standard benchmark as an example, the report execution time (performance) and resource consumption are two dimensions for comparison. It can be seen that the quasi-real-time (SMODE) has a very obvious advantage in performance ( is not a cost-free 16125dc6947580. In the specific scenario of TPCH, the integrated implementation of the SMODE mode not only achieves a 2.3X performance improvement, but also consumes 3.2X system resources (cpu * time).

Fig.3 Performance/resource consumption comparison: offline (batch) mode vs. integrated scheduling quasi-real-time (smode) mode

This observation is not unexpected, or to some extent it is by design. Taking a DAG generated by a typical SQL in the figure below, all computing nodes are pulled up at the beginning of job submission, although this scheduling method allows data to be pipelined (when needed), which may speed up data processing. But not all upstream and downstream computing nodes in all execution plans can have idealized pipelined dataflow. In fact, for many jobs, in addition to the root node of the DAG (the M node in the figure below), the downstream computing nodes have a certain degree of waste to some extent.

Fig.4 In the quasi-real-time (smode) mode of integrated scheduling, possible inefficient use of resources

The inefficiency of resource usage caused by this idling is particularly obvious when there is a barrier operator in the data processing flow and cannot pipeline , and when the DAG graph is deeper Of course, for scenarios where you want to optimize the running time of the job, the ultimate performance optimization can be obtained through more resource consumption. is reasonable in some scenarios. In fact, in some business-critical online service systems, in order to ensure that the service can always respond quickly and process peak data, the average single-digit CPU utilization is not uncommon. But for a distributed system of the magnitude of a computing platform, can achieve a better balance between extreme performance and efficient resource utilization? ?

The answer is yes. This is the hybrid computing mode we are going to introduce here: Bubble Execution

1. Overview of Bubble Execution

The core architectural idea of the DAG framework lies in the clear hierarchical design of the logical and physical layers of the execution plan. The physical execution graph is realized by materializing the physical characteristics (such as data transmission medium, scheduling timing, resource characteristics, etc.) of the nodes and edges in the logical graph. Comparing the batch mode and the smode mode described in Fig.2, DAG provides the realization of a unified offline mode and a quasi-real-time integrated execution mode based on a flexible scheduling execution framework. As shown in the figure below, by adjusting the different physical characteristics of computing nodes and data connection edges, not only can the two existing computing modes be clearly expressed, but after a more generalized expansion, one can also be explored. The new hybrid operating mode is Bubble Execution.

Fig.5 Multiple calculation modes on the DAG framework

To understand intuitively, if we regard a Bubble as a large scheduling unit, the internal resources of the Bubble apply for operation together, and the data of the internal upstream and downstream nodes are directly transmitted through the network/memory. In contrast, the data transmission on the connection side between Bubbles is transmitted through disk placement. Then offline and quasi-real-time job execution, in fact, can be considered as two extreme scenarios of Bubble execution: offline mode can be considered as a special case of each stage as a single-bubble, while the quasi-real-time framework integrates all computing nodes of the job They are all planned inside a big Bubble to do the other extreme of integrated scheduling execution. DAG AM has unified the two calculation modes onto a set of scheduling execution infra. It is possible to complement the advantages of the two modes, laying the foundation for the introduction of Bubble Execution.

Bubble Execution uses flexible and adaptive sub-graph (Bubble) cutting, between the existing two extremes, provides a more fine-grained, more general scheduling execution method to achieve the gain between job performance and resource utilization Optimized tradeoff method. After analyzing according to the input data volume, operator characteristics, job scale and other information, DAG's Bubble execution mode can split an offline job into multiple Bubbles, making full use of the network/memory direct connection and computing node warm-up inside Bubble And other ways to improve performance. In this segmentation method, can all be cut into a certain Bubble. They can be cut into different Bubbles according to their position in the DAG, and they can also not be cut into any Bubbles at all (still in traditional offline operations). Mode operation) . This highly flexible mixed operation mode makes the operation of the entire operation more flexible and adapts to the characteristics of various operations on the line, which is of great significance in actual production:

Bubble mode makes it possible to accelerate more jobs: integrated scheduling have "one size fits all" access conditions based on the overall scale (online default 2000). On the one hand, this is due to the fair use of limited resources, and on the other hand, it is also to control the cost of node failure. However, for medium and large jobs, although the overall scale may exceed the entry limit, the different internal subgraphs may be of appropriate scale and can be accelerated by methods such as data pipelines. In addition, some online computing nodes cannot use pre-heated quasi-real-time resource pools for execution due to their own characteristics (such as user logic including UDF and other user logic that require a security sandbox). The current black and white mode will cause a job, As long as one such computing node is included, the entire job cannot be executed in accelerated mode. Bubble mode can better solve these problems.
Bubble mode enables the opening of two online resource pools: current offline resource (cold) and the quasi-real-time resource pool (warm) are two online resources with different characteristics, which are completely isolated and managed separately. This separated status may lead to a waste of resources. For example, for a large-scale job, because the quasi-real-time resource pool is completely unavailable, the quasi-real-time resource pool may be in an idle state while waiting for offline resources, and vice versa. The Bubble mode can be used to mix different resources within the job, so that the two can supplement each other, cutting peaks and filling valleys.
Bubble mode can improve the utilization of resources as a whole: From the perspective of resource utilization, for medium-sized jobs that can meet the admission of quasi-real-time mode, due to the operation mode of quasi-real-time mode integrated scheduling, although the operation speed can be It has been improved, but objectively it will cause a certain degree of idling and waste of resources (especially when the DAG graph is deeper and the calculation logic has a barrier). In this case, according to the number of nodes, calculate barriers and other conditions, disassemble the integrated model into multiple Bubbles. This can effectively reduce the large amount of idling consumption of the node, and under the condition of reasonable splitting conditions, the performance loss can also be lower.
Bubble mode can effectively reduce the cost of failure of a single computing node: integrated quasi real-time mode execution, due to the characteristics of its data pipeline, the fault tolerance granularity of the job and its scheduling granularity are closely linked: both are all-in- one. In other words, as long as one node fails, the entire job must be rerun. Because the larger the job scale, the greater the probability of node failure during operation. Such failover granularity undoubtedly limits the maximum job scale that it can support. The Bubble mode provides a better balance: the failure of a single computing node at most only affects nodes that are in the same Bubble. In addition, Bubble mode does various fine-grained processing for various failovers, which we will describe below.

We can visually evaluate the effect of Bubble execution mode through the standard TPCH-1TB test benchmark. When the upper computing engine (MaxCompute optimizer and runtime, etc.) remains unchanged, and the bubble size is maintained at 500 (the specific bubble segmentation rules are described below), do the Bubble execution mode, standard offline mode, and quasi real-time mode. Comparison of performance (Latency) and resource consumption (cpu * time):

Fig.6.a Performance (Latency) comparison: Bubble mode vs offline (batch) mode vs integrated scheduling quasi-real-time (smode) mode

From the perspective of running time, Bubble mode is obviously much better than offline mode (the overall 2X performance improvement), and in terms of the more quasi-real-time integrated scheduling mode, Bubble's execution performance is not significantly reduced. Of course, in some data, pipeline processing queries (such as Q5, Q8, etc.) can be used very effectively, and quasi-real-time operations still have certain advantages. However, the advantage of SMODE operation in execution time is not without cost. If resource consumption is also considered, in the figure below, we can see that the performance improvement of quasi-real-time operation is based on the premise that resource consumption is much greater than Bubble mode. of. While Bubble's performance is much better than offline mode, its resource consumption is similar overall.

Fig.6.b Comparison of resource consumption (cpu * time):

Bubble mode vs offline (batch) mode vs integrated scheduling quasi real-time (smode) mode

Taken together, Bubble Execution can well combine the advantages of batch mode and quasi real-time mode:

At the execution time level, for any query in the TPCH test set, the execution time of the bubble mode is shorter than that of the batch mode. The total time consumption of 22 Queries is reduced by nearly 2X, which is close to the time consumption of service mode;
In terms of resource consumption, the bubble mode is basically the same as the batch mode. Compared with the service mode, there is a significant reduction, and the overall reduction is 2.6X.

Fig.6.c Overall comparison between Bubble mode and offline/quasi real-time mode

It is worth noting that in the above TPCH Benchmark comparison, we simplified the bubble segmentation condition, that is, the overall limit is 500, and the barrier and other conditions are not fully considered. If the bubble is segmented Further tuning, for example, for nodes whose data can be effectively pipelined, try to ensure that the segmentation is within the bubble, and the execution performance and resource utilization of the job can be further improved. This is our actual production system online process The Central Committee will pay attention to consideration. See Section 3 for specific online effects.

After understanding the overall design ideas and architecture of the Bubble execution mode, let’s start with the implementation details of the specific Bubble mode and the specific work required to bring this new hybrid execution mode online.

2. Segmentation and execution of Bubble

Jobs using Bubble Execution (hereinafter referred to as Bubble jobs) are the same as traditional offline jobs. A DAG master (aka. Application Master) is used to manage the entire DAG life cycle. AM is responsible for the reasonable bubble segmentation of the DAG, as well as the corresponding resource application and scheduling operation. On the whole, the calculation nodes inside Bubble will follow the calculation acceleration principle, including the use of pre-pulled calculation nodes at the same time and pipeline acceleration for data transmission through memory/network direct transmission. The computing nodes that are not located inside the bubble are executed in the classic offline mode, and the data on the connecting edges (including the edges across the bubble boundary) that are not inside the bubble are transmitted through disk placement.

Fig.7 Hybrid Bubble execution mode

Bubble segmentation determines the execution time and resource utilization of the job. It needs to be comprehensively considered based on the concurrency scale of the computing node, the attributes of the internal operator of the node, and other information. After , the execution of involves the execution of nodes. How to organically combine with the shuffle method of data pipeline/barrier is described separately here.

2.1 Bubble segmentation principle

The core idea of Bubble Execution is to split an offline job into multiple Bubbles for execution. In order to segment the bubbles that are conducive to the overall efficient operation of the job, there are several factors that need to be considered comprehensively:

The internal operator characteristics of the computing node: For the scheduling mode that simultaneously pulls up all the computing nodes of the bubble, whether the data can be effectively pipelined between the upstream and downstream nodes inside the bubble determines to a large extent the downstream and the downstream within the bubble. Whether the node will cause waste of resources due to idling state. Therefore, in the logic of dividing the bubble, when the node contains the operator of the barrier characteristic and may block the pipeline of the data , will consider not to switch the node and its downstream into the same bubble.
The number of computing nodes in a single Bubble: As discussed earlier, the integrated resource application/operation may not be able to apply for resources when there are too many computing nodes, or even if the failure cost can be applied for, it may not be controlled. Limiting the size of the Bubble can avoid the negative effects of excessively large integrated operation.
Aggregate computing nodes and cut the iterative direction of the Bubble: Considering the limitation of the bubble size, the two methods of dividing the bubble from top to bottom and from the bottom to top may lead to different results. For most online operations, the processed data is often in an inverted triangle shape, and the corresponding DAG is also in an inverted triangle state, so the bottom-up algorithm is used to cut the bubble by default, that is, the farthest from the root vertex The node starts to iterate.

Among the above factors, the barrier attribute of the operator is given by the upper computing engine (eg, the optimizer of MaxCompute). Generally speaking, operators that rely on global sort operations (such as MergeJoin, SorteAggregate, etc.) are considered to cause data barriers, while operators based on hash characteristics are more friendly to pipelines. For the number of computing nodes allowed in a single Bubble, based on our analysis of the characteristics of online quasi-real-time jobs and actual grayscale experiments of Bubble jobs, the default upper limit selected is 500. This is a reasonable value in most scenarios, which can ensure that the full amount of resources can be obtained relatively quickly. At the same time, because the amount of processed data is basically positively correlated with DoP, a bubble of this size generally does not have the problem of memory overrun. . Of course, these parameters and configurations allow the job level to be fine-tuned through configuration, and the Bubble execution framework will also provide the ability to dynamically adjust in real time during job operation.

In the DAG system, one of the physical attributes of the edge connection is whether the upstream and downstream nodes of the edge connection have a dependency relationship before and after operation. For the traditional offline mode, upstream and downstream run sequentially, corresponding to the attribute of sequential, which we call sequential edge. As for the upstream and downstream nodes inside the bubble, they are scheduled to run at the same time. We call the edge connecting such upstream and downstream nodes a concurrent edge. It can be noticed that the physical properties of this concurrent/sequential overlap with the physical properties of the data transmission method (network/memory direct transmission vs. data storage) in the bubble application scenario (Note: But these two are still It is a separate physical attribute. For example, when necessary, data can be transferred on the concurrent edge by data drop).

Based on this hierarchical abstraction, the Bubble segmentation algorithm is essentially a process of trying to aggregate the nodes of the DAG graph to restore the concurrent edge that does not meet the bubble access conditions into a sequential edge. Finally, the sub-picture connected by the concurrent edge is the bubble. Here we use a practical example to show the working principle of the Bubble segmentation algorithm. Suppose there is a DAG graph shown in the figure below. The circles in the figure represent the vertex, and the number in each circle represents the actual concurrency of the computing node corresponding to the vertex. Among them, V1 and V3 are marked as barrier vertex because they contain barrier operators at the beginning of job submission. The connecting lines between the circles represent the upstream and downstream connecting edges. The orange line represents the (initial) concurrent edge, and the black line represents the sequential edge. The sequential edge that the output edges of the barrier vertex are all sequential edges. The other edges are initialized to concurrent edges by default.

Fig.8 Sample DAG diagram (initial state)

On the basis of this initial DAG, in accordance with the overall principles introduced above and some implementation details described at the end of this chapter, the initial state described in the above figure can be iterated through multiple rounds of algorithms to finally produce the following Bubble segmentation results. In this result, two Bubbles are produced: Bubble#0 [V2, V4, V7, V8], Bubble#1 [V6, V10], and the other nodes are judged to run in offline mode.

Fig.9 Sample DAG graph Bubble segmentation result

In the segmentation process of the above figure, the vertex is traversed from the bottom up, and the following principles are adhered to:

If the current vertex cannot be added to the bubble, all its input edges are restored to sequential edges (such as V9 in the DAG diagram);

If the current vertex can be added to the bubble, perform the breadth-first traversal algorithm to aggregate and generate the bubble, first retrieve the vertex connected to the input edge, and then retrieve the vertex connected to the output edge. For the vertex that cannot be connected, restore the edge to the sequential edge (for example, traverse V2 in the DAG graph). When the output vertex V5, the edge restoration will be triggered because the total task count exceeds 500).

For any vertex, it can be added to the bubble only when the following conditions are met:

There is no sequential edge connection between vertex and the current bubble;
There is no circular dependency between vertex and the current bubble, namely:
- Case#1: All downstream vertex of this vertex does not exist a vertex is upstream current bubble;
- Case#2: All upstream vertex of this vertex does not exist a vertex is downstream current bubble;
- Case#3: There is no bubble downstream upstream of current bubble;
- Case#4: All upstream of bubble does not exist a vertex is downstream current bubble;

Note: The upstream/downstream not only represents the direct successor/predecessor of the current vertex, but also includes the indirect successor/predecessor

Fig.10 There may be several scenarios where there may be circular dependencies in the segmentation Bubble process

The segmentation of the actual online bubble will also take into account information such as actual resources and expected running time, such as whether the plan memory of the computing node exceeds a certain value, whether the computing node contains UDF operators, and the computing node in the production operation is based on historical information ( Whether the estimated execution time of HBO) is too long, etc., I won’t go into details here.

2.2 Bubble scheduling and execution

2.2.1 Bubble scheduling

In order to achieve the acceleration of computing, the source of the computing nodes inside Bubble comes from the resident preheated resource pool by default, which is the same as the quasi-real-time execution framework. At the same time, we provide flexible pluggability, allowing Bubble computing nodes to apply on the spot from the Resource Manager if necessary (switchable through configuration).

From the perspective of scheduling timing, a Bubble's internal node scheduling strategy is related to its corresponding input edge characteristics, which can be divided into the following situations:

There is no bubble root vertext of any input edge (such as V2 in Fig.9): the job is started by the schedule as soon as it runs.
Only sequential edge input bubble root vertex (such as V6 in Fig.9): wait for the completion of upstream nodes to reach the configured min fraction (the default is 100%, that is, all upstream nodes are completed) will be scheduled.
The vertex inside Bubble (that is, all input edges are concurrent edges, such as V4, V8, V10 in Fig.9), because they are completely connected through concurrent edges, they will naturally be triggered and scheduled at the same time as the upstream.
There is a bubble root vertex of mixed-inputs on the bubble boundary (for example, V7 in Fig.9). This situation requires some special handling. Although V7 and V4 are linked through a concurrent edge, because the scheduling of V7 is also controlled by V3 through a sequential edge, it is actually necessary to wait for V3 to complete the min-fraction before scheduling V7. For this kind of scenario, you can configure the min-fraction of V3 to be smaller (or even 0) to trigger in advance; in addition, we also provide the capability of progressive scheduling inside Bubble, which is also helpful for this kind of scenario.

For example, Bubble#1 in Figure 7 has only one SequentialEdge external dependent edge. When V2 is completed, the overall scheduling of V6 + V10 (via concurrent edge) will be triggered to run the entire Bubble#1.

After Bubble is triggered for scheduling, it will directly apply for resources to SMODE Admin. By default, the integrated Gang-Scheduling (GS) resource application mode is used. In this mode, the entire Bubble will construct a request and send it to Admin. When the Admin has enough resources to satisfy the request, it will send the scheduling result including the pre-pulled worker information to the AM of the bubble job.

Fig.11 Resource interaction between Bubble and Admin

In order to support both strained resources and dynamic adjustments within Bubble, Bubble also supports the Progressive resource application mode. This mode allows each Vertex in Bubble to independently apply for resources and scheduling. For this kind of application, as long as the Admin has incremental resource scheduling, the result will be sent to the AM until the corresponding Vertex request is fully satisfied. For the unique application in this scenario, we will not expand it for the time being.

After the quasi-real-time execution framework is upgraded, the resource management (Admin) and multi-DAG job management logic (MultiJobManager) in the SMODE service have been decoupled. Therefore, the resource application logic in the bubble mode only needs to interact with the Admin instead of The DAG execution management logic of normal quasi-real-time operations will bring any impact. In addition, in order to support the online grayscale hot upgrade capability, each resident computing node in the resource pool managed by Admin runs in the Agent+multi-Labor mode. When scheduling specific resources, it will also match the worker version according to the AM version. , And schedule the labor that meets the conditions to the Bubble job.

2.2.2 Bubble Data Shuffle

For the sequential edge traversing the Bubble bourndary, the data transmitted on it is the same as the ordinary offline operation, and the data is transmitted by way of disk placement. Here we mainly discuss the data transmission method within Bubble. According to the previously described job bubble segmentation principle, the inside of the bubble usually has sufficient data pipeline characteristics, and the amount of data is not large. Therefore, for the data on the concurrent edge of the bubble, the fastest network/memory direct transfer method is used for shuffle.

Among them, the method of network shuffle is the same as the classic quasi-real-time operation, through the establishment of a TCP link between the upstream node and the downstream node, the network is directly connected to send data. This push-based network data transmission method requires both upstream and downstream to pull up at the same time. According to the chain-type dependency transmission, this network push mode strongly relies on Gang-Scheduling. In addition, it also restricts fault tolerance and long tail avoidance. The flexibility of the bubble.

In order to better solve the above problems, in Bubble mode, the memory shuffle mode is explored. In this mode, the upstream node writes data directly to the memory of the cluster ShuffleAgent (SA), while the downstream node reads data from the SA. The fault tolerance and expansion of the memory shuffle mode, including the ability to asynchronously place part of the data to ensure higher availability when the memory is insufficient, are independently provided by ShuffleService. This mode can support both Gang-Scheduling/Progressive scheduling modes at the same time, and it also has strong scalability. For example, it can achieve more local data reading through SA Locality scheduling, and through blood-based instance level retry. Implement a finer granularity fault tolerance mechanism and so on.

Fig.12 Network Shuffle VS Memory Shuffle

In view of the many scalability advantages provided by memory shuffle, this is also the default shuffle method selected for online Bubble jobs, and network direct transmission is an alternative, allowing configuration and use in ultra-small-scale jobs with low fault tolerance.

2.3 Fault-Tolerance

As a brand-new hybrid execution mode, Bubble Execution explores various fine-grained balances between offline jobs and quasi-real-time jobs with integrated scheduling. In complex online clusters, various failures are inevitable during operation. In the brand-new mode of bubble, in order to ensure that the impact of failure is minimized and to achieve the best balance between reliability and operating performance, its failure handling strategies are also more diversified.

For different abnormal problems, we have designed various targeted fault-tolerant strategies. Through various strengths from fine to coarse, we deal with various abnormal scenarios that may be involved in the execution process, such as: failure to apply for resources from admin, failure in bubble Task execution failure (bubble-rerun), bubble-return (bubble-renew) where bubble execution fails multiple times, AM failover occurs during execution, and so on.

2.3.1 Bubble Rerun

At present, the default retry strategy adopted by Bubble is rerun bubble when the internal computing node fails. That is, when a node in the bubble fails to execute (attempt) this time, the entire bubble will be rerun immediately, and the attempt of the same version that is being executed will be cancelled. While returning the resource, trigger the bubble to re-execute. In this way, it is ensured that the retry attempt versions of all computing nodes in the bubble are consistent.

There are many scenarios that trigger bubble rerun, the more common ones are as follows:

Instance Failed : The execution of the compute node failed, usually triggered by a runtime error of the upper engine (such as throwing retryable-exception).
Resource Revoked : In the online production environment, there are many scenarios that will cause the resource node to restart. For example, the machine is oom, the machine is hacked, etc. After the worker is killed, the restarted worker will reconnect back to the admin according to the initial startup parameters. At this point, the admin encapsulates the worker restart message as Resource Revoked and sends it to the corresponding AM to trigger bubble rerun.
Admin Failover 16125dc6947d6e: Because the computing resources used by the Bubble job come from the admin resource pool of SMODE, when the admin After Failover, the Admin does not perceive the current AM information assigned to each node, and cannot send these restart messages to the AM. The current processing method is that each AM subscribes to the nuwa corresponding to the admin, and this file will be updated after the admin restarts. After the AM senses the information is updated, it will trigger the corresponding taskAttempt Failed to rerun the bubble.
Input Read Error : During the execution of the computing node, it is a very common error that the upstream data cannot be read. For the bubble, there are actually three different types of this error:
- InputReadError in Bubble: Since the shuffle data source is also in the bubble, when the bubble is rerun, the corresponding upstream task will also be rerun. There is no need for targeted treatment.
- The InputReadError at the Bubble boundary: The shuffle data source is generated by the task in the upstream offline vertex (or another bubble). The InputReadError will trigger the rerun of the upstream task. After the current bubble is rerun, it will be delayed until the upstream lineage (lineage). ) After the new version data is all ready, the scheduling is triggered.
- InputReadError downstream of Bubble: If an InputReadError occurs in a task downstream of the bubble, this event will trigger a task in the bubble to rerun. At this time, because the memory shuffle data that the task depends on has been released, it will trigger the entire bubble rerun.

2.3.2 Bubble Renew

When Admin resources are tight, Bubble's resource application from Admin may time out due to waiting. In some abnormal situations, such as when the onlinejob service is restarting when the bubble applies for resources, the application for resources may fail. In this case, all vertexes in the bubble will fall back to pure offline vertex execution. In addition, bubble renew will be triggered if the number of reruns exceeds the upper limit. After bubble renew occurs, all its internal edges are restored to sequential edges, and after all vertexes are reinitialized, the internal state transitions of these vertexes are triggered again in a purely offline manner by replaying all internal scheduling state machine trigger events. Ensure that all vertex in the current bubble will be executed in the classic offline mode after rollback, which effectively guarantees that the job can be terminated normally.

Fig. 13 Bubble Renew

2.3.3 Bubble AM Failover

For normal offline jobs, in the DAG framework, the internal scheduling events related to each computing node will be persistently stored to facilitate incremental failover at the computing node level. But for the bubble job, if an AM failover restart occurs during the bubble execution process, the bubble recovered through the replay of the storage event may be restored to the intermediate state of running. However, because the internal shuffle data may be stored in the memory and lost, the unfinished computing node in the bubble that is restored to the intermediate running state will immediately fail because it cannot read the upstream shuffle data.

This is essentially because in the Gang-Scheduled Bubble scene, the bubble as a whole exists as the smallest granularity of failover, so once an AM failover occurs, the recovery granularity should also be at the bubble level. Therefore, all scheduling events related to the bubble will be treated as a whole during operation, and bubbleStartedEvent and bubbleFInishedEvent will be flashed out when the bubble starts and ends. All related events of a bubble will be regarded as a whole when it resumes after failover. Only the ending bubbleFInishedEvent indicates that the bubble can be considered as completely finished, otherwise the entire bubble will be rerun.

For example, in the example below, the DAG contains two Bubbles (Bubble#0: {V1, V2}, Bubble#1: {V3, V4}). When an AM restart occurs, Bubble#0 has been TERMINATED and written BubbleFinishedEvent. The V3 in Bubble#1 has also been Terminated, but V4 is in the Running state, and the entire Bubble#1 has not reached the final state. After AM recover, V1 and V2 will be restored to the Terminated state, and Bubble#1 will be executed from the beginning.

Fig 14. AM Failover with Bubbles

3. Online effect

The Bubble mode is currently fully online in the public cloud. 34% of SQL jobs execute Bubbles, and the average daily execution includes 176K Bubbles.

We compared the query with the same signature when the bubble execution was closed and opened. We found that on the basis of the overall resource consumption basically unchanged, the execution performance of the job increased by 34%, and the amount of data processed per second increased by 54%.

Fig 15. Comparison of execution performance/resource consumption

In addition to the overall comparison, we also conducted a targeted analysis for VIP users. After the user Project turned on the Bubble switch (the point marked in red in the figure below is the time when Bubble was turned on), the average execution performance of the job was very Significant improvement.

Fig 16. Comparison of average execution time of VIP users after opening Bubble

Copyright Statement: content of this article is contributed spontaneously by Alibaba Cloud real-name registered users. The copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.

Secret of DAG, the core technology of MaxCompute's execution engine

1. Overview of Bubble Execution