This article was first published on : https://www.yuque.com/17sing
|1.0||2022.2.2||Article first published|
|1.0||2022.2.14||Updated section 3.4, enhanced comments section|
|1.2||2022.2.27||Update section 3.6, delete some descriptions that are not suitable for version 1.14|
This article is based on Flink 1.14 code for analysis.
It has been a while since Flink was applied to production. When I first started production, I was fortunate to investigate the Checkpoint timeout problem caused by data skew. I had a simple understanding of the relevant mechanism at that time. I was just reading the Flink source code recently, so I might as well take this opportunity to figure it out.
Here, we first have to figure out the difference between the two Exactly-Once:
- Exactly Once: Inside the computing engine, data is not lost or duplicated. The essence is to open the checkpoint through Flink for Barrier alignment, which can be done.
- End to End Exactly Once: This means that the data is not lost or duplicated during the entire process from data reading, engine processing to writing to external storage. This requires the data source to be replayable and the write side to support transaction recovery and rollback or idempotency.
1. Why does data skew cause Checkpoint timeout
When doing Checkpoint, the operator will have a barrier alignment mechanism (why it must be aligned will be discussed later). The following figure illustrates the alignment process as an example:
When two edges issue a barrier, barrier1 arrives at the operator before barrier2, then the operator will cache the element input by one edge, and will not issue the element until barrier2 reaches the checkpoint.
After each operator aligns the barrier, it will perform asynchronous state storage, and then issue the barrier. When each operator completes the Checkpoint,
CheckpointCoordinator will be notified. When
CheckpointCoordinator learns that the Checkpoint of all operators has been completed, it is considered that this Checkpoint is completed.
In our application, there is a map operator that accepts a large amount of data, so the barrier has not been delivered, and finally the entire Checkpoint times out.
2. Principle of Checkpoint
For the specific principle, please refer to the paper of the Flink team: Lightweight Asynchronous Snapshots for Distributed Dataflow . In short, the fault-tolerant solution of early stream computing is to periodically take snapshots of the global state, but this has two disadvantages:
- Blocking Computation - Snapshots are synchronously blocked.
- The current operator's unprocessed record and the record being processed will be taken into a snapshot, so the snapshot will become particularly large.
Flink, on the other hand, is extended based on the Chandy-Lamport algorithm - the algorithm performs snapshots asynchronously, while requiring the data source to be replayable, but still storing upstream data. The solution proposed by Flink's solution does not store data in the acyclic graph.
In Flink (acyclic directed graph), the barrier flag is inserted periodically to tell downstream operators to start taking snapshots. This algorithm is based on the following premise:
- The network transmission is reliable, and FIFO can be achieved. Here, the operations of
unblockedwill be performed on the operator. If an operator is
blocked, it will cache all the data received from the upstream channel and send it directly after receiving the signal of
- Tasks can perform the following operations on their channels:
- For the Source node, it will be abstracted into
3. Implementation of Checkpoint
In Flink, doing Checkpoint roughly consists of the following steps:
- Feasibility check
- The JobMaster notifies the Task to trigger a checkpoint
- TaskExecutor performs checkpointing
- JobMaster confirms checkpoint
Next, let's follow the source code to see the specific implementation inside.
3.1 Feasibility check
- Make sure the job is not closed or not started (see
- Generate a new CheckpointingID and create a PendingCheckpoint - when all tasks complete the Checkpoint, it will be converted into a CompletedCheckpoint. At the same time, a thread is also registered to pay attention to whether there is a timeout. If it times out, it will Abort the current Checkpoint (see
- Trigger the MasterHook. Some external systems need to do some extended logic before triggering the checkpoint, and the notification mechanism can be implemented by implementing MasterHook (see
- Repeat step 1, and if there is no problem, notify SourceStreamTask to start triggering checkpoints (see
3.2 JobMaster notifies Task to trigger checkpoint
CheckpointPlanCalculator#triggerCheckpointRequest , the
triggerTasks method will be called through the
Execution#triggerCheckpoint method. Execution corresponds to a Task instance, so JobMaster can find its
TaskManagerGateway through the Slot reference inside, and send a remote request to trigger Checkpoint.
3.3 TaskManager performs checkpoints
The embodiment of TaskManager in the code is
TaskExecutor . When the JobMaster triggers a remote request to the TaskExecutor, the handle method is
TaskExecutor#triggerCheckpoint , and then it will call
Task#triggerCheckpointBarrier to do:
- Do some checks, such as whether the Task is in the Running state
- Trigger Checkpoint: call
- Perform checkpoint:
CheckpointableTask#triggerCheckpointAsync. Taking the implementation of
StreamTaskas an example, here we will consider how to trigger the downstream Checkpoint when the upstream has been Finished—triggered by
CheckpointBarrier; if the task is not completed, call
StreamTask#triggerCheckpointAsyncInMailbox. Eventually, they will walk into
SubtaskCheckpointCoordinator#checkpointStateto trigger Checkpoint.
- Operator save snapshot: call
OperatorChain#broadcastEvent: save OperatorState and KeyedState.
SubtaskCheckpointCoordinatorImpl#finishAndReportAsyncAsynchronously report that the current snapshot has been completed.
3.4 JobMaster confirms checkpoint
|-- RpcCheckpointResponder \-- acknowledgeCheckpoint |-- JobMaster \-- acknowledgeCheckpoint |-- SchedulerBase \-- acknowledgeCheckpoint |-- ExecutionGraphHandler \-- acknowledgeCheckpoint |-- CheckpointCoordinator \-- receiveAcknowledgeMessage
In 3.1, we mentioned PendingCheckpoint. Some states are maintained here to ensure that all tasks are Ack and all masters are Ack. When the confirmation is complete,
CheckpointCoordinator will notify that all Checkpoints have been completed.
|-- CheckpointCoordinator \-- receiveAcknowledgeMessage \-- sendAcknowledgeMessages //通知下游Checkpoint已经完成。如果Sink实现了TwoPhaseCommitSinkFunction，将会Commit；如果因为一些原因导致Commit没有成功，则会抛出一个FlinkRuntimeException，而pendingCommitTransactions中的将会继续保存失败的CheckpointId，当检查点恢复时将会重新执行。
3.5 Checkpoint Recovery
This part of the code is relatively simple, and interested students can read the code according to the relevant call stack.
|-- Task \-- run \-- doRun |-- StreamTask \-- invoke \-- restoreInternal \-- restoreGates |-- OperatorChain \-- initializeStateAndOpenOperators |-- StreamOperator \-- initializeState |-- StreamOperatorStateHandler \-- initializeOperatorState |-- AbstractStreamOperator \-- initializeState |-- StreamOperatorStateHandler \-- initializeOperatorState |-- CheckpointedStreamOperator \-- initializeState #调用用户代码
3.6 End to End Exactly Once
End-to-end accurate one-time implementation is actually more difficult - consider the scenario of one Source to N sinks. Therefore, Flink has designed corresponding interfaces to ensure the end-to-end accuracy, which are:
- TwoPhaseCommitSinkFunction: A sink that wants to do exactly once must implement this interface.
- CheckpointedFunction: Hook when Checkpoint is called.
- CheckpointListener: As the name suggests, implementers of this interface are notified when a Checkpoint completes or fails.
At present, only Kafka is implemented by ExactlyOnce of Source and Sink - its upstream supports breakpoint reading, and its downstream supports rollback or idempotency. Interested students can read the relevant implementation of this interface.
This article cuts into the principle and implementation of Checkpoint from the perspective of the problem, and makes a simple trace of the relevant source code. In fact, the line of the code is relatively clear, but it involves a large number of classes - aspiring students may have found that this is the embodiment of the single responsibility principle. The implementation in
TwoPhaseCommitSinkFunction is also a typical template method design pattern.