3
This article was first published on : https://www.yuque.com/17sing
VersiondateRemark
1.02022.2.2Article first published
1.02022.2.14Updated section 3.4, enhanced comments section
1.22022.2.27Update section 3.6, delete some descriptions that are not suitable for version 1.14
1.32022.3.8fix typo
This article is based on Flink 1.14 code for analysis.

0. Preface

It has been a while since Flink was applied to production. When I first started production, I was fortunate to investigate the Checkpoint timeout problem caused by data skew. I had a simple understanding of the relevant mechanism at that time. I was just reading the Flink source code recently, so I might as well take this opportunity to figure it out.

Here, we first have to figure out the difference between the two Exactly-Once:

  • Exactly Once: Inside the computing engine, data is not lost or duplicated. The essence is to open the checkpoint through Flink for Barrier alignment, which can be done.
  • End to End Exactly Once: This means that the data is not lost or duplicated during the entire process from data reading, engine processing to writing to external storage. This requires the data source to be replayable and the write side to support transaction recovery and rollback or idempotency.

1. Why does data skew cause Checkpoint timeout

When doing Checkpoint, the operator will have a barrier alignment mechanism (why it must be aligned will be discussed later). The following figure illustrates the alignment process as an example:

When two edges issue a barrier, barrier1 arrives at the operator before barrier2, then the operator will cache the element input by one edge, and will not issue the element until barrier2 reaches the checkpoint.

After each operator aligns the barrier, it will perform asynchronous state storage, and then issue the barrier. When each operator completes the Checkpoint, CheckpointCoordinator will be notified. When CheckpointCoordinator learns that the Checkpoint of all operators has been completed, it is considered that this Checkpoint is completed.

In our application, there is a map operator that accepts a large amount of data, so the barrier has not been delivered, and finally the entire Checkpoint times out.

2. Principle of Checkpoint

For the specific principle, please refer to the paper of the Flink team: Lightweight Asynchronous Snapshots for Distributed Dataflow . In short, the fault-tolerant solution of early stream computing is to periodically take snapshots of the global state, but this has two disadvantages:

  • Blocking Computation - Snapshots are synchronously blocked.
  • The current operator's unprocessed record and the record being processed will be taken into a snapshot, so the snapshot will become particularly large.

Flink, on the other hand, is extended based on the Chandy-Lamport algorithm - the algorithm performs snapshots asynchronously, while requiring the data source to be replayable, but still storing upstream data. The solution proposed by Flink's solution does not store data in the acyclic graph.

In Flink (acyclic directed graph), the barrier flag is inserted periodically to tell downstream operators to start taking snapshots. This algorithm is based on the following premise:

  • The network transmission is reliable, and FIFO can be achieved. Here, the operations of blocked and unblocked will be performed on the operator. If an operator is blocked , it will cache all the data received from the upstream channel and send it directly after receiving the signal of unblocked .
  • Tasks can perform the following operations on their channels: block , unblock , send messages , broading messages .
  • For the Source node, it will be abstracted into Nil input channel.

3. Implementation of Checkpoint

In Flink, doing Checkpoint roughly consists of the following steps:

  1. Feasibility check
  2. The JobMaster notifies the Task to trigger a checkpoint
  3. TaskExecutor performs checkpointing
  4. JobMaster confirms checkpoint

Next, let's follow the source code to see the specific implementation inside.

3.1 Feasibility check

Reference code: CheckpointCoordinator#startTriggeringCheckpoint .

  1. Make sure the job is not closed or not started (see CheckpointPlanCalculator#calculateCheckpointPlan ).
  2. Generate a new CheckpointingID and create a PendingCheckpoint - when all tasks complete the Checkpoint, it will be converted into a CompletedCheckpoint. At the same time, a thread is also registered to pay attention to whether there is a timeout. If it times out, it will Abort the current Checkpoint (see CheckpointPlanCalculator#createPendingCheckpoint ).
  3. Trigger the MasterHook. Some external systems need to do some extended logic before triggering the checkpoint, and the notification mechanism can be implemented by implementing MasterHook (see CheckpointPlanCalculator#snapshotMasterState ).
  4. Repeat step 1, and if there is no problem, notify SourceStreamTask to start triggering checkpoints (see CheckpointPlanCalculator#triggerCheckpointRequest ).

3.2 JobMaster notifies Task to trigger checkpoint

In CheckpointPlanCalculator#triggerCheckpointRequest , the triggerTasks method will be called through the Execution#triggerCheckpoint method. Execution corresponds to a Task instance, so JobMaster can find its TaskManagerGateway through the Slot reference inside, and send a remote request to trigger Checkpoint.

3.3 TaskManager performs checkpoints

The embodiment of TaskManager in the code is TaskExecutor . When the JobMaster triggers a remote request to the TaskExecutor, the handle method is TaskExecutor#triggerCheckpoint , and then it will call Task#triggerCheckpointBarrier to do:

  1. Do some checks, such as whether the Task is in the Running state
  2. Trigger Checkpoint: call CheckpointableTask#triggerCheckpointAsync
  3. Perform checkpoint: CheckpointableTask#triggerCheckpointAsync . Taking the implementation of StreamTask as an example, here we will consider how to trigger the downstream Checkpoint when the upstream has been Finished—triggered by CheckpointBarrier ; if the task is not completed, call StreamTask#triggerCheckpointAsyncInMailbox . Eventually, they will walk into SubtaskCheckpointCoordinator#checkpointState to trigger Checkpoint.
  4. Operator save snapshot: call OperatorChain#broadcastEvent : save OperatorState and KeyedState.
  5. Call SubtaskCheckpointCoordinatorImpl#finishAndReportAsync Asynchronously report that the current snapshot has been completed.

3.4 JobMaster confirms checkpoint

|-- RpcCheckpointResponder
  \-- acknowledgeCheckpoint
|-- JobMaster
  \-- acknowledgeCheckpoint
|-- SchedulerBase
  \-- acknowledgeCheckpoint
|-- ExecutionGraphHandler
  \-- acknowledgeCheckpoint
|-- CheckpointCoordinator
  \-- receiveAcknowledgeMessage

In 3.1, we mentioned PendingCheckpoint. Some states are maintained here to ensure that all tasks are Ack and all masters are Ack. When the confirmation is complete, CheckpointCoordinator will notify that all Checkpoints have been completed.

|-- CheckpointCoordinator
  \-- receiveAcknowledgeMessage
  \-- sendAcknowledgeMessages  //通知下游Checkpoint已经完成。如果Sink实现了TwoPhaseCommitSinkFunction,将会Commit;如果因为一些原因导致Commit没有成功,则会抛出一个FlinkRuntimeException,而pendingCommitTransactions中的将会继续保存失败的CheckpointId,当检查点恢复时将会重新执行。

3.5 Checkpoint Recovery

This part of the code is relatively simple, and interested students can read the code according to the relevant call stack.

|-- Task
  \-- run
  \-- doRun
|-- StreamTask
  \-- invoke
  \-- restoreInternal
  \-- restoreGates
|-- OperatorChain
  \-- initializeStateAndOpenOperators
|-- StreamOperator
  \-- initializeState
|-- StreamOperatorStateHandler
  \-- initializeOperatorState
|-- AbstractStreamOperator
  \-- initializeState
|-- StreamOperatorStateHandler
  \-- initializeOperatorState
|-- CheckpointedStreamOperator
  \-- initializeState #调用用户代码

3.6 End to End Exactly Once

End-to-end accurate one-time implementation is actually more difficult - consider the scenario of one Source to N sinks. Therefore, Flink has designed corresponding interfaces to ensure the end-to-end accuracy, which are:

  • TwoPhaseCommitSinkFunction: A sink that wants to do exactly once must implement this interface.
  • CheckpointedFunction: Hook when Checkpoint is called.
  • CheckpointListener: As the name suggests, implementers of this interface are notified when a Checkpoint completes or fails.

At present, only Kafka is implemented by ExactlyOnce of Source and Sink - its upstream supports breakpoint reading, and its downstream supports rollback or idempotency. Interested students can read the relevant implementation of this interface.

4. Summary

This article cuts into the principle and implementation of Checkpoint from the perspective of the problem, and makes a simple trace of the relevant source code. In fact, the line of the code is relatively clear, but it involves a large number of classes - aspiring students may have found that this is the embodiment of the single responsibility principle. The implementation in TwoPhaseCommitSinkFunction is also a typical template method design pattern.


泊浮目
4.9k 声望1.3k 粉丝