This article was first published on : https://www.yuque.com/17sing
Version | date | Remark |
---|---|---|
1.0 | 2022.3.8 | Article first published |
This article is based on Flink 1.14 code for analysis.
0. Preface
A while ago, I encountered a strange phenomenon in production: the full number of jobs could not be carried out normally, and the log was full of errors of java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id container xxxx(HOSTNAME:PORT) timed out
.
The scenario is that Oracle is fully extracted to Hive, the data will flow through Kafka, the data volume is T level, and a partition is made every day according to the time field. The job that reported the error is responsible for extracting Kafka data and writing it to Hive, using TableAPI.
1. Investigation ideas
When this problem was reported to me, some students had already checked it for a round. According to the online search, it will tell you that the pressure of yarn is too high, the network is unstable for a short time, etc., you can increase heartbeat.timeout
to alleviate this problem, but the problem has not been solved after adjustment.
Another argument will tell you the reason for the frequent GC. It is recommended to adjust the memory. After adjustment, it does have a certain effect (making the problem time slower). That obviously has something to do with the code.
Because there was no problem with synchronizing data in the previous version, I began to look for recent code changes, and after a few laps, no suspicious code was found. Immediately, I felt a little tingling in my scalp. So let the students on the scene switch to the previous version and continue to do the full amount, and the phenomenon will still happen.
At this time, I was a little suspicious of the characteristics of the production environment - such as data characteristics, but the classmates at the scene told me that there was nothing special about the data. So I asked for a HeapDump of the scene, and threw it on the analysis software to check, and found that org.apache.flink.streaming.api.functions.sink.filesystem.Bucket
has a lot of objects.
So I looked at the definition of the Bucket object:
/**
* A bucket is the directory organization of the output of the {@link StreamingFileSink}.
*
* <p>For each incoming element in the {@code StreamingFileSink}, the user-specified {@link
* BucketAssigner} is queried to see in which bucket this element should be written to.
*/
@Internal
public class Bucket<IN, BucketID> {
good guy. One directory and one object. At this moment, I have doubts about the "data is nothing special" that my classmates told me, but for the sake of real hammer, I still follow the code:
|-- HiveTableSink
\-- createStreamSink
|-- StreamingFileSink
\-- initializeState
|-- StreamingFileSinkHelper
\-- constructor
|-- HadoopPathBasedBulkFormatBuilder
\-- createBuckets
|-- Buckets
\-- onElement
\-- getOrCreateBucketForBucketId
After going through the code once, I have a number in my mind. I asked the scene, whether the time span of the synchronized data is particularly large. After the confirmation of the on-site students, the time span is more than 3 years. Therefore, it is recommended to reduce the time span, or reduce the partition time. Finally, this problem was solved by dividing the full batch.
2. Curiosity after solving the problem
If each directory generates a bucket, then if you run a streaming job, you will encounter the same problem sooner or later. Such an obvious question, the great gods of the community must have thought of it long ago, and curiosity drove me to find the answer - until I saw this code:
public void commitUpToCheckpoint(final long checkpointId) throws IOException {
final Iterator<Map.Entry<BucketID, Bucket<IN, BucketID>>> activeBucketIt =
activeBuckets.entrySet().iterator();
LOG.info(
"Subtask {} received completion notification for checkpoint with id={}.",
subtaskIndex,
checkpointId);
while (activeBucketIt.hasNext()) {
final Bucket<IN, BucketID> bucket = activeBucketIt.next().getValue();
bucket.onSuccessfulCompletionOfCheckpoint(checkpointId);
if (!bucket.isActive()) {
// We've dealt with all the pending files and the writer for this bucket is not
// currently open.
// Therefore this bucket is currently inactive and we can remove it from our state.
activeBucketIt.remove();
notifyBucketInactive(bucket);
}
}
}
When submitting after checkpointing, it will decide whether to remove the data structure maintained in memory according to whether the bucket is active.
So what does it mean to be active? The code is very short:
boolean isActive() {
return inProgressPart != null
|| !pendingFileRecoverablesForCurrentCheckpoint.isEmpty()
|| !pendingFileRecoverablesPerCheckpoint.isEmpty();
}
The next step is to clarify the triggering conditions of these three.
2.1 inProgressPart == null
The type of this object is InProgressFileWriter
, and the trigger condition is closely related to the scrolling strategy of FileSystem.
/**
* The policy based on which a {@code Bucket} in the {@code Filesystem Sink} rolls its currently
* open part file and opens a new one.
*/
@PublicEvolving
public interface RollingPolicy<IN, BucketID> extends Serializable {
/**
* Determines if the in-progress part file for a bucket should roll on every checkpoint.
*
* @param partFileState the state of the currently open part file of the bucket.
* @return {@code True} if the part file should roll, {@link false} otherwise.
*/
boolean shouldRollOnCheckpoint(final PartFileInfo<BucketID> partFileState) throws IOException;
/**
* Determines if the in-progress part file for a bucket should roll based on its current state,
* e.g. its size.
*
* @param element the element being processed.
* @param partFileState the state of the currently open part file of the bucket.
* @return {@code True} if the part file should roll, {@link false} otherwise.
*/
boolean shouldRollOnEvent(final PartFileInfo<BucketID> partFileState, IN element)
throws IOException;
/**
* Determines if the in-progress part file for a bucket should roll based on a time condition.
*
* @param partFileState the state of the currently open part file of the bucket.
* @param currentTime the current processing time.
* @return {@code True} if the part file should roll, {@link false} otherwise.
*/
boolean shouldRollOnProcessingTime(
final PartFileInfo<BucketID> partFileState, final long currentTime) throws IOException;
}
These three interfaces correspond to whether the currently open file should be closed under certain circumstances:
- shouldRollOnCheckpoint: Check before doing Checkpoint.
- shouldRollOnEvent: Check if it should be closed based on the current state. For example, whether the current buffer size exceeds the limit.
- shouldRollOnProcessingTime: Check whether the current open time is too long to judge whether it meets the closing conditions.
2.2 pendingFileRecoverablesForCurrentCheckpoint isNotEmpty
The elements in it are also triggered according to RollingPolicy
, so there is no need to explain too much.
2.3 pendingFileRecoverablesPerCheckpoint isNotEmpty
Based on pendingFileRecoverablesForCurrentCheckpoint isNotEmpty. Use a dictionary to store a relationship between CheckpointId and List<InProgressFileWriter.PendingFileRecoverable>.
2.4 Inactive Bucket
Combined with the previous conditions, in fact, the directory that has closed and completed all Checkpoints is inactive Bucket
. The timing of inspections is generally:
- When the Task resumes, read the previous state from the StateBackend and check it
- After the Checkpoint is done, a check will be performed
When the Bucket becomes inactive, a notification of Inactive will be made. Inform the downstream that the data of the partition has been committed and become readable. See issue: artition commit is delayed when records keep coming
3. Clean Architecture in FileSystemConnector
After understanding the above knowledge points, I noticed such a proposal: FLIP-115: Filesystem connector in Table . According to this Proposal, I simply read the relevant source code and found that its implementation is also a reflection of a clean architecture.
We have already analyzed the source code above. Next, we will analyze the abstract design, responsibilities and layers inside:
|-- HiveTableSink #Table级API,负责对外,用户可以直接调用
|-- StreamingFileSink #Streaming 级API,也可以对外,位于TableAPI下方
|-- StreamingFileSinkHelper #集成了对于TimeService的逻辑,便于定期关闭Bucket;以及对于数据到Bucket的分发。这个类也被AbstractStreamingWriter使用,注释上也建议复用于 RichSinkFunction or StreamOperator
|-- BucketsBuilder #场景中调到的具体类是HadoopPathBasedBulkFormatBuilder,这个类会关注Buckets的具体实现以BucketWriter的具体实现
|-- Buckets #这是一个管理Bucket生命周期的类。其中有几个关键成员对象
|-- BucketWriter #会对应具体的FileSystem实现与写入的Format
|-- RolingPolicy #滚动策略,前面提到过,不再深入讨论
|-- BucketAssigner #决定每个元素输出到哪个Bucket中。比如是key还是date等等
|-- BucketFactory #负责每个Bucket的创建
Due to the finer granularity of responsibility segmentation, the data flow logic is decoupled from the external specific implementation. Let's take a few examples:
- If we want to call Hive's writing based on our own DSL, we only need to write a
HiveTableSink
similar to 06227ee12ac64a. - If a data warehouse (data lake) has been adding support for its own underlying file system, then when the first set of code is built, it only needs to implement the corresponding
BucketWriter
andFileSystem
in the future. - If a data warehouse (data lake) has been adding its own supported Format, then when the first set of code is built, it only needs to implement the corresponding
BucketWriter
in the future.
Based on this design, the core logic often does not change, and the easy-to-change parts are isolated, and the quality of the entire module will be more easily guaranteed.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。