Apache Flink&#39;s diversified exploration and practice in bilibili

This article is shared by Zheng Zhisheng, head of the bilibili big data real-time platform. The core of this sharing is to explain the implementation of the trillion-level transmission and distribution architecture and how the AI field can build a complete set of pre-processing real-time pipelines based on Flink. This sharing mainly focuses on the following four aspects:
1. Real-time past and present at station B
2. Flink On Yarn's incremental pipeline solution
3. Some engineering practices in the direction of Flink and AI
4. Future development and thinking

1. Real-time past and present at station B

1. Ecological scene radiation

Speaking of the future of real-time computing, the key word is the effectiveness of data. First of all, from the perspective of the entire big data development ecology, let's look at its core scenario radiation: in the early stage of big data development, the core is an offline computing scenario oriented to the sky. At that time, most of the data effectiveness was based on calculations in days, and it paid more attention to the balance of time and cost.

With the popularization and improvement of data application, data analysis and data warehouse, more and more people have put forward higher requirements for the effectiveness of data. For example, when some data needs to be recommended in real time, the actual effect of the data will determine its value. In this case, the entire real-time computing scene is generally born.

But in the actual operation process, we also encountered many scenes. In fact, there is no very high real-time requirement for the data. In this case, there will inevitably be some new scenes with data from milliseconds, seconds or days, real-time scenes. The data is more based on some incremental calculation scenarios with minute granularity. For offline computing, it pays more attention to cost; for real-time computing, it pays more attention to value effectiveness; for incremental computing, it pays more attention to balancing costs, as well as comprehensive value and time.

2. Timeliness of station B

In three dimensions, what is the division of station B? For station B, currently 75% of the data is supported by offline calculations, another 20% of scenarios are calculated in real time, and 5% are calculated in increments.

For real-time computing scenarios, it is mainly applied to the entire real-time machine learning, real-time recommendation, advertising search, data application, real-time channel analysis and delivery, reports, olap, monitoring, etc.;
For offline calculations, the data radiates a wide area, mainly based on data warehouses;
For incremental computing, some new scenarios have been launched this year, such as the incremental Upsert scenario of binlog.

3. ETL has poor timeliness

Regarding the issue of effectiveness, in fact, many pain points were encountered in the early stage, and the core was concentrated in three aspects:

First, the transmission pipeline lacks computing power. In the early plans, the data was basically dropped to ODS on a daily basis. The DW layer scanned all the data in the ODS layer of the previous day on the second day after the early morning. That is to say, the overall data could not be cleaned in advance;
Second, the resources containing a large number of operations are concentrated in the early morning, and the pressure on the entire resource scheduling will be very high;
Third, real-time and offline gaps are more difficult to satisfy, because for most data, the cost of pure real-time is too high, and the actual effect of pure offline is too poor. At the same time, the timeliness of MySQL data entry is not enough. For example, it is like the barrage data of station B. Its size is very exaggerated. The synchronization of this kind of business table often takes more than ten hours and is very unstable.

4. AI real-time engineering is complex

In addition to the problem of effectiveness, early on, we also encountered more complicated problems in AI real-time engineering:

The first is the calculation efficiency of the entire feature engineering. The same real-time feature calculation scenario also requires data backtracking in the offline scenario, and the calculation logic will be repeatedly developed;
Second, the entire real-time link is relatively long. A complete real-time recommendation link, covering more than a dozen jobs consisting of N real-time and M offline, sometimes encounters troubleshooting, and the operation, maintenance and control costs of the entire link are very high;
Third, with the increase of AI personnel and the input of algorithm personnel, it is difficult to scale experimental iterations horizontally.

5. Flink has done ecological practice

Against the background of these key pain points, we focused on Flink's ecological practice. The core includes the application of the entire real-time data warehouse and the entire incremental ETL pipeline, as well as some scenarios of AI-oriented machine learning. This sharing will focus more on the incremental pipeline and the direction of AI plus Flink. The following figure shows the overall scale. At present, the entire transmission and calculation volume has 30,000+ computing cores, 1,000+ jobs, and more than 100 users in the trillion-level message scale.

Two, Flink On Yarn incremental pipeline plan

1. Early architecture

Let's take a look at the early architecture of the entire pipeline. As can be seen from the figure below, data is actually consumed by Flume and Kafka is dropped to HDFS. Flume uses its transaction mechanism to ensure the consistency of data from Source to Channel, and then to Sink. Finally, after the data falls to HDFS, the downstream Scheduler will scan the directory for tmp files to determine whether the data is Ready. This is to schedule and pull downstream ETL offline jobs.

2. Pain points

I encountered a lot of pain points in the early stage:

The first key is data quality.
- The first I used was MemoryChannel, which would cause data loss. Later, I tried to use FileChannel mode, but the performance could not meet the requirements. In addition, when HDFS is not stable, Flume's transaction mechanism will cause data to be rolled back to the Channel, which will lead to continuous data duplication to a certain extent. In the case of extremely unstable HDFS, the highest repetition rate will reach the percentile probability;
- Lzo row storage, the entire transmission in the early days was in the form of separators. The schema of this separator is relatively weakly constrained, and it does not support nested formats.
The second point is the timeliness of the entire data, which cannot provide minute-level queries, because Flume does not have a Checkpoint cut-off mechanism like Flink, and more is to control the closing of files through the idle mechanism;
The third point is downstream ETL linkage. As mentioned in the previous article, we mostly scan the tmp directory to see if it is ready. In this case, the scheduler will call the api of the hadoop list with the namenode a lot, which will cause a lot of pressure on the namenode.

3. Stability related pain points

There are also many problems in stability:

First, Flume has no status. After the node is abnormal or restarted, tmp cannot be shut down normally;
Second, the environment that did not rely on big data in the early days was a physical deployment model. It is difficult to control resource expansion and the cost will be relatively high;
Third, Flume and HDFS have communication problems. For example, when writing to HDFS is clogged, the clogging of a certain node will be back pressured to the Channel, which will cause the Source not to go to Kafka to consume data, stop pulling the offset, to a certain extent, will trigger Kafka's Rebalance, and finally cause the overall situation The offset does not advance forward, resulting in accumulation of data.

4. Trillion-level incremental pipeline DAG view

Under the above pain points, the core solution builds a trillion-level incremental pipeline based on Flink. The following figure is the DAG view of the entire runtime.

First of all, under the Flink architecture, KafkaSource eliminates the avalanche problem of rebalance. Even if there is a certain degree of concurrency in the entire DAG view, the blockage of data writing to HDFS will not cause the blockage of all Kafka partitions globally. In addition, the essence of the entire program is to implement scalable nodes through Transform modules.

The first layer node is Parser, which is mainly used for data decompression, deserialization and other analytical operations;
The second layer is to introduce customized ETL modules provided to users, which can realize customized cleaning of data in the pipeline;
The third layer is the Exporter module, which supports exporting data to different storage media. For example, when writing to HDFS, it will be exported as parquet; when writing to Kafka, it will be exported as pb format. At the same time, the ConfigBroadcast module is introduced on the entire DAG link to solve the real-time update and hot loading of pipeline metadata. In addition, in the entire link, a checkpoint is performed every minute, and Append is performed on the actual incremental data, so that minute-level queries can be provided.

5. An overall view of the trillion-level incremental pipeline

In the overall architecture of Flink On Yarn, it can be seen that in fact, the entire pipeline view is divided into units of BU. Each Kafka topic represents the distribution of a certain type of data terminal, and the Flink job will be specifically responsible for the write processing of various terminal types. In the view, you can also see that for the blindlog data, the assembly of the entire pipeline is also realized, and the operation of the pipeline can be realized by multiple nodes.

6. Technical Highlights

Next, let's take a look at some of the technical highlights of the core of the entire architecture solution. The first three are some features at the real-time functional level, and the last three are mainly some optimizations at some non-functional levels.

For the data model, it is mainly through parquet and the mapping from Protobuf to parquet to achieve format convergence;
Partition notification is mainly because one pipeline actually handles multiple streams, and the core solution is the notification mechanism of partition ready for multiple stream data;
The CDC pipeline uses binlog and HUDI to solve the upsert problem;
Small files are mainly used to solve the problem of file merging through DAG topology at runtime;
HDFS communication is actually the optimization of many key issues at the trillion-level scale;
Finally, there are some optimizations for partition fault tolerance.

6.1 Data Model

The development of business is mainly through assembling character strings to assemble the report of each record of data. Later, it was organized through the definition and management of the model and its development, mainly by providing users at the entrance of the platform to record each stream, each table, its Schema, and the Schema would generate Protobuf. File, users can download the HDFS model file corresponding to Protobuf on the platform. In this way, the development of the client side can be restricted from pb through the strong Schema method.

Take a look at the runtime process. First of all, Kafka's Source will consume the records of every RawEvent that actually travels over. There will be PBEvent objects in RawEvent. PBEvents are actually records of Protobuf. The Parser module that data flows from the Source to the Parser module will form a PBEvent after parsing. The PBEvent will store the entire Schema model entered by the user on the platform on the OSS object system, and the Exporter module will dynamically load the model changes. Then through the pb file to reflect the generated specific event object, the event object can finally be mapped into the parquet format. A lot of cache reflection optimizations are mainly made here, so that the dynamic parsing performance of the entire pb is improved by six times. Finally, we will land the data on HDFS to form a parquet format.

6.2 Partition notification optimization

As mentioned earlier, the pipeline will process hundreds of streams. In the early Flume architecture, in fact, it is difficult for each Flume node to sense the progress of its own processing. At the same time, Flume can't handle the global progress. But based on Flink, it can be solved by the Watermark mechanism.

First, the Source will generate the Watermark based on the Eventime in the message. The Watermark will be passed to the Sink through each layer of processing. Finally, it will use the Commiter module to summarize the progress of all Watermark messages in a single-threaded manner. When it finds that the global Watermark has been advanced to the next hour's partition, it will send a message to Hive MetStore or write it to Kafka to notify the previous hour that the partition data is ready, so that downstream scheduling can be faster Through the message-driven way to pull up the operation of the job.

6.3 Optimization on the CDC pipeline

The right side of the figure below is actually a complete link of the entire cdc pipeline. To realize the complete mapping of MySQL data to Hive data, it is necessary to solve the problems of streaming and batch processing.

The first is to synchronize all MySQL data to HDFS at one time through Datax. Immediately after the spark job, the data is initialized into the initial snapshot of HUDI, and then the data of Mysql binlog is dragged to the topic of Kafka through Canal, and then the data of the initial snapshot is combined with the incremental data through the Flink job. Perform incremental updates, and finally form the HUDI table.

The entire link is to solve the problem of data loss or repetition. The key point is to write Kafka for Canal. The transaction mechanism is opened to ensure that when the data falls on the Kafka topic, the data can be transmitted without loss or repetition. In addition, in the upper layer of data transmission, there may actually be duplication and loss of data. At this time, it is more through the globally unique id plus millisecond timestamp. In the entire streaming job, the data is deduplicated for the global id, and the data is sorted for the millisecond time, so as to ensure that the data can be updated to the HUDI in an orderly manner.

Then, the Trace's system is based on Clickhouse to do storage, to count the number of data in and out of each node to achieve accurate data comparison.

6.4 Stability-Merging of small files

As mentioned earlier, after transforming into Flink, we did a checkpoint every minute, and the number of files was enlarged very seriously. It is mainly to introduce the merge operator in the entire DAG to realize the merging of files. The merge method of merge is mainly based on the horizontal merge of concurrency, and a writer corresponds to a merge. In this way, every five minutes of Checkpoint, 12 files of 1 hour, will be merged. In this way, the number of files can be greatly controlled within a reasonable range.

6.5 HDFS communication

In the actual operation process, there are often serious problems with the accumulation of the entire operation. The actual analysis is actually related to the HDFS communication.

In fact, HDFS communication has sorted out four key steps: initializing state, Invoke, Snapshot, and Notify Checkpoint complete.

The core problem mainly occurs in the Invoke phase. Invoke will reach the rolling condition of the file. At this time, flush and close will be triggered. When close actually communicates with the NameNode, there will often be congestion.

The Snapshot phase also encounters a problem. Once a snapshot is triggered by hundreds of streams in a pipeline, the serial execution of flush and close will also be very slow.

The core optimization focuses on three aspects:

First, the cut of the file is reduced, that is, the frequency of close. In the Snapshot phase, the file will not be closed by close, but more often by way of file continuation. In this way, in the phase of initializing state, you need to do file truncation to do Recovery recovery.
The second is the improvement of asynchronous close. It can be said that the close action will not block the processing of the entire total link. For the close of Invoke and Snapshot, the state will be managed into the state, and the file will be restored by initializing the state. .
Third, for multiple streams, Snapshot has also done parallel processing. Checkpoints every 5 minutes, multiple streams are actually multiple buckets, which will be processed serially through loops, and then be transformed by multi-threading. , It can reduce the occurrence of Checkpoint timeout.

6.6 Some optimizations for partition fault tolerance

In the case of multiple streams in the pipeline, the data of some streams is not continuous every hour.

This situation will cause partitions, and its Watermark cannot be promoted normally, causing the problem of empty partitions. Therefore, we introduce the PartitionRecover module during the operation of the pipeline, which will promote the notification of the partition based on the Watermark. For some streams of Watermark, if ideltimeout has not been updated, the Recover module will add partitions. When it arrives at the end of each partition, it adds a delay time to scan the watermarks of all streams, so as to carry out the bottom line.

During the transfer process, when the Flink job restarts, a wave of zombie files will be encountered. We use the commit node of the DAG to clean and delete the zombie files before the notification of the entire partition to achieve the entire zombie file. Cleanup, these are some optimizations at the non-functional level.

3. Some engineering practices in the direction of Flink and AI

1. Architecture evolution timeline

The figure below is a complete timeline of the AI direction in the real-time architecture.

As early as 2018, the experimental development of many algorithm personnel was workshop-style. Each algorithm staff will choose different languages to develop different experimental projects according to their familiar languages, such as Python, php or c++. Its maintenance cost is very high, and it is prone to failure;
In the first half of 2019, it was mainly based on the jar package mode provided by Flink to do some engineering support for the entire algorithm. It can be said that in the early part of the first half of the year, in fact, it is more about stability and versatility to do some support;
In the second half of 2019, through the self-developed BSQL, the threshold of model training was greatly reduced, and the real-time label and instance were solved to improve the efficiency of the entire experimental iteration;
In the first half of 2020, more improvements will be made around the calculation of the entire feature, the integration of flow batch calculation and the improvement of the efficiency of feature engineering;
In the second half of 2020, it will be more focused on the process of the entire experiment and the introduction of AIFlow to facilitate the batch DAG.

2. Review of AI Engineering Architecture

Looking back at the entire AI project, its early architecture diagram actually reflects the architecture view of the entire AI at the beginning of 2019. Its essence is to support the entire AI through a number of computing nodes composed of a number of single tasks and various mixed languages. The model training link is pulled up. After the iteration in 2019, the entire near-line training was completely replaced with the BSQL model for development and iteration.

3. Current pain points

At the end of 2019, we actually encountered some new problems. These problems were mainly concentrated in the two dimensions of function and non-function.

At the functional level:
- First, from label to instance flow generation, to model training, to online prediction, and even real experimental results, the entire link is very long and complicated;
- Second, the integration of the entire real-time feature, offline feature, and flow batch involves a lot of job composition, and the entire link is very complicated. At the same time, both the experiment and online need to calculate the features, and the inconsistent results will cause problems in the final effect. In addition, it is difficult to find where the features exist, and there is no way to trace them back.

On the non-functional level, algorithm students often encounter, do not know what Checkpoint is, whether to open it, and what configuration is there. In addition, it is not easy to troubleshoot when problems occur online, and the entire link is very long.
- So the third point is that the complete experimental progress requires a lot of resources, but the algorithm does not know what these resources are and how much it needs. These problems actually cause a lot of confusion for the algorithm.

4. Pain points boil down

In the final analysis, it is concentrated in three areas:

The first is the issue of consistency. From data preprocessing, to model training, to prediction, each link is actually faulty. Including the inconsistency of data, but also the inconsistency of calculation logic;
Second, the entire experimental iteration is very slow. A complete experimental link, in fact, for algorithm students, he needs to master a lot of things. At the same time, the materials behind the experiment cannot be shared. For example, there are some features that need to be repeatedly developed behind each experiment;
Third, the cost of operation, maintenance and control is relatively high.

The complete experimental link actually consists of a real-time project plus an offline project link, and it is difficult to troubleshoot online problems.

5. The prototype of real-time AI engineering

With some of these pain points, in the past 20 years, the main focus has been on the AI direction to create the prototype of real-time engineering. The core is to make breakthroughs through the following three aspects.

The first is in some of the capabilities of BSQL. For algorithms, it is hoped that SQL-oriented development can be used to reduce engineering investment;
The second is feature engineering, which will solve some problems of feature calculation through the core to meet some support of features;
The third is the collaboration of the entire experiment. The purpose of the algorithm is actually to experiment. I hope to create a set of end-to-end experimental collaboration, and ultimately hope to achieve a "one-click experiment" for the algorithm.

6. Feature Engineering-Difficulties

We encountered some difficulties in feature engineering.

The first is in real-time feature calculation, because it needs to use the results to the entire online prediction service, so its requirements for delay and stability are very high;
The second is that the entire real-time and offline calculation logic is consistent. We often encounter a real-time feature that needs to go back to the offline data of the past 30 to 60 days. How can the calculation logic of the real-time feature be calculated in the offline feature? Up to reuse;
The third is that the entire offline feature flow batch integration is difficult to get through. The calculation logic of real-time features often has some streaming concepts such as window timing, but offline features do not have these semantics.

7. Real-time features

Let's take a look at how we do real-time features. The right side of the figure is some of the most typical scenes. For example, I want to count in real time the number of times the user has played each UP main related video in the last minute, 6 hours, 12 hours, and 24 hours. For such a scenario, there are actually two points:

First, it needs to use a sliding window to calculate the entire user's past history. In addition, during the data sliding calculation process, it also needs to associate some basic information dimension tables of the UP master to obtain some videos of the UP master to count his playback times. In the final analysis, I actually encountered two major pains.
- Using Flink's native sliding window, a minute-level sliding will result in more windows and greater performance loss.
- At the same time, fine-grained windows will also lead to too many timers and poor cleaning efficiency.
The second is dimensional table query. You will encounter multiple keys to query multiple corresponding values of HBASE. In this case, you need to support concurrent query of arrays.

Under the two pain points, the sliding window is mainly transformed into the Group By mode, plus the UDF mode of agg, and some window data for the entire one hour, six hours, twelve hours, and twenty-four hours are stored in Among the entire Rocksdb. In this way, through the UDF mode, the entire data trigger mechanism can achieve record-level trigger based on Group By, and the entire semantics and timeliness will be greatly improved. At the same time, in the entire AGG UDF function, Rocksdb is used to do the state, and the data life cycle is maintained in the UDF. In addition, the entire SQL has been extended to implement the array-level dimension table query. The final whole effect can actually support various computing scenarios through the mode of super large windows in the direction of real-time features.

8. Features-Offline

Next, take a look at offline. The upper part of the left view is a complete real-time feature calculation link. It can be seen that if you want to solve the same SQL, it can also be reused in offline calculations, so you need to solve some of the corresponding ones. The problem that the calculated IO can be reused. For example, Kafka is used for data input in streaming mode, and HDFS is required for data input in offline mode. In the streaming, it is supported by some kv engines such as KFC or AVBase, and it needs to be solved by the hive engine in offline. In the final analysis, there are actually three problems to be solved:

First, it is necessary to simulate the entire streaming consumption capability to support the consumption of HDFS data in offline scenarios;
Second, it is necessary to solve the problem of orderly partitioning of HDFS data during consumption, similar to Kafka's partition consumption;
Third, it is necessary to simulate the consumption of kv engine dimension table, and realize the consumption of dimension table based on hive. There is also a problem that needs to be solved. When every record pulled from HDFS, every record actually has a corresponding Snapshot when consuming the hive table, which is equivalent to the timestamp of each piece of data, and the partition of the corresponding data timestamp must be consumed.

9. Optimization

9.1 Offline-Partitions in order

In fact, the orderly partitioning scheme is mainly based on some changes in the front when the data is in HDFS. First of all, before the data falls into HDFS, it is the transmission pipeline, and the data is consumed through Kafka. After Flink's job pulls data from Kafka, it uses Eventtime to extract the watermark of the data. The concurrency of each Kafka Source will report the watermark to the GlobalWatermark module in the JobManager. GlobalAgg will summarize the progress of each concurrency Watermark. So as to count the progress of GlobalWatermark. According to the progress of GlobalWatermark, we can calculate which concurrency Watermark is too fast to calculate, and then send control information to Kafka Source through GlobalAgg. When Kafka Source has some concurrency too fast, its entire partition advancement will slow down. . In this way, in the HDFS Sink module, the entire Event time of the data records received on the same time slice is basically orderly, and it will eventually fall into HDFS and the file name will identify its corresponding partition and the corresponding time slice range. Finally, in the HDFS partition directory, an ordered directory of data partitions can be realized.

9.2 Offline-partition incremental consumption

After the data is ordered in HDFS increments, HDFStreamingSource is implemented. It will do Fecher partitions for files, each file has Fecher threads, and each Fecher thread will count each file. Its offset handles the progress of the cursor, and updates the state to the State according to the Checkpoint process.

In this way, the orderly advancement of the entire file consumption can be realized. When reviewing historical data, offline operations will involve the stop of the entire operation. In fact, a partition end flag is introduced in the entire FileFetcher module, and when each thread counts each partition, it senses the end of its partition, and the status after the partition ends is finally summarized in the cancellationManager, and further It will be summarized to the Job Manager to update the progress of the global partition. When all the global partitions have reached the end cursor, the entire Flink job will be canceled and closed.

9.3 Offline-Snapshot dimension table

I talked about the entire offline data before. In fact, the data is on hive. The entire table field information of the HDFS table data of hive will be very much, but when the offline feature is actually used, the information required is actually very small, so it needs to be in hive. The process of doing offline field cutting first, clean an ODS table into a DW table, the DW table will finally run the job through Flink, and there will be a reload scheduler inside, which will periodically go to the Watermark's current advancement based on the data Partition, to pull the table information corresponding to each partition in hive. By downloading some data in the hive directory of a certain HDFS, it will finally be reloaded into Rocksdb files in the entire memory. Rocksdb is actually the last component used to provide dimension table KV queries.

The component will contain multiple Rocksdb build build processes, which mainly depend on the Eventtime in the entire data flow process. If it is found that the Eventtime advancement has reached the end of the hour partition, it will actively reload and build through the lazy loading mode. The partition of Rocksdb in the next hour, in this way, to switch the reading of the entire Rocksdb.

10. Experimental flow batch integration

Based on the above three optimizations, that is, partition orderly increment, Kafka-like partition Fetch consumption, and dimension table Snapshot, real-time features and offline features are finally realized, sharing a set of SQL solutions, and opening up the flow of features Batch calculation. Next, take a look at the entire experiment, the complete flow-batch-integrated link, and it can be seen from the figure that the top granularity is the complete offline calculation process. The second is the entire near-line process. In fact, the semantics of calculations used in the offline process are exactly the same as the semantics of real-time consumption in the near-line process, and Flink is used to provide SQL calculations.

Take a look at the nearline. In fact, Label join uses a click stream and presentation stream of Kafka. When it comes to the entire offline computing link, it uses a HDFS click directory and HDFS presentation directory. The feature data processing is also the same, the real-time use is Kafka playback data, and some Hbase manuscript data. For offline, the manuscript data of hive and the playback data of hive are used. In addition to the entire offline and near-line streaming batches, the real-time data effects generated by the entire near-line are summarized on the OLAP engine, and the entire real-time indicator visualization is provided through superset. In fact, it can be seen from the figure that the complete calculation link of complex flow and batch integration, the calculation nodes contained in it are very complicated and numerous.

11. Experimental Collaboration-Challenge

The challenge in the next stage is more about experimental collaboration. The figure below is a simplified abstraction of the entire previous link. As can be seen from the figure, in the three dashed area boxes, there are offline links plus two real-time links. Three complete links constitute a job flow batch, which is actually the most basic of a workflow. process. It needs to complete the complete abstraction of the workflow, including the driving mechanism of the flow batch event, and for the algorithm in the AI field, it is more hoped to use Python to define the complete flow, in addition to the entire input, output and its entire calculation It tends to be templated, which can facilitate the cloning of the entire experiment.

12. Introduce AIFlow

In the second half of the year, the entire workflow involved more cooperation with the community, introducing the entire set of AIFlow solutions.

On the right is actually the DAG view of the entire AIFlow complete link. You can see the entire node. In fact, there are no restrictions on the types it supports. It can be a streaming node or an offline node. In addition, the entire node-to-node communication edge can support data-driven and event-driven. The main advantage of introducing AIFlow is that AIFlow provides Python-based semantics to facilitate the definition of a complete AIFlow workflow, as well as scheduling the progress of the entire workflow.

On the edge of the node, compared to some native Flow solutions in the industry, he also supports the entire event-driven mechanism. The advantage is that it can help to send an event-driven message between two Flink jobs through the progress of watermark processing data partitions in Flink to pull up the next offline or real-time job.

In addition, it also supports some peripheral services, including some message module services for notifications, metadata services, and some model center services in the AI field.

13. Python defines Flow

Let's take a look at how AIFlow is finally defined as a workflow based on Python. The view on the right is the definition of the complete workflow of an online project. First, it is the entire definition of Spark job, in which dependency is configured to describe the entire downstream dependency, and it will issue an event-driven message to pull up the following Flink streaming job. Streaming jobs can also pull up the following Spark jobs in a message-driven manner. The definition of the entire semantics is very simple, only four steps are required to configure the confg information of each node, and define the operation behavior of each node, as well as its dependency dependencies, and finally run the topology view of the entire flow.

14. Based on event-driven flow batch

Next, let's take a look at the driving mechanism of the complete flow batch scheduling. The right side of the figure below is the complete driving view of the three working nodes. The first is from Source to SQL to Sink. The yellow box introduced is the extended supervisor, which can collect global watermark progress. When the entire streaming job finds that the watermark can be advanced to the next hour's partition, it will send a message to the NotifyService. After NotifyService receives this message, it will send it to the next job. The next job will mainly introduce the flow operator in the entire Flink DAG. The operator will not receive the message sent by the previous job. It will block the operation of the entire job. Until the message driver is received, it means that the upstream partition has actually been completed in the previous hour, and then the next flow node can be driven to operate. Similarly, the next workflow node also introduces the GlobalWatermark Collector module to collect the progress of its processing. When the partition in the last hour is completed, it will also send a message to NotifyService, and NotifyService will drive this message to call the AIScheduler module, so as to pull the spark offline job to complete the spark offline. As you can see from it, the entire link actually supports four scenarios of batch-to-batch, batch-to-stream, stream-to-stream, and stream-to-batch.

15. The prototype of real-time AI full link

Based on the entire flow definition and scheduling of flows and batches, the prototype of the real-time AI full link was initially constructed in 2020, with the core being experiment-oriented. Algorithm students can also develop Node nodes based on SQL. Python can define a complete DAG workflow. Monitoring, alarming and operation and maintenance are integrated.

At the same time, it supports the connection from offline to real-time, from data processing to model training, from model training to experimental effects, and end-to-end connection. On the right is the link for the entire near-line experiment. The following is the service of providing online prediction training with the material data produced by the entire experimental link. There will be three aspects of the overall package:

One is some basic platform functions, including experiment management, model management, feature management, etc.;
Secondly, it also includes some services at the bottom of AIFlow;
Then there are some platform-level metadata services.

Fourth, some prospects for the future

In the coming year, we will focus more on two aspects of work.

The first is the direction of the data lake, which will focus on some incremental computing scenarios from the ODS to the DW layer, and breakthroughs in some scenarios from the DW to the ADS layer. The core will be combined with Flink, Iceberg and HUDI as the landing of this direction.
On the real-time AI platform, we will further provide a set of real-time AI collaboration platform for experiments. The core is to create an efficient engineering platform that can refine and simplify algorithm personnel.