Analysis of real-time recommendation system architecture based on Apache Flink + Hologres

Introduction to "Real-time data warehouse introductory training camp" is composed of Alibaba Cloud researcher Wang Feng, Alibaba Cloud senior product expert Liu Yiming and other real-time computing Flink version and Hologres technology/product front-line experts joined forces to build the training camp course System, carefully polished the content of the course, and directly hit the pain points encountered by the current students. Analyze the architecture, scenarios, and practical applications of real-time data warehouses from the shallower to the deeper, and 7 high-quality courses will help you grow from a small white to a big cow in 5 days!

This article is compiled from the live broadcast "Analysis of Real-time Recommendation System Architecture Based on Apache Flink + Hologres-Qin Jiangjie"
Video link: https://developer.aliyun.com/learning/course/807/detail/13888

Abstract: This article is organized by Qin Jiangjie's speech from the real-time data warehouse online course.
Brief content:
1. Principle of real-time recommendation system
2. Real-time recommendation system architecture
3. Key technologies of real-time recommendation system based on Apache Flink + Hologres

Principles of Real-time Recommendation System

(1) Static recommendation system

Before introducing the real-time recommendation system, let's take a look at what the static recommendation system looks like.

图片 1.png

Above is the architecture diagram of a very classic static recommendation system. There will be many user-side applications on the front end, and these users will generate a large number of user behavior logs, and then put them in a message queue and enter ETL. Then use the offline system to do some feature generation and model training, and finally push the model and features to the online system. Through the online service, you can call the online reasoning service to obtain the recommendation result.
This is a very classic static recommendation system operation process. Let's take a specific example to see how the static recommendation system works.

图片 2.png

As shown in the figure above, for example, the behavior logs of online users may be logs of some users' browsing and ad clicks. The purpose of the recommendation system is to help users recommend advertisements. Then the following user behaviors can be seen in the logs:

User 1 and User 2 both watched PageID 200 and some other pages, and then User 1 watched PageID 200 and clicked the ad 2002. Then in the user log, a series of behaviors like this can be summarized through ETL, and then sent to Train the model in the model training. In the process of training the model, we will use some features. In this case, we can find that both user 1 and user 2 are male users in China, which may be a feature of the user dimension.

In this case, the result we saw from the log is that the user clicked the ad 2002 after viewing PageID 100, and both users were male users in China. Therefore, it is possible for our model to learn that when a Chinese male user comes to PageID 100, he should show him the ad 2002, and this behavior will be trained into the model. At this time, we will push the offline features of some users to the feature library, and then push the model online.

Suppose there is a user ID4, who happens to be a male user in China, this feature will be pushed into the feature database, and the model will also be pushed online. If user 4 looks at PageID 100 when visiting, the reasoning service will first look at the characteristics of user ID4, and then based on the fact that he is a Chinese male user, through the trained model, the system will push him an ad 2002, which is a static The basic working principle of the recommended system.

In this case, if some changes occur, let's see if the static recommendation system can continue to work well?

Suppose that the feature models of user 1 and user 2 are trained today, and user 4 is found to behave in the next day. According to the content in the model, the model will think that user 4 is a Chinese male user and user 1 and user 2 have the same behavior. So what needs to be pushed to him should be the behavior of Chinese male users. But at this time we found that the behavior of user 4 is actually more similar to user 3, rather than user 1 and user 2.

In this case, since the model and features are static, in order for user 4 to be more similar to the behavior obtained by user 3, the model needs to be retrained, which will cause the prediction effect to be delayed, because the user needs to be retrained 4. Only then can we recommend some behaviors that are more similar to user 3.

So in this actual operating situation, you can see that there are some problems with the static recommendation model:

Statically generate models and features;
Take the classification model as an example, classify users according to their similarity, assuming that users of the same category have similar interests and behaviors
1. For example, male users in China have similar behaviors.
2. Once a user is classified into a certain category, he will remain in this category until he is reclassified by a new model training.

In this case, it is more difficult to make a good recommendation. The reasons are:

User behavior is very diverse and cannot be classified into a fixed category
1) Purchase health care products for parents in the morning, book hotels for business trips at noon, and buy clothes for family members in the evening...
2) The static system cannot accurately place users in the correct category at the moment.
The behavior of a certain category of users is similar, but the behavior itself may change
1) Assume that users "follow the trend", but the "popularity" may change;
2) The “popularity” seen by historical data may not accurately reflect the real situation online.

(2) Join the recommendation system of real-time feature engineering

In order to solve the above problems, dynamic features can be added. So what are the dynamic characteristics? Give an example.

图片 3.png

As shown in the figure above, we take the dynamic characteristics of changes in the current as an example. The previous model recommendation was that if a male user in China visited PageID 100, he would recommend the ad 2002, which is a fixed behavior.

Make some changes on this basis. When sampling real-time features, this real-time feature is the most recent 10 ads when Chinese male users visit PageID 100. This feature cannot be calculated offline, because it is a real-time online user behavior.

So what can be done after the user behavior is generated? When a male user in China visits PageID 100, he can not simply advertise 2002, but the advertisements that Chinese male users have clicked the most when they visit PageID 100 during the recent period.

In this case, if Chinese male users visit PageID 100, the most recently visited ads are 2001 and 2002. When the user ID comes and we see that he is a Chinese male user, it is possible to recommend Ad 2001 instead of Ad 2002.

The above is an example of changes in the current trend.

In the same way, because the system can sample the user's real-time characteristics, it can better judge the user's intentions at the moment. For example, you can see which pages the user has viewed in the last minute and which products have been viewed. In this way, the user's thoughts at the moment can be judged in real time, and an advertisement that is more suitable for his current intentions can be recommended to him.

Is there no problem with such a recommendation system? Let's look at another example.

For example, as mentioned earlier, User 1 and User 2 are both Chinese male users. Previously, it was assumed that their behaviors were similar. This is also confirmed in the previous historical data. But when you really look at user behavior online, what might happen?

It may happen that the behaviors of User 1 and User 2 are differentiated. There may be many reasons for the differentiation, but I don't know why. At this time, the recommended things for user 1 and user 2 may be completely different. What is the reason for the differentiation?

图片 4.png

For example, if user 1 is from Shanghai and user 2 is from Beijing. One day there was a very big cooling in Beijing. At this time, user 2 in Beijing might start searching for long trousers, but Shanghai was still very hot that day. User 1 in Shanghai might still search for some summer clothes when searching for clothing. At this time, among Chinese male users, the search behaviors of Shanghai User 1 and Beijing User 2 have undergone some changes. At this time, you need to recommend different advertisements to them, but the static model cannot do this well.

Because this model is actually a statically trained model, if it is a classification model, the category that can be generated is actually a fixed category. In order to generate a new category, the model needs to be retrained. Since the model training is performed offline, the trained model may need to be updated the next day, which will have an impact on the recommendation effect.

By adding dynamic features
1) Real-time tracking of the behavior of a class of users to fit the "big flow";
2) Track users' behavior in real time, understand users' intentions at the moment, and divide users into more appropriate categories.
However, when the classification method of the model itself changes, the most suitable category may not be found, and the model needs to be retrained to increase the classification.

Example: New products are launched frequently, business is growing rapidly, and the distribution of user behavior changes relatively quickly.
When you encounter the above problems, you need to add considerations to the dynamic model update. How to do the dynamic model update? The truth is the same.

图片 5.png

As shown in the figure above, in addition to ETL the user's real-time behavior log to an offline location for Feature Generation, it may also be necessary to export the user behavior log online, and then perform feature generation, sample splicing, and then perform in-line model training. .

The model training here is usually streaming training. Incremental training is done on a basic model to make the model better fit some changes in user behavior at the time. In this case, through the training of this kind of real-time samples, the model can generate new classifications, and it will know that the behaviors of users in Shanghai and Beijing may be different. Therefore, when a user visits PageID 100, it may recommend advertising 2002 for users in Shanghai, and advertising 2011 may be recommended for users in Beijing.

Under such circumstances, suppose that when user 4 comes back, the system will see whether he is a user in Shanghai or a user in Beijing. If he is a user in Shanghai, he will still recommend the ad 2002.

Features of the recommendation system with real-time model training:

On the basis of dynamic features, train the model in real time to make the model as close as possible to the distribution of user behavior at this moment;
Alleviate the degradation of the model.

Real-time recommendation system architecture

The above example is to understand the principle of the real-time recommendation system, why it can do better than the general offline recommendation system. So, how to build such a usable real-time recommendation system through Flink plus Hologres and some other systems/projects?

(1) Architecture of classic offline recommendation system

First look at the architecture of the classic offline recommendation system mentioned above, as shown below.

图片 6.png

This architecture is actually the same as the architecture mentioned before, with only some details added.

First, the message queue is used to collect real-time user behaviors. The real-time user behaviors in this message queue will be imported into an offline storage to store historical user behaviors. Then, static feature calculations will be done every day, and finally put into the feature storage. For online reasoning services.

At the same time, the system will also perform offline sample splicing. The spliced samples will be stored in the sample storage for offline model training. Offline model training will generate new models for verification every day, and then use them in inference services. This model is a T+1 update.

The above is the architecture of a classic offline recommendation system. If it is to be promoted into the real-time recommendation system, there are three main things to do:

Feature calculation
From static T+1 feature calculation to real-time feature calculation.
Sample generation
Offline T+1 sample generation to real-time sample generation.
Model training
Offline training T+1 update to incremental training real-time update.

(2) Alibaba search promotes online machine learning process

Alibaba search promotion has launched such a real-time recommendation system. Its entire process is actually similar to the offline recommendation system. The main difference is that the entire process is real-time.

图片 7.png

图片 8.png

As shown above, this system has three main characteristics:
Timeliness: During the big promotion, the whole process will be updated in real time.
Flexibility: Adjust features and models at any time according to needs.
Reliability: The system is stable, highly available, and the online effect is guaranteed.
Users can update models and features in a very time-sensitive manner. During the promotion period, they can adjust features and models at any time, and the results are also very good.

(3) Real-time recommendation system architecture

What should the architecture of the real-time propulsion system look like?

图片 9.png

As shown in the figure above, compared to the classic offline recommendation system just now, the real-time recommendation architecture has undergone some changes. First of all, the data generated by the message queue, in addition to entering offline storage to save historical behaviors, the system will also read out two copies of the messages in the message queue, one of which is used for real-time feature calculation, and it will also be placed in the feature storage. Inside, the other one will be placed in real-time sample stitching, and a two-stream join will be performed with the user characteristics used by the online reasoning service, so that a real-time sample can be obtained.

In this case, the samples stored in the real-time system can be used for offline model training at the same time, or can be used for real-time model training.

Regardless of whether it is offline or real-time model training, the models they generate will be placed in the model storage, and the model will be verified and finally online.

Offline model training is at the day level, but real-time model training may be at the minute, hour, or even second level. At this time, offline model training will generate a Base Model at a daily level for real-time model training, and then perform incremental model updates.

One thing that needs to be mentioned in the whole architecture is that while the reasoning service uses the features taken from this feature store for reasoning, it also needs to send the features used for this reasoning plus the Request ID to the message queue. . In this case, when real-time sample splicing, when a positive sample is generated, for example, a user displays an advertisement, and then clicks on it, it is a positive sample. Only then can we know which features were used to recommend the advertisement to the user at that time, so This characteristic information needs to be retained by the inference service and sent to the real-time sample for sample splicing to generate a good sample.

It can be seen in this architecture that compared to the classic offline recommendation system, the parts in the green box are all real-time parts, some parts are newly added, and some parts turn the original offline parts into real-time parts. . For example, real-time feature calculation is newly added, real-time sample splicing turns the original offline sample splicing part into real-time, real-time model training is newly added, and model verification is the same. The original offline model is verified and changed. Become a real-time model verification.

(4) Real-time recommendation scheme based on Flink + Hologres

If you want to implement the real-time recommendation system architecture just now, what kind of systems will be used?

图片 10.png

As shown in the figure above, Kafka is used for message queues, and HDFS is assumed for offline storage. Whether it is real-time feature calculation or offline feature calculation, Flink can now be used for calculation. Using Flink's ability to stream batch integration can ensure that the results produced by real-time and offline feature calculations are consistent.
The role of Hologres here is feature storage. The advantage of Hologres feature storage is that it can provide very efficient spot checks. The other is that when doing real-time feature calculations, some inaccurate features are often generated, and these features need to be performed in the later stage. Some corrections. A good feature modification can be made through the mechanism of Flink plus Hologres.

In the same way, on the inference service side, by retaining the features used for inference and putting them in the subsequent sample splicing, the message queue here will also use Kafka. Sample stitching will be done with Flink, a very classic application scenario of Flink for dual-stream Join. After the samples are spliced, the features are added, and then the calculated samples are also put into Hologres for sample storage.

In the case of sample storage, the samples in Hologres can be used for real-time model training, by reading Hologres' Binlog for real-time model training, or by Hologres batch Scan for offline model training.

Whether it is online or offline model training, you can use Flink or FlinkML, which is Alink. If it is traditional machine learning, you can also use TensorFlow for deep learning model training. Such a model may still be stored in HDFS, and then the model can be verified through Flink and TensorFlow, and finally online inference service.

Many users of online reasoning service will have their own reasoning engine, if you can use it, you can also use it directly if you want to use Flink and TensorFlow.

(5) Real-time feature calculation and reasoning (Flink + Hologres)

图片 11.png

First, let's look at the process of real-time feature calculation and inference, as shown in the figure above.

As mentioned earlier, we will collect real-time user behaviors, send them to Flink for real-time feature calculation, and then store them in Hologres for online reasoning services.

The real-time features here may include:

User's browsing history in the last 5 minutes
1) Products, articles, videos
2) Length of stay
3) Collection, additional purchase, consultation, comment
The 50 most clicked products in each category in the last 10 minutes
The most viewed articles, videos, and products in the last 30 minutes
The 100 most searched words in the last 30 minutes

For the search promotion business, such real-time features can be used to better obtain the recommendation effect.

(6) Real-time sample stitching (Flink + Hologres)

Next we will look at the part of real-time sample stitching, as shown in the figure below.

图片 12.png

Real-time user behaviors will be collected and sent to Flink for sample stitching. The sample splicing here consists of two parts. The first part is to first know whether the sample is a positive sample or a negative sample. This is done by analyzing the logs of real-time user behavior. We will have a display stream and a click stream. If the display stream is Join the clickstream, and then find that an Item displayed was clicked by the user, then this is a positive sample. If we show that a certain Item user did not click, then it is a negative sample, and this is how we judge the positive and negative samples.

The judgment of only positive and negative samples is obviously not enough, because this feature is needed when training. These features are from the reasoning service. When displaying a certain Item, the reasoning service uses certain characteristics to judge whether the user is Would be interested in this thing. These features will be stored in Kafka and transferred to Flink. In the process of sample splicing, these features will be used to make recommendations at the time through Request ID Join, and then a complete sample will be generated and placed in Hologres.

Here, Flink multi-stream Join capabilities will be used for sample splicing, and at the same time, multi-stream synchronization, positive and negative samples, and sample correction will also be done.

(7) Real-time model training / deep learning (PAI-Alink / Tensorflow)

After the samples are generated, the next step is real-time model training or deep learning.

图片 13.png

As shown in the figure above, in this case, the samples just mentioned are stored in Hologres. The samples in Hologres can be used for two purposes. It can be used for online model training or offline model training.

Online model training and offline model training can use Hologres' Binlog and batch Scan functions respectively. In terms of performance, it is not much different from general message queue or file system scanning.

If it is a deep model, you can use TensorFlow for training. If it is a traditional machine learning model, we can use Alink or FlinkML for training, then go to HDFS storage, store the model, and then use Flink or TensorFlow to verify the model.

The above process is some techniques that can be used to actually build real-time models and deep model training.

(8) Alink–Flink ML (machine learning algorithm based on Flink)

Here is a brief introduction to Alink. Alink is a machine learning algorithm library based on Flink. It is currently open source and is contributing to the Apache Flink community.

图片 14.png
图片 15.png

As shown in the figure above, Alink (Flink ML) has two characteristics compared to Spark ML:

Spark ML only provides batch algorithms, and Alink provides batch-stream integrated algorithms;
Alink is equivalent to Spark ML in batch algorithm.

(9) Offline feature backfill (Backfill)

After introducing the training part, let's look at offline feature backfilling. This process actually means that after real-time features are launched, new features need to be launched. What should be done?

图片 16.png

As shown in the figure above, it is generally divided into two steps. The first step is to add new features to the real-time system. From a certain moment, the features stored in Hologres will all have new features. What about those historical data? At this time, you need to perform a feature backfill again, run a batch of tasks with the historical behavior data stored in HDFS, and then add some historical features.

Therefore, the offline feature backfilling in this architecture diagram is also done by Flink's offline feature calculation. The historical behavior data is read from HDFS, and then some offline features are calculated, and the features in the past historical news are supplemented.

Key technologies of real-time recommendation system based on Apache Flink + Hologres

There are many key technologies used in the architecture just now. Next, I will mainly talk about two points.

(1) Features and samples that can be withdrawn and corrected

图片 17.png

The first point is the features and samples that can withdraw corrections, as shown in the figure above.

In the lower shaded area in the figure, through the cooperation of Flink and Hologres, some samples and features will be withdrawn and corrected.
Why is the correction of characteristics and samples needed?

Real-time logs are out of order
For example, a False Negative sample is generated for a user click event due to the late system delay.
Generally recalculate offline samples through offline operations
Re-run the entire offline sample calculation
Update through Apache Flink + Hologres withdrawal mechanism
Update only the features and samples that need to be corrected

Real-time logs may be out of order, some streams may arrive earlier, and some streams may arrive later. In this case, when doing multi-stream Join, some False Negative samples may be generated due to system delay and late arrival.

For example, when doing display and clickstream Join, you might think that the user did not click on a certain ad at first, but later found that the user clicked, but the event arrived late. In this case, the downstream user will be told at the beginning that there is no click, which is a False Negative. Later, it is found that the user actually clicked, so the False Negative needs to be corrected. When this happens, you need to withdraw or update the previous sample to tell it that the previous sample is not a negative sample, but a positive sample.

Based on the above situation, we need the ability to withdraw on the entire set of links. We need to inform the downstream of previous errors step by step, and we need to correct it. Such a mechanism can be completed through the cooperation of Apache Flink + Hologres.

Why do you want to do such a thing?

In the past, when such False Negative samples were generated, they were usually corrected by recalculating offline samples through offline operations. The cost of this method is that you may need to re-run the entire offline sample calculation, but the ultimate goal is actually only to correct a small part of all samples, so this cost is relatively high.

Through the mechanism implemented by Apache Flink + Hologres, it is possible to update the False Negative samples in dots instead of re-running the entire sample. In this case, the cost of correcting features and samples will be much smaller.

(2) Event-based flow batch mixed workflow

Another key technology in this architecture is event-based flow batch hybrid workflow. What does it mean?

图片 18.png

Looking at this picture, in addition to the systems just shown, this is also a very complex workflow. Because between different systems, it may have dependencies and scheduling relationships, sometimes it is data dependency, and sometimes it is control dependency.

For example, we may periodically or regularly run some offline static feature calculations, it may be to do feature backfilling, or it may be to correct problems caused by real-time features, but it may be run periodically by default, or it may be triggered manually. run. Sometimes, after the offline model training is generated, it is necessary to trigger the online model verification action, or it may be the online model training action after the online model training is generated.

It is also possible that the sample is spliced to a certain point. For example, after the sample splicing is completed at 10 am, I want to tell the model training that the samples before 10 am are all spliced, and I hope to run a batch offline training task. The data from 10 am to 10 am this morning is used for offline model training. Here it is the process of triggering a batch task by a stream task. After the batch model training and generation mentioned earlier, it needs to be put into the process of online model verification. It is actually a process in which a batch task triggers a stream task, and the model generated by online model training also needs to go to the online model. Training is verified, which is the process by which the stream task triggers the stream task.

Therefore, in this process, there will be many interactions between different tasks. This is called a more complex workflow. It has both batch tasks and flow tasks, so it is a mixed flow and batch workflow.

(3) Flink AI Flow

How to achieve the workflow of batch mixing?

The Flink AI Flow is used, which is a big data plus AI top-level workflow abstraction.

图片 19.png

As shown in the figure above, a workflow can usually be divided into two steps: Workflow definition and Workflow execution.

Workflow definition defines Node and Relation, that is, defines the relationship between nodes and nodes. In Flink AI Flow, we define a node as a Logical Processing Unit, and then define the relationship between the nodes as Event driven conditions. Under this abstraction, an event-based scheduling is made at the Workflow execution level.

Strictly abstracted, there will be many events in a system. Combining these events together may satisfy certain conditions, and when one condition is met, some actions will occur.

For example, there may be a task A in a workflow, and it may listen to various events in the system. When event 1 occurs, then event 2 occurs, and then event 3 occurs. After the events occur in such a sequence, an action to start task A needs to be performed. It is a condition that events 123 occur sequentially.

Through this abstraction, the previous traditional workflow can be well integrated with the workflow with streaming jobs. Because the traditional workflow in the past is based on the change of job status for scheduling, usually the job is finished, and then to see how to run the next job. The problem with this method is that if the job is a streaming job, then the job will never finish, and the workflow will not work properly.

In event-based scheduling, this problem is solved well. Will no longer rely on changes in the status of the job to perform workflow scheduling, but based on events. In this way, even if it is a streaming job, it can also generate some events, and then tell the scheduler to do some other things.

In order to complete the entire scheduling semantics, some support services are needed. The support services that assist in completing the entire scheduling semantics include:

Metadata Service
Notification Service
Model Center

Let's take a look at the content of these support services respectively.

(4) Metadata Service

图片 20.png

Metadata service is to manage data sets. In the workflow, it is hoped that users do not need to find their own data sets very tediously, and can help users manage data sets. Users can just give a name when they want to use it.

The metadata service will also manage projects (Project), where Project refers to the Project in Flink AI Flow. A Project can contain multiple workflows. The main purpose of managing a Project is to ensure that the workflow can be reproduced.

In the metadata service, workflows and jobs are also managed, and many jobs may be involved in each workflow. In addition, it also manages the blood relationship of the model. It is possible to know which job in which workflow the version of the model is generated. Finally, it also supports the user to define some custom entities.

(5) Notification Service

The second service is the notification service, which is an event and event listener with a primary key.

图片 21.png

For example, as shown in the figure above. A client wants to listen to an event, and the key of this event is the model. If the Key is updated, the listening user will receive a callback, telling him that an event has been updated. The primary key of that event is the model, the Value is the URI of the model, and the version number is 1.

One of the functions that can be played here is that if a job is verified, it can listen to the Notification Service. When a new model is generated, it needs to be notified and then the model is verified, so this can be done through the Notification Service.

(6) Model Center

What the Model Center does is the management of multiple versions of the model, the recording of parameters, including the tracking of model indicators and the management of the model life cycle, as well as some model visualization work.

图片 22.png

Take an example to illustrate how Flink AI Flow describes the complex workflow in the real-time recommendation system with a complete workflow.

图片 23.png

As shown above, if there is a DAG, it contains three tasks: model training, model verification, and online reasoning.

First, after the job trained by the Scheduler model is submitted, the Scheduler will go to the Metadata Service to update the status of the job and become a state to be submitted. Assuming that the environment is K8S Cluster, it will be submitted to Kubernetes to run such a training job.

After the training job is running, the job status can be updated through the job status listener. If this job is a streaming training job, a model will be generated after running for a period of time, and this model will be registered in the model center. After the registration is completed, the model center will send an event to indicate that a new model version has been registered. This event will go to the Scheduler, and the Scheduler will listen to these events.

Then the Scheduler will go to see if some conditions are met when receiving this event, and then what actions need to be taken. When a model is generated, the Scheduler needs to verify the model. After this condition is met, a job needs to be started. This job is a model verification job.

After the model verification job is started, it will go to the model center to find the latest model version generated, and then verify the model. Assuming that the model validation is passed, this model validation is a batch job. It will tell the Model Center that the model is Validated. At this time, the Model Center will send a Model Validated Version Event to the Scheduler. After the model is updated, the Scheduler will check Model Validated. , Triggering the online reasoning service. After the reasoning service is launched, it will go to the model center to pull the model that has just been validated over for reasoning.

Assume that the reasoning service is also a streaming job, and it runs there all the time. After a period of time, the online streaming training job generates a new model. The road just now will be walked again. It will have a New Model Version Validated generated by a model, and it will be heard by the Scheduler again. When it arrives, the Scheduler pulls up a Validated job again, and Job2 will be pulled up again. After being pulled up, the Validated job will validate the model again. It is possible that the model validation passes again, and a new model version Validated will be sent to the model center. The Model Center will give this Event to the Scheduler. At this time, the Scheduler will see that the reasoning assignment has actually started there, and may just do nothing.

The inference job is also monitoring the Model Version Validated event. When it receives this event, one thing it will do is to reload the latest Validated event in the model center.

Through this example, it explains why a stream-batch hybrid scheduler and workflow are needed to realize the serialization of all jobs and workflows in the end-to-end real-time recommendation system architecture.

At present, Flink AI Flow is also placed on Github as an open source Flink ecological project, and interested students can watch it through the link below.

https://github.com/alibaba/flink-ai-extended/tree/master/flink-ai-flow

Copyright Statement: content of this article is contributed spontaneously by Alibaba Cloud real-name registered users, and the copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.