Recommendation system architecture design and technology evolution based on real-time deep learning

Introduction to compiled from the Alibaba Cloud Developers Conference on May 29, Qin Jiangjie and Liu Tongxuan's sharing, including the principle of real-time recommendation system and what is real-time recommendation system, the overall system architecture and how to implement it on Alibaba Cloud The details of deep learning
This article is compiled from the "Recommendation System Architecture Design and Technology Evolution Based on Real-time Deep Learning" brought by Qin Jiangjie and Liu Tongxuan at the Alibaba Cloud Developers Conference on May 29, the big data and AI integration platform sub-forum. The content of sharing is as follows:
The principle of real-time recommendation system and what is real-time recommendation system
The overall system architecture and how to implement it on Alibaba Cloud
A detailed introduction to deep learning.

GitHub address
https://github.com/apache/flink
Everyone is welcome to give Flink likes and send stars~

1. The principle of real-time recommendation system

Before introducing the principle of real-time recommendation system, let's first look at a traditional and classic static recommendation system.

The user's behavior log will appear in the message queue, and then ETL will be used for feature generation and model training. This part of the data is offline. Offline model updates and feature updates will be pushed to online systems, such as feature libraries and online reasoning services, and then go to serve online search and promote applications. The recommendation system itself is a service, and the service promotion applications displayed on the front end may include search recommendation, advertisement recommendation, etc. So how does this static system work? Let's look at the following example.

1. Static recommendation system

Intercept the current user's behavior log, and pour it into the offline system for feature generation and model training. This log indicates that user 1 and user 2 have browsed page#200 and some other pages at the same time, and user 1 has browsed page#100 And clicked ads#2002. Then this log will be taken offline by ETL, and then sent for feature generation and model training. In the generated features and model, you will see that both user 1 and user 2 are Chinese male users, and "Chinese male" is a feature of these two users. The final result of this learning model is: Chinese male users browsed page#100 At that time, you need to push ads#2002 to him. The logic here is to group the behaviors of similar users together, indicating that such users should have the same behavior.

The model established by the user feature advancing the feature library, if a user 4 appears when it is pushed to the online service, the online reasoning service will go to the feature library to check the user’s feature, and the found feature may be exactly the user It is a male user in China. The model learned that Chinese male users should push ads#2002 when they visit page#100, so they will recommend ads#2002 to user 4 based on the learning model. The above is the basic workflow of the static recommendation system.

But this system also has some problems. For example, after the model training on the first day was completed, it was discovered that user 4's behavior on the second day was actually more similar to user 3, not similar to user 1 and user 2. However, the result of the previous model training is that Chinese male users will push ads#2002 when they visit page#100, and this recommendation will be made by default. Only after the second model calculation can it be found that User 4 and User 3 are more similar. At this time, there is a delay in making new recommendations. This is because the model and features are static.

For a static recommendation system, both features and models are statically generated. For example, take the classification model as an example, classify according to the similarity of users, and then assume that similar users have similar behavioral interests and characteristics. Once the user is transformed into a certain category, he will remain in this category until the model is re-created. training.

2. Static recommendation system problem

First, user behavior is actually very diversified, and there is no way to describe the user's behavior with a static thing.
Second, the behavior of a certain type of user may be similar, but the behavior itself has changed. For example, Chinese male users must push ads#2002 when they visit page#100. This is the behavior of yesterday; but on the next day, it was discovered that not all Chinese male users would click ads#2002 when they saw page#100.

3. Solution

3.1 Flexible recommendation after adding real-time feature engineering

Add real-time feature engineering to the recommendation system, read a copy of the messages in the message queue, and then generate near-line features. For example, the 10 ads that Chinese male users clicked on most recently when they visited page#100 were tracked in real time. That is to say, Chinese male users have clicked at most 10 ads when they visited page#100 in the last 10 minutes or half an hour. This is not information obtained from yesterday’s historical data, but today’s real-time user behavior data. This is the real-time feature.

With this real-time feature, the problem of following the crowd can be solved just now. Similarly, if the features here are collected from a user's behavior in the last 3 or 5 minutes, the user's intentions at the time can be tracked more accurately, and more accurate recommendations can be made to the user.

Therefore, it can be accurately recommended after adding real-time features to the recommendation system. For example, in the previous example, if user 4 visits page#100 in this situation, the new learning content is: When a Chinese male user recently visited page#100, ads#2001 was the most clicked. Then we will directly recommend ads#2001 instead of pushing him ads#2002 according to yesterday’s information.

3.2 Limitations of the real-time feature recommendation system

The behaviors of the previous user 1 and user 2 are very similar, and real-time features can be added to know its current intentions. However, if user 1 and user 2 are doing the same characteristics, their behaviors are inconsistent; that is to say, they are considered to be the same type of users in the model, their behaviors are differentiated, and they become two types of users. If it is a static model, even if real-time features are added, this type of new user cannot be discovered; the model needs to be retrained before a new category can be generated.

After joining the implementation of the feature engineering recommendation system, you can track the behavior of a certain type of user to fit a major stream of changes; you can also track the user's performance in real time to understand the user's intention at the time. However, when the classification method of the model itself changes, there is no way to find the most suitable classification. It is necessary to reclassify the training model, which will encounter many situations.

For example, when many new products are launched, the business is growing rapidly, and many new users are generated every day, or the distribution of user behavior changes relatively quickly. In this case, even if a real-time feature system is used, since the model itself is a gradual degradation process, the model trained yesterday will be put online today, and it may not work well.

3.3 Solution

Two new parts are added to the recommendation system: near-line training and near-line sample generation.

Assume that user 1 and user 2 are users in Shanghai and Beijing respectively. At this time, you will find that the previous model does not know that there is a difference between users in Shanghai and Beijing. It thinks that they are all Chinese male users. After adding the real-time training model, you will gradually learn from users in Beijing and users in Shanghai. The behavior of the two is different. After confirming this, the recommendation will have a different effect.

For another example, if there is a sudden heavy rain in Beijing today or the weather in Shanghai is extremely hot, the behavior of users on both sides will be different at this time. At this time, if another user 4 comes over, the model will distinguish whether this user is a user from Shanghai or Beijing. If he is a user in Shanghai, he may recommend the content corresponding to the user in Shanghai; if not, he can continue to recommend other content.

The main purpose of adding real-time model training is to, based on dynamic characteristics, hope that the model itself can fit the distribution of user behavior at this moment as much as possible, and at the same time, it is hoped that the degradation of the model can be alleviated.

2. Alibaba real-time recommendation program

First, let’s understand what are the benefits of implementing this plan internally in Alibaba:

The first is timeliness. At present, Ali's big promotion has begun to be normalized, and the timeliness of the entire model during the big promotion has been greatly improved;
The second is flexibility. Features and models can be adjusted at any time according to needs;
The third is reliability. When you use the entire real-time recommendation system, you will feel uneasy. Without going through the large-scale calculation and verification on the night of offline, you will feel that it is not reliable enough if you directly push it online. In fact, there is a complete set of procedures to ensure the stability and stability of this matter. reliability;

From the perspective of the graph, this recommendation model shows that the process from feature to sample to model, and then to online prediction is no different from offline. The main difference is that the entire process is real-time, and this real-time process is used to serve online search and promotion applications.

1. How to implement

Evolve according to the classic offline architecture.

First, the user group behavior will be stored offline from the message queue, and then this offline storage will store all historical user behaviors; then on this offline storage, samples will be calculated by static features; then the samples will be stored in the sample storage, go Do offline model training; then publish the offline model to the model center for model verification; finally push the model verified by the model to the inference service to serve the online business. This is the complete offline system.

We will carry out real-time transformation through three things:

The first is feature calculation;
The second is sample generation;
The third is model training.

Compared to before, the message queue is not only stored in offline storage, but also divided into two links:

The first link will do real-time feature calculations. For example, in the last few minutes, Chinese male users have clicked on page#100 when they watched page#100. This is calculated in real-time, that is, some users may have generated them in the recent period. Some behavioral characteristics, etc.
The other link is a message queue, which can be used for real-time sample splicing, which means that there is no need to manually label, because the user will tell us the label. For example, we made a recommendation. If the user clicks, then it must be a positive sample; if the user does not click after a period of time, then we think it is a negative sample. So there is no need to manually label, the user will help us to label, this time it is easy to get the sample, and then this part of the sample will be placed in the sample storage, this is the same as before. The difference is that this sample storage not only serves offline model training, but also performs real-time model training.

Offline model training is usually a day-level T+1, and a base model will be trained and handed over to real-time model training for incremental training. The model output of incremental model training may be 10 minutes or 15 minutes, and then it will be sent to the model storage for model verification, and finally online.

The green parts in the architecture diagram are all real-time. Some of these parts are newly added, and some are changed from offline to real-time.

2. Alibaba Cloud enterprise-level real-time recommendation solution

How to use Alibaba Cloud products to build in the Alibaba Cloud enterprise-level real-time recommendation solution?

Message queue will use DataHub; real-time features and samples use real-time computing Flink version; offline feature storage and static feature calculation will use MaxCompute; feature storage and sample center will use MaxCompute interactive analysis (Hologres); message queue parts are all DataHub ; The part of model training will use PAI, model storage and verification, and online reasoning services are all in PAI.

2.1 Real-time feature calculation and reasoning

Feature and reasoning is to collect user logs in real time and import them into the real-time calculation Flink version for real-time feature calculation. Then it will be sent to Hologres, and use Hologres' streaming capabilities to use it as the feature center. Here, PAI can directly query these user characteristics in Hologres, that is, the ability to check.

When calculating the calculation features of the Flink version in real time, for example, the user's browsing records in the last 5 minutes, including products, articles, videos, etc., according to different business attributes, the real-time features are different. It may also include, for example, the 50 products with the highest click-through rate of each category in the last 10 minutes, the articles, videos, and products with the most views in the last 30 minutes, and the 100 words with the highest search volume in the last 30 minutes. In these different scenarios, such as search recommendations, there are advertisements, videos, texts, news, etc. These data are used as a link for real-time feature calculation and inference, and then on the basis of this link, sometimes static feature backfilling is required.

2.2 Static feature backfill

For example, a feature is newly launched. After this new feature is launched on the real-time link, if the user's behavior in the last 30 days is required, it is impossible to wait 30 days before calculating it. So you need to find offline data, and then fill it with this feature of the last 30 days. This is called feature backfill, which is backfill. Using MaxCompute to calculate this feature backfill is also written to Hologres. At the same time, new features will be added during implementation. This is a new feature scenario.

Of course, there are some other scenarios, such as calculating some static features; for example, there may be a bug in the online feature and the calculation is wrong, but the data has fallen offline. At this time, an error correction for the offline feature will also be used. The process of backfilling.

2.3 Real-time sample stitching

Real-time sample splicing is essentially for the recommendation scene, that is, after displaying the click stream, the sample obtains a positive sample or a negative sample. But this label is obviously not enough. It also needs features to be able to train. Features can come from DataHub. After real-time features are added, the features of the samples change all the time.

For example, when the recommendation behavior of a certain product is made, it is 10:00 in the morning, and the user's real-time characteristics are his browsing records from 9:55 to 10:00. But when you see this sample stream back, it may be 10:15. If this sample is a positive sample, when the user recommends the product and he has a purchase behavior, we cannot see the user's real-time characteristics during this period.

Because the feature at that time has become the browsing history of the user from 10:10 to 10:15. However, when making predictions, this product is not recommended based on the browsing records within 5 minutes, so it is necessary to save the features used when making the recommendation at the time, and add it when the sample is generated. Above, this is the role of DataHub here.

When using ES for real-time recommendation, you need to save these features that were used for recommendation at the time and use it to generate this sample. After the samples are generated, they can be stored in Hologres and MaxCompute, and real-time samples can be stored in DataHub.

2.4 Real-time deep learning and Flink AI Flow

In this part, there will be offline training based on the "day" level; there will also be online real-time training based on the "minute level"; some can be more extreme, based on the "second" level. No matter which model comes out, it will finally be sent to this model for model verification and online.

This is actually a very complicated workflow. First, the static feature calculation is periodic, or it may be manual. When a backfill is needed, there is a process that is triggered manually. According to this model diagram, it can be seen that it is batch training. After it is trained, it needs to go online to do a real-time model verification. This model verification may be a stream job, so here is a triggering process from batch to stream. The model comes out of the stream job. It is a long running job. A model is generated every 5 minutes. This is every 5 minutes. The model also needs to be sent in for this model verification, so this is a process in which a flow triggers a flow action.

Another example is this real-time sample splicing. Everyone knows that Flink has a concept of watermark. For example, when all the previous data at a certain moment is collected, you can trigger a batch of training. At this time, there will be a streaming job. . When he reaches a certain moment and needs to trigger batch training, this workflow cannot be done in traditional workflow scheduling, because traditional workflow scheduling is based on a process called job status change. That is, the job status changes.

Suppose that if a job is finished without errors, then the data generated by this job is already ready, and downstream jobs that depend on these data can be run. So in simple terms, one job finishes running and the next job continues to run, but when there is only one stream job in the entire workflow, then the entire workflow can't run, because the stream job can't finish running.

For example, in the real-time calculation of this example, the data is constantly changing, but there may be ready at any time. That is to say, the data may be ready when it reaches a certain level, but the job is not finished at all. So we need to introduce a workflow. We call this workflow Flink AI Flow to solve the problem of the synergy between the various jobs in the figure just now.

Flink AI Flow essentially means that a node is a logical processing unit. It is a logical processing node. The relationship between the node and the node is no longer the relationship between the previous job and the next job, but an event driven. Conditions is a concept triggered by an event.

Also at the level of workflow execution, the scheduler no longer performs scheduling actions based on changes in job status, but schedules based on events. For example, in the event scheduling example, when the water mark of a stream job arrives, it means that all the data before this point in time is complete, and the batch job can be triggered to run, and the stream job does not need to run.

For each job, it is necessary to mention the job or stop the job through the scheduler. When these events meet a condition, the scheduling action will be performed. For example, there is a model, when the model is generated, it will meet a condition, requiring the scheduler to pull up a validated job model verification job, then this is a condition generated by an event, and the schedule is required to do one The course of things.

In addition, in addition to the scheduled services, Flink AI Flow also provides three additional support services to satisfy the entire AI workflow semantics, namely metadata service, notification service, and model center.

Metadata service is to help you manage data sets and some states in the entire workflow;
The notification service is to satisfy the semantics of event-based scheduling;
The model center is to manage some of the life cycles of this model.

3. Real-time deep learning training PAI-ODL

After Flink generates real-time samples, there are two streams in the ODL system.

The first stream is a real-time stream, and the generated real-time samples are sent to the stream data source. For example, like Kafka, this sample in Kafka will flow in two directions, one is to online training, and the other is to online evaluation.
The second stream is the data stream for offline training. The offline T+1 training is done by taking the offline data stream to the data warehouse.

In online training, users can configure the frequency of model generation. For example, users can configure the model to be updated online every 30 seconds or 1 minute. This satisfies in real-time recommendation scenarios, especially scenarios with high timeliness requirements.

ODL supports users to set some indicators to automatically determine whether the generated model is deployed online. When the evaluation side meets these indicator requirements, the model will be automatically launched online. Because the frequency of model generation is very high, manual intervention is not realistic. Therefore, the user is required to set the index, and the system automatically determines when the index meets the requirements, the model is automatically pushed back to the line.

There is a line on the offline stream called model calibration, which is model calibration. The T+1 model generated by offline training will correct the model for online training.

PAI-ODL technical point analysis

1. Very large sparse model training

The training of the super sparse model is a commonly used function in sparse scenes such as recommending search ads. This is actually a typical and traditional deep learning engine, such as TensorFlow. Its native internal implementation is a fixed size variable such as fix size. There will be some common problems in the use of sparse scenarios.

Just like static shape, for example, in the usual scenes, like mobile apps, new users will join every day, and there will be new products, news, new videos and other updates every day. If it is a fixed-size shape, it cannot actually express the semantics of this change in a sparse scene. And this static shape will limit the long-term incremental training of the model itself. If the incremental training time of a model is one or two years, it is very likely that the previously set size is far from meeting business needs, which may cause serious feature conflicts and affect the effect of the model.

If the static shape set in the actual model is relatively large, but the utilization rate is very low, it will cause a waste of memory and some invalid IO. Including the waste of disk when generating the full model.

Based on the PAI-TF engine in PAI-ODL, PAI-TF provides the embedding variable function. This feature provides the ability to dynamically elastic characteristics. Each new feature will add a new slot. It also supports feature elimination. For example, if a product is removed from the shelves, the corresponding feature will be deleted.

The incremental model means that the part of the sparse feature change within one minute can be recorded and generated into this incremental model. The incremental model records the sparsely changing features and the parameters of the full Dense.

Based on the export of the incremental model, the model can be quickly updated in the ODL scenario. The incremental model that is updated quickly is very small, and the model can be launched frequently.

2. Support hot update of models in seconds

Usually among the users we contact, we usually pay attention to three points:

The first point is the effect of the model. Will the effect be good after I go online?
The second point is the cost, how much do I spend.
The third point is performance, whether it can meet my requirements for RT.

Embedding store's multi-level hybrid storage supports users to configure different storage methods. On the premise of satisfying the user's performance, the user's cost can be reduced to a greater extent.

The embedding scene has its own scene characteristics, for example, our characteristics have a clear difference between hot and cold. Some products or videos themselves are particularly hot; some are particularly hot due to users’ click behavior. Some unpopular products or videos are not clicked. This is an obvious separation of hot and cold, which is also in line with this 28 principle.

EmbeddingStore will store these hot features on DRAM, and then cold features on PMEM or SSD.

3. Super-sparse model prediction

In addition, EmbeddingStore supports distributed storage services. When serving, each serving node actually needs to load a full model. If you use the distributed service of EmbeddingStore, you can avoid loading the full model on each serving node.

EmbeddingStore supports users to configure this distributed embedding and independent isolated embedding store service. Each serving node queries the EmbeddingStore Service when querying sparse features.

The design of EmbeddingStore fully considers the data format and access characteristics of sparse features. A simple example: the key and value of the sparse feature, the key is int64, and the value is a float array. Whether it is serving or training, visits are large batches of visits. In addition, the access to sparse features in the Inference phase is read-only access without lock. These are the reasons that prompted us to design sparse feature storage based on embedding scenarios.

4. Real-time training model correction

Why does PAI-ODL support offline training models to have a model calibration for online training?

Usually in the real-time training process, there will be such problems as label inaccuracy and sample distribution. Therefore, the model using the day level will be automatically calibrated to online training to enhance the stability of the model. The model calibration provided by PAI-ODL is user-free. After the user configures the relevant configuration based on their own business characteristics, the base model calibration on the online training side is automatically performed every day according to the new full model. When offline training generates a base model, online training will automatically find the base model, and automatically jump to the corresponding sample in the data stream source, and start online training based on the latest base model and new online training training sample points.

5. Model rollback and sample playback

Although there are abnormal sample detection and abnormal sample processing of samples, it is still unavoidable that the online update model will have effect problems.

When the user receives an alarm, the online indicator drops. Need to provide users with a capability to roll back this model.

However, in the online training scenario, from the discovery of the problem to the intervention, it may have gone through several model iterations, and several models have been produced. The rollback at this time includes:

1) The online serving model is rolled back to the previous model at the time of the problem;

2) At the same time online training needs to jump back to the previous model of the problem model;

3) The sample should also jump back to that point in time to restart training.

Copyright Statement: content of this article is contributed spontaneously by Alibaba Cloud real-name registered users, and the copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.