The practice of real-time incremental learning in cloud music live recommendation system

Author: Polk

1. Live broadcast business background

1.1 Business Background

The live broadcast recommendation business is embedded in various places in the cloud music APP, including the live broadcast modules of the largest scene song play pages, the comment page live broadcast mixed in the comments, and the homepage six-card live broadcast of the cloud music homepage. As shown in the figure below, Figure 1.1 is the live broadcast module on the play page; Figure 1.2 is the six-card live broadcast module on the home page of Cloud Music; Figure 1.3 is the live broadcast module on the song comment page.

The live broadcasts of different recommendation positions carry different content missions. On the home page, we are more inclined to let new users know about the live broadcast and let old users quickly enter the live broadcast room to watch the live broadcast; the live broadcast on the song play page is more inclined to recommend to users and listen to songs. Relevant and song-related live content; on the song comment page, live broadcast is more like another form of content, a supplement to social content.

At the same time, the live streaming service is a derivative of cloud music, and its recommendation is different from music recommendation, mainly manifested in unclear user intentions, sparse user behavior in live streaming, high real-time requirements and debias.

This paper mainly focuses on the real-time recommendation system of live broadcast, introduces the optimization goal of the live broadcast scene with the algorithm characteristics of the scene as the starting point, and explains why the live broadcast needs to be real-time from multiple perspectives. At the same time, this paper outlines the basic components and basic solutions for the real-time performance of recommender systems. The real-time performance of recommendation system is mainly composed of feature real-time performance, model real-time performance and system real-time performance. Specifically, this paper also introduces in detail how we build the entire live broadcast real-time system with the real-time evolution of the cloud music live broadcast scene, and solve the difficulties and challenges brought by the real-time process. Finally, this paper also gives the online effects and analysis in large-scale scenarios.

1.2 Features of Scene Algorithm

The live broadcast scenes of most platforms have limited number of broadcasts at the same time, and the appeal for recall is not strong. In essence, it is a process of fine scheduling or rough scheduling + fine scheduling of real-time live broadcasts.

1.2.1 Multi-objective

In a recommendation scenario, users generally have more than one behavior, and the occurrences of different behaviors have sequences and dependencies. The live broadcast recommendation scene is such a related multi-objective scene. Wherever the modeling is, it is a multi-objective process. For example, it requires both a click rate and a long viewing time, and a high conversion rate. There are also some issues of fan intimacy. .
Figure 3 shows the entire multi-objective learning process for live recommendation.

When users browse the live broadcast module on the home page, they will click first, and then enter the live broadcast room to watch. When it reaches more than 30s, a target task will be generated: effective viewing. Maybe at this time, the user will like the anchor, interact with the anchor, pay attention to the anchor, and even send gifts to the anchor. Every behavior of the user can become a target in the multi-target model, and a multi-target model with dependencies between such targets belongs to the correlation-type multi-target model. In this way, the process can also be transformed into a multi-objective relationship diagram, then each task and goal is what our model needs to learn.

1.2.2 Real-time

Compared with the traditional personalized recommendation, which updates the user's recommendation results every day, the real-time recommendation adjusts the user's recommendation results in real time based on the user's behavior in the last few seconds. The real-time recommendation system allows the user's current interests to be immediately fed back to the changes in the recommendation results, which can give users a WYSIWYG visual experience. It firmly captures the user's interest and immerses the user in it. The live recommendation system has a very high demand for real-time, which can be roughly introduced from several perspectives, item recommendation perspective, data indicators, and business environment.

Item recommended angle

The live broadcast recommendation is a ranking of real-time online anchors , which is a small part of the existing anchors. As can be seen from the following distribution map of the launch of anchors on the platform (Figure 4), we can see that the number of anchors in our platform is in the order of one million, while the number of anchors launched on the same day is only in the order of ten thousand. In the morning, afternoon, and evening, the anchors are different in each time slot. Generally speaking, there are more anchors in the evening, and the coffee places are larger.

The item recommended by live broadcast is the anchor , which is the live goods , which is different from the songs in the song recommendation, the pictures, text or videos in the information stream recommendation. The anchor is a constantly changing item, as you can see in the following picture Every time a user enters the live broadcast room, the live content and the status of the host are different. The host may be in pk, performance, or chat; while a song or video is a completely static item, and each recommended display is played from the beginning; Therefore, the live broadcast recommends not only an item, but also a status.

Data indicators

The specific real-time efficiency of the anchor can also reflect this problem. The efficiency of a certain anchor varies drastically at different times in the same day, and there are multiple peaks and troughs, which may be related to the state of the anchor’s performance at that time, or it may be related to the live broadcast content. However, for the recommendation system, the system uses historical data to fit and predict future trends. If the system is not real-time enough to obtain the real-time performance of the anchor, the system is very likely to deviate from the anchor's future trend fitting. It could even be the exact opposite result.

business environment

The cloud music live broadcast business is closely related to other businesses, and it affects the whole body. This is because the live broadcast service is not recommended at a fixed position. For example, the homepage live broadcast receives the influence of the music service style recommendation, which will cause the live broadcast module to appear in the third, sixth, and seventh positions. The live broadcast module of the song play page will be affected by the song, because the efficiency of a live broadcast room is different for different songs. Therefore, once other businesses are changed, it is very likely that there will be huge changes in the data distribution of the live broadcast business.

Therefore, whether it is from the recommended items, anchor data indicators or the general environment, live broadcast recommendation urgently needs a real-time recommendation system.

2. The real-time nature of the recommendation system

The real-time performance of recommendation system consists of feature real-time performance, model real-time performance and system real-time performance. It is necessary to use feature real-time real-time to obtain data distribution in real time; model real-time to fit data distribution in real time; finally, based on real-time system to obtain the latest model and data .

Aside from system real-time, algorithm students are of course more concerned about feature real-time and model real-time. Specifically;

The faster the update speed of the recommendation system, the more it can reflect the user's recent user habits, and the more time-sensitive recommendation can be made to the user.
The faster the recommendation system is updated, the easier it is for the model to discover the latest popular data patterns, and the more able the model can respond to find the latest fashion trends.

2.1 Feature real-time

Recommendation systems rely on powerful data processing capabilities

The system collects the input features required by the model in "real time", enabling the recommender system to use the latest features for prediction and recommendation
The behavior of the user at time T-1 (playing a song, watching a certain anchor, rewarding/paying) needs to be fed back to the training data in real time at time T to provide model learning

This is a common implementation frame diagram of real-time features, mainly including log system, offline portrait, real-time portrait, through storm, flink, kafka to complete the processing and transmission of real-time data, and store in hbase and redis, and finally put it on the disk to hdfs. The intermediate link of real-time sample processing is to solve the problem of sample crossing and consistency through the snapshot system.

However, no matter how real-time the feature is, the scope of influence is limited to the current user. In order to quickly grasp the global data changes and newly generated data patterns at the system level, the real-time nature of the "model" must be strengthened.

2.2 Model real-time

Compared with the real-time performance of "features", the real-time performance of recommender system models is often from a more global . The real-time nature of features tries to describe a person with more accurate features, so that the recommendation system can give recommendation results that are more in line with this person. The real-time nature of the model is to capture new data patterns at the global level faster and discover new trends and correlations.

The most important way to strengthen the real-time performance of the model is to change the training method of the model. The ranking is based on the real-time performance intensity, which is all sample update, incremental update, and online learning.
Of course, different update methods will also bring different effects. For example, full update, the model will use all training samples in a certain period of time for retraining, and then replace the old version of the model with the trained new model. This kind of training method requires The training sample size, the training time is long, and the data delay is long, but the accuracy of the sample is the highest.
For online learning, the update speed is the fastest, which is an advanced version of incremental update. The model is updated in real time every time a new sample is obtained, but this method will lead to a serious problem, which is the sparseness of the model. Poor sex, opening up too many "fragmented" unimportant features.
Incremental update is relatively a tradeoff method, which can not only reduce the problems caused by long sample training time and serious data delay, but also reduce the problem of unstable model training caused by each sample update, so our approach It is also mainly used in this way.

3. Real-time evolution of live broadcast fine-arrangement model

The real-time service of cloud music live streaming has always been divided into two legs, one is real-time features, and the other is real-time models. At first, we mainly improved the real-time capability of the system by continuously adding real-time features of various dimensions to reflect the current changes in the host, user, and context in real time, so that the model can keep up with the current real-time changes and predict future trends. In addition, while improving the real-time features of features, we have also been upgrading and iterating on the model structure. The business initially used feature work + simple single-model logistic regression. The core of this method lies in real-time data and real-time cross-feature logging. construction. From the iteration to the present stage, we have adopted the ESMM+DFM+DMR model to solve the SSB problem and transform the Data Sparsity and DMR structure of the samples through the ESMM joint training model to capture the long-term and short-term interests of users. However, there are still some problems at this stage. The features are fast enough, the model is complex enough, and the model update is not fast enough to capture the new data patterns and trends at the global level faster.

4. Real-time incremental model construction

Through exploration and practice, the cloud music live algorithm team has concluded a relatively mature and effective training framework for real-time incremental models.

The left part of the frame is real-time incremental learning, and the right side is offline learning: offline training is still reserved. It is based on historical 7-day data training and is mainly used for hot restart restarts of incremental models. Incremental learning consumes Kafka's real-time data stream and processes it through ETL for real-time training of models.
Real-time sample cumulative attribution: real-time sample is processed and stored in HDFS. During real-time data stream processing, a cumulative attribution of the historical samples of the day needs to be done for the sample to ensure the accuracy of the sample label. The cumulative attribution here depends on For use scenarios, such as the home page scene with high repeated exposure, you need to do it; for scenes with low repeated exposure, you do not need to do it.
model real-time training: The training data sets of the real-time incremental training task are the cumulative attributed samples of the day, and the cumulative samples continue to increase with the kafka stream. Model training is restarted each time based on the latest offline model. The attribution samples here also depend on the landing scenario, and some scenarios do not require cumulative attribution.
model synchronization: model routinely exports the model file for 15 minutes and synchronizes it to the engine. The synchronization efficiency here can be adjusted by itself.

4.1 Offline model

The incremental model is based on the offline model to further iterate, without changing the network structure of the original model, incrementally update the model parameters, to quickly grasp the system-level global data changes and newly generated data patterns.

Our existing offline main model is a deep interest ESMM-DFM model; this model borrows the idea of Wide&Deep, adds feature intersection module and user interest module under the ESMM framework, and finally uses RestNet-DNN to speed up model convergence.

ESMM jointly trains the model, solves the SSB problem and transforms the Data Sparsity of the sample
Introduce DFM to replace DNN to increase multi-feature domain interactivity
Explicit modeling of U2I and I2I networks to enhance interest connections between users and streamers
Output layer: ResNet-DNN replaces Sigmoid to speed up model convergence

4.2 Sample Attribution

Most of the positive sample delays related to conversion are obvious, which will cause the real-time data distribution to not be the real distribution, because when the real-time samples are dropped, the positive samples have not yet arrived. The positive sample delay is due to the natural seriality of user behavior reporting, resulting in label lag. User exposure logs will be generated first, followed by clicks and then viewing logs. The entire behavior is irreversible. So this will cause a problem, how to determine the attribution sample label when the exposure and click data arrive successively for the same user on the same page.

There are two common sample attribution methods in the industry, one is the negative sample cache attribution method proposed by facebook, and the second is the sample correction method proposed by twitter.

The negative sample cache attribution method is as shown in the figure. Negative samples will be cached first and wait for potential positive samples to arrive. If subsequent positive samples arrive, only positive samples will be retained and the model will be updated once. The time window of the cache is generally determined by the conversion time, and it is enough to ensure that more than 95% of the conversion can be completed. Although this approach will have some sample delays, the sample label accuracy is relatively high.

Twitter's approach: Both samples will be retained and the model will be updated, which is the highest in real-time, but it is very dependent on the sample correction strategy.

Under the existing engineering framework, we are more inclined to construct a negative sample cache strategy similar to Facebook, but there will be problems with direct migration. We tried to land on the homepage live broadcast module, but the overall sample label join rate is only 70%.

This involves a problem that exists in the homepage scene of cloud music, the phenomenon of broken screen. Because users enter the homepage to exit, and then return to the cloud music homepage, they will not re-pull the stream, which leads to the user's clicks and views may be far greater than the exposure, and the cache cannot wait for such a long time. If the real-time sample is directly dropped, it will cause the same feature in the sample to correspond to multiple positive and negative sample labels, which will bring too much noise to the model training.

Therefore, we have designed a method of accumulative attribution of samples. The method of caching samples remains unchanged, and an offline processing process is added to make an attribution for real-time samples and historical samples of the day to ensure that the samples up to the current moment are accurate; this sacrifices a small amount of Time, in exchange for high sample accuracy, and finally improved sample accuracy from 70% to 97%.

4.3 Model warm restart restart

As shown in the technical architecture diagram of real-time incremental learning we gave above, the offline training process on the right will not be discarded, because incremental learning needs to rely on the offline model for model hot restart. There are two main reasons for a warm restart of an offline model:

(1) To prevent the model from being biased due to some local patterns, it can be corrected based on the hot start of the offline model.

(2) If the model training is long running, the probability of OOV of the model vocabulary will increase, and it is impossible to quickly add new IDs to the model's dictionary and quickly eliminate old IDs. By hot restarting the offline model, the model vocabulary can be updated synchronously to prevent OOV.

(3) Depending on the scene, as described in 4.2, the homepage live broadcast scene has a serious screen-off phenomenon, which leads to the need for further cumulative attribution of real-time samples. The samples obtained in this way are the cumulative samples at the current moment, so the model update needs to be done in Do gradient descent on the offline daily update model.

Therefore, we design a day-level offline model hot-start restart and a 15-min sample-level restart to solve the above three problems. This can not only solve the oov problem caused by the long running of the model and the problem of partial pattern bias, but also ensure that the labels of the samples entering the model training are accurate.

4.4 Feature Access Scheme

As the cornerstone of machine learning, the quality of samples and features directly determines the upper limit of the model effect. Therefore, ensuring the quality of samples and features is crucial to improving the effect of real-time incremental learning. In the section 4.2 Sample Attribution above, we focused on ensuring the accuracy of the samples entering the model training. This section mainly takes the specific case as the entry point, and introduces our solution from the perspectives of sample deviation characteristics and low-frequency characteristics.

Case1: 时间偏差特征，例如 week 和 hour 这类型的特征，增量样本中集中于某一两个特征值，和离线训练样本分布完全不一致。

Case2: 低频不置信特征，例如某主播 id 样本只出现 2 次曝光，1 次点击，1 次转化。 该 id 做维特征喂入模型，从统计意义上说，该 id 点击率50%，转化率100%。

Therefore, we designed a set of feature access schemes to alleviate this phenomenon, including feature freezing, feature hard access, and dynamic L1 regular soft access. The specific scheme is shown in the figure.

4.4.1 Feature Freezing

Machine learning algorithms have a basic assumption of independent and identical distribution, that is, the data distribution for model training must be independent and identically distributed to the data distribution at the time of prediction. Some time features, such as week and hour, are fully shuffled in offline batch samples. Models trained using these features can still ensure that training and prediction are independent and identically distributed, but in real-time streaming training, samples arrive in real-time order. Such features cannot be fully shuffled, so that the model training has been training the feature values at the same time, and the prediction may switch to the feature values at the next time, resulting in poor model generalization ability. Therefore, time-biased features such as do not participate in the parameter update of the model, and act as a frozen node to prevent these features from falling into the local optimum .

4.4.2 "Hard Admission"

Since the incremental samples are far less than the offline training samples, the frequency filtering conditions of the full feature are not necessarily applicable to the frequency filtering conditions of the incremental features. For example, at full volume, the number of frequency filters for some features is set to 1000, but at increments, it should be set to a smaller value. When the full volume is used, the samples accumulated at the day level are used for training, but when the incremental volume is used, the samples within 15 minutes or 1 hour are used for training. Features with a frequency of less than 1000, even if the frequency of these features increases to 1000 within the next hour, these missing feature values in the trained samples can no longer be recovered.

For such low-frequency features, we use two methods to solve the problem. The first method is the hard admission strategy. We set two filter conditions for the frequency of features. The full threshold with a larger threshold is used for full update; , using an incremental threshold with a relatively small threshold. And when building an incremental sample, the frequency of the features that were previously filtered by the full threshold during the construction of the full sample will be retained. When the next increment arrives, if these features filtered in the full sample appear again, they will be removed. The frequency of the full amount + the current increment is taken as the frequency of the current feature. In this case, the model training will be entered only after the frequency of a feature reaches the entry threshold. In this way, the online click rate of increased by 0.94%, and the effective viewing rate has been increased by %.

Advantages of : can solve the problem that the weights learned in this sample are not confident enough due to the low frequency of the features.

Disadvantages: will still filter out some low-frequency features and lose some effective information. The first n occurrences of a feature are ignored until the admission threshold is reached.

4.4.2 "Soft Admission"

As described in the disadvantages of the above-mentioned "hard admission" solution, hard admission will filter out some low-frequency features and lose part of the effective information. Before the feature reaches the admission threshold, the first n occurrences of the feature are ignored, and the design scheme of entering training only after reaching a certain threshold will destroy the integrity of the sample. For example, the full frequency is 99, the incremental frequency is 1, and the threshold filter is 100. The first 99 occurrences of this feature are ignored, and only one occurrence of this feature will be trained, resulting in poor model training stability. So we need a smoother way.

For the "soft admission" scheme, there are two common approaches in the industry, the eigenfrequency estimation based on Poisson distribution and the dynamic L1 regularization scheme.

based on Poisson distribution

The features after the offline shuffle satisfy the uniform distribution, but for the online data stream, the features entering the training system can be regarded as a Poisson process, which conforms to the Poisson distribution:

$$ P(N(t)=n)=\frac{\lambda t^n e^{-\lambda t}}{n!}$$

Among them, n is the current number of occurrences, t is the current number of steps, λ is the occurrence rate per unit time, which is the main parameter of the Poisson distribution, and T is the total number of training steps. $ \vec{N} $ is the minimum threshold of the feature (that is, the minimum number of occurrences in T time).

Therefore, we can take the time t as the unit time through the number of occurrences of the features in the previous t steps, then $ \lambda = \frac{n}{t} $. According to the Poisson distribution, we can calculate the probability that the event occurs more than or equal to $ \vec{N}-n $ times in the remaining $ \frac{Tt}{t} $ time

$$P_{\lambda_{i}}(N(\frac{T-t}{t})> \vec{N}-n)$$

Every time the feature appears, Bernoulli sampling can be done according to the probability $ P_{\lambda_{i}} $, and the probability of the feature entering the system at step t is calculated by the following formula:

$$ P = \prod_{i=1}^{t-1} {(1-P_{\lambda i })P_{\lambda t}}$$

Through real online data simulation, it can approach the effect of offline frequency filtering, and its λ is dynamically calculated with each feature entry. Its defect is: when t is smaller, the variance of the number of times the event occurs within t is larger, so the feature will be added or discarded with a certain probability. The total number of training steps T in the future is unknown in online learning. Frequency filtering is separate from the optimizer, resulting in no optimizer statistics.

Dynamic L1 Regularization Scheme

Regularization is the realization of the structural risk minimization strategy, which is to add a regularization term or penalty term to the empirical risk. The regularization term is generally a monotonically increasing function of model complexity. The more complex the model, the greater the regularization value. The L1 norm refers to the sum of the absolute values of each element in the vector, also called "Lasso regularization". The norm, as a regular term, will make the model parameter θ sparse, that is, let the elements of 0 in the model parameter vector be as many as possible. In a classical FTRL implementation, the L1 regularization is consistent for every feature. However, although too large L1 can filter out extremely low-frequency features, due to too strong constraints, some effective features are also lasso, which affects the performance of the model.

Referring to the dynamic L1 regularization scheme mentioned in the article of ant's real-time streaming technology, the L1 regularization coefficient is affected by the feature frequency, so that the features of different frequencies have different lasso effects.

The feature frequency is related to the confidence of the MLE-based parameter estimation, and the lower the number of occurrences, the lower the confidence. If a prior distribution (regular term) is added on the basis of pure frequency statistics, when the confidence of the frequency statistics is lower, the prior distribution is more inclined, and the corresponding regularity coefficient is larger. Here is the empirical formula:

$$ L_{1}(w_{i}) =L_{1}(w_{i}) * (1+ \frac{ (𝐶∗max⁡(𝑁−𝑓𝑟𝑒𝑞(𝑓𝑒𝑎𝑖𝑑),0)) }{N} )$$
Among them, C is the penalty multiple, N is the minimum threshold of the feature, both of which are hyperparameters, and freq(feaid) is the frequency of the current feature occurrence (including the frequency of occurrence in the current increment and the total historical frequency).

We also tried to use the scheme of dynamically adjusting L1 regularity in the online environment, and finally realized that on the basis of the original hard admission scheme, the online ABTest effect click rate was relatively increased by 1.2% ; the effective viewing rate was relatively increased by 1.1% .

5. Online effects

Of course, in addition to the improvement of the effect brought by the module, our entire real-time incremental model scheme was launched, and it also achieved gratifying results. Combining the above sample attribution processing, offline model hot restart and feature access scheme, we finally achieved the conversion rate in the recommended scenario of the homepage live broadcast module: the average 24-day relative increase +5.219% ; the click rate: the average 24-day relative increase +6.575% The effect of .

And we have tested various schemes for different model update frequencies, as shown in the figure below, ABTest t1 group is an offline daily update model, and the replacement model file is updated every day; t2 group is a 2-hour update model, and the model is incrementally trained every two hours. ; The t8 group updates the model for 15 minutes, and the model trains the model incrementally every 15 minutes. After many tests, we found that the faster the model is updated, the better the effect and the better stability.

The comparison of online sorting results before and after the experiment is shown in the following figure. Some of the anchors who cannot be ranked in the top 3 in the base model can be found and ranked in the front in the real-time incremental model. It also confirms what we said in Section 2.2 above, the real-time nature of the model can more quickly capture new data patterns at the global level and discover new trends and correlations.

6. Summary and Outlook

The live broadcast recommendation business has its own scene characteristics that are different from other businesses. It recommends not only an Item but also a Status. Therefore, the live broadcast recommendation requires faster, higher, and stronger recommendation algorithms to support the development of the business. This article is the first article on the real-time incremental learning of the cloud music live broadcast business. It shares our experience in how to implement real-time incremental learning in the live broadcast business and solve the problems caused by the real-time model. Next, we will continue to move towards a faster, higher and stronger recommendation algorithm, continue to explore in terms of user growth, user payment, and streamer growth, and strive to provide a better user experience and better online experience. On the effect, better service business.

references

Cheng Li, Yue Lu, Qiaozhu Mei, Dong Wang, and Sandeep Pandey. 2015. Click-through Prediction for Advertising in Twitter Timeline. In Proceedings of the 21th ACM SIGKDD International Conferen
Xinran He, Junfeng Pan, Ou Jin, Tianbing Xu, Bo Liu, Tao Xu, Yanxin Shi, Antoine Atallah, Ralf Herbrich, Stuart Bowers, and Joaquin Quiñonero Candela. 2014. Practical Lessons from Predicting Clicks on Ads at Facebook. In Proceedings of the Eighth International Workshop on Data Mining for Online Advertising (ADKDD’14). ACM, New York, NY, USA, , Article 5 , 9 pages.
the Taobao search model fully real-time? First applied to Double 11
Ant Financial's Core Technology: Real-time Recommendation Algorithm for Ten Billions of Features Revealed.
Exploration and practice of online learning in iQIYI information flow recommendation business.
Home Recommended Video Stream - Incremental Learning and Wide&deepFM Practice (Engineering + Algorithm)

This article is published from the NetEase Cloud Music technical team, and any form of reprinting of the article is prohibited without authorization. We recruit all kinds of technical positions all year round. If you are ready to change jobs and happen to like cloud music, then join us staff.musicrecruit@service.netease.com