1
头图
Introduction | The recall module faces hundreds of millions of recommendation pool materials, and the candidate set is very large. Since there is a subsequent sorting module as a guarantee, it does not need to be very accurate, but must ensure no omission and low latency. At present, it is mainly realized through multi-channel recall. On the one hand, each channel can be calculated in parallel, and on the other hand, learn from each other's strengths. There are two main types of recall channels: non-personalized and personalized.

1. The overall architecture of the recommendation algorithm

(1) The significance of the recommendation algorithm

With the vigorous development of the Internet in the past ten years, both the user scale and the content scale have shown rapid growth. It is not new for users to live more than 100 million daily lives. Due to the popularity of UGC production methods, platforms with billions of content libraries are not uncommon. How to let massive users find what they like in the massive content, and how to make the massive content accurately consumed by the massive users, has always been a very core issue for every company. In this context, search systems and recommender systems emerge as the times require. The search system mainly solves the problem of users looking for interesting content, which is more active consumption. The recommendation system mainly solves the problem of pushing content to suitable users, which is passive consumption. The two attract users and content at the same time, which is an intermediate medium to realize the matching between users and content. The recommendation system is a very core position in every company, and its significance is mainly as follows:

  • On the user side, timely and accurately push personalized content of interest to users, and constantly discover and cultivate users' potential interests, meet user consumption needs, and improve user experience, thereby enhancing user activity and retention.
  • On the content side, as a traffic distribution platform, it has positive feedback and stimulation capabilities for producers (such as UGC authors, e-commerce sellers, etc.). By supporting potential small and medium producers, it can promote the prosperity and development of the overall content ecology.
  • On the platform side, the recommendation system is crucial to the flow and efficiency of content distribution. By improving user experience, user retention can be improved, thereby increasing daily activity. By improving user conversion and traffic efficiency, core indicators such as the order volume of e-commerce platforms and the per capita duration of users on content platforms can be improved. By increasing the depth of user consumption, it can increase the overall traffic of the platform, lay the foundation for commercialization goals (such as advertising), and improve core indicators such as ARPU (average revenue per user). The recommendation system is closely related to many core indicators of the company, and has a great traction and promotion role, which is of great significance.

(2) Basic module of recommendation algorithm

At present, based on the consideration of computing power and storage, there is no way to achieve the overall end-to-end recommendation. Generally speaking, the recommendation system is divided into the following main modules:

  • Recommendation pool: Generally, based on some rules, some items are selected from the overall material library (which may have a scale of billions or even tens of billions) to enter the recommendation pool, and then regularly updated through replacement rules. For example, an e-commerce platform can build a recommendation pool based on the transaction volume in the past 30 days and the price level of the product in the category, etc., and a short video platform can build a recommendation pool based on the release time and the playback volume in the past 7 days. The recommendation pool is generally built offline on a regular basis.
  • Recall: Select thousands of items from the recommendation pool and send them to the subsequent sorting module. Since the candidate set faced by recall is very large and generally needs to be output online, the recall module must be lightweight, fast and low-latency. Since there is a subsequent sorting module as a guarantee, the recall does not need to be very accurate, but it cannot be omitted (especially the recall module in the search system). At present, the multi-way recall solution paradigm is basically used, which is divided into non-personalized recall and personalized recall. Personalized recall has various methods such as content-based, behavior-based, and feature-based.
  • Coarse sorting: Get the results of the recall module, and select thousands of items from them to send to the fine sorting module. Coarse sorting can be understood as a round of filtering mechanism before fine sorting, which reduces the pressure on the fine sorting module. Coarse sorting is between recall and fine sorting, taking into account both accuracy and low latency. The general model also cannot be overly complicated.
  • Fine sorting: Obtain the results of the rough sorting module, and score and sort the candidate set. Precise sorting needs to ensure the accuracy of scoring when the maximum delay allows. It is a crucial module in the entire system, and it is also the most complex and most researched module. The construction of fine sorting system generally needs to involve three parts: sample, feature and model.
  • Rearrangement: Obtain the sorting results of the refined sorting, and perform a fine-tuning based on the operation strategy, diversity, context, etc. For example, during the 38th Festival, the rights of beauty products are raised, and the categories are broken up, the same pictures are broken up, and the same sellers are broken up, etc. to ensure user experience. There are many rules in rearrangement, but there are also many models-based schemes to improve the rearrangement effect.
  • Shuffle: Multiple lines of business want to get exposure in the Feeds stream and need to shuffle their results. For example, advertisements are inserted into the recommendation stream, and pictures, texts and banners are inserted into the video stream. It can be implemented based on rule strategies (such as advertisement placement) and reinforcement learning.

The recommendation system contains many modules, and papers are emerging in an endless stream, which is relatively complex. The most important thing for us to master the recommendation system algorithm is to sort out the entire algorithm architecture and big picture, know how each module does it, what limitations and problems to be solved, and what methods can be optimized. And through the big picture of the algorithm architecture, the various modules are connected and integrated. So as not to get too caught up in a detail, unable to extricate themselves. When reading a paper, you should first understand what problem it is designed to solve, what solutions have been available before, and then understand how it solves it, and what improvements and optimizations it has compared to other solutions. This article mainly explains the big picture of the recommended algorithm architecture, helps readers grasp the overall situation, and plays a role as an outline.

2. Recall

(1) Multiple recall

The recall module faces hundreds of millions of recommendation pool materials, and the candidate set is very large. Since there is a subsequent sorting module as a guarantee, it does not need to be very accurate, but must ensure no omission and low latency. At present, it is mainly realized through multi-channel recall. On the one hand, each channel can be calculated in parallel, and on the other hand, learn from each other's strengths. There are two main types of recall channels: non-personalized and personalized.

non-personalized recall

Non-personalized recalls have nothing to do with users and can be built offline, mainly including:

  • Popular recall: For example, short videos with high vv have been played in the past 7 days, which can be smoothed by combining CTR and time decay, and filter out suspected fraudulent click items with low per capita duration. You can also select items with more likes and praise from users, etc. This part can be implemented mainly based on rules. Since popular items can easily lead to the Matthew effect, if the proportion of popular recalls in the overall channel is too large, some suppression can be considered.
  • High-efficiency recall: For example, short videos with high CTR, high completion rate, and high per capita duration, such items are more efficient, but they may be on the shelves soon, there are not many vvs in history, and praise takes time to accumulate, which may not be included in the popular recall .
  • Operational strategy recall: For example, the list of various categories of operation and construction, the film list, the latest item on the shelves, etc.

Personalized recall

Personalized recall is related to users, with thousands of people and thousands of faces. According to the construction method, there are:

  • content-based: Based on the content, you can select items with similar content through user tags, such as favorite directors, actors, categories and other information filled in during new registration, or through the user's historical behavior as a trigger. There are:

Tag recall: such as actors, directors, item tags, categories, etc.

Knowledge Graph.

Multi-modality: For example, items with similar title semantics, items with similar first images, items with similar video understanding, etc.

Generally, the inverted index is built offline first, and the user tag or historical behavior item is used as a trigger when used online, and the corresponding candidate can be taken out. Building an inverted index based on content does not require items to have rich behaviors, and is more friendly to cold-start items.

  • Behavior-based: Based on behavior, mainly userCF and itemCF, both find similarity through behavior, requiring users or items to have richer behaviors. userCF first finds users with similar behaviors to users, and selects items in their behavior sequence as candidates. itemCF finds other items whose behavior is similar for each item, and builds an inverted index.

What is the difference between the two when they are used? I think the main ones are:

userCF requires richer user behaviors, and itemCF requires richer behaviors for items. Therefore, for scenarios with high real-time requirements for items such as news, userCF can be considered because there are many cold-start items.

Generally speaking, the number of users is much larger than the number of items in the recommendation pool, that is, the user vector is much larger than the item vector, so the storage pressure and vector retrieval pressure of userCF are greater than itemCF. At the same time, the user vector is much sparser than the item vector, and the similarity calculation accuracy is not as good as itemCF.

What are the disadvantages of collaborative filtering?

Since most users only have behaviors for a small number of items, the behavior matrix of users and items is very sparse, and even some users have no behavior at all, which affects the accuracy of vector similarity calculation.

The number of users and items is very large, and the storage pressure of the behavior matrix is very large.

The sparse matrix also brings a problem, that is, the popular items in the head are easy to be similar to most items, resulting in the Harry Potter problem and the extremely serious Matthew effect.

So how to solve these problems, matrix decomposition MF came into being. It decomposes the behavior matrix of user and item into two matrices of user and item, and the matrix of M N is converted into two matrices of M K and K*N. Each row of the user matrix is a K-dimensional user vector, and each column of the item matrix is is a K-dimensional item vector. Unlike the vector in CF, which is generated based on behavior and has a relatively clear meaning, the vector in MF is also called user hidden vector and item hidden vector. Through MF, the problem that the CF vector is too sparse can be solved. At the same time, since K is much smaller than M and N, the high-dimensional sparse vector achieves low-dimensional densification, which greatly reduces the storage pressure.

What are the implementation methods of MF matrix decomposition, which can be calculated based on SVD and gradient descent. Since SVD has certain limitations, there are many based on gradient descent. So MF is also called model-based CF.

MF is still essentially based on user behavior, and does not make full use of various features of users and items, such as user gender and age, resulting in a large amount of information loss. LR and FM came into being.

  • Feature-based: Based on features, such as the user's age, gender, model, geographic location, behavior sequence, etc., item listing time, video duration, historical statistics, etc. The feature-based recall construction method, which utilizes sufficient information, generally has better effects, and is friendly to Leng Qi, has been the focus of research in recent years. It is mainly divided into:
  • Linear models: such as FM, FFM, etc., will not be expanded in detail.
  • Deep models: such as DNN-based DSSM twin towers (EBR, MOBIUS), youtubeDNN (also called deepMatch). Mind, SDM, CMDM, BERT4Rec based on user sequences. GNN-based Node2Vec, EGES, PingSage, etc. This one will not be expanded one by one, it is a big topic.

There are two ways to use it online:

  • Vector retrieval: Through the generated user embedding, use the nearest neighbor search to find the similar item embedding, so as to find the specific item. The retrieval methods include hash bucketing, HNSW and other methods.
  • i2i inverted index: Through item embedding, find other items similar to this item, and build i2i index offline. When used online, the candidate set is found from the inverted index by using the item in the user's historical behavior as a trigger.
  • social-network: Find other people on the social chain through friends' likes, follow relationships, address book relationships, etc., and then use them to recall items. The principle is that the items that your friends like will also like it with a high probability.

(2) Recall optimization

The main channels of the multi-channel recall are these. What are the main problems in the recall? Personally, I think the main ones are:

Negative sample construction problem: recall is the art of samples, sorting is the art of features, this sentence is quite right. Recalling positive samples can select samples that expose clicks, but how to choose negative samples? Choose to expose unclicked samples, definitely not.

The exposed unclicked samples can compete from the existing recall, rough sorting, and fine sorting modules, indicating that their item quality and relevance are still good, and it is definitely not suitable as a recall negative sample.

SSB problem, recall the entire recommendation pool, but only a small subset of the items that can be exposed, so constructing negative samples will lead to a very serious SSB (sample selection bias) problem, which makes the model seriously deviate from reality.

Based on this problem, we can randomly select items as negative samples in the recommendation pool, but there will be another problem. Compared with positive samples, randomly selected items are generally easy to distinguish, so we need hard negative samples to stimulate and Improve recall model effect. Constructing hard negative samples is one of the more common directions in current recall research, mainly including:

  1. With the help of the model: For example, Facebook EBR selects the items whose recall score is in the middle position in the previous version, and the items ranked around 101~500. They are not very high and can be regarded as negative samples, not the tail of the crane, and have a certain correlation with positive samples. , it is difficult to distinguish. EBR, Mobius, PinSage, etc. have similar ideas. This method is difficult to define exactly what items are somewhat similar, but not so similar, and may require multiple attempts.
  2. Business rules: For example, to select items with rules of the same category and the same price range, you can refer to the practice of Airbnb's paper.
  3. in-batch: The positive samples of other users in the batch are used as negative samples of this user.
  4. Active learning: The recall results are manually reviewed, and the bad cases are used as negative samples.
    Generally, hard negatives and easy negatives are used in a certain ratio, such as 1:100, as recall negative samples. In general, easy negatives still account for the vast majority.
  • SSB problem: The recall is for the entire recommendation pool. The number of items is huge, so a certain amount of negative sampling needs to be done, and there is a relatively large SSB sample selection bias problem. Therefore, it is necessary to make the selected negative samples as representative of the entire recommendation pool as possible, so as to improve the generalization ability of the model. The main problem is still negative sampling, especially the problem of hard negative samples. Ali ESAM tried to use transfer learning, through loss regularization, the model on the exposed sample can be applied to the non-exposed item, so as to optimize the SSB problem. Other more methods consider starting from negative sample sampling, combining easy negative and hard negative. For example, EBR, Airbnb Embedding, Mobius, PinSage, etc. all have hard negative optimization ideas.
  • Target inconsistency problem: The current recall target is still to find similarity, whether based on content, or based on behavior and features. However, the refined ranking and the final actual business indicators still look at the conversion. Similarity does not mean that a good conversion can be obtained. For example, in extreme cases, all short videos that are similar to the user’s recent playback are recalled. Obviously, the overall conversion is not high. Baidu Mobius introduced CPM in the recall phase, taking the business target as the truncation after vector retrieval, which can optimize items with high correlation but low conversion rate. Ali's TDM reconstructs the ANN recall retrieval process through the maximum heap tree, which greatly reduces the amount of retrieval calculation, so that it can accommodate complex models, such as DIN, so that recall and sorting can be aligned in structure (of course, the samples will vary greatly) , which can be regarded as a certain optimization for such problems.
  • Competition problem: each recall channel will eventually be merged and de-duplicated. If the repetition between each channel is too high, it is meaningless. Especially the new recall channel needs to have a good supplementary gain to the historical channel. Certain overlapping and competition issues. At the same time, the candidate items of the recall channel may not be able to be revealed in the refined ranking, especially the items with few historical recalls, due to their few exposed samples and low scores in the refined ranking, so they may not be revealed. The mutual love and killing of recall and fine row also need to be alleviated through full-link optimization.

About the Author

Xie Yangyi

Tencent application algorithm researcher.


腾讯云开发者
21.9k 声望17.3k 粉丝