Image credit: https://unsplash.com/photos/Q1p7bh3SHj8
Author: Zuo Fengyi
1. Project Background
Relying on the music master station business, Cloud Music has derived a lot of innovative businesses. Whether it is music, live broadcasting or podcasting, there are two cold start problems:
- How to distribute content for new users (or inactive users)
- How to distribute new (or unpopular) content to users
Each business has some solutions to solve the cold start. One of the more common methods is to introduce the user's behavior data in other scenarios. For example, in the live broadcast cold start, we will introduce the user's preference for music.
With the continuous development of the business, more and more scenarios hope to have a solution with better effect , low development cost , and strong versatility, which is suitable for solving the cold start in the early and mid-term of various cloud music businesses.
We investigated various solutions for cold start, such as cross-domain recommendation, stratification of new and old users, meta-learning, and relationship chain transfer. Finally, it is found that the transfer based on the relationship chain is more suitable for making a general basic solution, and many Internet products (such as WeChat video account, Pinduoduo, etc.) also use it as an important cold start method.
But cloud music does not have an explicit friendship chain, what should we do? The implicit relationship chain was born in this context.
So how to build the user's implicit relationship chain?
Cloud Music takes pan-music content as the core and connects people through content. Users have some behavioral preferences based on the content. Then, we can comprehensively use the behavior data of users in various core business scenarios of cloud music (considering user privacy and data security, the relevant data has been desensitized, and the user-related data mentioned in the subsequent articles will not be specified. ), to learn the vector representation of the user, and then based on the similarity of the vector, to describe the strength of the user's implicit relationship, on the one hand, some preference representations of the user can be obtained, and on the other hand, the user's privacy is further desensitized.
Based on the above background, the main goal of this project is to learn the user's vector representation by integrating the user's full ecological behavior data in cloud music, so as to build the user's implicit relationship chain. Based on the implicit relationship chain, different downstream recommendation tasks are realized, such as user cold start, content cold start, similar recall, and seed user Lookalike.
2. Project Challenges
The main challenges encountered by the project are:
First, the large scale of data
The large scale of data is mainly manifested in three aspects: many businesses, many scenarios, and many types of user interaction behaviors.
In response to this problem, we have made a thorough investigation of the data of each business. The total amount of user behavior data is very large. Considering the efficiency of model training and the timeliness of user interests, it is not necessary for us to use the full amount of data. Therefore, we process the business data in two ways:
- Use the core data of the business. For example, in the music business, we use the desensitized user's music preference behavior
- Some desensitization, cleaning, restriction and filtering are done on the data, which effectively reduces the amount of data
Through some of the above processing, we have obtained a relatively high-quality user behavior data.
Second, how to model
The implicit relationship chain of users is not a specific business scenario. It lacks a clear label and lacks a clear optimization goal. Therefore, we cannot directly use the ranking model in the business for modeling. In addition, the behavior of users is highly heterogeneous, and the graph composed of users and content (referring to songs, live broadcasts, podcasts, etc.) is heterogeneous, and includes not only heterogeneous nodes (users and various kinds of content), but also heterogeneous Edges (multiple interactions between users and content, such as clicks, plays, etc.).
For how to model this problem, we will focus on the third part.
Third, how to evaluate
As mentioned earlier, we cannot directly use the sorting model, which means that we cannot directly use sorting indicators such as auc. Through analysis and research, we comprehensively use the following methods to evaluate:
- Qualitative assessment
- case analysis. This method judges whether the relationship chain is reliable by analyzing the number of intersections of user behavior sequences or the Jaccard coefficient. This method is more intuitive and has a certain degree of interpretability. But there is actually a problem: if the model determines that the similarity of two users is 0.8, is their behavior sequence coincident very high? uncertain. For example, user A has heart behavior for songs S1 and S2, while user B has heart behavior for songs S3 and S4, and these songs are relatively related songs (for example, these songs often appear in the behavior of many other users. sequence, or from the same artist, etc.), although their behaviors do not overlap, we can also assume that the interests of these two users are similar.
- Visual analysis. After we have obtained the vector representation of the user, we can visualize the embedding through tools such as TensorFlow's Embedding Projector, and then observe whether users with the same label are clustered together, and whether users with less overlapping labels are separated.
- quantitative assessment
- Online experiment. Recall online through u2u, u2u2i and other methods, and evaluate according to online revenue. This method is relatively straightforward and has a high degree of confidence, but the cost of the experiment is also high. You should conduct an offline evaluation before determining whether to conduct an online experiment.
- Offline evaluation. Take the user vector as a feature, input it into the ranking model, and evaluate the benefits of offline AUC or loss indicators; or generate the u2i recommendation list offline, and then evaluate the recall precision and recall rate. These two offline evaluation methods are relatively cheaper than online. Although it is not a direct evaluation of u2u, it is also a feasible method.
Fourth, how to provide external services
The user's implicit relationship chain is an infrastructure. In order to facilitate the rapid access of various business scenarios, we need to make an online service, which involves a problem: how to perform millisecond-level similarity retrieval for billion-level user vectors?
Cloud Music refers to some vector retrieval engine frameworks in the industry (such as Faiss, milvus, and Proxima, etc.), and self-developed Nsearch, which realizes high-performance similarity retrieval of large-scale vectors, supports high concurrency and low latency, complex algorithm dynamic expansion and increment import etc.
In addition, in order to support the ability to customize the needs of different business scenarios, we have also designed a better service architecture, providing a unified interface for the business, and supporting the ability to quickly access and iterate.
On the issue of providing external services, we will focus on the fourth section.
3. Technology Evolution
In order to build a user's implicit relation chain, first we need to generate a vector representation of the user. To this end, we have researched and implemented a variety of algorithms, conducted good practices in multiple business scenarios, and made many useful attempts.
As shown in the figure, we divide the research process into 5 stages according to technical depth and modeling method:
The first stage is the exploration stage, we investigate the SimHash algorithm. The SimHash algorithm is actually used to calculate the similarity between texts. In our case, the user's behavior sequence is regarded as a piece of text.
The second stage is the initial stage, we investigate the item2vec model. item2vec is derived from word2vec, and its basic principle is to maximize the co-occurrence probability of those context words that appear near the center word.
The third stage is the exploration stage. We have made some optimizations on item2vec, such as changing global negative sampling to constrained negative sampling, and using user_id as the global context. These optimizations can further enhance the representation ability of vectors.
The fourth stage is the development stage. We investigated the heterogeneous graph model metapath2vec, which can model a variety of entity relationships with strong generalization.
The fifth stage is the innovation stage. The original metapath2vec does not add side information, and the generalization performance is not strong enough, so we are making some improvements and optimizations.
Below, we mainly introduce SimHash, Item2vec and MetaPath2vec.
SimHash
The SimHash algorithm is used to calculate the similarity between texts. It is an algorithm used by Google for massive text deduplication, and it is also a Locality Sensitive Hashing algorithm: two strings have a certain similarity. After hashing, this similarity is still preserved.
The basic principle of SimHash is to map the original text into an n-bit binary string, and then measure the similarity of the original content by comparing the Hamming distance of the binary string. Its basic steps are as follows:
- Keyword extraction. Doc is tokenized, stop words are removed, and each word is given a weight (such as the number of occurrences of the word or the TF-IDF value, etc.).
- Hash. Each word is mapped to a hash value (binary string) through a hash algorithm.
- weighted. The weight of the binary string of the word is calculated according to the weight of the word, the position of 1 is multiplied by the weight, the position of 0 is multiplied by the weight and negative.
- merge. The weighted sequence values calculated by each word are accumulated to form a sequence string.
- Dimensionality reduction. Convert the merged sequence string into 01 string, which is the final SimHash value. Conversion rule: If the value of this bit is greater than 0, it is taken as 1, and if it is less than 0, it is taken as 0.
As shown in the figure, it is a simple example:
Therefore, after processing by the SimHash algorithm, a text string becomes a number string with only 0 and 1, and then the Hamming distance is used to judge the similarity of the two texts. It is generally considered that the Hamming distance is less than 3. Two texts are similar.
So, how do we use SimHash to build a user's implicit relationship chain? In fact, we aggregate the behavior sequences of each user in various business scenarios through some rules, so as to obtain the id sequence of each user, treat each id as a word, and then assign weight to each id, which can be based on the SimHash algorithm. The process gets each user's SimHash signature (only a string of 0s and 1s).
Looks like our problem has been solved. However, in the actual application process, we need to retrieve which users a user is similar to. If we compare it with the data of the whole database, the efficiency is very low. Here, we generate a 64-bit 01 string for each user, then the problem of building an implicit relationship chain can be abstracted as:
Assuming that there are 1 billion non-repeating 64-bit 01 strings, given a 64-bit 01 string, how to quickly find the Hamming distance from the 1 billion strings to the given string less than or equal to 3 (comparatively similar) string.
How to achieve efficient retrieval? We can divide the 64-bit 01 string into four 16-bit segments. According to the drawer principle, if two strings are similar (the Hamming distance is within 3), then they must have one segment equal.
Based on the above ideas, we can design the retrieval process as follows:
Storage: Traverse all SimHash fingerprints. For each SimHash fingerprint, perform the following steps:
1) Split the 64-bit SimHash fingerprint into 4 16-bit segments
2) Store each segment through the kv database or inverted index, for example, segment 1 is stored in library 1, segment 2 is stored in library 2, and the key is a 16-bit 01 string, and the value is the fingerprint set when the keys are equal- Retrieval: For the SimHash fingerprint to be retrieved, perform the following steps:
1) Split the SimHash fingerprint into 4 16-bit segments
2) Go to the corresponding database for each segment to perform an equivalent query. According to the above drawer principle, the query is for the suspected similar fingerprints.
3) Compare the Hamming distance between the fingerprints to be retrieved and the suspected similar fingerprints to determine whether they are similar
The overall process can be represented as the following diagram:
Assuming that there are 2^30 (about 1 billion) pieces of data in the entire library, and assuming that the data is evenly distributed, then the maximum number returned by each 16-bit (2^16 random combinations of 16 01 numbers) is 2^30 /2^16=16384 candidate results, 4 16-bit results, there are 4*16384=65536 candidates in total. Therefore, through the above warehousing retrieval process, it used to take about 1 billion comparisons, but now it only takes 65536 times at most to get the results, which greatly improves the retrieval efficiency.
SimHash is the first algorithm we investigated. It is simple, fast, and requires no training, but it has obvious shortcomings:
- The result of SimHash has a certain randomness. There is a certain probability that the Hamming distance of two random sequences is relatively close. It is necessary to reduce this probability through ensemble, and the cost is relatively high.
- SimHash's representation of user similarity is relatively coarse-grained (computing Hamming distance, and Hamming distance is an integer)
- SimHash cannot learn the relationship of user behavior context sequences, and cannot model i2i, u2i, etc.
Item2Vec
Microsoft published a paper in 2016, Item2Vec: Neural Item Embedding for Collaborative Filtering. Inspired by NLP's use of embedding algorithm to learn the latent representation of word, the author refers to google's word2vec method, and applies item2vec to the i2i similarity calculation of recommended scenarios. Its main idea is: regard item as word in word2vec, regard user's behavior sequence as a sentence, co-occurrence among items as positive sample, and perform negative sampling according to the frequency distribution of item.
This paper is a very practical article by Microsoft applying word2vec to the field of recommendation. The item2vec method is simple and easy to use, which greatly expands the application scope of word2vec: from the NLP field directly to the recommendation, advertising, search and other fields that can generate sequences.
There is a problem with negative sampling for Item2Vec: negative samples are too simplistic. For example, if a user listens to a Cantonese song and uses global negative sampling, it may be an English song, which will weaken the representational ability of the vector. To this end, we can perform constrained negative sampling , which improves the representational power of vectors. As shown in the figure:
Note that Item2Vec generates a vector of item, so how to get the vector of user? In fact, after generating the item vector, we can calculate the mean pooling to obtain the user's vector expression according to the user's historical interaction with the item. With the user's vector representation, we can use high-dimensional vector retrieval engines (such as nsearch developed by Cloud Music, Faiss of Facebook, etc.) to quickly retrieve similar vectors.
In addition to obtaining the user vector indirectly through the above method, we can also refer to Doc2Vec. When we construct a training sequence through the interaction history of user and item, we can add user id as global context to training , and learn user and item at the same time. vector.
MetaPath2Vec
Item2Vec is a method of dealing with homogeneous networks. In fact, when we use it, we model and integrate it separately by business, while MetaPath2Vec is a
YuXiao Dong et al. proposed a model for learning the node representation of Heterogeneous Information Network (HIN) in 2017.
In the paper, a heterogeneous network is defined as a graph G=(V,E,T), where V represents the set of nodes, E represents the set of edges, and T represents the set of node types and edge types. For each node v and each edge e, there is a corresponding mapping function, f(v): V -> T_V, g(e): E -> T_E, T_V and T_E respectively represent the types of nodes and edges, and | T_V| + |T_E| > 2, that is, the total number of categories of graph node types and edge types is greater than 2.
MetaPath2Vec can be used to learn low-dimensional representations of graph nodes containing different node types and different edge types. The core idea of MetaPath2Vec has two main points:
- meta path
- Heterogeneous Skip-Gram
Below, we illustrate with a user-song-artist network. As shown in the figure, the network has three types of nodes, and different types of nodes are distinguished by different colors.
First, let's see what a metapath is. A meta-path is a combined path composed of node types selected in the graph, and has certain business meanings. For example, "user-song-user", which means that two users have a common behavior for a certain song. We usually design meta-paths in a symmetric form so that the loop can be repeated for many random walks. For example, based on the meta-path of "user-song-user", the sequence of "user1-song1-user2-song2-user3" can be sampled.
Compared with general random walks, meta-path-based random walks have the following advantages:
- Avoid wandering towards node types with high frequency
- Avoid walking towards relatively concentrated nodes (ie nodes with high degree)
- Capable of capturing the connections between different node types, so that the semantic relationships of different types of nodes can be fused into a heterogeneous Skip-Gram model
Next, we look at the heterogeneous Skip-Gram, whose optimization goal is to maximize the co-occurrence probability of a node and its heterogeneous context .
The objective function of the heterogeneous Skip-Gram is as follows. The main difference from the ordinary Skip-Gram model is that there is a summation symbol in the middle, which models the relationship between nodes and their heterogeneous neighbors.
4. Service Deployment
With the vector representation of the user, we can build the implicit relationship chain service. The overall architecture of the service is shown in the figure, from the bottom to the top are the algorithm layer, the vector engine layer and the service layer.
At the algorithm layer, we learn the vector representation of the user through models such as simHash, item2vec and metapath2vec. In this process, the vector representation of the content is also produced, namely Embedding.
At the vector engine layer, we import the embedding of users and content into the vector engine for index construction. The vector engine here uses nsearch, which is self-developed by Cloud Music, to achieve query requirements such as high-dimensional vector similarity retrieval, high concurrency and low latency.
In the service layer, the service framework rtrs developed by Cloud Music is adopted, which realizes engineering requirements such as dynamic publishing, multi-level caching and asynchronous processing. When the request comes, the framework will parse the protocol parameters, the recall module will go to the configuration center to load the corresponding business configuration, and then execute the business recall of the scenario.
Through the above framework, we can support a variety of recall methods for implicit relationship chains, including u2u, u2i, and i2i, etc. In addition, each recall method can support the ability to customize the needs of different business scenarios.
5. Project achievements
The implicit relationship chain comes from the business and should eventually return to the business to serve the business and create value for the business .
First, the implicit relationship chain is created from scratch, providing the ability to provide implicit relationship services. In this process, in addition to building the implicit relationship between users, we also build an implicit relationship chain between content and artists. As shown in the figure, the implicit relationship chain of the artist is shown.
Second, the current implicit relationship chain provides implicit relationship services for 13 application scenarios of 9 businesses, including music, podcasting, and dynamic, and has achieved good results. Among them, when strangers listen to this scene together, the effect is significantly improved (per capita connection time +9.4%).
Third, the current implicit relationship service, the peak QPS is more than 5000, and the average time consumption is 10 milliseconds.
6. Summary and Outlook
The implicit relationship chain is an infrastructure project. Although it is not a direct business, it has the same goal of doing business, and needs to create value for the business. We have achieved some results in some business scenarios, but there are still many areas that need to be further improved:
First, the implicit relationship chain is based on the neural network model. The black-box characteristics of the neural network make many models inexplicable, making the implicit relationship chain unable to be applied to some businesses that require explicit relationships, such as providing recommendation reasons in user recommendation scenarios. . For this problem, we will introduce graph database-assisted modeling.
Second, the data value of the implicit relationship chain has not been fully tapped, such as KOL mining, community discovery, etc.
Third, the generalization ability of the model is not strong enough, and more side information needs to be added.
Fourth, the model is not robust enough, and the model is easily biased by active users and popular content, resulting in insufficient learning of inactive users and long-tail content. For this problem, we will introduce contrastive learning for modeling.
references
- Charikar, Moses S. (2002), "Similarity estimation techniques from rounding algorithms", Proceedings of the 34th Annual ACM Symposium on Theory of Computing, p. 380, doi:10.1145/509907.509965, ISBN 978-1581134957.
- Gurmeet Singh, Manku; Jain, Arvind; Das Sarma, Anish (2007), "Detecting near-duplicates for web crawling", Proceedings of the 16th International Conference on World Wide Web (PDF), p. 141, doi:10.1145/1242572.1242592 , ISBN 9781595936547.
- Barkan O, Koenigstein N. Item2vec: neural item embedding for collaborative filtering[C]//2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP). IEEE, 2016: 1-6.
- Dong Y, Chawla NV, Swami A. metapath2vec: Scalable representation learning for heterogeneous networks[A]. Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining[C]. 2017: 135–144.
- Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with Graph Convolutional Networks. In ICLR.
- Xiangnan He, Kuan Deng ,Xiang Wang, Yan Li, Yongdong Zhang, Meng Wang(2020). LightGCN: Simplifying and Powering Graph Convolution Network for Recommendation
This article is published from the NetEase Cloud Music technical team, and any form of reprinting of the article is prohibited without authorization. We recruit all kinds of technical positions all year round. If you are ready to change jobs and happen to like cloud music, then join us at staff.musicrecruit@service.netease.com .
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。