Recommendation system study notes (two)

User portrait

What exactly is a user portrait? It is a vectorized representation of user information, which is User Profile, commonly known as "User Profile." User portrait is not the purpose of the recommendation system, but a by-product of a key link in the process of constructing the recommendation system. Constructing a user portrait requires the following two steps.

1 Structured text

The text we get is often described in natural language, in jargon, it is "unstructured", but when the computer is processing, it can only use structured data to index, retrieve, and then vectorize and calculate; so Analyzing the text is to structure the unstructured data, like digitizing the analog signal, only in this way can it be sent to the computer to continue the calculation.
From the text information of the article, we can use the mature NLP algorithm to analyze the following information.

Keyword extraction: the most basic source of tags, but also provides basic data for other text analysis, commonly used TF-IDF and TextRank.
Content classification: classify text according to classification system, and use classification to express coarse-grained structured information. Common tools FastText
Topic model: Learning topic vectors from a large number of existing texts, and then predicting the probability distribution of new texts on each topic, is also very practical. In fact, this is also a kind of clustering idea. The topic vector is not a label form, but also a user Common composition of portraits. Commonly used open source LDA training tools include Gensim, PLDA, etc.
Embedding: "Embedding" is also called Embedding. From words to chapters, you can learn this embedding expression. Embedded expression is to dig out the semantic information under the literal meaning and express it in a limited dimension.

2 Label selection

The text of the article is structured, and the embedding vectors such as tags (keywords, classifications, etc.), topics, and words are obtained. The next step is the second step: give the user the structured information of the item. A simple and rude way is to directly accumulate the tags of the items on which the user has acted.

Content recommendation algorithm

For content-based recommendation systems, the simplest recommendation algorithm is of course to calculate the similarity. The user’s portrait content is expressed as a sparse vector, and the content side also has a corresponding sparse vector. The cosine similarity between the two is calculated according to The similarity ranks the recommended items.

Cosine similarity
yuxian

User-based collaborative filtering

The thought behind

Have you ever felt this way? When you meet a person, you find that the books and movies that he likes are basically all you like. From then on, you always want to ask him: What else can I recommend? Yes, what books have you watched recently, and what movies have you watched recently? This feeling is very natural and direct. It is based on the idea behind user-based collaborative filtering. In detail: First, based on historical consumption behavior, help you find a group of users who have similar tastes to you; then, based on these users who are very similar to you, you can recommend any new items that you have not seen before. .

processing step

1 Prepare the user vector

In theory, a vector can be obtained for each user. Why is it "theoretical"? Because the premise of obtaining the vector is: the user needs to have behavioral data in our product, otherwise the vector will not be obtained. This vector has three characteristics:

The dimension of the vector is the number of items
The vector is sparse, which means that there is not a value in every dimension. The reason is of course very simple. This user has not consumed all items.
The value of the vector dimension can be a simple 0 or 1, which is a boolean value. 1 means browsed, and 0 means not.

2 Use the vector of each user to calculate the similarity between users pairwise, set a similarity threshold or set a maximum number, and reserve the most similar users for each user.

Here we also use cosine similarity to calculate

3 Generate recommendation results for each user.

Summarize the items viewed by users who are similar to him, remove the items that the user has already viewed, and the remaining sorted output is the recommendation result. We use a formula to express the specific summary method.
gongshi
The left side of the equal sign is to calculate the matching score between an item i and a user u. The right side of the equal sign is the calculation process of this score. The denominator is the sum of the similarities of n users similar to user u, and the numerator is the n users Their attitudes towards item i are weighted and summed according to the similarity. The simplest attitude here is 0 or 1. 1 means like, 0 means not. If it is a rating, it can be a value from 0 to 5. The entire formula is the weighted average of the attitudes of similar users.

This article is organized according to the 36-type recommendation system of Xing Wu Dao