Amazon Personalize Personalize Effectiveness Assessment, From Accuracy to Variety, Novelty, and Chance

background

Amazon Personalize is a machine learning service that enables developers to build applications using machine learning (ML) technologies used by Amazon.com to provide real-time personalized recommendations without requiring ML expertise. Amazon Pesonalize completes data inspection, feature engineering, hyperparameter selection, model training, model optimization, model deployment, and provides model evaluation metrics and personalized models for end users to use. However, there is a lack of effective evaluation metrics for our focus on diversity in recommender systems, especially regarding diversity, novelty, and contingency. We provide a way to evaluate Amazon Personalize. Especially about variety, novelty and serendipity. And take MovieLens (by grouplens) data as an example to demonstrate and practice step by step. And provide a complete example amazon-personalize-model on the official Github aws-sample.

Amazon Personalize

Amazon Personalize is a machine learning service that makes it easy for developers to provide personalized recommendations to customers using their applications. Based on the same technology used at Amazon.com, Amazon Personalize allows developers to easily build sophisticated personalization features into their applications. For your users, generate personalized product and content recommendations, customized search results, or use recommended results to determine promotional items.

Using Amazon Personalize, you can personalize a user's home page for using product recommendations based on their shopping history. Recommend similar items on product detail pages to help users easily find what they need. Help users quickly find relevant new products, deals and promotions. Easily re-rank relevant product recommendations to drive tangible business results. Improve marketing communications with personalized product recommendations to personalize push notifications and marketing emails. Or go even further by combining Amazon Personalize with business logic to create high-quality shopping cart boosts and cross-sell recommendations.

Model training and evaluation

1. Test data set

GroupLens Research collected and provided the ratings dataset from the MovieLens website https://movielens.org/ . The datasets were collected over different time periods, depending on the size of the dataset. This blog uses MovieLens 100K movie ratings data to evaluate the performance of the Amazon Personalize model. Profile includes: 100,000 ratings for 1700 movies by 1000 users, published April 1998.

2. Create personalized solutions and campaigns

When we are ready to test the dataset, we can refer to the example amazon-personalize-model https://github.com/aws-samples/amazon-personalize-evaluation to interact with the API through the Amazon Personalize SDK. We'll create dataset groups and upload user data, item data, and interaction data to Amazon S3. After these datasets are ready, Amazon Personalize can start creating solutions and activities to interact with. You can refer to 02-personalize_ranking_movielens.ipynb https://github.com/aws-samples/amazon-personalize-evaluation/blob/main/02-personalize_ranking_movielens.ipynb for details.

At the same time, we will conduct experiments such as 03-personalize_ranking_movielens-userpca-exp.ipynb https://github.com/aws-samples/amazon-personalize-evaluation/blob/main/03-personalize_ranking_movielens-userpca-exp.ipynb and 04 -personalize_ranking_movielens-no-user-baseline.ipynb https://github.com/aws-samples/amazon-personalize-evaluation/blob/main/04-personalize_ranking_movielens-no-user-baseline.ipynb as the baseline to compare performance.

3. Model evaluation

Amazon Pesonalize completes data inspection, feature engineering, hyperparameter selection, training model, model optimization, model deployment, and provides four model evaluation indicators. In addition, we will evaluate user data and product data. Three additional metrics: variety, novelty, and chance. For example: we have a total of 26 movies in AZ, and we will explain and calculate various evaluation indicators according to the results recommended by the model:

coverage

Coverage tells you the ratio of the results recommended by Amazon Personalize to the total number of items in the dataset. If we want Amazon Personalize to recommend more diverse products, we can refer to models with higher coverage. For example: out of 26 movies, A, B, C, D, E appear in the recommended results, and the coverage rate is: 5/26.

Precision (precision at K)

Accuracy tells you the proportion of the model's top K recommendations (5, 10, or 25) that are relevant. Amazon Personalize will calculate this metric based on the number of relevant recommendations in the top K recommendations, divided by K, where K is 5, 10, or 25. For example: A, B, C, D, E appear in the recommendation results, where B and D are related recommendation results and the accuracy is 2/5.

mean reciprocal rank

The average reciprocal ranking helps us evaluate the model's top-ranked recommendation results as an evaluation metric for relevant results. This metric is useful if you are interested in a single top-ranked recommendation. For example: A, B, C, D, E appear in the recommendation results, among which B, D are the relevant recommendation results, the second recommendation is the relevant recommendation result, and the reciprocal ranking is 1/2. The average is the sum of the above reciprocal rankings and the average.

Normalized discounted cumulative gain (NCDG) at K (5/10/25)

A normalized discounted cumulative gain that represents the relevance of the model's high ranking recommendation results, where K is the sample size of 5, 10, or 25 recommendations. This metric rewards related items that appear near the top of the list, as the top of the list usually gets more attention. Amazon Personalize uses a weighting factor of 1/log(1+position), where the top of the list is position 1. For example: A, B, C, D, E appear in the recommended results, where B and D are the relevant recommended results, the calculation result of the B weighting coefficient is 1/log(1+2), and the calculation result of the D weighting coefficient is 1/ log(1+4). The most ideal result is that the first two recommended A and B are related results of 1/log(1+1)+1/log(1+2). The weighted coefficient of the actual recommendation result is calculated and summed and divided by the ideal recommendation result, which is the normalized discounted cumulative income.

diversity

Diversity represents the evaluation index of the diversity of the recommended content generated by the model. When the diversity in the list is higher, users are often more satisfied with the recommendation. For example: A, B, C, D, E appear in the recommendation results. If five movies come from different directors, they will have better performance in diversity.

The calculation formula of diversity mainly reflects the degree of dissimilarity of objects in the recommendation results. Refer to the formulas defined in Diversity, Serendipity, Novelty, and Coverage in semantic scholar as follows:

In the formula, R is the recommended result list, |R| is the length of the recommended result list, and dist(I, j) is the distance function defined by the recommender system developer to describe the dissimilarity of two objects. The calculation of diversity is to sum up the distances of the paired objects in the recommendation result, and normalize the length of the recommendation result.

In terms of movie recommendation, the custom distance function can be the sum score of the difference in style, actor, director, etc. of the two movies. The definition of distance functions in different fields depends on the logic and user perception in the field. Depends. In the recommendation system of MovieLens, we define the implementation of dist(I, j) as the gap on the movie category. There are 19 categories of movie categories in movielens. A movie can belong to multiple categories. After the category onehot encoding, the manhattandistance is used to calculate the distance between the two movies.

novelty

Novelty is used to assess the extent to which the recommended content is new and unknown to the user. For example, A, B, C, D, E appear in the recommendation results, and they are all popular movies, maybe the recommendation itself is accurate, but because it is a movie that customers will obviously watch. Users may find it monotonous, so we hope that the recommended results can include some unpopular movies to give customers a sense of freshness and explore more possibilities.

The novelty formula mainly reflects the unpopularity of the object in the recommendation result. The formula defined in the papers of Diversity, Serendipity, Novelty, and Coverage in semantic scholar is as follows:

In the formula, R represents the recommended result, and p(.) is the popularity function defined by the developer. In the popularity distribution, items in the long-tail position contribute more to the novelty score. Within the MovieLens dataset, we define p(i) as the sum of rating events greater than 3 for movie i.

serendipity

Occasionality is used to assess how surprising the recommendation result is to the recommended user. We usually want the recommendation result to be relevant, but at the same time serendipitous. For example, users used to watch sci-fi movies, but A, B, C, D, and E appeared in the recommended results. AD is a sci-fi movie, and E is a director of a well-known sci-fi movie. Maybe give some new discoveries to users who are used to watching sci-fi movies, and such recommendation results also surprise users.

The definition of contingency is to hope to reflect the proportion of positive feedback such as clicks, plays, etc. in the system recommendation results if the objects are not familiar to the user. Refer to Diversity, Serendipity, Novelty, and Coverage in semantic scholar , the definition of contingency is as follows:

R is the recommendation result, and Runexp defines items that the user is not familiar with. For example, in the above example, the user has watched sci-fi movies in the past, and the romance movies that appear in the results are unfamiliar to the user. Ruseful In the recommendation results, users have positive feedback, such as clicks and movie viewing objects. The denominator in the formula is to take the intersection of the two sets, that is, among the unfamiliar results, the number of recommended results that users like, divided by the length of the recommended result list is chance. Recommender system developers must adjust according to the scenario and domain knowledge when defining Runexp and Ruseful.

In this example, we define the definition of user familiarity with an item – for a user a, sum the category vectors to which the movie Ma he has reviewed belongs to, and the total amount in each category dimension will be higher than the full dimension The average category is defined as the category Ca that user a likes, and the category Cm and Ca of any movie intersect, that is, it is defined as a familiar movie. The recommendation result with positive feedback is that the user finally has a movie with a rating of three or more.

Experimental results

First, use the Schema provided by MovieLens to create the Datasets group of Amazon Personalize. The result is the Baseline field of Table X.

In Diversity, Serendipity, Novelty, and Coverage https://www.semanticscholar.org/paper/Diversity%2C-Serendipity%2C-Novelty%2C-and-Coverage-Kaminskas-Bridge/0a2a1bfeea7a572a78cd12a79f3b00911aa9bba4 , it is mentioned to regulate Diversity, Novelty Part of the way with Serendipity is to control the influence of content based recommendation and collaborative recommendation. The other part is to control the degree of influence of the popular object

In our experiment, the above two strategies are used to design the experiment.

1. User data enhancement

Add the category information of movies reviewed by users to the user information. For a certain user a, sum up the category vectors of the movies Ma whose reviews are more than three points. In the MovieLens example, each user will have at most 19-dimensional vectors to describe their favorite movie categories (examples), and the influence of movie content is enhanced by category information.

Incidentally, since Amazon Personalize has a limit of five fields, if the user Demographic information is added together, the required fields will far exceed the range that Amazon Personalize can tolerate. In this case, you can use dimensionality reduction techniques, such as PCA, etc. Compresses the user's information to what Amazon Personalize can tolerate.

The experimental results after user data enhancement are shown in the "Enhanced User Profile" in Table 1. We can observe that the metrics related to Accuracy have decreased, but Diversity and Novelty have increased. In addition, Serendipity has a downward trend, which shows that This is because the recommended results are closer to the user's preference category.

2. Reduce the influence of Popular objects

Recommender systems developed based on user behavior tend to recommend popular items, which indirectly affects diversity, novelty, and contingency. In this experiment we define the popularity of a movie i, pop(i), and normalize the ratings Su,i of user u's reviews of movie i to pop(i) as the total number of users who rated the movie more than 3 is S'u,i.

In Amazon Personalize, we set the event_value in the user interaction data to S'u,i

In Amazon Personalize, we set the event_value in the user interaction data to S'u,i. In Amazon Personalize, the presence or absence of user events is used as the optimization current, instead of the event_value as the optimization target; but the event_value is still The input values for the model, whereby the recommendation results can be influenced. The results are shown in the Normalized by Popularity field in Table 1. It can be seen that the diversity, novelty and contingency have improved effects compared with the baseline version, but the evaluation of accuracy is also reduced.

In offline evaluation, increasing variety, novelty, and chance may be accompanied by a decrease in accuracy. One of the possible reasons is that user behavior in old data will be concentrated on popular objects. At this time, accuracy The decline is not necessarily a bad thing for the recommendation system, and the selection of the algorithm should take into account the application scenario.

Table 1 Comparison of experimental results

Summarize

In the actual use of Amazon Personalize, in addition to the common indicators such as coverage, accuracy, average reciprocal ranking, normalized discounted cumulative income, etc., we also need to evaluate whether the model can distinguish from diversity, novelty, contingency, etc. Go beyond metrics of accuracy to assess whether the model's recommendation results surprise or delight users. In this blog, we use the experimental data of MovieLens and the formula provided according to the definition of the paper, so that we can quantify the above indicators, and then can evaluate and answer the user's performance of the Amazon Personalize model on these indicators beyond the accuracy. If you want to learn about the evaluation model and backtesting method, and a complete example of implementing the formula through Amazon SageMaker Notebook, you can refer to amazon-personalize-model https://github.com/aws-samples/amazon-personalize-evaluation at Examples on the official Github aws-sample.

Author of this article

Wang Xiangrui

Amazon Cloud Technology Solution Architect

Mainly based on the consulting and design of cloud computing solution architecture based on Amazon Cloud Technology, research on building modern solutions and microservice architecture design on the cloud, especially in big data and machine learning. Expand the depth and breadth of technology by passing all 12 Amazon Cloud Technology certifications.

Yi-An Chen

Machine Learning Expert Solutions Architect

Focus on machine learning and data. In 2011, he graduated from National Taiwan University with a Ph.D. in Information Management, majoring in Information Mining. He then worked for Yahoo Taiwan and Keke Technology. He has more than ten years of experience in the development of machine learning products and AI applications. He has successful cases in e-commerce, multimedia and music recommendation systems, multimedia content analysis and natural language processing, and has co-published many papers with the academic community. Committed to cooperating with customers to solve the challenges encountered when importing machine learning technology, and creating successful machine learning applications.

Amazon Personalize Personalize Effectiveness Assessment, From Accuracy to Variety, Novelty, and Chance

background

Amazon Personalize

Model training and evaluation

1. Test data set

2. Create personalized solutions and campaigns

3. Model evaluation

Experimental results

1. User data enhancement

2. Reduce the influence of Popular objects

Summarize

亚马逊云开发者

引用和评论

利用 Amazon Bedrock Data Automation（BDA）对视频数据进行自动化处理与检索

入选AAAI 2025，浙江大学提出多对一回归模型M2OST，利用数字病理图像精准预测基因表达

LLM增强语义嵌入的模型算法综述

医院利用大数据技术开展“模拟审计”自查自纠，大幅减少违规问题

京东联合松灵等多家企业高校推出业内首个具身智能原子技能库架构

什么是模型上下文协议（MCP）？

Amazon卖家（开发者）必备技能：三种获取Amazon订单完整地址的方法详解