A practical case of data analysis: the use of pandas in restaurant rating data

Introduction

In order to better master the application of pandas in actual data analysis, today we will introduce how to use pandas to analyze the rating data of American restaurants.

Introduction to Restaurant Rating Data

The source of the data is UCI ML Repository, which contains more than 1,000 pieces of data with 5 attributes, namely:

userID: User ID

placeID: restaurant ID

rating: overall rating

food_rating: food rating

service_rating: service rating

We use pandas to read the data:

import numpy as np

path = '../data/restaurant_rating_final.csv'
df = pd.read_csv(path)
df

	userID	placeID	rating	food_rating	service_rating
0	U1077	135085	2	2	2
1	U1077	135038	2	2	1
2	U1077	132825	2	2	2
3	U1077	135060	1	2	2
4	U1068	135104	1	1	2
...	...	...	...	...	...
1156	U1043	132630	1	1	1
1157	U1011	132715	1	1	0
1158	U1068	132733	1	1	0
1159	U1068	132594	1	1	1
1160	U1068	132660	0	0	0

1161 rows × 5 columns

Analyze scoring data

If we are concerned with the total ratings and food ratings of different restaurants, we can first look at the average of these restaurant ratings, here we use the pivot_table method:

mean_ratings = df.pivot_table(values=['rating','food_rating'], index='placeID',
                                 aggfunc='mean')
mean_ratings[:5]

	food_rating	rating
placeID
132560	1.00	0.50
132561	1.00	0.75
132564	1.25	1.25
132572	1.00	1.00
132583	1.00	1.00

Then look at each placeID, the statistics of the number of voters:

ratings_by_place = df.groupby('placeID').size()
ratings_by_place[:10]

placeID
132560     4
132561     4
132564     4
132572    15
132583     4
132584     6
132594     5
132608     6
132609     5
132613     6
dtype: int64

If the number of votes is too small, then these data are not objective. Let's pick a restaurant with more than 4 votes:

active_place = ratings_by_place.index[ratings_by_place >= 4]
active_place

Int64Index([132560, 132561, 132564, 132572, 132583, 132584, 132594, 132608,
            132609, 132613,
            ...
            135080, 135081, 135082, 135085, 135086, 135088, 135104, 135106,
            135108, 135109],
           dtype='int64', name='placeID', length=124)

Select the average rating data for these restaurants:

mean_ratings = mean_ratings.loc[active_place]
mean_ratings

	food_rating	rating
placeID
132560	1.000000	0.500000
132561	1.000000	0.750000
132564	1.250000	1.250000
132572	1.000000	1.000000
132583	1.000000	1.000000
...	...	...
135088	1.166667	1.000000
135104	1.428571	0.857143
135106	1.200000	1.200000
135108	1.181818	1.181818
135109	1.250000	1.000000

124 rows × 2 columns

Sort the ratings and choose the 10 with the highest ratings:

top_ratings = mean_ratings.sort_values(by='rating', ascending=False)
top_ratings[:10]

	food_rating	rating
placeID
132955	1.800000	2.000000
135034	2.000000	2.000000
134986	2.000000	2.000000
132922	1.500000	1.833333
132755	2.000000	1.800000
135074	1.750000	1.750000
135013	2.000000	1.750000
134976	1.750000	1.750000
135055	1.714286	1.714286
135075	1.692308	1.692308

We can also calculate the difference between the average total score and the average food score and save it as a one-column diff:

mean_ratings['diff'] = mean_ratings['rating'] - mean_ratings['food_rating']

sorted_by_diff = mean_ratings.sort_values(by='diff')
sorted_by_diff[:10]

	food_rating	rating	diff
placeID
132667	2.000000	1.250000	-0.750000
132594	1.200000	0.600000	-0.600000
132858	1.400000	0.800000	-0.600000
135104	1.428571	0.857143	-0.571429
132560	1.000000	0.500000	-0.500000
135027	1.375000	0.875000	-0.500000
132740	1.250000	0.750000	-0.500000
134992	1.500000	1.000000	-0.500000
132706	1.250000	0.750000	-0.500000
132870	1.000000	0.600000	-0.400000

Invert the data and select the top 10 with the largest gap:

sorted_by_diff[::-1][:10]

	food_rating	rating	diff
placeID
134987	0.500000	1.000000	0.500000
132937	1.000000	1.500000	0.500000
135066	1.000000	1.500000	0.500000
132851	1.000000	1.428571	0.428571
135049	0.600000	1.000000	0.400000
132922	1.500000	1.833333	0.333333
135030	1.333333	1.583333	0.250000
135063	1.000000	1.250000	0.250000
132626	1.000000	1.250000	0.250000
135000	1.000000	1.250000	0.250000

Calculate the standard deviation of the ratings and pick the top 10 largest:

# Standard deviation of rating grouped by placeID
rating_std_by_place = df.groupby('placeID')['rating'].std()
# Filter down to active_titles
rating_std_by_place = rating_std_by_place.loc[active_place]
# Order Series by value in descending order
rating_std_by_place.sort_values(ascending=False)[:10]

placeID
134987    1.154701
135049    1.000000
134983    1.000000
135053    0.991031
135027    0.991031
132847    0.983192
132767    0.983192
132884    0.983192
135082    0.971825
132706    0.957427
Name: rating, dtype: float64

This article has been included in http://www.flydean.com/02-pandas-restaurant/
The most popular interpretation, the most profound dry goods, the most concise tutorials, and many tricks you don't know are waiting for you to discover!
Welcome to pay attention to my official account: "Program those things", understand technology, understand you better!

A practical case of data analysis: the use of pandas in restaurant rating data

Introduction

Introduction to Restaurant Rating Data

Analyze scoring data

flydean

引用和评论

在stable diffussion中完美修复AI图片

Anaconda安装教程以及Anaconda和pip配置国内镜像

科学计算编程涉及到的技术栈简介

使用 chardet 判断文件编码需要注意的坑——过大的文件会导致高耗时

Python3 格式化时间（qbit）

manus 的替代品有哪些？使用LLM大模型技术做手机/网页/浏览器自动化操作技术汇总

怎么判断自己下载的 trae 是国际版还是国内版？