Introduction
In order to better master the application of pandas in actual data analysis, today we will introduce how to use pandas to analyze the rating data of American restaurants.
Introduction to Restaurant Rating Data
The source of the data is UCI ML Repository, which contains more than 1,000 pieces of data with 5 attributes, namely:
userID: User ID
placeID: restaurant ID
rating: overall rating
food_rating: food rating
service_rating: service rating
We use pandas to read the data:
import numpy as np
path = '../data/restaurant_rating_final.csv'
df = pd.read_csv(path)
df
userID | placeID | rating | food_rating | service_rating | |
---|---|---|---|---|---|
0 | U1077 | 135085 | 2 | 2 | 2 |
1 | U1077 | 135038 | 2 | 2 | 1 |
2 | U1077 | 132825 | 2 | 2 | 2 |
3 | U1077 | 135060 | 1 | 2 | 2 |
4 | U1068 | 135104 | 1 | 1 | 2 |
... | ... | ... | ... | ... | ... |
1156 | U1043 | 132630 | 1 | 1 | 1 |
1157 | U1011 | 132715 | 1 | 1 | 0 |
1158 | U1068 | 132733 | 1 | 1 | 0 |
1159 | U1068 | 132594 | 1 | 1 | 1 |
1160 | U1068 | 132660 | 0 | 0 | 0 |
1161 rows × 5 columns
Analyze scoring data
If we are concerned with the total ratings and food ratings of different restaurants, we can first look at the average of these restaurant ratings, here we use the pivot_table method:
mean_ratings = df.pivot_table(values=['rating','food_rating'], index='placeID',
aggfunc='mean')
mean_ratings[:5]
food_rating | rating | |
---|---|---|
placeID | ||
132560 | 1.00 | 0.50 |
132561 | 1.00 | 0.75 |
132564 | 1.25 | 1.25 |
132572 | 1.00 | 1.00 |
132583 | 1.00 | 1.00 |
Then look at each placeID, the statistics of the number of voters:
ratings_by_place = df.groupby('placeID').size()
ratings_by_place[:10]
placeID
132560 4
132561 4
132564 4
132572 15
132583 4
132584 6
132594 5
132608 6
132609 5
132613 6
dtype: int64
If the number of votes is too small, then these data are not objective. Let's pick a restaurant with more than 4 votes:
active_place = ratings_by_place.index[ratings_by_place >= 4]
active_place
Int64Index([132560, 132561, 132564, 132572, 132583, 132584, 132594, 132608,
132609, 132613,
...
135080, 135081, 135082, 135085, 135086, 135088, 135104, 135106,
135108, 135109],
dtype='int64', name='placeID', length=124)
Select the average rating data for these restaurants:
mean_ratings = mean_ratings.loc[active_place]
mean_ratings
food_rating | rating | |
---|---|---|
placeID | ||
132560 | 1.000000 | 0.500000 |
132561 | 1.000000 | 0.750000 |
132564 | 1.250000 | 1.250000 |
132572 | 1.000000 | 1.000000 |
132583 | 1.000000 | 1.000000 |
... | ... | ... |
135088 | 1.166667 | 1.000000 |
135104 | 1.428571 | 0.857143 |
135106 | 1.200000 | 1.200000 |
135108 | 1.181818 | 1.181818 |
135109 | 1.250000 | 1.000000 |
124 rows × 2 columns
Sort the ratings and choose the 10 with the highest ratings:
top_ratings = mean_ratings.sort_values(by='rating', ascending=False)
top_ratings[:10]
food_rating | rating | |
---|---|---|
placeID | ||
132955 | 1.800000 | 2.000000 |
135034 | 2.000000 | 2.000000 |
134986 | 2.000000 | 2.000000 |
132922 | 1.500000 | 1.833333 |
132755 | 2.000000 | 1.800000 |
135074 | 1.750000 | 1.750000 |
135013 | 2.000000 | 1.750000 |
134976 | 1.750000 | 1.750000 |
135055 | 1.714286 | 1.714286 |
135075 | 1.692308 | 1.692308 |
We can also calculate the difference between the average total score and the average food score and save it as a one-column diff:
mean_ratings['diff'] = mean_ratings['rating'] - mean_ratings['food_rating']
sorted_by_diff = mean_ratings.sort_values(by='diff')
sorted_by_diff[:10]
food_rating | rating | diff | |
---|---|---|---|
placeID | |||
132667 | 2.000000 | 1.250000 | -0.750000 |
132594 | 1.200000 | 0.600000 | -0.600000 |
132858 | 1.400000 | 0.800000 | -0.600000 |
135104 | 1.428571 | 0.857143 | -0.571429 |
132560 | 1.000000 | 0.500000 | -0.500000 |
135027 | 1.375000 | 0.875000 | -0.500000 |
132740 | 1.250000 | 0.750000 | -0.500000 |
134992 | 1.500000 | 1.000000 | -0.500000 |
132706 | 1.250000 | 0.750000 | -0.500000 |
132870 | 1.000000 | 0.600000 | -0.400000 |
Invert the data and select the top 10 with the largest gap:
sorted_by_diff[::-1][:10]
food_rating | rating | diff | |
---|---|---|---|
placeID | |||
134987 | 0.500000 | 1.000000 | 0.500000 |
132937 | 1.000000 | 1.500000 | 0.500000 |
135066 | 1.000000 | 1.500000 | 0.500000 |
132851 | 1.000000 | 1.428571 | 0.428571 |
135049 | 0.600000 | 1.000000 | 0.400000 |
132922 | 1.500000 | 1.833333 | 0.333333 |
135030 | 1.333333 | 1.583333 | 0.250000 |
135063 | 1.000000 | 1.250000 | 0.250000 |
132626 | 1.000000 | 1.250000 | 0.250000 |
135000 | 1.000000 | 1.250000 | 0.250000 |
Calculate the standard deviation of the ratings and pick the top 10 largest:
# Standard deviation of rating grouped by placeID
rating_std_by_place = df.groupby('placeID')['rating'].std()
# Filter down to active_titles
rating_std_by_place = rating_std_by_place.loc[active_place]
# Order Series by value in descending order
rating_std_by_place.sort_values(ascending=False)[:10]
placeID
134987 1.154701
135049 1.000000
134983 1.000000
135053 0.991031
135027 0.991031
132847 0.983192
132767 0.983192
132884 0.983192
135082 0.971825
132706 0.957427
Name: rating, dtype: float64
This article has been included in http://www.flydean.com/02-pandas-restaurant/
The most popular interpretation, the most profound dry goods, the most concise tutorials, and many tricks you don't know are waiting for you to discover!
Welcome to pay attention to my official account: "Program those things", understand technology, understand you better!
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。