Introduction

In order to better master the application of pandas in actual data analysis, today we will introduce how to use pandas to analyze the rating data of American restaurants.

Introduction to Restaurant Rating Data

The source of the data is UCI ML Repository, which contains more than 1,000 pieces of data with 5 attributes, namely:

userID: User ID

placeID: restaurant ID

rating: overall rating

food_rating: food rating

service_rating: service rating

We use pandas to read the data:

import numpy as np

path = '../data/restaurant_rating_final.csv'
df = pd.read_csv(path)
df
userIDplaceIDratingfood_ratingservice_rating
0U1077135085222
1U1077135038221
2U1077132825222
3U1077135060122
4U1068135104112
..................
1156U1043132630111
1157U1011132715110
1158U1068132733110
1159U1068132594111
1160U1068132660000

1161 rows × 5 columns

Analyze scoring data

If we are concerned with the total ratings and food ratings of different restaurants, we can first look at the average of these restaurant ratings, here we use the pivot_table method:

mean_ratings = df.pivot_table(values=['rating','food_rating'], index='placeID',
                                 aggfunc='mean')
mean_ratings[:5]
food_ratingrating
placeID
1325601.000.50
1325611.000.75
1325641.251.25
1325721.001.00
1325831.001.00

Then look at each placeID, the statistics of the number of voters:

ratings_by_place = df.groupby('placeID').size()
ratings_by_place[:10]
placeID
132560     4
132561     4
132564     4
132572    15
132583     4
132584     6
132594     5
132608     6
132609     5
132613     6
dtype: int64

If the number of votes is too small, then these data are not objective. Let's pick a restaurant with more than 4 votes:

active_place = ratings_by_place.index[ratings_by_place >= 4]
active_place
Int64Index([132560, 132561, 132564, 132572, 132583, 132584, 132594, 132608,
            132609, 132613,
            ...
            135080, 135081, 135082, 135085, 135086, 135088, 135104, 135106,
            135108, 135109],
           dtype='int64', name='placeID', length=124)

Select the average rating data for these restaurants:

mean_ratings = mean_ratings.loc[active_place]
mean_ratings
food_ratingrating
placeID
1325601.0000000.500000
1325611.0000000.750000
1325641.2500001.250000
1325721.0000001.000000
1325831.0000001.000000
.........
1350881.1666671.000000
1351041.4285710.857143
1351061.2000001.200000
1351081.1818181.181818
1351091.2500001.000000

124 rows × 2 columns

Sort the ratings and choose the 10 with the highest ratings:

top_ratings = mean_ratings.sort_values(by='rating', ascending=False)
top_ratings[:10]
food_ratingrating
placeID
1329551.8000002.000000
1350342.0000002.000000
1349862.0000002.000000
1329221.5000001.833333
1327552.0000001.800000
1350741.7500001.750000
1350132.0000001.750000
1349761.7500001.750000
1350551.7142861.714286
1350751.6923081.692308

We can also calculate the difference between the average total score and the average food score and save it as a one-column diff:

mean_ratings['diff'] = mean_ratings['rating'] - mean_ratings['food_rating']

sorted_by_diff = mean_ratings.sort_values(by='diff')
sorted_by_diff[:10]
food_ratingratingdiff
placeID
1326672.0000001.250000-0.750000
1325941.2000000.600000-0.600000
1328581.4000000.800000-0.600000
1351041.4285710.857143-0.571429
1325601.0000000.500000-0.500000
1350271.3750000.875000-0.500000
1327401.2500000.750000-0.500000
1349921.5000001.000000-0.500000
1327061.2500000.750000-0.500000
1328701.0000000.600000-0.400000

Invert the data and select the top 10 with the largest gap:

sorted_by_diff[::-1][:10]
food_ratingratingdiff
placeID
1349870.5000001.0000000.500000
1329371.0000001.5000000.500000
1350661.0000001.5000000.500000
1328511.0000001.4285710.428571
1350490.6000001.0000000.400000
1329221.5000001.8333330.333333
1350301.3333331.5833330.250000
1350631.0000001.2500000.250000
1326261.0000001.2500000.250000
1350001.0000001.2500000.250000

Calculate the standard deviation of the ratings and pick the top 10 largest:

# Standard deviation of rating grouped by placeID
rating_std_by_place = df.groupby('placeID')['rating'].std()
# Filter down to active_titles
rating_std_by_place = rating_std_by_place.loc[active_place]
# Order Series by value in descending order
rating_std_by_place.sort_values(ascending=False)[:10]
placeID
134987    1.154701
135049    1.000000
134983    1.000000
135053    0.991031
135027    0.991031
132847    0.983192
132767    0.983192
132884    0.983192
135082    0.971825
132706    0.957427
Name: rating, dtype: float64

This article has been included in http://www.flydean.com/02-pandas-restaurant/

The most popular interpretation, the most profound dry goods, the most concise tutorials, and many tricks you don't know are waiting for you to discover!

Welcome to pay attention to my official account: "Program those things", understand technology, understand you better!


flydean
890 声望433 粉丝

欢迎访问我的个人网站:www.flydean.com