大数据与云计算学习：数据分析（二）

推荐github上的这个课程来练习下，做个入门PythonData

按照名字将数据分组,总数，平均数，标准差

print(baby_names.groupby('name').agg([np.sum,np.mean,np.std]))

图片描述

哪些名字出现的频率最高？

# 哪些名字出现的频率最高？
print(baby_names.groupby('name').agg({'frequency': sum}))

图片描述

# James, John, Robert, Micheal, Mary...都是耳熟能详的名字
baby_names.groupby('name').agg({'frequency': sum}).sort_values(by=['frequency'], ascending=[0])

图片描述

每年出生的男孩和女孩的个数分别是多少？

pandas.DataFrame.pivot_table

# 使用pivot_table方法查看
freq_by_gender_year = baby_names.pivot_table(index ='year', columns='gender',
                                             values = 'frequency', aggfunc=sum)

图片描述

# 使用tail方法查看最近几年出生人数
print(freq_by_gender_year.tail())

图片描述

# 一行命令即可做出高质量图形
freq_by_gender_year.plot(title='Frequency by year and gender')
plt.show()

图片描述

起名趋势分析

#增加一个变量rank，这个是根据年份性别依据名字出现频率所产生的次序
baby_names['ranked'] = baby_names.groupby(['year','gender'])['frequency'].rank(ascending=False)
print(baby_names.head(10))

图片描述

#计算每个名每年按性别占总出生人数的百分比
def add_pct(group):#自定义
    group['pct'] = group.frequency / group.frequency.sum()*100
    return group
#groupby和apply函数
baby_names = baby_names.groupby(['year','gender']).apply(add_pct)
# 查看新加的百分比（pct）
print(baby_names.head())

图片描述

查看每年最流行的名字所占百分比趋势

###
#起名趋势分析
###
#增加一个变量rank，这个是根据年份性别依据名字出现频率所产生的次序
baby_names['ranked'] = baby_names.groupby(['year','gender'])['frequency'].rank(ascending=False)
# print(baby_names.head(10))

#计算每个名每年按性别占总出生人数的百分比
def add_pct(group):#自定义
    group['pct'] = group.frequency / group.frequency.sum()*100
    return group
# #groupby和apply函数
baby_names = baby_names.groupby(['year','gender']).apply(add_pct)
# # 查看新加的百分比（pct）
# print(baby_names.head())

####
#查看每年最流行的名字所占百分比趋势
####

#将数据分为男孩和女孩
dff = baby_names[baby_names.gender == 'F']
dfm = baby_names[baby_names.gender == 'M']
#获取每年排名第一的名字
rank1m = dfm[dfm.ranked == 1]
rank1f = dff[dff.ranked == 1]

plt.plot(rank1m.year, rank1m.pct, color="blue", linewidth = 2, label = 'Boys')
plt.fill_between(rank1m.year, rank1m.pct, color="blue", alpha = 0.1, interpolate=True)
plt.xlim(1880,2012)
plt.ylim(0,9)
plt.xticks(scipy.arange(1880,2012,10), rotation=70)
plt.title("Popularity of #1 boys' name by year", size=18, color="blue")
plt.xlabel('Year', size=15)
plt.ylabel('% of male births', size=15)
plt.show()
plt.close()

图片描述

plt.plot(rank1f.year, rank1f.pct, color="red", linewidth = 2, label = 'Girls')
plt.fill_between(rank1f.year, rank1f.pct, color="red", alpha = 0.1, interpolate=True)
plt.xlim(1880,2012)
plt.ylim(0,9)
plt.xticks(scipy.arange(1880,2012,10), rotation=70)
plt.title("Popularity of #1 girls' name by year", size=18, color="red")
plt.xlabel('Year', size=15)
plt.ylabel('% of female births', size=15)
plt.show()
plt.close()

图片描述

参考

pandas教程：agg分组多种计算
 pandas 0.21.1 documentation
图片描述

大数据与云计算学习：数据分析（二）

参考

白鲸鱼

引用和评论

git使用规范

ClkLog埋点分析系统-环境部署配置指南

MCP+Hologres+LLM 搭建数据分析 Agent

某全球领先网络解决方案提供商基于 Apache Doris 统一 Trino、Pinot、Iceberg、Kyuubi技术栈

ClkLog埋点系统基于ClickHouse的百万日活测试报告

分析型数据库入门指南：如何选择适合你的实时分析工具？

ClkLog埋点用户分析系统支持手机端查询统计数据

大数据与云计算学习：数据分析（二）

参考

白鲸鱼

引用和评论

git使用规范

ClkLog埋点分析系统-环境部署配置指南

MCP+Hologres+LLM 搭建数据分析 Agent

某全球领先网络解决方案提供商 基于 Apache Doris 统一 Trino、Pinot、Iceberg、Kyuubi技术栈

ClkLog埋点系统基于ClickHouse的百万日活测试报告

分析型数据库入门指南：如何选择适合你的实时分析工具？

ClkLog埋点用户分析系统支持手机端查询统计数据

某全球领先网络解决方案提供商基于 Apache Doris 统一 Trino、Pinot、Iceberg、Kyuubi技术栈