新手上路，请多包涵

有没有办法做到这一点？我似乎无法通过绘制 CDF 来连接 Pandas 系列。

原文由 wolfsatthedoor 发布，翻译遵循 CC BY-SA 4.0 许可协议

python pandas series cdf

阅读 630

2 个回答

得票最新

社区维基

发布于
2023-01-03

✓ 已被采纳

如果您也对价值观感兴趣，而不仅仅是情节。

 import pandas as pd

# If you are in jupyter
%matplotlib inline

这将始终有效（离散和连续分布）

 # Define your series
s = pd.Series([9, 5, 3, 5, 5, 4, 6, 5, 5, 8, 7], name = 'value')
df = pd.DataFrame(s)

 # Get the frequency, PDF and CDF for each value in the series

# Frequency
stats_df = df \
.groupby('value') \
['value'] \
.agg('count') \
.pipe(pd.DataFrame) \
.rename(columns = {'value': 'frequency'})

# PDF
stats_df['pdf'] = stats_df['frequency'] / sum(stats_df['frequency'])

# CDF
stats_df['cdf'] = stats_df['pdf'].cumsum()
stats_df = stats_df.reset_index()
stats_df

 # Plot the discrete Probability Mass Function and CDF.
# Technically, the 'pdf label in the legend and the table the should be 'pmf'
# (Probability Mass Function) since the distribution is discrete.

# If you don't have too many values / usually discrete case
stats_df.plot.bar(x = 'value', y = ['pdf', 'cdf'], grid = True)

从连续分布中抽取样本的替代示例，或者您有很多单独的值：

 # Define your series
s = pd.Series(np.random.normal(loc = 10, scale = 0.1, size = 1000), name = 'value')

 # ... all the same calculation stuff to get the frequency, PDF, CDF

 # Plot
stats_df.plot(x = 'value', y = ['pdf', 'cdf'], grid = True)

仅适用于连续分布

请注意 ，如果假设样本中每个值只出现一次（通常在连续分布的情况下会遇到），那么 groupby() + agg('count') 不是必需的（因为计数始终为 1）。

在这种情况下，可以使用百分比排名直接获得 cdf。

走这种捷径时请做出最佳判断！ :)

 # Define your series
s = pd.Series(np.random.normal(loc = 10, scale = 0.1, size = 1000), name = 'value')
df = pd.DataFrame(s)

 # Get to the CDF directly
df['cdf'] = df.rank(method = 'average', pct = True)

 # Sort and plot
df.sort_values('value').plot(x = 'value', y = 'cdf', grid = True)

原文由 Raphvanns 发布，翻译遵循 CC BY-SA 4.0 许可协议

社区维基

发布于
2023-01-03

我相信您正在寻找的功能在 Series 对象的 hist 方法中，该方法将 hist() 函数包装在 matplotlib 中

这是相关文档

In [10]: import matplotlib.pyplot as plt

In [11]: plt.hist?
...
Plot a histogram.

Compute and draw the histogram of *x*. The return value is a
tuple (*n*, *bins*, *patches*) or ([*n0*, *n1*, ...], *bins*,
[*patches0*, *patches1*,...]) if the input contains multiple
data.
...
cumulative : boolean, optional, default : False
    If `True`, then a histogram is computed where each bin gives the
    counts in that bin plus all bins for smaller values. The last bin
    gives the total number of datapoints.  If `normed` is also `True`
    then the histogram is normalized such that the last bin equals 1.
    If `cumulative` evaluates to less than 0 (e.g., -1), the direction
    of accumulation is reversed.  In this case, if `normed` is also
    `True`, then the histogram is normalized such that the first bin
    equals 1.

...

例如

In [12]: import pandas as pd

In [13]: import numpy as np

In [14]: ser = pd.Series(np.random.normal(size=1000))

In [15]: ser.hist(cumulative=True, density=1, bins=100)
Out[15]: <matplotlib.axes.AxesSubplot at 0x11469a590>

In [16]: plt.show()

原文由 Dan Frank 发布，翻译遵循 CC BY-SA 4.0 许可协议

撰写回答

你尚未登录，登录后可以

和开发者交流问题的细节
关注并接收问题和回答的更新提醒
参与内容的编辑和改进，让解决方法与时俱进

推荐问题

在 python 中绘制 pandas 系列的 CDF

这将始终有效（离散和连续分布）

仅适用于连续分布

你尚未登录，登录后可以

字节的 trae AI IDE 不支持类似 vscode 的 ssh remote 远程开发怎么办？

DataCap 中验证码无法显示，后台出现 NullPointerException 错误?

发现深拷贝和浅拷贝效果一致：请问一下有什么区别呢？

如何实现一个深拷贝函数？

Python 成员变量在多个子类实例间共享，如何避免？

为什么 Qwen2.5-Omni-7B 官方教程都报错 Cannot import available module of Qwen2_5OmniModel in modelscope ？

Spark-TTS-0.5B 的 requirements.txt 在哪里？

Stack Overflow 翻译