从 len 18000 的 Dask Dataframe 中采样 n= 2000 会产生错误 Cannot take a larger sample than population when 'replace=False'

发布于
2023-01-10

新手上路，请多包涵

我有一个从 csv 文件创建的 dask 数据 len(daskdf) 返回 18000 但是当我 ddSample = daskdf.sample(2000) 我得到错误

ValueError: Cannot take a larger sample than population when 'replace=False'

如果数据帧大于样本大小，我可以不放回地进行采样吗？

原文由 mobcdi 发布，翻译遵循 CC BY-SA 4.0 许可协议

python dask

阅读 1.2k

2 个回答

得票最新

社区维基

发布于
2023-01-10

✓ 已被采纳

示例方法仅支持 frac= 关键字参数。请参阅 API 文档

您收到的错误来自 Pandas，而不是 Dask。

 In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'x': [1]})
In [3]: df.sample(frac=2000, replace=False)
ValueError: Cannot take a larger sample than population when 'replace=False'

解决方案 1

正如 Pandas 错误所暗示的那样，考虑 使用替换 进行抽样

In [4]: df.sample(frac=2, replace=True)
Out[4]:
   x
0  1
0  1

In [5]: import dask.dataframe as dd
In [6]: ddf = dd.from_pandas(df, npartitions=1)
In [7]: ddf.sample(frac=2, replace=True).compute()
Out[7]:
   x
0  1
0  1

方案二

这可能会帮助某人..

我从某个地方找到了这个，不记得在哪里了。

这将正确无误地向您显示结果。（这是给熊猫的，我不知道 dask）。

 import pandas as pd

df = pd.DataFrame({'a': [1,2,3,4,5,6,7],
                   'b': [1,1,1,2,2,3,3]})

# this is fixed number, will be error when data in group is less than sample size
df.groupby('b').apply(pd.DataFrame.sample, n=1)

# this is flexible with min, no error, will return 3 or less than that
df.groupby(['b'], as_index=False, group_keys=False
          ).apply(
            lambda x: x.sample(min(3, len(x)))
        )

原文由 MRocklin 发布，翻译遵循 CC BY-SA 4.0 许可协议

社区维基

发布于
2023-01-10

我从某个地方找到了这个，不记得在哪里了。

这将正确无误地向您显示结果。（这是给熊猫的，我不知道 dask）。

 import pandas as pd

df = pd.DataFrame({'a': [1,2,3,4,5,6,7],
                   'b': [1,1,1,2,2,3,3]})

# this is fixed number, will be error when data in group is less than sample size
df.groupby('b').apply(pd.DataFrame.sample, n=1)

# this is flexible with min, no error, will return 3 or less than that
df.groupby(['b'], as_index=False, group_keys=False
          ).apply(
            lambda x: x.sample(min(3, len(x)))
        )

原文由 ihightower 发布，翻译遵循 CC BY-SA 4.0 许可协议

撰写回答

你尚未登录，登录后可以

和开发者交流问题的细节
关注并接收问题和回答的更新提醒
参与内容的编辑和改进，让解决方法与时俱进

推荐问题

从 len 18000 的 Dask Dataframe 中采样 n= 2000 会产生错误 Cannot take a larger sample than population when 'replace=False'

解决方案 1

方案二

你尚未登录，登录后可以

字节的 trae AI IDE 不支持类似 vscode 的 ssh remote 远程开发怎么办？

DataCap 中验证码无法显示，后台出现 NullPointerException 错误?

发现深拷贝和浅拷贝效果一致：请问一下有什么区别呢？

如何实现一个深拷贝函数？

Python 成员变量在多个子类实例间共享，如何避免？

为什么 Qwen2.5-Omni-7B 官方教程都报错 Cannot import available module of Qwen2_5OmniModel in modelscope ？

Spark-TTS-0.5B 的 requirements.txt 在哪里？

Stack Overflow 翻译