分割一个大熊猫数据框

新手上路,请多包涵

我有一个包含 423244 行的大型数据框。我想把它分成 4 个。我尝试了下面的代码,它给出了一个错误? ValueError: array split does not result in an equal division

 for item in np.split(df, 4):
    print item

如何将此数据框分成 4 组?

原文由 Nilani Algiriyage 发布,翻译遵循 CC BY-SA 4.0 许可协议

阅读 257
2 个回答

使用 np.array_split

 Docstring:
Split an array into multiple sub-arrays.

Please refer to the ``split`` documentation.  The only difference
between these functions is that ``array_split`` allows
`indices_or_sections` to be an integer that does *not* equally
divide the axis.

 In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
   ...:                           'foo', 'bar', 'foo', 'foo'],
   ...:                    'B' : ['one', 'one', 'two', 'three',
   ...:                           'two', 'two', 'one', 'three'],
   ...:                    'C' : randn(8), 'D' : randn(8)})

In [3]: print df
     A      B         C         D
0  foo    one -0.174067 -0.608579
1  bar    one -0.860386 -1.210518
2  foo    two  0.614102  1.689837
3  bar  three -0.284792 -1.071160
4  foo    two  0.843610  0.803712
5  bar    two -1.514722  0.870861
6  foo    one  0.131529 -0.968151
7  foo  three -1.002946 -0.257468

In [4]: import numpy as np
In [5]: np.array_split(df, 3)
Out[5]:
[     A    B         C         D
0  foo  one -0.174067 -0.608579
1  bar  one -0.860386 -1.210518
2  foo  two  0.614102  1.689837,
      A      B         C         D
3  bar  three -0.284792 -1.071160
4  foo    two  0.843610  0.803712
5  bar    two -1.514722  0.870861,
      A      B         C         D
6  foo    one  0.131529 -0.968151
7  foo  three -1.002946 -0.257468]

原文由 root 发布,翻译遵循 CC BY-SA 4.0 许可协议

我想做同样的事情,但我首先遇到了拆分功能的问题,然后是安装 pandas 0.15.2 的问题,所以我回到了我的旧版本,并编写了一个运行良好的小功能。我希望这能有所帮助!

 # input - df: a Dataframe, chunkSize: the chunk size
# output - a list of DataFrame
# purpose - splits the DataFrame into smaller chunks
def split_dataframe(df, chunk_size = 10000):
    chunks = list()
    num_chunks = len(df) // chunk_size + 1
    for i in range(num_chunks):
        chunks.append(df[i*chunk_size:(i+1)*chunk_size])
    return chunks

原文由 elixir 发布,翻译遵循 CC BY-SA 4.0 许可协议

推荐问题