如何将一个DataFrame随机拆分成几个更小的DataFrame?

新手上路,请多包涵

我无法将 DataFrame df 随机拆分为更小的组 DataFrames

 df
  movie_id  1   2   4   5   6   7   8   9   10  11  12  borda
0   1       5   4   0   4   4   0   0   0   4   0   0   21
1   2       3   0   0   3   0   0   0   0   0   0   0   6
2   3       4   0   0   0   0   0   0   0   0   0   0   4
3   4       3   0   0   0   0   5   0   0   4   0   5   17
4   5       3   0   0   0   0   0   0   0   0   0   0   3
5   6       5   0   0   0   0   0   0   5   0   0   0   10
6   7       4   0   0   0   2   5   3   4   4   0   0   22
7   8       1   0   0   0   4   5   0   0   0   4   0   14
8   9       5   0   0   0   4   5   0   0   4   5   0   23
9   10      3   2   0   0   0   4   0   0   0   0   0   9
10  11      2   0   4   0   0   3   3   0   4   2   0   18
11  12      5   0   0   0   4   5   0   0   5   2   0   21
12  13      5   4   0   0   2   0   0   0   3   0   0   14
13  14      5   4   0   0   5   0   0   0   0   0   0   14
14  15      5   0   0   0   3   0   0   0   0   5   5   18
15  16      5   0   0   0   0   0   0   0   4   0   0   9
16  17      3   0   0   4   0   0   0   0   0   0   0   7
17  18      4   0   0   0   0   0   0   0   0   0   0   4
18  19      5   3   0   0   4   0   0   0   0   0   0   12
19  20      4   0   0   0   0   0   0   0   0   0   0   4
20  21      1   0   0   3   3   0   0   0   0   0   0   7
21  22      4   0   0   0   3   5   5   0   5   4   0   26
22  23      4   0   0   0   4   3   0   0   5   0   0   16
23  24      3   0   0   4   0   0   0   0   0   3   0   10

我试过 samplearange ,但结果不佳。

 ran1 = df.sample(frac=0.2, replace=False, random_state=1)
ran2 = df.sample(frac=0.2, replace=False, random_state=1)
ran3 = df.sample(frac=0.2, replace=False, random_state=1)
ran4 = df.sample(frac=0.2, replace=False, random_state=1)
ran5 = df.sample(frac=0.2, replace=False, random_state=1)

print(ran1, '\n')
print(ran2, '\n')
print(ran3, '\n')
print(ran4, '\n')
print(ran5, '\n')

结果是 5 个完全相同的 DataFrames

    movie_id  1  2  4  5  6  7  8  9  10  11  12  borda
13    14     5  4  0  0  5  0  0  0   0   0   0     14
18    19     5  3  0  0  4  0  0  0   0   0   0     12
3     4      3  0  0  0  0  5  0  0   4   0   5     17
14    15     5  0  0  0  3  0  0  0   0   5   5     18
20    21     1  0  0  3  3  0  0  0   0   0   0      7

我也试过:

 g = df.groupby(['movie_id'])
h = np.arange(g.ngroups)
np.random.shuffle(h)

df[g.ngroup().isin(h[:6])]

输出 :

     movie_id    1   2   4   5   6   7   8   9   10  11  12  borda
4      5        3   0   0   0   0   0   0   0   0   0   0   3
6      7        4   0   0   0   2   5   3   4   4   0   0   22
7      8        1   0   0   0   4   5   0   0   0   4   0   14
16     17       3   0   0   4   0   0   0   0   0   0   0   7
17     18       4   0   0   0   0   0   0   0   0   0   0   4
18     19       5   3   0   0   4   0   0   0   0   0   0   12

但是仍然只有一个较小的组,来自 df 的其他数据没有分组。

我希望使用百分比将较小的组平均分配。整个 df 应该分成几组。

原文由 Jerry Chen 发布,翻译遵循 CC BY-SA 4.0 许可协议

阅读 744
2 个回答

使用 np.array_split

 shuffled = df.sample(frac=1)
result = np.array_split(shuffled, 5)

df.sample(frac=1) df 行。然后使用 np.array_split 将其分成大小相等的部分。

它给你:

 for part in result:
    print(part,'\n')

     movie_id  1  2  4  5  6  7  8  9  10  11  12  borda
5          6  5  0  0  0  0  0  0  5   0   0   0     10
4          5  3  0  0  0  0  0  0  0   0   0   0      3
7          8  1  0  0  0  4  5  0  0   0   4   0     14
16        17  3  0  0  4  0  0  0  0   0   0   0      7
22        23  4  0  0  0  4  3  0  0   5   0   0     16

    movie_id  1  2  4  5  6  7  8  9  10  11  12  borda
13        14  5  4  0  0  5  0  0  0   0   0   0     14
14        15  5  0  0  0  3  0  0  0   0   5   5     18
21        22  4  0  0  0  3  5  5  0   5   4   0     26
1          2  3  0  0  3  0  0  0  0   0   0   0      6
20        21  1  0  0  3  3  0  0  0   0   0   0      7

    movie_id  1  2  4  5  6  7  8  9  10  11  12  borda
10        11  2  0  4  0  0  3  3  0   4   2   0     18
9         10  3  2  0  0  0  4  0  0   0   0   0      9
11        12  5  0  0  0  4  5  0  0   5   2   0     21
8          9  5  0  0  0  4  5  0  0   4   5   0     23
12        13  5  4  0  0  2  0  0  0   3   0   0     14

    movie_id  1  2  4  5  6  7  8  9  10  11  12  borda
18        19  5  3  0  0  4  0  0  0   0   0   0     12
3          4  3  0  0  0  0  5  0  0   4   0   5     17
0          1  5  4  0  4  4  0  0  0   4   0   0     21
23        24  3  0  0  4  0  0  0  0   0   3   0     10
6          7  4  0  0  0  2  5  3  4   4   0   0     22

    movie_id  1  2  4  5  6  7  8  9  10  11  12  borda
17        18  4  0  0  0  0  0  0  0   0   0   0      4
2          3  4  0  0  0  0  0  0  0   0   0   0      4
15        16  5  0  0  0  0  0  0  0   4   0   0      9
19        20  4  0  0  0  0  0  0  0   0   0   0      4

原文由 Dawei 发布,翻译遵循 CC BY-SA 4.0 许可协议

一个简单的演示:

 df = pd.DataFrame({"movie_id": np.arange(1, 25),
          "borda": np.random.randint(1, 25, size=(24,))})
n_split = 5
# the indices used to select parts from dataframe
ixs = np.arange(df.shape[0])
np.random.shuffle(ixs)
# np.split cannot work when there is no equal division
# so we need to find out the split points ourself
# we need (n_split-1) split points
split_points = [i*df.shape[0]//n_split for i in range(1, n_split)]
# use these indices to select the part we want
for ix in np.split(ixs, split_points):
    print(df.iloc[ix])

结果:

     borda  movie_id
8       3         9
10      2        11
22     14        23
7      14         8

    borda  movie_id
0      16         1
20      4        21
17     15        18
15      1        16
6       6         7

    borda  movie_id
9       9        10
19      4        20
5       1         6
16     23        17
21     20        22

    borda  movie_id
11     24        12
23      5        24
1      22         2
12      7        13
18     15        19

    borda  movie_id
3      11         4
14     10        15
2       6         3
4       7         5
13     21        14

原文由 keineahnung2345 发布,翻译遵循 CC BY-SA 4.0 许可协议

撰写回答
你尚未登录,登录后可以
  • 和开发者交流问题的细节
  • 关注并接收问题和回答的更新提醒
  • 参与内容的编辑和改进,让解决方法与时俱进
推荐问题