Pandas advanced tutorial: GroupBy usage

Introduction

The DF data type in pandas can be groupby operated like a database table. Generally speaking, the groupby operation can be divided into three parts: split data, apply transformation and merge data.

This article will explain in detail the groupby operation in Pandas.

Split data

The purpose of dividing data is to divide DF into groups. In order to perform groupby operations, you need to specify the corresponding label when creating the DF:

df = pd.DataFrame(
   ...:     {
   ...:         "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
   ...:         "B": ["one", "one", "two", "three", "two", "two", "one", "three"],
   ...:         "C": np.random.randn(8),
   ...:         "D": np.random.randn(8),
   ...:     }
   ...: )
   ...:

df
Out[61]: 
     A      B         C         D
0  foo    one -0.490565 -0.233106
1  bar    one  0.430089  1.040789
2  foo    two  0.653449 -1.155530
3  bar  three -0.610380 -0.447735
4  foo    two -0.934961  0.256358
5  bar    two -0.256263 -0.661954
6  foo    one -1.132186 -0.304330
7  foo  three  2.129757  0.445744

By default, the axis of groupby is the x axis. There can be one group or multiple groups:

In [8]: grouped = df.groupby("A")

In [9]: grouped = df.groupby(["A", "B"])

Multi index

In the 0.24 version, if we have multiple indexes, we can select a specific index to group:

In [10]: df2 = df.set_index(["A", "B"])

In [11]: grouped = df2.groupby(level=df2.index.names.difference(["B"]))

In [12]: grouped.sum()
Out[12]: 
            C         D
A                      
bar -1.591710 -1.739537
foo -0.752861 -1.402938

get_group

get_group can get the data after grouping:

In [24]: df3 = pd.DataFrame({"X": ["A", "B", "A", "B"], "Y": [1, 4, 3, 2]})

In [25]: df3.groupby(["X"]).get_group("A")
Out[25]: 
   X  Y
0  A  1
2  A  3

In [26]: df3.groupby(["X"]).get_group("B")
Out[26]: 
   X  Y
1  B  4
3  B  2

dropna

By default, NaN data will be excluded from groupby. NaN data can be allowed by setting dropna=False:

In [27]: df_list = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]]

In [28]: df_dropna = pd.DataFrame(df_list, columns=["a", "b", "c"])

In [29]: df_dropna
Out[29]: 
   a    b  c
0  1  2.0  3
1  1  NaN  4
2  2  1.0  3
3  1  2.0  2

# Default ``dropna`` is set to True, which will exclude NaNs in keys
In [30]: df_dropna.groupby(by=["b"], dropna=True).sum()
Out[30]: 
     a  c
b        
1.0  2  3
2.0  2  5

# In order to allow NaN in keys, set ``dropna`` to False
In [31]: df_dropna.groupby(by=["b"], dropna=False).sum()
Out[31]: 
     a  c
b        
1.0  2  3
2.0  2  5
NaN  1  4

groups attribute

The groupby object has a groups attribute, which is a key-value dictionary, the key is the data used for classification, and the value is the value corresponding to the classification.

In [34]: grouped = df.groupby(["A", "B"])

In [35]: grouped.groups
Out[35]: {('bar', 'one'): [1], ('bar', 'three'): [3], ('bar', 'two'): [5], ('foo', 'one'): [0, 6], ('foo', 'three'): [7], ('foo', 'two'): [2, 4]}

In [36]: len(grouped)
Out[36]: 6

index level

For multi-level index objects, groupby can specify the index level of the group:

In [40]: arrays = [
   ....:     ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
   ....:     ["one", "two", "one", "two", "one", "two", "one", "two"],
   ....: ]
   ....: 

In [41]: index = pd.MultiIndex.from_arrays(arrays, names=["first", "second"])

In [42]: s = pd.Series(np.random.randn(8), index=index)

In [43]: s
Out[43]: 
first  second
bar    one      -0.919854
       two      -0.042379
baz    one       1.247642
       two      -0.009920
foo    one       0.290213
       two       0.495767
qux    one       0.362949
       two       1.548106
dtype: float64

Group first level:

In [44]: grouped = s.groupby(level=0)

In [45]: grouped.sum()
Out[45]: 
first
bar   -0.962232
baz    1.237723
foo    0.785980
qux    1.911055
dtype: float64

Group second level:

In [46]: s.groupby(level="second").sum()
Out[46]: 
second
one    0.980950
two    1.991575
dtype: float64

group traversal

After getting the group object, we can traverse the group through the for statement:

In [62]: grouped = df.groupby('A')

In [63]: for name, group in grouped:
   ....:     print(name)
   ....:     print(group)
   ....: 
bar
     A      B         C         D
1  bar    one  0.254161  1.511763
3  bar  three  0.215897 -0.990582
5  bar    two -0.077118  1.211526
foo
     A      B         C         D
0  foo    one -0.575247  1.346061
2  foo    two -1.143704  1.627081
4  foo    two  1.193555 -0.441652
6  foo    one -0.408530  0.268520
7  foo  three -0.862495  0.024580

If it is a multi-field group, the name of the group is a tuple:

In [64]: for name, group in df.groupby(['A', 'B']):
   ....:     print(name)
   ....:     print(group)
   ....: 
('bar', 'one')
     A    B         C         D
1  bar  one  0.254161  1.511763
('bar', 'three')
     A      B         C         D
3  bar  three  0.215897 -0.990582
('bar', 'two')
     A    B         C         D
5  bar  two -0.077118  1.211526
('foo', 'one')
     A    B         C         D
0  foo  one -0.575247  1.346061
6  foo  one -0.408530  0.268520
('foo', 'three')
     A      B         C        D
7  foo  three -0.862495  0.02458
('foo', 'two')
     A    B         C         D
2  foo  two -1.143704  1.627081
4  foo  two  1.193555 -0.441652

Aggregation operation

After grouping, you can perform aggregation operations:

In [67]: grouped = df.groupby("A")

In [68]: grouped.aggregate(np.sum)
Out[68]: 
            C         D
A                      
bar  0.392940  1.732707
foo -1.796421  2.824590

In [69]: grouped = df.groupby(["A", "B"])

In [70]: grouped.aggregate(np.sum)
Out[70]: 
                  C         D
A   B                        
bar one    0.254161  1.511763
    three  0.215897 -0.990582
    two   -0.077118  1.211526
foo one   -0.983776  1.614581
    three -0.862495  0.024580
    two    0.049851  1.185429

For multi-index data, the default return value is also multi-index. If you want to use a new index, you can add as_index = False:

In [71]: grouped = df.groupby(["A", "B"], as_index=False)

In [72]: grouped.aggregate(np.sum)
Out[72]: 
     A      B         C         D
0  bar    one  0.254161  1.511763
1  bar  three  0.215897 -0.990582
2  bar    two -0.077118  1.211526
3  foo    one -0.983776  1.614581
4  foo  three -0.862495  0.024580
5  foo    two  0.049851  1.185429

In [73]: df.groupby("A", as_index=False).sum()
Out[73]: 
     A         C         D
0  bar  0.392940  1.732707
1  foo -1.796421  2.824590

The above effect is equivalent to reset_index

In [74]: df.groupby(["A", "B"]).sum().reset_index()

grouped.size() calculates the size of the group:

In [75]: grouped.size()
Out[75]: 
     A      B  size
0  bar    one     1
1  bar  three     1
2  bar    two     1
3  foo    one     2
4  foo  three     1
5  foo    two     2

grouped.describe() describes the information of the group:

In [76]: grouped.describe()
Out[76]: 
      C                                                    ...         D                                                  
  count      mean       std       min       25%       50%  ...       std       min       25%       50%       75%       max
0   1.0  0.254161       NaN  0.254161  0.254161  0.254161  ...       NaN  1.511763  1.511763  1.511763  1.511763  1.511763
1   1.0  0.215897       NaN  0.215897  0.215897  0.215897  ...       NaN -0.990582 -0.990582 -0.990582 -0.990582 -0.990582
2   1.0 -0.077118       NaN -0.077118 -0.077118 -0.077118  ...       NaN  1.211526  1.211526  1.211526  1.211526  1.211526
3   2.0 -0.491888  0.117887 -0.575247 -0.533567 -0.491888  ...  0.761937  0.268520  0.537905  0.807291  1.076676  1.346061
4   1.0 -0.862495       NaN -0.862495 -0.862495 -0.862495  ...       NaN  0.024580  0.024580  0.024580  0.024580  0.024580
5   2.0  0.024925  1.652692 -1.143704 -0.559389  0.024925  ...  1.462816 -0.441652  0.075531  0.592714  1.109898  1.627081

[6 rows x 16 columns]

General polymerization method

The following are general aggregation methods:

function	description
`mean()`	average value
`sum()`	Sum
`size()`	Calculate size
`count()`	group statistics
`std()`	Standard deviation
`var()`	variance
`sem()`	Standard error of the mean
`describe()`	Statistics description
`first()`	The first group value
`last()`	The last group value
`nth()`	Nth group value
`min()`	Minimum
`max()`	Max

Use multiple aggregation methods at the same time

You can specify multiple aggregation methods at the same time:

In [81]: grouped = df.groupby("A")

In [82]: grouped["C"].agg([np.sum, np.mean, np.std])
Out[82]: 
          sum      mean       std
A                                
bar  0.392940  0.130980  0.181231
foo -1.796421 -0.359284  0.912265

Can be renamed:

In [84]: (
   ....:     grouped["C"]
   ....:     .agg([np.sum, np.mean, np.std])
   ....:     .rename(columns={"sum": "foo", "mean": "bar", "std": "baz"})
   ....: )
   ....: 
Out[84]: 
          foo       bar       baz
A                                
bar  0.392940  0.130980  0.181231
foo -1.796421 -0.359284  0.912265

NamedAgg

NamedAgg can define the aggregation more precisely. It contains two customized fields, column and aggfunc.

In [88]: animals = pd.DataFrame(
   ....:     {
   ....:         "kind": ["cat", "dog", "cat", "dog"],
   ....:         "height": [9.1, 6.0, 9.5, 34.0],
   ....:         "weight": [7.9, 7.5, 9.9, 198.0],
   ....:     }
   ....: )
   ....: 

In [89]: animals
Out[89]: 
  kind  height  weight
0  cat     9.1     7.9
1  dog     6.0     7.5
2  cat     9.5     9.9
3  dog    34.0   198.0

In [90]: animals.groupby("kind").agg(
   ....:     min_height=pd.NamedAgg(column="height", aggfunc="min"),
   ....:     max_height=pd.NamedAgg(column="height", aggfunc="max"),
   ....:     average_weight=pd.NamedAgg(column="weight", aggfunc=np.mean),
   ....: )
   ....: 
Out[90]: 
      min_height  max_height  average_weight
kind                                        
cat          9.1         9.5            8.90
dog          6.0        34.0          102.75

Or use a tuple directly:

In [91]: animals.groupby("kind").agg(
   ....:     min_height=("height", "min"),
   ....:     max_height=("height", "max"),
   ....:     average_weight=("weight", np.mean),
   ....: )
   ....: 
Out[91]: 
      min_height  max_height  average_weight
kind                                        
cat          9.1         9.5            8.90
dog          6.0        34.0          102.75

Different columns specify different aggregation methods

By passing a dictionary to the agg method, you can specify different columns to use different aggregations:

In [95]: grouped.agg({"C": "sum", "D": "std"})
Out[95]: 
            C         D
A                      
bar  0.392940  1.366330
foo -1.796421  0.884785

Conversion operation

Conversion is the operation of converting an object into an object of the same size. In the process of data analysis, data conversion operations are often required.

You can access lambda operations:

In [112]: ts.groupby(lambda x: x.year).transform(lambda x: x.max() - x.min())

Fill in the na value:

In [121]: transformed = grouped.transform(lambda x: x.fillna(x.mean()))

Filter operation

The filter method can filter the data we don't need through lambda expressions:

In [136]: sf = pd.Series([1, 1, 2, 3, 3, 3])

In [137]: sf.groupby(sf).filter(lambda x: x.sum() > 2)
Out[137]: 
3    3
4    3
5    3
dtype: int64

Apply operation

Some data may not be suitable for aggregation or conversion operations. Pandas provides a apply method for more flexible conversion operations.

In [156]: df
Out[156]: 
     A      B         C         D
0  foo    one -0.575247  1.346061
1  bar    one  0.254161  1.511763
2  foo    two -1.143704  1.627081
3  bar  three  0.215897 -0.990582
4  foo    two  1.193555 -0.441652
5  bar    two -0.077118  1.211526
6  foo    one -0.408530  0.268520
7  foo  three -0.862495  0.024580

In [157]: grouped = df.groupby("A")

# could also just call .describe()
In [158]: grouped["C"].apply(lambda x: x.describe())
Out[158]: 
A         
bar  count    3.000000
     mean     0.130980
     std      0.181231
     min     -0.077118
     25%      0.069390
                ...   
foo  min     -1.143704
     25%     -0.862495
     50%     -0.575247
     75%     -0.408530
     max      1.193555
Name: C, Length: 16, dtype: float64

You can add functions:

In [159]: grouped = df.groupby('A')['C']

In [160]: def f(group):
   .....:     return pd.DataFrame({'original': group,
   .....:                          'demeaned': group - group.mean()})
   .....: 

In [161]: grouped.apply(f)
Out[161]: 
   original  demeaned
0 -0.575247 -0.215962
1  0.254161  0.123181
2 -1.143704 -0.784420
3  0.215897  0.084917
4  1.193555  1.552839
5 -0.077118 -0.208098
6 -0.408530 -0.049245
7 -0.862495 -0.503211

This article has been included in http://www.flydean.com/11-python-pandas-groupby/
The most popular interpretation, the most profound dry goods, the most concise tutorial, and many tips you don't know are waiting for you to discover!

Pandas advanced tutorial: GroupBy usage

Introduction

Split data

Multi index

get_group

dropna

groups attribute

index level

group traversal

Aggregation operation

General polymerization method

Use multiple aggregation methods at the same time

NamedAgg

Different columns specify different aggregation methods

Conversion operation

Filter operation

Apply operation

flydean

引用和评论

在stable diffussion中完美修复AI图片

Anaconda安装教程以及Anaconda和pip配置国内镜像

科学计算编程涉及到的技术栈简介

使用 chardet 判断文件编码需要注意的坑——过大的文件会导致高耗时

Python3 格式化时间（qbit）

manus 的替代品有哪些？使用LLM大模型技术做手机/网页/浏览器自动化操作技术汇总

怎么判断自己下载的 trae 是国际版还是国内版？