pandas

2. pandas

2.1 数据构造与加载

2.1.1 构造dataframe

空dataframe
通过columns,index参数指定dataframe的行索引、列名。

df1 = pd.DataFrame(columns=['c1', 'c2'], index=[1, 2])
# output:
    c1    c2
1    NaN    NaN
2    NaN    NaN

字典构造dataframe
通过字典构造dataframe，并给dataframe添加列；可以通过index=[]参数设置df的索引。

dict_v = {'c1': ['a', 'b', 'c'],
              'c2': [1, 2, 3]}
df1 = pd.DataFrame(dict_v)
# output:
c1    c2
0    a    1
1    b    2
2    c    3

2.1.1 describe()

dataframe的整体信息，如数据行数，均值，标准差，分位数，最大最小值等。

参数

参数名	含义	可取值	默认值
percentiles	输出结果中将要统计的分位数	list,0到1之间，比如percentiles=[0,0.2,0.4,0.5,0.6,0.8]	[.25, .5, .75]
include	白名单，控制输出结果中的数据类型（即输出的类型）	‘all’,np.number,np.object,'category'	None
exclude	黑名单，控制输出结果中的数据类型（即不输出的类型）	‘all’,np.number,np.object,'category'	None

输出

结果名称	含义
count	数据行数
unique	数据种类数（只有列数据为字符串格式时输出）
top	出现次数最多的一类数据（只有列数据为字符串格式时输出）
freq	出现次数最多的一类数据的出现次数（只有列数据为字符串格式时输出）

# df_bak2
asset_name    asset_id    bond_period    rate    buying_time    amount
0    Govtbond    1    5    0.0315    2019-12-31    30.0
1    Govtbond    2    10    0.0355    2018-06-30    30.0
2    Govtbond    3    15    0.0410    2015-09-30    50.0
3    Finbond    4    5    0.0330    2019-12-31    18.0
4    Finbond    5    10    0.0365    2018-06-30    21.0
5    CorpbondAAA    6    3    0.0380    2018-06-30    30.0
6    CorpbondAAA    7    5    0.0400    2020-06-30    8.0
7    CorpbondAAA    8    10    0.0490    2020-06-30    19.0
8    CorpbondAA    9    1    0.0480    2020-06-30    19.0
9    CorpbondAA    10    5    0.0510    2020-06-30    12.0

df_bak2.describe(include='all')
# output:
asset_name    asset_id    bond_period    rate    buying_time    amount
count    10    10.00000    10.000000    10.00000    10    10.000000
unique    4    NaN    NaN    NaN    4    NaN
top    CorpbondAAA    NaN    NaN    NaN    2020-06-30    NaN
freq    3    NaN    NaN    NaN    4    NaN
mean    NaN    5.50000    6.900000    0.04035    NaN    23.700000
std    NaN    3.02765    4.201851    0.00686    NaN    11.916841
min    NaN    1.00000    1.000000    0.03150    NaN    8.000000
25%    NaN    3.25000    5.000000    0.03575    NaN    18.250000
50%    NaN    5.50000    5.000000    0.03900    NaN    20.000000
75%    NaN    7.75000    10.000000    0.04625    NaN    30.000000
max    NaN    10.00000    15.000000    0.05100    NaN    50.000000

df_bak2.describe(include=[np.number])
# output:
asset_id    bond_period    rate    amount
count    10.00000    10.000000    10.00000    10.000000
mean    5.50000    6.900000    0.04035    23.700000
std    3.02765    4.201851    0.00686    11.916841
min    1.00000    1.000000    0.03150    8.000000
25%    3.25000    5.000000    0.03575    18.250000
50%    5.50000    5.000000    0.03900    20.000000
75%    7.75000    10.000000    0.04625    30.000000
max    10.00000    15.000000    0.05100    50.000000

df_bak2.describe(include=[object])
# output:
    asset_name    buying_time
count    10    10
unique    4    4
top    CorpbondAAA    2020-06-30
freq    3    4

2.2 操作文件

2.2.1 to_excel()

保存dataframe到excel文件，可以通过参数'sheet_name'指定sheet。

2.2.2 to_csv()

保存dataframe到csv文件，

2.2.3 read_csv()

读csv文件中的数据为dataframe。header=None:表示csv文件中不包含列名，names=[]：设置df的列名。

df = pd.read_csv(filepath_or_buffer=file_name, header=None, names=['c1', 'c2', 'c3', 'c4'])

2.3 数据检测与过滤

数据的检测与过滤包括缺失值、异常值检测与过滤。
示例df：

data_dict = {'asset_name': ['Govtbond','Govtbond','Govtbond','Finbond','Finbond','CorpbondAAA','CorpbondAAA','CorpbondAAA','CorpbondAA','CorpbondAA'],
'asset_id': [1,2,3,4,5,6,7,8,9,10],
'bond_period': [5,10,15,5,10,3,5,10,1,5],
'rate':[0.0315,0.0355,0.041,0.033,0.0365,0.038, 0.04, 0.049,0.048, 0.051]}
df_base=pd.DataFrame(data_dict)
time_amount = pd.DataFrame(data={'asset_name': ['Govtbond','Govtbond','Govtbond','Finbond','Finbond','CorpbondAAA','CorpbondAAA', 'CorpbondAAA','CorpbondAA','CorpbondAA'],
'asset_id': [1,2,3,4,5,6,7,8,9,10],
'buying_time':['2019-12-31', '2018-06-30', '2015-09-30','2019-12-31', '2018-06-30', np.nan,'2020-06-30','2020-06-30',np.nan, np.nan],
'amount': [30, np.nan, 50, 18, 21, 30, 8, 19, np.nan,12]})
print(f"df_base=\n {df_base},\n time_amount=\n {time_amount}")
df = df_base.merge(time_amount, how='left', on=['asset_name', 'asset_id'])

2.3.1 dropna()

根据指定的条件选择性的删除dataframe中含NaN的行或列。

DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)

参数

参数名	含义	可取值	默认值
axis	决定删除NaN值时的轴，0：按行删除，1：按列删除	0 or ‘index’, 1 or ‘columns’	0
how	控制删除方式，any：行（或列）有NaN值则删除该行（或列），all：行（或列）的全部值为NaN时删除该行（或列）	‘any’, ‘all’	‘any’
thresh	这一行除去NaN值，剩余数值的数量大于等于n，便显示这一行	整数	可选参数
subset	从指定的列中寻找NaN值，然后删除行（只能用于删除行数据）	列名	可选参数
inplace	是否覆盖原dataframe，True:覆盖，不返回新的df，False:不覆盖，返回新的df	bool	False

原df:
    asset_name    asset_id    bond_period    rate    buying_time    amount
0    Govtbond    1    5    0.0315    2019-12-31    30.0
1    Govtbond    2    10    0.0355    2018-06-30    NaN
2    Govtbond    3    15    0.0410    2015-09-30    50.0
3    Finbond    4    5    0.0330    2019-12-31    18.0
4    Finbond    5    10    0.0365    2018-06-30    21.0
5    CorpbondAAA    6    3    0.0380    NaN    30.0
6    CorpbondAAA    7    5    0.0400    2020-06-30    8.0
7    CorpbondAAA    8    10    0.0490    2020-06-30    19.0
8    CorpbondAA    9    1    0.0480    NaN    NaN
9    CorpbondAA    10    5    0.0510    NaN    12.0

# thresh参数
df_bak=df.copy()
df_bak.dropna(thresh=5) # 除去NaN值，剩余数值的数量大于等于5时显示该行
# output:
# 索引为8的行被删除
asset_name    asset_id    bond_period    rate    buying_time    amount
0    Govtbond    1    5    0.0315    2019-12-31    30.0
1    Govtbond    2    10    0.0355    2018-06-30    NaN
2    Govtbond    3    15    0.0410    2015-09-30    50.0
3    Finbond    4    5    0.0330    2019-12-31    18.0
4    Finbond    5    10    0.0365    2018-06-30    21.0
5    CorpbondAAA    6    3    0.0380    NaN    30.0
6    CorpbondAAA    7    5    0.0400    2020-06-30    8.0
7    CorpbondAAA    8    10    0.0490    2020-06-30    19.0
9    CorpbondAA    10    5    0.0510    NaN    12.0

# subset参数
df_bak=df.copy()
df_bak.dropna(axis=0, subset=['amount']) # 根据'amount'列是否有空值来判断是否删除删除行数据
# output:
# amount列为NaN的行被删除
asset_name    asset_id    bond_period    rate    buying_time    amount
0    Govtbond    1    5    0.0315    2019-12-31    30.0
2    Govtbond    3    15    0.0410    2015-09-30    50.0
3    Finbond    4    5    0.0330    2019-12-31    18.0
4    Finbond    5    10    0.0365    2018-06-30    21.0
5    CorpbondAAA    6    3    0.0380    NaN    30.0
6    CorpbondAAA    7    5    0.0400    2020-06-30    8.0
7    CorpbondAAA    8    10    0.0490    2020-06-30    19.0
9    CorpbondAA    10    5    0.0510    NaN    12.0

2.3.2 notnull()

对dataframe中的对个元素判断其是否为nan并返回bool值True或False。

2.4 填充缺失数据

2.4.1 fillna()

根据条件对dataframe的空值填充指定的值。

DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None)

参数

参数名	含义	可取值	默认值
value	将填充的值，如果不是标量，则会根据键或行列来填充	scalar, dict, Series, or DataFrame	无默认值
method	控制填充方式，pad/ffill:由前一个值填充其后的NaN值，backfill/bfill:用NaN值后面的值向前填充	‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None	None
axis	决定填充值时的轴，0：按行填充，1：按列填充	0 or ‘index’, 1 or ‘columns’	0
limit	如果指定了方法，这是向前/向后填充的连续NaN值的最大数目。换句话说，如果有超过这个数目的连续nan，它将只被部分填充。如果未指定方法，则这是将填充nan的整个轴上的最大条目数。	int	None
downcast	用其他df填充NaN的控制，根据列名和行索引，将downcast中的值填入对应含NaN值的位置	DF	可选参数
inplace	是否覆盖原dataframe，True:覆盖，不返回新的df，False:不覆盖，返回新的df	bool	False

# axis参数
df_bak=df.copy()
df_bak.fillna(method='ffill',axis=1) # 按行 由前向后 填充，
# output:
asset_name    asset_id    bond_period    rate    buying_time    amount
0    Govtbond    1    5    0.0315    2019-12-31    30
1    Govtbond    2    10    0.0355    2018-06-30    2018-06-30
2    Govtbond    3    15    0.041    2015-09-30    50
3    Finbond    4    5    0.033    2019-12-31    18
4    Finbond    5    10    0.0365    2018-06-30    21
5    CorpbondAAA    6    3    0.038    0.038    30
6    CorpbondAAA    7    5    0.04    2020-06-30    8
7    CorpbondAAA    8    10    0.049    2020-06-30    19
8    CorpbondAA    9    1    0.048    0.048    0.048
9    CorpbondAA    10    5    0.051    0.051    12

# limit参数
df_bak=df.copy()
df_bak.fillna(method='ffill',axis=1, limit=1) # 按行由前向后填充，且行连续的NaN值超过limit时，只填充前limit个NaN值
# output:
asset_name    asset_id    bond_period    rate    buying_time    amount
0    Govtbond    1    5    0.0315    2019-12-31    30
1    Govtbond    2    10    0.0355    2018-06-30    2018-06-30
2    Govtbond    3    15    0.041    2015-09-30    50
3    Finbond    4    5    0.033    2019-12-31    18
4    Finbond    5    10    0.0365    2018-06-30    21
5    CorpbondAAA    6    3    0.038    0.038    30
6    CorpbondAAA    7    5    0.04    2020-06-30    8
7    CorpbondAAA    8    10    0.049    2020-06-30    19
8    CorpbondAA    9    1    0.048    0.048    NaN
9    CorpbondAA    10    5    0.051    0.051    12

2.4.2 dataframe缺数据时怎么补齐？

dataframe有空行，但是数据缺失的日期并不在dataframe中，也不知道完整日期的dataframe

2.5 数据去重

asset_df = pd.DataFrame(data={'asset_name':['Govtbond','Govtbond','Govtbond','Finbond','Finbond','Finbond','CorpbondAAA','CorpbondAAA','CorpbondAAA','CorpbondAAA','CorpbondAA','CorpbondAA'],'asset_id': [1,2,3,4,5,5,6,7,8,8,9,10]})
asset_df
# output:
asset_name    asset_id
0    Govtbond    1
1    Govtbond    2
2    Govtbond    3
3    Finbond    4
4    Finbond    5
5    Finbond    5
6    CorpbondAAA    6
7    CorpbondAAA    7
8    CorpbondAAA    8
9    CorpbondAAA    8
10    CorpbondAA    9
11    CorpbondAA    10

2.5.1 duplicated()

DataFrame的duplicated方法返回一个布尔型Series，表示各行是否是重复行（前面出现过的行）。

参数

参数名	含义	可取值	默认值
subset	从指定的列中寻找重复值，然后bool值	列名	可选参数
keep	决定重复值的标记方式，‘first’:将重复值的第一个标记为True，其余标记为False； ‘last’:将重复值的最后一个标记为True，其余标记为False；False:将所有重复值标记为False}	‘first’, ‘last’, False	‘first’

# keep参数
asset_df.duplicated(keep=False) # 将重复值的所有行都标记为True
# output:
0     False
1     False
2     False
3     False
4      True
5      True
6     False
7     False
8      True
9      True
10    False
11    False
dtype: bool

2.5.2 drop_duplicates()

根据条件删除重复值（默认对比所有列的值，默认保留重复值的第一行），参数ignore_index为pandas1.0.0以上版本新加参数

参数

参数名	含义	可取值	默认值
subset	从指定的列中寻找重复值，然后bool值	列名	可选参数，不输入该参数时表示所有列
keep	决定重复值的标记方式，‘first’:将重复值的第一个标记为True，其余标记为False； ‘last’:将重复值的最后一个标记为True，其余标记为False；False:将所有重复值标记为False}	‘first’, ‘last’, False	‘first’
inplace	是否覆盖原dataframe，True:覆盖，不返回新的df，False:不覆盖，返回新的df	列名	False
ignore_index	是否忽略旧索引	bool	False

asset_df.drop_duplicates()
# output:
asset_name    asset_id
0    Govtbond    1
1    Govtbond    2
2    Govtbond    3
3    Finbond    4
4    Finbond    5
6    CorpbondAAA    6
7    CorpbondAAA    7
8    CorpbondAAA    8
10    CorpbondAA    9
11    CorpbondAA    10

2.6 数据修改

2.6.1 upper()&lower()

将dataframe或series的列数据或索引的字符串转换为大写(upper())或小写(lower())。

asset_df['asset_name'] = asset_df['asset_name'].str.upper() # 将asset_name列的值全部转换为大写
asset_df
# output:
asset_name    asset_id
0    GOVTBOND    1
1    GOVTBOND    2
2    GOVTBOND    3
3    FINBOND    4
4    FINBOND    5
5    FINBOND    5
6    CORPBONDAAA    6
7    CORPBONDAAA    7
8    CORPBONDAAA    8
9    CORPBONDAAA    8
10    CORPBONDAA    9
11    CORPBONDAA    10

2.6.2 Series.map()

Series的map方法可以接受一个函数或含有映射关系的字典型对象，可以用于修改dataframe对象的数据子集，将指定列的值转换为其他值。

map_dict = {
  'Govtbond':'bond1',
  'Finbond':'bond2',
  'CorpbondAAA':'bond3',
  'CorpbondAA':'bond4'
}
asset_df['asset_name'] = asset_df['asset_name'].map(map_dict)
asset_df
# output:
    asset_name    asset_id
0    bond1    1
1    bond1    2
2    bond1    3
3    bond2    4
4    bond2    5
5    bond2    5
6    bond3    6
7    bond3    7
8    bond3    8
9    bond3    8
10    bond4    9
11    bond4    10

2.6.3 map()与lambda结合

将map()函数与匿名函数lambda结合使用，可以一步实现数据的转换。

asset_df['asset_name'] = asset_df['asset_name'].map(lambda x: map_dict[x])
asset_df
# output:
    asset_name    asset_id
0    bond1    1
1    bond1    2
2    bond1    3
3    bond2    4
4    bond2    5
5    bond2    5
6    bond3    6
7    bond3    7
8    bond3    8
9    bond3    8
10    bond4    9
11    bond4    10

2.6.4 replace()

replace() 函数用于替换dataframe中的元素为其他值。

DataFrame.replace(to_replace=None, value=NoDefault.no_default, inplace=False, limit=None, regex=False, method=NoDefault.no_default)

参数

参数名	含义	可取值	默认值
to_replace	找到将被替换的值的方法	str,regex(正则表达式),list,dict,Series,int,float,or None
value	匹配到to_replace的位置将要被替换的值	scalar,dict,list,str,regex	None
inplace	是否覆盖原dataframe，True:覆盖，不返回新的df，False:不覆盖，返回新的df	列名	False
limit	如果指定了方法，这是向前/向后填充的最大数目。换句话说，如果有超过这个数目的匹配值，它将只被部分填充。	int	None
regex	是否将to_replace和（或）value解释为正则表达式。如果这是True，那么to_replace必须是一个字符串。	Bool或与to_replace相同的类型	False
method	当to_replace为标量、列表或元组且value为None时使用的方法。pad/ffill:由前一个值填充其后的NaN值，backfill/bfill:用NaN值后面的值向前填充	‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None	None

参数 to_replace取值

df.replace(a, A, inplace=True)--a,b都为标量
将dataframe所有列的a元素替换为A元素。
df.replace(to_replace=a, value=A, inplace=True)--a为列表,b为标量
将dataframe所有列中，在列表a中的值全部替换为A元素。

asset_df_bak3=asset_df.copy()
asset_df_bak3.replace(to_replace=[0,1,2,3,10], value=0.1, inplace=True) # 将df中的0,1,2,3,10都替换为0.1
asset_df_bak3
# output:
asset_name    asset_id
0    Govtbond    0.1
1    Govtbond    0.1
2    Govtbond    0.1
3    Finbond    4.0
4    Finbond    5.0
5    Finbond    5.0
6    CorpbondAAA    6.0
7    CorpbondAAA    7.0
8    CorpbondAAA    8.0
9    CorpbondAAA    8.0
10    CorpbondAA    9.0
11    CorpbondAA    0.1

df.replace(to_replace=a, value=A, inplace=True)--a为列表,A也为列表，且列表a与A长度一样
将dataframe所有列中，在列表a中的值全部替换为列表A的值，替换时连个列表的值一一对应。
df.replace(to_replace=a, inplace=True)--a为字典，没有value参数
将dataframe所有列中，等于字典a中key的元素全部替换为key对应的value。比如：to_replace={2:0.1,9:0.2}，表示将df中的2替换为0.1，将9替换为0.2。
df.replace(to_replace=a,value=A, inplace=True)--a为字典，有value参数(此时字典的key为列名)
将dataframe所有列中，等于字典a中key的列的某元素（对应的value）全部替换为A（此方法不能对多个列的元素替换为不同的值）。

# 将asset_df_bak3中asset_name列中等于'Govtbond'和asset_id列中等于5的元素全部替换为0.55
asset_df_bak3.replace(to_replace={'asset_name':'Govtbond','asset_id':5},value= 0.55,inplace=True)
asset_df_bak3
# output:
asset_name    asset_id
0    0.55    1.00
1    0.55    2.00
2    0.55    3.00
3    Finbond    4.00
4    Finbond    0.55
5    Finbond    0.55
6    CorpbondAAA    6.00
7    CorpbondAAA    7.00
8    CorpbondAAA    8.00
9    CorpbondAAA    8.00
10    CorpbondAA    9.00
11    CorpbondAA    10.00

df.replace(to_replace=a, inplace=True)--a为字典，字典的key为列名，字典的value也是字典。
（此方法能对多个列的元素替换为不同的值）

# 将asset_df_bak3中asset_name列中等于'Finbond'的元素替换为'bond1'，asset_id列中等于5的元素替换为0.55
asset_df_bak3.replace(to_replace={'asset_name':{'Finbond':'bond1'},'asset_id':{5:0.55}},inplace=True)
asset_df_bak3
# output:
    asset_name    asset_id
0    Govtbond    1.00
1    Govtbond    2.00
2    Govtbond    3.00
3    bond1    4.00
4    bond1    0.55
5    bond1    0.55
6    CorpbondAAA    6.00
7    CorpbondAAA    7.00
8    CorpbondAAA    8.00
9    CorpbondAAA    8.00
10    CorpbondAA    9.00
11    CorpbondAA    10.00

df.replace(to_replace=a, regex=True, inplace=True)--根据正则表达式替换df中的元素。

# 替换dataframe中的正负无穷值
df.replace([np.nan, np.inf, -np.inf], 0, inplace=True)
# 替换dataframe中的空字符串为nan
df=df.replace('', np.nan)

2.6.5 rename()

rename()可以实现复制DataFrame并对其索引和列标签进行赋值。

参数

参数名	含义	可取值	默认值
mapper	重命名对象的映射关系	字典或表达式
index	替代指定轴(mapper, axis=0等价于index=mapper)。	字典或表达式	None
axis	决定重命名时的轴，0：按行重命名，1：按列重命名	0 or ‘index’, 1 or ‘columns’	0
columns	替代指定轴(mapper, axis=1等价于columns=mapper)。	字典或表达式	None
inplace	是否覆盖原dataframe，True:覆盖，不返回新的df，False:不覆盖，返回新的df	列名	False
level	有多层索引时可以对指定层的索引重命名	int或者level name	None
errors	是否忽略重命名过程中的错误	{‘ignore’, ‘raise’}	‘ignore’

修改列名

df = df.rename(columns={'c1': 'cc1'})

2.6.6 set_index()

用于设置dataframe的索引，可以将单列或多列设置为dataframe的索引。

将单列设置为索引

df_bak.set_index('asset_name')
# output:
    asset_id    bond_period    rate    buying_time    amount
asset_name                    
Govtbond    1    5    0.0315    2019-12-31    30.0
Govtbond    2    10    0.0355    2018-06-30    30.0
Govtbond    3    15    0.0410    2015-09-30    50.0
Finbond    4    5    0.0330    2019-12-31    18.0
Finbond    5    10    0.0365    2018-06-30    21.0
CorpbondAAA    6    3    0.0380    2018-06-30    30.0
CorpbondAAA    7    5    0.0400    2020-06-30    8.0
CorpbondAAA    8    10    0.0490    2020-06-30    19.0
CorpbondAA    9    1    0.0480    2020-06-30    19.0
CorpbondAA    10    5    0.0510    2020-06-30    12.0

将多列设置为索引

df_bak.set_index(['asset_name', 'asset_id'])
# output:
                bond_period    rate    buying_time    amount
asset_name    asset_id                
Govtbond    1    5    0.0315    2019-12-31    30.0
            2    10    0.0355    2018-06-30    30.0
            3    15    0.0410    2015-09-30    50.0
Finbond    4    5    0.0330    2019-12-31    18.0
            5    10    0.0365    2018-06-30    21.0
CorpbondAAA    6    3    0.0380    2018-06-30    30.0
            7    5    0.0400    2020-06-30    8.0
            8    10    0.0490    2020-06-30    19.0
CorpbondAA    9    1    0.0480    2020-06-30    19.0
            10    5    0.0510    2020-06-30    12.0

2.6.7 reset_index()

重置索引

2.6.8 cut()

cut()函数用于对数据分桶(bins)，

pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates='raise', ordered=True)

参数

参数名	含义	可取值	默认值
x	待分桶的数组（必须是一维数组）	array-like
bins	分桶的标准	int：，sequence of scalars：,IntervalIndex：
right	指示容器是否包括最右边的边，当bins是IntervalIndex时，这个参数被忽略	bool	True
labels	指定返回的容器的标签。	array or False	None
retbins	是否返回分桶信息，当容器作为标量提供时很有用	bool	False
precision	存储和显示分箱标签的精度	int	3
include_lowest	第一个interval是否左包含。	bool	False
duplicates	如果bin边不是唯一的，抛出ValueError或删除非唯一的。		default ‘raise’, ‘drop’

# 1、指定分桶的区间
pd.cut(df_bak3['bond_period'], bins=[0,8,20])
# output:
""
0     (0, 8]
1    (8, 20]
2    (8, 20]
3     (0, 8]
4    (8, 20]
5     (0, 8]
6     (0, 8]
7    (8, 20]
8     (0, 8]
9     (0, 8]
Name: bond_period, dtype: category
Categories (2, interval[int64]): [(0, 8] < (8, 20]]
""

# 2、指定分桶的区间数，并返回分桶的信息
pd.cut(df_bak3['bond_period'], bins=3, retbins=True)
# output:
""
(0     (0.986, 5.667]
 1    (5.667, 10.333]
 2     (10.333, 15.0]
 3     (0.986, 5.667]
 4    (5.667, 10.333]
 5     (0.986, 5.667]
 6     (0.986, 5.667]
 7    (5.667, 10.333]
 8     (0.986, 5.667]
 9     (0.986, 5.667]
 Name: bond_period, dtype: category
 Categories (3, interval[float64]): [(0.986, 5.667] < (5.667, 10.333] < (10.333, 15.0]],
 array([ 0.986     ,  5.66666667, 10.33333333, 15.        ]))
""

# 3、指定分桶的区间数，并指定标签，返回分桶的信息
pd.cut(df_bak3['bond_period'], bins=3, labels=['H', 'M', 'L'], retbins=True)
# output:
""
(0    H
 1    M
 2    L
 3    H
 4    M
 5    H
 6    H
 7    M
 8    H
 9    H
 Name: bond_period, dtype: category
 Categories (3, object): [H < M < L],
 array([ 0.986     ,  5.66666667, 10.33333333, 15.        ]))
""

2.6.9 assign()

df.assign(),为dataframe添加新列或者覆盖原有列。关键字参数为列名，如果列存在，则根据参数更新列值；如果列不存在，则添加新列。

# 生成新列'year'
df_bak3.assign(year=lambda x: x['buying_time'].str[:4])
# output:
asset_name    asset_id    bond_period    rate    buying_time    amount    new_amount    year
0    Govtbond    1    5    0.0315    2019-12-31    30.0    3    2019
1    Govtbond    2    10    0.0355    2018-06-30    30.0    5    2018
2    Govtbond    3    15    0.0410    2015-09-30    50.0    6    2015
3    Finbond    4    5    0.0330    2019-12-31    18.0    8    2019
4    Finbond    5    10    0.0365    2018-06-30    21.0    1    2018
5    CorpbondAAA    6    3    0.0380    2018-06-30    30.0    7    2018
6    CorpbondAAA    7    5    0.0400    2020-06-30    8.0    9    2020
7    CorpbondAAA    8    10    0.0490    2020-06-30    19.0    19    2020
8    CorpbondAA    9    1    0.0480    2020-06-30    19.0    10    2020
9    CorpbondAA    10    5    0.0510    2020-06-30    12.0    15    2020

# 不生成新列，覆盖'buying_time'
df_bak3.assign(buying_time=lambda x: x['buying_time'].str[:4])
# output:
asset_name    asset_id    bond_period    rate    buying_time    amount    new_amount
0    Govtbond    1    5    0.0315    2019    30.0    3
1    Govtbond    2    10    0.0355    2018    30.0    5
2    Govtbond    3    15    0.0410    2015    50.0    6
3    Finbond    4    5    0.0330    2019    18.0    8
4    Finbond    5    10    0.0365    2018    21.0    1
5    CorpbondAAA    6    3    0.0380    2018    30.0    7
6    CorpbondAAA    7    5    0.0400    2020    8.0    9
7    CorpbondAAA    8    10    0.0490    2020    19.0    19
8    CorpbondAA    9    1    0.0480    2020    19.0    10
9    CorpbondAA    10    5    0.0510    2020    12.0    15

2.6.10 pd.to_numeric()

将dataframe列数据格式改成float。

2.6.11 dt.date

将dataframe列的数据格式改成日期

2.6.12 df.columns.astype()

修改dataframe列名的格式为字符串

df.columns = df.columns.astype(str)

2.7 数据格式修改

2.7.1 pivot()

根据索引或列重塑dataframe。没有指定为索引或者新列的原有列元素就当作新dataframe的值，没有值的位置为NaN。

DataFrame.pivot(index=None, columns=None, values=None)

参数

参数名	含义	可取值	默认值
index	设置为新索引的列	str or object or a list of str	可选参数
columns	设置为新列的列	str or object or a list of str	可选参数
values	用于填充新帧值的列。如果没有指定，将使用所有剩余的列，并且结果将具有分层索引的列。	str, object or a list of the previous	可选参数

2.7.2 pivot_table()

根据索引或列重塑dataframe。没有指定为索引或者新列的原有列元素就当作新dataframe的值，没有值的位置为NaN。pivot_table可以使用aggfunc参数指定需要计算的值（求和、均值等）

DataFrame.pivot_table(values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All', observed=False, sort=True)

参数

参数名	含义	可取值	默认值
values	用于填充新帧值的列。如果没有指定，将使用所有剩余的列，并且结果将具有分层索引的列。	str, object or a list of the previous	可选参数
index	设置为新索引的列	str or object or a list of str	可选参数
columns	设置为新列的列	str or object or a list of str	可选参数
aggfunc	设置对数据聚合时进行的函数操作	function, list of functions, dict	numpy.mean
fill_value	缺失值填充	scalar	None

df_bak3[['asset_name', 'asset_id', 'bond_period', 'rate']].pivot_table(index=['asset_name', 'asset_id'], columns=['bond_period'])
# output:
            rate
bond_period    1    3    5    10    15
asset_name    asset_id                    
CorpbondAA    9    0.048    NaN    NaN    NaN    NaN
            10    NaN    NaN    0.0510    NaN    NaN
CorpbondAAA    6    NaN    0.038    NaN    NaN    NaN
            7    NaN    NaN    0.0400    NaN    NaN
            8    NaN    NaN    NaN    0.0490    NaN
Finbond        4    NaN    NaN    0.0330    NaN    NaN
            5    NaN    NaN    NaN    0.0365    NaN
Govtbond    1    NaN    NaN    0.0315    NaN    NaN
            2    NaN    NaN    NaN    0.0355    NaN
            3    NaN    NaN    NaN    NaN    0.041

2.7.3 unstack()

2.8 数据提取

2.8.1 简单取数

# 取dataframe中的部分列
df_bak2[['asset_name', 'asset_id','bond_period']]
# 取amount大于20的所有行
df_bak2[df_bak2['amount']>20]
df_bak2[df_bak2.amount > 20]
# 取所有大于20的值，不大于20的值为NaN
df_bak2[df_bak2>20]
# 如果所有列中，大于20的值的个数大于1，则保留该行数据
df_bak2[(df_bak2>20).any(1)]
# 取满足条件的数据对应的索引
list(df_bak2[df_bak2.amount > 20].index)
# 根据多个条件取数时，用&符号连接
df_bak2[(df_bak2.amount > 20) & (df_bak2.bond_period==10)]

2.8.2 sample()

随机采样的可以通过sample()函数实现

参数

参数名	含义	可取值	默认值
n	采样的数量，不可与frac同时使用	int	1且frac=None
frac	采样比例，不能与n一起使用	float	None
replace	允许或不允许对同一行进行多次抽样	bool	0
weights		str或ndarray-like	None
random_state	随机种子，控制随机采样是否可重现	int	False
axis	行采样或列采样	0 or ‘index’, 1 or ‘columns’	None
ignore_index	是否忽略原有索引，重新生成索引0,1,2,...(pandas版本1.3.0之后)	bool	False

weights
参数weights等于str时，要求是DataFrame中的一个列名（即执行行抽样）。pandas将str表示的这个列的取值作为该行数据的抽样权重进行抽样。如果列中数据相加和不等于1，该列数据将被标准化到和为1。列中如果有缺失值，该行数据的抽样权重被视为0，也就是说不抽取这一行数据。
weights是一个Series时，Series的长度可以和数据中行或列的长度不同。以行抽样为例，在进行抽样之前，pandas会先进行索引对齐，相当于对DataFrame和Series做一个左连接。DataFrame没有匹配到的索引对应的行抽样权重为0。

# 根据amount列的元数大小，对应的概率采样
df_bak2.sample(n=5,weights='amount')
# output:
asset_name    asset_id    bond_period    rate    buying_time    amount
3    Finbond    4    5    0.0330    2019-12-31    18.0
1    Govtbond    2    10    0.0355    2018-06-30    30.0
2    Govtbond    3    15    0.0410    2015-09-30    50.0
7    CorpbondAAA    8    10    0.0490    2020-06-30    19.0
6    CorpbondAAA    7    5    0.0400    2020-06-30    8.0
# 根据weight_的采样权重大小采样，weight_中概率大于0的数的数量要大于采样条数n，否则会出现异常。
weight_ = pd.Series([0.1, 0.1,0.3,0.2], index=[0,4,6,7])
df_bak2.sample(n=3,weights=weight_)
# output:
asset_name    asset_id    bond_period    rate    buying_time    amount
6    CorpbondAAA    7    5    0.0400    2020-06-30    8.0
7    CorpbondAAA    8    10    0.0490    2020-06-30    19.0
0    Govtbond    1    5    0.0315    2019-12-31    30.0

2.8.3 take()

根据索引从dataframe中取数。利用numpy.random.permutation函数也可以轻松实现对DataFrame的列的采样工作，先使用permutation函数生成随机索引列表，然后使用take函数根据索引列表取数（permuting，随机重排序）。

参数
参数名含义可取值默认值
indices 待取出数据的索引 array-like
axis 按行取或列取数 0 or ‘index’, 1 or ‘columns’ None
is_copy 是否返回副本(pandas版本1.0.0之后) bool True

参数名	含义	可取值	默认值
indices	待取出数据的索引	array-like
axis	按行取或列取数	0 or ‘index’, 1 or ‘columns’	None
is_copy	是否返回副本(pandas版本1.0.0之后)	bool	True

# 取列序号为0,1,2,3的4列
df_bak2.take(indices=[0,1,2,3],axis=1)
# output:
asset_name    asset_id    bond_period    rate
0    Govtbond    1    5    0.0315
1    Govtbond    2    10    0.0355
2    Govtbond    3    15    0.0410
3    Finbond    4    5    0.0330
4    Finbond    5    10    0.0365
5    CorpbondAAA    6    3    0.0380
6    CorpbondAAA    7    5    0.0400
7    CorpbondAAA    8    10    0.0490
8    CorpbondAA    9    1    0.0480
9    CorpbondAA    10    5    0.0510

3.8.4 iloc()

根据行或列位置取数，参数可以是单个整数、整数列表（切片对象）或bool值列表。如果参数是单个整数，返回的结果是Series对象；参数是列表时（即使列表只有一个元素），返回的结果是DataFrame对象。
当输入':'时，表示取全部行（或列），比如：df.iloc[:,[0,1,2]]表示取全部行，第0,1,2列的数据。

# df_bak2：列'asset_name'置为索引
        asset_id    bond_period    rate    buying_time    amount
asset_name                    
Govtbond    1    5    0.0315    2019-12-31    30.0
Govtbond    2    10    0.0355    2018-06-30    30.0
Govtbond    3    15    0.0410    2015-09-30    50.0
Finbond    4    5    0.0330    2019-12-31    18.0
Finbond    5    10    0.0365    2018-06-30    21.0
CorpbondAAA    6    3    0.0380    2018-06-30    30.0
CorpbondAAA    7    5    0.0400    2020-06-30    8.0
CorpbondAAA    8    10    0.0490    2020-06-30    19.0
CorpbondAA    9    1    0.0480    2020-06-30    19.0
CorpbondAA    10    5    0.0510    2020-06-30    12.0

# 1、根据整数列表取数,取第0,1,3行，取0,1,4列
df_bak2.iloc[[0,1,3],[0,1,4]]
# output:
        asset_id    bond_period    amount
asset_name            
Govtbond    1    5    30.0
Govtbond    2    10    30.0
Finbond        4    5    18.0

# 2、根据bool值列表取数，当bool值列表长度不够时，缺少的部分默认为False
df_bak2.iloc[[True,True,False,True],[True,True,False,False,True]]
# output:
        asset_id    bond_period    amount
asset_name            
Govtbond    1    5    30.0
Govtbond    2    10    30.0
Finbond        4    5    18.0

3.8.5 loc()

根据行或列名称取数，参数可以是单个名称、名称列表或bool值列表。如果参数是单个名称，返回的结果是Series对象；参数是列表时（即使列表只有一个元素），返回的结果是DataFrame对象。

# 1、根据索引名称和列名称取数
df_bak2.loc[['Govtbond'],['asset_id', 'bond_period','amount']]
# output:
        asset_id    bond_period    amount
asset_name            
Govtbond    1    5    30.0
Govtbond    2    10    30.0
Govtbond    3    15    50.0

# 2、根据bool值列表取数，当bool值列表长度不够时，缺少的部分默认为False
df_bak2.loc[[True,True,False,True],[True,True,False,False,True]]
# output:
        asset_id    bond_period    amount
asset_name            
Govtbond    1    5    30.0
Govtbond    2    10    30.0
Finbond        4    5    18.0

# 3、根据数值大小取数
df_bak2.loc[df_bak2['amount']>=30]
# output:
        asset_id    bond_period    rate    buying_time    amount
asset_name                    
Govtbond    1    5    0.0315    2019-12-31    30.0
Govtbond    2    10    0.0355    2018-06-30    30.0
Govtbond    3    15    0.0410    2015-09-30    50.0
CorpbondAAA    6    3    0.0380    2018-06-30    30.0

# 4、通过loc修改dataframe中满足条件的值
df_bak2.loc[[True,True,False,True],[True,True,False,False,True]]=[0,10,100]
df_bak2
# output:
    asset_id    bond_period    rate    buying_time    amount
asset_name                    
Govtbond    0    10    0.0315    2019-12-31    100.0
Govtbond    0    10    0.0355    2018-06-30    100.0
Govtbond    3    15    0.0410    2015-09-30    50.0
Finbond    0    10    0.0330    2019-12-31    100.0
Finbond    5    10    0.0365    2018-06-30    21.0
CorpbondAAA    6    3    0.0380    2018-06-30    30.0
CorpbondAAA    7    5    0.0400    2020-06-30    8.0
CorpbondAAA    8    10    0.0490    2020-06-30    19.0
CorpbondAA    9    1    0.0480    2020-06-30    19.0
CorpbondAA    10    5    0.0510    2020-06-30    12.0

# 5、当索引为多层索引时，可以通过元组指定不通层的索引取数
# df:
                max_speed    shield
cobra    mark i    12    2
        mark ii    0    4
sidewinder    mark i    10    20
        mark ii    1    4
viper    mark ii    7    1
        mark iii    16    36
df.loc[('cobra', 'mark ii')] # 返回Series对象
df.loc[[('cobra', 'mark ii')]] # 返回DataFrame对象
# 返回从cobra的mark i到viper索引的全部值
df.loc[('cobra', 'mark i'):'viper']
# 返回从cobra的mark ii到viper的mark ii索引的全部值
df.loc[('cobra', 'mark ii'):('viper', 'mark ii')]
# output:
                max_speed    shield
cobra        mark ii    0    4
sidewinder    mark i    10    20
            mark ii    1    4
viper        mark ii    7    1

2.8.6 iloc与loc的对比

不同点：
iloc取数据是根据行或列位置的顺序取数，loc根据行或列名称取数；
iloc可以通过切片的方式确定取数的位置，loc只有索引为数字时才可以，其他情况不可以；
iloc不可以通过判断dataframe值大小来确定取数位置，loc可以根据元素大小取数；
iloc不可以直接用于修改dataframe的元素，loc可以修改满足取数条件的位置对应的元素值；
使用iloc根据索引大小取数的速度要稍快于loc根据列元素大小取数（大概快20%），但是将指定列置为索引需要额外耗费时间（设置索引列的时间加上iloc的时间，要大于直接loc的时间）
相同点
都可以通过bool值取数。

2.8.7 iterrows()

iterrows()按行遍历dataframe或series，返回两个元素：行索引和行值，其中行值为Series格式，可以根据list索引取具体的数值。

# df_bak3
    asset_name    asset_id    bond_period    rate    buying_time    amount
0    Govtbond    1    5    0.0315    2019-12-31    30.0
1    Govtbond    2    10    0.0355    2018-06-30    30.0
2    Govtbond    3    15    0.0410    2015-09-30    50.0
3    Finbond    4    5    0.0330    2019-12-31    18.0
4    Finbond    5    10    0.0365    2018-06-30    21.0
5    CorpbondAAA    6    3    0.0380    2018-06-30    30.0
6    CorpbondAAA    7    5    0.0400    2020-06-30    8.0
7    CorpbondAAA    8    10    0.0490    2020-06-30    19.0
8    CorpbondAA    9    1    0.0480    2020-06-30    19.0
9    CorpbondAA    10    5    0.0510    2020-06-30    12.0

for index_, row in df_bak3.iterrows():
    print(index_,row)

2.8.8 itertuples()

itertuples为每一行产生一个namedtuple，并且行的索引值作为元组的第一个元素，pandas类型。

for nametuple in df_bak3.itertuples():
    print(nametuple)

# output:
Pandas(Index=0, asset_name='Govtbond', asset_id=1, bond_period=5, rate=0.0315, buying_time='2019-12-31', amount=30.0)
...
Pandas(Index=9, asset_name='CorpbondAA', asset_id=10, bond_period=5, rate=0.051, buying_time='2020-06-30', amount=12.0)

2.8.9 df.str

pandas的向量化字符串函数，可以用于dataframe中某字符串列的数据查找、截取等。

方法

方法名	含义	可取值	默认值
cat	实现元素级的字符串连接操作，可以指定分隔符
contains	是否包含指定的子串
endswith,startswith	是否以指定的子串结尾（开始）
get	获取指定索引位置的子串
isalnum	是否为数字
isalpha	检查每个字符串中的所有字符是否为字母。
isdecimal	检查是否每个字符串中的所有字符都是数字（带小数点('1.5')、逗号(3,000)都会判别为False，'⅕'为False）
isdigit	检查是否每个字符串中的所有字符都是数字（带小数点('1.5')、逗号(3,000)都会判别为False，'⅕'为False）
isnumeric	检查是否每个字符串中的所有字符都是数字（带小数点('1.5')、逗号(3,000)都会判别为False，'⅕'为True）。
islower	是否为小写字母
isupper	是否为大写字母
join	使用指定分隔符将series中元素的字符串连接起来
replace	replace('a', 'b') 使用子串b替换子串a
split	使用指定符号分割字符串
strip(rstrip,lstrip)	去除两边空值（去除右边空值，去除左边空值）
upper,lower	转大（小）写

dataframe.str.split()

将dataframe的某些列根据分隔符分成多列。
如：将df2的rate_type按照“|”分割成两列，两列的列名分别为'asset_name', 'rate_type'。

df2[['asset_name', 'rate_type']] = df2['rate_type'].str.split('|', expand=True)
# output:
year value_type rate
    0 2020 asset1|cii 0.19
    1 2020 asset1|nii 0.10
    2 2020 asset2|cii 0.20
    3 2020 asset2|nii 0.29

参数expand=True时，表示将拆分的数据分成多列。

year value_type rate asset
0 2020 cii 0.19 asset1
1 2020 nii 0.10 asset1
2 2020 cii 0.20 asset2
3 2020 nii 0.29 asset2

如果拆分出来的数据的列数多于待存放数据的列，那么将数据从左到右依次保存，直到可存放数据的列用完，多于的数据将被舍弃。
比如：

df2['value_type'] = df2['value_type'].str.split('|', expand=True)
# output:
    year value_type rate
    0 2020 asset1 0.19
    1 2020 asset1 0.10
    2 2020 asset2 0.20
    3 2020 asset2 0.29

参数expand=False时，表示不将拆分的数据分成多列，以list格式保存为一列：

 year value_type rate
0 2020 [asset1, cii] 0.19
1 2020 [asset1, nii] 0.10
2 2020 [asset2, cii] 0.20
3 2020 [asset2, nii] 0.29

2.9 数据合并

2.9.1 concat()

沿特定轴连接pandas对象，并沿其他轴连接可选的集合逻辑。使用concat可以实现merge的功能。

pandas.concat(objs, axis=0, join='outer', ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=False, copy=True)

参数

参数名	含义	可取值	默认值
objs	待连接的对象	a sequence or mapping of Series or DataFrame objects	无默认值，不可为空
axis	连接的轴	{0/’index’, 1/’columns’}	0
join	拼接方式，外连接、内连接	{‘inner’, ‘outer’}	‘outer’
ignore_index	是否忽略原有索引，重新生成索引	bool	False
keys	指定连接使用的键	sequence	None
levels	多级索引时使用	list of sequences	None
names	生成的层次结构索引中级别的名称	list	None
verify_integrity	检查新的连接轴是否包含重复项	bool	False
sort	在连接为“外部”时，如果非连接轴尚未对齐，则对其进行排序。当join='inner'时，这没有影响，因为它已经保留了非连接轴的顺序。(pandas1.0.0)	bool	False

2.9.2 merge()

使用数据库样式的连接合并DataFrame或命名为Series的对象。一个Series对象被视为只有一个列的DataFrame对象。

pandas.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)

参数

参数名	含义	可取值	默认值
left	待拼接的左dataframe	dataframe	无默认值，不可为空
right	待拼接的右dataframe	dataframe	无默认值，不可为空
how	拼接方式，如左连接、右连接、内连接等	{‘left’, ‘right’, ‘outer’, ‘inner’, ‘cross’，cross是pandas版本1.2.0以上}	‘inner’
on	要连接的列或索引级别名称。这些必须在两个数据帧中都找到。如果on为None并且没有在索引上合并，则默认为两个DataFrames中的列的交集	label或list
left_on	列或索引级别的名称连接到左DataFrame。也可以是一个数组或数组的列表长度的左DataFrame	label或list
right_on	在正确的DataFrame中连接的列或索引级别名称	label或list
left_index	是否使用来自左DataFrame的索引作为连接键	bool	False
right_index	是否使用来自右DataFrame的索引作为连接键	bool	False
sort	对结果DataFrame中的联接键按字典顺序排序。如果为False，则连接键的顺序取决于连接类型(how关键字)	bool	False
suffixes	长度为2的序列，其中每个元素都是可选的字符串，指示要分别添加到左侧和右侧重叠列名的后缀,比如：左右两个df都有'value'列，输入suffixes=('\_left', '\_right')，则拼接后分别为'value\_left', 'value\_right'	list-like	(“_x”, “_y”)
copy	是否复制	bool	True
indicator	如果为True，则在输出DataFrame中添加一个名为“\_merge”的列，该列包含关于每行源的信息。通过提供一个字符串参数，可以给列一个不同的名称。	bool或str	False
validate	如果指定，则检查merge是否为指定类型。	str	可选参数

2.10 数据计算

import random
df_bak3=df.copy()
df_bak3.fillna(method='ffill',inplace=True)
df_bak3['new_amount'] = random.sample(range(1,20),10)
# df_bak3:
    asset_name    asset_id    bond_period    rate    buying_time    amount    new_amount
0    Govtbond    1    5    0.0315    2019-12-31    30.0    10
1    Govtbond    2    10    0.0355    2018-06-30    30.0    12
2    Govtbond    3    15    0.0410    2015-09-30    50.0    1
3    Finbond    4    5    0.0330    2019-12-31    18.0    2
4    Finbond    5    10    0.0365    2018-06-30    21.0    13
5    CorpbondAAA    6    3    0.0380    2018-06-30    30.0    15
6    CorpbondAAA    7    5    0.0400    2020-06-30    8.0    4
7    CorpbondAAA    8    10    0.0490    2020-06-30    19.0    16
8    CorpbondAA    9    1    0.0480    2020-06-30    19.0    14
9    CorpbondAA    10    5    0.0510    2020-06-30    12.0    17

2.10.1 apply()

apply()函数的作用是沿着DataFrame的轴线应用一个函数。

DataFrame.apply(func, axis=0, raw=False, result_type=None, args=(), **kwargs)

参数

参数名	含义	可取值	默认值
func	待应用的函数	dataframe	无默认值，不可为空
axis	指定轴线	0 or ‘index’, 1 or ‘columns’	0
raw		bool	False
result_type	返回格式(只在axis=1时使用)	‘expand’, ‘reduce’, ‘broadcast’, None	None

# 对单列应用apply方法
df_bak3['bond_period'] = df_bak3['bond_period'].apply(lambda x: x*12)
df_bak3
# output:
    asset_name    asset_id    bond_period    rate    buying_time    amount    new_amount
0    Govtbond    1    60    0.0315    2019-12-31    30.0    10
1    Govtbond    2    120    0.0355    2018-06-30    30.0    12
2    Govtbond    3    180    0.0410    2015-09-30    50.0    1
3    Finbond    4    60    0.0330    2019-12-31    18.0    2
4    Finbond    5    120    0.0365    2018-06-30    21.0    13
5    CorpbondAAA    6    36    0.0380    2018-06-30    30.0    15
6    CorpbondAAA    7    60    0.0400    2020-06-30    8.0    4
7    CorpbondAAA    8    120    0.0490    2020-06-30    19.0    16
8    CorpbondAA    9    12    0.0480    2020-06-30    19.0    14
9    CorpbondAA    10    60    0.0510    2020-06-30    12.0    17

# 2、求两列的和
df_bak3['amount'] = df_bak3.apply(lambda x: x['amount'] + x['new_amount'], axis=1)
df_bak3
# output:
asset_name    asset_id    bond_period    rate    buying_time    amount    new_amount
0    Govtbond    1    60    0.0315    2019-12-31    40.0    10
1    Govtbond    2    120    0.0355    2018-06-30    42.0    12
2    Govtbond    3    180    0.0410    2015-09-30    51.0    1
3    Finbond    4    60    0.0330    2019-12-31    20.0    2
4    Finbond    5    120    0.0365    2018-06-30    34.0    13
5    CorpbondAAA    6    36    0.0380    2018-06-30    45.0    15
6    CorpbondAAA    7    60    0.0400    2020-06-30    12.0    4
7    CorpbondAAA    8    120    0.0490    2020-06-30    35.0    16
8    CorpbondAA    9    12    0.0480    2020-06-30    33.0    14
9    CorpbondAA    10    60    0.0510    2020-06-30    29.0    17

# 3、当传入的不是单个值时且result_type=None
df_bak3['new_amount'] = df_bak3.apply(lambda x: [1,3], axis=1)
df_bak3
# output:
asset_name    asset_id    bond_period    rate    buying_time    amount    new_amount
0    Govtbond    1    5    0.0315    2019-12-31    30.0    [1, 3]
1    Govtbond    2    10    0.0355    2018-06-30    30.0    [1, 3]
2    Govtbond    3    15    0.0410    2015-09-30    50.0    [1, 3]
3    Finbond    4    5    0.0330    2019-12-31    18.0    [1, 3]
4    Finbond    5    10    0.0365    2018-06-30    21.0    [1, 3]
5    CorpbondAAA    6    3    0.0380    2018-06-30    30.0    [1, 3]
6    CorpbondAAA    7    5    0.0400    2020-06-30    8.0    [1, 3]
7    CorpbondAAA    8    10    0.0490    2020-06-30    19.0    [1, 3]
8    CorpbondAA    9    1    0.0480    2020-06-30    19.0    [1, 3]
9    CorpbondAA    10    5    0.0510    2020-06-30    12.0    [1, 3]

# 4、当传入的不是单个值时且result_type='expand',由于apply的结果只赋值给了一列，所以结果中只有一个数。
df_bak3['new_amount'] = df_bak3.apply(lambda x: [1,3], axis=1, result_type='expand')
df_bak3
# output:
asset_name    asset_id    bond_period    rate    buying_time    amount    new_amount
0    Govtbond    1    5    0.0315    2019-12-31    30.0    1
1    Govtbond    2    10    0.0355    2018-06-30    30.0    1
2    Govtbond    3    15    0.0410    2015-09-30    50.0    1
3    Finbond    4    5    0.0330    2019-12-31    18.0    1
4    Finbond    5    10    0.0365    2018-06-30    21.0    1
5    CorpbondAAA    6    3    0.0380    2018-06-30    30.0    1
6    CorpbondAAA    7    5    0.0400    2020-06-30    8.0    1
7    CorpbondAAA    8    10    0.0490    2020-06-30    19、.0    1
8    CorpbondAA    9    1    0.0480    2020-06-30    19.0    1
9    CorpbondAA    10    5    0.0510    2020-06-30    12.0    1

2.10.2 applymap()

applymap()函数的作用是将函数应用到每一个元素，而apply()只会将函数应用都某一个轴线。

# 对同一个dataframe分别应用apply()和applymap()
df_bak3.apply(lambda x: [1,3])
# output:
asset_name     [1, 3]
asset_id       [1, 3]
bond_period    [1, 3]
rate           [1, 3]
buying_time    [1, 3]
amount         [1, 3]
new_amount     [1, 3]
dtype: object

df_bak3.apply(lambda x: [1,3], axis=1)
# output:
0    [1, 3]
1    [1, 3]
2    [1, 3]
3    [1, 3]
4    [1, 3]
5    [1, 3]
6    [1, 3]
7    [1, 3]
8    [1, 3]
9    [1, 3]
dtype: object

df_bak3.applymap(lambda x: [1,3])
# output:
    asset_name    asset_id    bond_period    rate    buying_time    amount    new_amount
0    [1, 3]    [1, 3]    [1, 3]    [1, 3]    [1, 3]    [1, 3]    [1, 3]
1    [1, 3]    [1, 3]    [1, 3]    [1, 3]    [1, 3]    [1, 3]    [1, 3]
2    [1, 3]    [1, 3]    [1, 3]    [1, 3]    [1, 3]    [1, 3]    [1, 3]
3    [1, 3]    [1, 3]    [1, 3]    [1, 3]    [1, 3]    [1, 3]    [1, 3]
4    [1, 3]    [1, 3]    [1, 3]    [1, 3]    [1, 3]    [1, 3]    [1, 3]
5    [1, 3]    [1, 3]    [1, 3]    [1, 3]    [1, 3]    [1, 3]    [1, 3]
6    [1, 3]    [1, 3]    [1, 3]    [1, 3]    [1, 3]    [1, 3]    [1, 3]
7    [1, 3]    [1, 3]    [1, 3]    [1, 3]    [1, 3]    [1, 3]    [1, 3]
8    [1, 3]    [1, 3]    [1, 3]    [1, 3]    [1, 3]    [1, 3]    [1, 3]
9    [1, 3]    [1, 3]    [1, 3]    [1, 3]    [1, 3]    [1, 3]    [1, 3]

2.10.3 shift()

dataframe平移。

DataFrame.shift(periods=1, freq=None, axis=0, fill_value=NoDefault.no_default)
# periods--平移periods行，periods是负数时表示向上平移，正数则向下平移。
# freq--从tseries模块或时间规则使用的偏移量（索引是日期时可用）

df_bak3.shift(periods=2)
# output：
asset_name    asset_id    bond_period    rate    buying_time    amount    new_amount
0    NaN    NaN    NaN    NaN    NaN    NaN    NaN
1    NaN    NaN    NaN    NaN    NaN    NaN    NaN
2    Govtbond    1.0    5.0    0.0315    2019-12-31    30.0    15.0
3    Govtbond    2.0    10.0    0.0355    2018-06-30    30.0    4.0
4    Govtbond    3.0    15.0    0.0410    2015-09-30    50.0    14.0
5    Finbond    4.0    5.0    0.0330    2019-12-31    18.0    12.0
6    Finbond    5.0    10.0    0.0365    2018-06-30    21.0    10.0
7    CorpbondAAA    6.0    3.0    0.0380    2018-06-30    30.0    1.0
8    CorpbondAAA    7.0    5.0    0.0400    2020-06-30    8.0    11.0
9    CorpbondAAA    8.0    10.0    0.0490    2020-06-30    19.0    5.0

2.10.4 df.groupby()

groupby()用于对大量数据进行分组，并计算对这些分组的操作。

参数

参数名	含义	可取值	默认值
by	用于确定groupby的组	mapping, function, label, or list of labels	无默认值，不可为空
axis	确定groupby的轴	{0 or ‘index’, 1 or ‘columns’}	0
level	如果轴是MultiIndex(层次化)，则按一个或多个特定级别进行分组。	int, level name, or sequence of such	None
as_index		bool	True
sort	对groupby的键是否排序	bool	True
group_keys		bool	True

## 变量grouped是一个GroupBy对象，它实际上还没有进行任何计算
grouped = df_bak3.groupby('asset_name')
grouped
# output:
<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x000002A080D98FD0>

1) 使用groupby对dataframe按照列名、索引分组

然后对分组的数据求均值、方差等计算。

## 按单列分组求和sum(),均值mean(),标准差std()等
df_bak3.groupby('asset_name').mean()
# output:
        asset_id    bond_period    rate    amount    new_amount
asset_name                    
CorpbondAA    9.5    3.0    0.049500    15.500000    12.500000
CorpbondAAA    7.0    6.0    0.042333    19.000000    11.666667
Finbond        4.5    7.5    0.034750    19.500000    4.500000
Govtbond    2.0    10.0    0.036000    36.666667    4.666667

## 同时传入多个对象：对df按照多级分组，然后求列均值、标准差等。
df_bak3.groupby(['asset_name', 'buying_time'], sort=False).sum()
# output:
        asset_id    bond_period    rate    amount    new_amount
asset_name    buying_time                    
Govtbond    2019-12-31    1    5    0.0315    30.0    3
            2018-06-30    2    10    0.0355    30.0    5
            2015-09-30    3    15    0.0410    50.0    6
Finbond        2019-12-31    4    5    0.0330    18.0    8
            2018-06-30    5    10    0.0365    21.0    1
CorpbondAAA    2018-06-30    6    3    0.0380    30.0    7
            2020-06-30    15    15    0.0890    27.0    28
CorpbondAA    2020-06-30    19    6    0.0990    31.0    25

## 使用size()方法查看每个分组的数据大小。
df_bak3.groupby(['asset_name', 'buying_time'], sort=False).size()
asset_name   buying_time
Govtbond     2019-12-31     1
             2018-06-30     1
             2015-09-30     1
Finbond      2019-12-31     1
             2018-06-30     1
CorpbondAAA  2018-06-30     1
             2020-06-30     2
CorpbondAA   2020-06-30     2
dtype: int64

2) groupby对象支持迭代

产生一组二元元组（由分组名和数据块组成），其中数据块的索引与原dataframe一致。

## 单个键的情况
for key1, sub_df in df_bak3.groupby(['asset_name'], sort=False):
    print(key1)
    print(sub_df)
# output:
Govtbond
  asset_name  asset_id  bond_period    rate buying_time  amount  new_amount
0   Govtbond         1            5  0.0315  2019-12-31    30.0           3
1   Govtbond         2           10  0.0355  2018-06-30    30.0           5
2   Govtbond         3           15  0.0410  2015-09-30    50.0           6
Finbond
  asset_name  asset_id  bond_period    rate buying_time  amount  new_amount
3    Finbond         4            5  0.0330  2019-12-31    18.0           8
4    Finbond         5           10  0.0365  2018-06-30    21.0           1
CorpbondAAA
    asset_name  asset_id  bond_period   rate buying_time  amount  new_amount
5  CorpbondAAA         6            3  0.038  2018-06-30    30.0           7
6  CorpbondAAA         7            5  0.040  2020-06-30     8.0           9
7  CorpbondAAA         8           10  0.049  2020-06-30    19.0          19
CorpbondAA
   asset_name  asset_id  bond_period   rate buying_time  amount  new_amount
8  CorpbondAA         9            1  0.048  2020-06-30    19.0          10
9  CorpbondAA        10            5  0.051  2020-06-30    12.0          15

## 对于多重键的情况，元组的第一个元素将会是由键值组成的元组
for (key1, key2), sub_df in df_bak3.groupby(['asset_name', 'buying_time'], sort=False):
    print(key1, key2)
    print(sub_df)

3) groupby默认是在axis=0上进行分组的，通过设置也可以在其他任何轴上进行分组

df_bak3.groupby('asset_name', axis=1).sum()

4) 分组信息可以是字典

将dataframe的列通过字典的键值对映射，可以将多列映射到同一个键值，然后传入字典根据键值分组。

mapping={'asset_id':'asset_id','amount': 'amount',
         'new_amount': 'amount', 'buying_time':'buying_time',
        'bond_period':'bond_period', 'rate':'rate'}
df_bak3.groupby(mapping, axis=1).sum()
# output:
amount    asset_id    bond_period    buying_time    rate
0    33.0    1    5    2019-12-31    0.0315
1    35.0    2    10    2018-06-30    0.0355
2    56.0    3    15    2015-09-30    0.0410
3    26.0    4    5    2019-12-31    0.0330
4    22.0    5    10    2018-06-30    0.0365
5    37.0    6    3    2018-06-30    0.0380
6    17.0    7    5    2020-06-30    0.0400
7    38.0    8    10    2020-06-30    0.0490
8    29.0    9    1    2020-06-30    0.0480
9    27.0    10    5    2020-06-30    0.0510

5) groupby()的聚合操作

对datafram数据groupby之后，可以对其他单列或多列进行聚合操作，求均值(mean())、最小值(min())、最大值(max())、求和(sum())、中位数(median())、方差(var(ddof=0), var(ddof=1))、标准差(std(ddof=0), std(ddof=1))等。格式有多种。

import pandas as pd
real_constrains = dict(
 name=['a', 'b', 'c', 'd', 'a','c'],
 ret=[0.3,0.12,0.13, 0.21, 6,0.01],
 duration = [0.5,1,6,2,5,10]
 )
constrains_df = pd.DataFrame(real_constrains)

res1 = constrains_df.groupby('name')['ret'].std(ddof=0)
# 要对多列求不同的聚合计算时，可以通过agg函数传字典参数进行计算
res2 = constrains_df.groupby('name').agg({'ret':'max','duration':'min'})
res3 = constrains_df.groupby('name')[['ret','duration']].min()
res4 = constrains_df.groupby('name').std(ddof=0)['ret']
print(res1,res2,res3,res4)

# output:
name
a    2.85
b    0.00
c    0.06
d    0.00
Name: ret, dtype: float64        ret  duration
name                
a     6.00       0.5
b     0.12       1.0
c     0.13       6.0
d     0.21       2.0        ret  duration
name                
a     0.30       0.5
b     0.12       1.0
c     0.01       6.0
d     0.21       2.0 name
a    2.85
b    0.00
c    0.06
d    0.00
Name: ret, dtype: float64

6) groupby的transform操作

使用agg操作会将groupby条件相同的数据整合，而transform操作可以将结果返回原dataframe的新列，并将结果返回给每一行。

import pandas as pd
real_constrains = dict(
 name=['a', 'b', 'c', 'd', 'a','c'],
 ret=[0.3,0.12,0.13, 0.21, 6,0.01],
 duration = [0.5,1,6,2,5,10]
 )
constrains_df = pd.DataFrame(real_constrains)

# 使用agg无法将计算结果返回给原DF新的列
constrains_df['ave_ret_agg'] = constrains_df.groupby('name')['ret'].mean()
print('使用agg结果：\n', constrains_df)
# 使用transform可以将计算结果返回原DF的每一行
constrains_df['ave_ret_tra'] = constrains_df.groupby('name')['ret'].transform('mean')
print('使用transform结果：\n', constrains_df)

# output:
使用agg结果：
   name   ret  duration  ave_ret_agg
0    a  0.30       0.5          NaN
1    b  0.12       1.0          NaN
2    c  0.13       6.0          NaN
3    d  0.21       2.0          NaN
4    a  6.00       5.0          NaN
5    c  0.01      10.0          NaN
使用transform结果：
   name   ret  duration  ave_ret_agg  ave_ret_tra
0    a  0.30       0.5          NaN         3.15
1    b  0.12       1.0          NaN         0.12
2    c  0.13       6.0          NaN         0.07
3    d  0.21       2.0          NaN         0.21
4    a  6.00       5.0          NaN         3.15
5    c  0.01      10.0          NaN         0.07

2.10.5 df.rolling(n)

时间窗函数rolling()用于对数据进行平移计算，如计算相邻10个数据的均值mean()、和sum()、方差var()等。

DataFrame.rolling(window, min_periods=None, center=False, win_type=None, on=None, axis=0, closed=None, method='single')

参数

参数名	含义	可取值	默认值
window	移动窗口的大小	int, offset（只对datetime类索引有效）, or BaseIndexer subclass
min_periods	窗口中需要一个值的最小观测数，如果不满足最小观测数，则返回np.nan	int	None,如果window是整数，默认是窗口大小；如果window是偏移量，则默认为1
center	是否将窗口标签设置为窗口索引的中心	bool	False
win_type	设置窗口类型	str	None
on	对于DataFrame，用于计算滚动窗口的列标签或索引级别，而不是DataFrame的索引	str	可选
axis	设置按行或列计算	0 or 'index',1 or 'columns'	0
closed(pandas版本1.3.0)	如果为“right”，则窗口中的第一个点将被排除在计算之外。如果为'left'，则将窗口中的最后一个点排除在计算之外。如果'both'，则将窗口中的no点排除在计算之外。如果'neither'，则将窗口中的第一个和最后一个点排除在计算之外。	str	None

示例

tmp_df = pd.DataFrame({'B': [0, 1, 2, np.nan, np.nan,4], 'C': [0.2, 0.5, 0.6, 0.8,np.nan, np.nan]})
tmp_df
# output:
B    C
0    0.0    0.2
1    1.0    0.5
2    2.0    0.6
3    NaN    0.8
4    NaN    NaN
5    4.0    NaN

## 1、不设置最小观测数min_periods时，必须要3个数全不为nan才能输出值
tmp_df.rolling(window=3).sum()
# output:
B    C
0    NaN    NaN
1    NaN    NaN
2    3.0    1.3
3    NaN    1.9
4    NaN    NaN
5    NaN    NaN

## 2、设置min_periods=2时，只需要3个数中有2个为数即可
tmp_df.rolling(window=3, min_periods=2).sum()
# output:
B    C
0    NaN    NaN
1    1.0    0.7
2    3.0    1.3
3    3.0    1.9
4    NaN    1.4
5    NaN    NaN

## 3、填充空值，比较参数'center'不同时的差异。可以看出，当center=True时，会将计算结果与窗口的中间一个数据的索引对齐。center=Flse时，会将计算结果与窗口的最后一个数据的索引对齐。
tmp_df.fillna(method='ffill', inplace=True)
# output:
    B    C
0    0.0    0.2
1    1.0    0.5
2    2.0    0.6
3    2.0    0.8
4    2.0    0.8
5    4.0    0.8
tmp_df.rolling(window=3, center=False).sum()
# output:
B    C
0    NaN    NaN
1    NaN    NaN
2    3.0    1.3
3    5.0    1.9
4    6.0    2.2
5    8.0    2.4
tmp_df.rolling(window=3, center=True).sum()
# output:
B    C
0    NaN    NaN
1    3.0    1.3
2    5.0    1.9
3    6.0    2.2
4    8.0    2.4
5    NaN    NaN
# output:

2.10.6 df.sort_values()

将dataframe按指定轴的元素排序。

DataFrame.sort_values(by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last', ignore_index=False, key=None)

参数

参数名	含义	可取值	默认值
by	要排序的名称或名称列表
axis	指定轴	{0 or ‘index’, 1 or ‘columns’}	0
ascending	指定升序(ascending=True)还是降序(ascending=False)，	bool or list of bool	True
kind	指定排序算法	{‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’}	quicksort

示例

df_bak3.sort_values(by=['asset_name', 'bond_period'], ascending=True)
# output:
    asset_name    asset_id    bond_period    rate    buying_time    amount    new_amount
8    CorpbondAA    9    1    0.0480    2020-06-30    19.0    10
9    CorpbondAA    10    5    0.0510    2020-06-30    12.0    15
5    CorpbondAAA    6    3    0.0380    2018-06-30    30.0    7
6    CorpbondAAA    7    5    0.0400    2020-06-30    8.0    9
7    CorpbondAAA    8    10    0.0490    2020-06-30    19.0    19
3    Finbond    4    5    0.0330    2019-12-31    18.0    8
4    Finbond    5    10    0.0365    2018-06-30    21.0    1
0    Govtbond    1    5    0.0315    2019-12-31    30.0    3
1    Govtbond    2    10    0.0355    2018-06-30    30.0    5
2    Govtbond    3    15    0.0410    2015-09-30    50.0    6

2.10.7 df.cumprod()

按指定轴计算dataframe的累乘。

real_constrains = dict(
            ret=[0.3,0.12,0.13, 0.21, 6,0.01],
            duration = [0.1,1,6,2,5,10]
        )
df = pd.DataFrame(real_constrains)
df['temp_cumu_price'] = df['duration']+ 1
df['cumu_price2'] = df['temp_cumu_price'].cumprod(axis=0)
# output:
ret  duration  temp_cumu_price  cumu_price2
0  0.30       0.1              1.1          1.1
1  0.12       1.0              2.0          2.2
2  0.13       6.0              7.0         15.4
3  0.21       2.0              3.0         46.2
4  6.00       5.0              6.0        277.2
5  0.01      10.0             11.0       3049.2

使用np.pord()函数同样可以实现,但是需要做循环计算，计算速度会比df.cumprod()慢很多。

import time
import random
real_constrains = dict(
            ret=np.random.normal(1.151, 0.05, 1000),
            duration = np.random.normal(5, 1, 1000)
        )
df = pd.DataFrame(real_constrains)
df['cumu_price'] = np.nan
st_prod = time.time()
for idx in range(0, len(df)):
        df['cumu_price'][idx] = np.prod(df['duration'][0:idx + 1] + 1)
et_prod = time.time()
print('all time of prod =', et_prod-st_prod)
st_cumprod = time.time()
df['temp_cumu_price'] = df['duration']+ 1
df['cumu_price2'] = df['temp_cumu_price'].cumprod(axis=0)
et_cumprod = time.time()
print('all time of cumprod =', et_cumprod-st_cumprod)
# output:
all time of prod = 0.30219101905822754
all time of cumprod = 0.0009975433349609375

2.10.8 sum()和cumsum()

sum表示按指定轴、指定列求和，只对相同条件中的最后一行赋值。cumsum表示按指定轴、指定列求累加和。

real_constrains = dict(
    duration = [1,1,6,2,5,10],
            ret=[0.12,0.12,0.13, 0.21, 6,0.01]            
        )

df = pd.DataFrame(real_constrains)
print('原DF: \n',df)
df['cum_ret'] = df.groupby(['duration'])['ret'].cumsum()
print('求cumsum后：\n', df)
# output 原DF: 
    duration   ret
0         1  0.12
1         1  0.12
2         6  0.13
3         2  0.21
4         5  6.00
5        10  0.01

# output 求cumsum后：
    duration   ret  cum_ret
0         1  0.12     0.12
1         1  0.12     0.24
2         6  0.13     0.13
3         2  0.21     0.21
4         5  6.00     6.00
5        10  0.01     0.01

参考文献：

利用 Python 进行数据分析

 Pandas数据处理三板斧