Introduction
pandas is a fast, powerful, flexible and easy-to-use open source data analysis and processing tool based on the Python programming language. It contains data cleaning and analysis tools.
Work faster and simpler data structure and operation tools. Pandas is often used in conjunction with other tools, such as numerical calculation tools NumPy and SciPy, analysis libraries statsmodels and scikit-learn, and data visualization library matplotlib.
pandas is based on NumPy arrays. Although pandas uses a large number of NumPy coding grids, the most important difference is that pandas is specifically designed for processing tables and mixed data. NumPy is more suitable for processing unified numeric array data.
This article is a concise tutorial on Pandas.
Object creation
Because Pandas is built based on NumPy arrays, we need to quote Pandas and NumPy at the same time:
In [1]: import numpy as np
In [2]: import pandas as pd
The two main data structures in Pandas are Series and DataFrame.
Series is very similar to a one-dimensional array. It is composed of various data types of NumPy and also contains indexes related to this group of data.
Let's look at an example of Series:
In [3]: pd.Series([1, 3, 5, 6, 8])
Out[3]:
0 1
1 3
2 5
3 6
4 8
dtype: int64
The one on the left is the index, and the one on the right is the value. Because we did not specify the index when creating the Series, the index starts from 0 and ends at n-1.
When a Series is created, you can also pass in np.nan to indicate a null value:
In [4]: pd.Series([1, 3, 5, np.nan, 6, 8])
Out[4]:
0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64
DataFrame is a tabular data structure, which contains a set of ordered columns, each column can be a different value type (number, string, boolean, etc.).
DataFrame has both index and column index. It can be regarded as a dictionary composed of Series (using the same index in total).
Look at an example of creating a DataFrame:
In [5]: dates = pd.date_range('20201201', periods=6)
In [6]: dates
Out[6]:
DatetimeIndex(['2020-12-01', '2020-12-02', '2020-12-03', '2020-12-04',
'2020-12-05', '2020-12-06'],
dtype='datetime64[ns]', freq='D')
Above we created an index list.
Then use this index to create a DataFrame:
In [7]: pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
Out[7]:
A B C D
2020-12-01 1.536312 -0.318095 -0.737956 0.143352
2020-12-02 1.325221 0.065641 -2.763370 -0.130511
2020-12-03 -1.143560 -0.805807 0.174722 0.427027
2020-12-04 -0.724206 0.050155 -0.648675 -0.645166
2020-12-05 0.182411 0.956385 0.349465 -0.484040
2020-12-06 1.857108 1.245928 -0.767316 -1.890586
The above DataFrame receives three parameters. The first parameter is the table data of the DataFrame, the second parameter is the value of index, which can also be regarded as the row name, and the third parameter is the column name.
You can also directly pass in a dictionary to create a DataFrame:
In [9]: pd.DataFrame({'A': 1.,
...: 'B': pd.Timestamp('20201202'),
...: 'C': pd.Series(1, index=list(range(4)), dtype='float32'),
...: 'D': np.array([3] * 4, dtype='int32'),
...: 'E': pd.Categorical(["test", "train", "test", "train"]),
...: 'F': 'foo'})
...:
Out[9]:
A B C D E F
0 1.0 2020-12-02 1.0 3 test foo
1 1.0 2020-12-02 1.0 3 train foo
2 1.0 2020-12-02 1.0 3 test foo
3 1.0 2020-12-02 1.0 3 train foo
In the above DataFrame, each column has a different data type.
We use a picture to better understand DataFrame and Series:
It is like a table in Excel, with row headers and column headers.
Each column in the DataFrame can be regarded as a Series:
View data
After creating the Series and DataFrame, we can view their data.
Series can get its index and value information through index and values:
In [10]: data1 = pd.Series([1, 3, 5, np.nan, 6, 8])
In [12]: data1.index
Out[12]: RangeIndex(start=0, stop=6, step=1)
In [14]: data1.values
Out[14]: array([ 1., 3., 5., nan, 6., 8.])
DataFrame can be regarded as a collection of Series, so DataFrame has more attributes:
In [16]: df.head()
Out[16]:
A B C D
2020-12-01 0.446248 -0.060549 -0.445665 -1.392502
2020-12-02 -1.119749 -1.659776 -0.618656 1.971599
2020-12-03 0.610846 0.216937 0.821258 0.805818
2020-12-04 0.490105 0.732421 0.547129 -0.443274
2020-12-05 -0.475531 -0.853141 0.160017 0.986973
In [17]: df.tail(3)
Out[17]:
A B C D
2020-12-04 0.490105 0.732421 0.547129 -0.443274
2020-12-05 -0.475531 -0.853141 0.160017 0.986973
2020-12-06 0.288091 -2.164323 0.193989 -0.197923
Head and tail get the first few rows and the last few rows of the DataFrame respectively.
The same DataFrame also has index and columns:
In [19]: df.index
Out[19]:
DatetimeIndex(['2020-12-01', '2020-12-02', '2020-12-03', '2020-12-04',
'2020-12-05', '2020-12-06'],
dtype='datetime64[ns]', freq='D')
In [20]: df.values
Out[20]:
array([[ 0.44624818, -0.0605494 , -0.44566462, -1.39250227],
[-1.11974917, -1.65977552, -0.61865617, 1.97159943],
[ 0.61084596, 0.2169369 , 0.82125808, 0.80581847],
[ 0.49010504, 0.73242082, 0.54712889, -0.44327351],
[-0.47553134, -0.85314134, 0.16001748, 0.98697257],
[ 0.28809148, -2.16432292, 0.19398863, -0.19792266]])
The describe method can perform statistics on data:
In [26]: df.describe()
Out[26]:
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean 0.040002 -0.631405 0.109679 0.288449
std 0.687872 1.128019 0.556099 1.198847
min -1.119749 -2.164323 -0.618656 -1.392502
25% -0.284626 -1.458117 -0.294244 -0.381936
50% 0.367170 -0.456845 0.177003 0.303948
75% 0.479141 0.147565 0.458844 0.941684
max 0.610846 0.732421 0.821258 1.971599
You can also transpose the DataFrame:
In [27]: df.T
Out[27]:
2020-12-01 2020-12-02 2020-12-03 2020-12-04 2020-12-05 2020-12-06
A 0.446248 -1.119749 0.610846 0.490105 -0.475531 0.288091
B -0.060549 -1.659776 0.216937 0.732421 -0.853141 -2.164323
C -0.445665 -0.618656 0.821258 0.547129 0.160017 0.193989
D -1.392502 1.971599 0.805818 -0.443274 0.986973 -0.197923
You can sort by row and by column:
In [28]: df.sort_index(axis=1, ascending=False)
Out[28]:
D C B A
2020-12-01 -1.392502 -0.445665 -0.060549 0.446248
2020-12-02 1.971599 -0.618656 -1.659776 -1.119749
2020-12-03 0.805818 0.821258 0.216937 0.610846
2020-12-04 -0.443274 0.547129 0.732421 0.490105
2020-12-05 0.986973 0.160017 -0.853141 -0.475531
2020-12-06 -0.197923 0.193989 -2.164323 0.288091
In [29]: df.sort_values(by='B')
Out[29]:
A B C D
2020-12-06 0.288091 -2.164323 0.193989 -0.197923
2020-12-02 -1.119749 -1.659776 -0.618656 1.971599
2020-12-05 -0.475531 -0.853141 0.160017 0.986973
2020-12-01 0.446248 -0.060549 -0.445665 -1.392502
2020-12-03 0.610846 0.216937 0.821258 0.805818
2020-12-04 0.490105 0.732421 0.547129 -0.443274
Select data
Through the column name of the DataFrame, you can select the Series representing the column:
In [30]: df['A']
Out[30]:
2020-12-01 0.446248
2020-12-02 -1.119749
2020-12-03 0.610846
2020-12-04 0.490105
2020-12-05 -0.475531
2020-12-06 0.288091
Freq: D, Name: A, dtype: float64
Rows can be selected by slicing:
In [31]: df[0:3]
Out[31]:
A B C D
2020-12-01 0.446248 -0.060549 -0.445665 -1.392502
2020-12-02 -1.119749 -1.659776 -0.618656 1.971599
2020-12-03 0.610846 0.216937 0.821258 0.805818
Or like this:
In [32]: df['20201202':'20201204']
Out[32]:
A B C D
2020-12-02 -1.119749 -1.659776 -0.618656 1.971599
2020-12-03 0.610846 0.216937 0.821258 0.805818
2020-12-04 0.490105 0.732421 0.547129 -0.443274
loc and iloc
Use loc to select data using axis labels.
In [33]: df.loc[:, ['A', 'B']]
Out[33]:
A B
2020-12-01 0.446248 -0.060549
2020-12-02 -1.119749 -1.659776
2020-12-03 0.610846 0.216937
2020-12-04 0.490105 0.732421
2020-12-05 -0.475531 -0.853141
2020-12-06 0.288091 -2.164323
The front is the row selection, and the back is the column selection.
You can also specify the name of the index:
In [34]: df.loc['20201202':'20201204', ['A', 'B']]
Out[34]:
A B
2020-12-02 -1.119749 -1.659776
2020-12-03 0.610846 0.216937
2020-12-04 0.490105 0.732421
If the name of the index is not a slice, it will reduce the dimensionality of the data:
In [35]: df.loc['20201202', ['A', 'B']]
Out[35]:
A -1.119749
B -1.659776
Name: 2020-12-02 00:00:00, dtype: float64
If the following column is a constant, return the corresponding value directly:
In [37]: df.loc['20201202', 'A']
Out[37]: -1.1197491665145112
iloc selects data based on the value, for example, we select the third row:
In [42]: df.iloc[3]
Out[42]:
A 0.490105
B 0.732421
C 0.547129
D -0.443274
Name: 2020-12-04 00:00:00, dtype: float64
It is actually equivalent to df.loc['2020-12-04']:
In [41]: df.loc['2020-12-04']
Out[41]:
A 0.490105
B 0.732421
C 0.547129
D -0.443274
Name: 2020-12-04 00:00:00, dtype: float64
You can also pass in slices:
In [43]: df.iloc[3:5, 0:2]
Out[43]:
A B
2020-12-04 0.490105 0.732421
2020-12-05 -0.475531 -0.853141
You can pass in list:
In [44]: df.iloc[[1, 2, 4], [0, 2]]
Out[44]:
A C
2020-12-02 -1.119749 -0.618656
2020-12-03 0.610846 0.821258
2020-12-05 -0.475531 0.160017
Take the value of a specific grid:
In [45]: df.iloc[1, 1]
Out[45]: -1.6597755161871708
Boolean index
DataFrame can also be indexed by boolean values. The following is to find out all the elements in column A that are greater than 0:
In [46]: df[df['A'] > 0]
Out[46]:
A B C D
2020-12-01 0.446248 -0.060549 -0.445665 -1.392502
2020-12-03 0.610846 0.216937 0.821258 0.805818
2020-12-04 0.490105 0.732421 0.547129 -0.443274
2020-12-06 0.288091 -2.164323 0.193989 -0.197923
Or find out the value greater than 0 in the entire DF:
In [47]: df[df > 0]
Out[47]:
A B C D
2020-12-01 0.446248 NaN NaN NaN
2020-12-02 NaN NaN NaN 1.971599
2020-12-03 0.610846 0.216937 0.821258 0.805818
2020-12-04 0.490105 0.732421 0.547129 NaN
2020-12-05 NaN NaN 0.160017 0.986973
2020-12-06 0.288091 NaN 0.193989 NaN
You can add a column to DF:
In [48]: df['E'] = ['one', 'one', 'two', 'three', 'four', 'three']
In [49]: df
Out[49]:
A B C D E
2020-12-01 0.446248 -0.060549 -0.445665 -1.392502 one
2020-12-02 -1.119749 -1.659776 -0.618656 1.971599 one
2020-12-03 0.610846 0.216937 0.821258 0.805818 two
2020-12-04 0.490105 0.732421 0.547129 -0.443274 three
2020-12-05 -0.475531 -0.853141 0.160017 0.986973 four
2020-12-06 0.288091 -2.164323 0.193989 -0.197923 three
Use isin() to judge the range value:
In [50]: df[df['E'].isin(['two', 'four'])]
Out[50]:
A B C D E
2020-12-03 0.610846 0.216937 0.821258 0.805818 two
2020-12-05 -0.475531 -0.853141 0.160017 0.986973 four
Dealing with missing data
Now our df has 5 columns of a, b, c, d, and e. If we add another column of f to him, then the initial value of f will be NaN:
In [55]: df.reindex(columns=list(df.columns) + ['F'])
Out[55]:
A B C D E F
2020-12-01 0.446248 -0.060549 -0.445665 -1.392502 one NaN
2020-12-02 -1.119749 -1.659776 -0.618656 1.971599 one NaN
2020-12-03 0.610846 0.216937 0.821258 0.805818 two NaN
2020-12-04 0.490105 0.732421 0.547129 -0.443274 three NaN
2020-12-05 -0.475531 -0.853141 0.160017 0.986973 four NaN
2020-12-06 0.288091 -2.164323 0.193989 -0.197923 three NaN
We assign values to the first two Fs:
In [74]: df1.iloc[0:2,5]=1
In [75]: df1
Out[75]:
A B C D E F
2020-12-01 0.446248 -0.060549 -0.445665 -1.392502 one 1.0
2020-12-02 -1.119749 -1.659776 -0.618656 1.971599 one 1.0
2020-12-03 0.610846 0.216937 0.821258 0.805818 two NaN
2020-12-04 0.490105 0.732421 0.547129 -0.443274 three NaN
2020-12-05 -0.475531 -0.853141 0.160017 0.986973 four NaN
2020-12-06 0.288091 -2.164323 0.193989 -0.197923 three NaN
You can drop all rows that are NaN:
In [76]: df1.dropna(how='any')
Out[76]:
A B C D E F
2020-12-01 0.446248 -0.060549 -0.445665 -1.392502 one 1.0
2020-12-02 -1.119749 -1.659776 -0.618656 1.971599 one 1.0
You can fill in the value of NaN:
In [77]: df1.fillna(value=5)
Out[77]:
A B C D E F
2020-12-01 0.446248 -0.060549 -0.445665 -1.392502 one 1.0
2020-12-02 -1.119749 -1.659776 -0.618656 1.971599 one 1.0
2020-12-03 0.610846 0.216937 0.821258 0.805818 two 5.0
2020-12-04 0.490105 0.732421 0.547129 -0.443274 three 5.0
2020-12-05 -0.475531 -0.853141 0.160017 0.986973 four 5.0
2020-12-06 0.288091 -2.164323 0.193989 -0.197923 three 5.0
You can judge the value:
In [78]: pd.isna(df1)
Out[78]:
A B C D E F
2020-12-01 False False False False False False
2020-12-02 False False False False False False
2020-12-03 False False False False False True
2020-12-04 False False False False False True
2020-12-05 False False False False False True
2020-12-06 False False False False False True
merge
DF can use Concat to merge multiple df, we first create a df:
In [79]: df = pd.DataFrame(np.random.randn(10, 4))
In [80]: df
Out[80]:
0 1 2 3
0 1.089041 2.010142 -0.532527 0.991669
1 1.303678 -0.614206 -1.358952 0.006290
2 -2.663938 0.600209 -0.008845 -0.036900
3 0.863718 -0.450501 1.325427 0.417345
4 0.789239 -0.492630 0.873732 0.375941
5 0.327177 0.010719 -0.085967 -0.591267
6 -0.014350 1.372144 -0.688845 0.422701
7 -3.355685 0.044306 -0.979253 -2.184240
8 -0.051961 0.649734 1.156918 -0.233725
9 -0.692530 0.057805 -0.030565 0.209416
Then split the DF into three parts:
In [81]: pieces = [df[:3], df[3:7], df[7:]]
Finally, use concat to combine them:
In [82]: pd.concat(pieces)
Out[82]:
0 1 2 3
0 1.089041 2.010142 -0.532527 0.991669
1 1.303678 -0.614206 -1.358952 0.006290
2 -2.663938 0.600209 -0.008845 -0.036900
3 0.863718 -0.450501 1.325427 0.417345
4 0.789239 -0.492630 0.873732 0.375941
5 0.327177 0.010719 -0.085967 -0.591267
6 -0.014350 1.372144 -0.688845 0.422701
7 -3.355685 0.044306 -0.979253 -2.184240
8 -0.051961 0.649734 1.156918 -0.233725
9 -0.692530 0.057805 -0.030565 0.209416
You can also use join to perform SQL-like merging:
In [83]: left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
In [84]: right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})
In [85]: left
Out[85]:
key lval
0 foo 1
1 foo 2
In [86]: right
Out[86]:
key rval
0 foo 4
1 foo 5
In [87]: pd.merge(left, right, on='key')
Out[87]:
key lval rval
0 foo 1 4
1 foo 1 5
2 foo 2 4
3 foo 2 5
Grouping
First look at the DF above:
In [99]: df2
Out[99]:
key lval rval
0 foo 1 4
1 foo 1 5
2 foo 2 4
3 foo 2 5
We can group according to the key to sum:
In [98]: df2.groupby('key').sum()
Out[98]:
lval rval
key
foo 6 18
Group can also be carried out in multiple columns:
In [100]: df2.groupby(['key','lval']).sum()
Out[100]:
rval
key lval
foo 1 9
2 9
This article has been included in http://www.flydean.com/01-python-pandas-overview/
The most popular interpretation, the most profound dry goods, the most concise tutorial, and many tips you don't know are waiting for you to discover!
Welcome to pay attention to my official account: "Program those things", know technology, know you better!
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。