1

Introduction

pandas is a fast, powerful, flexible and easy-to-use open source data analysis and processing tool based on the Python programming language. It contains data cleaning and analysis tools.

Work faster and simpler data structure and operation tools. Pandas is often used in conjunction with other tools, such as numerical calculation tools NumPy and SciPy, analysis libraries statsmodels and scikit-learn, and data visualization library matplotlib.

pandas is based on NumPy arrays. Although pandas uses a large number of NumPy coding grids, the most important difference is that pandas is specifically designed for processing tables and mixed data. NumPy is more suitable for processing unified numeric array data.

This article is a concise tutorial on Pandas.

Object creation

Because Pandas is built based on NumPy arrays, we need to quote Pandas and NumPy at the same time:

In [1]: import numpy as np

In [2]: import pandas as pd

The two main data structures in Pandas are Series and DataFrame.

Series is very similar to a one-dimensional array. It is composed of various data types of NumPy and also contains indexes related to this group of data.

Let's look at an example of Series:

In [3]: pd.Series([1, 3, 5, 6, 8])
Out[3]:
0    1
1    3
2    5
3    6
4    8
dtype: int64

The one on the left is the index, and the one on the right is the value. Because we did not specify the index when creating the Series, the index starts from 0 and ends at n-1.

When a Series is created, you can also pass in np.nan to indicate a null value:

In [4]: pd.Series([1, 3, 5, np.nan, 6, 8])
Out[4]:
0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

DataFrame is a tabular data structure, which contains a set of ordered columns, each column can be a different value type (number, string, boolean, etc.).

DataFrame has both index and column index. It can be regarded as a dictionary composed of Series (using the same index in total).

Look at an example of creating a DataFrame:

In [5]: dates = pd.date_range('20201201', periods=6)

In [6]: dates
Out[6]:
DatetimeIndex(['2020-12-01', '2020-12-02', '2020-12-03', '2020-12-04',
               '2020-12-05', '2020-12-06'],
              dtype='datetime64[ns]', freq='D')

Above we created an index list.

Then use this index to create a DataFrame:

In [7]:  pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
Out[7]:
                   A         B         C         D
2020-12-01  1.536312 -0.318095 -0.737956  0.143352
2020-12-02  1.325221  0.065641 -2.763370 -0.130511
2020-12-03 -1.143560 -0.805807  0.174722  0.427027
2020-12-04 -0.724206  0.050155 -0.648675 -0.645166
2020-12-05  0.182411  0.956385  0.349465 -0.484040
2020-12-06  1.857108  1.245928 -0.767316 -1.890586

The above DataFrame receives three parameters. The first parameter is the table data of the DataFrame, the second parameter is the value of index, which can also be regarded as the row name, and the third parameter is the column name.

You can also directly pass in a dictionary to create a DataFrame:

In [9]: pd.DataFrame({'A': 1.,
   ...:                         'B': pd.Timestamp('20201202'),
   ...:                         'C': pd.Series(1, index=list(range(4)), dtype='float32'),
   ...:                         'D': np.array([3] * 4, dtype='int32'),
   ...:                         'E': pd.Categorical(["test", "train", "test", "train"]),
   ...:                         'F': 'foo'})
   ...:
Out[9]:
     A          B    C  D      E    F
0  1.0 2020-12-02  1.0  3   test  foo
1  1.0 2020-12-02  1.0  3  train  foo
2  1.0 2020-12-02  1.0  3   test  foo
3  1.0 2020-12-02  1.0  3  train  foo

In the above DataFrame, each column has a different data type.

We use a picture to better understand DataFrame and Series:

It is like a table in Excel, with row headers and column headers.

Each column in the DataFrame can be regarded as a Series:

View data

After creating the Series and DataFrame, we can view their data.

Series can get its index and value information through index and values:

In [10]: data1 = pd.Series([1, 3, 5, np.nan, 6, 8])

In [12]: data1.index
Out[12]: RangeIndex(start=0, stop=6, step=1)

In [14]: data1.values
Out[14]: array([ 1.,  3.,  5., nan,  6.,  8.])

DataFrame can be regarded as a collection of Series, so DataFrame has more attributes:

In [16]: df.head()
Out[16]:
                   A         B         C         D
2020-12-01  0.446248 -0.060549 -0.445665 -1.392502
2020-12-02 -1.119749 -1.659776 -0.618656  1.971599
2020-12-03  0.610846  0.216937  0.821258  0.805818
2020-12-04  0.490105  0.732421  0.547129 -0.443274
2020-12-05 -0.475531 -0.853141  0.160017  0.986973

In [17]: df.tail(3)
Out[17]:
                   A         B         C         D
2020-12-04  0.490105  0.732421  0.547129 -0.443274
2020-12-05 -0.475531 -0.853141  0.160017  0.986973
2020-12-06  0.288091 -2.164323  0.193989 -0.197923

Head and tail get the first few rows and the last few rows of the DataFrame respectively.

The same DataFrame also has index and columns:

In [19]: df.index
Out[19]:
DatetimeIndex(['2020-12-01', '2020-12-02', '2020-12-03', '2020-12-04',
               '2020-12-05', '2020-12-06'],
              dtype='datetime64[ns]', freq='D')

In [20]: df.values
Out[20]:
array([[ 0.44624818, -0.0605494 , -0.44566462, -1.39250227],
       [-1.11974917, -1.65977552, -0.61865617,  1.97159943],
       [ 0.61084596,  0.2169369 ,  0.82125808,  0.80581847],
       [ 0.49010504,  0.73242082,  0.54712889, -0.44327351],
       [-0.47553134, -0.85314134,  0.16001748,  0.98697257],
       [ 0.28809148, -2.16432292,  0.19398863, -0.19792266]])

The describe method can perform statistics on data:

In [26]: df.describe()
Out[26]:
              A         B         C         D
count  6.000000  6.000000  6.000000  6.000000
mean   0.040002 -0.631405  0.109679  0.288449
std    0.687872  1.128019  0.556099  1.198847
min   -1.119749 -2.164323 -0.618656 -1.392502
25%   -0.284626 -1.458117 -0.294244 -0.381936
50%    0.367170 -0.456845  0.177003  0.303948
75%    0.479141  0.147565  0.458844  0.941684
max    0.610846  0.732421  0.821258  1.971599

You can also transpose the DataFrame:

In [27]: df.T
Out[27]:
   2020-12-01  2020-12-02  2020-12-03  2020-12-04  2020-12-05  2020-12-06
A    0.446248   -1.119749    0.610846    0.490105   -0.475531    0.288091
B   -0.060549   -1.659776    0.216937    0.732421   -0.853141   -2.164323
C   -0.445665   -0.618656    0.821258    0.547129    0.160017    0.193989
D   -1.392502    1.971599    0.805818   -0.443274    0.986973   -0.197923

You can sort by row and by column:

In [28]: df.sort_index(axis=1, ascending=False)
Out[28]:
                   D         C         B         A
2020-12-01 -1.392502 -0.445665 -0.060549  0.446248
2020-12-02  1.971599 -0.618656 -1.659776 -1.119749
2020-12-03  0.805818  0.821258  0.216937  0.610846
2020-12-04 -0.443274  0.547129  0.732421  0.490105
2020-12-05  0.986973  0.160017 -0.853141 -0.475531
2020-12-06 -0.197923  0.193989 -2.164323  0.288091

In [29]: df.sort_values(by='B')
Out[29]:
                   A         B         C         D
2020-12-06  0.288091 -2.164323  0.193989 -0.197923
2020-12-02 -1.119749 -1.659776 -0.618656  1.971599
2020-12-05 -0.475531 -0.853141  0.160017  0.986973
2020-12-01  0.446248 -0.060549 -0.445665 -1.392502
2020-12-03  0.610846  0.216937  0.821258  0.805818
2020-12-04  0.490105  0.732421  0.547129 -0.443274

Select data

Through the column name of the DataFrame, you can select the Series representing the column:

In [30]: df['A']
Out[30]:
2020-12-01    0.446248
2020-12-02   -1.119749
2020-12-03    0.610846
2020-12-04    0.490105
2020-12-05   -0.475531
2020-12-06    0.288091
Freq: D, Name: A, dtype: float64

Rows can be selected by slicing:

In [31]: df[0:3]
Out[31]:
                   A         B         C         D
2020-12-01  0.446248 -0.060549 -0.445665 -1.392502
2020-12-02 -1.119749 -1.659776 -0.618656  1.971599
2020-12-03  0.610846  0.216937  0.821258  0.805818

Or like this:

In [32]: df['20201202':'20201204']
Out[32]:
                   A         B         C         D
2020-12-02 -1.119749 -1.659776 -0.618656  1.971599
2020-12-03  0.610846  0.216937  0.821258  0.805818
2020-12-04  0.490105  0.732421  0.547129 -0.443274

loc and iloc

Use loc to select data using axis labels.

In [33]: df.loc[:, ['A', 'B']]
Out[33]:
                   A         B
2020-12-01  0.446248 -0.060549
2020-12-02 -1.119749 -1.659776
2020-12-03  0.610846  0.216937
2020-12-04  0.490105  0.732421
2020-12-05 -0.475531 -0.853141
2020-12-06  0.288091 -2.164323

The front is the row selection, and the back is the column selection.

You can also specify the name of the index:

In [34]: df.loc['20201202':'20201204', ['A', 'B']]
Out[34]:
                   A         B
2020-12-02 -1.119749 -1.659776
2020-12-03  0.610846  0.216937
2020-12-04  0.490105  0.732421

If the name of the index is not a slice, it will reduce the dimensionality of the data:

In [35]: df.loc['20201202', ['A', 'B']]
Out[35]:
A   -1.119749
B   -1.659776
Name: 2020-12-02 00:00:00, dtype: float64

If the following column is a constant, return the corresponding value directly:

In [37]: df.loc['20201202', 'A']
Out[37]: -1.1197491665145112

iloc selects data based on the value, for example, we select the third row:

In [42]: df.iloc[3]
Out[42]:
A    0.490105
B    0.732421
C    0.547129
D   -0.443274
Name: 2020-12-04 00:00:00, dtype: float64

It is actually equivalent to df.loc['2020-12-04']:

In [41]: df.loc['2020-12-04']
Out[41]:
A    0.490105
B    0.732421
C    0.547129
D   -0.443274
Name: 2020-12-04 00:00:00, dtype: float64

You can also pass in slices:

In [43]: df.iloc[3:5, 0:2]
Out[43]:
                   A         B
2020-12-04  0.490105  0.732421
2020-12-05 -0.475531 -0.853141

You can pass in list:

In [44]: df.iloc[[1, 2, 4], [0, 2]]
Out[44]:
                   A         C
2020-12-02 -1.119749 -0.618656
2020-12-03  0.610846  0.821258
2020-12-05 -0.475531  0.160017

Take the value of a specific grid:

In [45]: df.iloc[1, 1]
Out[45]: -1.6597755161871708

Boolean index

DataFrame can also be indexed by boolean values. The following is to find out all the elements in column A that are greater than 0:

In [46]: df[df['A'] > 0]
Out[46]:
                   A         B         C         D
2020-12-01  0.446248 -0.060549 -0.445665 -1.392502
2020-12-03  0.610846  0.216937  0.821258  0.805818
2020-12-04  0.490105  0.732421  0.547129 -0.443274
2020-12-06  0.288091 -2.164323  0.193989 -0.197923

Or find out the value greater than 0 in the entire DF:

In [47]: df[df > 0]
Out[47]:
                   A         B         C         D
2020-12-01  0.446248       NaN       NaN       NaN
2020-12-02       NaN       NaN       NaN  1.971599
2020-12-03  0.610846  0.216937  0.821258  0.805818
2020-12-04  0.490105  0.732421  0.547129       NaN
2020-12-05       NaN       NaN  0.160017  0.986973
2020-12-06  0.288091       NaN  0.193989       NaN

You can add a column to DF:

In [48]: df['E'] = ['one', 'one', 'two', 'three', 'four', 'three']

In [49]: df
Out[49]:
                   A         B         C         D      E
2020-12-01  0.446248 -0.060549 -0.445665 -1.392502    one
2020-12-02 -1.119749 -1.659776 -0.618656  1.971599    one
2020-12-03  0.610846  0.216937  0.821258  0.805818    two
2020-12-04  0.490105  0.732421  0.547129 -0.443274  three
2020-12-05 -0.475531 -0.853141  0.160017  0.986973   four
2020-12-06  0.288091 -2.164323  0.193989 -0.197923  three

Use isin() to judge the range value:

In [50]: df[df['E'].isin(['two', 'four'])]
Out[50]:
                   A         B         C         D     E
2020-12-03  0.610846  0.216937  0.821258  0.805818   two
2020-12-05 -0.475531 -0.853141  0.160017  0.986973  four

Dealing with missing data

Now our df has 5 columns of a, b, c, d, and e. If we add another column of f to him, then the initial value of f will be NaN:

In [55]: df.reindex(columns=list(df.columns) + ['F'])
Out[55]:
                   A         B         C         D      E   F
2020-12-01  0.446248 -0.060549 -0.445665 -1.392502    one NaN
2020-12-02 -1.119749 -1.659776 -0.618656  1.971599    one NaN
2020-12-03  0.610846  0.216937  0.821258  0.805818    two NaN
2020-12-04  0.490105  0.732421  0.547129 -0.443274  three NaN
2020-12-05 -0.475531 -0.853141  0.160017  0.986973   four NaN
2020-12-06  0.288091 -2.164323  0.193989 -0.197923  three NaN

We assign values to the first two Fs:

In [74]: df1.iloc[0:2,5]=1

In [75]: df1
Out[75]:
                   A         B         C         D      E    F
2020-12-01  0.446248 -0.060549 -0.445665 -1.392502    one  1.0
2020-12-02 -1.119749 -1.659776 -0.618656  1.971599    one  1.0
2020-12-03  0.610846  0.216937  0.821258  0.805818    two  NaN
2020-12-04  0.490105  0.732421  0.547129 -0.443274  three  NaN
2020-12-05 -0.475531 -0.853141  0.160017  0.986973   four  NaN
2020-12-06  0.288091 -2.164323  0.193989 -0.197923  three  NaN

You can drop all rows that are NaN:

In [76]: df1.dropna(how='any')
Out[76]:
                   A         B         C         D    E    F
2020-12-01  0.446248 -0.060549 -0.445665 -1.392502  one  1.0
2020-12-02 -1.119749 -1.659776 -0.618656  1.971599  one  1.0

You can fill in the value of NaN:

In [77]: df1.fillna(value=5)
Out[77]:
                   A         B         C         D      E    F
2020-12-01  0.446248 -0.060549 -0.445665 -1.392502    one  1.0
2020-12-02 -1.119749 -1.659776 -0.618656  1.971599    one  1.0
2020-12-03  0.610846  0.216937  0.821258  0.805818    two  5.0
2020-12-04  0.490105  0.732421  0.547129 -0.443274  three  5.0
2020-12-05 -0.475531 -0.853141  0.160017  0.986973   four  5.0
2020-12-06  0.288091 -2.164323  0.193989 -0.197923  three  5.0

You can judge the value:

In [78]:  pd.isna(df1)
Out[78]:
                A      B      C      D      E      F
2020-12-01  False  False  False  False  False  False
2020-12-02  False  False  False  False  False  False
2020-12-03  False  False  False  False  False   True
2020-12-04  False  False  False  False  False   True
2020-12-05  False  False  False  False  False   True
2020-12-06  False  False  False  False  False   True

merge

DF can use Concat to merge multiple df, we first create a df:

In [79]: df = pd.DataFrame(np.random.randn(10, 4))

In [80]: df
Out[80]:
          0         1         2         3
0  1.089041  2.010142 -0.532527  0.991669
1  1.303678 -0.614206 -1.358952  0.006290
2 -2.663938  0.600209 -0.008845 -0.036900
3  0.863718 -0.450501  1.325427  0.417345
4  0.789239 -0.492630  0.873732  0.375941
5  0.327177  0.010719 -0.085967 -0.591267
6 -0.014350  1.372144 -0.688845  0.422701
7 -3.355685  0.044306 -0.979253 -2.184240
8 -0.051961  0.649734  1.156918 -0.233725
9 -0.692530  0.057805 -0.030565  0.209416

Then split the DF into three parts:

In [81]: pieces = [df[:3], df[3:7], df[7:]]

Finally, use concat to combine them:

In [82]: pd.concat(pieces)
Out[82]:
          0         1         2         3
0  1.089041  2.010142 -0.532527  0.991669
1  1.303678 -0.614206 -1.358952  0.006290
2 -2.663938  0.600209 -0.008845 -0.036900
3  0.863718 -0.450501  1.325427  0.417345
4  0.789239 -0.492630  0.873732  0.375941
5  0.327177  0.010719 -0.085967 -0.591267
6 -0.014350  1.372144 -0.688845  0.422701
7 -3.355685  0.044306 -0.979253 -2.184240
8 -0.051961  0.649734  1.156918 -0.233725
9 -0.692530  0.057805 -0.030565  0.209416

You can also use join to perform SQL-like merging:

In [83]: left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})

In [84]: right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})

In [85]: left
Out[85]:
   key  lval
0  foo     1
1  foo     2

In [86]: right
Out[86]:
   key  rval
0  foo     4
1  foo     5

In [87]: pd.merge(left, right, on='key')
Out[87]:
   key  lval  rval
0  foo     1     4
1  foo     1     5
2  foo     2     4
3  foo     2     5

Grouping

First look at the DF above:

In [99]: df2
Out[99]:
   key  lval  rval
0  foo     1     4
1  foo     1     5
2  foo     2     4
3  foo     2     5

We can group according to the key to sum:

In [98]: df2.groupby('key').sum()
Out[98]:
     lval  rval
key
foo     6    18

Group can also be carried out in multiple columns:

In [100]: df2.groupby(['key','lval']).sum()
Out[100]:
          rval
key lval
foo 1        9
    2        9

This article has been included in http://www.flydean.com/01-python-pandas-overview/

The most popular interpretation, the most profound dry goods, the most concise tutorial, and many tips you don't know are waiting for you to discover!

Welcome to pay attention to my official account: "Program those things", know technology, know you better!


flydean
890 声望433 粉丝

欢迎访问我的个人网站:www.flydean.com