Pandas: Pandas Advanced Tutorial Takes Titanic Real Data as an Example

Introduction

Today we will explain the advanced tutorial of Pandas, including reading and writing files, selecting subsets, and graphical representation.

Read and write files

A key step of data processing is to read the file for analysis, and then write the analysis and processing results to the file again.

Pandas supports reading and writing of multiple file formats:

In [108]: pd.read_
 read_clipboard() read_excel()     read_fwf()       read_hdf()       read_json        read_parquet     read_sas         read_sql_query   read_stata
 read_csv         read_feather()   read_gbq()       read_html        read_msgpack     read_pickle      read_sql         read_sql_table   read_table

Next, we will take Titanic.csv provided by Pandas official website as an example to explain the use of Pandas.

Titanic.csv provides information about more than 800 passengers on the Titanic, which is a matrix of 891 rows x 12 columns.

We use Pandas to read this csv:

In [5]: titanic=pd.read_csv("titanic.csv")

The read_csv method will convert the csv file into pandas DataFrame .

By default, we directly use the DF variable, which will display the first 5 rows and the last 5 rows of data by default:

In [3]: titanic
Out[3]: 
     PassengerId  Survived  Pclass                                               Name     Sex  ...  Parch            Ticket     Fare Cabin  Embarked
0              1         0       3                            Braund, Mr. Owen Harris    male  ...      0         A/5 21171   7.2500   NaN         S
1              2         1       1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  ...      0          PC 17599  71.2833   C85         C
2              3         1       3                             Heikkinen, Miss. Laina  female  ...      0  STON/O2. 3101282   7.9250   NaN         S
3              4         1       1       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  ...      0            113803  53.1000  C123         S
4              5         0       3                           Allen, Mr. William Henry    male  ...      0            373450   8.0500   NaN         S
..           ...       ...     ...                                                ...     ...  ...    ...               ...      ...   ...       ...
886          887         0       2                              Montvila, Rev. Juozas    male  ...      0            211536  13.0000   NaN         S
887          888         1       1                       Graham, Miss. Margaret Edith  female  ...      0            112053  30.0000   B42         S
888          889         0       3           Johnston, Miss. Catherine Helen "Carrie"  female  ...      2        W./C. 6607  23.4500   NaN         S
889          890         1       1                              Behr, Mr. Karl Howell    male  ...      0            111369  30.0000  C148         C
890          891         0       3                                Dooley, Mr. Patrick    male  ...      0            370376   7.7500   NaN         Q

[891 rows x 12 columns]

You can use head(n) and tail(n) to specify a specific number of rows:

In [4]: titanic.head(8)
Out[4]: 
   PassengerId  Survived  Pclass                                               Name     Sex  ...  Parch            Ticket     Fare Cabin  Embarked
0            1         0       3                            Braund, Mr. Owen Harris    male  ...      0         A/5 21171   7.2500   NaN         S
1            2         1       1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  ...      0          PC 17599  71.2833   C85         C
2            3         1       3                             Heikkinen, Miss. Laina  female  ...      0  STON/O2. 3101282   7.9250   NaN         S
3            4         1       1       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  ...      0            113803  53.1000  C123         S
4            5         0       3                           Allen, Mr. William Henry    male  ...      0            373450   8.0500   NaN         S
5            6         0       3                                   Moran, Mr. James    male  ...      0            330877   8.4583   NaN         Q
6            7         0       1                            McCarthy, Mr. Timothy J    male  ...      0             17463  51.8625   E46         S
7            8         0       3                     Palsson, Master. Gosta Leonard    male  ...      1            349909  21.0750   NaN         S

[8 rows x 12 columns]

Use dtypes to view the data type of each column:

In [5]: titanic.dtypes
Out[5]: 
PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

Use to_excel to convert DF to excel file, use read_excel to read excel file again:

In [11]: titanic.to_excel('titanic.xlsx', sheet_name='passengers', index=False)

In [12]: titanic = pd.read_excel('titanic.xlsx', sheet_name='passengers')

Use info() to make a preliminary statistics on DF:

In [14]: titanic.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

DF choice

Select column data

DF's head or tail method can only display all column data, the following method can select specific column data.

In [15]: ages = titanic["Age"]

In [16]: ages.head()
Out[16]:
0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
Name: Age, dtype: float64

Each column is a Series:

In [6]: type(titanic["Age"])
Out[6]: pandas.core.series.Series

In [7]: titanic["Age"].shape
Out[7]: (891,)

You can also select multiple:

In [8]: age_sex = titanic[["Age", "Sex"]]

In [9]: age_sex.head()
Out[9]: 
    Age     Sex
0  22.0    male
1  38.0  female
2  26.0  female
3  35.0  female
4  35.0    male

If you select multiple columns, the returned result is a DF type:

In [10]: type(titanic[["Age", "Sex"]])
Out[10]: pandas.core.frame.DataFrame

In [11]: titanic[["Age", "Sex"]].shape
Out[11]: (891, 2)

Select row data

Above we talked about how to select column data, let's take a look at how to select row data:

Select customers older than 35 years old:

In [12]: above_35 = titanic[titanic["Age"] > 35]

In [13]: above_35.head()
Out[13]: 
    PassengerId  Survived  Pclass                                               Name     Sex  ...  Parch    Ticket     Fare Cabin  Embarked
1             2         1       1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  ...      0  PC 17599  71.2833   C85         C
6             7         0       1                            McCarthy, Mr. Timothy J    male  ...      0     17463  51.8625   E46         S
11           12         1       1                           Bonnell, Miss. Elizabeth  female  ...      0    113783  26.5500  C103         S
13           14         0       3                        Andersson, Mr. Anders Johan    male  ...      5    347082  31.2750   NaN         S
15           16         1       2                   Hewlett, Mrs. (Mary D Kingcome)   female  ...      0    248706  16.0000   NaN         S

[5 rows x 12 columns]

Use isin to select all customers with Pclass in 2 and 3:

In [16]: class_23 = titanic[titanic["Pclass"].isin([2, 3])]
In [17]: class_23.head()
Out[17]: 
   PassengerId  Survived  Pclass                            Name     Sex   Age  SibSp  Parch            Ticket     Fare Cabin Embarked
0            1         0       3         Braund, Mr. Owen Harris    male  22.0      1      0         A/5 21171   7.2500   NaN        S
2            3         1       3          Heikkinen, Miss. Laina  female  26.0      0      0  STON/O2. 3101282   7.9250   NaN        S
4            5         0       3        Allen, Mr. William Henry    male  35.0      0      0            373450   8.0500   NaN        S
5            6         0       3                Moran, Mr. James    male   NaN      0      0            330877   8.4583   NaN        Q
7            8         0       3  Palsson, Master. Gosta Leonard    male   2.0      3      1            349909  21.0750   NaN        S

The above isin is equal to:

In [18]: class_23 = titanic[(titanic["Pclass"] == 2) | (titanic["Pclass"] == 3)]

Filter Age is not empty:

In [20]: age_no_na = titanic[titanic["Age"].notna()]

In [21]: age_no_na.head()
Out[21]: 
   PassengerId  Survived  Pclass                                               Name     Sex  ...  Parch            Ticket     Fare Cabin  Embarked
0            1         0       3                            Braund, Mr. Owen Harris    male  ...      0         A/5 21171   7.2500   NaN         S
1            2         1       1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  ...      0          PC 17599  71.2833   C85         C
2            3         1       3                             Heikkinen, Miss. Laina  female  ...      0  STON/O2. 3101282   7.9250   NaN         S
3            4         1       1       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  ...      0            113803  53.1000  C123         S
4            5         0       3                           Allen, Mr. William Henry    male  ...      0            373450   8.0500   NaN         S

[5 rows x 12 columns]

Select rows and columns at the same time

We can select rows and columns at the same time.

You can use loc and iloc to select rows and columns. The difference between the two is that loc uses names for selection, and iloc uses numbers for selection.

Select the passenger name of age>35:

In [23]: adult_names = titanic.loc[titanic["Age"] > 35, "Name"]

In [24]: adult_names.head()
Out[24]: 
1     Cumings, Mrs. John Bradley (Florence Briggs Th...
6                               McCarthy, Mr. Timothy J
11                             Bonnell, Miss. Elizabeth
13                          Andersson, Mr. Anders Johan
15                     Hewlett, Mrs. (Mary D Kingcome) 
Name: Name, dtype: object

The first value in loc indicates row selection, and the second value indicates column selection.

Use iloc to select:

In [25]: titanic.iloc[9:25, 2:5]
Out[25]: 
    Pclass                                 Name     Sex
9        2  Nasser, Mrs. Nicholas (Adele Achem)  female
10       3      Sandstrom, Miss. Marguerite Rut  female
11       1             Bonnell, Miss. Elizabeth  female
12       3       Saundercock, Mr. William Henry    male
13       3          Andersson, Mr. Anders Johan    male
..     ...                                  ...     ...
20       2                 Fynney, Mr. Joseph J    male
21       2                Beesley, Mr. Lawrence    male
22       3          McGowan, Miss. Anna "Annie"  female
23       1         Sloper, Mr. William Thompson    male
24       3        Palsson, Miss. Torborg Danira  female

[16 rows x 3 columns]

Use plots to draw

How to convert DF into a diversified graphic display?

To use matplotlib for drawing on the command line, you need to start the QT environment of ipython:

ipython qtconsole --pylab=inline

Use plot directly to show the passenger information we read above:

import matplotlib.pyplot as plt

import pandas as pd

titanic = pd.read_excel('titanic.xlsx', sheet_name='passengers')

titanic.plot()

The abscissa is the index in DF, and the column coordinate is the name of each column. Note that the above column only shows numeric types.

We only show age information:

titanic['Age'].plot()

The default is a histogram, we can convert the form of the graph, such as a dot graph:

titanic.plot.scatter(x="PassengerId",y="Age", alpha=0.5)

Choose PassengerId in the data as the x-axis and age as the y-axis:

In addition to scatter plots, many other images are also supported:

[method_name for method_name in dir(titanic.plot) if not method_name.startswith("_")]
Out[11]: 
['area',
 'bar',
 'barh',
 'box',
 'density',
 'hexbin',
 'hist',
 'kde',
 'line',
 'pie',
 'scatter']

Look at another box diagram:

titanic['Age'].plot.box()

It can be seen that the age of passengers is mostly concentrated between 20-40 years old.

You can also plot the selected multiple columns separately:

titanic.plot.area(figsize=(12, 4), subplots=True)

Specify specific columns:

titanic[['Age','Pclass']].plot.area(figsize=(12, 4), subplots=True)

You can also draw the picture first, and then fill it:

fig, axs = plt.subplots(figsize=(12, 4));

First draw an empty picture, and then fill it:

titanic['Age'].plot.area(ax=axs);

axs.set_ylabel("Age");

fig

Use existing columns to create new columns

Sometimes, we need to transform existing columns to get new columns. For example, if we want to add an Age2 column whose value is Age column+10, we can do this:

titanic["Age2"]=titanic["Age"]+10;

titanic[["Age","Age2"]].head()
Out[34]: 
    Age  Age2
0  22.0  32.0
1  38.0  48.0
2  26.0  36.0
3  35.0  45.0
4  35.0  45.0

You can also rename the column:

titanic_renamed = titanic.rename(
   ...:     columns={"Age": "Age2",
   ...:              "Pclass": "Pclas2"})

Convert column names to lowercase:

titanic_renamed = titanic_renamed.rename(columns=str.lower)

Make statistics

Let's count the average age of passengers:

titanic["Age"].mean()
Out[35]: 29.69911764705882

Choose median:

titanic[["Age", "Fare"]].median()
Out[36]: 
Age     28.0000
Fare    14.4542
dtype: float64

More information:

titanic[["Age", "Fare"]].describe()
Out[37]: 
              Age        Fare
count  714.000000  891.000000
mean    29.699118   32.204208
std     14.526497   49.693429
min      0.420000    0.000000
25%     20.125000    7.910400
50%     28.000000   14.454200
75%     38.000000   31.000000
max     80.000000  512.329200

Use agg to specify a specific aggregation method:

titanic.agg({'Age': ['min', 'max', 'median', 'skew'],'Fare': ['min', 'max', 'median', 'mean']})
Out[38]: 
              Age        Fare
max     80.000000  512.329200
mean          NaN   32.204208
median  28.000000   14.454200
min      0.420000    0.000000
skew     0.389108         NaN

You can use groupby:

titanic[["Sex", "Age"]].groupby("Sex").mean()
Out[39]: 
              Age
Sex              
female  27.915709
male    30.726645

All columns in groupby:

titanic.groupby("Sex").mean()
Out[40]: 
        PassengerId  Survived    Pclass        Age     SibSp     Parch  
Sex                                                                      
female   431.028662  0.742038  2.159236  27.915709  0.694268  0.649682   
male     454.147314  0.188908  2.389948  30.726645  0.429809  0.235702

You can also select specific columns after groupby:

titanic.groupby("Sex")["Age"].mean()
Out[41]: 
Sex
female    27.915709
male      30.726645
Name: Age, dtype: float64

Count can be classified:

titanic["Pclass"].value_counts()
Out[42]: 
3    491
1    216
2    184
Name: Pclass, dtype: int64

The above is equivalent to:

titanic.groupby("Pclass")["Pclass"].count()

DF reorganization

You can sort according to a certain column:

titanic.sort_values(by="Age").head()
Out[43]: 
     PassengerId  Survived  Pclass                             Name     Sex  \
803          804         1       3  Thomas, Master. Assad Alexander    male   
755          756         1       2        Hamalainen, Master. Viljo    male   
644          645         1       3           Baclini, Miss. Eugenie  female   
469          470         1       3    Baclini, Miss. Helene Barbara  female   
78            79         1       2    Caldwell, Master. Alden Gates    male

Sort by multiple columns:

titanic.sort_values(by=['Pclass', 'Age'], ascending=False).head()
Out[44]: 
     PassengerId  Survived  Pclass                       Name     Sex   Age  \
851          852         0       3        Svensson, Mr. Johan    male  74.0   
116          117         0       3       Connors, Mr. Patrick    male  70.5   
280          281         0       3           Duane, Mr. Frank    male  65.0   
483          484         1       3     Turkula, Mrs. (Hedwig)  female  63.0   
326          327         0       3  Nysveen, Mr. Johan Hansen    male  61.0

Select specific rows and columns of data. In the following example, we will select part of the data whose gender is female:

female=titanic[titanic['Sex']=='female']

female_subset=female[["Age","Pclass","PassengerId","Survived"]].sort_values(["Pclass"]).groupby(["Pclass"]).head(2)

female_subset
Out[58]: 
      Age  Pclass  PassengerId  Survived
1    38.0       1            2         1
356  22.0       1          357         1
726  30.0       2          727         1
443  28.0       2          444         1
855  18.0       3          856         1
654  18.0       3          655         0

Use pivot to convert the axis:

female_subset.pivot(columns="Pclass", values="Age")
Out[62]: 
Pclass     1     2     3
1       38.0   NaN   NaN
356     22.0   NaN   NaN
443      NaN  28.0   NaN
654      NaN   NaN  18.0
726      NaN  30.0   NaN
855      NaN   NaN  18.0

female_subset.pivot(columns="Pclass", values="Age").plot()

This article has been included in http://www.flydean.com/02-python-pandas-advanced/
The most popular interpretation, the most profound dry goods, the most concise tutorial, and many tips you don't know are waiting for you to discover!
Welcome to pay attention to my official account: "Program those things", know technology, know you better!

Pandas: Pandas Advanced Tutorial Takes Titanic Real Data as an Example

Introduction

Read and write files

DF choice

Select column data

Select row data

Select rows and columns at the same time

Use plots to draw

Use existing columns to create new columns

Make statistics

DF reorganization

flydean

引用和评论

在stable diffussion中完美修复AI图片

Anaconda安装教程以及Anaconda和pip配置国内镜像

如何减少跨团队交付摩擦？——基于 DevOps 与敏捷的最佳实践

pip安装报错：No such file or directory 'conda-forge' 没有那个文件或目录

科学计算编程涉及到的技术栈简介

Python 描述符

使用 chardet 判断文件编码需要注意的坑——过大的文件会导致高耗时