Introduction
Today we will explain the advanced tutorial of Pandas, including reading and writing files, selecting subsets, and graphical representation.
Read and write files
A key step of data processing is to read the file for analysis, and then write the analysis and processing results to the file again.
Pandas supports reading and writing of multiple file formats:
In [108]: pd.read_
read_clipboard() read_excel() read_fwf() read_hdf() read_json read_parquet read_sas read_sql_query read_stata
read_csv read_feather() read_gbq() read_html read_msgpack read_pickle read_sql read_sql_table read_table
Next, we will take Titanic.csv provided by Pandas official website as an example to explain the use of Pandas.
Titanic.csv provides information about more than 800 passengers on the Titanic, which is a matrix of 891 rows x 12 columns.
We use Pandas to read this csv:
In [5]: titanic=pd.read_csv("titanic.csv")
The read_csv method will convert the csv file into pandas DataFrame
.
By default, we directly use the DF variable, which will display the first 5 rows and the last 5 rows of data by default:
In [3]: titanic
Out[3]:
PassengerId Survived Pclass Name Sex ... Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male ... 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female ... 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female ... 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female ... 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male ... 0 373450 8.0500 NaN S
.. ... ... ... ... ... ... ... ... ... ... ...
886 887 0 2 Montvila, Rev. Juozas male ... 0 211536 13.0000 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female ... 0 112053 30.0000 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female ... 2 W./C. 6607 23.4500 NaN S
889 890 1 1 Behr, Mr. Karl Howell male ... 0 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick male ... 0 370376 7.7500 NaN Q
[891 rows x 12 columns]
You can use head(n) and tail(n) to specify a specific number of rows:
In [4]: titanic.head(8)
Out[4]:
PassengerId Survived Pclass Name Sex ... Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male ... 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female ... 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female ... 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female ... 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male ... 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male ... 0 330877 8.4583 NaN Q
6 7 0 1 McCarthy, Mr. Timothy J male ... 0 17463 51.8625 E46 S
7 8 0 3 Palsson, Master. Gosta Leonard male ... 1 349909 21.0750 NaN S
[8 rows x 12 columns]
Use dtypes to view the data type of each column:
In [5]: titanic.dtypes
Out[5]:
PassengerId int64
Survived int64
Pclass int64
Name object
Sex object
Age float64
SibSp int64
Parch int64
Ticket object
Fare float64
Cabin object
Embarked object
dtype: object
Use to_excel to convert DF to excel file, use read_excel to read excel file again:
In [11]: titanic.to_excel('titanic.xlsx', sheet_name='passengers', index=False)
In [12]: titanic = pd.read_excel('titanic.xlsx', sheet_name='passengers')
Use info() to make a preliminary statistics on DF:
In [14]: titanic.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
DF choice
Select column data
DF's head or tail method can only display all column data, the following method can select specific column data.
In [15]: ages = titanic["Age"]
In [16]: ages.head()
Out[16]:
0 22.0
1 38.0
2 26.0
3 35.0
4 35.0
Name: Age, dtype: float64
Each column is a Series:
In [6]: type(titanic["Age"])
Out[6]: pandas.core.series.Series
In [7]: titanic["Age"].shape
Out[7]: (891,)
You can also select multiple:
In [8]: age_sex = titanic[["Age", "Sex"]]
In [9]: age_sex.head()
Out[9]:
Age Sex
0 22.0 male
1 38.0 female
2 26.0 female
3 35.0 female
4 35.0 male
If you select multiple columns, the returned result is a DF type:
In [10]: type(titanic[["Age", "Sex"]])
Out[10]: pandas.core.frame.DataFrame
In [11]: titanic[["Age", "Sex"]].shape
Out[11]: (891, 2)
Select row data
Above we talked about how to select column data, let's take a look at how to select row data:
Select customers older than 35 years old:
In [12]: above_35 = titanic[titanic["Age"] > 35]
In [13]: above_35.head()
Out[13]:
PassengerId Survived Pclass Name Sex ... Parch Ticket Fare Cabin Embarked
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female ... 0 PC 17599 71.2833 C85 C
6 7 0 1 McCarthy, Mr. Timothy J male ... 0 17463 51.8625 E46 S
11 12 1 1 Bonnell, Miss. Elizabeth female ... 0 113783 26.5500 C103 S
13 14 0 3 Andersson, Mr. Anders Johan male ... 5 347082 31.2750 NaN S
15 16 1 2 Hewlett, Mrs. (Mary D Kingcome) female ... 0 248706 16.0000 NaN S
[5 rows x 12 columns]
Use isin to select all customers with Pclass in 2 and 3:
In [16]: class_23 = titanic[titanic["Pclass"].isin([2, 3])]
In [17]: class_23.head()
Out[17]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S
The above isin is equal to:
In [18]: class_23 = titanic[(titanic["Pclass"] == 2) | (titanic["Pclass"] == 3)]
Filter Age is not empty:
In [20]: age_no_na = titanic[titanic["Age"].notna()]
In [21]: age_no_na.head()
Out[21]:
PassengerId Survived Pclass Name Sex ... Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male ... 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female ... 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female ... 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female ... 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male ... 0 373450 8.0500 NaN S
[5 rows x 12 columns]
Select rows and columns at the same time
We can select rows and columns at the same time.
You can use loc and iloc to select rows and columns. The difference between the two is that loc uses names for selection, and iloc uses numbers for selection.
Select the passenger name of age>35:
In [23]: adult_names = titanic.loc[titanic["Age"] > 35, "Name"]
In [24]: adult_names.head()
Out[24]:
1 Cumings, Mrs. John Bradley (Florence Briggs Th...
6 McCarthy, Mr. Timothy J
11 Bonnell, Miss. Elizabeth
13 Andersson, Mr. Anders Johan
15 Hewlett, Mrs. (Mary D Kingcome)
Name: Name, dtype: object
The first value in loc indicates row selection, and the second value indicates column selection.
Use iloc to select:
In [25]: titanic.iloc[9:25, 2:5]
Out[25]:
Pclass Name Sex
9 2 Nasser, Mrs. Nicholas (Adele Achem) female
10 3 Sandstrom, Miss. Marguerite Rut female
11 1 Bonnell, Miss. Elizabeth female
12 3 Saundercock, Mr. William Henry male
13 3 Andersson, Mr. Anders Johan male
.. ... ... ...
20 2 Fynney, Mr. Joseph J male
21 2 Beesley, Mr. Lawrence male
22 3 McGowan, Miss. Anna "Annie" female
23 1 Sloper, Mr. William Thompson male
24 3 Palsson, Miss. Torborg Danira female
[16 rows x 3 columns]
Use plots to draw
How to convert DF into a diversified graphic display?
To use matplotlib for drawing on the command line, you need to start the QT environment of ipython:
ipython qtconsole --pylab=inline
Use plot directly to show the passenger information we read above:
import matplotlib.pyplot as plt
import pandas as pd
titanic = pd.read_excel('titanic.xlsx', sheet_name='passengers')
titanic.plot()
The abscissa is the index in DF, and the column coordinate is the name of each column. Note that the above column only shows numeric types.
We only show age information:
titanic['Age'].plot()
The default is a histogram, we can convert the form of the graph, such as a dot graph:
titanic.plot.scatter(x="PassengerId",y="Age", alpha=0.5)
Choose PassengerId in the data as the x-axis and age as the y-axis:
In addition to scatter plots, many other images are also supported:
[method_name for method_name in dir(titanic.plot) if not method_name.startswith("_")]
Out[11]:
['area',
'bar',
'barh',
'box',
'density',
'hexbin',
'hist',
'kde',
'line',
'pie',
'scatter']
Look at another box diagram:
titanic['Age'].plot.box()
It can be seen that the age of passengers is mostly concentrated between 20-40 years old.
You can also plot the selected multiple columns separately:
titanic.plot.area(figsize=(12, 4), subplots=True)
Specify specific columns:
titanic[['Age','Pclass']].plot.area(figsize=(12, 4), subplots=True)
You can also draw the picture first, and then fill it:
fig, axs = plt.subplots(figsize=(12, 4));
First draw an empty picture, and then fill it:
titanic['Age'].plot.area(ax=axs);
axs.set_ylabel("Age");
fig
Use existing columns to create new columns
Sometimes, we need to transform existing columns to get new columns. For example, if we want to add an Age2 column whose value is Age column+10, we can do this:
titanic["Age2"]=titanic["Age"]+10;
titanic[["Age","Age2"]].head()
Out[34]:
Age Age2
0 22.0 32.0
1 38.0 48.0
2 26.0 36.0
3 35.0 45.0
4 35.0 45.0
You can also rename the column:
titanic_renamed = titanic.rename(
...: columns={"Age": "Age2",
...: "Pclass": "Pclas2"})
Convert column names to lowercase:
titanic_renamed = titanic_renamed.rename(columns=str.lower)
Make statistics
Let's count the average age of passengers:
titanic["Age"].mean()
Out[35]: 29.69911764705882
Choose median:
titanic[["Age", "Fare"]].median()
Out[36]:
Age 28.0000
Fare 14.4542
dtype: float64
More information:
titanic[["Age", "Fare"]].describe()
Out[37]:
Age Fare
count 714.000000 891.000000
mean 29.699118 32.204208
std 14.526497 49.693429
min 0.420000 0.000000
25% 20.125000 7.910400
50% 28.000000 14.454200
75% 38.000000 31.000000
max 80.000000 512.329200
Use agg to specify a specific aggregation method:
titanic.agg({'Age': ['min', 'max', 'median', 'skew'],'Fare': ['min', 'max', 'median', 'mean']})
Out[38]:
Age Fare
max 80.000000 512.329200
mean NaN 32.204208
median 28.000000 14.454200
min 0.420000 0.000000
skew 0.389108 NaN
You can use groupby:
titanic[["Sex", "Age"]].groupby("Sex").mean()
Out[39]:
Age
Sex
female 27.915709
male 30.726645
All columns in groupby:
titanic.groupby("Sex").mean()
Out[40]:
PassengerId Survived Pclass Age SibSp Parch
Sex
female 431.028662 0.742038 2.159236 27.915709 0.694268 0.649682
male 454.147314 0.188908 2.389948 30.726645 0.429809 0.235702
You can also select specific columns after groupby:
titanic.groupby("Sex")["Age"].mean()
Out[41]:
Sex
female 27.915709
male 30.726645
Name: Age, dtype: float64
Count can be classified:
titanic["Pclass"].value_counts()
Out[42]:
3 491
1 216
2 184
Name: Pclass, dtype: int64
The above is equivalent to:
titanic.groupby("Pclass")["Pclass"].count()
DF reorganization
You can sort according to a certain column:
titanic.sort_values(by="Age").head()
Out[43]:
PassengerId Survived Pclass Name Sex \
803 804 1 3 Thomas, Master. Assad Alexander male
755 756 1 2 Hamalainen, Master. Viljo male
644 645 1 3 Baclini, Miss. Eugenie female
469 470 1 3 Baclini, Miss. Helene Barbara female
78 79 1 2 Caldwell, Master. Alden Gates male
Sort by multiple columns:
titanic.sort_values(by=['Pclass', 'Age'], ascending=False).head()
Out[44]:
PassengerId Survived Pclass Name Sex Age \
851 852 0 3 Svensson, Mr. Johan male 74.0
116 117 0 3 Connors, Mr. Patrick male 70.5
280 281 0 3 Duane, Mr. Frank male 65.0
483 484 1 3 Turkula, Mrs. (Hedwig) female 63.0
326 327 0 3 Nysveen, Mr. Johan Hansen male 61.0
Select specific rows and columns of data. In the following example, we will select part of the data whose gender is female:
female=titanic[titanic['Sex']=='female']
female_subset=female[["Age","Pclass","PassengerId","Survived"]].sort_values(["Pclass"]).groupby(["Pclass"]).head(2)
female_subset
Out[58]:
Age Pclass PassengerId Survived
1 38.0 1 2 1
356 22.0 1 357 1
726 30.0 2 727 1
443 28.0 2 444 1
855 18.0 3 856 1
654 18.0 3 655 0
Use pivot to convert the axis:
female_subset.pivot(columns="Pclass", values="Age")
Out[62]:
Pclass 1 2 3
1 38.0 NaN NaN
356 22.0 NaN NaN
443 NaN 28.0 NaN
654 NaN NaN 18.0
726 NaN 30.0 NaN
855 NaN NaN 18.0
female_subset.pivot(columns="Pclass", values="Age").plot()
This article has been included in http://www.flydean.com/02-python-pandas-advanced/
The most popular interpretation, the most profound dry goods, the most concise tutorial, and many tips you don't know are waiting for you to discover!
Welcome to pay attention to my official account: "Program those things", know technology, know you better!
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。