Introduction

On April 15, 1912, the unsinkable Titanic sank due to a collision with an iceberg. Unfortunately, 1,502 of the 2,224 passengers died because there was not enough rescue equipment. Accidents have happened, but can we find some data patterns in the historical data from the Titanic? Today, this article will lead you to use pandas flexibly for data analysis.

Titanic passenger data

We downloaded part of the passenger data of the Titanic from the kaggle official website, which mainly includes the following fields:

variable namemeaningvalue
survivalsurvive0 = No, 1 = Yes
pclassclass of ticket1 = 1st, 2 = 2nd, 3 = 3rd
sexgender
Ageage
sibspSpouse Information
parchparent or child information
ticketTicket code
fareBoat fare
cabincabin number
embarkedlogin portC = Cherbourg, Q = Queenstown, S = Southampton

The downloaded file is a csv file. Next, let's take a look at how to use pandas to perform data analysis on it.

Analyze data with pandas

import dependency package

This article mainly uses pandas and matplotlib, so you need to make the following general settings first:

from numpy.random import randn
import numpy as np
np.random.seed(123)
import os
import matplotlib.pyplot as plt
import pandas as pd
plt.rc('figure', figsize=(10, 6))
np.set_printoptions(precision=4)
pd.options.display.max_rows = 20

Read and analyze data

pandas provides a read_csv method that can easily read a csv data and convert it to a DataFrame:

path = '../data/titanic.csv'
df = pd.read_csv(path)
df

Let's take a look at the data read in:

PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
08923Kelly, Mr. Jamesmale34.5003309117.8292NaNQ
18933Wilkes, Mrs. James (Ellen Needs)female47.0103632727.0000NaNS
28942Myles, Mr. Thomas Francismale62.0002402769.6875NaNQ
38953Wirz, Mr. Albertmale27.0003151548.6625NaNS
48963Hirvonen, Mrs. Alexander (Helga E Lindqvist)female22.011310129812.2875NaNS
58973Svensson, Mr. Johan Cervinmale14.00075389.2250NaNS
68983Connolly, Miss. Katefemale30.0003309727.6292NaNQ
78992Caldwell, Mr. Albert Francismale26.01124873829.0000NaNS
89003Abrahim, Mrs. Joseph (Sophie Halaut Easu)female18.00026577.2292NaNC
99013Davies, Mr. John Samuelmale21.020A/4 4887124.1500NaNS
....................................
40813003Riordan, Miss. Johanna Hannah""femaleNaN003349157.7208NaNQ
40913013Peacock, Miss. Treasteallfemale3.011SOTON/O.Q. 310131513.7750NaNS
41013023Naughton, Miss. HannahfemaleNaN003652377.7500NaNQ
41113031Minahan, Mrs. William Edward (Lillian E Thorpe)female37.0101992890.0000C78Q
41213043Henriksson, Miss. Jenny Lovisafemale28.0003470867.7750NaNS
41313053Spector, Mr. WoolfmaleNaN00A.5. 32368.0500NaNS
41413061Oliva y Ocana, Dona. Ferminafemale39.000PC 17758108.9000C105C
41513073Saether, Mr. Simon Sivertsenmale38.500SOTON/O.Q. 31012627.2500NaNS
41613083Ware, Mr. FrederickmaleNaN003593098.0500NaNS
41713093Peter, Master. Michael JmaleNaN11266822.3583NaNC

418 rows × 11 columns

Call the describe method of df to view basic statistics:

PassengerIdPclassAgeSibSpParchFare
count418.000000418.000000332.000000418.000000418.000000417.000000
mean1100.5000002.26555030.2725900.4473680.39234435.627188
std120.8104580.84183814.1812090.8967600.98142955.907576
min892.0000001.0000000.1700000.0000000.0000000.000000
25%996.2500001.00000021.0000000.0000000.0000007.895800
50%1100.5000003.00000027.0000000.0000000.00000014.454200
75%1204.7500003.00000039.0000001.0000000.00000031.500000
max1309.0000003.00000076.0000008.0000009.000000512.329200

If you want to see the port where the passenger is logged in, you can choose:

df['Embarked'][:10]
0    Q
1    S
2    Q
3    S
4    S
5    S
6    Q
7    S
8    C
9    S
Name: Embarked, dtype: object

It can be counted using value_counts:

embark_counts=df['Embarked'].value_counts()
embark_counts[:10]
S    270
C    102
Q     46
Name: Embarked, dtype: int64

From the results, it can be seen that there are 270 passengers who log in from port S, 102 passengers who log in from port C, and 46 passengers who log in from port Q.

Similarly, we can count the age information:

age_counts=df['Age'].value_counts()
age_counts.head(10)

The ages of the top 10 are as follows:

24.0    17
21.0    17
22.0    16
30.0    15
18.0    13
27.0    12
26.0    12
25.0    11
23.0    11
29.0    10
Name: Age, dtype: int64

Calculate the average age:

df['Age'].mean()
30.272590361445783

In fact some data has no age, we can use the average to fill it:

clean_age1 = df['Age'].fillna(df['Age'].mean())
clean_age1.value_counts()

It can be seen that the average is 30.27 and the number is 86.

30.27259    86
24.00000    17
21.00000    17
22.00000    16
30.00000    15
18.00000    13
26.00000    12
27.00000    12
25.00000    11
23.00000    11
            ..
36.50000     1
40.50000     1
11.50000     1
34.00000     1
15.00000     1
7.00000      1
60.50000     1
26.50000     1
76.00000     1
34.50000     1
Name: Age, Length: 80, dtype: int64

It's probably not a good idea to use the mean for age, and another way is to discard the mean:

clean_age2=df['Age'].dropna()
clean_age2
age_counts = clean_age2.value_counts()
ageset=age_counts.head(10)
ageset
24.0    17
21.0    17
22.0    16
30.0    15
18.0    13
27.0    12
26.0    12
25.0    11
23.0    11
29.0    10
Name: Age, dtype: int64

Graphical representation and matrix transformation

Graphics are very helpful for data analysis. We use a bar chart for the top 10 ages obtained above:

import seaborn as sns
sns.barplot(x=ageset.index, y=ageset.values)

Next, let's do a complex matrix transformation. Let's filter out the data where age and sex are both empty:

cframe=df[df.Age.notnull() & df.Sex.notnull()]
cframe
PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
08923Kelly, Mr. Jamesmale34.5003309117.8292NaNQ
18933Wilkes, Mrs. James (Ellen Needs)female47.0103632727.0000NaNS
28942Myles, Mr. Thomas Francismale62.0002402769.6875NaNQ
38953Wirz, Mr. Albertmale27.0003151548.6625NaNS
48963Hirvonen, Mrs. Alexander (Helga E Lindqvist)female22.011310129812.2875NaNS
58973Svensson, Mr. Johan Cervinmale14.00075389.2250NaNS
68983Connolly, Miss. Katefemale30.0003309727.6292NaNQ
78992Caldwell, Mr. Albert Francismale26.01124873829.0000NaNS
89003Abrahim, Mrs. Joseph (Sophie Halaut Easu)female18.00026577.2292NaNC
99013Davies, Mr. John Samuelmale21.020A/4 4887124.1500NaNS
....................................
40312951Carrau, Mr. Jose Pedromale17.00011305947.1000NaNS
40412961Frauenthal, Mr. Isaac Geraldmale43.0101776527.7208D40C
40512972Nourney, Mr. Alfred (Baron von Drachstedt")"male20.000SC/PARIS 216613.8625D38C
40612982Ware, Mr. William Jefferymale23.0102866610.5000NaNS
40712991Widener, Mr. George Duntonmale50.011113503211.5000C80C
40913013Peacock, Miss. Treasteallfemale3.011SOTON/O.Q. 310131513.7750NaNS
41113031Minahan, Mrs. William Edward (Lillian E Thorpe)female37.0101992890.0000C78Q
41213043Henriksson, Miss. Jenny Lovisafemale28.0003470867.7750NaNS
41413061Oliva y Ocana, Dona. Ferminafemale39.000PC 17758108.9000C105C
41513073Saether, Mr. Simon Sivertsenmale38.500SOTON/O.Q. 31012627.2500NaNS

332 rows × 11 columns

Next use groupby to group age and sex:

by_sex_age = cframe.groupby(['Age', 'Sex'])
by_sex_age.size()
Age    Sex   
0.17   female    1
0.33   male      1
0.75   male      1
0.83   male      1
0.92   female    1
1.00   female    3
2.00   female    1
       male      1
3.00   female    1
5.00   male      1
                ..
60.00  female    3
60.50  male      1
61.00  male      2
62.00  male      1
63.00  female    1
       male      1
64.00  female    2
       male      1
67.00  male      1
76.00  female    1
Length: 115, dtype: int64

Use unstack to turn Sex's column data into rows:

Sexfemalemale
Age
0.171.00.0
0.330.01.0
0.750.01.0
0.830.01.0
0.921.00.0
1.003.00.0
2.001.01.0
3.001.00.0
5.000.01.0
6.000.03.0
.........
58.001.00.0
59.001.00.0
60.003.00.0
60.500.01.0
61.000.02.0
62.000.01.0
63.001.01.0
64.002.01.0
67.000.01.0
76.001.00.0

79 rows × 2 columns

We add up the number of people with the same age, and then use argsort to sort to get the sorted index:

indexer = agg_counts.sum(1).argsort()
indexer.tail(10)
Age
58.0    37
59.0    31
60.0    29
60.5    32
61.0    34
62.0    22
63.0    38
64.0    27
67.0    26
76.0    30
dtype: int64

Take the last 10 from agg_counts, which is the largest 10:

count_subset = agg_counts.take(indexer.tail(10))
count_subset=count_subset.tail(10)
count_subset
Sexfemalemale
Age
29.05.05.0
25.01.010.0
23.05.06.0
26.04.08.0
27.04.08.0
18.07.06.0
30.06.09.0
22.010.06.0
21.03.014.0
24.05.012.0

The above operation can be simplified to the following code:

agg_counts.sum(1).nlargest(10)
Age
21.0    17.0
24.0    17.0
22.0    16.0
30.0    15.0
18.0    13.0
26.0    12.0
27.0    12.0
23.0    11.0
25.0    11.0
29.0    10.0
dtype: float64

Perform a stack operation on count_subset to facilitate subsequent drawing:

stack_subset = count_subset.stack()
stack_subset
Age   Sex   
29.0  female     5.0
      male       5.0
25.0  female     1.0
      male      10.0
23.0  female     5.0
      male       6.0
26.0  female     4.0
      male       8.0
27.0  female     4.0
      male       8.0
18.0  female     7.0
      male       6.0
30.0  female     6.0
      male       9.0
22.0  female    10.0
      male       6.0
21.0  female     3.0
      male      14.0
24.0  female     5.0
      male      12.0
dtype: float64
stack_subset.name = 'total'
stack_subset = stack_subset.reset_index()
stack_subset
AgeSextotal
029.0female5.0
129.0male5.0
225.0female1.0
325.0male10.0
423.0female5.0
523.0male6.0
626.0female4.0
726.0male8.0
827.0female4.0
927.0male8.0
1018.0female7.0
1118.0male6.0
1230.0female6.0
1330.0male9.0
1422.0female10.0
1522.0male6.0
1621.0female3.0
1721.0male14.0
1824.0female5.0
1924.0male12.0

The drawing is as follows:

sns.barplot(x='total', y='Age', hue='Sex',  data=stack_subset)

For examples in this article, please refer to: https://github.com/ddean2009/learn-ai/

This article has been included in http://www.flydean.com/01-pandas-titanic/

The most popular interpretation, the most profound dry goods, the most concise tutorial, many you do not know

Welcome to pay attention to my official account: "Program those things", understand technology, understand you better!


flydean
890 声望433 粉丝

欢迎访问我的个人网站:www.flydean.com