[toc]
Introduction
This article will explain the basic data types Series and DataFrame in Pandas, and explain in detail the basic behaviors such as the creation and indexing of these two types.
To use Pandas, you need to reference the following lib:
In [1]: import numpy as np
In [2]: import pandas as pd
Series
Series is a one-dimensional array with label and index. We use the following method to create a Series:
>>> s = pd.Series(data, index=index)
The data here can be a Python dictionary, an np ndarray, or a scalar.
index is a list of horizontal axis labels. Next, let's take a look at how to create a Series.
Created from
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
s
Out[67]:
a -1.300797
b -2.044172
c -1.170739
d -0.445290
e 1.208784
dtype: float64
Use index to get index:
s.index
Out[68]: Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
Create from dict
d = {'b': 1, 'a': 0, 'c': 2}
pd.Series(d)
Out[70]:
a 0
b 1
c 2
dtype: int64
Create from scalar
pd.Series(5., index=['a', 'b', 'c', 'd', 'e'])
Out[71]:
a 5.0
b 5.0
c 5.0
d 5.0
e 5.0
dtype: float64
Series and ndarray
Series and ndarray are very similar. Using index values in Series behaves like ndarray:
s[0]
Out[72]: -1.3007972194268396
s[:3]
Out[73]:
a -1.300797
b -2.044172
c -1.170739
dtype: float64
s[s > s.median()]
Out[74]:
d -0.445290
e 1.208784
dtype: float64
s[[4, 3, 1]]
Out[75]:
e 1.208784
d -0.445290
b -2.044172
dtype: float64
Series and dict
If you use label to access the Series, its performance is very similar to that of dict:
s['a']
Out[80]: -1.3007972194268396
s['e'] = 12.
s
Out[82]:
a -1.300797
b -2.044172
c -1.170739
d -0.445290
e 12.000000
dtype: float64
Vectorization operations and label alignment
Series can use simpler vectorization operations:
s + s
Out[83]:
a -2.601594
b -4.088344
c -2.341477
d -0.890581
e 24.000000
dtype: float64
s * 2
Out[84]:
a -2.601594
b -4.088344
c -2.341477
d -0.890581
e 24.000000
dtype: float64
np.exp(s)
Out[85]:
a 0.272315
b 0.129487
c 0.310138
d 0.640638
e 162754.791419
dtype: float64
Name attribute
Series also has a name attribute, which we can set when it is created:
s = pd.Series(np.random.randn(5), name='something')
s
Out[88]:
0 0.192272
1 0.110410
2 1.442358
3 -0.375792
4 1.228111
Name: something, dtype: float64
s also has a rename method, you can rename s:
s2 = s.rename("different")
DataFrame
DataFrame is a two-dimensional labelled data structure, which is composed of Series. You can think of DataFrame as an excel table. DataFrame can be created from the following types of data:
- One-dimensional ndarrays, lists, dicts, or Series
- Structured array creation
- 2-dimensional numpy.ndarray
- Other DataFrame
Create from Series
You can create a DataFrame from a dictionary composed of Series:
d = {'one': pd.Series([1., 2., 3.], index=['a', 'b', 'c']),'two': pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
df
Out[92]:
one two
a 1.0 1.0
b 2.0 2.0
c 3.0 3.0
d NaN 4.0
Perform index rearrangement:
pd.DataFrame(d, index=['d', 'b', 'a'])
Out[93]:
one two
d NaN 4.0
b 2.0 2.0
a 1.0 1.0
Perform column rearrangement:
pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])
Out[94]:
two three
d 4.0 NaN
b 2.0 NaN
a 1.0 NaN
Create from ndarrays and lists
d = {'one': [1., 2., 3., 4.],'two': [4., 3., 2., 1.]}
pd.DataFrame(d)
Out[96]:
one two
0 1.0 4.0
1 2.0 3.0
2 3.0 2.0
3 4.0 1.0
pd.DataFrame(d, index=['a', 'b', 'c', 'd'])
Out[97]:
one two
a 1.0 4.0
b 2.0 3.0
c 3.0 2.0
d 4.0 1.0
Create from structured array
You can create a DF from a structured array:
In [47]: data = np.zeros((2, ), dtype=[('A', 'i4'), ('B', 'f4'), ('C', 'a10')])
In [48]: data[:] = [(1, 2., 'Hello'), (2, 3., "World")]
In [49]: pd.DataFrame(data)
Out[49]:
A B C
0 1 2.0 b'Hello'
1 2 3.0 b'World'
In [50]: pd.DataFrame(data, index=['first', 'second'])
Out[50]:
A B C
first 1 2.0 b'Hello'
second 2 3.0 b'World'
In [51]: pd.DataFrame(data, columns=['C', 'A', 'B'])
Out[51]:
C A B
0 b'Hello' 1 2.0
1 b'World' 2 3.0
Create from a list of dictionaries
In [52]: data2 = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]
In [53]: pd.DataFrame(data2)
Out[53]:
a b c
0 1 2 NaN
1 5 10 20.0
In [54]: pd.DataFrame(data2, index=['first', 'second'])
Out[54]:
a b c
first 1 2 NaN
second 5 10 20.0
In [55]: pd.DataFrame(data2, columns=['a', 'b'])
Out[55]:
a b
0 1 2
1 5 10
Create from tuple
More complex DFs can be created from tuples:
In [56]: pd.DataFrame({('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2},
....: ('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4},
....: ('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6},
....: ('b', 'a'): {('A', 'C'): 7, ('A', 'B'): 8},
....: ('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}})
....:
Out[56]:
a b
b a c a b
A B 1.0 4.0 5.0 8.0 10.0
C 2.0 3.0 6.0 7.0 NaN
D NaN NaN NaN NaN 9.0
Column selection, adding and deleting
DF can be operated like a Series:
In [64]: df['one']
Out[64]:
a 1.0
b 2.0
c 3.0
d NaN
Name: one, dtype: float64
In [65]: df['three'] = df['one'] * df['two']
In [66]: df['flag'] = df['one'] > 2
In [67]: df
Out[67]:
one two three flag
a 1.0 1.0 1.0 False
b 2.0 2.0 4.0 False
c 3.0 3.0 9.0 True
d NaN 4.0 NaN False
You can delete a specific column, or pop operation:
In [68]: del df['two']
In [69]: three = df.pop('three')
In [70]: df
Out[70]:
one flag
a 1.0 False
b 2.0 False
c 3.0 True
d NaN False
If you insert a constant, then the entire column will be filled:
In [71]: df['foo'] = 'bar'
In [72]: df
Out[72]:
one flag foo
a 1.0 False bar
b 2.0 False bar
c 3.0 True bar
d NaN False bar
By default, it will be inserted into the last column in DF. You can use insert to specify insert into a specific column:
In [75]: df.insert(1, 'bar', df['one'])
In [76]: df
Out[76]:
one bar flag foo one_trunc
a 1.0 1.0 False bar 1.0
b 2.0 2.0 False bar 2.0
c 3.0 3.0 True bar NaN
d NaN NaN False bar NaN
Use assign to derive new columns from existing columns:
In [77]: iris = pd.read_csv('data/iris.data')
In [78]: iris.head()
Out[78]:
SepalLength SepalWidth PetalLength PetalWidth Name
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
In [79]: (iris.assign(sepal_ratio=iris['SepalWidth'] / iris['SepalLength'])
....: .head())
....:
Out[79]:
SepalLength SepalWidth PetalLength PetalWidth Name sepal_ratio
0 5.1 3.5 1.4 0.2 Iris-setosa 0.686275
1 4.9 3.0 1.4 0.2 Iris-setosa 0.612245
2 4.7 3.2 1.3 0.2 Iris-setosa 0.680851
3 4.6 3.1 1.5 0.2 Iris-setosa 0.673913
4 5.0 3.6 1.4 0.2 Iris-setosa 0.720000
Note that assign will create a new DF, and the original DF remains unchanged.
Let's use a table to represent the index and selection in DF:
operating | grammar | Return result |
---|---|---|
Select column | df[col] | Series |
Select row by label | df.loc[label] | Series |
Select rows by array | df.iloc[loc] | Series |
Slices of rows | df[5:10] | DataFrame |
Use boolean vector to select rows | df[bool_vec] | DataFrame |
This article has been included in http://www.flydean.com/03-python-pandas-data-structures/
The most popular interpretation, the most profound dry goods, the most concise tutorial, and many tips you don't know are waiting for you to discover!
Welcome to pay attention to my official account: "Program those things", know technology, know you better!
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。