Pandas: an in-depth understanding of the data structure of Pandas

[toc]

Introduction

This article will explain the basic data types Series and DataFrame in Pandas, and explain in detail the basic behaviors such as the creation and indexing of these two types.

To use Pandas, you need to reference the following lib:

In [1]: import numpy as np

In [2]: import pandas as pd

Series

Series is a one-dimensional array with label and index. We use the following method to create a Series:

>>> s = pd.Series(data, index=index)

The data here can be a Python dictionary, an np ndarray, or a scalar.

index is a list of horizontal axis labels. Next, let's take a look at how to create a Series.

Created from

s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])

s
Out[67]: 
a   -1.300797
b   -2.044172
c   -1.170739
d   -0.445290
e    1.208784
dtype: float64

Use index to get index:

s.index
Out[68]: Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

Create from dict

d = {'b': 1, 'a': 0, 'c': 2}

pd.Series(d)
Out[70]: 
a    0
b    1
c    2
dtype: int64

Create from scalar

pd.Series(5., index=['a', 'b', 'c', 'd', 'e'])
Out[71]: 
a    5.0
b    5.0
c    5.0
d    5.0
e    5.0
dtype: float64

Series and ndarray

Series and ndarray are very similar. Using index values in Series behaves like ndarray:

s[0]
Out[72]: -1.3007972194268396

s[:3]
Out[73]: 
a   -1.300797
b   -2.044172
c   -1.170739
dtype: float64

s[s > s.median()]
Out[74]: 
d   -0.445290
e    1.208784
dtype: float64

s[[4, 3, 1]]
Out[75]: 
e    1.208784
d   -0.445290
b   -2.044172
dtype: float64

Series and dict

If you use label to access the Series, its performance is very similar to that of dict:

s['a']
Out[80]: -1.3007972194268396

s['e'] = 12.

s
Out[82]: 
a    -1.300797
b    -2.044172
c    -1.170739
d    -0.445290
e    12.000000
dtype: float64

Vectorization operations and label alignment

Series can use simpler vectorization operations:

s + s
Out[83]: 
a    -2.601594
b    -4.088344
c    -2.341477
d    -0.890581
e    24.000000
dtype: float64

s * 2
Out[84]: 
a    -2.601594
b    -4.088344
c    -2.341477
d    -0.890581
e    24.000000
dtype: float64

np.exp(s)
Out[85]: 
a         0.272315
b         0.129487
c         0.310138
d         0.640638
e    162754.791419
dtype: float64

Name attribute

Series also has a name attribute, which we can set when it is created:

s = pd.Series(np.random.randn(5), name='something')

s
Out[88]: 
0    0.192272
1    0.110410
2    1.442358
3   -0.375792
4    1.228111
Name: something, dtype: float64

s also has a rename method, you can rename s:

s2 = s.rename("different")

DataFrame

DataFrame is a two-dimensional labelled data structure, which is composed of Series. You can think of DataFrame as an excel table. DataFrame can be created from the following types of data:

One-dimensional ndarrays, lists, dicts, or Series
Structured array creation
2-dimensional numpy.ndarray
Other DataFrame

Create from Series

You can create a DataFrame from a dictionary composed of Series:

d = {'one': pd.Series([1., 2., 3.], index=['a', 'b', 'c']),'two': pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)

df
Out[92]: 
   one  two
a  1.0  1.0
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0

Perform index rearrangement:

pd.DataFrame(d, index=['d', 'b', 'a'])
Out[93]: 
   one  two
d  NaN  4.0
b  2.0  2.0
a  1.0  1.0

Perform column rearrangement:

pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])
Out[94]: 
   two three
d  4.0   NaN
b  2.0   NaN
a  1.0   NaN

Create from ndarrays and lists

d = {'one': [1., 2., 3., 4.],'two': [4., 3., 2., 1.]}

pd.DataFrame(d)
Out[96]: 
   one  two
0  1.0  4.0
1  2.0  3.0
2  3.0  2.0
3  4.0  1.0

pd.DataFrame(d, index=['a', 'b', 'c', 'd'])
Out[97]: 
   one  two
a  1.0  4.0
b  2.0  3.0
c  3.0  2.0
d  4.0  1.0

Create from structured array

You can create a DF from a structured array:

In [47]: data = np.zeros((2, ), dtype=[('A', 'i4'), ('B', 'f4'), ('C', 'a10')])

In [48]: data[:] = [(1, 2., 'Hello'), (2, 3., "World")]

In [49]: pd.DataFrame(data)
Out[49]: 
   A    B         C
0  1  2.0  b'Hello'
1  2  3.0  b'World'

In [50]: pd.DataFrame(data, index=['first', 'second'])
Out[50]: 
        A    B         C
first   1  2.0  b'Hello'
second  2  3.0  b'World'

In [51]: pd.DataFrame(data, columns=['C', 'A', 'B'])
Out[51]: 
          C  A    B
0  b'Hello'  1  2.0
1  b'World'  2  3.0

Create from a list of dictionaries

In [52]: data2 = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]

In [53]: pd.DataFrame(data2)
Out[53]: 
   a   b     c
0  1   2   NaN
1  5  10  20.0

In [54]: pd.DataFrame(data2, index=['first', 'second'])
Out[54]: 
        a   b     c
first   1   2   NaN
second  5  10  20.0

In [55]: pd.DataFrame(data2, columns=['a', 'b'])
Out[55]: 
   a   b
0  1   2
1  5  10

Create from tuple

More complex DFs can be created from tuples:

In [56]: pd.DataFrame({('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2},
   ....:               ('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4},
   ....:               ('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6},
   ....:               ('b', 'a'): {('A', 'C'): 7, ('A', 'B'): 8},
   ....:               ('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}})
   ....: 
Out[56]: 
       a              b      
       b    a    c    a     b
A B  1.0  4.0  5.0  8.0  10.0
  C  2.0  3.0  6.0  7.0   NaN
  D  NaN  NaN  NaN  NaN   9.0

Column selection, adding and deleting

DF can be operated like a Series:

In [64]: df['one']
Out[64]: 
a    1.0
b    2.0
c    3.0
d    NaN
Name: one, dtype: float64

In [65]: df['three'] = df['one'] * df['two']

In [66]: df['flag'] = df['one'] > 2

In [67]: df
Out[67]: 
   one  two  three   flag
a  1.0  1.0    1.0  False
b  2.0  2.0    4.0  False
c  3.0  3.0    9.0   True
d  NaN  4.0    NaN  False

You can delete a specific column, or pop operation:

In [68]: del df['two']

In [69]: three = df.pop('three')

In [70]: df
Out[70]: 
   one   flag
a  1.0  False
b  2.0  False
c  3.0   True
d  NaN  False

If you insert a constant, then the entire column will be filled:

In [71]: df['foo'] = 'bar'

In [72]: df
Out[72]: 
   one   flag  foo
a  1.0  False  bar
b  2.0  False  bar
c  3.0   True  bar
d  NaN  False  bar

By default, it will be inserted into the last column in DF. You can use insert to specify insert into a specific column:

In [75]: df.insert(1, 'bar', df['one'])

In [76]: df
Out[76]: 
   one  bar   flag  foo  one_trunc
a  1.0  1.0  False  bar        1.0
b  2.0  2.0  False  bar        2.0
c  3.0  3.0   True  bar        NaN
d  NaN  NaN  False  bar        NaN

Use assign to derive new columns from existing columns:

In [77]: iris = pd.read_csv('data/iris.data')

In [78]: iris.head()
Out[78]: 
   SepalLength  SepalWidth  PetalLength  PetalWidth         Name
0          5.1         3.5          1.4         0.2  Iris-setosa
1          4.9         3.0          1.4         0.2  Iris-setosa
2          4.7         3.2          1.3         0.2  Iris-setosa
3          4.6         3.1          1.5         0.2  Iris-setosa
4          5.0         3.6          1.4         0.2  Iris-setosa

In [79]: (iris.assign(sepal_ratio=iris['SepalWidth'] / iris['SepalLength'])
   ....:      .head())
   ....: 
Out[79]: 
   SepalLength  SepalWidth  PetalLength  PetalWidth         Name  sepal_ratio
0          5.1         3.5          1.4         0.2  Iris-setosa     0.686275
1          4.9         3.0          1.4         0.2  Iris-setosa     0.612245
2          4.7         3.2          1.3         0.2  Iris-setosa     0.680851
3          4.6         3.1          1.5         0.2  Iris-setosa     0.673913
4          5.0         3.6          1.4         0.2  Iris-setosa     0.720000

Note that assign will create a new DF, and the original DF remains unchanged.

Let's use a table to represent the index and selection in DF:

operating	grammar	Return result
Select column	`df[col]`	Series
Select row by label	`df.loc[label]`	Series
Select rows by array	`df.iloc[loc]`	Series
Slices of rows	`df[5:10]`	DataFrame
Use boolean vector to select rows	`df[bool_vec]`	DataFrame

This article has been included in http://www.flydean.com/03-python-pandas-data-structures/
The most popular interpretation, the most profound dry goods, the most concise tutorial, and many tips you don't know are waiting for you to discover!
Welcome to pay attention to my official account: "Program those things", know technology, know you better!

Pandas: an in-depth understanding of the data structure of Pandas

Introduction

Series

Created from

Create from dict

Create from scalar

Series and ndarray

Series and dict

Vectorization operations and label alignment

Name attribute

DataFrame

Create from Series

Create from ndarrays and lists

Create from structured array

Create from a list of dictionaries

Create from tuple

Column selection, adding and deleting

flydean

引用和评论

在stable diffussion中完美修复AI图片

Anaconda安装教程以及Anaconda和pip配置国内镜像

科学计算编程涉及到的技术栈简介

使用 chardet 判断文件编码需要注意的坑——过大的文件会导致高耗时

Python3 格式化时间（qbit）

本地使用PaddleOCR进行图片识别获得文字（返回JSON）

manus 的替代品有哪些？使用LLM大模型技术做手机/网页/浏览器自动化操作技术汇总