Pandas advanced tutorial: sparse data structure

Introduction

If there are many NaN values in the data, it will waste space to store them. In order to solve this problem, Pandas introduced a structure called Sparse data to effectively store these NaN values.

Spare data example

We create an array, then set most of its data to NaN, and then use this array to create a SparseArray:

In [1]: arr = np.random.randn(10)

In [2]: arr[2:-2] = np.nan

In [3]: ts = pd.Series(pd.arrays.SparseArray(arr))

In [4]: ts
Out[4]: 
0    0.469112
1   -0.282863
2         NaN
3         NaN
4         NaN
5         NaN
6         NaN
7         NaN
8   -0.861849
9   -2.104569
dtype: Sparse[float64, nan]

The dtype type here is Sparse[float64, nan], which means that nan in the array is not actually stored, only non-nan data is stored, and the type of these data is float64.

SparseArray

arrays.SparseArray is a ExtensionArray used to store sparse array types.

In [13]: arr = np.random.randn(10)

In [14]: arr[2:5] = np.nan

In [15]: arr[7:8] = np.nan

In [16]: sparr = pd.arrays.SparseArray(arr)

In [17]: sparr
Out[17]: 
[-1.9556635297215477, -1.6588664275960427, nan, nan, nan, 1.1589328886422277, 0.14529711373305043, nan, 0.6060271905134522, 1.3342113401317768]
Fill: nan
IntIndex
Indices: array([0, 1, 5, 6, 8, 9], dtype=int32)

Use numpy.asarray() to convert it to an ordinary array:

In [18]: np.asarray(sparr)
Out[18]: 
array([-1.9557, -1.6589,     nan,     nan,     nan,  1.1589,  0.1453,
           nan,  0.606 ,  1.3342])

SparseDtype

SparseDtype represents the Spare type. It contains two kinds of information, the first is the data type of non-NaN value, and the second is the constant value when filling, such as nan:

In [19]: sparr.dtype
Out[19]: Sparse[float64, nan]

A SparseDtype can be constructed as follows:

In [20]: pd.SparseDtype(np.dtype('datetime64[ns]'))
Out[20]: Sparse[datetime64[ns], NaT]

You can specify the filled value:

In [21]: pd.SparseDtype(np.dtype('datetime64[ns]'),
   ....:                fill_value=pd.Timestamp('2017-01-01'))
   ....: 
Out[21]: Sparse[datetime64[ns], Timestamp('2017-01-01 00:00:00')]

Sparse attributes

Sparse can be accessed through .sparse:

In [23]: s = pd.Series([0, 0, 1, 2], dtype="Sparse[int]")

In [24]: s.sparse.density
Out[24]: 0.5

In [25]: s.sparse.fill_value
Out[25]: 0

Sparse calculation

The calculation function of np can be used directly in SparseArray and will return a SparseArray.

In [26]: arr = pd.arrays.SparseArray([1., np.nan, np.nan, -2., np.nan])

In [27]: np.abs(arr)
Out[27]: 
[1.0, nan, nan, 2.0, nan]
Fill: nan
IntIndex
Indices: array([0, 3], dtype=int32)

SparseSeries and SparseDataFrame

SparseSeries and SparseDataFrame were deleted in version 1.0.0. What replaces them is the more powerful SparseArray.

Look at the difference in the use of the two:

# Previous way
>>> pd.SparseDataFrame({"A": [0, 1]})

# New way
In [31]: pd.DataFrame({"A": pd.arrays.SparseArray([0, 1])})
Out[31]: 
   A
0  0
1  1

If it is a sparse matrix in SciPy, you can use DataFrame.sparse.from_spmatrix():

# Previous way
>>> from scipy import sparse
>>> mat = sparse.eye(3)
>>> df = pd.SparseDataFrame(mat, columns=['A', 'B', 'C'])

# New way
In [32]: from scipy import sparse

In [33]: mat = sparse.eye(3)

In [34]: df = pd.DataFrame.sparse.from_spmatrix(mat, columns=['A', 'B', 'C'])

In [35]: df.dtypes
Out[35]: 
A    Sparse[float64, 0]
B    Sparse[float64, 0]
C    Sparse[float64, 0]
dtype: object

This article has been included in http://www.flydean.com/13-python-pandas-sparse-data/
The most popular interpretation, the most profound dry goods, the most concise tutorial, and many tips you don't know are waiting for you to discover!

Pandas advanced tutorial: sparse data structure

Introduction

Spare data example

SparseArray

SparseDtype

Sparse attributes

Sparse calculation

SparseSeries and SparseDataFrame

flydean

引用和评论

在stable diffussion中完美修复AI图片

Anaconda安装教程以及Anaconda和pip配置国内镜像

科学计算编程涉及到的技术栈简介

Python3 格式化时间（qbit）

manus 的替代品有哪些？使用LLM大模型技术做手机/网页/浏览器自动化操作技术汇总

怎么判断自己下载的 trae 是国际版还是国内版？

如何系统地入门学习stm32？