# Introduction

If there are many NaN values in the data, it will waste space to store them. In order to solve this problem, Pandas introduced a structure called Sparse data to effectively store these NaN values.

# Spare data example

We create an array, then set most of its data to NaN, and then use this array to create a SparseArray:

``````In : arr = np.random.randn(10)

In : arr[2:-2] = np.nan

In : ts = pd.Series(pd.arrays.SparseArray(arr))

In : ts
Out:
0    0.469112
1   -0.282863
2         NaN
3         NaN
4         NaN
5         NaN
6         NaN
7         NaN
8   -0.861849
9   -2.104569
dtype: Sparse[float64, nan]``````

The dtype type here is Sparse[float64, nan], which means that nan in the array is not actually stored, only non-nan data is stored, and the type of these data is float64.

# SparseArray

`arrays.SparseArray` is a `ExtensionArray` used to store sparse array types.

``````In : arr = np.random.randn(10)

In : arr[2:5] = np.nan

In : arr[7:8] = np.nan

In : sparr = pd.arrays.SparseArray(arr)

In : sparr
Out:
[-1.9556635297215477, -1.6588664275960427, nan, nan, nan, 1.1589328886422277, 0.14529711373305043, nan, 0.6060271905134522, 1.3342113401317768]
Fill: nan
IntIndex
Indices: array([0, 1, 5, 6, 8, 9], dtype=int32)``````

Use numpy.asarray() to convert it to an ordinary array:

``````In : np.asarray(sparr)
Out:
array([-1.9557, -1.6589,     nan,     nan,     nan,  1.1589,  0.1453,
nan,  0.606 ,  1.3342])``````

# SparseDtype

SparseDtype represents the Spare type. It contains two kinds of information, the first is the data type of non-NaN value, and the second is the constant value when filling, such as nan:

``````In : sparr.dtype
Out: Sparse[float64, nan]``````

A SparseDtype can be constructed as follows:

``````In : pd.SparseDtype(np.dtype('datetime64[ns]'))
Out: Sparse[datetime64[ns], NaT]``````

You can specify the filled value:

``````In : pd.SparseDtype(np.dtype('datetime64[ns]'),
....:                fill_value=pd.Timestamp('2017-01-01'))
....:
Out: Sparse[datetime64[ns], Timestamp('2017-01-01 00:00:00')]``````

# Sparse attributes

Sparse can be accessed through .sparse:

``````In : s = pd.Series([0, 0, 1, 2], dtype="Sparse[int]")

In : s.sparse.density
Out: 0.5

In : s.sparse.fill_value
Out: 0``````

# Sparse calculation

The calculation function of np can be used directly in SparseArray and will return a SparseArray.

``````In : arr = pd.arrays.SparseArray([1., np.nan, np.nan, -2., np.nan])

In : np.abs(arr)
Out:
[1.0, nan, nan, 2.0, nan]
Fill: nan
IntIndex
Indices: array([0, 3], dtype=int32)``````

# SparseSeries and SparseDataFrame

SparseSeries and SparseDataFrame were deleted in version 1.0.0. What replaces them is the more powerful SparseArray.

Look at the difference in the use of the two:

``````# Previous way
>>> pd.SparseDataFrame({"A": [0, 1]})``````
``````# New way
In : pd.DataFrame({"A": pd.arrays.SparseArray([0, 1])})
Out:
A
0  0
1  1``````

If it is a sparse matrix in SciPy, you can use DataFrame.sparse.from_spmatrix():

``````# Previous way
>>> from scipy import sparse
>>> mat = sparse.eye(3)
>>> df = pd.SparseDataFrame(mat, columns=['A', 'B', 'C'])``````
``````# New way
In : from scipy import sparse

In : mat = sparse.eye(3)

In : df = pd.DataFrame.sparse.from_spmatrix(mat, columns=['A', 'B', 'C'])

In : df.dtypes
Out:
A    Sparse[float64, 0]
B    Sparse[float64, 0]
C    Sparse[float64, 0]
dtype: object``````

The most popular interpretation, the most profound dry goods, the most concise tutorial, and many tips you don't know are waiting for you to discover! ##### 程序那些事
Spring,区块链,密码学,分布式,多线程等教程 欢迎关注我的公众号:程序那些事，更多精彩等着您！

723 声望
405 粉丝
##### 0 条评论 