Pandas advanced tutorial: sparse data structure



If there are many NaN values in the data, it will waste space to store them. In order to solve this problem, Pandas introduced a structure called Sparse data to effectively store these NaN values.

Spare data example

We create an array, then set most of its data to NaN, and then use this array to create a SparseArray:

In [1]: arr = np.random.randn(10)

In [2]: arr[2:-2] = np.nan

In [3]: ts = pd.Series(pd.arrays.SparseArray(arr))

In [4]: ts
0    0.469112
1   -0.282863
2         NaN
3         NaN
4         NaN
5         NaN
6         NaN
7         NaN
8   -0.861849
9   -2.104569
dtype: Sparse[float64, nan]

The dtype type here is Sparse[float64, nan], which means that nan in the array is not actually stored, only non-nan data is stored, and the type of these data is float64.


arrays.SparseArray is a ExtensionArray used to store sparse array types.

In [13]: arr = np.random.randn(10)

In [14]: arr[2:5] = np.nan

In [15]: arr[7:8] = np.nan

In [16]: sparr = pd.arrays.SparseArray(arr)

In [17]: sparr
[-1.9556635297215477, -1.6588664275960427, nan, nan, nan, 1.1589328886422277, 0.14529711373305043, nan, 0.6060271905134522, 1.3342113401317768]
Fill: nan
Indices: array([0, 1, 5, 6, 8, 9], dtype=int32)

Use numpy.asarray() to convert it to an ordinary array:

In [18]: np.asarray(sparr)
array([-1.9557, -1.6589,     nan,     nan,     nan,  1.1589,  0.1453,
           nan,  0.606 ,  1.3342])


SparseDtype represents the Spare type. It contains two kinds of information, the first is the data type of non-NaN value, and the second is the constant value when filling, such as nan:

In [19]: sparr.dtype
Out[19]: Sparse[float64, nan]

A SparseDtype can be constructed as follows:

In [20]: pd.SparseDtype(np.dtype('datetime64[ns]'))
Out[20]: Sparse[datetime64[ns], NaT]

You can specify the filled value:

In [21]: pd.SparseDtype(np.dtype('datetime64[ns]'),
   ....:                fill_value=pd.Timestamp('2017-01-01'))
Out[21]: Sparse[datetime64[ns], Timestamp('2017-01-01 00:00:00')]

Sparse attributes

Sparse can be accessed through .sparse:

In [23]: s = pd.Series([0, 0, 1, 2], dtype="Sparse[int]")

In [24]: s.sparse.density
Out[24]: 0.5

In [25]: s.sparse.fill_value
Out[25]: 0

Sparse calculation

The calculation function of np can be used directly in SparseArray and will return a SparseArray.

In [26]: arr = pd.arrays.SparseArray([1., np.nan, np.nan, -2., np.nan])

In [27]: np.abs(arr)
[1.0, nan, nan, 2.0, nan]
Fill: nan
Indices: array([0, 3], dtype=int32)

SparseSeries and SparseDataFrame

SparseSeries and SparseDataFrame were deleted in version 1.0.0. What replaces them is the more powerful SparseArray.

Look at the difference in the use of the two:

# Previous way
>>> pd.SparseDataFrame({"A": [0, 1]})
# New way
In [31]: pd.DataFrame({"A": pd.arrays.SparseArray([0, 1])})
0  0
1  1

If it is a sparse matrix in SciPy, you can use DataFrame.sparse.from_spmatrix():

# Previous way
>>> from scipy import sparse
>>> mat = sparse.eye(3)
>>> df = pd.SparseDataFrame(mat, columns=['A', 'B', 'C'])
# New way
In [32]: from scipy import sparse

In [33]: mat = sparse.eye(3)

In [34]: df = pd.DataFrame.sparse.from_spmatrix(mat, columns=['A', 'B', 'C'])

In [35]: df.dtypes
A    Sparse[float64, 0]
B    Sparse[float64, 0]
C    Sparse[float64, 0]
dtype: object

This article has been included in

The most popular interpretation, the most profound dry goods, the most concise tutorial, and many tips you don't know are waiting for you to discover!

阅读 270

Spring,区块链,密码学,分布式,多线程等教程 欢迎关注我的公众号:程序那些事,更多精彩等着您!


723 声望
405 粉丝
0 条评论


723 声望
405 粉丝