Introduction
If there are many NaN values in the data, it will waste space to store them. In order to solve this problem, Pandas introduced a structure called Sparse data to effectively store these NaN values.
Spare data example
We create an array, then set most of its data to NaN, and then use this array to create a SparseArray:
In [1]: arr = np.random.randn(10)
In [2]: arr[2:-2] = np.nan
In [3]: ts = pd.Series(pd.arrays.SparseArray(arr))
In [4]: ts
Out[4]:
0 0.469112
1 -0.282863
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 -0.861849
9 -2.104569
dtype: Sparse[float64, nan]
The dtype type here is Sparse[float64, nan], which means that nan in the array is not actually stored, only non-nan data is stored, and the type of these data is float64.
SparseArray
arrays.SparseArray
is a ExtensionArray
used to store sparse array types.
In [13]: arr = np.random.randn(10)
In [14]: arr[2:5] = np.nan
In [15]: arr[7:8] = np.nan
In [16]: sparr = pd.arrays.SparseArray(arr)
In [17]: sparr
Out[17]:
[-1.9556635297215477, -1.6588664275960427, nan, nan, nan, 1.1589328886422277, 0.14529711373305043, nan, 0.6060271905134522, 1.3342113401317768]
Fill: nan
IntIndex
Indices: array([0, 1, 5, 6, 8, 9], dtype=int32)
Use numpy.asarray() to convert it to an ordinary array:
In [18]: np.asarray(sparr)
Out[18]:
array([-1.9557, -1.6589, nan, nan, nan, 1.1589, 0.1453,
nan, 0.606 , 1.3342])
SparseDtype
SparseDtype represents the Spare type. It contains two kinds of information, the first is the data type of non-NaN value, and the second is the constant value when filling, such as nan:
In [19]: sparr.dtype
Out[19]: Sparse[float64, nan]
A SparseDtype can be constructed as follows:
In [20]: pd.SparseDtype(np.dtype('datetime64[ns]'))
Out[20]: Sparse[datetime64[ns], NaT]
You can specify the filled value:
In [21]: pd.SparseDtype(np.dtype('datetime64[ns]'),
....: fill_value=pd.Timestamp('2017-01-01'))
....:
Out[21]: Sparse[datetime64[ns], Timestamp('2017-01-01 00:00:00')]
Sparse attributes
Sparse can be accessed through .sparse:
In [23]: s = pd.Series([0, 0, 1, 2], dtype="Sparse[int]")
In [24]: s.sparse.density
Out[24]: 0.5
In [25]: s.sparse.fill_value
Out[25]: 0
Sparse calculation
The calculation function of np can be used directly in SparseArray and will return a SparseArray.
In [26]: arr = pd.arrays.SparseArray([1., np.nan, np.nan, -2., np.nan])
In [27]: np.abs(arr)
Out[27]:
[1.0, nan, nan, 2.0, nan]
Fill: nan
IntIndex
Indices: array([0, 3], dtype=int32)
SparseSeries and SparseDataFrame
SparseSeries and SparseDataFrame were deleted in version 1.0.0. What replaces them is the more powerful SparseArray.
Look at the difference in the use of the two:
# Previous way
>>> pd.SparseDataFrame({"A": [0, 1]})
# New way
In [31]: pd.DataFrame({"A": pd.arrays.SparseArray([0, 1])})
Out[31]:
A
0 0
1 1
If it is a sparse matrix in SciPy, you can use DataFrame.sparse.from_spmatrix():
# Previous way
>>> from scipy import sparse
>>> mat = sparse.eye(3)
>>> df = pd.SparseDataFrame(mat, columns=['A', 'B', 'C'])
# New way
In [32]: from scipy import sparse
In [33]: mat = sparse.eye(3)
In [34]: df = pd.DataFrame.sparse.from_spmatrix(mat, columns=['A', 'B', 'C'])
In [35]: df.dtypes
Out[35]:
A Sparse[float64, 0]
B Sparse[float64, 0]
C Sparse[float64, 0]
dtype: object
This article has been included in http://www.flydean.com/13-python-pandas-sparse-data/
The most popular interpretation, the most profound dry goods, the most concise tutorial, and many tips you don't know are waiting for you to discover!
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。