This article is participating in Python Theme Month. See the link to the event for more details
Introduction to the
If you have a lot of nans in your data, you’ll waste space storing them. To address this problem, Pandas has introduced a structure called Sparse Data to efficiently store these NaN values.
Spare data example
We create an array, set most of its data to NaN, and then use this array to create SparseArray:
In [1]: arr = np.random.randn(10) In [2]: arr[2:-2] = np.nan In [3]: ts = pd.Series(pd.arrays.SparseArray(arr)) In [4]: Ts Out[4]: 0 0.469112 1-0.282863 2 NaN 3 NaN 4 NaN 5 NaN 6 NaN 7 NaN 8-0.861849 9-2.104569 dtype: Sparse[float64, nan]Copy the code
The DType here is Sparse[Float64, nan], which means that the nan in the array is not actually stored, only non-nan data is stored, and these data are of type FLOAT64.
SparseArray
SparseArray is an ExtensionArray for storing SparseArray types.
In [13]: arr = np.random.randn(10) In [14]: arr[2:5] = np.nan In [15]: arr[7:8] = np.nan In [16]: sparr = pd.arrays.SparseArray(arr) In [17]: sparr Out[17]: [-1.9556635297215477, -1.6588664275960427, nan, nan, nan, nan, nan, nan, nan, nan 0.6060271905134522, 1.3342113401317768] Fill: Nan IntIndex Indices: array(0, 1, 5, 6, 8, 9], dtype=int32)Copy the code
This can be converted to a normal array using numpy.asarray() :
In [18]: NP-asarray (sparR) Out[18]: array([-1.9557, -1.6589, nan, nan, 1.1589, 0.1453, Nan, 0.606, 1.3342])Copy the code
SparseDtype
SparseDtype indicates the Spare type. It contains two types of information, the first type is a non-Nan value, and the second is a constant value when populated, such as NaN:
In [19]: sparr.dtype
Out[19]: Sparse[float64, nan]
Copy the code
You can construct a SparseDtype as follows:
In [20]: pd.SparseDtype(np.dtype('datetime64[ns]'))
Out[20]: Sparse[datetime64[ns], NaT]
Copy the code
You can specify padding values:
In [21]: pd.SparseDtype(np.dtype('datetime64[ns]'), .... : fill_value=pd.Timestamp('2017-01-01')) .... : Out[21]: Sparse[datetime64[ns], Timestamp('2017-01-01 00:00:00')]Copy the code
The properties of Sparse
Sparse can be accessed by.sparse:
In [23]: s = pd.Series(0, 0, 1, 2) In [24]: s.parse.density Out[24]: 0.5 In [25]: s.sparse.fill_value Out[25]: 0Copy the code
The calculation of Sparse
The calculation function of NP can be used directly in SparseArray and will return a SparseArray.
In [26]: arr = pd.arrays.SparseArray([1., np.nan, np.nan, -2., np.nan])
In [27]: np.abs(arr)
Out[27]:
[1.0, nan, nan, 2.0, nan]
Fill: nan
IntIndex
Indices: array([0, 3], dtype=int32)
Copy the code
SparseSeries and SparseDataFrame
SparseSeries and SparseDataFrame were removed in version 1.0.0. In their place is the more powerful SparseArray.
Look at the differences in how they are used:
# Previous way
>>> pd.SparseDataFrame({"A": [0, 1]})
Copy the code
# New way
In [31]: pd.DataFrame({"A": pd.arrays.SparseArray([0, 1])})
Out[31]:
A
0 0
1 1
Copy the code
If it’s a sparse matrix in SciPy, then you can use dataframe.sparse.from_spmatrix() :
# Previous way
>>> from scipy import sparse
>>> mat = sparse.eye(3)
>>> df = pd.SparseDataFrame(mat, columns=['A', 'B', 'C'])
Copy the code
# New way
In [32]: from scipy import sparse
In [33]: mat = sparse.eye(3)
In [34]: df = pd.DataFrame.sparse.from_spmatrix(mat, columns=['A', 'B', 'C'])
In [35]: df.dtypes
Out[35]:
A Sparse[float64, 0]
B Sparse[float64, 0]
C Sparse[float64, 0]
dtype: object
Copy the code
This article is available at www.flydean.com/13-python-p…
The most popular interpretation, the most profound dry goods, the most concise tutorial, many you do not know the small skills waiting for you to find!