This article is participating in Python Theme Month. See the link to the event for more details

Introduction to the

If you have a lot of nans in your data, you’ll waste space storing them. To address this problem, Pandas has introduced a structure called Sparse Data to efficiently store these NaN values.

Spare data example

We create an array, set most of its data to NaN, and then use this array to create SparseArray:

In [1]: arr = np.random.randn(10) In [2]: arr[2:-2] = np.nan In [3]: ts = pd.Series(pd.arrays.SparseArray(arr)) In [4]: Ts Out[4]: 0 0.469112 1-0.282863 2 NaN 3 NaN 4 NaN 5 NaN 6 NaN 7 NaN 8-0.861849 9-2.104569 dtype: Sparse[float64, nan]Copy the code

The DType here is Sparse[Float64, nan], which means that the nan in the array is not actually stored, only non-nan data is stored, and these data are of type FLOAT64.

SparseArray

SparseArray is an ExtensionArray for storing SparseArray types.

In [13]: arr = np.random.randn(10) In [14]: arr[2:5] = np.nan In [15]: arr[7:8] = np.nan In [16]: sparr = pd.arrays.SparseArray(arr) In [17]: sparr Out[17]: [-1.9556635297215477, -1.6588664275960427, nan, nan, nan, nan, nan, nan, nan, nan 0.6060271905134522, 1.3342113401317768] Fill: Nan IntIndex Indices: array(0, 1, 5, 6, 8, 9], dtype=int32)Copy the code

This can be converted to a normal array using numpy.asarray() :

In [18]: NP-asarray (sparR) Out[18]: array([-1.9557, -1.6589, nan, nan, 1.1589, 0.1453, Nan, 0.606, 1.3342])Copy the code

SparseDtype

SparseDtype indicates the Spare type. It contains two types of information, the first type is a non-Nan value, and the second is a constant value when populated, such as NaN:

In [19]: sparr.dtype
Out[19]: Sparse[float64, nan]
Copy the code

You can construct a SparseDtype as follows:

In [20]: pd.SparseDtype(np.dtype('datetime64[ns]'))
Out[20]: Sparse[datetime64[ns], NaT]
Copy the code

You can specify padding values:

In [21]: pd.SparseDtype(np.dtype('datetime64[ns]'), .... : fill_value=pd.Timestamp('2017-01-01')) .... : Out[21]: Sparse[datetime64[ns], Timestamp('2017-01-01 00:00:00')]Copy the code

The properties of Sparse

Sparse can be accessed by.sparse:

In [23]: s = pd.Series(0, 0, 1, 2) In [24]: s.parse.density Out[24]: 0.5 In [25]: s.sparse.fill_value Out[25]: 0Copy the code

The calculation of Sparse

The calculation function of NP can be used directly in SparseArray and will return a SparseArray.

In [26]: arr = pd.arrays.SparseArray([1., np.nan, np.nan, -2., np.nan])

In [27]: np.abs(arr)
Out[27]: 
[1.0, nan, nan, 2.0, nan]
Fill: nan
IntIndex
Indices: array([0, 3], dtype=int32)
Copy the code

SparseSeries and SparseDataFrame

SparseSeries and SparseDataFrame were removed in version 1.0.0. In their place is the more powerful SparseArray.

Look at the differences in how they are used:

# Previous way
>>> pd.SparseDataFrame({"A": [0, 1]})
Copy the code
# New way
In [31]: pd.DataFrame({"A": pd.arrays.SparseArray([0, 1])})
Out[31]: 
   A
0  0
1  1
Copy the code

If it’s a sparse matrix in SciPy, then you can use dataframe.sparse.from_spmatrix() :

# Previous way
>>> from scipy import sparse
>>> mat = sparse.eye(3)
>>> df = pd.SparseDataFrame(mat, columns=['A', 'B', 'C'])
Copy the code
# New way
In [32]: from scipy import sparse

In [33]: mat = sparse.eye(3)

In [34]: df = pd.DataFrame.sparse.from_spmatrix(mat, columns=['A', 'B', 'C'])

In [35]: df.dtypes
Out[35]: 
A    Sparse[float64, 0]
B    Sparse[float64, 0]
C    Sparse[float64, 0]
dtype: object
Copy the code

This article is available at www.flydean.com/13-python-p…

The most popular interpretation, the most profound dry goods, the most concise tutorial, many you do not know the small skills waiting for you to find!