preface

Pandas has a dask and CUDF package, but not everyone has a gpu, and many of our friends still use the Software for pandas. We need to use the Apply function to handle many of its problems. While the Apply function is very slow, in this article we will show you how to speed up the Apply function by 600 times.

The experimental contrast

01 Apply(Baseline)

Taking Apply as an example, the original Apply function takes 18.4s to handle the following problem.

import numpy as np df = pd.DataFrame(np.random.randint(0, 11, size=(1000000, 5)), columns=('a','b','c','d','e')) def func(a,b,c,d,e): if e == 10: return c*d elif (e < 10) and (e>=5): return c+d elif e < 5: return a+b %%time df['new'] = df.apply(lambda x: Func (x [' a '], [' b '], x x [' c '], [' d '], x x [' e ']), the axis = 1) CPU times: 17.9 s, user sys: 301 ms, total: 18.2 s Wall time: 18.4 s ` `Copy the code

After converting the above problem to the following processing, our time is reduced to: 421 ms.

02 Swift speed

Since the processing is parallel, we can speed it up using Swift, and after using Swift, the same operation can be upgraded to 7.67s on my machine.

%%time # ! pip install swifter import swifter df['new'] = df.swifter.apply(lambda x : Func (x [' a '], [' b '], x x [' c '], [' d '], x x [' e ']), the axis = 1) HBox (children = (HTML (value = 'Dask Apply'), FloatProgress (value = 0.0, Max =16.0), HTML(value= "))) CPU times: user 329 ms, sys: 240 ms, total: 569 ms Wall time: 7.67 sCopy the code

03 To quantify

The fastest way to use Pandas and Numpy is to vectorize the function. If our operation is vectorized directly, we should avoid using:

  • The for loop.
  • List processing;
  • The apply operations such as

After converting the above problem to the following processing, our time is reduced to: 421 ms.

%%time
df['new'] = df['c'] * df['d'] #default case e = =10
mask = df['e'] < 10
df.loc[mask,'new'] = df['c'] + df['d']
mask = df['e'] < 5
df.loc[mask,'new'] = df['a'] + df['b']
CPU times: user 134 ms, sys: 149 ms, total: 283 ms
Wall time: 421 ms``
`
Copy the code

Category transformation + vectorization

We first converted the above categories into INT16 types, and then carried out the same vectomization operation, and found that the time was shortened to 116 ms.

for col in ('a','b','c','d'):
	df[col] = df[col].astype(np.int16) 
	%%time
	df['new'] = df['c'] * df['d'] #default case e = =10
	mask = df['e'] < 10
	df.loc[mask,'new'] = df['c'] + df['d']
	mask = df['e'] < 5
	df.loc[mask,'new'] = df['a'] + df['b']
	CPU times: user 71.3 ms, sys: 42.5 ms, total: 114 ms
	Wall time: 116 ms
Copy the code

05 Is converted to VALUES

If you can convert it to.values, convert it to.values.

  • Values is equivalent to numpy, so we can vectorize faster.

As a result, the above operation time was reduced to 74.9ms.

%%time df['new'] = df['c'].values * df['d'].values #default case e = =10 mask = df['e'].values < 10 df.loc[mask,'new'] = Df [' c '] + df/' d 'mask = df [r].' e 'values < 5 df. Loc [mask,' new '] = df [' a '] + df [' b '] CPU times: 64.5 ms, user sys: 12.5 ms, Total: 77 ms Wall time: 74.9 msCopy the code

The experimental summary

By using some of the tricks above, we speed up the simple Apply function hundreds of times, specifically: \

  • Apply: 18.4 s
  • Apply + Swifter: 7.67s
  • Pandas vectorizatoin: 421 ms
  • Pandas vectorization + data types: 116 ms
  • Pandas vectorization + values + data types: 74.9ms