Source: Python Data Science

Akik Tongo takes off

Vaex is a replacement tool for PANDAS. Pandas is hundreds of times faster than pandas by using memory mapping, but THE limited functionality of VAEx will not be able to challenge PANDAS.

1. Review of ways to increase the speed of pandas

There are two ways to speed pandas up:

1. To quantify

Vectorization is the optimal approach. For example, we define vectorization as a calculation that uses Numpy to represent the entire array rather than the elements. Here are two arrays:

array_1 = np.array([1.2.3.4.5])
array_2 = np.array([6.7.8.9.10])
Copy the code

We want to create a new array that is the sum of two arrays. The result should be:

result = [7.9.11.13.15]
Copy the code

Of course, we could also sum these arrays in Python using a for loop, but this is very slow. Instead, Numpy allows us to operate directly on arrays, which is much faster, especially for large arrays.

result = array_1 + array_2
Copy the code

2. Parallelization

The second is parallelization.

Pandas is a dataframe that can be used for manipulating pandas by using the apply function.

Because pandas applies only to each row of the dataframe, it will be very slow to call only one handler.

But it’s much faster if we take advantage of multiprocessor parallelism, split the Dataframe data box into parts, send each part to the processor, and finally assemble the parts into a single Dataframe!

How can you combine these two approaches to speed up apply? * * * *

Swifter automatically makes Apply run as fast as possible with just one line of code.

2. Swifter is introduced

Swifter does this.

1. First, check whether the apply function can be vectorized, and if so, the vectorized calculation is automatically used (most effectively).

2. If vectorization is not possible, check whether it makes more sense to use Dask for parallel processing or just use ordinary Pandas’ Apply (using a single kernel only).

Parallelization is not always necessary because the overhead of parallel processing slows down the processing of small data sets, so this also needs to be analyzed depending on the size of the data set.

Let’s look at a picture.

It can be concluded that the results are almost always better to use vectorization regardless of the size of the data, but if vectorization is not possible, it is best to use parallelism to optimize the speed for pandas (the intersection of the red and blue lines when the dataset size exceeds a certain threshold).

This is amazing. Swifter automatically selects the best method for us.

3. How to use Swifter?

Swifter is very simple to use.

import pandas as pd
import swifter

df.swifter.apply(lambda x: x.sum() - x.min())
Copy the code

All we need to do is introduce the Swifter and call it in a single line of code. Give it a try!

Long press scan to add “Python Assistant”

Click here to become a community member