Write high-performance Pandas code

In my opinion, Python, one of the most commonly used languages for scientific computing, deals with a large number of data computations, and if it is too slow, it will take too long for the trial-and-error methods of scientific computing. So I’ve always wondered, how slow is Python to make it seem slow to people to start using it? How fast can everyone use it to compute gigabytes of data?

Pandas is the most commonly used framework for processing scientific computing data. In experimenting step by step, I found that it depends on how the code is written. Let’s compare the performance cost of several approaches to traversing the data set.

The data is from a random data set found on the Internet:

import pandas as pd
import numpy as np
 
data = pd.read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/datasets/EuStockMarkets.csv")
data.head()
 Copy the code

Unnamed: 0 DAX SMI CAC FTSE
0 1 1628.75 1678.1 1772.8 2443.6
1 2 1613.63 1688.5 1750.5 2460.2
2 3 1606.51 1678.6 1718.0 2448.2
3 4 1621.04 1684.1 1708.1 2470.4
4 5 1618.16 1686.6 1723.1 2484.7
data.describe()
 Copy the code

Unnamed: 0 DAX SMI CAC FTSE
count 1860.000000 1860.000000 1860.000000 1860.000000 1860.000000
mean 930.500000 2530.656882 3376.223710 2227.828495 3565.643172
std 537.080069 1084.792740 1663.026465 580.314198 976.715540
min 1.000000 1402.340000 1587.400000 1611.000000 2281.000000
25% 465.750000 1744.102500 2165.625000 1875.150000 2843.150000
50% 930.500000 2140.565000 2796.350000 1992.300000 3246.600000
75% 1395.250000 2722.367500 3812.425000 2274.350000 3993.575000
max 1860.000000 6186.090000 8412.000000 4388.500000 6179.000000

I’m going to use the DAX column for the test, and the function I’m going to use is probably sin(DAX) * 1.1, and it doesn’t have any special meaning, it just consumes time.

Presumably we’ll test the following:

  1. Naive for loop
  2. Iterrows method loop
  3. The apply method
  4. Vector method

To test the time consumed one by one, statistical time method is unified as timeit.

Let’s look at the first one

Naive for loop

%%timeit
 
result = [np.sin(data.iloc[i]['DAX']) * 1.1 for i in range(len(data))]
 
data['target'] = result
 Copy the code

422 ms ± 29.3 ms per loop (mean ± std.dev. Of 7 runs, 1 loop each)Copy the code

Using a regular for loop, 1860 times takes 400+ms. This number should vary from computer to computer, but it is definitely relative to the following.

This is very time consuming, very Python. This kind of writing method is also very not recommended, so to improve

Iterrows method loop

Pandas provides the iterrows method, which provides enumerate behavior, and the index and object of the current loop.

Iterrows is much more efficient and easy to use than a simpler for loop. Pandas provides an itertuples method that provides high performance at the cost of convenience. The itertuples are called tuples and cannot be retrieved by the index names of the columns in the loop. The value can only be specified by the index of a tuple.

Timeit result = [np.sin(row['DAX']) * 1.1 for index,row in data.iterrows()] data['target'] = resultCopy the code

103 ms ± 1.15 ms per loop (mean ± std.dev. Of 7 runs, 10 loops each)Copy the code

%%timeit
 
result = [np.sin(row[2]) * 1.1 for row in data.itertuples()]
 
data['target'] = result
 Copy the code

9.67 ms ± 649 µs per loop (mean ± std.dev. Of 7 runs, 100 loops each)Copy the code

In addition to the Iterrows method, there is another method.

The apply method

This method iterates through the DataFrame in the form of a callback method, with the axis parameter specifying which axis to traverse (row or column).

The apply method performs better than iterrows but less well than Itertuples and is as convenient as Iterrows.

%%timeit
 
data['target'] = data.apply(lambda row: np.sin(row['DAX']) * 1.1, axis=1)
 Copy the code

60.7 ms ± 3.14 ms per loop (mean ± std.dev. Of 7 runs, 10 loops each)Copy the code

In the case of itertuples, performance has been optimized up to about 9ms, which is about 10 times better, so can it be improved further

Vector method

If you do a scientific calculation, you should know the vector 1, 2. Data in a DataFrame is basically all vectors, including filters.

So is it faster to calculate in DataFrame vector units?

% % timeit data [' target '] = np. Sin (data [' DAX ']) * 1.1Copy the code

427 µs ± 36.4 µs per loop (mean ± std.dev. Of 7 runs, 1000 loops each)Copy the code

As you can see, it is simpler and more efficient, requiring only 400+ US, which is 1000 times better than the original 400ms, and is already less than milliseconds.

thinking

If you think about it, you can see why you can optimize your performance step by step.

The naive for loop carries too much information and is mostly unnecessary, and uses Python’s slow loop. So you can optimize for both. First, you can optimize the Python loop, use the Apply function, and use the C code to loop, so that the performance is doubled; Then reducing the amount of information carried and using only immutable tuples can improve performance considerably.

However, the tuple is not a DataFrame structure, and it takes a lot of time to convert the tuple into a DataFrame structure, which should improve performance.

So using vectors directly improves performance by another 20 times.

The DataFrame vector also carries some additional information, such as the index of the DataFrame. Therefore, using the lowest unit of computation can also improve performance.

Pandas is the basic unit of data used by Numpy, so it can improve performance by extracting the original Numpy array from the vector.

%%timeit
 
data['target'] = np.sin(data['DAX'].values) * 1.1
 Copy the code

25 µs ± 6.34 µs per loop (mean ± std.dev. Of 7 runs, 1000 loops each)Copy the code

Performance is nearly twice as good, 2,000 times better than in the beginning.

conclusion

At the beginning of the design of many systems, there are very beginning to worry about performance, which I think is completely unnecessary.

The difficulty of system design lies in the best or better way to achieve this function, system performance is not the bottleneck of the system. Python is trial-and-error, iterates quickly, and has many ways to optimize performance. If these methods don’t meet your needs, you can use Cython or write parts of your code in C.

Of course, performance has always been Python’s Achilles heel, and it’s been labeled slow from the start. So don’t worry too much about performance. There’s nothing wrong with using a few dozen lines of code to do something and then rewriting it to thousands of lines of C code.