Write high-performance Pandas code

In my opinion, Python, one of the most commonly used languages for scientific computing, deals with a large number of data computations, and if it is too slow, it will take too long for the trial-and-error methods of scientific computing. So I’ve always wondered, how slow is Python to make it seem slow to people to start using it? How fast can everyone use it to compute gigabytes of data?

Pandas is the most commonly used framework for processing scientific computing data. In experimenting step by step, I found that it depends on how the code is written. Let’s compare the performance cost of several approaches to traversing the data set.

The data is from a random data set found on the Internet:

import pandas as pd
import numpy as np
 
data = pd.read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/datasets/EuStockMarkets.csv")
data.head()
 Copy the code

	Unnamed: 0	DAX	SMI	CAC	FTSE
0	1	1628.75	1678.1	1772.8	2443.6
1	2	1613.63	1688.5	1750.5	2460.2
2	3	1606.51	1678.6	1718.0	2448.2
3	4	1621.04	1684.1	1708.1	2470.4
4	5	1618.16	1686.6	1723.1	2484.7

data.describe()
 Copy the code

	Unnamed: 0	DAX	SMI	CAC	FTSE
count	1860.000000	1860.000000	1860.000000	1860.000000	1860.000000
mean	930.500000	2530.656882	3376.223710	2227.828495	3565.643172
std	537.080069	1084.792740	1663.026465	580.314198	976.715540
min	1.000000	1402.340000	1587.400000	1611.000000	2281.000000
25%	465.750000	1744.102500	2165.625000	1875.150000	2843.150000
50%	930.500000	2140.565000	2796.350000	1992.300000	3246.600000
75%	1395.250000	2722.367500	3812.425000	2274.350000	3993.575000
max	1860.000000	6186.090000	8412.000000	4388.500000	6179.000000

I’m going to use the DAX column for the test, and the function I’m going to use is probably sin(DAX) * 1.1, and it doesn’t have any special meaning, it just consumes time.

Presumably we’ll test the following:

Naive for loop
Iterrows method loop
The apply method
Vector method

To test the time consumed one by one, statistical time method is unified as timeit.

Let’s look at the first one

Naive for loop

%%timeit
 
result = [np.sin(data.iloc[i]['DAX']) * 1.1 for i in range(len(data))]
 
data['target'] = result
 Copy the code

422 ms ± 29.3 ms per loop (mean ± std.dev. Of 7 runs, 1 loop each)Copy the code

Using a regular for loop, 1860 times takes 400+ms. This number should vary from computer to computer, but it is definitely relative to the following.

This is very time consuming, very Python. This kind of writing method is also very not recommended, so to improve

Iterrows method loop

Pandas provides the iterrows method, which provides enumerate behavior, and the index and object of the current loop.

Iterrows is much more efficient and easy to use than a simpler for loop. Pandas provides an itertuples method that provides high performance at the cost of convenience. The itertuples are called tuples and cannot be retrieved by the index names of the columns in the loop. The value can only be specified by the index of a tuple.

Timeit result = [np.sin(row['DAX']) * 1.1 for index,row in data.iterrows()] data['target'] = resultCopy the code

103 ms ± 1.15 ms per loop (mean ± std.dev. Of 7 runs, 10 loops each)Copy the code

%%timeit
 
result = [np.sin(row[2]) * 1.1 for row in data.itertuples()]
 
data['target'] = result
 Copy the code

9.67 ms ± 649 µs per loop (mean ± std.dev. Of 7 runs, 100 loops each)Copy the code

In addition to the Iterrows method, there is another method.

The apply method

This method iterates through the DataFrame in the form of a callback method, with the axis parameter specifying which axis to traverse (row or column).

The apply method performs better than iterrows but less well than Itertuples and is as convenient as Iterrows.

%%timeit
 
data['target'] = data.apply(lambda row: np.sin(row['DAX']) * 1.1, axis=1)
 Copy the code

60.7 ms ± 3.14 ms per loop (mean ± std.dev. Of 7 runs, 10 loops each)Copy the code

In the case of itertuples, performance has been optimized up to about 9ms, which is about 10 times better, so can it be improved further

Vector method

If you do a scientific calculation, you should know the vector 1, 2. Data in a DataFrame is basically all vectors, including filters.

So is it faster to calculate in DataFrame vector units?

% % timeit data [' target '] = np. Sin (data [' DAX ']) * 1.1Copy the code

427 µs ± 36.4 µs per loop (mean ± std.dev. Of 7 runs, 1000 loops each)Copy the code

As you can see, it is simpler and more efficient, requiring only 400+ US, which is 1000 times better than the original 400ms, and is already less than milliseconds.

thinking

If you think about it, you can see why you can optimize your performance step by step.

The naive for loop carries too much information and is mostly unnecessary, and uses Python’s slow loop. So you can optimize for both. First, you can optimize the Python loop, use the Apply function, and use the C code to loop, so that the performance is doubled; Then reducing the amount of information carried and using only immutable tuples can improve performance considerably.

However, the tuple is not a DataFrame structure, and it takes a lot of time to convert the tuple into a DataFrame structure, which should improve performance.

So using vectors directly improves performance by another 20 times.

The DataFrame vector also carries some additional information, such as the index of the DataFrame. Therefore, using the lowest unit of computation can also improve performance.

Pandas is the basic unit of data used by Numpy, so it can improve performance by extracting the original Numpy array from the vector.

%%timeit
 
data['target'] = np.sin(data['DAX'].values) * 1.1
 Copy the code

25 µs ± 6.34 µs per loop (mean ± std.dev. Of 7 runs, 1000 loops each)Copy the code

Performance is nearly twice as good, 2,000 times better than in the beginning.

conclusion

At the beginning of the design of many systems, there are very beginning to worry about performance, which I think is completely unnecessary.

The difficulty of system design lies in the best or better way to achieve this function, system performance is not the bottleneck of the system. Python is trial-and-error, iterates quickly, and has many ways to optimize performance. If these methods don’t meet your needs, you can use Cython or write parts of your code in C.

Of course, performance has always been Python’s Achilles heel, and it’s been labeled slow from the start. So don’t worry too much about performance. There’s nothing wrong with using a few dozen lines of code to do something and then rewriting it to thousands of lines of C code.

Write high-performance Pandas code

Write high-performance Pandas code

Naive for loop

Iterrows method loop

The apply method

Vector method

thinking

conclusion

Related Posts

Primary school students can understand the network protocol :WebSocket

Interrupt (APIC)

Soul torture, what is the business value of Kubernetes?