In Python, we can find native parallelization instructions. This article will teach you how to dramatically speed up data preprocessing with just 3 lines of code.

By default, Python programs are single processes executed using a single CPU core. Most hardware has at least a dual-core processor. This means that without optimization, more than 50% of the computing power will be wasted during data preprocessing.

Fortunately, there are hidden features built into the Python library that allow us to take full advantage of the power of all CPU cores. Using Python’s concurrent.futures module, it takes only three lines of code to convert an ordinary program into one suitable for parallel processing by multi-core processors.

The standard method

Let’s take a simple example of having an image data set with tens of thousands of images in a single folder. In this case, we decided to use 1,000. We want to adjust all images to a 600×600 pixel resolution before they are passed to the deep neural network. Here’s the standard Python code you’ll often see on GitHub:

import glob
import os
import cv2


### Loop through all jpg files in the current folder 
### Resize each one to size 600x600
for image_filename in glob.glob("*.jpg"):
 ### Read in the image data
 img = cv2.imread(image_filename)

 ### Resize the image
 img = cv2.resize(img, (600, 600)) 
Copy the code

The above program follows the simple pattern you often see when working with data scripts:

  • 1. Start with a list of files (or other data) that you want to process.

  • 2. Process each data one by one using the for loop, and then run preprocessing on each iteration of the loop.

Let’s test the program on a folder with 1000 JPeGs and see how long it takes to run:

time python standard_res_conversion.py
Copy the code

On my Core I7-8700K 6-core CPU, the running time was 7.9864 seconds! On a high-end CPU like this, the speed seems unacceptable. Let’s see what we can do.

A faster way

To understand parallelization, suppose we need to perform the same task, such as driving 1000 nails into wood. If driving one takes a second, a person needs 1000 seconds to complete the task. A team of four takes 250 seconds.

In our example of 1000 images, Python could do something similar:

  • Split the JPEG file list into 4 groups;

  • Run four separate instances of the Python interpreter;

  • Let each instance of Python process one of the four data groups;

  • The final result list is obtained by combining the results of the four processes.

The point of this approach is that Python does all the hard work for us. We just tell it which function we want to run, how many Python instances we want to use, and it takes care of the rest! It only takes three lines of code to change. Example:

import glob
import os
import cv2
import concurrent.futures


def load_and_resize(image_filename):
 ### Read in the image data
 img = cv2.imread(image_filename)

 ### Resize the image
 img = cv2.resize(img, (600, 600)) 


### Create a pool of processes. By default, one is created for each CPU in your machine.
with concurrent.futures.ProcessPoolExecutor() as executor:
 ### Get a list of files to process
 image_files = glob.glob("*.jpg")

 ### Process the list of files, but split the work across the process pool to use all CPUs
 ### Loop through all jpg files in the current folder 
 ### Resize each one to size 600x600
 executor.map(load_and_resize, image_files)
Copy the code

Take a line from the above code:

with concurrent.futures.ProcessPoolExecutor() as executor:
Copy the code

The more CPU cores you have, the more Python processes you start. Mine has six cores. The actual processing code is as follows:

executor.map(load_and_resize, image_files)
Copy the code

“Executor.map ()” takes as input the function you want to run and a list, each element of which is a single input to our function. Since we have 6 cores, we will be working on 6 items in this list at the same time!

If we run our program again with the following code:

time python fast_res_conversion.py
Copy the code

We were able to reduce the running time to 1.14265 seconds, which is a nearly six-fold increase in speed!

Note: There is some overhead in generating more Python processes and collating data between them, so the speed increase is not always so noticeable. But overall, the speed increase is impressive.

Is it always that fast?

If you have a list of data to work with and perform similar operations on each data point, using Python parallel pooling is a good choice. But sometimes that’s not the best solution. Data processed by parallel pools is not processed in any predictable order. If you have a special order for the results after processing, this method may not be for you.

The data you work with must also be of a type that Python can “cook up”. Fortunately, these designated categories are common. The following is from the Python official documentation:

  • None, True and False

  • Integers, floating – point numbers, complex numbers

  • String, byte, byte array

  • Contains only tuples, lists, collections, and dictionaries of selectable objects

  • Functions defined at the top level of a module (use def instead of lambda)

  • Built-in functions defined at the top level of a module

  • Classes defined at the top level of a module

  • Instances of such a Class whose __dict__ or __getState__ () call results are optional (see the “Pickling Class Instances” section).