The Jupyter Notebook is a very powerful tool for data analysis in Python. It can be used to easily read and write plain text data such as TXT or CSV in small datasets.

However, when the data set is large in size or dimension, saving and loading the data back into memory is slow, and each time you start the Jupyter Notebook, you have to wait until the data is reloaded, so CSV or any other plain text data loses its appeal.

In this article, we will compare the I/O speed, memory consumption, and disk space used by pandas to find a suitable format for our data.

Format specification

Now illustrate some of the data formats compared in this article.

  • CSV: The most common data format
  • Pickle: Used to serialize and deserialize Python object structures
  • MessagePack: Similar to JSON, but smaller and larger
  • HDF5: A common cross-platform data storage file
  • Feather: A fast, lightweight storage framework
  • Parquet: Column storage format for Apache Hadoop

Indicators show

In order to find a format to store data, this article selects the following metrics for comparison.

  • Size_mb: size of the file with serialized data frames
  • Save_time: Time required to save data frames to disk
  • Load_time: The time required to load previously dumped data frames into memory
  • Save_ram_delta_mb: Maximum memory consumption increase during data frame saving
  • Load_ram_delta_mb: maximum memory consumption increase during data frame loading

Note that the last two metrics become important when we use an efficiently compressed binary data format, such as Parquet. They help us estimate the amount of RAM required to load serialized data, as well as the data size itself. We’ll discuss this in more detail in the next section.

contrast

Now to compare the five data formats described above, we will use our own generated data sets for better control over the serialized data structures and attributes.

Below is the code for generating test data, where we randomly generate data sets with numeric and categorical characteristics. The numerical characteristics are taken from the standard normal distribution. Classification features are generated as uuid4 random strings of radix C, where 2 <= C <= max_cat_size.

Def generate_dataset(n_rows, num_count, cat_count, max_nan=0.1, max_cat_size=100): dataset, types = {}, {} def generate_categories(): from uuid import uuid4 category_size = np.random.randint(2, max_cat_size) return [str(uuid4()) for _ in range(category_size)] for col in range(num_count): name = f'n{col}' values = np.random.normal(0, 1, n_rows) nan_cnt = np.random.randint(1, int(max_nan*n_rows)) index = np.random.choice(n_rows, nan_cnt, replace=False) values[index] = np.nan dataset[name] = values types[name] = 'float32' for col in range(cat_count): name = f'c{col}' cats = generate_categories() values = np.array(np.random.choice(cats, n_rows, replace=True), dtype=object) nan_cnt = np.random.randint(1, int(max_nan*n_rows)) index = np.random.choice(n_rows, nan_cnt, replace=False) values[index] = np.nan dataset[name] = values types[name] = 'object' return pd.DataFrame(dataset), typesCopy the code

Now let’s benchmark the performance of CSV file saving and loading. Five randomly generated data sets with a million observations are dumped into CSV and then read back into memory for average metrics. Each binary format was tested against 20 randomly generated datasets with the same number of rows.

Use both methods for comparison:

  • 1. Keep the generated category variables as strings
  • 2. Convert any I/O to pandas.Categorical data type before executing it

1. Use strings as classification features

The figure below shows the average I/O time for each data format. The interesting finding here is that HDF loads slower than CSV, while other binary formats perform significantly better, while Feather and Parquet perform very well

! [with Jupyter + pandas data analysis, 6 kinds of data formats efficiency contrast] (https://p1-tt.byteimg.com/origin/pgc-image/f9f1d62b314346cda01342117c562543?from=pc)

What is the memory consumption of saving data and reading data from disk? The next image shows that HDF performance is not so good again. But to be sure, CSV doesn’t require much extra memory to save/load plain text strings, while Feather and Parquet are very close

! [with Jupyter + pandas data analysis, 6 kinds of data formats efficiency contrast] (https://p1-tt.byteimg.com/origin/pgc-image/658831266b8a45849538e5ebea2a3cca?from=pc)

Finally, let’s look at the file size comparison. Parquet showed very good results this time, which is understandable given that the format was developed for efficient storage of large amounts of data

! [with Jupyter + pandas data analysis, 6 kinds of data formats efficiency contrast] (https://p3-tt.byteimg.com/origin/pgc-image/332aacc1678f4d0289505d4dfa76ff0f?from=pc)

2. Transform features

In the previous section, we did not attempt to store Categorical characteristics efficiently, but instead used pure strings, and next we used the special pandas.Categorical type to compare again.

! [with Jupyter + pandas data analysis, 6 kinds of data formats efficiency contrast] (https://p1-tt.byteimg.com/origin/pgc-image/14657c70375d49af84fc5027a199019e?from=pc)

As you can see from the figure above, all binary formats can be shown to be really powerful compared to plain text CSV, far more efficient than CSV, so we removed it to see the differences between the various binary formats more clearly.

! [with Jupyter + pandas data analysis, 6 kinds of data formats efficiency contrast] (https://p1-tt.byteimg.com/origin/pgc-image/9f6d5957e40f47798b48347513146eb8?from=pc)

As you can see feather and Pickle have the fastest I/O speeds, then the memory consumption during the data load comparison is consumed. The bar chart below shows the parquet format we mentioned earlier

! [with Jupyter + pandas data analysis, 6 kinds of data formats efficiency contrast] (https://p6-tt.byteimg.com/origin/pgc-image/4cf1944e43f246a194130a728eeec23a?from=pc)

Why is Parquet memory so expensive? As long as it takes up a little space on disk, additional resources are required to uncompress and retract data frames. Even if a file needs a modest amount of capacity on a persistent storage disk, it may not be able to load it into memory.

Finally, let’s look at the file size comparison for different formats. All formats show good results, except HDF still requires more space than the others.

! [with Jupyter + pandas data analysis, 6 kinds of data formats efficiency contrast] (https://p1-tt.byteimg.com/origin/pgc-image/fdc6c4034f0546bc8a2f119cee74f2fa?from=pc)

conclusion

As our test results above show, the Feather format seems to be an ideal choice for storing data between multiple JUPyters. It shows high I/O speed, does not take up much memory on the disk, and does not require any unpacking when loading back into RAM.

Of course, this comparison does not mean that we should use this format in every case. For example, you do not want to use the FEATHER format for long-term file storage. Moreover, it does not consider all possible scenarios when other formats work best. So we also need to choose according to the specific situation!