“This is the third day of my participation in the First Challenge 2022, for more details: First Challenge 2022”.

preface

Pandas is inefficient for handling large data. Polars is recommended.

Let’s have a good time

The development tools

Python version: 3.6.4

Related modules:

PIL module;

OS module;

And some modules that come with Python.

Environment set up

Install Python and add it to the environment variables. PIP installs the required related modules.

Pandas can create data, read files in a variety of formats (text, CSV, JSON), or slice and slice data to combine multiple data sources.

Pandas does have some disadvantages, such as its lack of multiprocessors and its slow processing of large data sets.

I introduce you to a new Python library called Polars.

Pandas uses syntax similar to Pandas, but handles data much faster than Pandas.

Polars is a library written through Rust, and Polars’ memory model is based on Apache Arrow.

Polars has two apis: the Eager API and the Lazy API.

The Eager API is used in the same way as the Eager API is used in Pandas.

The Lazy API, like Spark, first converts queries into logical plans and then reorganizes and optimizes the plans to reduce execution time and memory usage.

Install Polars using Baidu PIP source.

# installation polars
pip install polars -i https://mirror.baidu.com/pypi/simple/
Copy the code

After the installation is successful, we test how Pandas and Polars handle data.

The user name data of registered users on a website is analyzed, and the CSV file contains about 26 million user names.

import pandas as pd

df = pd.read_csv('users.csv')
print(df)
Copy the code

The data is as follows

A self-created CSV file was also used for data integration testing

import pandas as pd

df = pd.read_csv('fake_user.csv')
print(df)
Copy the code

results

Compare the sorting algorithm time of the two libraries

import timeit
import pandas as pd

start = timeit.default_timer()

df = pd.read_csv('users.csv')
df.sort_values('n', ascending=False)
stop = timeit.default_timer()

print('Time: ', stop - start)

-------------------------
Time:  27.555776743218303
Copy the code

As you can see, it took about 28s to sort the data using Pandas

import timeit
import polars as pl

start = timeit.default_timer()

df = pl.read_csv('users.csv')
df.sort(by_column='n', reverse=True)
stop = timeit.default_timer()

print('Time: ', stop - start)

-----------------------
Time:  9.924110282212496
Copy the code

Polars took only about 10s, which means Polars is 2.7 times faster than Pandas

Let’s try data integration, vertical connectivity

import timeit
import pandas as pd

start = timeit.default_timer()

df_users = pd.read_csv('users.csv')
df_fake = pd.read_csv('fake_user.csv')
df_users.append(df_fake, ignore_index=True)
stop = timeit.default_timer()

print('Time: ', stop - start)

------------------------
Time:  15.556222308427095
Copy the code

It takes 15 seconds to use Pandas

import timeit
import polars as pl

start = timeit.default_timer()

df_users = pl.read_csv('users.csv')
df_fake = pl.read_csv('fake_user.csv')
df_users.vstack(df_fake)
stop = timeit.default_timer()

print('Time: ', stop - start)

-----------------------
Time:  3.475433263927698
Copy the code

Polars is about 3.5 seconds faster than Pandas

The data used this time can be obtained from the home page

conclusion

Pandas is 12 years old and has developed a mature ecosystem that supports many other data analysis libraries

Polars is a relatively new library and has a lot to be desired

If your dataset is too large for Pandas and too small for Spark, then Polars is an option to consider