This article is from the Heart of the Machine. https://towardsdatascience.com/be-a-more-efficient-data-scientist-today-master-pandas-with-this-guide-ea362d27386, if there is any infringement, Can be deleted.

Python is open source, which is great, but there are some inherent problems with open source: many packages are doing (or trying to do) the same thing. If you’re new to Python, it’s hard to know which package is the best for a particular task, and you need someone with experience to tell you. One package that is absolutely required for data science is pandas.

The most interesting thing about Pandas is the number of bags hidden inside. It is a core package with many of the other package’s features. This is great because you only need to use Pandas to get the job done.

Pandas is the Python equivalent of Excel: it uses tables (that is, dataframe) and can perform various transformations on data, but it does a lot more.

If you already know how to use Python, skip to paragraph 3.

Let’s get started:

import pandas as pd
Copy the code

Don’t ask why it’s “PD” and not “P”, that’s it. Just use it 🙂

Pandas’ most basic functions

Read the data

data = pd.read_csv( my_file.csv ) data = pd.read_csv( my_file.csv , sep= ; , encoding= latin-1, nrows=1000, skiprows=[2,5])Copy the code

Sep stands for delimiter. If you’re using French data, the CSV delimiter in Excel is “;” , so you need to specify it explicitly. The encoding is set to Latin-1 to read French characters. Nrows =1000 indicates that the first 1000 rows of data are read. Skiprows =[2,5] means that you will remove lines 2 and 5 when reading the file.

  • Most commonly used functions: read_csv, read_excel

  • Some other nice features: read_clipboard, read_SQL

Write the data

data.to_csv( my_new_file.csv , index=None)
Copy the code

Index =None indicates that the data will be written as it is. If you do not write index=None, you will have an extra first column with 1,2,3,… All the way to the last line.

I usually don’t use other functions like.to_excel,.to_json,.to_pickle, etc., because.to_csv works just fine, and CSV is the most common way to save tables.

Check the data

Gives (#rows, #columns)
Copy the code

Give the number of trips and the number of columns

data.describe()
Copy the code

Calculate basic statistics

View the data

data.head(3)
Copy the code

Prints the first 3 lines of the data. Similarly,.tail() corresponds to the last row of data.

data.loc[8]
Copy the code

Print out line 8

data.loc[8,  column_1 ]
Copy the code

Print the eighth column whose name is column_1

Data. Loc [range] (4, 6)Copy the code

A subset of the data in rows 4 through 6 (left closed and right open)

The basic functions for Pandas

Logical operations

data[data[ column_1 ]== french ] data[(data[ column_1 ]== french ) & (data[ year_born ]==1990)] data[(data[ column_1 ]==  french ) & (data[ year_born ]==1990) & ~(data[ city ]== London )]Copy the code

Take a subset of data by logical operation. To use & (AND), ~ (NOT) AND | (OR), must add “AND” before, during, AND after logic operation.

data[data[ column_1 ].isin([ french ,  english ])]
Copy the code

In addition to using more than one OR in the same column, you can also use the.isin() function.

Basic drawing

The Matplotlib package makes this possible. As we said in the introduction, it can be used directly in Pandas.

data[ column_numerical ].plot()
Copy the code

Example of output from ().plot()

data[ column_numerical ].hist()
Copy the code

Plot the data distribution (histogram)

An example of.hist() output

%matplotlib inline
Copy the code

If you’re using Jupyter, don’t forget to add the above code to your drawing.

Update the data

Loc [8, column_1] = English Replace column_1 with 'English'Copy the code
data.loc[data[ column_1 ]== french ,  column_1 ] =  French
Copy the code

Change the value of multiple columns in a single line of code

Ok, now you can do something that is easily accessible in Excel. Let’s dive into some of the amazing things that excel can’t do.

Intermediate function

Count the number of occurrences

data[ column_1 ].value_counts()
Copy the code

Example output from the.value_counts() function

Operate on all rows, columns, or all data

data[ column_1 ].map(len)
Copy the code

The len() function is applied to each element in the column “column_1”

The.map() operation applies a function to each element in a column

data[ column_1 ].map(len).map(lambda x: x/100).plot()
Copy the code

A good function of the pandas is chain method (tomaugspurger. Making. IO/method – chai… And the plot ()).

data.apply(sum)
Copy the code

.apply() applies a function to a column.

Applymap () applies a function to all cells in the table (DataFrame).

TQDM, the only one

When manipulating large data sets, pandas will spend some time doing.map(),.apply(),.applymap(), and so on. TQDM is a package that can be used to help predict when the execution of these operations will complete (yes, I lied when I said we would only use Pandas).

from tqdm import tqdm_notebook
tqdm_notebook().pandas()
Copy the code

Set TQDM in Pandas

data[ column_1 ].progress_map(lambda x: x.count( e ))
Copy the code

Using.progress_map() instead of.map(),.apply(), and.applymap() is similar.

Progress bar in Jupyter using TQDM and Pandas

Correlation and scattering matrix

data.corr()
data.corr().applymap(lambda x: int(x*100)/100)
Copy the code

Corr () gives the correlation matrix

Pd. Plotting. Scatter_matrix (data, figsize = (12, 8))Copy the code

The scatter matrix example. It draws all the combinations of the two columns in the same picture.

Advanced operations in Pandas

The SQL associated

Implementing associations in Pandas is very, very simple

data.merge(other_data, on=[ column_1 ,  column_2 ,  column_3 ])
Copy the code

Associating three columns requires only one line of code

grouping

It’s not that easy in the beginning, you need to master the syntax first, and then you’ll find yourself using the feature all the time.

data.groupby( column_1 )[ column_2 ].apply(sum).reset_index()
Copy the code

Group by one column and select another column to perform a function. .reset_index() reconstructs the data into a table.

As explained earlier, to optimize your code, join your functions on one line.

Line of iteration

dictionary = {}

for i,row in data.iterrows():
 dictionary[row[ column_1 ]] = row[ column_2 ]
Copy the code

.iterrows() loops through two variables: the row index and the row data (I and row above)

All in all, Pandas is one of the reasons Python is such a great programming language

I could have shown more interesting Pandas features, but what has been written is enough to understand why data scientists need Pandas. To summarize, Pandas has the following advantages:

  • Easy to use, all complex, abstract computing is hidden behind;

  • Intuitive;

  • Fast, very fast if not the fastest.

It helps data scientists read and understand data quickly, improving their efficiency