10 Powerful Python Data science Tips

Author | LAKSHAY ARORA compile | Flin source | analyticsvidhya

introduce

When was the last time you learned a new Python technique? As data scientists, we’re used to using familiar libraries and calling the same functions every time. It’s time to break the old rules!

Python is not limited to Pandas, NumPy, and SciKit-learn (although they are absolutely essential in data science)! There are a number of Python tricks you can use to improve your code, speed up your data science tasks, and make writing code more efficient.

More importantly, learning new things we can do in Python is really fun! I like to play with different packages and functions. Every once in a while, a new twist catches my eye and I incorporate it into my routine.

So I decided to put together my favorite Python tips in one place — this article! This list ranges from speeding up basic data science tasks, such as preprocessing, to getting R and Python code in the same Jupyter Notebook. There are plenty of learning tasks waiting for us, so let’s get started!

New to the world of Python and data science? This is a subtle and comprehensive course to help you get started at the same time:

Applied Machine Learning — from beginner to professional
- Courses.analyticsvidhya.com/courses/app…

Zip: Merge multiple lists in Python

Often we end up writing complex for loops to group multiple lists together. Sound familiar? Then you’ll like the ZIP function. The purpose of this zip function is to “create an iterator that aggregates elements from each iterable.”

Let’s take a simple example to see how to use the zip function and combine multiple lists:

Id.analyticsvidhya.com/auth/login/…

See how easy it is to merge multiple lists?

Gmplot: Plot GPS coordinates in the Google Maps dataset

I love using Google Maps data. Think about it, it’s one of the richest data applications out there. That’s why I decided to start with this Python trick.

Scatter plots are great when we want to see the relationship between two variables. But if variables were latitude and longitude coordinates of a location, would you use them? Probably not. It is better to plot these points on a real map so that we can easily see and solve a particular problem (such as optimizing the route).

Gmplot provides an amazing interface that generates HTML and JavaScript to render all the data we want on top of Google Maps. Let’s look at an example of how to use GMplot.

Install gmplot

! pip3 install gmplotCopy the code

Plot location coordinates on Google Maps

You can download the dataset of this code here.

Drive.google.com/file/d/1VS2…

Let’s import the library and read the data:

import pandas as pd
import gmplot
data = pd.read_csv('3D_spatial_network.csv')
data.head()
Copy the code

# latitude and longitude list 
latitude_list = data['LATITUDE'] 
longitude_list = data['LONGITUDE'] 

# center co-ordinates of the map 
gmap = gmplot.GoogleMapPlotter( 56.730876.9.349849.9)

# plot the co-ordinates on the google map 
gmap.scatter( latitude_list, longitude_list, '# FF0000', size = 40, marker = True) 

# the following code will create the html file view that in your web browser 
gmap.heatmap(latitude_list, longitude_list) 

gmap.draw( "mymap.html" )
Copy the code

The code above will generate an HTML file, and you can see latitude and longitude coordinates plotted on Google Maps. Heat maps show areas with high density points in red. Pretty cool, huh?

3. Category_encoders: Use 15 different coding schemes to encode categorical variables

One of the biggest obstacles we faced in early data science datasets — what should we do with categorical variables? Our machines can process numbers in the blink of an eye, but processing categories is an entirely different problem.

Some machine learning algorithms can handle classification variables themselves. But we need to convert them to numeric variables, and for that, Category_Encoders is an amazing library that provides 15 different coding schemes.

Let’s see how we can leverage this library.

Install the category – encoders

! pip3 install category-encodersCopy the code

Convert classified data to numeric data


import pandas as pd 
import category_encoders as ce 

# create a Dataframe 
data = pd.DataFrame({ 'gender' : ['Male'.'Female'.'Male'.'Female'.'Female'].'class' : ['A'.'B'.'C'.'D'.'A'].'city' : ['Delhi'.'Gurugram'.'Delhi'.'Delhi'.'Gurugram'] }) 
                                                                                      
data.head()
Copy the code

# One Hot Encoding 
# create an object of the One Hot Encoder 

ce_OHE = ce.OneHotEncoder(cols=['gender'.'city']) 

# transform the data 
data = ce_OHE.fit_transform(data) 
data.head()
Copy the code

Category_encoders support about 15 different encoding methods, such as:

The hash code
LeaveOneOut coding
The order code
Binary coding
The target code

All encoders are fully compatible with Sklearn-Transformers, so you can easily use them in your existing scripts. Additionally, CATEGORY_encoders supports NumPy arrays and Pandas data frames. You can read more about Category_Encoders here.

Github.com/scikit-lear…

4. Progress_apply: Monitor the time you spend on data science tasks

How much time do you usually spend cleaning and preprocessing data? It’s true that data scientists typically spend 60-70% of their time cleaning up data. It’s important for us to track that, isn’t it?

We don’t want to spend days cleaning up the data and ignoring other data science steps. This is where the Progress_apply function makes our work easier. Let me demonstrate how it works.

Let’s calculate the distance from a point to a particular point and see how far we have progressed to complete this task. You can download the dataset here.

Drive.google.com/file/d/1VS2…

import pandas as pd
from tqdm._tqdm_notebook import tqdm_notebook
from pysal.lib.cg import harcdist
tqdm_notebook.pandas()
data = pd.read_csv('3D_spatial_network.csv')
data.head()
Copy the code

# calculate the distance of each data point from # (Latitude, Longitude) = (58.4442, 9.3722) 

def calculate_distance(x) : 
   return harcdist((x['LATITUDE'],x['LONGITUDE'), (58.4442.9.3722)) 
   
data['DISTANCE'] = data.progress_apply(calculate_distance,axis=1)
Copy the code

You’ll see how easy it is to track the progress of our code. Simple, efficient.

5. Pandas_profiling: Detailed report on generating datasets

We spent a lot of time trying to understand the data that we got. That’s fair enough — we don’t want to jump into model building without knowing what model we’re using. This is an essential step in any data science project.

Pandas_profiling is a Python package that reduces the amount of work required to perform the initial data analysis step. The package generates a detailed report on our data with just one line of code!

import pandas as pd 
import pandas_profiling 

# read the dataset 
data = pd.read_csv('add-your-data-here') 
pandas_profiling.ProfileReport(data)
Copy the code

We can see that we have a detailed report of the data set with just one line of code:

Warnings, for example: Item_Identifier has a high cardinality: 1559 different value warnings
Frequency count for all class variables
Quantile and descriptive statistics of numeric variables
The figure

6. Grouper: Grouping time series data

Who is not familiar with Pandas now? It is one of the most popular Python libraries and is widely used for data manipulation and analysis. We know Pandas has an amazing ability to manipulate and summarize data.

I was recently working on a time series problem and found that Pandas has a Grouper function that I have never used before. I began to wonder about its use.

It is proved that this Grouper function is very important for time series data analysis. Let’s try this and see how it works. You can download the dataset of this code here.

Drive.google.com/file/d/1UXH…

import pandas as pd 

data = pd.read_excel('sales-data.xlsx') 
data.head()
Copy the code

Now, the first step in processing any time series data is to convert the DATE column to DateTime format:

data['date'] = pd.to_datetime(data['date'])
Copy the code

Suppose our goal is to look at each customer’s monthly sales. Most of us are trying to write something complicated here. But this is where Pandas is more useful to us.

data.set_index('date').groupby('name') ["ext price"].resample("M").sum(a)Copy the code

We can use a simple method using the Groupby syntax without having to reindex. We’ll add some extra content to this function, providing some information about how to group data in the date column. It looks cleaner and works exactly the same way:

data.groupby(['name', pd.Grouper(key='date', freq='M'[])'ext price'].sum(a)Copy the code

7. Unstack: Convert indexes to Dataframe columns

We just saw how Grouper helps group time series data. Now, here’s the challenge — what if we want the name column (in the example above, the index) to be the dataframe column.

This is where the unstack function becomes critical. Let’s apply the unstack function to the above code example and see the results.

data.groupby(['name', pd.Grouper(key='date', freq='M'[])'ext price'].sum().unstack()
Copy the code

Very useful! Note: If the index is not MultiIndex, the output will be Series.

8. % matplotlib Notebook: Interactive drawing in Jupyter Notebook

I’m a big fan of the Matplotlib library. It is the most common visual library we use in Jupyter Notebook to generate various graphics.

To view these plots, we usually import the matplotlib library with a line — %matplotlib inline. This works fine, rendering a static image from the Jupyter Notebook.

Just replace the line %matplotlib with %matplotlib notebook and you’ll see the magic. You’ll get resizable and scalable drawings in your Notebook!

%matplotlib notebook
import matplotlib.pyplot as plt

# scatter plot of some data # try this on your dataset
plt.scatter(data['quantity'],data['unit price'])
Copy the code

By changing just one word, we can get an interactive drawing in which we can resize and scale.

%% time: Checks the running time of a particular Python code block

There are many ways to solve a problem. As data scientists, we know this very well. Computing costs is critical in the industry, especially for small and medium-sized organizations. You may want to choose the best way to get the task done in the shortest amount of time.

In fact, it is very easy to check the running time of a particular code block in Jupyter Notebook.

Simply add the %% time command to check the running time of a particular cell:

%%time 
def myfunction(x) : 
    for i in range(1.100000.1) : 
        i=i+1
Copy the code

In this case, we have CPU time and Wall time. CPU time is the total execution or running time that the CPU is dedicated to a process. Wall time is the time that the clock has elapsed between the start of the process and the “now.”

10: Rpy2: R and Python are in the same Jupyter Notebook!

R and Python are two of the best and most popular open source programming languages in the data science world. While R is primarily used for statistical analysis, Python provides a simple interface to turn mathematical solutions into code.

This is good news, we can use both of them in a Jupyter Notebook! We can take advantage of both ecosystems, and to do so, we just need to install RPY2.

So leave the R vs. Python debate aside for now and draw the GGplot level diagram in our Jupyter Notebook.

! pip3 install rpy2Copy the code

We can use both languages at the same time and even pass variables between them.

%load_ext rpy2.ipython
%R require(ggplot2)
Copy the code

import pandas as pd
df = pd.DataFrame({
        'Class': ['A'.'A'.'A'.'V'.'V'.'A'.'A'.'A'].'X': [4.3.5.2.1.7.7.5].'Y': [0.4.3.6.7.10.11.9].'Z': [1.2.3.1.2.3.1.2]})Copy the code

%%R -i df
ggplot(data = df) + geom_point(aes(x = X, y= Y, color = Class, size = Z))
Copy the code

Here, we create a data box df in Python and use it to create a scatter plot using R’s GGploT2 library (geom_point function).

endnotes

This is my indispensable collection of Python tips. I enjoy using these packages and features for everyday tasks. To be honest, my productivity has increased, which makes working in Python more fun than ever.

Other than that, are there any Python tricks you’d like me to know? Let me know in the comment section below and we can exchange ideas!

Also, if you’re new to Python and new to data science, then you really should check out our comprehensive and best-selling courses:

Applied machine learning — from beginner to professional
- Courses.analyticsvidhya.com/courses/app…

The original link: www.analyticsvidhya.com/blog/2019/0…

Welcome to panchuangai blog: panchuang.net/

Sklearn123.com/

Welcome to docs.panchuang.net/