Editor’s note: Python is now almost the language of choice for data scientists, thanks to a well-established programming language ecosystem and better scientific computing libraries. If you are starting to learn Python, and the goal is to analyze data, NumPy, SciPy, and Pandas will be necessary for you to get ahead. For mathematics majors in particular, Pandas can serve as a preferred entry point for data analysis.

This article describes 12 Pandas techniques for data analysis. To better describe their effectiveness, we use a dataset to assist operations.

Data set: we study loan is the theme of the forecast, please download the data to the datahack.analyticsvidhya.com/contest/practice-problem-loan-prediction (registration required), and start learning.

Ready! To start!

First, we import the module and load the data set into the Python environment:


     
  1. import pandas as pd

  2. import numpy as np

  3. data = pd.read_csv("train.csv", index_col="Loan_ID")

Copy the code

1. Boolean Indexing

In a table, what would you do if you wanted to filter the values of the current column based on the criteria of another column? For example, suppose we wanted a list of all the women who have not graduated but have taken out loans. What would the operation be? In this case, Boolean Indexing, or Boolean indexes, provides functionality. Here’s what we need to do:


     
  1. data.loc[(data["Gender"]=="Female") & (data["Education"]=="Not Graduate") & (data["Loan_Status"]=="Y"), ["Gender","Education","Loan_Status"]]

Copy the code

2. Apply Function

The Apply function is one of the commonly used functions to work with data and create new variables. After applying some function to a particular row/column of the DataFrame, it returns the corresponding value. These functions can be either default or user-defined. Here we define a function that looks for missing values in each row/column:


     
  1. #Create a new function:

  2. def num_missing(x):

  3.  return sum(x.isnull())

  4. #Applying per column:

  5. print "Missing values per column:"

  6. print data.apply(num_missing, axis=0) #axis=0 defines that function is to be applied on each column

  7. #Applying per row:

  8. print "\nMissing values per row:"

  9. print data.apply(num_missing, axis=1).head() #axis=1 defines that function is to be applied on each row

Copy the code

We got the result we expected. One thing to note here is that the head() function only works on the second output because it contains multiple rows of data.

3. Replace the missing value

Fillna () can replace missing values in one step. It does this by updating the missing value with the mean/mode/median of the target column. In this example, let’s update the missing values for the Gender, Married, and Self_Employed columns with modes:


     
  1. #First we import a function to determine the mode

  2. from scipy.stats import mode

  3. mode(data['Gender'])

Copy the code

Output:


     
  1. ModeResult (mode = array ([' Male '], dtype = object), the count = array ([489]))

Copy the code

We have the mode and the number of times it occurs. Remember that most of the time the mode will be an array because there may be multiple high-frequency words in the data. By default, we will choose the first one:


     
  1. mode(data['Gender']).mode[0]

Copy the code

* ‘Male’

Now we can update the missing values and test our knowledge of the Apply function:


     
  1. #Impute the values:

  2. data['Gender'].fillna(mode(data['Gender']).mode[0], inplace=True)

  3. data['Married'].fillna(mode(data['Married']).mode[0], inplace=True)

  4. data['Self_Employed'].fillna(mode(data['Self_Employed']).mode[0], inplace=True)

  5. #Now check the #missing values again to confirm:

  6. print data.apply(num_missing, axis=0)

Copy the code

It can be seen from the result that the missing value is indeed filled up, but this is only the most primitive form. In real work, we need to master more complex methods, such as grouping the mean value/mode/median, modeling the missing value, etc.

4. Pivot Table

Pandas can be used to create MS Excel style Pivot tables. In this example, the key column of the data is the “LoanAmount” with missing values. To get the numbers, use the Gender, Married and Self_Employed columns:


     
  1. #Determine pivot table

  2. impute_grps = data.pivot_table(values=["LoanAmount"], index=["Gender","Married","Self_Employed"], aggfunc=np.mean)

  3. print impute_grps

Copy the code

5. Multi-Indexing

If you look closely at the output from the “Replace missing values” section, you may notice that each index is composed of three values. This is called multi-indexing and facilitates the fast execution of operations.

Let’s continue with this example and assume that we now have the values for each column, but we haven’t done the missing value estimation yet. This is where all the previous techniques come in:


     
  1. #iterate only through rows with missing LoanAmount

  2. for i,row in data.loc[data['LoanAmount'].isnull(),:].iterrows():

  3.  ind = tuple([row['Gender'],row['Married'],row['Self_Employed']])

  4.  data.loc[i,'LoanAmount'] = impute_grps.loc[ind].values[0]

  5. #Now check the #missing values again to confirm:

  6. print data.apply(num_missing, axis=0)

Copy the code

Note:

  • Multiple indexes require tuples to define index groups in LOC statements. This is a tuple to be used in a function.

  • The suffix of values [0] is required because the value returned by default does not match the value of the DataFrame. In this case, direct allocation would be an error.

6. Crosstab

This function can be used to create an initial “overview” of the data, which in layman’s terms means that we can test some basic assumptions, such as in the case of a loan, “Credit_History” affects the success of a personal loan. This can be tested with Crosstab, as shown below:


     
  1. pd.crosstab(data["Credit_History"],data["Loan_Status"],margins=True)

Copy the code

If the value isn’t intuitive enough, we can convert it to a percentage using the apply function:


     
  1. def percConvert(ser):

  2.  return ser/float(ser[-1])

  3.  pd.crosstab(data["Credit_History"],data["Loan_Status"],margins=True).apply(percConvert, axis=1)

Copy the code

Apparently, someone with “CreditHistory” has a better chance of getting a loan, more than 80% of the time, compared with a pitiful 9% for someone without it.

But is this a simple prediction? No, there’s an interesting story here. “CreditHistory” is known to have a higher probability of getting a loan, so what would a model predict if it were possible to set their “LoanStatus” to Y and others to N? We ran 614 tests, and it correctly predicted 82+378=460 times, 75% accuracy!

You might wonder why we’re talking about statistical models. I don’t deny it, but I just want to make the point that if you can improve the accuracy of this model by even one-tenths of a percent, that’s a huge breakthrough.

Note: 75% is an approximate value, and the exact number varies between training sets and test sets. I hope this intuitively explains why a 0.05% increase in accuracy in a game like Kaggle can lead to a 500 + ranking increase.

7. DataFrame merger

Merging dataframes (or data boxes, as you like to call them) becomes important when we need to consolidate information from different sources. Now the housing price is very hot, so is the speculation group, so we first use the data of the data set to assume a table with different average housing price (1 square) in various regions:


     
  1. Index = ['Rural', 'Semiurban', 'Urban'], columns = ['rates'])

  2. prop_rates

Copy the code

Now we can merge this information with the original DataFrame into:


     
  1. data_merged = data.merge(right=prop_rates, how='inner',left_on='Property_Area',right_index=True, sort=False)

  2. data_merged.pivot_table(values='Credit_History',index=['Property_Area','rates'], aggfunc=len)

Copy the code

Well, we can’t afford it. All right, data merge. Note that the ‘values’ argument is useless here, because we are just counting.

8. DataFrame sorting

Pandas can easily sort based on multiple columns, as shown below


     
  1. data_sorted = data.sort_values(['ApplicantIncome','CoapplicantIncome'], ascending=False)

  2. data_sorted[['ApplicantIncome','CoapplicantIncome']].head(10)

Copy the code

Pandas’ sort function is no longer available. Sort_value is now called for sorting Pandas.

9. Plot (Boxplot and histogram)

Many people may not know that they can draw box diagrams and histograms in Pandas in one line without calling matplotlib. For example, if we want to compare the ApplicantIncome distribution of Loan_Status:


     
  1. import matplotlib.pyplot as plt

  2. %matplotlib inline

  3. data.boxplot(column="ApplicantIncome",by="Loan_Status")

Copy the code


     
  1. data.hist(column="ApplicantIncome",by="Loan_Status",bins=30)

Copy the code

The two graphs show that income does not play as big a role in the loan process as we expected, and there is no significant difference in income between those who are rejected and those who receive loans.

10. Cut function for binning

Sometimes the aggregated data makes more sense. By today recently frequent accident’s self-driving cars, for example, if we want to use it to capture data to reproduce a road traffic conditions, rather than a whole day’s smooth data, or the evenly divided into 24 hours a day, “” morning” afternoon “” night” “night” “late” key periods of data contains more information, and more effective. If we use this data to model, its results will be more intuitive and can avoid overfitting.

Here we define a simple function that makes binning efficient:


     
  1. #Binning:

  2. def binning(col, cut_points, labels=None):

  3.  #Define min and max values:

  4.  minval = col.min()

  5.  maxval = col.max()

  6.  #create list by adding min and max to cut_points

  7.  break_points = [minval] + cut_points + [maxval]

  8.  #if no labels provided, use default labels 0 ... (n-1)

  9.  if not labels:

  10.    labels = range(len(cut_points)+1)

  11.  #Binning using cut function of pandas

  12.  colBin = pd.cut(col,bins=break_points,labels=labels,include_lowest=True)

  13.  return colBin

  14. #Binning age:

  15. Cut_points = [90140190]

  16. labels = ["low","medium","high","very high"]

  17. data["LoanAmount_Bin"] = binning(data["LoanAmount"], cut_points, labels)

  18. print pd.value_counts(data["LoanAmount_Bin"], sort=False)

Copy the code

11. Encoding nominal data

Sometimes we need to reclassify nominal data, this may be due to a variety of reasons:

  • Some algorithms (such as Logistic regression) require all inputs to be numbers, so we recode the nominal variable to 0,1… (n – 1).

  • Sometimes a category may contain multiple expressions, such as “temperature” can be recorded as “High” “Medium” “Low” “H” “Low”, where “High” and “H” are one thing, and “Low” and “Low” are the same thing, but Python considers them different.

  • Some categories have very low frequencies, so we should combine them.

To solve this problem, here we define a simple function that takes the input as a “dictionary” and then calls the Pandas replace function to re-code it:


     
  1. #Define a generic function using Pandas replace function

  2. def coding(col, codeDict):

  3.  colCoded = pd.Series(col, copy=True)

  4.  for key, value in codeDict.items():

  5.    colCoded.replace(key, value, inplace=True)

  6.  return colCoded

  7. #Coding LoanStatus as Y=1, N=0:

  8. print 'Before Coding:'

  9. print pd.value_counts(data["Loan_Status"])

  10. data["Loan_Status_Coded"] = coding(data["Loan_Status"], {'N':0,'Y':1})

  11. print '\nAfter Coding:'

  12. print pd.value_counts(data["Loan_Status_Coded"])

Copy the code

12. Iterate over rows of the dataframe

It’s not a common technique, but if you do, you probably don’t want to try to figure it out later, or just give up and go through all the rows with a for loop. Here are two scenarios where this approach is used:

  • Nominal variable with numbers is treated as a number.

  • Numeric variable is treated as a classification when a line is Numeric with characters (due to a data error).

At this point we need to manually define the column categories. It’s a hassle, but after that if we check the data categories again:


     
  1. #Check current type:

  2. data.dtypes

Copy the code

Its output would be:

Here we see that Credit_History is a named variable, but it shows up as float. A good way to solve these problems is to create a CSV file that contains column names and types. With this, we can create a function to read the file and assign column data types.


     
  1. #Load the file:

  2. colTypes = pd.read_csv('datatypes.csv')

  3. print colTypes

Copy the code

After loading the file, we can iterate over each line and assign the data type to the variable name defined in the ‘feature’ column using the ‘type’ column.


     
  1. #Iterate through each row and assign variable type.

  2. #Note: astype is used to assign types

  3. for i, row in colTypes.iterrows():  #i: dataframe index; row: each row in series format

  4.    if row['type']=="categorical":

  5.        data[row['feature']]=data[row['feature']].astype(np.object)

  6.    elif row['type']=="continuous":

  7.        data[row['feature']]=data[row['feature']].astype(np.float)

  8. print data.dtypes

Copy the code

Hope this article is useful to you!

The original address: www.analyticsvidhya.com/blog/2016/01/12-pandas-techniques-python-data-manipulation/