This article was first published on Jizhi

This article is long, all dry goods, recommended after collection as a long-term study guide.


Some time ago, we shared how to become a data analytics expert through Python

But this guide is just a general guide to how to learn and what resources to use. Today, we’ll go into more detail about how to become a data analyst from scratch with Python, using a lending dataset as an example to share the basics and tools needed to do data analysis with Python. I believe it can help you quickly become a data analyst with Python.

Table of contents

1. Basic Python knowledge for data analysis

  • Why learn Python for data analysis?
  • How do I install Python?
  • Do some simple projects in Python

2.Python libraries and data structures

  • Python data structures
  • Python iteration and conditional construction
  • Python library for data analysis

3. Exploratory data analysis using Python and Pandas

  • Pandas’ Introduction to Series and Dataframe
  • Complete the data analysis project with loan data set as an example

4. Reprocess the data using Python’s Pandas

  • How are missing values presumed
  • How to deal with extremes

5. Create a prediction model in Python and make predictions based on the data

  • Logistic regression model
  • Decision tree model
  • Random forest model

Note: The original author of this article is Data scientist Kunal Jain. He graduated from The Indian Institute of Technology, one of the top IT schools in India. He has been working in the field of data science for more than 10 years and has worked as a senior data analyst for several large multinational companies in the UK and India.

About a few years ago, after five years in SAS, I decided to step out of my comfort zone and become a data scientist. After searching for a learning tool that would help me become a data scientist, I finally decided on Python.

I’ve always loved coding, it’s a passion of mine.

I spent about a week learning the basics of Python, and then not only did I learn the language in depth myself, but I helped others learn Python as well. Python was originally a general-purpose language, but over the years and with strong community support, Python has spawned a number of libraries for data analysis and predictive modeling.

Since many people don’t know how to use Python for data science and lack resources, I decided to write this tutorial to help more people learn Python faster. In this tutorial, I will also teach you how to do data analysis in Python, and eventually become proficient enough to use it for your own use.

1. Basic knowledge of Python for data analysis

Why learn Python for data analysis?

In recent years, Python has become the language of choice for data analysis. I compared it to SAS and R, and here are some of the benefits of learning Python for data analysis:

  • Open source – Free to install and use
  • Strong online community support
  • Very easy to learn (very quick to learn without any programming background)
  • Can become a common language for data science and the deployment of data analysis-based products

Of course, there are some disadvantages:

It is an interpreted language rather than a compiled language, so it takes more CPU time. But considering how much time we save learning programming (because it’s so easy to learn), it’s not a big disadvantage.

How do I install Python?

There are two ways to install Python:

1. You can directly from the Python project web site (https://www.python.org/download/releases/2.7/) to download and install the components you need, and libraries.

2. You can also download and install packages that come with pre-installed libraries. I suggest downloading Anaconda. Another option is the Enthought Canopy Express. The second method offers a worry-free installation, so I recommend it for beginners. The downside of this approach is that you have to wait for the entire package to complete, even if you are only interested in the latest version of a library.

Selecting a development environment

Once Python is installed, you also have a variety of options when choosing your development environment. Here are three of the most common options:

  • Based on terminal and Shell
  • IDLE (default environment)
  • IPython Notebook — the equivalent of Markdown in R

While you should choose the right environment for your needs, I personally prefer iPython Notebook. It provides a lot of nice functionality for writing documentation as you write code, as well as the option to run code in blocks (rather than line by line).

We will use the iPython environment throughout this tutorial.

Warm-up: Run your first Python project

You can start by writing a simple calculator in Python:

Some things to watch out for:

  • Depending on the operating system you use, you can start ipython Notebook by writing “Ipython Notebook” in your terminal/ CMD.
  • You can name an iPython notebook with a simple click on the name — such as UntitledO in the screenshot above.

  • If you want to insert an extra line after it, you can execute the code by pressing “Shift +Enter” or “ALT+Enter”.

Before we dive into the problem, let’s take a step back to the basics of Python. We know that data structures, iteration, and conditional structures are key to any programming language. In Python, this information includes lists, strings, tuples, dictionaries, for loops, while loops, if-else loops, and so on. Let’s look at some of them.

2.Python libraries and data structures

Python data structures

Here are some data structures used in Python. You should be familiar with them so that you can use them correctly later.

Lists – Lists are one of the most common data structures in Python. A list is simply defined by writing a common comma-separated list of values inside square brackets. Lists may contain different types of data items, but usually the data items are of the same type. Python lists are mutable, and individual elements in lists can change.

Here is an example of defining a list in Python and getting the list:

Strings – Define strings using single (‘), double (‘), or triple (‘) quotes. Strings ending in triple quotes can span multiple lines of code and are used frequently in Docstrings, a method of commenting functions in Python. \ is used as an escape character. Note that Python strings are immutable, so you cannot change the contents of strings.

Tuples — Tuples are represented by a number of values separated by commas. Tuples are immutable, and the output is enclosed in parentheses, so handle nested tuples correctly. In addition, while elements are immutable, they can contain mutable data if needed. Tuples are faster to process than lists because of their immutable nature. So if your list cannot be changed, use tuples instead of lists.

Dictionaries – Dictionaries are mutable container models that can store objects of any type. Each key=>value pair of the dictionary is separated by a colon:, each key=>value pair is separated by a comma, and the entire dictionary is enclosed in curly braces {}. A pair of curly braces creates an empty dictionary: {}.

Python iterates and conditional constructs

Python, like most languages, has a for loop, which is the most widely used method of Python iteration. The syntax is relatively simple:

for i in [Python Iterable]:
  expression(i)
Copy the code

The “Python iterable” here can be a list, tuple, or other high-level data structure, as we’ll cover in the next section. Let’s start with a simple example, determining the factorial of a number.

fact=1
for i in range(1,N+1):
  fact *= i
Copy the code

As for conditional statements, they are used to execute snippets of code based on certain conditions. The most commonly used construction is if-else, which has the following syntax:

if [condition]:
  __execution if true__
else:
  __execution if false__
Copy the code

For example, if we wanted to print whether the number N is even or odd, we could write the code like this:

if N%2 == 0:
  print 'Even'
else:
  print 'Odd'
Copy the code

Now that you’re familiar with the basics of Python, let’s take it a step further. What to do if you have some of these tasks:

  • Multiply two matrices
  • Solve the roots of a quadratic equation of one variable
  • Draw bar charts and histograms
  • Creating statistical models
  • Access to web pages

If you’re trying to write code from scratch, the task is a nightmare, and you won’t be able to do it in two days. But don’t worry too much. Fortunately, there are many predefined Python libraries that we can import directly into our code to make our task easier. For example, in our factorial example above, we can do it in one step:

math.factorial(N)
Copy the code

Of course we need to import the Math library to solve this problem. We’ll cover some libraries in the next section.

Python library

In this section, we’ll look at some useful Python libraries. The obvious first step is to learn how to import them into our environment in several ways:

import math as m
Copy the code
from math import *
Copy the code

In the first approach, we defined an alias m for the math library, and we can now use the various functions inside the Math library (like factorial) by referencing it with alias M.Fahrenheit ().

In the second approach, we import the entire namespace in Math, which means we can use factorial() without referring to Math.

Tip: Google recommends using the first method of importing libraries, because that way you know where functions come from.

Here’s a list of libraries that you’ll need for any data analysis you do, so make sure you’re familiar with them:

  • Numpy: An extension library of the Python language. Support a large number of advanced dimensional array and matrix operations, also provides a large number of mathematical functions for array operations.
  • SciPy: Built on top of NumPy, SciPy provides a set of tools for scientific computation in Python, such as algorithms for numerical computation and functions that make it easy to manipulate data.
  • Matplotlib: A Python 2D drawing library that generates publish-quality graphics in a variety of hard copy formats and cross-platform interactive environments. With Matplotlib, developers can generate plots, histograms, power spectra, bar charts, error charts, scatter charts, and more with just a few lines of code.
  • Pandas: A numpy-based tool created to solve data analysis tasks. Pandas incorporates a large number of libraries and some standard data models to provide the tools needed to efficiently manipulate large data sets. Pandas provides a large number of functions and methods that allow us to work with data quickly and easily.
  • Scikit Learn: a machine learning library developed in Python, which contains a large number of machine learning algorithms, data sets and is a convenient tool for data mining.
  • Statsmodels: A Python module containing statistical models, statistical tests, and statistical data mining. A corresponding statistical result is generated for each model. Statistical results are compared with existing statistical packages to ensure their correctness.
  • Seaborn: A Python visualization tool for visualizing statistics. With it, you can draw very beautiful, intuitive, and attractive graphics in Python. Is engaged in data analysis and mining tools.
  • Bokeh: An interactive visualization Python package that enables the creation of interactive graphics, data panels, and data applications in a browser.
  • Blaze: Extends the capabilities of Numpy and Pandas to handle distributed and streaming media datasets. Can be used to fetch data from multi-dimensional data sources including Bcolz, MongoDB, SQLAIchemy, Apache Spark PyTables, etc. When used with Bokeh, Blaze is a very powerful data tool that helps us visualize large amounts of data.
  • Scrapy: A fast, high-level screen scraping and Web scraping framework for Python development for scraping Web sites and extracting structured data from pages. Scrapy is versatile and can be used for data mining, monitoring, and automated testing.
  • SymPy: The Python library for symbolic computing, which includes integral, differential equation and other mathematical operations, provides strong mathematical support for Python.
  • Requests: Is an HTTP library written in Python, based on URllib, using the Apache2 Licensed open source protocol. It is more convenient than URllib, saves us a lot of work, and fully meets HTTP testing requirements.

There are a few other libraries you might use:

  • OS Used for operating system and file operations
  • Networkx and igraph are used for graphics based on data manipulation
  • Regular Expressions is used to discover patterns in textual data
  • BeautifulSoup is used to crawl network data.

You don’t need to be familiar with all of these libraries. However, you must be familiar with data mining and analysis tools like Numpy and Pandas.

Now that we’ve covered the basics of Python and some libraries, let’s take a closer look at how to solve problems in Python. That’s right, create prediction models in Python! We’ll use some powerful libraries along the way, as well as higher-level Python data structures. There are three main stages:

  • Data exploration – Exploring more information about the data in our hands
  • Data reprocessing – Cleaning data, processing data, making it more suitable for statistical modeling
  • Predictive modeling — Running algorithms to get interesting and meaningful results from data

3. Use Python and Pandas for exploratory analysis of data

To further mine our data, we’ll use Pandas.

Pandas is the most useful data analysis library in Python. We will use Pandas to read the data set from the Analytic Vidhya, perform an exploratory analysis of the data, and create a simple classification algorithm to solve the problem.

Before we load the data, let’s understand the two key data structures in Pandas, Series and Dataframes.

Introduction to Series and Dataframes

A Series can be thought of as a 1-dimensional array of labels or indexes that you can retrieve from the labels of individual elements in a Series.

A Dataframe is much like an Excel table in that we have column names to refer to columns and rows to retrieve using row numbers. However, in dataframe, column names and row numbers are both column and row indexes.

Series and Dataframe form the core data schema for the Python library Pandas. Pandas first reads the dataset in Dataframe format, and then can easily apply a variety of operations, such as grouping, aggregating, and so on.

If you are not very familiar with the Pandas, you can view the article 10 minutes Pandas introductory tutorial: https://jizhi.im/blog/post/10min2pandas01

Practice data set – borrowing forecasting problem

We download the data from here, and here is the variable description:

We started to explore the data.

To start the iPython interface in Pylab mode, type the following on your Terminal or Windows command line:

ipython notebook --pylab=inline
Copy the code

This opens the Ipython Notebook in the Pylab environment, which has some useful libraries built in. Moreover, you can draw data within rows, making it ideal for interactive data analysis. Check that the environment loads correctly by typing the following code:

plot(arange(5))
Copy the code

I currently work on Linux, has the data stored in the following location: / home/access/Downloads/Loan_Prediction/train CSV

Import libraries and datasets:

The following libraries will be used in this tutorial:

  • Numpy
  • Matplotlib
  • Pandas

Note that because you are using the Pylab environment, you do not need to import Matplotlib and Numpy. I’ve kept them in the code in case you need to use this code in another context.

After importing the library, we use the function read_csv() to read the data set. The code so far looks like this:

import pandas as pd
import numpy as np
import matplotlib as plt

df = pd.read_csv("/home/kunal/Downloads/Loan_Prediction/train.csv") #Reading the dataset in a dataframe using Pandas
Copy the code

Rapid data exploration

After reading the data set, we can view the top few rows with the head() function.

df.head(10)
Copy the code

It should print 10 lines. You can also view more rows by printing the dataset. Next, you view the summary of numeric fields by using the describe() function.

df.describe()
Copy the code

Describe () functions can show count, mean, standard variance, minimum, maximum, and so on in their output.

By looking at the output of the describe() function, we can find the following information:

LoanAmount has 22 missing values

Loan_Amount_Term has 14 missing values

Credit_History has 50 missing values

We can also see that 84% of customers have credit_history.

How did you get there?

The Credit_History field averages 0.84. The Applicantlncome distribution seems to be as expected, as does the Coapplicantlncome.

Note that we can also see some bias in the data by comparing the mean to the median.

For non-numeric values (such as Property_Area, Credit_History, etc.), we can look at frequency distributions to see if they make sense. You can print the frequency table by running the following command:

df['Property_Area'].value_counts()
Copy the code

Also, we can look at the special value of Credit_History.

Note that dfname[‘ column_name ‘] is a basic indexing technique to obtain the specific columns of the dataframe. It can also be a list of columns.

Distribution analysis:

Now that we understand the basic data characteristics, let’s move on to the distribution of multiple variables. Let’s start with the numeric variables, Applicantlncome and LoanAmount.

We’ll start by drawing the Applicantlncome histogram with the following command:

df['ApplicantIncome'].hist(bins=50)
Copy the code

Here we observe a small number of extremes, which is why we need 50 bins to map the distribution clearly.

Next, we look at the box diagram to understand the distribution of the data. A box diagram can be drawn using the following command:

df.boxplot(column='ApplicantIncome')
Copy the code

This confirms that there are many outliers in the data, which also reflect large income disparities in society, perhaps because of different levels of education. We separate them through Education:

df.boxplot(column='ApplicantIncome', by = 'Education')
Copy the code

We can see that there is no significant difference between the average income of the college-educated and the uneducated. But the highly educated group has more high earners, some of them so high that they are outliers in the data.

Now let’s look at the histogram and box of LoanAmount with the following command:

df['LoanAmount'].hist(bins=50)
Copy the code

df.boxplot(column='LoanAmount')
Copy the code

Again, we found some extreme values. Obviously, Applicantlncome and LoanAmount need some data reprocessing. LoanAmount contains missing values and quite a few extreme values, while Applicantlncome has fewer extreme values and requires a deeper understanding. We will tackle this task in the next section.

Categorical variable analysis

Now that we understand the data distribution for Applicantlncome and Loanlncome, we can explore more detailed classification variable information. We use Excel type pivottables and crossovers, for example, to see the odds of getting a loan based on your credit history. Using PivotTable to get the results:

Note: Here we write the borrowing status as 1 for “yes” and 0 for “no”. So the average is the probability of getting a loan.

temp1 = df['Credit_History'].value_counts(ascending=True)
temp2 = df.pivot_table(values='Loan_Status',index=['Credit_History'],aggfunc=lambda x: x.map({'Y': 1,'N':0}).mean())
print 'Frequency Table for Credit History:' 
print temp1

print '\nProbility of getting loan for each Credit History class:' 
print temp2
Copy the code

Now let’s look at the steps required to generate the same data insights in Python, which we can observe as pivot_table in Excel. We can use the Matplotlib library to draw a bar chart with the following code:

Import matplotlib.pyplot as PLT FIG = plt.figure(figsize=(8,4)) ax1 = fig.add_subplot(121) ax1.set_xlabel('Credit_History')
ax1.set_ylabel('Count of Applicants')
ax1.set_title("Applicants by Credit_History")
temp1.plot(kind='bar')

ax2 = fig.add_subplot(122)
temp2.plot(kind = 'bar')
ax2.set_xlabel('Credit_History')
ax2.set_ylabel('Probability of getting loan')
ax2.set_title("Probability of getting loan by credit history")
Copy the code

The chart shows that applicants with a valid credit history are eight times more likely to get a loan. Similar charts can be drawn for Married, self-employed, Property_Area, etc. Alternatively, we can superimpose the two graphs:

temp3 = pd.crosstab(df['Credit_History'], df['Loan_Status'])
temp3.plot(kind='bar', stacked=True, color=['red'.'blue'], grid=False)
Copy the code

Add gender to the mix:

In case you didn’t see it very clearly, we actually created two basic classification algorithms here. One is based on credit history and the other on two categorical variables (including gender).

We just showed you how to do exploratory data analysis with Python and Pandas. We then explore the Applicantlncome and LoanStatus variables further, perform data reprocessing, and create a data set to apply a variety of modeling techniques. It is highly recommended that you apply another data set and similar questions to practice with the tutorial we just showed you.

4. Use Pandas and Python for data reprocessing

At this stage, be sure to study hard and get ready to start practicing.

Data Reprocessing — “Demand Renovation”

As we explored the data, we found that there were some problems in the data set that needed to be solved to build a well-performing model later. The task of tackling problematic data is called “data munging”. Here’s what we’ve noticed:

There are missing values in some variables, and we should wisely estimate these values based on the number of missing values and the expected importance of the variable.

When we look at the distribution of the data, we see that ApplicantIncome and LoanAmount seem to contain extreme values on both sides. While these extremes make sense, we must deal with these outliers appropriately.

In addition to these problems with numeric fields, we should also look at non-numeric fields such as Gender, Property_Area, Married, Education, and Dependents to see if they contain useful information. If you are not familiar with Pandas, check out this insight into some of the techniques used in Pandas: Poke here

Check for missing values in the data set

We look at missing values in all variables, because most models can’t handle missing values, and even if they could, they wouldn’t help. So let’s check for null values in the dataset.

 df.apply(lambda x: sum(x.isnull()),axis=0) 
Copy the code

If the value is empty, this line should tell us how many missing values are in each column.

Although the number of missing values is not very large, there are many missing values in many variables, so we should estimate and add them. For more detailed methods of missing value presumption, check out this article: Poke here

Note: Remember that missing values are not always null. For example, if Loan_Amount_Term is 0, does that make sense to you? Or would you consider it a missing value? Assuming that’s what you think, and that the answer is correct, we should check for unrealistic values.

How do I fill in the missing value of LoanAmount?

There are many ways to fill in missing values. The simplest way to do this is to replace them with average values. This can be done with the following code:

 df['LoanAmount'].fillna(df['LoanAmount'].mean(), inplace=True)
Copy the code

Another approach is to create a supervised learning model that predicts the amount of LoanAmount based on other variables, and then combines other variables to predict missing values.

Since we are targeting data reprocessing steps here, I will take an approach that falls somewhere in between. An important assumption to make is that applicants can expect to receive larger loans regardless of whether they are well educated or employed.

First, let’s look at the box chart to see if there are certain trends:

We see some variation in the median loan amount for each group of applicants, which can be used to infer missing values. But first, we must make sure that there are no missing values in both Self_Employed and Education characteristics.

As we said earlier, Self_Employed has some missing values, so let’s take a look at the frequency table:

Since the value of ~86% is “no”, it is more reliable to assume that the missing value is also “no”, which can be completed with the following code:

 df['Self_Employed'].fillna('No',inplace=True)
Copy the code

Now we will create a PivotTable to get the median of all unique values under the Self_Employed and Education characteristics. Next, we define a function that returns the values of these cells and fills the missing values of LoanAmount with the function:

table = df.pivot_table(values='LoanAmount', index='Self_Employed' ,columns='Education', aggfunc=np.median)
Define a function to return the value of pivot_table
def fage(x):
 return table.loc[x['Self_Employed'],x['Education']]
# Replace missing values
df['LoanAmount'].fillna(df[df['LoanAmount'].isnull()].apply(fage, axis=1), inplace=True)
Copy the code

This is how we infer missing values in LoanAmount.

How do I deal with extreme values in the LoanAmount and Applicantlncome data distribution?

Let’s look at LoanAmount first. Since extreme values are possible in practice, meaning that some people may take out very high loans for specific needs, we don’t treat them as outliers, but try logarithmic transformations to offset their effects:

 df['LoanAmount_log'] = np.log(df['LoanAmount'])
df['LoanAmount_log'].hist(bins=20)
Copy the code

Let’s look at the histogram again:

The distribution of the data now looks more like normal, and the influence of extremes has been greatly diminished.

Now look at Applicantlncome. Our intuition is that some applicants will have low incomes but will make good co-applicants, meaning two or more people applying for a loan together. So it’s a good idea to combine co-applicants’ incomes into total income, and then do a logarithmic conversion.

df['TotalIncome'] = df['ApplicantIncome'] + df['CoapplicantIncome']
df['TotalIncome_log'] = np.log(df['TotalIncome'])
df['LoanAmount_log'].hist(bins=20) 
Copy the code

Now we see that the distribution of the data is much better than before. Try to infer the missing values for Gender, Married, Dependents, Loan_Amount_Term and Credit_History for yourself. I also encourage you to think about some of the other things that can be mined from the data. For example, we could create a column for LoanAmount/Totallncome because it tells us how much the applicant will repay the loan.

Let’s look at how to create a prediction model.

5. Create prediction models in Python

After some effort to pre-process the data, we now use Python to create a prediction model based on our data set. In Python, Skail-learn is the most commonly used library for creating prediction models. If you’re not familiar with Sklearn, We suggest you check out this introductory tutorial:

Introductory tutorial

Since Sklearn requires all inputs to be numeric, we should convert all classification variables to numeric variables. This can be done with the following code:

 from sklearn.preprocessing import LabelEncoder
var_mod = ['Gender'.'Married'.'Dependents'.'Education'.'Self_Employed'.'Property_Area'.'Loan_Status']
le = LabelEncoder()
for i in var_mod:
    df[i] = le.fit_transform(df[i])
df.dtypes 
Copy the code

Next, we import the required modules. We then define a general classification function that takes the model as input to determine accuracy and cross-validation rates. Since this article is just a primer, I won’t go into the code here, but you can refer to this article for information on how to optimize model performance with cross-validation.

#Import models from scikit learn module:
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import KFold   #For K-fold cross validation
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn import metrics

Create a generic function for categorizing models and getting importance:
def classification_model(model, data, predictors, outcome):
  # Fitting model:
  model.fit(data[predictors],data[outcome])
  
  # Using data sets to make predictions:
  predictions = model.predict(data[predictors])
  
  # Print accuracy
  accuracy = metrics.accuracy_score(predictions,data[outcome])
  print "Accuracy : %s" % "3%} {0:.".format(accuracy)

  Perform k-fold cross validation with 5fold
  kf = KFold(data.shape[0], n_folds=5)
  error = []
  for train, test in kf:
    # Filter training data
    train_predictors = (data[predictors].iloc[train,:])
    
    # The target value we used to train the algorithm
    train_target = data[outcome].iloc[train]
    
    Train the algorithm with target values and predictors
    model.fit(train_predictors, train_target)
    
    Record the error value for each run of cross validation
    error.append(model.score(data[predictors].iloc[test,:], data[outcome].iloc[test]))
 
  print "Cross-Validation Score : %s" % "3%} {0:.".format(np.mean(error))

  # Fit the model again to be referenced by external functions
  model.fit(data[predictors],data[outcome]) 
Copy the code

Logistic regression

We create our first logistic regression model. One way to do this is to input all the variables into the model, but this may lead to overfitting and the model does not generalize well.

We can make some intuitive assumptions to solve the problem. Applicants are more likely to get a loan if:

  • The applicant has a credit history (we explained this earlier)
  • The applicant has a high income, or as a co-applicant has a high income
  • The applicant has a high level of education
  • There are fixed assets in urban areas and high appreciation prospects

We create our first prediction model based on “Credit_History”.

outcome_var = 'Loan_Status'
model = LogisticRegression()
predictor_var = ['Credit_History']
classification_model(model, df,predictor_var,outcome_var)
Copy the code

Accuracy: 80.945% Cross validation score: 80.946%

#We can try different combination of variables:
predictor_var = ['Credit_History'.'Education'.'Married'.'Self_Employed'.'Property_Area']
classification_model(model, df,predictor_var,outcome_var)
Copy the code

Accuracy: 80.945% Cross validation score: 80.946%

Normally we expect accuracy to increase by adding variables, but this can be challenging. Less important variables had no significant effect on accuracy and cross-validation scores. Credit_History is in the lead. Now we have two options:

  • Feature engineering: Mining new information and trying to predict. I’ll leave that to you.
  • Choose better modeling methods.

The decision tree

Decision trees are another way to create prediction models and are known to be more accurate than logistic regression models.

model = DecisionTreeClassifier()
predictor_var = ['Credit_History'.'Gender'.'Married'.'Education']
classification_model(model, df,predictor_var,outcome_var)
Copy the code

Accuracy: 81.930% Cross validation score: 76.656%

Here the model has no effect based on classification variables because Credit_History is dominant. Let’s try some numerical variables.

# We can try different combinations of variables
predictor_var = ['Credit_History'.'Loan_Amount_Term'.'LoanAmount_log']
classification_model(model, df,predictor_var,outcome_var)
Copy the code

Accuracy: 92.345% Cross validation score: 71.009%

Here we observe that although the accuracy increases as the variables increase, the cross-validation error also decreases. This is because of the over-fitting results of the model. Let’s try a more complex algorithm to see if it helps.

Random forests

Random forest is another algorithm to solve classification problems. For more information on random forest, see this article by Jiji.

One of the nice things about a random forest is that we can make it process all the features, and it will return a feature importance matrix that can be used to select features.

model = RandomForestClassifier(n_estimators=100)
predictor_var = ['Gender'.'Married'.'Dependents'.'Education'.'Self_Employed'.'Loan_Amount_Term'.'Credit_History'.'Property_Area'.'LoanAmount_log'.'TotalIncome_log']
classification_model(model, df,predictor_var,outcome_var)
Copy the code

Accuracy: 100.00% Cross validation score: 78.179%

Here we see that the accuracy of the model on the training set reaches 100%. This extreme case of over-fitting can be solved by the following two methods:

  • Reduce the number of predictors
  • Debug model parameters

Let’s try it both ways. First let’s look at the feature importance matrix and pick the most important feature.

Featimp = pd.series (model.feature_importances_, index=predictor_var).sort_values(Ascending =False)print featimp
Copy the code

We created the model using the five variables with the highest importance. Similarly, we will slightly adjust the parameters of the random forest model:

model = RandomForestClassifier(n_estimators=25, min_samples_split=25, max_depth=7, max_features=1)
predictor_var = ['TotalIncome_log'.'LoanAmount_log'.'Credit_History'.'Dependents'.'Property_Area']
classification_model(model, df,predictor_var,outcome_var)
Copy the code

Accuracy: 82.899% Cross validation score: 81.461%

Note that although accuracy is reduced, the cross-validation score is optimized and the display model generalizes well. Remember that the random forest model is not repeatable, and because of the randomness, the results will be slightly different from run to run. But output has remained roughly stable.

You may also have noticed that, despite debugging the basic functions in the random forest model, the cross-validation score we got was only slightly better than that of the first logistic regression model. There are also some interesting findings:

  • Using more complex models does not guarantee better results
  • Avoid very complex modeling techniques without understanding the concepts. Hasty use of the model will result in overfitting.
  • Feature engineering is an important tool for solving problems, and everyone can use the Xgboost model, but it depends on whether you can enhance features to better match the model.

conclusion

Hopefully this article will help you get a start on how to effectively grow as a data analyst by working in data science. By studying the above, you can not only understand the basic data analysis methods, but also learn how to apply some of the more complex methods.

Python is a really great tool and has become an increasingly popular programming language in data science because it is easy to learn and has a lot of library support.

So learn Python, and you’ll eventually be able to handle any data science project with ease, including reading, analyzing, and visualizing data and making predictions based on it.

The resources


0806 ai – From Scratch to Master – on sale for a limited time!

Click here for details

What about online programming?

(The first 25 students can also get ¥200 coupon)