Building a Data science portfolio: Start with machine learning projects

This article is the third in the series “Building a Portfolio of Data Science”. It takes approximately 25,000 characters and 37 minutes to read.

Written by Vik Paruchuri, translated by tang xiaoting, proofread by EarlGrey, produced by PythonTG translation group/programming school

If you like and want to keep up with the latest articles in this series, you can subscribe.

Data science companies are increasingly looking for a portfolio in hiring because it’s one of the best ways to measure actual ability. The good news is, you’re in complete control of your portfolio. With some effort, you can build a high-quality portfolio that will impress employers.

The first step in building a high-quality portfolio is to figure out what skills you should be able to showcase in your portfolio. The capabilities that companies want data scientists to have (that is, the capabilities they want their portfolio to showcase) include:

Ability to communicate
Ability to cooperate with others
Technical ability
Data inference capability
Subjective initiative

A good portfolio typically consists of multiple projects, each of which demonstrates one or two competency points above. This is the third article on how to build a robust data science portfolio. This article describes how to build the second project in your portfolio and how to create a complete machine learning project. Finally, you will have a project that demonstrates both the ability to interpret data and the technical ability to do so. Here’s the full project file if you want to get a look at the whole project.

A complete project

As a data scientist, sometimes you’re called upon to analyze a data set and try to tell a story with that data. Good communication and clear thinking are very important. A tool like Jupyter Notebook, which we used earlier, is a great way to do this. The expectation is to summarize the presentations or documents you find.

However, sometimes you will be called upon to work on projects that have business value. A business value project has a direct impact on the day-to-day business of the company and is frequently used. Tasks like “Design an algorithm that predicts user churn” or “create a model that automatically tags articles.” In such cases, the ability to tell a story is less important than the technical ability. You need to be able to analyze a data set, understand it, and then write scripts that can process that data. These scripts also need to run fast and use the least amount of resources, such as memory, which is a common requirement. Often these scripts need to be run frequently, so the final deliverable becomes the scripts themselves, not the reports. These outcomes are often integrated into business processes and may even be directly addressed to users.

Create a complete project that requires you to:

Understand the overall project environment
Explore the data and find the nuances
Create a well-structured project that is easy to integrate into the business process
Write high-performance code that runs fast and consumes the least system resources
Write good documentation for the installation and use of the code for others to use

To create such projects efficiently, we need to deal with many files. We highly recommend using a document editor like Atom, or an IDE like PyCharm. These tools allow you to jump between files and edit different types of files, such as Markdown files, Python files, and CSV files. Give your code a good structure for easy version management and uploading to code collaboration tools like Github.

In this article, we will use libraries such as Pandas and SciKit-learn. We’ll use Pandas’ DataFrame a lot, which makes it easy to read and manipulate table data in Python.

Look for good data sets

Finding good data sets for complete project analysis is difficult. The data set needs to be large enough to have memory and performance limitations. There needs to be business value. The dataset, for example, contains data on admissions conditions, graduation rates and future earnings of graduates at American universities. So that’s a good data set to tell a story. However, if you think about it, there isn’t enough detail to build a complete project.

For example, you can tell someone that their potential future earnings will be higher if they go to certain (good) colleges, but that only takes a quick lookup and comparison, and there’s not enough room to show off your technical skills. You can also see that colleges are more likely to earn high salaries if they have higher entry requirements, but this is more about storytelling than business value.

When you have more than GIGABytes of data, or when you want to predict some data detail, memory and performance limitations come into play because you have to run algorithms on the data set.

A good data set allows you to write a series of scripts that manipulate the data to answer dynamic questions. Stock prices are a good data set. You can use that data to predict the next day’s share price movements and feed the new data to the algorithm when the market closes. This can help you execute trades and even make profits. It’s not telling a story — it’s directly generating value.

Here are some places to find good data sets:

/r/datasets – a subreddit with hundreds of interesting datasets
Google Public Datasets – some Public Datasets on Google BigQuery
Awesome Datasets – A list of datasets hosted on Github

As you go through these data sets, think about what questions people might ask if they had them, and then think about whether they are one-off (” How is the correlation between the S&P 500 and house prices?” “), or persistent (” Can you predict stock prices? ). The key here is to find persistent questions that require multiple runs and different inputs to answer.

For this article, we chose loan data from Fannie Mae. Fannie Mae is a U.S. government-sponsored enterprise that buys home loans from lenders. After buying mortgages, it packages them into mortgage-backed securities (MBS) and resells them. This has helped lenders make more loans and created more liquidity in the market. That, in theory, should lead to more homeowners, which should lead to better mortgage policies. From the borrower’s point of view, however, things are no different.

Fannie Mae discloses two kinds of data — data on the mortgages it acquires, and data on the performance of mortgages. In the best case, a person takes out a loan from a lender and keeps paying it back until the loan is repaid. A few missed payments by a borrower, however, can lead to foreclosure. In that case, the bank would take ownership of the house because the mortgage wasn’t paid off. Fannie Mae records which mortgages are not being paid and which are in foreclosure. The data is released quarterly with a one-year lag. As of this writing, the most recent data set is q1 2015.

Fannie Mae publishes acquisition information when it buys mortgages, which contains a lot of information about borrowers, including credit scores, mortgages and homes. It then publishes quarterly data on mortgage performance, covering borrowers’ payment information and the status of mortgages. There can be many lines in the mortgage performance information. One way to think about it is that the acquisition information says That Fannie Mae is now in control of the mortgage, and the performance information includes a series of status updates on the mortgage. Some states might say that the loan was foreclosed in a certain quarter.

A borrower is in foreclosure and the house is sold

Selective analysis Angle

For the Fannie Mae data set, there are multiple perspectives. We can:

Try to predict the sale price of a foreclosed home
Predict a borrower’s repayment history
Calculate the score of a mortgage at the time of acquisition

The important thing is to stick to an Angle. Focusing on too many things at once can make a great project difficult. It is also important to choose an Angle with enough detail. Here are some angles with little detail:

Which bank sold fannie the most foreclosed mortgages
Trends in borrower credit scores
Which home types are most frequently foreclosed on
The amount of the mortgage and the foreclosure sale price

All of these angles are interesting, and great if we focus on storytelling, but not so great for a business project.

With the Fannie Mae data set, we’re going to try to predict whether a mortgage is going to be foreclosed, using only the data from the time of acquisition. In effect, we give each mortgage a “score,” which indicates whether Fannie Mae should buy the mortgage. It will be a good foundation and a great piece of work.

Understand the data

Let’s start with a quick look at the raw data file. Here are the first few lines of acquisition data for the first quarter of 2012:

100000853384 | R | OTHER 04/2012 02/2012 4.625 360 | | 280000 | | | | | | 1 31 31 23 | | | | | C N 801 SF, | 1 | | I CA | 945 | | FRM | 100003735682 | R | SUNTRUST MORTGAGE INC., 03/2012 01/2012 3.99 360 | | 466000 | | | | 80 | | 80 | | 794 | 30 N 2 | | P SF | 1 | | 208 | | MD P | FRM 100006367485 | | 788 | C PHH MORTGAGE CORPORATION|4|229000|360|02/2012|04/2012|67|67|2|36|802|N|R|SF|1|P|CA|959||FRM|794Copy the code

Here are the first few lines of performance data for the first quarter of 2012:

100000853384 | 03/01/2012 | OTHER 4.625 | | | 0 03/2042 359 | | 360 | | 41860 | | 0 N | | | | | | | | | | | | | | | | 4.625 100000853384 | 04/01/2012 | | | | 1 | 359 | 358 | | 41860 | | 0 N 03/2042 | | | | | | | | | | | | | | | | 4.625 100000853384 | 05/01/2012 | | | | | 358 | 357 | | 41860 | | 0 N 03/2042 | | | | | | | | | | | | | | | |Copy the code

It’s useful to take the time to understand the data before writing the code. Especially for business projects, because we don’t explore the data interactively, it’s hard to find certain details unless we find them in the first place. In this case, the first step is to go to Fannie Mae’s website and read about the data set:

Introduction to the
The vocabulary
Q&A
Columns in acquisition and performance documents
Acquisition of data sample files
Presentation data sample file

After reading this material, we know some key information that is useful:

From 2000 to the present, there is an acquisition filing and a performance filing every quarter. The data lag by a year, so the most recent data are from 2015
These files are text, use | as a delimiter
These files have no header documentation, but we have a list of all the column names
All told, the documents contain data on 22 million mortgages
Because the performance file covers the previous mortgage information, the earlier mortgage will have more performance data (for example, the mortgage acquired in 2014 will not have much performance information).

This information can save us a lot of time when designing project structures and processing data.

Design project structure

It is important to structure your project before you start downloading and exploring data. When building a complete project, our main goals are:

Output a viable solution
The solution runs fast and consumes the least resources
Make it easy for others to expand the project
Make your code easy for others to understand
Write as little code as possible

To achieve these goals, we need to design the structure of the project. A well-structured project follows the following specifications:

Data files are separated from source code
Raw data is separated from generated data
Is there a readme.md file that describes how to install and use this project
There is a requirements.txt file that contains all the modules required for the project
There is a settings.py file that contains the Settings required for all the other files
- For example, if there are many Python scripts reading the same file, it is better to have them all import Settings and get the file from that one place
There is a.gitignore file to prevent very large or private files from being committed to Git
Break tasks into steps and put them in separate files that you can do individually
- For example, one file reads data, one file establishes features, and one file performs predictions
Store intermediate values. For example, one script may output a file that is read by another script
- This allows us to make changes in the flow of data processing without recalculating

The file structure of the project is as follows:

Loan-prediction ├── Data ├── Heavy ├─ readme.md ├─ requirements.txt ├─ settings.pyCopy the code

Creating the initial file

First, create the loan-Prediction folder. In this folder, create the Data and Processed folders. The first is used to store raw data, and the second is used to store all intermediate values.

Next, create the.gitignore file. The.gitignore file ensures that some files are ignored by Git and not pushed to Github. The.ds_store files OS X creates in each folder are such files to ignore. To get started with the.gitignore file, see here. Also ignore some files that are too large, and Fannie Mae’s terms don’t allow for redistributions, so we should add these two lines at the end of the.gitignore file:

data
processed
Copy the code

Here is an example.gitignore file for this project.

Next, create readme.md, which helps people understand the project. Md means the file is in Markdown format. Markdown allows you to write directly in plain text, but you can also add some nice typography if you want. Here is a markdown guide. If you upload a file called readme. md to Github, Github will automatically process the file and present it to visitors as the home page. Here’s an example.

For now, just put a short description in readme. md:

Loan Prediction
-----------------------

Predict whether or not loans acquired by Fannie Mae will go into foreclosure.  Fannie Mae acquires loans from other lenders as a way of inducing them to lend more.  Fannie Mae releases data on the loans it has acquired and their performance afterwards [here](http://www.fanniemae.com/portal/funding-the-market/data/loan-performance-data.html).
Copy the code

Now, create the requirements.txt file. This can help others install our project. We don’t know exactly which libraries are needed, but these are a good place to start:

pandas
matplotlib
scikit-learn
numpy
ipython
scipy
Copy the code

These are some of the most commonly used libraries for data analysis in Python and should be used in this project. Here is a sample Requirements file for this project.

After creating requirements.txt, you should install these modules. In this article, we use Python 3. If you do not already have Python installed, you are advised to use Anaconda, a Python installer that can install all of the above modules.

Finally, create a blank settings.py file because the project doesn’t have any Settings yet.

Get the data

Once you’ve created the framework for the entire project, you can capture the raw data.

Fannie Mae has some restrictions on data downloads, so you have to sign up for an account first. The download page is here. Once the account is registered, the loan data can be downloaded at will. The file is in ZIP format and is quite large after decompression.

For this article, we’ll download all the data from q1 2012 to Q1 2015. Then unzip the file, and after unzip, delete the original.zip file. Finally, the loan-Prediction folder should be structured like this:

Loan - prediction ├ ─ ─ data │ ├ ─ ─ Acquisition_2012Q1. TXT │ ├ ─ ─ Acquisition_2012Q2. TXT │ ├ ─ ─ Performance_2012Q1. TXT │ ├ ─ ─ Performance_2012Q2. TXT │ └ ─ ─... ├── International Standard Exercises ── International Standard Exercises ── International Standard ExercisesCopy the code

After you download the data, you can use shell commands like head and tail to observe the first and last lines of the file. Are there any columns that are not needed? Refer to the PDF file that describes column names when viewing the data

Read the data

There are two problems that make dealing with the data directly difficult:

Acquisition and performance data sets are scattered across many files
All files are missing header documents

Before you can start processing this data, you need to put all of your acquisition data into one file, and all of your performance data into one file. Each file needs to contain only the columns we care about, and the normal header document. There is a small problem that the presentation data is very large, so we have to trim some columns if possible.

The first step is to add some variables to settings.py, including paths to raw and intermediate data. We’ll also add some Settings that will be useful later:

DATA_DIR = "data" PROCESSED_DIR = "processed" MINIMUM_TRACKING_QUARTERS = 4 TARGET = "foreclosure_status" NON_PREDICTORS  = [TARGET, "id"] CV_FOLDS = 3Copy the code

Putting paths in settings.py keeps them all in one place, making future changes easy. When many files use the same variables, it is much easier to put them together than to make changes in each file separately. Here is a sample settings.py file for this project.

The second step is to create a file called assemble. Py, which combines the scattered data into two files. Running Python assemble. Py gives you two data files in the processed folder.

And then assemble. Py. First, we define the header document for each file, so we need to look at the PDF document that explains the column names, and then create a list of rows for the acquisition data and for the presentation data file.

HEADERS = {
    "Acquisition": [
        "id",
        "channel",
        "seller",
        "interest_rate",
        "balance",
        "loan_term",
        "origination_date",
        "first_payment_date",
        "ltv",
        "cltv",
        "borrower_count",
        "dti",
        "borrower_credit_score",
        "first_time_homebuyer",
        "loan_purpose",
        "property_type",
        "unit_count",
        "occupancy_status",
        "property_state",
        "zip",
        "insurance_percentage",
        "product_type",
        "co_borrower_credit_score"
    ],
    "Performance": [
        "id",
        "reporting_period",
        "servicer_name",
        "interest_rate",
        "balance",
        "loan_age",
        "months_to_maturity",
        "maturity_date",
        "msa",
        "delinquency_status",
        "modification_flag",
        "zero_balance_code",
        "zero_balance_date",
        "last_paid_installment_date",
        "foreclosure_date",
        "disposition_date",
        "foreclosure_costs",
        "property_repair_costs",
        "recovery_costs",
        "misc_costs",
        "tax_costs",
        "sale_proceeds",
        "credit_enhancement_proceeds",
        "repurchase_proceeds",
        "other_foreclosure_proceeds",
        "non_interest_bearing_balance",
        "principal_forgiveness_balance"
    ]
}
Copy the code

The next step is to define which columns you want to keep. Since the mortgage we care about is only about whether it is in foreclosure, we can discard many columns from the performance data (which does not affect the foreclosure data). But we need to keep all acquisition data, because we want as much information as possible about mortgages (after all, we’re predicting foreclosure when we buy them). Discarding some columns saves some disk space and memory, and also speeds up your code.

SELECT = {
    "Acquisition": HEADERS["Acquisition"],
    "Performance": [
        "id",
        "foreclosure_date"
    ]
}
Copy the code

Next, write a function to concatenate all the data sets. The following code will:

Import some required libraries, including Settings
Define the function concatenate, which can:
- Get the names of all the files in the data directory
- Walk through each file
  - If the file is in the wrong format (it doesn’t start with the expected prefix), ignore it
  - With the Pandasread_csvFunction to read the file into aDataFrame 里
    - The delimiter is set to |, read the data correctly
    - The data now has no header document, so set header to None
    - Set the values in the HEADERS dictionary to the column names. These will become the column names in the DataFrame
    - Only columns added to SELECT are selected from the DataFrame
  - Concatenate all the dataframes together
  - Output the assembled DataFrame to a file

import os
import settings
import pandas as pd

def concatenate(prefix="Acquisition"):
    files = os.listdir(settings.DATA_DIR)
    full = []
    for f in files:
        if not f.startswith(prefix):
            continue

        data = pd.read_csv(os.path.join(settings.DATA_DIR, f), sep="|", header=None, names=HEADERS[prefix], index_col=False)
        data = data[SELECT[prefix]]
        full.append(data)

    full = pd.concat(full, axis=0)

    full.to_csv(os.path.join(settings.PROCESSED_DIR, "{}.txt".format(prefix)), sep="|", header=SELECT[prefix], index=False)
Copy the code

You can call the above function with parameters Acquisition and Performance, respectively, to concatenate all the Acquisition and presentation files together. The following code will:

Run only when the script is executed on the command line with Python assemble. Py
Concatenate all files and print them into two files:
- processed/Acquisition.txt
- processed/Performance.txt

if __name__ == "__main__":
    concatenate("Acquisition")
    concatenate("Performance")
Copy the code

We now have a modular assemble. Py file that is both easy to run and easy to extend. By breaking down the big problems into smaller ones like this, we make the project much simpler. Instead of doing everything in one script, we separate the different files and define the data between them. This is usually good when you’re working on a large project, because changing a few files won’t produce unexpected results.

Once the assemble. Py script is complete, run Python assemble. Py. You can find the full script here.

This prints two files in the processed directory:

Loan - prediction ├ ─ ─ data │ ├ ─ ─ Acquisition_2012Q1. TXT │ ├ ─ ─ Acquisition_2012Q2. TXT │ ├ ─ ─ Performance_2012Q1. TXT │ ├ ─ ─ Performance_2012Q2. TXT │ └ ─ ─... ├── processed │ ├── procured.txt │ ├── Performance Requirements. TXT ├ ─ ─ Settings. PyCopy the code

Presentation data calculation

The next step is from processed/Performance. TXT calculation some values in the data. What we want to do is predict whether a property is going to go into foreclosure. To figure that out, we need only look at performance data to see if there is a Foreclosure_date in a home. If Foreclosure_date is None, the property is not in foreclosure. We also need to avoid mortgages that don’t have much history in their performance data, and we can do that by counting how many rows they have accumulated in their performance data.

One way to think about the relationship between acquisition data and performance data is as follows:

We find that each row in the acquisition data corresponds to multiple rows in the performance data. In performance data, when a foreclosure occurs, quarterly foreclosure_date appears and should be blank until then. Some loans never go into foreclosure, so Foreclosure_date is blank in related performance data.

We need to calculate foreclorSURE_status, which is a Boolean value that indicates whether a loan ID has ever been foreclosed. We also calculate performance_count, which is how many rows each ID has in the performance data.

Performance_count There are several ways to calculate performance_count:

Read all the performance data, and then use Pandas’groupbyMethod finds the number of rows associated with each loan ID and if foreclosure_date corresponding to that ID is not None.
- The advantage of this is that the implementation syntax is simple
- The downside of this is that reading 129236094 rows takes a lot of memory and is extremely slow
We can read all the performance data and then use it on the acquisition data DataFrameapplyTo obtain the count of each ID
- The advantage is that it’s conceptually simple
- The downside is still that reading 129236,094 rows takes a lot of memory and is extremely slow
We can iterate through each line of the presentation data and save a separate dictionary containing counts
- The advantage is that you don’t need to read all the data into memory at once, so this is fast and optimizes memory
- The downside is that it takes a little longer to sort out the concepts and implementations, and you have to parse each line manually

It takes a lot of memory to load all the data at once, so we use the third method. All we need is to iterate through each row of the presentation data and save a count dictionary containing each ID. In the dictionary, we record how many times each ID appears in the performance data and whether Foreclosure_date is not None. Foreclosure_status and Performance_count are then found.

Create a new file annotate.py and add the code for the calculation. In the following code, we will:

Import the required libraries
Define a function called count_performance_rows
- Open the precessed/Performance. TXT. This does not read the file into memory, but simply opens a file handle and reads the contents line by line
- Walk through every line in the file
  - According to the delimiter | string segmentation
  - Check if loan_id is in the COUNTS dictionary
    - If not, add it to counts
  - Add 1 to performance_count for load_id
  - If Date is not None, then we know the loan is in foreclosure, so set Foreclosure_status accordingly

import os
import settings
import pandas as pd

def count_performance_rows():
    counts = {}
    with open(os.path.join(settings.PROCESSED_DIR, "Performance.txt"), 'r') as f:
        for i, line in enumerate(f):
            if i == 0:
                # Skip header row
                continue
            loan_id, date = line.split("|")
            loan_id = int(loan_id)
            if loan_id not in counts:
                counts[loan_id] = {
                    "foreclosure_status": False,
                    "performance_count": 0
                }
            counts[loan_id]["performance_count"] += 1
            if len(date.strip()) > 0:
                counts[loan_id]["foreclosure_status"] = True
    return counts
Copy the code

Get the result

After we create counts, we can use a function to extract and pass in values for load_id and key:

def get_performance_summary_value(loan_id, key, counts):
    value = counts.get(loan_id, {
        "foreclosure_status": False,
        "performance_count": 0
    })
    return value[key]
Copy the code

The above function returns the values from the COUNTS dictionary, and lets us add Foreclosure_STATUS and Performance_count values to each row in the acquisition data. The dictionary’s get method returns a default value if the key is not found, so it returns a reasonable default value even if it is not found.

Mark the data

Now that we’ve added some functions to Annotate. py, we’re ready to start working on the most valuable parts. We need to convert the acquisition data into a training set that machine learning algorithms can use. The following things need to be done:

Turn all the data into numbers
Fill in the blank value
Add a performance_count and a Foreclosure_status to each line
Delete rows that have very little performance_count history (rows with very low performance_count)

Several columns of data are text, which is useless in machine learning. However, they are really category variables, such as category numbers such as R and S. We turn them into numbers by assigning them numbers:

And when you do that, you can use them for machine learning.

Some columns also contain times (first_payment_date and origination_date). They can be split into two columns:

In the following code, we convert the acquisition data. Define a function that will:

Obtain data from the COUNTS dictionary, and create a Foreclosure_status column in Acquisition
Get data from the COUNTS dictionary and create a performance_count column in acquisition
Convert the following columns from text to numbers:
- channel
- seller
- first_time_homebuyer
- loan_purpose
- property_type
- occupancy_status
- property_state
- product_type
Convert first_payment_date and origination_date to two columns:
- The separator is /
- Assign the first part to the month column
- Assign the second part to the year column
- Delete original column
- Finally, we have first_payment_month, first_payment_year, origination_month, and origination_year
Replace all missing values in Acquisition with -1

def annotate(acquisition, counts):
    acquisition["foreclosure_status"] = acquisition["id"].apply(lambda x: get_performance_summary_value(x, "foreclosure_status", counts))
    acquisition["performance_count"] = acquisition["id"].apply(lambda x: get_performance_summary_value(x, "performance_count", counts))
    for column in [
        "channel",
        "seller",
        "first_time_homebuyer",
        "loan_purpose",
        "property_type",
        "occupancy_status",
        "property_state",
        "product_type"
    ]:
        acquisition[column] = acquisition[column].astype('category').cat.codes

    for start in ["first_payment", "origination"]:
        column = "{}_date".format(start)
        acquisition["{}_year".format(start)] = pd.to_numeric(acquisition[column].str.split('/').str.get(1))
        acquisition["{}_month".format(start)] = pd.to_numeric(acquisition[column].str.split('/').str.get(0))
        del acquisition[column]

    acquisition = acquisition.fillna(-1)
    acquisition = acquisition[acquisition["performance_count"] > settings.MINIMUM_TRACKING_QUARTERS]
    return acquisition
Copy the code

Concatenate all data

We’ll soon be stitching everything together, but before we do, we just need to add a little more code to Annotate.py. In the following code, we:

Define a function to read acquisition data
Define a function to write processed data to processed/train.csv
If the file is passed in from the command line, such as python annotate.py, then:
- Read acquisition data
- Calculates the cumulative number of presentation data and assigns it to COUNTS
- Mark acquisition DataFrame
- Write the Acquisition DataFrame to train.csv

def read():
    acquisition = pd.read_csv(os.path.join(settings.PROCESSED_DIR, "Acquisition.txt"), sep="|")
    return acquisition

def write(acquisition):
    acquisition.to_csv(os.path.join(settings.PROCESSED_DIR, "train.csv"), index=False)

if __name__ == "__main__":
    acquisition = read()
    counts = count_performance_rows()
    acquisition = annotate(acquisition, counts)
    write(acquisition)
Copy the code

Once you’ve written the file, remember to run it with Python Annotate. py, which generates a train.csv file. The complete Annotate.py file is available here.

The folder should now look like this:

Loan - prediction ├ ─ ─ data │ ├ ─ ─ Acquisition_2012Q1. TXT │ ├ ─ ─ Acquisition_2012Q2. TXT │ ├ ─ ─ Performance_2012Q1. TXT │ ├ ─ ─ Performance_2012Q2. TXT │ └ ─ ─... ├── processed │ ├── procured.txt │ ├── Performance ├─ readme.md ├─ requirements.txt ├─ settings.pyCopy the code

Look for error measures

Now that we’ve generated the training data, we just need to complete the last step, generating the predictions. We need to find a measure of error and how to evaluate the data. For the purposes of this article, many more loans are not in foreclosure than are in foreclosure, so typical accuracy measures do not apply.

If we look at the training data and look at the count of foreclosure_status columns, we find:

import pandas as pd
import settings

train = pd.read_csv(os.path.join(settings.PROCESSED_DIR, "train.csv"))
train["foreclosure_status"].value_counts()
Copy the code

False    4635982
True        1585
Name: foreclosure_status, dtype: int64
Copy the code

Because only so few loans are in foreclosure, if we just look at what percentage of labels are correctly predicted, we can still get very high accuracy even if we build a model that only predicts False. So the measures we use take this imbalance into account to ensure accurate forecasts. We don’t want too many False positives, which predict that a loan will go into foreclosure when it won’t, or too many False negatives, which predict that a loan won’t go into foreclosure when it will. Between the two, false negatives are more expensive for Fannie Mae, because they can’t recoup their investment by buying these mortgages.

We defined the false negative ratio as the number of forecasted foreclosures that didn’t happen but actually did, divided by the total number of foreclosed loans. This is the actual percentage of foreclosures that the model doesn’t capture. Here’s a chart:

In the figure above, a loan in status 1 is predicted to be non-foreclosed, but is actually foreclosed. If you divide it by the number of actual foreclosures by 2, the negative prediction rate is 50%. We use it as a measure of error so that we can effectively evaluate the model’s performance.

Set up classifiers for machine learning

We use cross validation to make predictions. For cross-validation, we split the data into 3 groups and then:

The model was trained on sets 1 and 2, and then predicted on sets 3
The model was trained on sets 1 and 3, and then predicted on sets 2
The model was trained on sets 2 and 3, and then predicted on set 1

Breaking the data into groups means we don’t train models with the same data and then make predictions with the same data. This avoids overfitting. If we overfit, we get a false low negative rate, which means that our model is difficult to apply to real situations or make subsequent improvements.

Scikit-learn has a function called cross_val_predict that makes cross-validation easy.

We also need to pick an algorithm to make predictions. We need a classifier to do binary classification. Because the target variable Foreclosure_status has only two values, True and Flase.

We use a logistic regression algorithm. Because it performs well under binary classification, runs extremely fast, and consumes very little memory. That’s because of the way the algorithm works – it doesn’t build a bunch of decision trees like random forest, or do resource-intensive transformations like support vector machines, and it’s designed with relatively few matrix operations.

We can use the logical recursive classifier algorithm that comes with SciKit-Learn. The only thing to watch out for is the weight of each class. If each class is given the same weight, the algorithm predicts False for each row because it minimizes the error. However, we care more about the loans that are in foreclosure than the loans that are not in foreclosure. Therefore, we pass the balanced argument to the class_weight keyword for the Logistic Regression class, resulting in an algorithm that takes into account the number of samples to give a balanced weight. This ensures that the algorithm does not predict False for every row.

To make predictions

Now that you’ve completed your preparatory work, you can start making predictions. Create a new file called predict.py using the train.csv we created earlier. The following code will:

Import the required libraries
Create a cross_validate function that will:
- Create a logical recursive classifier with the correct keyword arguments
- Create a list of data columns to train the model, and delete the ID and Foreclosure_STATUS columns
- Run cross-validation on the Train DataFrame
- Return to predict

import os
import settings
import pandas as pd
from sklearn import cross_validation
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

def cross_validate(train):
    clf = LogisticRegression(random_state=1, class_weight="balanced")

    predictors = train.columns.tolist()
    predictors = [p for p in predictors if p not in settings.NON_PREDICTORS]

    predictions = cross_validation.cross_val_predict(clf, train[predictors], train[settings.TARGET], cv=settings.CV_FOLDS)
    return predictions
Copy the code

The prediction error

Now we just have to write some functions to calculate the error. The following code will:

Create computer_error, which will:
- Calculate a simple accuracy score (the percentage of predictions in true Foreclosure_status values) using SciKit-learn
Creates the computer_FALSE_negatives function, which:
- Write goals and forecasts into a DataFrame
- Calculate the false negative rate
Create the computer_FALSE_positives function, which should:
- Write goals and forecasts into a DataFrame
- Calculate the false positive rate
  - Find the number of loans that the model predicts are foreclosed but not foreclosed
  - Divide that by the number of loans that are not in foreclosure

def compute_error(target, predictions):
    return metrics.accuracy_score(target, predictions)

def compute_false_negatives(target, predictions):
    df = pd.DataFrame({"target": target, "predictions": predictions})
    return df[(df["target"] == 1) & (df["predictions"] == 0)].shape[0] / (df[(df["target"] == 1)].shape[0] + 1)

def compute_false_positives(target, predictions):
    df = pd.DataFrame({"target": target, "predictions": predictions})
    return df[(df["target"] == 0) & (df["predictions"] == 1)].shape[0] / (df[(df["target"] == 0)].shape[0] + 1)
Copy the code

Integrate all functions

Now, put all of the above functions inside predict.py. The following code will:

Read data set
Calculate cross-validation predictions
Calculate the three error values mentioned above
Print out the error value

def read():
    train = pd.read_csv(os.path.join(settings.PROCESSED_DIR, "train.csv"))
    return train

if __name__ == "__main__":
    train = read()
    predictions = cross_validate(train)
    error = compute_error(train[settings.TARGET], predictions)
    fn = compute_false_negatives(train[settings.TARGET], predictions)
    fp = compute_false_positives(train[settings.TARGET], predictions)
    print("Accuracy Score: {}".format(error))
    print("False Negatives: {}".format(fn))
    print("False Positives: {}".format(fp))
Copy the code

Once you have added this code, you can run Python predictor.py to generate the predictions. It turned out that the false negative ratio was.26, which means that for foreclosed loans, we were wrong about 26 percent of them. It’s a good start, but there’s still a lot of room for improvement.

The full predict.py file is here.

The file tree should now look like this:

Loan - prediction ├ ─ ─ data │ ├ ─ ─ Acquisition_2012Q1. TXT │ ├ ─ ─ Acquisition_2012Q2. TXT │ ├ ─ ─ Performance_2012Q1. TXT │ ├ ─ ─ Performance_2012Q2. TXT │ └ ─ ─... ├── processed │ ├── procured.txt │ ├── Performance ├─ predict.py ├─ readme.md ├─ requirements. TXT ├─ settings.pyCopy the code

Write the README

Now that we have the complete project, we just need to write the readme.md file to summarize and explain to others what we did and how to copy it. A typical readme.md should include the following:

Project overview and objectives
How to download the required data or materials
Install the tutorial
- How do I install the required modules
Using the tutorial
- How to run a project
- What results should you see at each step
How to contribute
- How to extend this project

Here is an example readme.md for this project.

The next step

Congratulations, you’ve completed a complete machine learning project! You can find the full sample project here. When you’re done with your project, be sure to upload it to Github so others can see it as part of your portfolio.

There is still some room for you to dig into the data. Roughly speaking, we can divide them into three categories — extending projects to improve accuracy, making predictions with other data columns, and exploring the data further. The following ideas are for your reference only:

Generate more features with annotate.py
Change the algorithm in predictor. py
Use more data from Fannie Mae
Plus a way to predict future data. The current code will work if we add more data, so we can add more past or future data
Trying to predict whether banks should make loans in the first place (and whether Fannie Mae should buy loans)
- Delete columns of information that banks cannot obtain when issuing loans
  - Some were there when Fannie mae bought them, but not before
- Do the forecast
Explore if you can predict data other than Foreclosure_status
- Is it possible to predict how much the property will fetch when it is sold?
Explore the details of how presentation data is updated
- Can you predict how many times a borrower is late on a loan?
- Can you draw a typical loan cycle?
Plot the data by state or zip code
- See any interesting patterns?

If you’ve created something interesting, let us know in the comments section!

Other translations in this series:

Building a Data Science Portfolio: Telling a story with Data
Build a Data science Portfolio: Build a data science blog