Although the road of machine learning is endless, many students are still engaged in the flood of machine learning. How do you start machine learning more effectively? The author of this article uses Python as a tool for machine learning and explains in detail the Titanic project from the Kaggle competition. Follow xiaobian to see!

From Freecodecamp by Tirmidzi Faizal Aflahi, Compiled by Heart of The Machine with Participation by Li Shimeng and Du Wei.

With the rise of machine learning in the industry, tools that allow users to iterate through the process quickly have become critical. Python, the rising star of machine learning technology, is often your first choice for success. Therefore, a guide to implementing machine learning in Python is essential.

Introduction to machine learning in Python

So why Python? In my experience, Python is one of the easiest programming languages to learn. Now the whole process needs to be iterated over quickly, and at the same time, data scientists don’t need a deep understanding of the language because they can pick it up quickly.

How easy is it?

for anything in the_list:    print(anything)Copy the code

It’s that easy. Python syntax is closely related to English (or human language, not machine language) syntax. There are no stupid braces in Python syntax. I have a colleague who works on Quality Assurance and, while not a software engineer, can write production-grade Python code in a day. (Really!)

I’ll introduce several Python-based libraries in the following article. As data analysts and data scientists, we can use their handiwork to help us get things done. These incredible libraries are essential tools for machine learning in Python.

NumPy

This is a very well-known data analysis library. NumPy can help you with everything from calculating the median of a data distribution to manipulating multidimensional arrays.

Pandas

This is used to process CSV files. There are some tables and statistics to look at. There are some tables and statistics to look at.

Matplotlib

After storing the data in Pandas’ data box, you may need to do some visualization to understand more about the data. After all, a picture is worth a thousand words.

Seaborn

This is another visualization tool, but this one focuses more on visualization of statistical results, such as histograms, pie charts, graphs, or correlation tables.

Scikit-Learn

This is the ultimate tool for machine learning in Python. Machine learning in Python is all about this — scikit-learn. Everything you need from algorithms to enhancements can be found here.

Tensorflow and Pytorch

I won’t say much about these two tools. But if you’re interested in deep learning, they’re worth your time. (I’ll write another tutorial on deep learning next time, so stay tuned!)

Python machine learning project

Of course, reading and studying alone won’t get you there. You need practical practice. As I said in my blog, there’s no point in learning these tools if you don’t dig into the data. So HERE’s a place where you can easily find Python machine learning projects.

Thedatamage.com/

Kaggle is a platform for directly researching data. You can solve projects in this platform and get to the point where you’re really good at machine learning. You might be more interested in something else — a machine learning competition run by Kaggle that offers $100,000 in prize money. You might want to take your chances, haha.

Kaggle:www.kaggle.com/

But the most important thing isn’t the money — you can actually find machine learning projects in Python here. There are many projects you can try to complete. But if you’re a beginner, you might want to enter this competition.

We will use an example project in a later tutorial:

Titanic: Machine learning from disaster (www.kaggle.com/c/titanic)

This is known as the Titanic. The disaster took place in 1912, affecting 2,224 passengers and crew and killing 1,502 of them. This Kaggle contest (or tutorial) provides real data on disasters. Your task is to interpret the data and predict who will survive and who won’t.

Machine learning tutorial in Python

Before we can delve into the Titanic data, we need to install some necessary tools.

The first, of course, is Python. You need to install Python from the official website for the first time. You will need to install version 3.6 or later to keep up with the latest version of the library.

Python’s official website: www.python.org/downloads/

All libraries can then be installed using Python’s PIP. The Python distribution you just downloaded will automatically install PIP.

Any other tools you need can be installed with PIP. Open a terminal, command line, or PowerShell with the following command:

pip install numpypip install pandaspip install matplotlibpip install seabornpip install scikit-learnpip install jupyterCopy the code

Everything seems to be working fine. But wait, what is Jupyter? Jupyter stands for Julia, Python and R, so it is actually Jupytr. But the word looked so strange that they changed it to Jupyter. This is a well-known notebook where you can write interactive Python code.

Just type Jupyter Notebook in the terminal to open the browser page as shown below:

You can write code in green rectangles, and you can interactively write and evaluate Python code.

Now you have all the tools installed. Let’s get started!

Data exploration

Exploring the data is the first step. You will need to download the data from Kaggle’s Titanic page and then put the downloaded data into the folder where you started Jupyter notebook.

The data download address: www.kaggle.com/c/titanic/d…

Then import the necessary libraries:

import numpy as np import pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsimport warningswarnings.filterwarnings('ignore')%matplotlib inlineCopy the code

Loading data:

train_df=pd.read_csv("train.csv")train_df.head()Copy the code

The output is as follows:

So that’s our data. It has the following columns:

  • PassengerId, passenger identifier;

  • Survived, whether he (she) Survived;

  • Pclass, class of cabin, maybe 1 for economy, 2 for business, 3 for first;

  • Name, the passenger’s Name

  • Sex C.

  • Age C.

  • SibSp, siblings or spouses, were the number of siblings and spouses on the boat.

  • Parch, Parents or Children, is the number of Parents and Children on board;

  • Ticket DETAILS D.

  • Cabin, NaN

  • Embarked in, S is Southampton, Q is Queenstown, and C is Cherbourg

When exploring data, you often encounter the problem of missing data. Let’s see

def missingdata(data):    total = data.isnull().sum().sort_values(ascending = False)    percent = (data.isnull().sum()/data.isnull().count()*100).sort_values(ascending = False)    ms=pd.concat([total, percent], axis=1, keys=['Total'.'Percent'])    ms= ms[ms["Percent"C, ioplots (ioplots), ioplots (ioplots), and ioplots (ioplots)'90')    fig=sns.barplot(ms.index, ms["Percent"],color="green"PLT, alpha = 0.8). The xlabel ('Features', fontsize=15)    plt.ylabel('Percent of missing values', fontsize=15)    plt.title('Percent missing data by feature', fontsize=15)    return msmissingdata(train_df)Copy the code

We see results like this:

Cabin number, age and boarding place data all have some missing values, and cabin number information has a lot of missing values. We need to process them, known as Data Cleaning.

Data cleaning

We spend 90% of our time doing this. We do a lot of data cleansing for every machine learning project. When the data is cleaned up, we can easily move on to the next step, with nothing to worry about.

The most common technique in data cleansing is populating missing data. You can fill in missing data with modes, averages, or medians. There are no absolute rules for choosing these numbers; you can try them all out and see how they perform. However, as a rule of thumb, only mode can be used for classified data, and median or mean can be used for continuous data. So we used mode to fill in the boarding data and median to fill in the age data.

train_df['Embarked'].fillna(train_df['Embarked'].mode()[0], inplace = True)train_df['Age'].fillna(train_df['Age'].median(), inplace = True)Copy the code

The next important operation is to delete the data, especially for large amounts of missing data. We conduct the following processing for the cabin number data:

drop_column = ['Cabin']train_df.drop(drop_column, axis=1, inplace = True)Copy the code

Now examine the cleaned data.

print('check the nan value in train data')print(train_df.isnull().sum())Copy the code

Perfect! There’s no missing data! This means that the data has been cleaned up.

Characteristics of the engineering

Now the data has been cleaned up. And then we’re going to do feature engineering.

Feature engineering is basically the technique of discovering features or data based on currently available data. There are several ways to implement this technique. This is common sense in many cases.

Let’s take the embarkation data — it’s populated with Q, S, or C. The Python library can’t handle this because it can only handle numbers. So you need to deal with it with something called One Hot Vectorization, which turns One column into three. Embarked_Q, Embarked_S, and Embarked_C are filled with 0 or 1 to indicate whether the person is from this port or not.

Take SibSp and Parch again. There’s nothing interesting about these two trains, but you might wonder how many family members a certain passenger had on board. Having a large family may increase your chances of survival because they can help each other. On the other hand, passengers who board a ship alone may struggle to survive.

So you can create a new column, this column is the number of members (family size), family size = SibSp + Parch + 1 (passengers themselves).

The last example is the bin column. Because you think it’s hard to distinguish between things that have similar values, this operation creates ranges of values and then groups multiple values together. For example, is there a significant difference between five – and six-year-old passengers? Or is there a significant difference between 45 and 46 years old?

This is why the bin column was created. Perhaps in terms of age, we could create four columns — toddlers (0 to 14 years old), teens (14 to 20 years old), adults (20 to 40 years old), and older people (40 and older).

The encoding is as follows:

all_data = train_dffor dataset in all_data :    dataset['FamilySize'] = dataset['SibSp'] + dataset['Parch'] + 1import re# Define function to extract titles from passenger namesdef get_title(name): title_search = re.search(' ([A-Za-z]+)\.', name) # If the title exists, extract and return it. if title_search: return title_search.group(1) return ""# Create a new feature Title, containing the titles of passenger namesfor dataset in all_data: dataset['Title'] = dataset['Name'].apply(get_title)# Group all non-common titles into one single grouping "Rare"for dataset in all_data: dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss') dataset['Title'] = dataset['Title'].replace('Ms', 'Miss') dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')for dataset in all_data: The dataset [' Age_bin] = pd. The cut (dataset [' Age '], bins =,14,20,40,120 [0], labels=['Children','Teenage','Adult','Elder'])for dataset in all_data: Dataset ['Fare_bin'] = pd. Cut (dataset['Fare'], bins=[0,7.91,14.45,31,120], labels ['Low_fare','median_fare', 'Average_fare','high_fare'])traindf=train_dffor dataset in traindf: drop_column = ['Age','Fare','Name','Ticket'] dataset.drop(drop_column, axis=1, inplace = True)drop_column = ['PassengerId']traindf.drop(drop_column, axis=1, inplace = True)traindf = pd.get_dummies(traindf, columns = ["Sex","Title","Age_bin","Embarked","Fare_bin"], prefix=["Sex","Title","Age_type","Em_type","Fare_type"])Copy the code

You have now created all the features. Then let’s look at the correlation between these characteristics:

sns.heatmap(traindf.corr(),annot=True,cmap='RdYlGn'The linewidths = 0.2)# data. Corr () -- - > the correlation matrixfig = PLT. The GCF (FIG. Set_size_inches (20, 12) PLT. The show ()Copy the code

A correlation value close to 1 means a high positive correlation, and -1 means a high negative correlation. For example, there is a negative correlation between being male and female because passengers must be identified as one sex (or the other). Also, you can see that no two are highly correlated except for content created with feature projects. This proves that we did the right thing.

What if certain factors are highly correlated? We can delete one of these columns, and the information in the new column does not give the system any new information, because the two are exactly the same.

Machine learning in Python

We have now reached the climax of this tutorial, machine learning modeling.

from sklearn.model_selection import train_test_split #for split the datafrom sklearn.metrics import accuracy_score #for accuracy_scorefrom sklearn.model_selection import KFold #for K-fold cross validationfrom sklearn.model_selection import cross_val_score #score evaluationfrom sklearn.model_selection import cross_val_predict #predictionfrom sklearn.metrics import confusion_matrix #for confusion matrixall_features = traindf.drop("Survived",axis=1)Targeted_feature = traindf["Survived"]X_train,X_test,y_train,y_test = Train_test_split (all_features, Targeted_feature test_size = 0.3, random_state = 42) X_train. Shape, X_test. Shape, y_train. Shape, y_ test.shapeCopy the code

The SciKit-Learn library has a variety of algorithms for you to choose from:

  • Logistic regression

  • Random forests

  • Support vector machine

  • K nearest neighbor

  • Naive Bayes

  • The decision tree

  • AdaBoost

  • LDA

  • Gradient enhanced

You may feel overwhelmed and want to figure out what is what. Don’t worry, just treat it like a black box — pick the one that performs best. (I’ll write a full article later on how to choose these algorithms.)

Take my favorite random forest algorithm:

from sklearn.ensemble import RandomForestClassifiermodel = RandomForestClassifier(criterion='gini', n_estimators=700,                             min_samples_split=10,min_samples_leaf=1,                             max_features='auto',oob_score=True,                             random_state=1,n_jobs=-1)model.fit(X_train,y_train)prediction_rm=model.predict(X_test)print('--------------The Accuracy of the model----------------------------')print('The accuracy of the Random Forest Classifier is', round(accuracy_score(prediction_rm,y_test)*100,2))kfold = kfold (n_field =10, random_state=22)# k=10, split the data into 10 equal partsresult_rm=cross_val_score(model,all_features,Targeted_feature,cv=10,scoring='accuracy')print('The cross validated Score for Random Forest Classifier is:',round(result_rm.mean()*100,2))y_pred = cross_val_predict(model,all_features,Targeted_feature,cv=10)sns.heatmap(confusion_matrix(Targeted_feature,y_pred),annot= True, FMT = '3.0 f, cmap = "summer") PLT. Title (' Confusion_matrix, y = 1.05, size = 15)Copy the code

Wow! The accuracy rate is 83%. That’s a pretty good result for a first try.

Cross validation score stands for k-fold validation method. If K=10, that means dividing the data into 10 variables, calculating the average of all the scores, and taking them as the final score.

fine-tuning

You have now completed the steps to implement machine learning in Python. But there’s one more step you can take to get better results — fine-tuning. Fine tuning means finding the best parameters for a machine learning algorithm. Take the random forest code above for example:

model = RandomForestClassifier(criterion='gini', n_estimators=700,                             min_samples_split=10,min_samples_leaf=1,                             max_features='auto',oob_score=True,                             random_state=1,n_jobs=-1)Copy the code

You need to set a lot of parameters. By the way, these are all defaults. You can change the parameters as needed. But of course, it takes a lot of time.

Don’t worry — there’s a tool called Grid Search that automatically finds the best parameters. Sounds good, right?

Tunning model = n_estim=range(100,100,100)## Search grid for optimal parametersparam_grid = {"n_estimators" :n_estim}model_rf = GridSearchCV(model,param_grid = param_grid, cv=5, scoring="accuracy", n_jobs= 4, verbose = 1)model_rf.fit(train_X,train_Y)# Best scoreprint(model_rf.best_score_)#best estimatormodel_rf.best_estimator_Copy the code

Well, you can try it out for yourself and enjoy machine learning.

conclusion

How’s that? Machine learning doesn’t seem difficult, does it? Machine learning in Python is simple. Everything is ready for you. You can do amazing things and make people happy.

The original link: medium.freecodecamp.org/how-to-get-…