Public account: You and the cabin by: Peter Editor: Peter

Hello, I’m Peter

Titanic data is a classic data set of data mining. This paper introduces the case sharing of Kaggle ranking no. 1. Original notebook address:

www.kaggle.com/startupsci/…

ranking

Take a look at how this case ranks:

There is not much difference between the first and second place, and the second place has far more comments than the first place; Let’s learn the idea of second place again.

Through their own overall study of the first source, early field processing is very careful, comprehensive; The modeling process is a little more superficial.

Data exploration

Import libraries

Three types of libraries are required to import the entire process:

  • The data processing
  • Visual library
  • Modeling library
# Data processing
import pandas as pd
import numpy as np
import random as rnd

# visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# model
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
Copy the code

Import data

Check the size of the data after importing it

Field information

View all fields:

train.columns

Index(['PassengerId'.'Survived'.'Pclass'.'Name'.'Sex'.'Age'.'SibSp'.'Parch'.'Ticket'.'Fare'.'Cabin'.'Embarked'],
      dtype='object')
Copy the code

Here’s what the fields mean:

  • PassengerId: indicates the user ID
  • Survival: 0- no, 1- yes
  • Pclass: 1- first class, 2- second class, 3- third class
  • Name: name
  • Sex, gender,
  • Age: Age
  • Sibsp: Number of brothers/spouses on board
  • Parch: Number of parents/children on board
  • Ticket: Banks
  • Fare fare:
  • Cabin b. Cabin no.
  • Embarked: Place of embarkation

Field classification

There are two types of data in this case:

  • Definition: demolished, Sex, and Embarked. Ordinal: Pclass
  • Continous: Age, Fare. Discrete: SibSp, Parch

Missing value

Check the missing values of the training set and test set:

You can also use the info function to query basic information about the data:

The data assumption

Based on the basic information and common sense of the data, the author gives some hypotheses and the following direction of data processing and analysis:

Delete the field

  • This project mainly investigates the relationship between other fields and Survival field
  • Focus on the fields Age and Embarked
  • Select * from PassengerId, PassengerId, PassengerId, PassengerId, PassengerId, PassengerId, PassengerId, PassengerId, PassengerId, PassengerId, PassengerId;

Modify and add fields

  • Added Family: based on Parch (number of siblings on board) and SibSp (number of parents and children on board)
  • Extract Title from the Name field as the new feature
  • Turn the Age field into an ordered classification feature
  • Create a feature based on the Fare range

guess

  • Women are more likely to survive
  • Children (Age>?) More likely to survive
  • Passengers with higher cabin class are more likely to survive (Pclass=1)

Statistical analysis

Sex, ordered variable Pclss, discrete SibSp and Parch were analyzed to verify our conjecture

1. Cabin class (1- first class, 2- second class, 3- third class)

Verdict: People in first class are more likely to survive

2, gender,

Conclusion: Women are more likely to survive

3. Number of siblings/spouses

Conclusion: Passengers with fewer siblings or spouses are more likely to survive

4. Number of parents/children

Conclusion: Parents are more likely to survive at the age of three

Visual analysis

Age and Survival

g = sns.FacetGrid(train, col="Survived")
g.map(plt.hist, 'Age', bins=20)

plt.show()
Copy the code

  1. Of those who did not survive, most were aged 15-25 (left)
  2. The age of survivors is up to 80; And children under the age of four have a high survival rate (right)
  3. Most of the passengers are between 15 and 35 years old (two pictures)

Accommodation and survival

grid = sns.FacetGrid(
    train,
    col="Survived",
    row="Pclass",
    size=2.2,
    aspect=1.6
    )

grid.map(plt.hist,"Age",alpha=0.5,bins=20)
grid.add_legend()
plt.show()
Copy the code

  • Class 3 has the most passengers; But many did not survive
  • Class 1 passengers survived the most

The relationship between boarding place, sex and survival

grid = sns.FacetGrid(train,
                     row="Embarked",
                     size=2.2,
                     aspect=1.6)
grid.map(sns.pointplot,
         "Pclass"."Survived"."Sex",
         palette="deep")

grid.add_legend()

plt.show()
Copy the code

  1. Women survived better than men
  2. Embarked=C, males are more likely to escape.
  3. Embarked on: When the ship grade is Pclass=3, the survival rate of male pursuit =C is better than that of Q

Fare, cabin and survival

grid = sns.FacetGrid(train, 
                     row='Embarked', 
                     col='Survived', 
                     size=2.2, aspect=1.6)

grid.map(sns.barplot, 
         'Sex'.'Fare', 
         alpha=. 5, ci=None)

grid.add_legend()

plt.show()
Copy the code

  • The higher the ticket price, the better the survival; 2 pictures on the right
  • The survival rate is related to where you board the ship; Embarked=C is the closest in value

The above analysis is based on simple statistics and visualization, while the following process is based on various machine learning modeling methods. A lot of pre-processing and feature engineering work has been done in the early stage.

Delete invalid fields

Ticket and Cabin are almost useless for our analysis, we can consider directly delete:

Generate new features

It is mainly based on the existing feature attributes to find a certain relationship, to generate new features, or to carry out a certain feature attributes transformation.

Field Name processing

According to the Name Name generation, find the appellation, such as Lady, Dr, Miss, etc., to check whether there is a relationship between the appellation and the survival information

# Extract by re
for dataset in combine:
    dataset["Title"] = dataset.Name.str.extract('([A-Za-z]+)\.', expand=False)
    
  # count the number of men and women under Title
train.groupby(["Sex"."Title"]).size().reset_index()
Copy the code

Formal statistics using crosstab:

# crosstab form
pd.crosstab(train['Title'], train['Sex'])
Copy the code

The extracted appellations are sorted out and classified as common appellations and Rare information:

for dataset in combine:
    dataset["Title"] = dataset["Title"].replace(['Lady'.'Countess'.'Capt'.'Col', \'Don'.'Dr'.'Major'.'Rev'.'Sir'.'Jonkheer'.'Dona'].'Rare')
    
    dataset['Title'] = dataset['Title'].replace('Mlle'.'Miss')
    dataset['Title'] = dataset['Title'].replace('Ms'.'Miss')
    dataset['Title'] = dataset['Title'].replace('Mme'.'Mrs')
    
    
# Mean of survivability according to Title
train[["Title"."Survived"]].groupby("Title",as_index=False).mean()
Copy the code

The appellation itself is text type and useless for later modeling, so we directly convert it into numerical type:

title_mapping = {
  "Mr":1."Miss":2."Mrs":3."Master":4."Rare":5
}

for dataset in combine:
    # Exist data to match
    dataset['Title'] = dataset['Title'].map(title_mapping)
    If it does not exist, add 0
    dataset['Title'] = dataset['Title'].fillna(0)
    
train.head()
Copy the code

We also need to delete some fields:

train = train.drop(['Name'.'PassengerId'], axis=1)
test = test.drop(['Name'], axis=1)

combine = [train, test]
train.shape, test.shape

# ((891, 9), (418, 9))
Copy the code

Field Sex

Change Male and Female of gender to 0-Male, 1-Female

 for dataset in combine:
    dataset['Sex'] = dataset['Sex'].map({'female': 1.'male': 0} ).astype(int)
Copy the code

The relationship between sex, age and survival:

grid = sns.FacetGrid(
    train,
    row='Pclass',
    col='Sex',
    size=2.2, 
    aspect=1.6)

grid.map(plt.hist, 
         'Age', 
         alpha=. 5, 
         bins=20)

grid.add_legend()

plt.show()
Copy the code

Field Age

1, the first is the missing value of the field processing.

We observe that there are missing values in the age field, which are filled by 6 combinations of Sex (0, 1) and Pclass (1, 2, 3). Missing values:

Specific filling process:

guess_ages = np.zeros((2.3))

 for dataset in combine:
    for i in range(0.2) :for j in range(0.3) :Find the missing value of the Age field under certain conditions and delete it
            guess_df = dataset[(dataset["Sex"] == i) & (dataset["Pclass"] == j+1] ["Age"].dropna()
            age_guess = guess_df.median()  # the median
            guess_ages[i,j] = int(age_guess / 0.5 + 0.5) * 0.5
    for i in range(0.2) :for j in range(0.3):
            dataset.loc[(dataset.Age.isnull()) & (dataset.Sex == i) & (dataset.Pclass == j+1),"Age"] = guess_ages[i,j]
    dataset["Age"] = dataset["Age"].astype(int)
    
There are no missing values after padding
train.isnull().sum(a)Copy the code

2. Divide the boxes by age

3. Turn to numerical classification

  • Age less than 16 is replaced by 0
  • Replace 16 to 32 with 1, etc…
for dataset in combine:
    dataset.loc[dataset["Age"] < =16."Age"] = 0
    dataset.loc[(dataset["Age"] > 16) & (dataset["Age"] < =32), "Age"] = 1
    dataset.loc[(dataset["Age"] > 32) & (dataset["Age"] < =48), "Age"] = 2
    dataset.loc[(dataset["Age"] > 48) & (dataset["Age"] < =64), "Age"] = 3
    dataset.loc[(dataset["Age"] > 64), "Age"] = 4
    
# delete age group AgeBand field
train = train.drop(["AgeBand"], axis=1)
combine = [train, test]
Copy the code

Field processing

Generate a new field from an existing field:

Generate a new field 1

Start by generating a FamilySize field based on the Parch and SibSp fields

for dataset in combine:
    dataset["FamilySize"] = dataset["SibSp"] + dataset["Parch"] + 1

    
# Survivable mean of each FamilySize
train[['FamilySize'.'Survived']].groupby(['FamilySize'], as_index=False).mean().sort_values(by='Survived', ascending=False)
Copy the code

Check whether the family member FamilySize field is Islone: if the family member FamilySize is a person, it is Islone, denoted by 1, otherwise it is 0

Delete Parch, SibSp, and FamilySize;

# delete Parch, SibSp, and FamilySize, leaving only one person Islone

train = train.drop(['Parch'.'SibSp'.'FamilySize'],axis=1)
test = test.drop(['Parch'.'SibSp'.'FamilySize'],axis=1)
combine = [train, test]

train.head()
Copy the code

Generate a new field 2

The new field 2 is the product of Age and Pclass:

The classification of the field Embarked

The value of the field Embarked is SQC. First we fill in the missing values

Check this field for missing values:

Treatment: Find outstanding numbers, fill in missing values, and look at the average of each value

Convert text type to numeric type:

Fare field processing

There are no missing values in the training set field, there is one in the test set:

Fill with the median value:

Separate container operation:

# box only FareBand fields
train['FareBand'] = pd.qcut(train['Fare'].4)  # Divide into 4 groups

# Mean of survival
train[['FareBand'.'Survived']].groupby(['FareBand'], as_index=False).mean().sort_values(by='FareBand', ascending=True)
Copy the code

Convert each segment into numeric data:

# 4 segments
for dataset in combine:
    dataset.loc[ dataset['Fare'] < =7.91.'Fare'] = 0
    dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] < =14.454), 'Fare'] = 1
    dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] < =31), 'Fare']   = 2
    dataset.loc[ dataset['Fare'] > 31.'Fare'] = 3
    dataset['Fare'] = dataset['Fare'].astype(int)

# 
train = train.drop(['FareBand'], axis=1)
combine = [train, test]
    
test.head()
Copy the code

This gives us the fields and data that will eventually be used for modeling:

modeling

The following is the specific modeling process. We divided the data set first:

# training set
X_train = train.drop("Survived", axis=1)
Y_train = train["Survived"]

# test set
X_test  = test.drop("PassengerId", axis=1).copy()
X_train.shape, Y_train.shape, X_test.shape
Copy the code

Specific process for each model:

  1. Create objects instantiated by the model
  2. Fit the training set
  3. Make a prediction for the test set
  4. Accuracy of calculation

Model 1: Logistic regression

# model instantiation
logreg = LogisticRegression()
# Fitting process
logreg.fit(X_train, Y_train)

# Test set prediction
Y_pred = logreg.predict(X_test)
# Accuracy solution
acc_log = round(logreg.score(X_train, Y_train) * 100.2)
acc_log

# the results
81.37
Copy the code

The coefficients obtained by the logistic regression model are:

# Logistic regression features and coefficients


coeff_df = pd.DataFrame(train.columns[1:)# Remove the Survived feature
coeff_df.columns = ["Features"]

coeff_df["Correlation"] = pd.Series(logreg.coef_[0])

# From high to low
coeff_df.sort_values(by='Correlation', ascending=False)
Copy the code

The bottom line: Gender really is an important factor in our survival

Model 2: Support vector machine SVM

Model 3: KNN

Model 4: Naive Bayes

Model 5: Perceptron

Model 6: Linear support vector classification

linear_svc = LinearSVC()
linear_svc.fit(X_train, Y_train)

Y_pred = linear_svc.predict(X_test)

acc_linear_svc = round(linear_svc.score(X_train, Y_train) * 100.2)
acc_linear_svc
# the results
79.46
Copy the code

Model 7: Stochastic gradient descent

Model 8: Decision tree

Model 9: Random forest

Model contrast

Compare the results (accuracy) of the 9 models above:

models = pd.DataFrame({
    'Model': ['Support Vector Machines'.'KNN'.'Logistic Regression'.'Random Forest'.'Naive Bayes'.'Perceptron'.'Stochastic Gradient Decent'.'Linear SVC'.'Decision Tree'].'Score': [acc_svc, acc_knn, acc_log, 
              acc_random_forest, acc_gaussian, acc_perceptron, 
              acc_sgd, acc_linear_svc, acc_decision_tree]})

models.sort_values(by='Score', ascending=False)
Copy the code

The comparison results show that decision tree and random forest have the best performance in this data set. The second is KNN (K nearest neighbor) algorithm.