Data exploration

Import libraries

Three types of libraries are required to import the entire process:

  • The data processing
  • Visual library
  • Modeling library
# Data processing
import pandas as pd
import numpy as np
import random as rnd

# visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# model
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
Import data

Check the size of the data after importing it

Field information

View all fields:


Here’s what the fields mean:

  • PassengerId: indicates the user ID
  • Survival: 0- no, 1- yes
  • Pclass: 1- first class, 2- second class, 3- third class
  • Name: name
  • Sex, gender,
  • Age: Age
  • Sibsp: Number of brothers/spouses on board
  • Parch: Number of parents/children on board
  • Ticket: Banks
  • Fare fare:
  • Cabin b. Cabin no.
  • Embarked: Place of embarkation

Field classification

There are two types of data in this case:

  • Definition: demolished, Sex, and Embarked. Ordinal: Pclass
  • Continous: Age, Fare. Discrete: SibSp, Parch

Missing value

Check the missing values of the training set and test set:

You can also use the info function to query basic information about the data:

The data assumption

Based on the basic information and common sense of the data, the author gives some hypotheses and the following direction of data processing and analysis:

Delete the field

  • This project mainly investigates the relationship between other fields and Survival field
  • Focus on the fields Age and Embarked
  • Select * from PassengerId, PassengerId, PassengerId, PassengerId, PassengerId, PassengerId, PassengerId, PassengerId, PassengerId, PassengerId, PassengerId;

Modify and add fields

  • Added Family: based on Parch (number of siblings on board) and SibSp (number of parents and children on board)
  • Extract Title from the Name field as the new feature
  • Turn the Age field into an ordered classification feature
  • Create a feature based on the Fare range


  • Women are more likely to survive
  • Children (Age>?) More likely to survive
  • Passengers with higher cabin class are more likely to survive (Pclass=1)

Statistical analysis

Sex, ordered variable Pclss, discrete SibSp and Parch were analyzed to verify our conjecture

1. Cabin class (1- first class, 2- second class, 3- third class)

Verdict: People in first class are more likely to survive

2, gender,

Conclusion: Women are more likely to survive

3. Number of siblings/spouses

Conclusion: Passengers with fewer siblings or spouses are more likely to survive

4. Number of parents/children

Conclusion: Parents are more likely to survive at the age of three

Visual analysis

Age and Survival

g = sns.FacetGrid(train, col="Survived"), 'Age', bins=20)
  1. Of those who did not survive, most were aged 15-25 (left)
  2. The age of survivors is up to 80; And children under the age of four have a high survival rate (right)
  3. Most of the passengers are between 15 and 35 years old (two pictures)

Accommodation and survival

grid = sns.FacetGrid(
  • Class 3 has the most passengers; But many did not survive
  • Class 1 passengers survived the most

The relationship between boarding place, sex and survival

grid = sns.FacetGrid(train,

  1. Women survived better than men
  2. Embarked=C, males are more likely to escape.
  3. Embarked on: When the ship grade is Pclass=3, the survival rate of male pursuit =C is better than that of Q

Fare, cabin and survival

grid = sns.FacetGrid(train, 
                     size=2.2, aspect=1.6), 
         alpha=. 5, ci=None)

  • The higher the ticket price, the better the survival; 2 pictures on the right
  • The survival rate is related to where you board the ship; Embarked=C is the closest in value

The above analysis is based on simple statistics and visualization, while the following process is based on various machine learning modeling methods. A lot of pre-processing and feature engineering work has been done in the early stage.

Delete invalid fields

Ticket and Cabin are almost useless for our analysis, we can consider directly delete:

Generate new features

It is mainly based on the existing feature attributes to find a certain relationship, to generate new features, or to carry out a certain feature attributes transformation.

Field Name processing

According to the Name Name generation, find the appellation, such as Lady, Dr, Miss, etc., to check whether there is a relationship between the appellation and the survival information

# Extract by re
for dataset in combine:
    dataset["Title"] = dataset.Name.str.extract('([A-Za-z]+)\.', expand=False)
  # count the number of men and women under Title
Formal statistics using crosstab:

# crosstab form
pd.crosstab(train['Title'], train['Sex'])
The extracted appellations are sorted out and classified as common appellations and Rare information:

for dataset in combine:
    dataset["Title"] = dataset["Title"].replace(['Lady'.'Countess'.'Capt'.'Col', \'Don'.'Dr'.'Major'.'Rev'.'Sir'.'Jonkheer'.'Dona'].'Rare')
    dataset['Title'] = dataset['Title'].replace('Mlle'.'Miss')
    dataset['Title'] = dataset['Title'].replace('Ms'.'Miss')
    dataset['Title'] = dataset['Title'].replace('Mme'.'Mrs')
# Mean of survivability according to Title
The appellation itself is text type and useless for later modeling, so we directly convert it into numerical type:

title_mapping = {

for dataset in combine:
    # Exist data to match
    dataset['Title'] = dataset['Title'].map(title_mapping)
    If it does not exist, add 0
    dataset['Title'] = dataset['Title'].fillna(0)
We also need to delete some fields:

train = train.drop(['Name'.'PassengerId'], axis=1)
test = test.drop(['Name'], axis=1)

combine = [train, test]
train.shape, test.shape

# ((891, 9), (418, 9))
Field Sex

Change Male and Female of gender to 0-Male, 1-Female

 for dataset in combine:
    dataset['Sex'] = dataset['Sex'].map({'female': 1.'male': 0} ).astype(int)
The relationship between sex, age and survival:

grid = sns.FacetGrid(
         alpha=. 5, 

Field Age

1, the first is the missing value of the field processing.

We observe that there are missing values in the age field, which are filled by 6 combinations of Sex (0, 1) and Pclass (1, 2, 3). Missing values:

Specific filling process:

guess_ages = np.zeros((2.3))

 for dataset in combine:
    for i in range(0.2) :for j in range(0.3) :Find the missing value of the Age field under certain conditions and delete it
            guess_df = dataset[(dataset["Sex"] == i) & (dataset["Pclass"] == j+1] ["Age"].dropna()
            age_guess = guess_df.median()  # the median
            guess_ages[i,j] = int(age_guess / 0.5 + 0.5) * 0.5
    for i in range(0.2) :for j in range(0.3):
            dataset.loc[(dataset.Age.isnull()) & (dataset.Sex == i) & (dataset.Pclass == j+1),"Age"] = guess_ages[i,j]
    dataset["Age"] = dataset["Age"].astype(int)
There are no missing values after padding
2. Divide the boxes by age

3. Turn to numerical classification

  • Age less than 16 is replaced by 0
  • Replace 16 to 32 with 1, etc…
for dataset in combine:
    dataset.loc[dataset["Age"] < =16."Age"] = 0
    dataset.loc[(dataset["Age"] > 16) & (dataset["Age"] < =32), "Age"] = 1
    dataset.loc[(dataset["Age"] > 32) & (dataset["Age"] < =48), "Age"] = 2
    dataset.loc[(dataset["Age"] > 48) & (dataset["Age"] < =64), "Age"] = 3
    dataset.loc[(dataset["Age"] > 64), "Age"] = 4
# delete age group AgeBand field
train = train.drop(["AgeBand"], axis=1)
combine = [train, test]
Field processing

Generate a new field from an existing field:

Generate a new field 1

Start by generating a FamilySize field based on the Parch and SibSp fields

for dataset in combine:
    dataset["FamilySize"] = dataset["SibSp"] + dataset["Parch"] + 1

# Survivable mean of each FamilySize
train[['FamilySize'.'Survived']].groupby(['FamilySize'], as_index=False).mean().sort_values(by='Survived', ascending=False)
Check whether the family member FamilySize field is Islone: if the family member FamilySize is a person, it is Islone, denoted by 1, otherwise it is 0

Delete Parch, SibSp, and FamilySize;

# delete Parch, SibSp, and FamilySize, leaving only one person Islone

train = train.drop(['Parch'.'SibSp'.'FamilySize'],axis=1)
test = test.drop(['Parch'.'SibSp'.'FamilySize'],axis=1)
combine = [train, test]

Generate a new field 2

The new field 2 is the product of Age and Pclass:

The classification of the field Embarked

The value of the field Embarked is SQC. First we fill in the missing values

Check this field for missing values:

Treatment: Find outstanding numbers, fill in missing values, and look at the average of each value

Convert text type to numeric type:

Fare field processing

There are no missing values in the training set field, there is one in the test set:

Fill with the median value:

Separate container operation:

# box only FareBand fields
train['FareBand'] = pd.qcut(train['Fare'].4)  # Divide into 4 groups

# Mean of survival
train[['FareBand'.'Survived']].groupby(['FareBand'], as_index=False).mean().sort_values(by='FareBand', ascending=True)
Convert each segment into numeric data:

# 4 segments
for dataset in combine:
    dataset.loc[ dataset['Fare'] < =7.91.'Fare'] = 0
    dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] < =14.454), 'Fare'] = 1
    dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] < =31), 'Fare']   = 2
    dataset.loc[ dataset['Fare'] > 31.'Fare'] = 3
    dataset['Fare'] = dataset['Fare'].astype(int)

train = train.drop(['FareBand'], axis=1)
combine = [train, test]
This gives us the fields and data that will eventually be used for modeling:


The following is the specific modeling process. We divided the data set first:

# training set
X_train = train.drop("Survived", axis=1)
Y_train = train["Survived"]

# test set
X_test  = test.drop("PassengerId", axis=1).copy()
X_train.shape, Y_train.shape, X_test.shape
Specific process for each model:

  1. Create objects instantiated by the model
  2. Fit the training set
  3. Make a prediction for the test set
  4. Accuracy of calculation

Model 1: Logistic regression

# model instantiation
logreg = LogisticRegression()
# Fitting process, Y_train)

# Test set prediction
Y_pred = logreg.predict(X_test)
# Accuracy solution
acc_log = round(logreg.score(X_train, Y_train) * 100.2)

# the results
The coefficients obtained by the logistic regression model are:

# Logistic regression features and coefficients

coeff_df = pd.DataFrame(train.columns[1:)# Remove the Survived feature
coeff_df.columns = ["Features"]

coeff_df["Correlation"] = pd.Series(logreg.coef_[0])

# From high to low
coeff_df.sort_values(by='Correlation', ascending=False)
The bottom line: Gender really is an important factor in our survival

Model 2: Support vector machine SVM

Model 3: KNN

Model 4: Naive Bayes

Model 5: Perceptron

Model 6: Linear support vector classification

linear_svc = LinearSVC(), Y_train)

Y_pred = linear_svc.predict(X_test)

acc_linear_svc = round(linear_svc.score(X_train, Y_train) * 100.2)
# the results
Model 7: Stochastic gradient descent

Model 8: Decision tree

Model 9: Random forest

Model contrast

Compare the results (accuracy) of the 9 models above:

models = pd.DataFrame({
    'Model': ['Support Vector Machines'.'KNN'.'Logistic Regression'.'Random Forest'.'Naive Bayes'.'Perceptron'.'Stochastic Gradient Decent'.'Linear SVC'.'Decision Tree'].'Score': [acc_svc, acc_knn, acc_log, 
              acc_random_forest, acc_gaussian, acc_perceptron, 
              acc_sgd, acc_linear_svc, acc_decision_tree]})

models.sort_values(by='Score', ascending=False)
The comparison results show that decision tree and random forest have the best performance in this data set. The second is KNN (K nearest neighbor) algorithm.