Record of Titanic for Kaggle entry: top 10%.

Subject analysis

By looking at the given data set, the problem is a dichotomous one.

The preparatory work

What I did was to download the dataset to the dataset folder in the same directory and perform feature engineering and training locally.

The preparatory work

Open Jupyter or a familiar IDE and recommend Jupyter Notebook. The first is the import of commonly used packages:

Import the basic package
import numpy as np
import os

# Draw correlation
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Imputer
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
import seaborn as sns
from sklearn.preprocessing import LabelBinarizer
# measure
from sklearn.metrics import accuracy_score

# ignore warnings
import warnings

DATASET_DIR = "./dataset"
Copy the code

The last line is the table of contents for the dataset.

Here is a feature that will be added in a later version of SkLearn: One-hot encoding of category data, which is required here.

The above function, whose input is a 2D array of 1 enumerated categories, needs 0 to be 0
# from sklearn.preprocessing import CategoricalEncoder # add this method

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils import check_array

from scipy import sparse

# Understand later
class CategoricalEncoder(BaseEstimator, TransformerMixin):
    """Encode categorical features as a numeric array.
    The input to this transformer should be a matrix of integers or strings,
    denoting the values taken on by categorical (discrete) features.
    The features can be encoded using a one-hot aka one-of-K scheme
    (``encoding='onehot'``, the default) or converted to ordinal integers
    (``encoding='ordinal'``).
    This encoding is needed for feeding categorical data to many scikit-learn
    estimators, notably linear models and SVMs with the standard kernels.
    Read more in the :ref:`User Guide <preprocessing_categorical_features>`.
    Parameters
    ----------
    encoding : str, 'onehot', 'onehot-dense' or 'ordinal'
        The type of encoding to use (default is 'onehot'):
        - 'onehot': encode the features using a one-hot aka one-of-K scheme
          (or also called 'dummy' encoding). This creates a binary column for
          each category and returns a sparse matrix.
        - 'onehot-dense': the same as 'onehot' but returns a dense array
          instead of a sparse matrix.
        - 'ordinal': encode the features as ordinal integers. This results in
          a single column of integers (0 to n_categories - 1) per feature.
    categories : 'auto' or a list of lists/arrays of values.
        Categories (unique values) per feature:
        - 'auto' : Determine categories automatically from the training data.
        - list : ``categories[i]`` holds the categories expected in the ith
          column. The passed categories are sorted before encoding the data
          (used categories can be found in the ``categories_`` attribute).
    dtype : number type, default np.float64
        Desired dtype of output.
    handle_unknown : 'error' (default) or 'ignore'
        Whether to raise an error or ignore if a unknown categorical feature is
        present during transform (default is to raise). When this is parameter
        is set to 'ignore' and an unknown category is encountered during
        transform, the resulting one-hot encoded columns for this feature
        will be all zeros.
        Ignoring unknown categories is not supported for
        ``encoding='ordinal'``.
    Attributes
    ----------
    categories_ : list of arrays
        The categories of each feature determined during fitting. When
        categories were specified manually, this holds the sorted categories
        (in order corresponding with output of `transform`).
    Examples
    --------
    Given a dataset with three features and two samples, we let the encoder
    find the maximum value per feature and transform the data to a binary
    one-hot encoding.
    >>> from sklearn.preprocessing import CategoricalEncoder
    >>> enc = CategoricalEncoder(handle_unknown='ignore')
    >>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])
    ... # doctest: +ELLIPSIS
    CategoricalEncoder(categories='auto', dtype=<... 'numpy.float64'>,
              encoding='onehot', handle_unknown='ignore')
    >>> enc.transform([[0, 1, 1], [1, 0, 4]]).toarray()
    array([[ 1.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  0.],
           [ 0.,  1.,  1.,  0.,  0.,  0.,  0.,  0.,  0.]])
    See also
    --------
    sklearn.preprocessing.OneHotEncoder : performs a one-hot encoding of
      integer ordinal features. The ``OneHotEncoder assumes`` that input
      features take on values in the range ``[0, max(feature)]`` instead of
      using the unique values.
    sklearn.feature_extraction.DictVectorizer : performs a one-hot encoding of
      dictionary items (also handles string-valued features).
    sklearn.feature_extraction.FeatureHasher : performs an approximate one-hot
      encoding of dictionary items or strings.
    """

    def __init__(self, encoding='onehot', categories='auto', dtype=np.float64,
                 handle_unknown='error') : self.encoding = encoding self.categories = categories self.dtype = dtype self.handle_unknown = handle_unknown def fit(self, X, y=None):"""Fit the CategoricalEncoder to X. Parameters ---------- X : array-like, shape [n_samples, n_feature] The data to determine the categories of each feature. Returns ------- self """

        if self.encoding not in ['onehot'.'onehot-dense'.'ordinal']:
            template = ("encoding should be either 'onehot', 'onehot-dense' "
                        "or 'ordinal', got %s")
            raise ValueError(template % self.handle_unknown)

        if self.handle_unknown not in ['error'.'ignore']:
            template = ("handle_unknown should be either 'error' or "
                        "'ignore', got %s")
            raise ValueError(template % self.handle_unknown)

        if self.encoding == 'ordinal' and self.handle_unknown == 'ignore':
            raise ValueError("handle_unknown='ignore' is not supported for"
                             " encoding='ordinal'")

        X = check_array(X, dtype=np.object, accept_sparse='csc', copy=True)
        n_samples, n_features = X.shape

        self._label_encoders_ = [LabelEncoder() for _ in range(n_features)]

        for i in range(n_features):
            le = self._label_encoders_[i]
            Xi = X[:, i]
            if self.categories == 'auto':
                le.fit(Xi)
            else:
                valid_mask = np.in1d(Xi, self.categories[i])
                if not np.all(valid_mask):
                    if self.handle_unknown == 'error':
                        diff = np.unique(Xi[~valid_mask])
                        msg = ("Found unknown categories {0} in column {1}"
                               " during fit".format(diff, i))
                        raise ValueError(msg)
                le.classes_ = np.array(np.sort(self.categories[i]))

        self.categories_ = [le.classes_ for le in self._label_encoders_]

        return self

    def transform(self, X):
        """Transform X using one-hot encoding. Parameters ---------- X : array-like, shape [n_samples, n_features] The data to encode. Returns ------- X_out : sparse matrix or a 2-d array Transformed input. """
        X = check_array(X, accept_sparse='csc', dtype=np.object, copy=True)
        n_samples, n_features = X.shape
        X_int = np.zeros_like(X, dtype=np.int)
        X_mask = np.ones_like(X, dtype=np.bool)

        for i in range(n_features):
            valid_mask = np.in1d(X[:, i], self.categories_[i])

            if not np.all(valid_mask):
                if self.handle_unknown == 'error':
                    diff = np.unique(X[~valid_mask, i])
                    msg = ("Found unknown categories {0} in column {1}"
                           " during transform".format(diff, i))
                    raise ValueError(msg)
                else:
                    # Set the problematic rows to an acceptable value and
                    # continue `The rows are marked `X_mask` and will be
                    # removed later.
                    X_mask[:, i] = valid_mask
                    X[:, i][~valid_mask] = self.categories_[i][0]
            X_int[:, i] = self._label_encoders_[i].transform(X[:, i])

        if self.encoding == 'ordinal':
            return X_int.astype(self.dtype, copy=False)

        mask = X_mask.ravel()
        n_values = [cats.shape[0] for cats in self.categories_]
        n_values = np.array([0] + n_values)
        indices = np.cumsum(n_values)

        column_indices = (X_int + indices[:-1]).ravel()[mask]
        row_indices = np.repeat(np.arange(n_samples, dtype=np.int32),
                                n_features)[mask]
        data = np.ones(n_samples * n_features)[mask]

        out = sparse.csc_matrix((data, (row_indices, column_indices)),
                                shape=(n_samples, indices[-1]),
                                dtype=self.dtype).tocsr()
        if self.encoding == 'onehot-dense':
            return out.toarray()
        else:
            return out

Copy the code

Load the data

Download and save the data set provided by the topic to the appropriate directory.

dataset_train_tf = pd.read_csv("./dataset/train.csv")
dataset_test_tf = pd.read_csv("./dataset/test.csv")
combine = [dataset_train_tf, dataset_test_tf]
Copy the code

View the first five pieces of training data:

dataset_train_tf.head()
Copy the code

The output is as follows:

You can see that it contains different information, numeric characteristics and category characteristics. Refer to the kaggle title introduction for the meaning of each feature.

Let’s take a look at its data distribution:

This diagram has a lot of important information:

  1. Close to 38 percent survived. (calculation of average value), represented by 0,1.
  2. Silsp represents siblings, Parch represents spouses, and together represents the number of trips with relatives and friends. 75% of people have a zero, which means they’re traveling alone.
  3. Nearly 30 percent of passengers have siblings and/or spouses.
  4. Fares vary widely, with almost no passengers (<1%) paying as much as $512.
  5. Almost none of the passengers are 65-80 years old.

The distribution state of numerical features is shown above, and the state of category features is checked below:

Some speculation

  1. Whether women are more likely to survive
  2. Whether children are more likely to survive
  3. What does the type of ticket have to do with survival.

Let’s look at the data types and missing cases:

The above three steps are basically the necessary steps to get a data set. Here is a simple analysis:

Simple analysis

View the relationship with survivability by grouping different attributes with groupby.

dataset_train_tf[["Pclass"."Survived"]].groupby(['Pclass'],as_index=False).mean().sort_values(by="Survived", ascending=False)
Copy the code

The results are as follows:

Pclass Survived
0 1 0.629630
1 2 0.472826
2 3 0.242363

The chances of survival are different depending on the ticket.

dataset_train_tf[["Sex"."Survived"]].groupby(['Sex'],as_index=False).mean().sort_values(by="Survived", ascending=False)
Copy the code
Sex Survived
0 female 0.742038
1 male 0.188908

Wait, I’m not going to do too much analysis here.


See Kaggle’s introductory reference and Titanic’s article for the visualization section.

Corr functions can be used to calculate the correlation between different features.

# Interrelationships between data
corr_matrix = dataset_train_tf.corr()
corr_matrix["Survived"].sort_values() 1, -1 is the most useful
Copy the code

The output is as follows:

Pclass        -0.338481
Age           -0.077221
SibSp         -0.035322
PassengerId   -0.005007
Parch          0.081629
Fare           0.257307
Survived       1.000000
Name: Survived, dtype: float64
Copy the code

The following part introduces the missing data completion, the use of Pipeline, the coding of category feature, and the training and prediction of model.

Many years ago making links: