background

About Kaggle

  • www.kaggle.com/
  • This is a perfect data for you to provide you with the actual application scenarios, you can and partners in the field of data mining high don’t don’t place ah!!

Kaggle is an online data experiment platform for learning, sharing and competition, similar to KDD-CUP (International Knowledge Discovery and Data Mining Competition). Enterprises or researchers can publish background, data, expectation indicators and other information on Kaggle, and seek solutions from data scientists and enthusiasts around the world in a way of competition. Users can download/analyze data, use statistics/machine learning/data mining knowledge, build algorithm models, produce results and submit them, and there may be rewards for top ranking!

About the Titanic disaster

www.kaggle.com/c/titanic

  • Problem background page

  • Download the Data page

  • Background to the Titanic Problem
  • It’s the story of Jack and Rose in u Jump I Jump. The yacht sank after hitting an iceberg. The passengers are panicking, and the co-captain is “Lady and Kid First”, so the model doesn’t look at your face like a coin toss to decide if you’re rescued or not. They have a certain background, and what else is worth thinking about when it comes to women and children first, and that’s what we’ll solve later in feature engineering.
  • Training and test data are the personal information and survival conditions of some passengers. Try to use these data to build a suitable model for prediction.
  • This is a dichroic problem (survives or not), and this paper attempts to handle the problem using logistic regression
  • instructions
  • “There is no such thing as an algorithm that is good or bad, and there is no absolute high-performance machine learning algorithm. There are only machine learning algorithms that are more appropriate for specific scenarios, data and features.” Since XGBC, random forest and SVC are still unknown in the learning stage, the algorithm used in this paper is only Logistic regression.

New data

Under Data, we can see the official train. CSV and test. CSV files, which are training and test Data respectively. We can use VirtualEnv to create an “isolated” Python application environment (virtual environment). Here, you don’t need to worry about the version of the system’s original library, just PIP to manage everything you need.

import pandas as pd 
import numpy as np 
from pandas import Series,DataFrame

data_train = pd.read_csv("./train.csv")
pd.DataFrame(data_train)
Copy the code

Pandas is a python data processing package that reads the CSV file into the Dataframe format. In the jupyter Notebook, we can see what our data looks like:

We can think of it as an Excel spreadsheet with 12 columns and 891 rows (representing 891 passengers at Train_CSV). The Survived field indicates whether the passenger was saved (1 means saved, 0 means not), and the rest is some personal information

  • PassengerId => ID of the passenger
  • Pclass => Class of the passenger’s cabin (class 1, 2, 3)
  • Name = > Name
  • Sex = > gender
  • Age = > Age
  • SibSp => Number of siblings
  • Parch => Number of parents and children
  • Ticket => Ticket information
  • Fare = > ticket prices
  • Cabin => Cabin information
  • Embarked => Port in which the ship is Embarked
data_train.info()
Copy the code

data_train.describe()
Copy the code

Preliminary data analysis

There are about 12 attributes for each passenger. Only the data understanding of the above two operations cannot provide us with ideas and thoughts. Now we do not know which are useful and which are not useful for the model, so we have come to the most critical feature engineering link. We know that the final output was Suvived, now we need to find the intrinsic relationship between the final Survived and each property. The first mate had said that “women and children should be considered first”.

for x in data1_x:
 ifdata1[x].dtype ! ='float64' :
 print('Survival Correlation by:', x)
 print(data1[[x, Target[0]]].groupby(x, as_index=False).mean())
 print(The '-'*10.'\n')
 
print(pd.crosstab(data1['Title'],data1[Target[0]]))
Copy the code

fig = plt.figure()
fig.set(alpha=0.2) 
Survived_0 = data_train.Pclass[data_train.Survived == 0].value_counts()
Survived_1 = data_train.Pclass[data_train.Survived == 1].value_counts()
df=pd.DataFrame({'Survived':Survived_1, 'unSurvived':Survived_0})
df.plot(kind='bar', stacked=True)
plt.title("Survived status of all passenger classes")
plt.xlabel("passanger's level") 
plt.ylabel("number") 
plt.show()
Copy the code

data_train.Age[data_train.Pclass == 1].plot(kind='kde') 
data_train.Age[data_train.Pclass == 2].plot(kind='kde')
data_train.Age[data_train.Pclass == 3].plot(kind='kde')
plt.xlabel("age")# plots an axis lable
plt.ylabel("dendity") 
plt.title("Age distribution of passengers levels")
plt.legend(('first level'.'second level'.'third level'),loc='best')
Copy the code

Simple data preprocessing

We’ve gone through the general data and have a general idea of the attributes we’re interested in. Now we need to do a little bit of processing with this data to get ready for machine learning modeling. Let’s start with the most prominent data attributes, Cabin and Age, because these two items are not included in some passenger information, and the loss of data will have a great impact on the next step.

Cabin = ‘Yes’; / / Cabin =’ No ‘;

Besides, the Age:

There are several common ways to deal with missing values

  • If the proportion of missing samples is very high in the total number, we may directly discard them. If we add them as features, they may actually bring noise and affect the final result
  • If the sample of missing values is moderate and the attribute is not a continuous value attribute (such as a category attribute), NaN is added to the category attribute as a new category
  • If the sample of missing values is moderate and the attribute is a continuous valued characteristic attribute, sometimes we will consider giving a step(for example, age here, we can consider a step every 2/3 years), then discretize it, and then add NaN as a type to the attribute category.
  • In some cases, the number of missing values is not very large, so we can also try to fit the data according to the existing values and add them.

In this case, because Cabin is not one of the features that affects the final result. So by taking the other attributes (gender, rank, rank) that are required and converting them to numeric traits, we can use pandas’ “get_dummies” to build on the original “data_train.

dummies_Embarked = pd.get_dummies(data_train['Embarked'], prefix= 'Embarked')
dummies_Sex = pd.get_dummies(data_train['Sex'], prefix= 'Sex')
dummies_Pclass = pd.get_dummies(data_train['Pclass'], prefix= 'Pclass')

df = pd.concat([data_train, dummies_Embarked, dummies_Sex, dummies_Pclass], axis=1)
df.drop(['Pclass'.'Name'.'Sex'.'Ticket'.'Embarked'], axis=1, inplace=True)
df
Copy the code

import sklearn.preprocessing as preprocessing
scaler = preprocessing.StandardScaler()
age_scale_param = scaler.fit(df['Age'])
df['Age_scaled'] = scaler.fit_transform(df['Age'], age_scale_param)
fare_scale_param = scaler.fit(df['Fare'])
df['Fare_scaled'] = scaler.fit_transform(df['Fare'], fare_scale_param)
df
Copy the code

Well, that’s about all the primary data processing is done

modeling

We extracted the required feature fields, converted them into NUMpy format, and used the LogisticRegression model in Scikit-learn.

from sklearn import linear_model
Use the re to retrieve the desired attribute value
train_df = df.filter(regex='Survived|Age_.*|SibSp|Parch|Fare_.*|Cabin_.*|Embarked_.*|Sex_.*|Pclass_.*')
train_np = train_df.as_matrix()
# y = Survival result
y = train_np[:, 0]
# X is the value of the characteristic attribute
X = train_np[:, 1:]
# fit into RandomForestRegressor
clf = linear_model.LogisticRegression(C=1.0, penalty='l1', tol=1e-6)
clf.fit(X, y)
clf
Copy the code

OK!!!!!! After this modeling, we get a model, and then simply run test.csv through the same data to get the predicted results.

System optimization

Wait, you think that’s the end of it. This is just the beginning, we just made a baseline model, everything is still basic, we need to optimize.

However, in the current scenario, don’t rush to do this thing, our baseline system is still a little rough, first to explore excavation.

  • First, the Name and Ticket attributes were completely discarded (well, because almost every record is a completely different value, we didn’t find a straightforward way to deal with them).
  • Then, let’s think that the age fit itself is not necessarily a very reliable thing, and we can’t actually do a very good fit and predict the unknown age based on the other attributes. Another to our everyday experience, children and the elderly may be care more, so, age as a continuous value, to a fixed coefficient, should and age is a positive correlation or negative correlation, couldn’t seem to reflect the two head to take care of the actual situation, so perhaps we age discretization, It would be more appropriate to classify the attributes by section. (Check out Kaggle for Kernels.)

The text/joeCDC

Math enthusiast

Sound/fluorspar

This article has been authorized by the author, the copyright belongs to chuangyu front. Welcome to indicate the source of this article. Link to this article: knownsec-fed.com/2018-12-04-…

To subscribe for more sharing from the front line of KnownsecFED development, please search our wechat official account KnownsecFED. Welcome to leave a comment to discuss, we will reply as far as possible.

Thank you for reading.