This article has participated in the “Digitalstar Project” and won a creative gift package to challenge the creative incentive money. Small knowledge, big challenge! This article is participating in the creation activity of “Essential Tips for Programmers”.

Processing sub-type features: encoding and dummy variables

In machine learning, most algorithms, such as logistic regression, SUPPORT vector machine SVM, K-nearest Neighbor algorithm and so on, can only process numerical data, but not text. In SkLearn, except for algorithms dedicated to text processing, other algorithms in FIT all require input array or matrix. It also cannot import literal data (actually handwritten decision trees and Pusbayes can handle literal data, but skLearn stipulates that numeric data must be imported). In reality, however, many tags and features are not represented by numbers when data is collected. For example, the value of degree can be [” primary school “, “junior high school”, “senior high school”, “university “], and the payment method may include [” Alipay”, “Cash”, “wechat “] and so on. In this case, in order to make the data suitable for the algorithm and library, we must encode the data, that is, convert literal data to numeric data

Preprocessing. LabelEncoder: label is special, can be classified into numerical classification

From sklearn. Preprocessing import LabelEncoder y = data.iloc[:,-1] So one dimension le = LabelEncoder() # instantiate le = le.fit(y) # import data label = le.transform(y) #transform interface fetch result le.classes_ Inverse_transform (label) {label le.fit_transform(y) {fit_transform(y) { Inverse_transform data.iloc[:,-1] = label # Make the label equal to the result of our operation data.head() #  from sklearn.preprocessing import LabelEncoder data.iloc[:,-1] = LabelEncoder().fit_transform(data.iloc[:,-1])Copy the code

Special preprocessing. OrdinalEncoder: characteristics, can transform classification features for classification

Categories_ Apply to classes_, from sklearn. Preprocessing import OrdinalEncoder # interface categories_ Data_ = data.copy() data_.head() OrdinalEncoder().fit(data_.iloc[:,1:-1]). Categories_ data_. Iloc [:,1:-1] = OrdinalEncoder().fit_transform(data_.iloc[:,1:-1]) data_.head()Copy the code

Preprocessing. OneHotEncoder: hot coding alone, to create a dummy variable

The classification variables Sex and Embarked have been converted to numeric categories using the OrdinalEncoder. Embarked in the door column, we use [0,1,2] to represent three different doors, but is the conversion correct? Consider three different types of classified data:

  • 1) Hatch door (S, C, Q)
  • The three values S, C and Q are independent and have no connection with each other at all, expressing the concept that S≠C+Q. This is the nominal variable.
  • 2) Education (primary school, middle school, high school)
  • The three values are not completely independent, and we can obviously see that in nature, there can be a relationship between high school > junior high > primary school, and the degree of education is high or low, but the value of education is not calculable, we can not say that primary school + a certain value = junior high school. This is an ordered variable.
  • 3) Weight (>45kg, >90kg, >135kg)
  • Each value is related to each other and can be calculated with each other. For example, 120kg-45kg = 90kg. Each category can be converted to each other through mathematical calculation. This is a distance variable.
  • However, during feature coding, these three classification data are all converted into [0,1,2]. These three numbers are continuous and computable in the view of the algorithm. They are not equal to each other, have size, and can be added and multiplied. So the algorithm will mistake things like cabin doors, education, for things like weight. This is to say that when we convert categories into numbers, we ignore the inherent mathematical properties of numbers, and therefore convey some inaccurate information to the algorithm, which affects our modeling.
  • The category OrdinalEncoder can be used to process ordered variables, but for nominal variables, we can only use dummy variables to process, so as to convey the most accurate information to the algorithm: