Original reference:tecdat.cn/?p=4491

Original source:Tuo End number according to the tribe public number

 

Feature engineering is very important for the execution of models, and even simple models with powerful capabilities can outperform complex algorithms. In fact, feature engineering is considered to be the most important factor in determining the success or failure of a prediction model. Feature engineering really boils down to human factors in machine learning. With human intuition and creativity, your knowledge of data can make a difference.

So what is feature engineering? It could mean many things to different issues, but in the Titanic race it could mean cutting down and wringing more value out of it combined with the different attributes given by our good people at Kaggle. In general, machine learning algorithms can more easily digest and make rules from engineering learning algorithms rather than from their derived variables.

The initial suspects for getting more machine learning magic were the three text fields we never sent to the decision tree last time. Ticket number, class and name are unique to each passenger; It may be possible to extract portions of these text strings to build new prediction properties. Let’s start with the name field. If we look at the name of the first passenger, we see the following:

> train$Name[1]

[1] Braund, Mr. Owen Harris

891 Levels: Abbing, Mr. Anthony Abbott, Mr. Rossmore Edward ... Zimmerman, Mr. Leo
Copy the code

Where previously we only accessed passenger groups through subsets, now we access individuals by using row number 1 as an index. Okay, no one else on board has that name, that’s almost certain, but what else do they share? Well, I’m sure there are many gentlemen on board. Perhaps character titles might give us more insight.

If we scroll through the data set, we’ll see more titles, including Miss, Mrs, Master, and even Countess! The title “Master” is a bit old-fashioned now, but these days it is reserved for unmarried boys. Moreover, nobles like our countess might have acted differently towards the lower proletariat. There seems to be very few patterns of possibility that go deeper than the combinations of age, sex and so forth that we’ve seen before.

To extract these headings to create new variables, we need to perform the same operations on both the training set and the test set so that these functions can be used to grow our decision tree and make predictions about invisible test data. An easy way to perform the same procedure simultaneously on both datasets is to merge them. In R we can use rbind, which stands for row binding, as long as two data frames have the same columns as each other. Since we obviously lack the Survived column in the test set, let’s create a full missing value (NAs) and then bind the two dataset rows together:

> test$Survived <- NA

> combi <- rbind(train, test)
Copy the code

Now we have a new data box named “combi” that contains exactly the same rows as the original two data sets, stacked in the order we specify: first train, then test second.

If you look back at our findings on Owen, his name is still coded as a factor. As we mentioned earlier in the tutorial series, the string automatically imports the factor in R, even if it doesn’t make sense. So we need to convert this column back to a text string. To do this, we use as.character. Let’s do this, and then look at Owen:

> combi$Name <- as.character(combi$Name)

> combi$Name[1]

[1] "Braund, Mr. Owen Harris"
Copy the code

To decompose the string, we need some hooks to tell the program to look up. Well, we see a comma after the name and a full stop after their title. We can easily distinguish the original names of the two symbols using the function strsplit (which stands for string split). Let’s try Mr. Blond:

> strsplit(combi$Name[1], split='[,.]')

[[1]]

[1] "Braund" " Mr" " Owen Harris"
Copy the code

B: Ok. In this case, we send strsplit the cells of interest and select some symbols for them, either commas or periods, as we split the string. Those symbols in square brackets are called regular expressions, and while this is a very simple notation, IF you’re going to use a lot of text, I definitely recommend getting used to them!

We see that the title has broken alone, although there is a strange space before it begins, as the comma occurs at the end of the last name. But how do we get the title and clear out the other things we don’t want? [[1]] Prints the index before the text section. Let’s try to delve deeper into this new type of container by attaching all the square brackets to the original command:

> strsplit(combi$Name[1], split='[,.]')[[1]]

[1] "Braund" " Mr" " Owen Harris"
Copy the code

String splitting uses a double stacked matrix because it can never be certain that a given regular expression will have the same number of blocks. If there are more commas or periods in the name, more segments are created, so it hides them deeper to maintain the rectangular type containers we’re used to, like spreadsheets or data frames these days! Let’s dive into index clutter and extract titles. This is the second item in the nested list, so let’s dig into index number 2 of the new container:

> strsplit(combi$Name[1], split='[,.]')[[1]][2]

[1] " Mr"
Copy the code

Since we had to dig into the container to get the Title, simply trying combi$Title < -strsplit (combi$Name, split='[,.]’)[[1]][2] traversing the entire Name vector would result in all of our rows having the same Mr., So we need to work harder. Not surprisingly, applyR’s suite of functions is used to apply a function to a large number of cells in a data frame or vector:

> combi$Title <- sapply(combi$Name, FUN=function(x) {strsplit(x, split='[,.]')[[1]][2]})
Copy the code

R’s application functions all work in a slightly different way to Sapply, but work well here. We provide sapply with the name vectors and functions that we just presented. It iterates through the lines of the name vector and sends each name to the function. The results of all these string splits are combined into a vector as the output of the sapply function, which we then store in a new column called Title in the original data frame.

Finally, we might want to strip these Spaces from the beginning of the title. Here, we can replace the first occurrence of the space with anything. We can use sub this:

> combi$Title <- sub(' ', '', combi$Title)
Copy the code

Ok, we now have a nice new header column, let’s take a look at it:

> table(combi$Title) Capt Col Don Dona Dr Jonkheer Lady 1 4 1 1 8 1 1 Major Master Miss Mlle Mme Mr Mrs 2 61 260 2 1 757  197 Ms Rev Sir the Countess 2 8 1 1Copy the code

Well, here are some very rare titles that won’t give our model much, so let’s combine some of the most unusual ones. Let’s combine them into a category:

> combi$Title[combi$Title %in% c('Mme', 'Mlle')] <- 'Mlle'
Copy the code

What did we do here? The %in% carrier checks if the value is part of the carrier we compare it to. So here we combine the two headings “Mme” and “Mlle” into a new temporary vector, using the C () operator and seeing if any existing headings in the entire Title column match any of them. Then we replace any game with “Mlle”.

We’re always looking for redundancy. For our collection here, being very rich seems to be a problem. For these men, we have only one or two blessed titles: Captain, Major and Sir. All of these are military titles, or rich guys who were born with a lot of land.

For the ladies, we have Dona, Lady, Jonkheer (* see comments below), and of course our Countess. All of these people were wealthy and probably behaved somewhat similarly due to their noble birth. Let’s combine the two and reduce the number of factor levels to what the decision tree might understand:

< combi$Title[combi$Title %in% c('Dona', 'Lady', 'the Countess', 'Jonkheer')] <- 'Lady'
Copy the code

Our final step is to change the variable type back to a factor, because these are basically the categories we created:

> combi$Title <- factor(combi$Title)
Copy the code

B: Ok. We have now completed the passenger title. What else can we think of? So, there are two variables SibSb and Parch that indicate the number of family members the passenger is traveling with. It seems reasonable to think that an extended family might not be able to track little Johnny as they all race to sink the ship, so let’s merge these two variables into a new, FamilySize:

> combi$FamilySize <- combi$SibSp + combi$Parch + 1
Copy the code

Very simple! We just add the number of siblings, spouses, parents and children passengers have with them, and of course one for their own presence, and there’s a new variable indicating the size of the family they’re traveling with.

More stuff? Okay, we’re just thinking of an extended family with a lifeboat problem, but maybe certain families are more troubled than others? We could try to extract passengers’ last names and group them to find family members, but a common last name like Johnson might add some unrelated people to the boat. In fact, there are three Johnsons in a 3-year-old family, and the other three probably unrelated Johnsons all travel alone.

Combining surnames with family size solves this problem. No two families – Johnson should have the same FamilySize variable on such a small boat. Let’s first extract the passenger’s last name. This should be a very simple change to the title extraction code we ran earlier, now we just want the first part of the strsplit output:

> combi$Surname <- sapply(combi$Name, FUN=function(x) {strsplit(x, split='[,.]')[[1]][1]})
Copy the code

We then want to append the FamilySize variable to it, but as we can see, string operations require strings. So let’s temporarily convert the FamilySize variable to a string and use it in combination with the Surname to get the new FamilyID variable:

combi$FamilyID <- paste(as.character(combi$FamilySize), combi$Surname, sep="")
Copy the code

We use the function paste to join the two strings together and tell it to separate them with the sep argument. This is stored in a new column named FamilyID. But the three single Johnsons all had the same family ID. Given our initial assumption that extended families might struggle to hold together in a panic, let’s weed out any family size of two or less and call it “small.” That would also solve the Johnson problem.

> combi$FamilyID[combi$FamilySize <= 2] <- 'Small'
Copy the code

Let’s look at how we identify these family groups:

> table(combi$FamilyID) 11Sage 3Abbott 3Appleton 3Beckwith 3Boulos 11 3 1 2 3 3Bourke 3Brown 3Caldwell 3Christy 3Collyer  3 4 3 2 3 3Compton 3Cornell 3Coutts 3Crosby 3Danbom 3 1 3 3 3 . . .Copy the code

Well, a few seem to have slipped through the cracks here. Many familyids have only one or two members, even if we only want three or more. Maybe some families have different last names, but anyway, all of these groups of one or two people are cutoffs of three that we’re trying to avoid. Let’s start cleaning it up:

> famIDs <- data.frame(table(combi$FamilyID))
Copy the code

Now we store the above table into the data frame. Yes, you can store most tables in a data box if you like, so let’s take a look at it by clicking on it in Explorer:

Here again, we see all those naughty families that don’t work well with our hypothesis, so let’s show a subset of this data box only those unexpectedly small FamilyID groups.

famIDs <- famIDs[famIDs$Freq <= 2,]
Copy the code

We then need to override any family ids in the incorrectly recognized groups in the data set and eventually convert them to factors:

We are now ready to decompose the test and training sets back to their original state and use them to bring us novel engineering variables. The best part we just did was how to deal with factors in R. Behind the scenes, factors are basically stored as integers, but masked by their textual names for our viewing. If the above factors are created on separate test and training sets, there is no guarantee that there will be two groups in both groups.

For example, the “3Johnson” family discussed earlier does not exist in the test set. We know that all three of them survived the training set data. If we establish our factors in isolation, then the test set has no factor “3Johnson”. This confuses any machine learning model because the factors between the training set used to build the model and the test set it is required to predict are inconsistent. Namely. If you try, R will throw an error at you.

Because we build factors on individual data frames and then split them after building them, R will provide all factor levels for all new data frames, even if the factor does not exist in a single data frame. It still has factor levels, but there is no actual observation in the set. Neat trick, right? I assure you that manually updating factor levels is a pain.

So let’s break them apart and make some predictions about our new fancy engineering variables:

Here we introduce another subset method in R; A lot depends on how you want to slice the data. We have isolated certain line ranges of the combined data set based on the size of the original train and test set. No number after the comma indicates that we want to use this subset to get all the columns and store them in the specified data frame. This gives us the original number of rows, along with all the new variables, including consistent factor levels.

It’s time to make our predictions! We have a bunch of new variables, so let’s send them to a new decision tree. The default complexity was fine last time, so let’s generate a tree with the vanilla control and see what it can do:

Interestingly, our new variable basically manages our tree. This is another disadvantage of decision trees that I didn’t mention last time: they tend to support multi-level factors. Look how our level 61 FamilyID factor is highlighted here, and the tree picks out all the families that are more biased than the others. In this way, the decision node can slice and change the data to the best possible combination of purity of the following nodes.

But beyond that, you should know how to create a commit from a decision tree, so let’s see how it works!

Great! We’ve almost halved our ranking! All by squeezing more value out of what we already have. This is just an example of what you can find in this dataset.

Keep trying to create more project variables! As before, I also highly encourage you to play with complexity metrics and maybe try pruning some deeper trees to see if it helps or hinders your level. You might even consider excluding some variables from the tree to see if it also changes.

But in most cases, due to the greedy nature of the decision tree, the title or gender variable will dictate the first decision. Bias against multi-level factors will not go away either, and the overfitting problem is difficult to measure without actually submitting submissions, but good judgment may help.

If you have any questions, please leave a message below!