background

Machine learning is divided into classification and regression. In the last article, we introduced classification using decision tree and random forest through examples. This time, let’s forecast housing prices and practice the regression analysis model in R language.

The data set

The contest website for this selection is www.kaggle.com/c/house-pri…

The competition gave us 80 features of the nearly 1,500 homes that had been sold, and then asked us to predict the sale price based on those features. The data set contains quite a number of feature fields. In addition to the basic information such as location, area and number of floors, there are also features such as basement, distance from street and exterior wall materials of houses that are not cared about at all in China. In China, where housing prices are so crazy, you basically just need to look at the location and area to estimate the price.

Familiar with data

Before building the model, let’s familiarize ourselves with the missing and distributed data.

First download the training data and test data, put them in the directory D:/RData/House/, and then merge the training data and test data. SalePrice is the housing price field to be predicted this time.

Train < -read.csv ("D:/RData/House/train.csv") test < -read.csv ("D:/RData/House/test.csv" test$SalePrice <- NA all <- rbind(train, test)Copy the code

Let’s start by looking at the variables. There are many variables here. Please find attached the specific explanation of the variables.

str(all)
Copy the code

Results:

'data.frame':   2919 obs. of  81 variables:
 $ Id   : int  1 2 3 4 5 6 7 8 9 10 ...
 $ MSSubClass   : int  60 20 60 70 60 50 20 60 50 190 ...
 $ MSZoning : Factor w/ 5 levels "C (all)","FV",..: 4 4 4 4 4 4 4 4 5 4 ...
 $ LotFrontage  : int  65 80 68 60 84 85 75 NA 51 50 ...
 $ LotArea  : int  8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
 $ Street   : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
 $ Alley: Factor w/ 2 levels "Grvl","Pave": NA NA NA NA NA NA NA NA NA NA ...
 $ LotShape : Factor w/ 4 levels "IR1","IR2","IR3",..: 4 4 1 1 1 1 4 1 4 4 ...
 $ LandContour  : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 4 4 4 4 ...
 $ Utilities: Factor w/ 2 levels "AllPub","NoSeWa": 1 1 1 1 1 1 1 1 1 1 ...
 $ LotConfig: Factor w/ 5 levels "Corner","CulDSac",..: 5 3 5 1 3 5 5 1 5 1 ...
 $ LandSlope: Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 1 1 1 1 ...
 $ Neighborhood : Factor w/ 25 levels "Blmngtn","Blueste",..: 6 25 6 7 14 12 21 17 18 4 ...
 $ Condition1   : Factor w/ 9 levels "Artery","Feedr",..: 3 2 3 3 3 3 3 5 1 1 ...
 $ Condition2   : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 1 ...
 $ BldgType : Factor w/ 5 levels "1Fam","2fmCon",..: 1 1 1 1 1 1 1 1 1 2 ...
 $ HouseStyle   : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 6 3 6 6 6 1 3 6 1 2 ...
 $ OverallQual  : int  7 6 7 7 8 5 8 7 7 5 ...
 $ OverallCond  : int  5 8 5 5 5 5 5 6 5 6 ...
 $ YearBuilt: int  2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
 $ YearRemodAdd : int  2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
 $ RoofStyle: Factor w/ 6 levels "Flat","Gable",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ RoofMatl : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ Exterior1st  : Factor w/ 15 levels "AsbShng","AsphShn",..: 13 9 13 14 13 13 13 7 4 9 ...
 $ Exterior2nd  : Factor w/ 16 levels "AsbShng","AsphShn",..: 14 9 14 16 14 14 14 7 16 9 ...
 $ MasVnrType   : Factor w/ 4 levels "BrkCmn","BrkFace",..: 2 3 2 3 2 3 4 4 3 3 ...
 $ MasVnrArea   : int  196 0 162 0 350 0 186 240 0 0 ...
 $ ExterQual: Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 4 3 4 3 4 4 4 ...
 $ ExterCond: Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
 $ Foundation   : Factor w/ 6 levels "BrkTil","CBlock",..: 3 2 3 1 3 6 3 2 1 1 ...
 $ BsmtQual : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 3 3 4 3 3 1 3 4 4 ...
 $ BsmtCond : Factor w/ 4 levels "Fa","Gd","Po",..: 4 4 4 2 4 4 4 4 4 4 ...
 $ BsmtExposure : Factor w/ 4 levels "Av","Gd","Mn",..: 4 2 3 4 1 4 1 3 4 4 ...
 $ BsmtFinType1 : Factor w/ 6 levels "ALQ","BLQ","GLQ",..: 3 1 3 1 3 3 3 1 6 3 ...
 $ BsmtFinSF1   : int  706 978 486 216 655 732 1369 859 0 851 ...
 $ BsmtFinType2 : Factor w/ 6 levels "ALQ","BLQ","GLQ",..: 6 6 6 6 6 6 6 2 6 6 ...
 $ BsmtFinSF2   : int  0 0 0 0 0 0 0 32 0 0 ...
 $ BsmtUnfSF: int  150 284 434 540 490 64 317 216 952 140 ...
 $ TotalBsmtSF  : int  856 1262 920 756 1145 796 1686 1107 952 991 ...
 $ Heating  : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ HeatingQC: Factor w/ 5 levels "Ex","Fa","Gd",..: 1 1 1 3 1 1 1 1 3 1 ...
 $ CentralAir   : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
 $ Electrical   : Factor w/ 5 levels "FuseA","FuseF",..: 5 5 5 5 5 5 5 5 2 5 ...
 $ X1stFlrSF: int  856 1262 920 961 1145 796 1694 1107 1022 1077 ...
 $ X2ndFlrSF: int  854 0 866 756 1053 566 0 983 752 0 ...
 $ LowQualFinSF : int  0 0 0 0 0 0 0 0 0 0 ...
 $ GrLivArea: int  1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
 $ BsmtFullBath : int  1 0 1 1 1 1 1 1 0 1 ...
 $ BsmtHalfBath : int  0 1 0 0 0 0 0 0 0 0 ...
 $ FullBath : int  2 2 2 1 2 1 2 2 2 1 ...
 $ HalfBath : int  1 0 1 0 1 1 0 1 0 0 ...
 $ BedroomAbvGr : int  3 3 3 3 4 1 3 3 2 2 ...
 $ KitchenAbvGr : int  1 1 1 1 1 1 1 1 2 2 ...
 $ KitchenQual  : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 3 3 4 3 4 4 4 ...
 $ TotRmsAbvGrd : int  8 6 6 7 9 5 7 7 8 5 ...
 $ Functional   : Factor w/ 7 levels "Maj1","Maj2",..: 7 7 7 7 7 7 7 7 3 7 ...
 $ Fireplaces   : int  0 1 1 1 1 0 1 2 2 2 ...
 $ FireplaceQu  : Factor w/ 5 levels "Ex","Fa","Gd",..: NA 5 5 3 5 NA 3 5 5 5 ...
 $ GarageType   : Factor w/ 6 levels "2Types","Attchd",..: 2 2 2 6 2 2 2 2 6 2 ...
 $ GarageYrBlt  : int  2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
 $ GarageFinish : Factor w/ 3 levels "Fin","RFn","Unf": 2 2 2 3 2 3 2 2 3 2 ...
 $ GarageCars   : int  2 2 2 3 3 2 2 2 2 1 ...
 $ GarageArea   : int  548 460 608 642 836 480 636 484 468 205 ...
 $ GarageQual   : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 2 3 ...
 $ GarageCond   : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
 $ PavedDrive   : Factor w/ 3 levels "N","P","Y": 3 3 3 3 3 3 3 3 3 3 ...
 $ WoodDeckSF   : int  0 298 0 0 192 40 255 235 90 0 ...
 $ OpenPorchSF  : int  61 0 42 35 84 30 57 204 0 4 ...
 $ EnclosedPorch: int  0 0 0 272 0 0 0 228 205 0 ...
 $ X3SsnPorch   : int  0 0 0 0 0 320 0 0 0 0 ...
 $ ScreenPorch  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ PoolArea : int  0 0 0 0 0 0 0 0 0 0 ...
 $ PoolQC   : Factor w/ 3 levels "Ex","Fa","Gd": NA NA NA NA NA NA NA NA NA NA ...
 $ Fence: Factor w/ 4 levels "GdPrv","GdWo",..: NA NA NA NA NA 3 NA NA NA NA ...
 $ MiscFeature  : Factor w/ 4 levels "Gar2","Othr",..: NA NA NA NA NA 3 NA 3 NA NA ...
 $ MiscVal  : int  0 0 0 0 0 700 0 350 0 0 ...
 $ MoSold   : int  2 5 9 2 12 10 8 11 4 1 ...
 $ YrSold   : int  2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
 $ SaleType : Factor w/ 9 levels "COD","Con","ConLD",..: 9 9 9 9 9 9 9 9 9 9 ...
 $ SaleCondition: Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 1 5 5 5 5 1 5 ...
 $ SalePrice: int  208500 181500 223500 140000 250000 143000 307000 200000 1299
Copy the code

Variables are mainly divided into two types: number type and factor type.

Res < -sapply (all, class) table(res)Copy the code

The results of

 factor integer
 43      38
Copy the code

In general, the dataset consisted of 81 variables and 2919 records, including 43 factor variables and 38 numeric variables

Characteristics of the processing

It can be seen from the above variable values that there are many variables in the data set with missing values, so the first step is to deal with missing values.

First, sort according to the proportion of missing values in each variable

< -sapply (all, function(x) sum(is.na(x)) # Miss < -sort (res, decreasing=T) Miss [Miss >0]Copy the code

Execution result – Only variables with missing values are given here and are manually commented out

# variable missing number missing proportion Meaning PoolQC 2909 100% # pool quality MiscFeature 2814 96% # special facilities Alley 2721 93% # near the house Fence 2348 80% # The house Fence FireplaceQu 1420 49% # Quality of fireplaces LotFrontage 486 17% # Distance between the house and the street GarageYrBlt 159 5% # GarageFinish 159 5% GarageQual 159 5% GarageCond 159 5% GarageType 157 5% BsmtCond 82 3% # Basement BsmtExposure 82 3% BsmtQual 81 3% BsmtFinType2 80 3% BsmtFinType1 79 3% MasVnrType 24 1% # Exterior decoration MasVnrArea 23 1% MSZoning 4 0% # Other Utilities 2 0% BsmtFullBath 2 0% BsmtHalfBath 2 0% Functional 2 0% Exterior1st 1 0% Exterior2nd 1 0% BsmtFinSF1 1 0% BsmtFinSF2 1 0% BsmtUnfSF 1 0% TotalBsmtSF 1 0% Electrical 1 0% KitchenQual 1 0% GarageCars 1 0% GarageArea 1 0% SaleType 1 0%Copy the code

Then view an overview of the variables with missing values. Only variables with a high number of missing values are given here

Summary (all[,names(miss)[miss>0]])Copy the code

The results of

PoolQC MiscFeature Alley Fence SalePrice FireplaceQu Ex : 4 Gar2: 5 Grvl: 120 GdPrv: 118 Min. : 34900 Ex : 43 Fa : 2 Othr: 4 Pave: 78 GdWo : 112 1st Qu.:129975 Fa : 74 Gd : 4 Shed: 95 NA's:2721 MnPrv: 329 Median :163000 Gd : 744 NA's:2909 TenC: 1 MnWw : 12 Mean :180921 Po : 46 NA's:2814 NA's :2348 3rd Qu.:214000 TA : 592 Max. :755000 NA's:1420 NA's :1459 LotFrontage GarageYrBlt GarageFinish GarageQual GarageCond Min. : 21.00 Min. :1895 Fin: 719 Ex: 3 Ex: 3 1st Qu.: 59.00 1ST Qu.:1960 RFn: 811 Fa: 124 Fa: 74 Median: 68.00 Median :1979 Unf :1230 Gd: 24 Gd: 15 Mean: 69.31 Mean :1978 NA's: 159 Po: 5 Po: 14 3rd Qu.: 80.00 3rd Qu.:2002 TA :2604 TA :2654 Max. :313.00 Max. :2207 NA's: 159 NA's: 159 NA's :486 NA's :159 GarageType BsmtCond BsmtExposure BsmtQual BsmtFinType2 BsmtFinType1 2Types : 23 Fa : 104 Av : 418 Ex : 258 ALQ : 52 ALQ :429 Attchd :1723 Gd : 122 Gd : 276 Fa : 88 BLQ : 68 BLQ :269 Basment: 36 Po : 5 Mn : 239 Gd :1209 GLQ : 34 GLQ :849 BuiltIn: 186 TA :2606 No :1904 TA :1283 LwQ : 87 LwQ :154 CarPort: 15 NA's: 82 NA's: 82 NA's: 81 Rec : 105 Rec :288 Detchd : 779 Unf :2493 Unf :851 NA's : 157 NA's: 80 NA's: 79 MasVnrType MasVnrArea MSZoning Utilities BsmtFullBath BrkCmn: 25 Min. : 0.0 C (all): 25 AllPub:2916 Min. :0.0000 BrkFace: 879 1st Qu.:0.0 FV: 139 NoSeWa: 1 1st Qu.:0.0000 None :1742 Median :0.0 RH: 26 NA's :2 Median :0.0000 Stone: 249 Mean: 102.2 RL :2265 Mean :0.4299 NA's: 24 3rd Qu.: 164.0 RM: 460 3rd Qu.:1.0000 Max. :1600.0 NA's: 4 Max. :3.0000 NA's :23 NA's :2 BsmtHalfBath Functional Exterior1st Exterior2nd BsmtFinSF1 Min. :0.00000 Typ :2717 VinylSd:1025 VinylSd:1014 Min. :0.0 1st Qu.:0.00000 Min2:70 MetalSd: 450 MetalSd: 447 1st Qu.: 0.0 Median :0.00000 min1:65 HdBoard: 442 HdBoard: 406 Median: 368.5 Mean :0.06136 Mod: 35 Wd Sdng: 411 Wd Sdng: 391 Mean: 441.4 3rd Qu.:0.00000 MAJ1:19 Plywood: 221 Plywood: 270 3rd Qu.: 733.0 Max. :2.00000 (Other): 11 (Other): 369 (Other): 390 Max. :5644.0 NA's :2 NA's :2 NA's: 1 NA's: 1 NA's :1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Electrical KitchenQual Min. : 0.00 Min. : 0.0 Min. : 0.0 FuseA: 188 Ex: 2051st Qu.: 0.00 1st Qu.: 220.0 1ST Qu.: 793.0 FuseF: 50 Fa: 70 Median: 0.00 Median: 467.0 Median: 989.5 FuseP: 8 Gd :1151 Mean: 49.58 Mean: 560.8 Mean :1051.8 Mix :1 TA :1492 3rd Qu.: 0.00 3rd Qu.: 805.5 3rd Qu.:1302.0 SBrkr:2671 NA's: 1 Max. :1526.00 Max. :2336.0 Max. :6110.0 NA's: 1 NA's :1 NA's :1 NA's :1 NA's :1 GarageCars GarageArea SaleType Min. :0.000 Min. :0.0 WD :2525 1st Qu.:1.000 1st Qu.: 320.0 New: 239 Median :2.000 Median: 480.0 COD: 87 Mean :1.767 Mean: 472.9 ConLD: 26 3rd Qu.:2.000 3rd Qu.: 576.0 CWD: 12 Max. :5.000 Max. :1488.0 (Other): 29 NA's :1 NA's :1 NA's :1Copy the code

There are many variables of missing data, and the processing cases can be divided into the following categories:

Remove variables with a large number of missing values from the direct data set

The large number of missing PoolQC, MiscFeature, Alley, Fence, FireplaceQu is due to the lack of swimming pool, special facilities, side alleyway, Fence, fireplace, etc. Due to the large number of missing variables, we removed these variables directly.

Drop < -names (all) %in% c("PoolQC","MiscFeature","Alley"," FireplaceQu") all < -all [!Drop]Copy the code

Let’s take NA as a new factor

By checking the variable description file, it can be seen that the five garage-related variables GarageType, GarageYrBlt, GarageFinish, GarageQual and GarageCond are also missing because the house has no garage.

Similarly, BsmtExposure, BsmtFinType2, BsmtQual, BsmtCond and BsmtFinType1 are all about the basement, which is missing because the house has no basement.

The number of missing variables in this class is small, and the missing value is simply replaced by None.

# Fill None Garage < -c ("GarageType","GarageQual","GarageCond","GarageFinish") Bsmt <- c("BsmtExposure","BsmtFinType2","BsmtQual","BsmtCond","BsmtFinType1") for (x in c(Garage, Bsmt) ) { all[[x]] <- factor( all[[x]], levels= c(levels(all[[x]]),c('None'))) all[[x]][is.na(all[[x]])] <- "None" }Copy the code

Where, GarageYrBlt is the garage year, and we replace it with the building year of the house

All $GarageYrBlt[is. Na (all$GarageYrBlt)] < -all $YearBuilt[is. Na (all$GarageYrBlt)]Copy the code

Manually complete the missing value

For the remaining variables, we check their detailed data in turn, which can be processed as follows.

Variable LotFrontage Distance from house to street

This is a numeric variable and we supplement it with Median.

All $LotFrontage[is. Na (all$LotFrontage)] < -median (all$LotFrontage, na.rm = T)Copy the code

Variable MasVnrType exterior wall decoration material

This variable should have little effect on the price; NA in MasVnrType is replaced by its own None

# with None complement all [[" MasVnrType "]] [is. Na (all [[" MasVnrType "]])] < - "None"Copy the code

Variable area of MasVnrArea exterior wall decoration material

This missing value corresponds to the None value of MasVnrType, and NA should be replaced with 0

With 0 # added all [[" MasVnrArea "]] [is. Na (all [[" MasVnrArea "]])] < 0Copy the code

The variable Utilities is indiscriminately discarded

Utilities all$Utilities < -nullCopy the code

The variable BsmtFullBath BsmtHalfBath BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF GarageCars GarageArea is missing because there is no corresponding facility. These variables are all numeric variables, so they can be added to 0.

# Due to the lack of facilities, resulting in the lack of quantity, The value is 0 Param0 <- c("BsmtFullBath","BsmtHalfBath","BsmtFinSF1","BsmtFinSF2","BsmtUnfSF","TotalBsmtSF","GarageCars","GarageArea") for (x in  Param0 ) all[[x]][is.na(all[[x]])] <- 0Copy the code

Variable MSZoning, Functional, Exterior1st Exterior2nd, KitchenQual, Electrical, SaleType these variables are factors, and only a few missing value, with the most direct factor instead

Req < -c ("MSZoning","Functional","Exterior1st","Exterior2nd","KitchenQual","Electrical","SaleType") for (x) with the highest frequency factor in Req ) all[[x]][is.na(all[[x]])] <- levels(all[[x]])[which.max(table(all[[x]]))]Copy the code

Generating training set

After completing a series of missing values, we see that there are 75 variables left and no missing data. We split the data set into training set and test set by whether SalePrice is NA to prepare for the subsequent model training.

Train < -all [! Is.na (all$SalePrice),] test < -all [is.na(all$SalePrice),]Copy the code

The regression model

The main problem of linear regression is the choice of independent variables. Selecting characteristic variables that are highly relevant to the final predicted response variables is the first step to model success. There are many methods for variable selection, but the most critical and straightforward one is manual selection by the analyst based on the business scenario. We first try this variable selection approach as a first step in our model.

House Prices: Advanced Regression Techniques (2)