preface

This is Lin Xiaobian’s new module ~ the winter vacation is coming. Xiaobian has always wanted to systematically learn the application of R language on machine learning, mainly from the perspective of algorithm and R package, and share my learning notes. I hope you can criticize and correct and communicate with me. The main reference book is “Machine Learning with R, Tidyverse, and MLR”, this book involves two very important R packages for MLR and Tidyverse, interested readers can first install:

install.packages("mlr", dependencies = TRUE)
install.packages("tidyverse")
Copy the code

Among them, MLR contains a surprising number of machine learning algorithms and greatly simplifies all of our machine learning tasks. Tidyverse is a “collection of R packages designed specifically for data science”, created to make data science tasks in R simpler, more human, and more reproducible.

This issue will start from the commonly used K nearest neighbor algorithm!

1. Introduction of k-nearest Neighbor algorithm

K-nearest Neighbor (KNN) algorithm is a relatively mature classification algorithm in theory and one of the simplest machine learning algorithms. The idea is that in a feature space, a sample belongs to a category if most of the k nearest (that is, nearest in the feature space) samples near it belong to that category.

In other words, given a training data set, k instances closest to the newly input sample are found in the training data set. If most of these K instances belong to the class, the newly input sample also belongs to the class.

2. Basic elements of KNN algorithm

In KNN algorithm, the selected adjacent instances are all correctly classified objects. This algorithm only depends on the category of the nearest one or several instances to determine the category to which the samples to be divided belong. Classifier does not need to use training set for training, and the training time complexity is 0. Then the classification time complexity of KNN is O(n). The selection of k value, distance measurement and classification decision rules are the three basic elements of the algorithm:

2.1 Selection of K value

It is easy to know that the choice of k value will have a significant impact on the results of the algorithm. The small value of K means that only the training instance close to the sample to be divided can play a role in the prediction results, but overfitting is easy to occur. If the value of K is large, then the training instance with a long distance from the sample to be divided will also play a role in the prediction, which may make the prediction error. In practical application, k value is usually selected as a small value (usually less than 20), and in practice, cross-validation method is often used to select the optimal k value.

2.2 Distance Measurement

Distance measurement methods include Euclidean distance, Minkowski distance and Mahalanobis distance, etc. From analysis, it can be known that norms on RnR^nRn are equivalent, so there is no need to be too obsessed with who to choose. Before measuring, the value of each attribute should be normalized to help prevent attributes with a larger initial range from being overweighted than those with a smaller initial range.

2.3 Classification decision rules

The classification decision rule in this algorithm is usually majority voting, that is, the majority classes of k nearest training instances of the input instance decide the classification of the samples to be divided.

3. Application Examples

This paper will first introduce how to use KNN algorithm in MLR package, taking diabetes data set in McLust package as an example.

3.1 Loading Data

Library (McLust) Library (Tibble)# belongs to the Tidyverse and organizes and displays data in a reasonable way (diabetes, Package = "McLust ")# load data diabetesTib < -as_tibble (diabetes)# convert to tibble form diabetesTib # A tibble: 145 x 4 class glucose insulin sspg <fct> <dbl> <dbl> <dbl> 1 Normal 80 356 124 2 Normal 97 289 117 3 Normal 105 319 143 4 Normal 90 356 199 5 Normal 90 323 240 6 Normal 86 381 157 7 Normal 100 350 221 8 Normal 85 301 186 9 Normal 97 379 142  10 Normal 97 296 131 # ... with 135 more rowsCopy the code

This dataset has 145 instances and 4 variables. Class factor showed that 76 cases were non-diabetic (Normal), 36 cases were Chemical, and 33 cases were Overt. The other three variables were overt and postinsulin glucose tolerance tests (glucose and insulin, respectively) and steady-state glucose levels (SSPG).

Note: The Tibble package introduces a new data structure. For more information about this package and this new data structure, see Chapter 2 of the reference books or the official help notes for the package.

3.2 Diagram analysis

To understand the relationship between these variables, plot using the ggplot2 package commonly used in R.

library(ggplot2) ggplot(diabetesTib, aes(glucose, insulin, Col = class)) + geom_point() + theme_bw()# change variables to insulin and glucose ggplot(diabetesTib, AES (SSPG, insulin, Col = class)) + geom_point() + theme_bw()# Col = class)) + geom_point() + theme_bw()#Copy the code

It can be seen from the figure that there are differences in continuous variables between these three categories. Next, a KNN classifier will be constructed and used to predict the diabetes status of future patients.

3.3 Use MLR to train KNN model

There are three main stages in building a machine learning model with this package:

  • Define tasks. Tasks include data and what you want to do with it. In this case, the data is diabetesTib, and we want to classify the data using the variable class as the target variable.

  • Define the learner. Learner is simply the name of the algorithm that you plan to use, along with any other parameters that the algorithm accepts.

  • Training model. This stage is to give the task to Learner, and Learner generates a model that you can use to predict the future.

3.3.1 Defining tasks

The parts needed to define a task are:

  • Data containing predictive variables (we want these variables to contain the information needed to make predictions/solve problems).

  • The target variable that you want to predict.

That is:

Since a classification model is to be built, the makeClassifTask() function is used to define a classification task, and makeRegrTask() and makeClusterTask() will be used when regression and clustering models are built, respectively.

< -makeclassifTask (data = diabetesTib, target = "class")Copy the code

3.3.2 rainfall distribution on 10-12 definitionlearner

The required parts to define learner are:

  • Class of algorithm used: “Classif. Regr.” Regression;” “Cluster;” Surv.” and “multilabel.” are used to predict survival and multi-label classification.

  • The algorithm used.

  • Other options used to control the algorithm.

That is:

Use the makeLearner() function to define learner. The first argument to the makeLearner() function is the algorithm used to train the model, in this case using the KNN algorithm, so the parameter is specified as “Classif.knn”. The second parameter, par.vals, represents the parameter value that specifies the number of k nearest neighbors that you want the algorithm to use.

Learner KNN < -makelearner ("classif.knn", par.vals = list("k" = 2)Copy the code

3.3.3 Training model

The required part of the training model is the task and learner defined before, and defining the task and learner and combining them is the whole process of the training model.

That is:

This is done through the train() function, which takes Learner as the first argument and task as the second.

< -train (KNN, diabetesTask)Copy the code

3.4 Prediction and evaluation model

Now that we have the model, we can send the data back to the model to see how it performs. Predict () takes unlabeled data and passes it to the model to get their prediction classes. The first argument to the function is the model, and the data passed to it is given by the second argument, newData.

knnPred <- predict(knnModel, newdata = diabetesTib)
Copy the code

These predictions can then be passed as the first argument to the performance() function. This function compares the class predicted by the model with the real class and returns a performance indicator of how well the predicted value matches the real value.

Performance (knnPred, Measures = List (MMCE, ACC)) MMCE ACC 0.0137931 0.9862069Copy the code

The performance indicators specified here are MMCE (Mean Misclassifcation Error) and ACC (Accuracy). Mmce is the proportion of instances that are classified into other categories than the real category, and ACC, by contrast, is the proportion of instances that are correctly classified by the model.

It can be seen that 98.62% of the cases are correctly classified by the model.

Does that mean our model will perform well in new, unseen patients? We actually don’t know. Evaluating model performance using predictions from the data originally used to train the model says little about how the model will behave when predictions are made against data that is completely invisible. Therefore, it is not reasonable to evaluate model performance in this way.

Xiaobian has something to say

In the next period, we will continue to introduce cross-validation, how to choose parameter K to optimize the model, and how to use KNN or KKNN functions in R language to achieve k-nearest neighbor classification and weighted K-nearest neighbor classification.