preface

Google Chairman Eric Schmidt has said that while Google’s driverless cars and robots get a lot of press, the company’s real future lies in machine learning, a technology that makes computers smarter and more personalized.

We live in perhaps the most critical period in human history: from the use of mainframe computers, to personal computers, and now cloud computing. What matters is not what happened in the past, but what will happen in the future.

The democratization of tools and technology made people like me excited about this period. The same is true of the computing boom. Today, as a data scientist, you can earn several dollars an hour building data-processing machines using complex algorithms. But it’s not easy to do this! I’ve had countless dark days and nights.

Who can benefit the most from this guide?

This guide is intended to simplify the learning journey for aspiring data scientists and machine learning enthusiasts. This guide will help you get your hands dirty with machine learning problems and gain insights from doing things. What I provide is a high-level understanding of several machine learning algorithms, as well as the R and Python code that runs them. That should be enough for you to try it out for yourself.



I purposely skipped over the data behind these techniques, because you don’t need to understand them to begin with. If you want to understand these algorithms at the data level, you should look elsewhere. But if you want to do some preparation before starting a machine learning project, you’ll like this article.

Broadly speaking, there are three kinds of machine learning algorithms

1. Supervised learning

How it works: This algorithm consists of a target variable or result variable (or dependent variable). These variables are predicted from a known set of predictive variables (independent variables). Using this set of variables, we generate a function that maps the input values to the desired output values. This training process continues until the model achieves the desired accuracy in the training data. Examples of supervised learning include regression, decision tree, random forest, K-nearest neighbor algorithm, logistic regression, etc.

2. Unsupervised learning

How it works: In this algorithm, there are no target variables or outcome variables to predict or estimate. This algorithm is used for clustering analysis of different groups. This type of analysis is widely used to segment customers into different user groups based on the type of intervention. Examples of unsupervised learning are association algorithms and k-means algorithms.

3. Reinforcement learning

How it works: This algorithm trains the machine to make decisions. Here’s how it works: the machine is placed in an environment that allows it to train itself through trial and error. Machines learn from past experience and try to make accurate business decisions using the best known knowledge. Examples of reinforcement learning are Markov decision processes.

List of common machine learning algorithms

Here is a list of commonly used machine learning algorithms. These algorithms can be used for almost any data problem:

  1. Linear regression
  2. Logistic regression
  3. The decision tree
  4. SVM

  5. Naive Bayes
  6. K nearest neighbor algorithm
  7. K-means algorithm
  8. Random forest algorithm
  9. Dimension reduction algorithm
  10. Gradient Boost and Adaboost algorithm

1. Linear regression

Linear regression is usually used to estimate actual values based on continuous variables (room rates, number of calls, total sales, etc.). We establish the relationship between independent and dependent variables by fitting the best line. This optimal line is called the regression line and is represented by the linear equation Y= a *X + b.

The best way to understand linear regression is to go back to childhood. What do you think a fifth grader would do if he were asked to rank his classmates from lighter to heavier without asking about their weight? He or she is likely to visually assess people’s height and body shape, combining these observable parameters to rank them. This is a real life example of using linear regression. In fact, the child found a relationship between height and body shape and weight that looked a lot like the equation above.

In this equation:

  • Y: Dependent variable
  • A: the slope
  • X: independent variable
  • B: the intercept

The coefficients A and b can be obtained by least square method

See the following example. We find the best fitting line y=0.2811x+13.9. Given the height of a person, we can use this equation to figure out the weight.



The two main types of linear regression are unary linear regression and multiple linear regression. One variable linear regression is characterized by only one independent variable. Multiple linear regression, as its name suggests, has multiple independent variables. When looking for the best fit line, you can fit into multinomial or curve regression. These are called multinomial or curvilinear regression.

  • Python code

#Import Library

#Import other necessary libraries like pandas, numpy…

from  sklearn import  linear_model

#Load Train and Test datasets

#Identify feature and response variable(s) and values must be numeric and numpy arrays

x_train= input_variables_values_training_datasets

y_train= target_variables_values_training_datasets

x_test= input_variables_values_test_datasets

# Create linear regression object

linear =  linear_model.LinearRegression ()

# Train the model using the training sets and check score

linear. fit(x_train,  y_train)

linear. score(x_train,  y_train)

#Equation coefficient and Intercept

print( ‘Coefficient: n’,  linear.coef_)

print( ‘Intercept: n’,  linear.intercept_ )

#Predict Output

predicted=  linear.predict( x_test)

  • R code

#Load Train and Test datasets

#Identify feature and response variable(s) and values must be numeric and numpy arrays

x_train <-  input_variables_values_training_datasets

y_train <-  target_variables_values_training_datasets

x_test <-  input_variables_values_test_datasets

x <-  cbind(x_train, y_train)

# Train the model using the training sets and check score

linear <-  lm(y_train ~  ., data =  x)

summary( linear)

#Predict Output

predicted=  predict(linear, x_test)

,

Logistic regression

Don’t be fooled by the name! This is a classification algorithm rather than a regression algorithm. The algorithm estimates discrete values (for example, binary values 0 or 1, yes or no, true or false) based on a known set of dependent variables. In simple terms, it estimates the probability of an event by fitting data into a logical function. Therefore, it is also called logistic regression. Because it estimates probabilities, its output values are between 0 and 1 (as expected).

Let’s understand this algorithm again with a simple example.

Suppose your friend asks you to solve a puzzle. There are only two consequences: you solve it or you don’t solve it. Imagine you have to solve a lot of problems to figure out what topic you’re good at. The result of the study would be something like this: If the problem were a tenth grade trig problem, you’d have a 70 percent chance of solving it. However, if the question is a fifth grade history question, you only get it right 30 percent of the time. That’s what logistic regression gives you.

Mathematically, the logarithm of the odds in the results uses a linear combination model of the predictive variables.

odds=  p/ (1 -p) =  probability of event occurrence /  probability of not  event occurrence

ln( odds) =  ln(p/ (1-p ))

logit( p) = ln (p/( 1-p))  = b0+ b1X1+b2X2 +b3X3…. +bkXk

In this formula, p is the probability of the feature we’re interested in. It takes a value that maximizes the probability of observing the sample, rather than minimizing the sum of squares of error (as normal regression analysis does).

Now you might ask, why do we want to figure out the logarithm? In short, this approach is one of the best ways to copy a step function. I could go into more detail, but that would defeat the purpose of this guide.




Python code

#Import Library

from  sklearn.linear_model  import LogisticRegression

#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset

# Create logistic regression object

model =  LogisticRegression()

# Train the model using the training sets and check score

model. fit(X,  y)

model. score(X,  y)

#Equation coefficient and Intercept

print( ‘Coefficient: n’,  model.coef_)

print( ‘Intercept: n’,  model.intercept_ )

#Predict Output

predicted=  model.predict( x_test)


R code

x <-  cbind(x_train, y_train)

# Train the model using the training sets and check score

logistic <-  glm(y_train ~  ., data =  x,family= ‘binomial’)

summary( logistic)

#Predict Output

predicted=  predict(logistic, x_test)

One step further:

You can try more ways to improve the model:

  1. Add interaction items
  2. Reduced model characteristics
  3. Use regularization
  4. Using nonlinear models

3. Decision tree

This is one of my favorite and most frequently used algorithms. This supervised learning algorithm is usually used for classification problems. Surprisingly, it applies to both categorical and continuous dependent variables. In this algorithm, we divide the population into two or more homogeneous groups. This is done by grouping as different groups as possible according to the most important attributes or independent variables. To learn more, read: Simplifying Decision Trees.



(Credit: Statsexchange)

As you can see in the image above, the crowd is divided into four different groups based on a variety of attributes, judging whether they will play or not. In order to divide the population into different groups, a number of techniques are used, such as Gini, Information Gain, Chi-Square, and entropy.

The best way to understand how decision trees work is to play Jezzball, a classic Microsoft game (see picture below). The ultimate goal of the game is to carve out as much space as possible without balls by building walls in a room where you can move walls.

Therefore, every time you divide a room with a wall, you are trying to create two different populations in the same room. Similarly, decision trees are trying to divide the population into as many different groups as possible.

For more information, see: the decision tree algorithm simplified (http://www.analyticsvidhya.com/blog/2015/01/decision-tree-simplified/)

  • Python code
#Import Library

#Import other necessary libraries like pandas, numpy...

from sklearn import tree

 

#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset

# Create tree object

model = tree.DecisionTreeClassifier(criterion='gini') # for classification, here you can change the algorithm as gini or entropy (information gain) by default it is gini  

 

# model = tree.DecisionTreeRegressor() for regression

# Train the model using the training sets and check score

model.fit(X, y)

model.score(X, y)

 

#Predict Output

predicted= model.predict(x_test)Copy the code
  • R code

library( rpart)

x <-  cbind(x_train, y_train)

# grow tree

fit <-  rpart(y_train ~  ., data =  x,method= “class”)

summary( fit)

#Predict Output

predicted=  predict(fit, x_test)

4. Support vector machines


That’s one way to classify. In this algorithm, we plot each data point in n-dimensional space (N is the total number of features you have), and the value of each feature is the value of a coordinate.

For example, if we had only two features, height and hair length, we would plot these two variables in two dimensions, with each point having two coordinates (these coordinates are called support vectors).



Now, we’ll find a line that separates the two different sets of data. The distance between the two nearest points in the two groups and this line is optimized simultaneously.



The black line in the example above optimizes the data classification into two groups, and the distance between the closest points in the two groups (points A and B in the figure) to the black line meets the optimal condition. This line right here is our dividing line. Next, the test data falls on whichever side of the line, and we place it in that category.

More, see: support vector machine (SVM) of simplified (http://www.analyticsvidhya.com/blog/2014/10/support-vector-machine-simplified/)

Think of this algorithm as playing JezzBall in an N-dimensional space. A few minor changes to the game are needed:

  • Instead of having to draw a straight line horizontally or vertically, you can now draw a line or a plane at any Angle.
  • The purpose of the game becomes to divide the different colored balls into different Spaces.
  • The position of the ball will not change.
  • Python code

#Import Library

from  sklearn import  svm

#Assumed you have, X (predic

tor)  and Y  (target)  for  training data set and  x_test( predictor)  of test_dataset

# Create SVM classification object

model =  svm.svc()  # there is various option associated with it, this is simple for classification. You can refer link, for mo# re detail.

# Train the model using the training sets and check score

model. fit(X,  y)

model. score(X,  y)

#Predict Output

predicted=  model.predict( x_test)


R code

library( e1071)

x <-  cbind(x_train, y_train)

# Fitting model

fit <- svm(y_train ~  ., data =  x)

summary( fit)

#Predict Output

predicted=  predict(fit, x_test)

5. Naive Bayes


On the premise that the predicted variables are independent, naive Bayes classification method can be obtained according to Bayes’ theorem. In simpler terms, a naive Bayes classifier assumes that the properties of a classification are unrelated to the other properties of that classification. For example, if a fruit is round and red and about 3 inches in diameter, it might be an apple. Even if these properties depend on each other, or on the presence of other properties, naive Bayes classifiers assume that these properties independently suggest that the fruit is an apple.

Naive Bayesian models are easy to construct and very useful for large data sets. Although simple, naive Bayes’ performance transcends very complex classification methods.

Bayes’ theorem provides a from P (c), P (x) and P (x | c) calculate the posterior probability P (c | x) method. Look at this equation:



Here,

  1. P (c | x) is the premise of known predictor variables (properties), the posterior probability of the class (target)
  2. P(c) is the prior probability of the class
  3. P (x | c) is a possibility, that is, under the premise of known class, the probability of predictor variables
  4. P(x) is the prior probability of the predicted variable

Example: Let’s use an example to understand this concept. Below, I have a weather training set and the corresponding target variable “Play”. Now, we need to categorize the participants who will “play” and “not play” based on weather conditions. Let’s do the following steps.

Step 1: Transform the data set into a frequency table.

Step 2: Create the Likelihood table with probabilities like “when Overcast probability is 0.29, Likelihood of play is 0.64”.



Step 3: Now, use the naive Bayes equation to calculate the posterior probabilities for each class. The category with the highest posteriori probability is the predicted outcome.

Problem: If the weather is clear, participants can play. Is this statement correct?

We can use the method discussed to solve the problem. Then P (play | sunny) = P (sunny | play) * P (play)/P (clear)

We have P (sunny | play) = 3/9 = 0.33, P = 5/14 (clear) = 0.36, P = 9/14 = 0.64 (play)

Now, P (play | sunny) = 0.33 * 0.64/0.36 = 0.60, have greater probability.

Naive Bayes uses a similar approach, using different attributes to predict probabilities of different classes. This algorithm is commonly used for text classification and problems involving more than one class.


Python code

#Import Library

from  sklearn.naive_bayes  import GaussianNB

#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset

# Create SVM classification object model = GaussianNB() # there is other distribution for multinomial classes like Bernoulli Naive Bayes, Refer link

# Train the model using the training sets and check score

model. fit(X,  y)

#Predict Output

predicted=  model.predict (x_test)


R code

library( e1071)

x <-  cbind(x_train, y_train)

# Fitting model

fit <- naiveBayes(y_train  ~ .,  data = x )

summary( fit)

#Predict Output

predicted=  predict(fit, x_test)

6. KNN (K-Nearest Neighbor Algorithm)


The algorithm can be used for classification and regression problems. However, in the industry, the K-nearest neighbor algorithm is more commonly used for classification problems. K – nearest neighbor algorithm is a simple algorithm. It stores all cases, dividing new cases by most of the k surrounding cases. According to a distance function, the new case is assigned to the most common category among its K neighbors.

These distance functions can be European distance, Manhattan distance, Ming distance or Hamming distance. The first three distance functions are used for continuous functions, and the fourth function (hamming function) is used for categorizing variables. If K=1, the new case is placed directly into the category of the nearest case. Sometimes, choosing a value for K is a challenge when modeling with KNN.

More information: K — Introduction to The Nearest Neighbor Algorithm (Simplified Version)



We can easily apply KNN in real life. If you want to get to know a complete stranger, you might want to reach out to his close friends or circle for information.

Things to consider before choosing to use KNN:

  1. The computation cost of KNN is very high.
  2. Variables should be normalized before they are biased by higher-range variables.
  3. Before KNN is used, more efforts should be made in the early processing such as outlier removal and noise removal.


Python code

#Import Library

from  sklearn.neighbors import  KNeighborsClassifier

#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset

# Create KNeighbors classifier object model

KNeighborsClassifier( n_neighbors=6)  # default value for n_neighbors is 5

# Train the model using the training sets and check score

model. fit(X,  y)

#Predict Output

predicted=  model.predict (x_test)


R code

library( knn)

x <-  cbind(x_train, y_train)

# Fitting model

fit <- knn(y_train ~  ., data =  x,k= 5)

summary( fit)

#Predict Output

predicted=  predict(fit ,x_test)

7. K-means algorithm

K – means algorithm is a kind of unsupervised learning algorithm, which can solve the clustering problem. The process of grouping data into a certain number of clusters (assuming K clusters) using the K-means algorithm is simple. Data points within a cluster are homogeneous and different from other clusters.

Remember finding shapes in ink stains? The K-means algorithm is similar to this activity in a way. Look at the shapes and stretch your imagination to find out how many clusters or populations there are.



  • How k-means algorithm forms clusters:

  1. The K-means algorithm selects K points for each cluster. These points are called the center of mass.
  2. Each data point forms a cluster with the nearest center of mass, that is, k clusters.
  3. Find the center of mass for each category based on the existing category members. Now we have our new center of mass.
  4. When we have a new center of mass, repeat steps 2 and 3. Find the center of mass closest to each data point and associate it with a new K-cluster. This process is repeated until the data converge, that is, when the center of mass does not change.
  • How to determine K value:

The K-means algorithm involves clusters, each with its own center of mass. The sum of squares of the center of mass and the distances between data points in a cluster forms the sum of the squares of the cluster. At the same time, when the sum of the squares of all clusters is added up, it forms the sum of the squares of the cluster scheme.

We know that as the number of clusters increases, K continues to decrease. However, if you graph the result, you will see that the sum of the squares of distances decreases rapidly. After a certain value, k, the rate of decrease slows down dramatically. Here, we can find the optimal value for the number of clusters.




Python code

#Import Library

from  sklearn.cluster import  KMeans

#Assumed you have, X (attributes) for training data set and x_test(attributes) of test_dataset

# Create KNeighbors classifier object model

k_means =  KMeans(n_clusters= 3, random_state =0)

# Train the model using the training sets and check score

model. fit(X)

#Predict Output

predicted=  model.predict( x_test)


R code

library( cluster)

fit <-  kmeans(X , 3 ) # 5 cluster solution

Random forest

Random forest is a proper term for decision tree population. In the random forest algorithm, we have a series of decision trees (hence the name “forest”). In order to classify a new object according to its properties, each decision tree has a classification that is said to “vote” for that classification. This forest selected the category that received the most votes (of all trees) in the forest.

Each tree is grown like this:

  1. If the number of cases in the training set is N, samples are randomly selected from N cases using reset sampling method. This sample will serve as a training set for the “nurture” tree.
  2. If there are M input variables, define a number M <
  3. Plant each tree as much ground as possible without pruning.

To learn more about this algorithm and compare decision trees and optimization model parameters, I recommend you read the following article:

  1. Introduction to Random Forest – Simplified version
  2. Compare CART model with random Forest (I)
  3. Comparison between random Forest and CART Model (II)
  4. Adjust your random forest model parameters


Python code

#Import Library

from  sklearn.ensemble import  RandomForestClassifier

#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset

# Create Random Forest object

model=  RandomForestClassifier()

# Train the model using the training sets and check score

model. fit(X,  y)

#Predict Output

predicted=  model.predict( x_test)

  • R code

library( randomForest)

x <-  cbind(x_train, y_train)

# Fitting model

fit <-  randomForest(Species  ~ .,  x,ntree= 500)

summary( fit)

#Predict Output

predicted=  predict(fit, x_test)

9. Dimensionality reduction algorithm

At every possible stage, information capture has grown exponentially over the last four to five years. Companies, government agencies, and research organizations capture detailed information in addition to coping with new sources.

For example, e-commerce companies capture more detailed information about their customers: personal information, web browsing, their likes and dislikes, purchases, feedback and much more, paying more attention to you than the grocery store clerk around you.

As data scientists, the data we provide contains many characteristics. That sounds like good material for building a model that will stand up to graduate school, but there’s a challenge: how do you distinguish the most important variables from 1,000 or 2,000? In this case, dimensionality reduction and other algorithms (such as decision trees, random forests, PCA, and factor analysis) help us find these important variables based on correlation matrices, the proportion of missing values, and other factors.

To learn more about this algorithm, read the Beginner’s Guide to Dimensionality Reduction Algorithms.


Python code

#Import Library

from  sklearn import  decomposition

#Assumed you have training and test data set as train and test

# Create PCA obeject pca= decomposition.PCA(n_components=k) #default value of k =min(n_sample, n_features)

# For Factor analysis

#fa= decomposition.FactorAnalysis()

# Reduced the dimension of training dataset using PCA

train_reduced =  pca.fit_transform( train)

#Reduced the dimension of test dataset

test_reduced =  pca.transform( test)

#For more detail on this, please refer  this link.

  • R code
library(stats)

pca <- princomp(train, cor = TRUE)

train_reduced  <- predict(pca,train)

test_reduced  <- predict(pca,test)Copy the code

Gradient Boosting and AdaBoost algorithm

GBM and AdaBoost, two Boosting algorithms, are used when we need to process a lot of data to make a prediction with high predictive power. Boosting algorithm is an ensemble learning algorithm. It combines prediction results based on multiple base estimates to improve the reliability of a single estimate. Boosting these Boosting algorithms are generally effective in data science competitions such as Kaggl, AV Hackathon, and CrowdAnalytix.

More: Learn more about Gradient and AdaBoost

  • Python code
#Import Library from sklearn.ensemble import GradientBoostingClassifier #Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset # Create Gradient Boosting Classifier object model = GradientBoostingClassifier (n_estimators = 100, learning_rate = 1.0, max_depth=1,random_state=0) # Train the model using the training sets and check score model.fit(X, y) #Predict Output predicted= model.predict(x_test)Copy the code
  • R code

library( caret)

x <-  cbind(x_train, y_train)

# Fitting model

fitControl <-  trainControl(  method = “repeatedcv” , number  = 4,  repeats =  4)

fit <-  train(y ~  ., data =  x, method  = “gbm”,  trControl =  fitControl,verbose  = FALSE)

predicted=  predict(fit, x_test,type=  “prob”)[,2 ]

GradientBoostingClassifier and random forests are two different kinds of boosting tree classifier. People often ask about the difference between the two algorithms.

conclusion

By now I’m sure you have a general understanding of the machine learning algorithms in use. The only purpose of writing this article and providing Python and R code is to get you started right away. If you want to master machine learning, start right away. Do the exercises, see the process rationally, apply the code, and have fun!



The original post was published on March 8, 2018

This article is from the cloud community partner “Datapai THU”. For relevant information, you can follow the wechat public account of “Datapai THU”