A gentle guide to machine learning

[Editor’s note] Machine learning is the most advanced aspect of artificial intelligence today, and more beginners are entering the field. In this article, Machine learning and NLP expert Raul Garreta, co-founder and CEO of MonkeyLearn, Outlines important concepts, applications, and challenges in using machine learning for beginners.

Machine learning is a branch of artificial intelligence that builds algorithms for computers to learn and use them on data sets to complete tasks without explicit coding.

Is that clear? We can make machines learn how to do things! When I first heard it, it made me very excited. That means we can program computers to learn things on their own!

The ability to learn is one of the most important aspects of intelligence. Applying this power to machines should be a big step towards making computers smarter. In fact, machine learning is the most advanced aspect of AI today; It’s a fashionable topic right now, and using machine learning is likely to lead to smarter machines.

This article will provide a brief introduction to machine learning for beginners. I will outline the important concepts, applications, and challenges in using machine learning. It is not the purpose of this article to give a formal and detailed description of machine learning, but rather to introduce the reader to some of the initial concepts that will enable the reader to continue to explore machine learning.

Machine learning

Well, not everything is all it’s cracked up to be, and machine learning has its limits. We cannot build intelligent machines like Data from Star Trek or Hal 9000 from 2001: A Space Odyssey. But we have enough real-world applications where machine learning works wonders. Here are some of the most common categories of practical machine learning applications:

The image processing

Image processing problems basically need to analyze the image to obtain data or do some conversion. Here are some examples:

Image tagging, like in Facebook, where algorithms can automatically detect your face or your friend’s face in photos. It’s basically machine learning algorithms that learn from photos that you manually tag.
Optical character recognition (OCR), an algorithm that learns to convert manuscript or scanned text into a digital version. The algorithm needs to learn how to convert handwritten character images into corresponding digitized letters.
Self-driving car, one of the mechanisms that allows a car to drive itself through image processing. Machine learning algorithms use each frame of an image taken by the camera to learn where the edge of the road is, whether there is a stop sign or if a car is approaching.

The text analysis

Text analysis is when we extract or categorize information from text files, such as tweets, emails, chats, documents, etc. Here are some popular examples:

Spam filtering is one of the best known and most commonly used text categorization applications. Spam filters learn how to classify messages as spam based on content and subject.
Sentiment analysis, another application of text categorization, must learn to classify an idea as positive, neutral or negative, based on the emotions expressed by the author.
Information extraction: Learn to extract specific information or data blocks from text, such as addresses, entities, keywords, etc.

Data mining

Data mining is used to discover patterns or make predictions from data. This definition is a bit generic, but you can think of it as mining useful information from a large number of database tables. Each row can be our training instance, and each column can be a feature. We might be interested in predicting a new column with the remaining columns in the table, or discovering some pattern to group rows. Such as:

Outlier detection: Detecting outliers, such as credit card fraud detection, you can detect abnormal behavior from a user’s usual shopping patterns.
Association rules: For example, in a supermarket or e-commerce site, you can discover customers’ buying habits by observing which products are purchased together. This information can be used for marketing purposes.
Grouping: For example, in SaaS platforms, users can be grouped by their behavior and profile.
Prediction: To predict another variable (a column in the database) from the remaining variables. For example, you can learn and predict the credit scores of new customers by using information from existing customer profiles and credit scores.

Video games and robots

Video games and robotics are a huge area where machine learning is being used. Typically we have an Agent (a game character or robot) that must act according to its environment (a virtual environment in a video game or a real environment for the robot). Machine learning enables the Agent to perform tasks, such as moving into an environment while avoiding obstacles or enemies. One of the most popular machine learning techniques in these situations is reinforcement learning, where agents perform tasks by learning the reinforcement coefficient of the environment (negative if the Agent encounters an obstacle, positive if the Agent reaches the goal).

Ok, I now know what machine learning is, but how does it work?

One of the first books I read about machine learning about 10 years ago was Machine Learning by Tom Mitchell. The book was written in 1997, but the overall concept is still useful today.

In that book, I liked the formal definition of machine learning as follows:

For some type of task T and performance metric P, a computer program is said to be learning from experience E if its performance as measured by P on T improves with experience E.

For example, a human game player can learn to play chess (task T) by watching previous chess games or playing against a mentor (experience E). Its performance P can be measured by the percentage of games it wins against human players.

Let’s use some more examples:

Case 1: Enter an image into the system, and the system needs to determine whether barack Obama’s face is in the image (generally a facebook-like automatic image tagging).

Case 2: Input a tweet into the system, and the system determines whether the tweet has positive or negative emotions.

Case 3: Enter some information about a person into the system, and the system calculates the probability that the person will repay the credit card loan.

In case 1, the system was tasked with detecting when Barack Obama’s face appeared in the image. Information about which photos he appears in or doesn’t appear in can be taken as experience. The system’s performance was measured by the percentage of times it correctly identified Obama’s face.

In case 2, the system task is to perform sentiment analysis on a tweet. A systematic experience can be a set of tweets and the emotions that correspond to them. The system’s performance can be measured by the proportion of new tweets that the system correctly analyzes.

In case 3, the system task is to perform a credit score. The system can take a series of user profiles and corresponding credit scores as experience. You can use squared error (the difference between predicted and expected scores) as a performance measure.

In order for the algorithm to learn to convert the input into the desired output, you must provide training instances or training samples, which Mitchell defines as experience E. A training set is a collection of instances that serve as samples from which the machine learning algorithm learns and performs the desired task. Easy to understand, right? It’s like showing a kid how to throw a ball. You throw the ball a few times to teach him how to do it, and then by watching those examples, he starts to learn how to throw the ball himself.

Each training instance is usually represented as a fixed set of attributes or characteristics. Characteristics are the way in which each instance is represented. For example, in Case 1, a picture can be represented by the gray level of each pixel. In case 2, the tweet can be represented by the words that appear in the tweet. In case 3, the credit history can be represented by the person’s age, salary, occupation, and so on.

Computing and selecting reasonable features to represent an instance is one of the most important tasks in using machine learning, as we will discuss later in this article.

Types of machine learning algorithms

In this section we will discuss two main categories of machine learning algorithms: supervised and unsupervised. The main differences between the two types of algorithms lie in the training samples we provide to the algorithm, the way the algorithm uses the samples, and the class of problems they solve.

Supervised learning

In supervised learning, machine learning algorithms can be viewed as the process of converting a particular input into a desired output.

Machine learning needs to learn how to convert all possible inputs into correct/expected outputs, so each training sample has specific inputs and expected outputs.

In the case of an artificial chess player, the input can be a specific board state, and the output may be the best way to play in that state.

According to the output, supervised learning can be divided into two subcategories:

classification

When the output values belong to discrete and finite sets, then this is a classification problem. Case 2 can be viewed as a classification problem where the output is a finite set: positive, negative, or neutral. Our training example looks like this:

Return to the

When the output is continuous values, such as probability, then this is a regression problem. Case 3 is a regression problem because the result is a number between 0 and 1, which represents the probability that a person will repay the debt. In this case, our training sample looks like this:

Supervised learning is the most popular type of machine learning algorithm. The downside of using this approach is that we need to provide the correct output for each training sample, which in most cases can be costly. For example, in the emotion analysis case, if we need 10,000 training cases (tweets), we need to label each tweet with the correct emotion (positive, negative or neutral). This would require a team of people to read and tag every tweet (very time-consuming and boring work). This is often the most common bottleneck for machine learning algorithms: gathering correctly labeled training data.

Unsupervised learning

The second type of machine learning algorithm is called unsupervised learning. In this case, the training data only needs to be input into the algorithm, and there is no corresponding expected output. A typical use case is to discover hidden structures or relationships between training samples. A typical example is clustering algorithms, where we learn to find similar instances or groups of instances (clusters). Let’s say we have a story and we want to recommend a similar story. Some clustering algorithms such as K-means learn from input data.

Machine learning algorithm

Okay, now for math and logic. To convert the input to the desired output, we can use different models. Machine learning isn’t the only algorithm. You’ve probably heard of support vector machines, naive Bayes, decision trees, or deep learning. Those are different machine learning algorithms, and they all solve the same problem: learning to convert input into correct output.

Those different machine learning algorithms use different paradigms or techniques to perform the learning process and represent what they learn.

Before we go through each of these algorithms, we need to understand that the most common principle is that machine learning algorithms try to be generalized. That is, they try to explain things in the simplest possible terms, which is called Occam’s Razor. All machine learning algorithms, regardless of the paradigm they use, will attempt to create the simplest hypothesis (the one that makes the least assumptions) to account for the majority of the training instances.

There are many machine learning algorithms out there, but let’s briefly introduce three popular ones:

Support vector machine: This model attempts to construct a hyperplane high-dimensional space set. It tries to distinguish instances of different classes by calculating the maximum distance from the nearest instance. The concept is straightforward, but the model can sometimes be very complex and powerful. In fact, for some domains support vector machines are one of the best machine algorithms you can currently use.

Probabilistic models: These models typically model a probability distribution of a problem to predict the correct response. Probably the most popular of these algorithms is the naive Bayes classifier, which uses Bayes’ theorem and the assumption of independence between features to build classifiers. One of the advantages of this model is that it is both simple and powerful, and it returns not only the predicted value but also the certainty of the predicted value, which is very useful.

Deep learning: a new field of machine learning based on the well-known artificial neural network model. Neural networks have associative ways of operating, and they try to mimic (in a very simple way) how the brain works. Basically, they consist of a set of interconnected neurons (the basic unit of processing) that are organized into many layers. In simple terms, deep learning uses deeper layers to build new structures, improving algorithms through high-level abstractions that not only improve learning but also build structures that automatically represent the most important features.

Important aspects of machine learning

Machine learning sounds like a wonderful concept, and it is, but there are some processes in machine learning that aren’t all that automatic. In fact, when designing a solution, a lot of times you need to do it manually. However, it is a vital part of getting good results. Some of them are:

What kind of machine learning algorithm should I use?

Supervised or unsupervised?

Do you have any labeled data? That’s the input and the corresponding output. If so, then you can use supervised learning algorithms. If not, an unsupervised algorithm can solve the problem.

Classification, regression or clustering?

It really depends on what kind of problem you’re trying to solve. If you want to tag data (with discrete options), categorization is probably the right choice. Conversely, if you want to pick a number, such as a fraction, regression is your best bet. Or if you want to recommend similar products on an e-commerce site based on what users are currently viewing, clustering is your best bet.

Deep learning, SVM, Naive Bayes, decision tree… which is best?

My answer is: no best. Clearly, deep learning and support vector machines have proven to be the strongest and most flexible algorithms in different applications. But consider that depending on the particular application, some machine learning algorithms may be better than others. Analyze their respective strengths and use them!

Characteristics of the engineering

Feature engineering is a process in which we extract and select the most important features used to represent training samples and instances for machine learning algorithm processing. This process is the most important aspect of machine learning (and sometimes underappreciated).

Note: If you don’t provide quality assured features to the algorithm, the results will be bad, even if you use the best machine learning algorithm in this situation. It’s like trying to learn how to read with the naked eye in the dark, and no matter how smart you are you can’t do it.

Feature extraction

To feed data into a machine learning algorithm, you usually need to convert the raw data into something the algorithm can “understand.” This process is called feature extraction. Usually we convert the raw data into eigenvectors.

In case 1, how do we feed a machine learning algorithm an image?

One direct way is to convert an image into a vector, where each component is the gray value of each pixel in the image. So each component, or feature, can be represented by a value from 0 to 255, where 0 is black, 255 is white, and 1 to 254 is varying degrees of gray.

This approach may work, but it may work better if we provide higher-level features:

Does the image contain a face?
What is skin color?
What color are the eyes?
Do you have hair on your face?
…

These are higher-level features that provide the algorithm with more knowledge than just the gray value of each pixel (their calculations can be done with other machine learning algorithms). By providing higher-level features, we’re helping the machine learning algorithm get better information about whether my face or someone else’s face is in an image.

If we implement better feature extraction:

Our algorithm is more likely to learn and get the desired result.
We may not need that many training examples.
In this way, we can significantly reduce the time required to train the model.

Feature selection

Sometimes (but not most of the time), the features we choose to feed into the algorithm may not be of much use. For example, when emotional tagging a tweet, we might take the length of the tweet, the age of the tweet and so on as features that may or may not be useful, and there are automatic ways to identify whether they are useful or not. Intuitively, feature selection algorithms technically score each feature and then return the most important features based on their score.

Another important point to keep in mind is to avoid using large feature sets. Some people might try to add all possible features to the model for the algorithm to learn. But that’s not a good idea. As we add more features to represent instances, the dimensions of the space increase, making the matrix more sparse. Intuitively, because we get more features, we have to have a very large number of instances representing each combination of features. This is called a dimensional disaster. As the model grows in complexity, the number of training samples needs to grow exponentially, and trust me, it can be tricky.

The training sample

You have to feed the machine learning algorithm training samples. Depending on the problem you’re trying to solve, we’ll use hundreds, thousands, millions, or billions of training examples. Also, it is important to maintain the quality of the samples; if you enter the wrong samples into the algorithm, you are less likely to get good results.

Collecting large amounts of good data to train machine learning algorithms is often a labor-intensive endeavor. Unless you already have tagged data, you will need to tag it yourself manually or hire someone else to do so. Several crowdsourcing platforms attempt to solve this problem, and you can find some tools here to do the job. Or you can make tagging more efficient by using your own machine learning model to generate helper programs.

The general rule for training samples is that the better training data you collect, the better training results you are likely to get.

Test samples and performance indicators

After we train a machine learning model, we need to test its performance. This is very important, otherwise you don’t know if your model has learned anything!

The concept is very simple, we use a test set, a collection of instances that are not included in the training set. Basically, we’ll feed each test sample into the model and see if it yields the expected results. In the case of supervised learning classification, we simply input each piece of test data and then check that the model output is as expected. If our model is correct 95% of the time, we say the model is correct 95% of the time.

Keep in mind that training and test data sets cannot overlap, as this is the only way to test the model’s generalization and prediction abilities. You may be able to get high accuracy on your training data, but poor accuracy on a separate test set. This is called overfitting, where the algorithm overfits the training sample resulting in poor predictive ability. The usual way to avoid overfitting is to use simpler models with fewer features, simplify models, and use larger and more representative training sets.

Accuracy is the most basic metric, but you should also pay attention to other metrics, such as accuracy and recall, which will tell you how well the algorithm performs on each category (when using supervised learning classifications). Confusion matrix is a good tool to observe where confusion prediction occurs in classification algorithms.

For regression and clustering problems, there are other indicators to measure the performance of the algorithm.

performance

In practice, if you are going to implement a solution, you must build a powerful and high-performance solution. In machine learning applications, this can be a complex task. First, you need to choose a machine learning framework, which is not an easy task because not all programming languages have powerful tools for it. Python and SciKit-Learn are good examples of programming languages that can be used to build powerful machine learning frameworks.

Once you’ve chosen your framework, consider performance. Depending on the amount of data, complexity, and algorithm designed, running training algorithms can consume a lot of computing time and memory. You may need to run multiple training algorithms until you get good results. Also, often you might retrain the model with new instances to improve accuracy.

In order to train a large number of models in use and achieve fast results, we usually use machines with large memory and multi-core processors to train models in parallel.

These are mostly practical issues that are important to consider if you want to deploy machine learning solutions to real-world applications.

conclusion

So, I gave you a brief overview of what machine learning is. There are many practical applications and machine learning algorithms and concepts that are not covered in this article, and we will leave that to the reader.

Machine learning is powerful, but training it is also hard, and the difficulties that can arise when training models described in this article are just the tip of the iceberg.

A background in computer science, especially machine learning, is often required to achieve good results. A person may be disappointed by many difficulties before he or she gets on the right track.

That’s why we created MonkeyLearn, a democratized machine learning technique for text analysis. Avoid reinventing the wheel and let every software developer or entrepreneur get practical results quickly. The following is the main aspect of our work, abstracting the end users of all these problems, ranking them from machine learning complexity to actual extensibility, and getting to plug and play machine learning.

A Gentle Guide to Machine Learning is A Gentle Guide to Machine Learning.