@TOC

✨ blogger introduction

🌊 Author’s home page: Suzhou program Dabai

🌊 Author profile: 🏆CSDN Artificial intelligence domain quality creator 🥇, one of the founders of Suzhou Kaijie Intelligent Technology Co., LTD., currently cooperating with foxconn, Goer and other new energy companies

💬 if this article is helpful to you, welcome to subscribe to C#, Halcon, python+opencv, VUE, interviews with major companies and other columns

💅 any questions are welcome to private letter, see will respond to 💅 follow the Suzhou program, share fan welfare

🕴 ️ preface

Machine learning is something everyone is talking about, but very few people, except teachers, know exactly what it is. If you read online articles about machine learning, you’re likely to encounter one of two things: a heavy academic trilogy filled with theorems (I could get half of them right), or some hype about artificial intelligence, the magic of data science and the future of work.

I decided to write a long-gestating article that would provide a brief introduction to machine learning for those who want to understand it. No advanced principles, simple language to talk about real world problems and practical solutions. It doesn’t matter if you’re a programmer or a manager.

Let’s get started!

😀 Why do we want machines to learn?

Billy: I want to buy a car. I want to figure out how much money I need to save every month to afford it. After browsing dozens of online ads, he learned that new cars cost around $20,000, a year-old used car $19,000, a 2-year-old car $18,000, and so on.

As a smart analyst, Billy found a pattern: the price of a car depends on the age of the car, and the price drops by $1,000 for each additional year, but not less than $10,000.

In machine learning terms, Billy invented “regression” — predicting a value (price) based on known historical data. When people try to estimate a fair price for a used iPhone on eBay or calculate the number of ribs needed for a barbecue, they have been using something like Billy’s method – 200g per person? 500?

Yes, it would be nice if there were a simple formula to solve all the world’s problems — especially for barbecue parties — but unfortunately, it’s not possible.

Going back to buying cars, the problem now is that in addition to their age, they have different production dates, dozens of accessories, technical conditions, seasonal fluctuations in demand… Who knows what else is hidden… Billy, a common man, can’t take all this into account when calculating the price, and I can’t either.

People are lazy and stupid — we need robots to help them with their math. So here we take a computer’s approach — feed the machine some data and ask it to find all the underlying patterns related to prices.

Finally, it worked. The most exciting thing is that a machine can handle all the dependencies much better than a real person can do it in their head.

And so machine learning was born.

😃 3 components of machine learning

All the bullshit associated with ARTIFICIAL intelligence (AI) aside, the only goal of machine learning is to predict results based on incoming data, period. All machine learning tasks can be represented in this way, otherwise it would not be a machine learning problem to begin with.

The more diverse the sample, the easier it is to find relevant patterns and predict outcomes. Therefore, we need three parts to train the machine:

  • Data:

1. Want to detect spam? Get a sample of the spam. Want to predict stocks? Find historical price information. Want to find out user preferences? Analyze their Facebook activity (no, Mark, stop collecting data ~ enough already). The more diverse the data, the better the results. A few hundred thousand lines of data, at least, would be enough for a machine running like hell.

2. There are two main ways to obtain data — manually or automatically. Data collected manually has fewer jumbled errors, but takes more time — and often more money. Automated methods are relatively cheap, and you can collect all the data you can find (hopefully good enough).

3. Smart guys like Google use their users to annotate their data for free. Remember ReCaptcha forces you to “pick all the landmarks”? That’s how they get their data and work for free! Well done. If I were them, I’d show these verified images more often, but wait…

Good data sets are really hard to come by, and they’re so important that some companies might even open up their algorithms but rarely publish data sets.

  • Features:

1. Call them “parameters” or “variables”, such as miles driven, user gender, stock price, word frequency in documents, etc. In other words, these are all factors for the machine to consider.

2. If the data is stored as a table, the characteristics correspond to column names, which is simpler. But what about a 100GB cat picture? We can’t treat every pixel as a feature. This is why selecting the right feature usually takes more time than other steps in machine learning, and feature selection is also a major source of error. Subjective tendencies in human nature can lead people to choose features they like or feel are “more important” — something to avoid.

  • Algorithm:

1. The obvious part. Any problem can be solved in a different way. The method you choose will affect the accuracy, performance, and size of the final model. One caveat: if the data quality is poor, even the best algorithms won’t help. This is called garbae in-garbage out (GIGO). So, before putting a lot of effort into accuracy, you should get more data.

😄 learn V.S. intelligence

I once read an article titled “Will Neural Networks Replace Machine Learning?” on some popular media sites. In the article. These media people somehow exaggerate technologies like linear regression as “artificial intelligence”, short of “skynet”. The following figure shows the relationship between several concepts that are easily confused.

  • “Artificial intelligence” is the name of an entire discipline, similar to “biology” or “chemistry.”

  • “Machine learning” is an important part of “artificial intelligence,” but not the only part.

  • Neural networks are a popular branch of machine learning, but there are other branches of the machine learning family.

  • “Deep learning” is a modern approach to building, training and using neural networks. Essentially, it’s a new architecture. In current practice, no one distinguishes between deep learning and “normal networks,” and the libraries that need to be called to use them are the same. To avoid looking like a fool, it’s best to talk about specific types of networks and avoid buzzwords.

The general principle is to compare things on the same level. That’s why “neural networks will replace machine learning” sounds like “the wheel will replace the car.” Dear press, this is going to take a toll on your reputation.

The machine can The machine can’t
To predict Create something new
memory Get Smart fast
copy Out of mission
Choose the best term Exterminate all mankind

😁 Map of the world of machine learning

If you’re too lazy to read long paragraphs, this chart will help you get some idea. In the world of machine learning, there is never a single way to solve a problem — it’s important to remember that — because you will always find several algorithms that can be used to solve a problem, and you need to choose the one that works best. Of course, all problems can be handled by “neural networks”, but who pays for the hardware behind the computing power?

Let’s start with some basic overview. There are four main directions for machine learning.

😉 Classic machine learning algorithms

Classical machine learning algorithms have their roots in pure statistics from the 1950s. Statisticians solve formal math problems such as finding patterns in numbers, estimating distances between points, and calculating vector directions.

Today, half of the Internet is working on these algorithms. When you see a “read on” column, or find your bank card locked out at some out-of-the-way gas station, it’s probably one of those little guys.

Big tech companies are big fans of neural networks. It’s easy to see why. A 2% increase in accuracy means $2 billion in revenue for these big companies. But when you’re small, it’s not that important. I heard of a team that spent a year developing a new recommendation algorithm for their e-commerce site, only to find out later that 99% of the traffic on the site came from search engines — their algorithm was useless, since most users wouldn’t even open the home page.

Despite the widespread use of classical algorithms, the principles are simple enough that you can easily explain them to a toddler. They’re like basic arithmetic — we use them every day without even thinking about them.

🍇 supervised learning

Classical machine Learning is usually divided into two categories: Supervised Learning and Unsupervised Learning.

In “supervised learning”, a “supervisor” or “teacher” provides the machine with all the answers to help it learn, such as whether it is a cat or a dog in the picture. The “teacher” has finished dividing the dataset — labeled “cat” or “dog” — and the machine uses this sample data to learn to distinguish between cats and dogs one by one.

Unsupervised learning means the machine alone is tasked with distinguishing which is which among a bunch of pictures of animals. The data is not pre-marked, there is no “teacher”, and the machine has to find all possible patterns on its own. More on that later.

Obviously, machines learn faster when a “teacher” is present, so supervised learning is more common in real life. Supervised learning falls into two categories:

  • Classification, which predicts the category an object belongs to.

  • Regression, predicting a specific point on the number line;

💌 Classification (Classification)

“Categorize objects based on a preknown attribute, such as socks by color, documents by language, and music by style.” Classification algorithms are often used:

  • Filtering spam;

  • Language detection;

  • Find similar documents;

  • Sentiment analysis

  • Recognize handwritten letters or numbers

  • Fraud detection

Commonly used algorithms:

  • Naive Bayes

  • Decision Tree

  • Logistic Regression

  • K-nearest Neighbours

  • Support Vector Machine

Machine learning mainly solves the problem of “classification”. This machine is like a baby learning to classify toys: this is a robot, this is a car, this is a machine-car… Oh, wait, mistake! Error!

For categorized tasks, you need a “teacher.” The data needs to be labeled so that the machine can learn to categorize based on those labels. Categorize everything — users based on interest, articles based on language and subject (important for search engines), music based on genre (Spotify playlists), and your email is no exception.

Naive Bayes algorithm is widely used in spam filtering.The machine counted how often words such as “Viagra” appeared in spam and normal emails, multiplied the probability by the Bayesian equation, summed up the results and, ha, the machine was done.What have spammers learned to do about itBayes filterAdding lots of “good” words to the end of an email is cynically known"Bayesian poisoning". Naive Bayes goes down in history as the most elegant and the first practical algorithm, but there are other algorithms to deal with spam filtering.

Here’s another example of a classification algorithm:

If you need to borrow some money now, how will the bank know if you will repay it in the future? I can’t be sure. But banks have files on historical borrowers with data such as “age,” “education,” “occupation,” “salary” and — most importantly — “whether they pay back the money.”

Using this data, we can train machines to find patterns and come up with answers. Finding the answer is not the problem. The problem is that banks cannot blindly trust the machine to give them the answer. What if the system breaks down, is hacked or has just been patched by a drunk graduate?

To tackle this problem, we need Decision Trees that automatically divide the data into yes/no questions — such as “Does the borrower earn more than $128.12?” — Sounds a little anti-human. However, the machine generates such questions in order to optimally partition the data at each step.

That’s how trees are made. The higher the branch (closer to the root node), the wider the scope of the problem. Any analyst can accept this and explain it afterwards, even if he doesn’t know what the algorithm is about (typical analyst!).

Decision trees are widely used in high-responsibility scenarios: diagnostics, medicine, and finance.

The two best known decision tree algorithms are CART and C4.5.

Today, pure decision tree algorithms are rarely used. However, they are the building blocks of large-scale systems, and when integrated, decision trees perform even better than neural networks. We’ll talk about that later.

When you do a Google search, it’s a bunch of awkward “trees” that help you find the answer. Search engines like these algorithms because they run fast.

By rights, support vector machines (SVM) should be the most popular classification method. It can be used to classify anything that exists: plants in pictures by shape, documents by category, etc.

The idea behind SVM is simple — it tries to draw two lines between data points and maximizes the distance between them as much as possible. As shown below:

On all categories, let’s just label it. This approach is now being used in the medical field of MRI, where computers flag all suspicious areas or deviations in the range of detection. The stock market uses it to detect the unusual behavior of traders in order to find a mole. When we train the computer to recognize what is right, we automatically teach it to recognize what is wrong.

The rule of thumb states that the more complex the data, the more complex the algorithm. For data such as text, numbers and tables, I choose the classical method to operate. These models are smaller, faster to learn, and have clearer workflows. For pictures, videos, and other complex big data, I would definitely study neural networks.

Just five years ago, you could find svM-based face classifiers. It is now easier to pick a model from hundreds of pre-trained neural networks. However, nothing has changed with spam filters, they are still written in SVM and there is no reason to change it. Even my website uses SVM-based filtering for spam in comments.

⌛ Regression (Regression)

“Draw a line through the dots, HMMM, that’s machine learning.” Regression algorithms are currently used for:

  • Stock price forecast

  • Supply and sales analysis

  • Medical diagnosis

  • Time series correlation is calculated

Common regression algorithms are:

  • Linear Regression

  • Polynomial Regression

A “regression” algorithm is essentially a “classification” algorithm, only it predicts a number rather than a category. Such as predicting the price of a car based on miles driven, estimating the amount of traffic at different times of the day, and predicting how much supply will change as the company grows. When dealing with time-related tasks, regression algorithms are the best choice.

Regression algorithms are favored by those in finance or analytics. It even became a built-in feature of Excel, and the process went smoothly — the machine simply tried to draw a line that represented the average correlation. But unlike a person with a pen and a whiteboard, the machine does this with mathematical precision by calculating the average distance between each dot and line.If it’s a straight line, it’s a “linear regression.” If it’s curved, it’s a “polynomial regression.” They are the two main types of regression. Other types are rarer. Don’t be fooled by the “black sheep” of Logistics regression, it’s a classification algorithm, not regression.

But it’s okay to confuse “regression” with “categorization.” Some classifiers adjust their parameters and become regressions. In addition to defining the category of an object, remember how close the object is to that category, which brings up the regression problem.

🍈 Unsupervised learning

Unsupervised learning has come a little later than supervised learning — in the 1990s, such algorithms were used relatively infrequently, sometimes simply because there was no alternative.

Annotated data is a luxury. If I were to create, say, a “bus classifier,” would I have to go out on the street and take millions of pictures of damn buses, and then label them all? No way, this is going to take my whole life, I still have a lot of games to play on Steam.

In this case, have a little hope for capitalism. Thanks to social crowdsourcing, we can get millions of cheap labor and services. Mechanical Turk[2], for example, is a group of people ready to help you complete tasks for $0.05. That’s how things usually work out.

Or, you can try using unsupervised learning. But I don’t recall any best practices for it. Unsupervised learning is usually used for exploratory data analysis and not as a primary algorithm. Specially trained people with Oxford degrees fed the machine a bunch of junk and looked: is there clustering? No. Can you see any connections? No. Okay, next thing you know, you want to work in data science, right?

🔮 Clustering (Clustering)

“The machine will choose the best way to distinguish things based on some unknown feature.” Clustering algorithm is currently used for:

  • Market segmentation (customer type, loyalty)

  • Merge adjacent points on the map

  • Image compression

  • Analyze and annotate new data

  • Detect abnormal behavior

Common algorithms:

  • K-means clustering

  • Mean-Shift

  • DBSCAN

Clustering is to classify categories without marking them in advance. You can sort socks if you can’t remember all the colors. Clustering algorithms try to find similar things (based on certain features) and then cluster them into clusters. Objects with many similar characteristics are grouped together and grouped into the same category. Some algorithms even support setting the exact number of data points in each cluster.

Here’s a good example of clustering — a marker on an online map. As you search for nearby vegetarian restaurants, the clustering engine groups them into bubbles with numbers. Otherwise, the browser freezes as it tries to map all 300 vegetarian restaurants in this trendy city.

Apple Photos and Google Photos use more complex clustering. Create an album of your friends by searching for faces in photos. The app doesn’t know how many friends you have or what they look like, but it can still find common facial features. This is typical clustering.

Another common application scenario is image compression. When saving the image as PNG, you can set the color to 32 colors. This means that the clustering algorithm finds all the “red” pixels, calculates an “average red” and assigns that average to all the red pixels. Fewer colors, smaller files — a bargain!

However, when it comes to colors such as blue and green, there is trouble. Is this green or blue? This is where the K-means algorithm comes in.

First, 32 color points were randomly selected from the colors as “cluster centers”, and the remaining points were labeled according to the nearest cluster centers. This gives us a “cluster” of 32 color points. We then move the center of the cluster to the center of the cluster and repeat the steps until the center stops moving.

To finish. It just gets into 32 stable clusters.

To give you a real life example:

Finding the center of the cluster is handy, but real clusters are not always round. If you are a geologist, now you need to find some similar minerals on a map. In this case, the shape of the cluster can be strange, even nested. You don’t even know how many clusters there are, 10? 100?

The K-means algorithm is useless here, but the DBSCAN algorithm is. We liken data points to people in a square, finding any three people approaching each other and asking them to hold hands. They were then told to grab the hand of the neighbor they could reach (standing immobile all the way) and repeat the process until a new neighbor joined them. So we get the first cluster and repeat the process until everyone is assigned to the cluster.

The whole process looks pretty cool.

Like classification algorithms, clustering can be used to detect anomalies. Is there any abnormal operation after the user logs in? Tell the machine to temporarily disable his account, and then create a work order for tech support to check what’s going on. Maybe it’s a robot. We don’t even have to know what “normal behavior” looks like, just feed the user’s behavior data to the model and let the machine decide if the other person is a “typical” user.

This approach is not as effective as the classification algorithm, but it is still worth a try.

🪁 Dimensionality Reduction

Assemble specific features into more advanced features The “dimensionality reduction” algorithm is currently used for:

  • Recommendation system
  • Beautiful visualization
  • Topic modeling and finding similar documents
  • False image recognition
  • Risk management

Commonly used “dimensionality reduction” algorithm:

  • Principal Component Analysis (PCA)
  • Singular Value Decomposition (SVD)
  • Latent Dirichlet Allocation (LDA)
  • Latent Semantic Analysis (LSA, pLSA, GLSA),
  • T-sne (for visualization)

In the early days, these methods were used by “hardcore” data scientists determined to find “something interesting” in a mass of numbers. When Excel charts didn’t work, they forced the machine to do the pattern-finding work. So they invented dimensionality reduction or feature learning.

Abstract concepts are more convenient for people than a bunch of fragmented features. For example, we combine a dog with triangular ears, a long nose and a large tail into the abstract concept of a sheepdog. We do lose some information compared to specific sheepdogs, but the new abstractions are more useful for scenarios that need to be named and explained. As a bonus, these “abstract” models learn faster, train with fewer features, and reduce overfitting.

These algorithms can be useful in the task of “topic modeling”. We can abstract the meaning from particular phrases. Latent semantic analysis (LSA) does this, based on the frequency of specific words you can see on a topic. For example, there are more tech-related words in science articles, or politicians’ names are more likely to appear in politically-related news, and so on.

We could create clustering directly from all the words in all the articles, but we would lose all the important connections (for example, battery and Accumulator mean the same thing in different articles). LSA can handle this very well. That’s why it’s called “latent semantic.”

Therefore, words and document links need to be combined into a feature to preserve the underlying connections — singular value decomposition (SVD) has been found to solve this problem. Useful clusters of topics are easy to see in the clusters of phrases.

Recommendation systemandCollaborative filteringIs another area where dimensionality reduction algorithms are used frequently. If you use it to extract information from users’ ratings, you get a great system for recommending movies, music, games, or whatever you want.

It’s almost impossible to fully understand the abstractions on the machine, but watch for correlations: Some of the abstractions are age-related – kids play Minecraft more or watch cartoons more, while others may be related to movie style or user preferences.

Machines can pick up these high-level concepts without even understanding them, based on information such as user ratings. Nice work, Mr. Computer. Now we can write a paper on “Why the bearded lumberjack likes my pony”.

💎 Association Rule Learning

Look for patterns in the order stream Association Rules are currently used for:

  • Anticipate sales and discounts
  • Analyze “buy together” items
  • Planning merchandising
  • Analyze web browsing patterns

Commonly used algorithms:

  • Apriori
  • Euclat
  • FP-growth

Algorithms for analyzing shopping carts, automated marketing strategies, and other event-related tasks are all here. If you want to see patterns in a sequence of items, try them out.

For example, a customer goes to the cash register with a six-pack of beer. Should we put peanuts on the way to the checkout? How often do people buy both beer and peanuts? Yes, association rules probably apply to beer and peanuts, but what other sequences can we use them to predict? Can you make a small change in the layout of the product and get a big increase in profits?

The same idea applies to e-commerce, where the task is more interesting — what will the customer buy next?

For some reason rule learning seems to be rarely mentioned in the context of machine learning. The classic approach is to apply a tree or collection approach based on a head-on check of all purchases. Algorithms can only search for patterns, but cannot generalize or reproduce them on new examples.

In the real world, every major retailer has its own proprietary solution, so it’s not going to be revolutionary for you. The highest level of technology mentioned in this article is the recommendation system. I may not be aware of a breakthrough, though. If you have something you’d like to share, let me know in the comments.