Definition of Big Data Big data refers to the data collection that cannot be captured, managed and processed by conventional software tools within a certain time range. It is a massive, high-growth and diversified information asset that requires new processing mode to have stronger decision-making power, insight and discovery ability and process optimization ability. Big data is a general concept yet to be discovered and accurately defined.
The core of big data is to make use of the value of data. Machine learning is a key technology to make use of the value of data. For big data, machine learning is indispensable. On the contrary, for machine learning, the more data, the more likely it is to improve the accuracy of the model. Meanwhile, the calculation time of complex machine learning algorithm also urgently needs the key technologies such as distributed computing and in-memory computing. Therefore, machine learning can not thrive without the help of big data. Big data and machine learning are mutually reinforcing and interdependent.
Machine learning is closely linked to big data. However, it must be clearly recognized that big data is not the same as machine learning, and machine learning is not the same as big data. Big data includes distributed computing, in-memory database, multidimensional analysis and many other technologies. In terms of analysis methods alone, big data also includes the following four analysis methods:
1. Big data, small analysis: OLAP analysis idea in data warehouse field, namely multidimensional analysis idea.
2. Big data and big analysis: This represents data mining and machine learning analysis.
3. Streaming analysis: This mainly refers to event-driven architecture.
4. Query analysis: The classic representative is NoSQL database.
In other words, machine learning is just one type of big data analysis. While some of the results of machine learning are magic, in some cases they are the best illustration of the value of big data. However, this does not mean that machine learning is the only analysis method under big data.
Broadly speaking, machine learning is a method of empowering a machine to perform functions that cannot be performed by direct programming. But in a practical sense, machine learning is a way of using data, training models, and then using models to make predictions.
First, we need to store historical data in the computer. We then process these data through machine learning algorithms, a process called “training” in machine learning, and the results can be used to predict the new data, which is commonly called “model”. The process of predicting new data is called “prediction” in machine learning. “Training” and “prediction” are two processes of machine learning, and “model” is the intermediate output of the process. “training” produces “model”, and “model” guides “prediction”.
Human beings have accumulated a lot of history and experience in the process of growing and living. Periodically, human beings “generalize” these experiences and acquire “rules” of life. When people encounter unknown problems or need to “speculate” about the future, they use these “rules” to “speculate” about unknown problems and the future, so as to guide their life and work.
The “training” and “prediction” processes of machine learning can correspond to the “induction” and “prediction” processes of human beings. Through such correspondence, we can find that the idea of machine learning is not complicated, but only a simulation of human learning and growth in life. Because machine learning is not based on results formed by programming, its processing is not causal logic, but correlation conclusions drawn through inductive thinking.
This can also be associated with why human beings should learn history, history is actually a summary of human past experience. There is a good saying that “history is often different, but history is always surprisingly similar”. Through the study of history, we can sum up the law of life and country from history, so as to guide our next work, which is of great value. Some contemporary people ignore the original value of history and use it as a means to publicize achievements, which is actually a misuse of the real value of history.
Machine learning has deep connections with pattern recognition, statistical learning, data mining, computer vision, speech recognition, natural language processing and other fields.
In terms of scope, machine learning is similar to pattern recognition, statistical learning and data mining. At the same time, the combination of machine learning and processing technology in other fields has formed interdisciplinary disciplines such as computer vision, speech recognition and natural language processing. Therefore, when talking about data mining in general, it can be equivalent to talking about machine learning. At the same time, what we call machine learning applications should be universal, not limited to structured data, but also images, audio, etc.
Pattern recognition Pattern recognition = machine learning. The main difference between the two is that the former is a concept developed from industry, while the latter is mainly derived from computer science. In his famous book Pattern Recognition And Machine Learning, Christopher M. Bishop begins by saying that “Pattern Recognition comes from industry, while Machine Learning comes from computer science. However, their activities can be viewed as two sides of the same field, and both of them have grown significantly in the past decade.
Data Mining Data mining = machine learning + database. The concept of data mining is all too familiar these years. Almost tantamount to hype. Data mining is often touted as a way of extracting gold from data and turning discarded data into value. But while I may dig for gold, I may also dig for rocks. The idea is that data mining is just a way of thinking that we should try to mine knowledge out of data, but not every data is gold, so don’t mythologize it. A system cannot be made invincible by a single data-mining module (IBM likes to boast). Rather, it is a person with a data-mining mindset who has a deep understanding of the data so that patterns can be derived from the data to guide business improvement. Most algorithms in data mining are machine learning algorithms optimized in the database.
Statistical learning Statistical learning is approximate to machine learning. Statistical learning is a highly overlapping discipline with machine learning. Because most of the methods in machine learning come from statistics, it could even be argued that the development of statistics has helped machine learning to flourish. For example, the well-known support vector machine algorithm is derived from the statistical discipline. But both are respectively, to some extent the respectively is: statistical learner focuses on the development and optimization of the statistical model, mathematics, and learners pay more attention to machine is to be able to solve the problem, partial practice, so the machine learning researchers will focus on the efficiency and the accuracy of learning algorithm is performed on computer.
Computer Vision Computer vision = Image processing + machine learning. Image processing techniques are used to process images as suitable inputs into machine learning models, which are responsible for identifying relevant patterns from images. Computer vision related applications are very many, such as Baidu map recognition, handwritten character recognition, license plate recognition and so on. This field has a very hot application prospect and is also a hot research direction. With the development of deep learning, a new field of machine learning, the effect of computer image recognition has been greatly promoted, so the future development prospect of computer vision is inestimable.
Speech Recognition Speech recognition = Speech processing + machine learning. Speech recognition is a combination of audio processing technology and machine learning. Speech recognition technology is usually not used in isolation, but in combination with natural language processing technologies. Current apps include Siri, Apple’s voice assistant.
Natural Language Processing natural language processing = text processing + machine learning. Natural language processing (NLP) is a field that allows machines to understand human language. In natural language processing technology, a large number of compiler principle related technologies are used, such as lexical analysis, grammar analysis and so on. In addition, semantic understanding, machine learning and other technologies are used in the understanding level. As the only symbol created by human beings, natural language processing has been the research direction of machine learning. According to Yu Kai, a machine learning expert at Baidu, “Listening and watching are all things that cats and dogs can do, but only language is unique to humans.” How to use machine learning technology for deep understanding of natural language has always been the focus of industry and academia.
Machine learning methods 1, regression algorithm
In most machine learning courses, regression is the first algorithm introduced. There are two reasons: One. Regression algorithms are relatively simple and allow for a smooth transition from statistics to machine learning. Regression algorithm is the foundation of several powerful algorithms. If you don’t understand regression algorithm, you can’t learn those powerful algorithms. Regression algorithm has two important subclasses: linear regression and logistic regression.
Linear regression is the line function that we see all the time. How do I fit a line that best matches all of my data? The least square method is generally used to solve the problem. The idea of the least square method is to assume that the line we fit represents the true value of the data, and the observed data represents the value with an error. In order to minimize the effects of the errors, you need to solve for a line that minimizes the sum of squares of all the errors. The least square method transforms the optimal problem into the problem of finding the extremum of the function. Extremum of a function In mathematics we usually take the method of finding the derivative to be 0. But this approach is not suitable for computers, may not solve it, may also be too much calculation.
There is a special discipline in computer science called numerical computation, which is dedicated to improving the accuracy and efficiency of computers in performing all kinds of calculations. For example, the famous “gradient descent” and “Newton’s method” are the classical algorithms in numerical calculation, which are also very suitable to deal with the problem of solving the extreme value of functions. Gradient descent method is one of the simplest and effective methods to solve regression model. Strictly speaking, since both the neural network and the recommendation algorithm in the following paper have the factor of linear regression, the gradient descent method is also applied in the implementation of the following algorithm.
Logistic regression is an algorithm very similar to linear regression, but essentially, the type of problems linear regression deals with are not the same as logistic regression. Linear regression deals with numerical problems, that is, the predicted result is a number, such as house prices. Logistic regression is a classification algorithm, that is to say, logistic regression predicts discrete categories, such as whether the email is spam or not, and whether the user will click on the AD and so on.
In terms of implementation, logistic regression simply adds a Sigmoid function to the calculated result of linear regression, and converts the numerical result into the probability between 0 and 1. (The Sigmoid function is generally not intuitive, you just need to understand that the larger the logarithm, the closer the function is to 1, and the smaller the logarithm, the closer the function is to 0.) Then we can make predictions based on this probability, for example, if the probability is greater than 0.5, the email is spam, or whether the tumor is malignant, etc. Intuitively, logistic regression draws a classification line, as shown below.
Suppose we have data from a group of patients with tumors that are either benign (blue dots) or malignant (red dots). The red and blue color of the tumor can be called the “label” of the data. Each data also included two “characteristics” : the patient’s age and the size of the tumour. We mapped these two features and labels to this two-dimensional space to form the data in my figure above.
When I have a green dot, do I know if the tumor is malignant or benign? We trained a logistic regression model based on the red and blue dots, which are the classification lines in the diagram. At this point, the green dot appears on the left side of the classification line, so we know it should be labeled red, which means it’s a malignant tumor.
The classification line drawn by logistic regression algorithm is basically linear (there are also logistic regression that draw nonlinear classification line, but such model will be very inefficient when processing large amount of data), which means that when the line between the two categories is not linear, the expression ability of logistic regression is insufficient. The following two algorithms are among the most powerful and important in machine learning, and both can fit nonlinear classification lines.
2. Neural networks
Neural network (also known as artificial neural network, ANN) algorithms were very popular in machine learning in the 1980s, but declined in the mid 1990s. Now, with the momentum of “deep learning”, neural networks are back as one of the most powerful machine learning algorithms.
The birth of neural network originates from the study of brain working mechanism. Early biologists used neural networks to model the brain. Machine learning scholars use neural networks to conduct machine learning experiments, and found that the effect of visual and speech recognition is quite good. After the birth of BP algorithm (numerical algorithm to accelerate the training process of neural network), the development of neural network has entered a boom. One of the inventors of the BP algorithm is Geoffrey Hinton(middle in Figure 1), the machine learning hero introduced earlier.
Specifically, what is the learning mechanism of neural networks? Simply put, it is decomposition and integration. In the famous Hubel-Wiesel experiment, researchers studied the visual analysis mechanism of cats like this.
A square, for example, breaks down into four polylines and goes to the next level of visual processing. Four neurons each process a polyline. Each polyline is further decomposed into two straight lines, and each line is decomposed into black and white faces. So a complex image turns into a lot of detail that goes into the neuron, and the neuron processes it and integrates it, and finally it says you’re looking at a square. That’s how visual recognition works in the brain, and that’s how neural networks work.
Let’s look at the logical architecture of a simple neural network. In this network, there are input layers, hidden layers, and output layers. The input layer is responsible for receiving signals, the hidden layer is responsible for data decomposition and processing, and the final results are integrated into the output layer. A circle in each layer represents a processing unit, which can be considered as a simulation of a neuron. Several processing units form a layer, and several layers form a network, also known as a “neural network”.
In the neural network, each processing unit is actually a logistic regression model. The logistic regression model receives the input from the upper layer and transmits the prediction results of the model as output to the next layer. Through this process, neural network can complete very complex nonlinear classification.
The following figure illustrates a well-known application of neural networks in the field of image recognition. This program is called LeNet, which is a neural network based on multiple hidden layers. LeNet can recognize a variety of handwritten digits with high recognition accuracy and good robustness.
The image of the input computer is displayed in the lower right square, and the output of the computer is displayed after the word “answer” in red above the square. The three vertical image columns on the left show the output of the three hidden layers in the neural network. It can be seen that with the deepening of the layers, the deeper the layers are, the lower the details are processed. For example, layer 3 basically deals with the details of lines. LeNet was invented by machine learning guru Yann LeCun(figure 1, right).
In the 1990s, the development of neural network has entered a bottleneck period. The main reason is that despite the acceleration of BP algorithm, the training process of neural network is still very difficult. Therefore, in the late 1990s, support vector machine (SVM) algorithm replaced the position of neural network.
3. SVM(Support vector Machine)
Support vector machine (SVM) algorithm is a classical algorithm born in statistical learning field and shining in machine learning field.
In a sense, support vector machine algorithm is the enhancement of logistic regression algorithm: by giving more strict optimization conditions to logistic regression algorithm, support vector machine algorithm can obtain better classification boundaries than logistic regression. But without some kind of function technique, the support vector machine algorithm is at best a better linear classification technique.
However, by combining with gaussian kernel, support vector machines can express very complex classification boundaries and achieve good classification results. “Kernel” is actually a special kind of function, the most typical characteristic of which is that it can map a lower dimensional space to a higher dimensional space.
How do we make a circular classification boundary in two dimensions? This can be difficult in two dimensions, but using the kernel you can map two dimensions to three, and then use a linear plane to achieve similar results. In other words, the nonlinear classification boundary of a two-dimensional plane can be equivalent to the linear classification boundary of a three-dimensional plane. Therefore, we can achieve the effect of nonlinear partition in two-dimensional plane by simple linear partition in three-dimensional space.
Support vector machines are machine learning algorithms with a heavy mathematical component (as opposed to neural networks with a biological component). One of the core steps in the algorithm proves that mapping data from lower dimensions to higher dimensions does not result in an increase in computational complexity. Therefore, support vector machine algorithm can not only maintain computational efficiency, but also obtain very good classification effect. Therefore, support vector machine has been occupying the most core position in machine learning in the late 1990s, basically replacing neural network algorithm. Only now, with the resurgence of neural networks through deep learning, is the delicate balance shifting again.
4. Clustering algorithm
A significant feature of the previous algorithm is that my training data contains labels, and the trained model can predict labels for other unknown data. In the following algorithm, the training data does not contain labels, and the purpose of the algorithm is to infer the labels of these data through training. This kind of algorithm has a general name, namely unsupervised algorithm (the algorithm with labeled data in front is supervised algorithm). The most typical unsupervised algorithm is clustering algorithm.
Let’s take a two-dimensional piece of data, a piece of data that contains two features. I want to label different kinds of them by clustering algorithm, so how do I do that? In simple terms, the clustering algorithm is to calculate the distance in the population and divide the data into multiple populations according to the distance.
The most typical representative of clustering algorithm is k-means algorithm.
5. Dimensionality reduction algorithm
Dimension reduction algorithm is also a kind of unsupervised learning algorithm whose main feature is to reduce the data from high dimension to low dimension level. Here, the dimension actually refers to the size of the data’s feature quantity. For example, the house price includes the length, width, area and number of rooms, that is, the data with a dimension of 4 dimensions. As you can see, the length and width actually overlap with the information for area, for example area = length × width. Through the dimensionality reduction algorithm, we can remove redundant information and reduce the features to two features: area and number of rooms, that is, from 4 dimensional data to 2 dimensional data. So we reduced the data from higher dimensions to lower dimensions, which not only facilitated presentation, but also accelerated computation.
The dimensions that are reduced during dimensionality reduction are at the visually visible level, and there is no loss of information due to compression (because information is redundant). Dimensionality reduction works if it is invisible to the naked eye, or if there are no redundant features, but some information is lost. However, the dimensionality reduction algorithm can be proved mathematically that the data information is preserved to the maximum extent in the lower dimension compressed from the higher dimension. Therefore, the use of dimensionality reduction algorithm still has many benefits.
The main function of dimensionality reduction algorithm is to compress data and improve the efficiency of other machine learning algorithms. Through the dimensionality reduction algorithm, the data with thousands of features can be compressed into several features. In addition, another benefit of dimensionality reduction algorithm is the visualization of data, for example, 5 dimensional data compressed into 2 dimensions, and then can be viewed with a two-dimensional plane. The main representative of dimension reduction algorithm is PCA algorithm (principal component analysis algorithm).
6. Recommendation algorithm
Recommendation algorithm is a very popular algorithm in the current industry, which has been widely used in e-commerce, such as Amazon, Tmall, Jingdong and so on. The main feature of recommendation algorithm is that it can automatically recommend to users what they are most interested in, so as to increase the purchase rate and improve benefits. There are two main categories of recommendation algorithms:
One kind is based on the recommendations from the article content, is the will and the content of the users to buy similar items to recommend to the user, this is the premise of each item have several tags, so they can find similar items to users to buy goods, the benefits of such recommendations are associated degree is bigger, but because each item need to label, so larger workload.
Recommendation, another kind is based on the user similarity is the other will be the same as the target user interest recommend users to buy things to target users, such as small A history bought items B and C, after algorithm analysis, find another small with A similar user purchased item E, D and E recommended for small A the item.
Both types of recommendations have their own advantages and disadvantages. In general e-commerce applications, they are generally mixed. The most famous recommendation algorithm is the collaborative filtering algorithm.
7. Gradient descent
Gradient descent is an optimization algorithm, also known as the fastest descent method. The fastest descent method is one of the simplest and oldest methods to solve unconstrained optimization problems. Although it is not practical now, many effective algorithms are improved and modified based on it. The steepest descent method uses the negative gradient direction as the search direction. The closer the steepest descent method approaches the target value, the smaller the step size and the slower the progress. For example, function is compared to a mountain, we stand on a hillside, look around, from which direction to take a small step down, can descend the fastest; Of course, there are many ways to solve this problem, and gradient descent is just one of them, and there’s another one called the Normal Equation
Newton’s method
Newton method is a nonlinear least square optimization method. It uses Taylor’s expansion of objective function to transform the least squares problem of nonlinear function into the least squares problem of linear function for each iteration. The disadvantage of Newton’s method is that if the initial point is too far from the minimum point, the iteration step is too large, so that the function value of the next generation of iteration may not be smaller than that of the previous generation. Under the action of the second derivative, Newton’s method directly searches how to reach the extreme point from the convexity of the function. In other words, when choosing the direction, it not only considers whether the current slope is large enough, but also considers whether the slope will become larger after you take a step.
From the point of view of convergence rate, gradient descent is linear convergence, Newton’s method is superlinear, at least second order convergence ~, when the objective function is convex, gradient descent method to explain the global optimal solution. In general, its solution is not guaranteed to be globally optimal. When the objective function is not convex, the objective function can be approximately transformed into convex function. Or some intelligent optimization algorithms, such as simulated annealing, can jump out of the local extremum with a certain probability, but these algorithms are not guaranteed to find the minimum.
9. BP algorithm
BP algorithm is a learning process composed of two processes: signal forward propagation and error back propagation. In the forward propagation, the input sample is introduced from the input layer, and then transmitted to the output layer after being processed layer by layer by layer. If the actual output of the output layer is not consistent with the desired output (teacher signal), then the back propagation stage of the error is entered. Error backtransmission refers to the backtransmission of the output error to the input layer layer by layer through the hidden layer in some form, and the error is apportioned to all the units of each layer, so as to obtain the error signal of each unit, which is used as the basis for correcting the weight of each unit. The weight adjustment process of each layer of signal forward propagation and error back propagation is repeated. The process of constant weight adjustment is also the learning and training process of the network. This process continues until the error of the network output is reduced to an acceptable level, or until a predetermined number of learning times.
10. SMO algorithm
SMO algorithm is aimed at solving the Lagrange dual problem of SVM problem, a quadratic programming formula, developed efficient algorithm. The computational cost of traditional quadratic programming algorithms is proportional to the size of the training set, while SMO optimizates the solving process of this particular quadratic programming problem based on the characteristics of the problem itself (KKT constraints). In the duality problem, we only solve the Lagrange multiplier α vector at last. The basic idea of this algorithm is to select only a pair (α I, αj) each time, fix the values of elements of other dimensions of α vector, and then optimize until convergence.
In addition to the above algorithms, there are other algorithms in machine learning such as Gaussian discrimination, naive Bayes, decision tree and so on. But the 10 algorithms listed above are the most widely used, influential, and diverse. One of the features of machine learning is that there are so many algorithms, and so many things are happening.
The following is a summary. According to whether the trained data has labels, the above algorithms can be divided into supervised learning algorithms and unsupervised learning algorithms. However, the recommendation algorithm is special and belongs to a separate category, neither supervised learning nor unsupervised learning.
Supervised learning algorithm:
Linear regression, logistic regression, neural network, SVM
Unsupervised learning algorithm:
Clustering algorithm, dimensionality reduction algorithm
Special algorithm:
Recommendation algorithm
In addition to these algorithms, there are several algorithms whose names have cropped up frequently in the field of machine learning. But they are not machine learning algorithms per se, they are designed to solve a subproblem. You can understand them as sub-algorithms of the above algorithms, used to greatly improve the training process. Among them, there are: gradient descent method, mainly used in linear regression, logistic regression, neural network, recommendation algorithm; Newton’s method, mainly used in linear regression; BP algorithm, mainly used in neural network; SMO algorithm is mainly used in SVM.
The combination of machine learning and big data has generated tremendous value. Based on advances in machine learning, data can be “predicted.” For human beings, the more accumulated experience, the more extensive experience, the more accurate the judgment of the future. It is often said, for example, that “experienced” people have an advantage over “inexperienced” people in their work because the rules are more accurate than others. In the field of machine learning, a famous experiment has effectively confirmed a theory in the field of machine learning: the more data the machine learning model has, the better the prediction efficiency of machine learning will be.
Successful machine learning applications don’t have the best algorithms, they have the most data!
In the era of big data, there are several advantages that make machine learning more widely applicable. For example, with the development of the Internet of Things and mobile devices, we have more and more data, including pictures, texts, videos and other unstructured data, which enables machine learning models to obtain more and more data. At the same time, map-reduce distributed computing in big data technology makes machine learning faster and more convenient to use. All kinds of advantages make the advantages of machine learning can get the best play in the era of big data