Source | altman superman’s blog

http://blog.csdn.net/djy1992/article/details/75257551

A Brief history of Machine Learning PpDAI AI Center data research and development director: Wang Chunping

1 Introduction to machine learning

Thank you very much! First of all, it’s my great honor to come to Societe Generale and share such a simple talk with you. I heard from my friends in Industrial Group that it was the first time for us to share the topic of machine learning. So I will discuss this topic from a relatively broad perspective and incorporate my personal experience of working and learning in this field. The follow-up “boundless” forum will launch a series of special topics. If you’re particularly interested in one area or if you think I missed something here, we can talk more after the meeting.

This is today’s lesson overall framework: the first is a simple introduction, and then put some relatively classic model string together, trying to find some common place in the middle, and then introduces some common problems of the application of machine learning, there is something about my personal views on the future development trend of artificial intelligence.

In the introduction section, I actually want to do the following thing: to put the term machine learning in a relatively larger context, in the use of data to drive business, data mining in finance, stock selection and so on, what kind of position does it occupy? What we say is to use data to drive business. In fact, this business can be anything. For example, what you are interested in today may be financial business. From big data, a hot concept, to artificial intelligence, and then to the business we really care about, there is a strong relationship between the three. It can also be said that big data and artificial intelligence have become hot successively.

We think first you need to accumulate a lot of data in the business process, because when you have enough data, we can do something with it. Now artificial intelligence can be seen as more of a tool class, using big data as a raw material to return to the business problems we are concerned about and solve some practical problems. The field of ARTIFICIAL intelligence, which is a relatively broad concept, has become popular again in the last couple of years probably because of the craze caused by AlphaGo. AlphaGo has more to do with deep learning, but artificial intelligence is a relatively broad concept, and anything that can be done by machines instead of humans can be considered artificial intelligence. It can include some simple rule engines that you write, some warning and monitoring and so on, and it can also include a topic like machine learning that we’re going to focus on today. One category of machine learning is deep learning, which is hot right now.

From my personal understanding, the difference between machine learning and generalized artificial intelligence is that it needs to learn something from a large amount of data and then apply it, rather than just applying the experience of some experts. That’s one of its positions in the whole business and ai, and then we can look at what it looks like in the whole process. When we want to use a data-driven method to solve the problem of a business, usually want through such several big steps: first of all we need to focus on what is the problem itself, such as I now see a financial problem, I need to predict a stock it in tomorrow is to go up to still drop, it is a more specific problem. But a lot of times it’s not so clear-cut, so you might want to look at the type of question first. What to predict, what to focus on, what to sample. And then ultimately what do you want the output of that question to be, and so on. In fact, this is a combination of business and technology, which requires two groups of people to discuss together, and there will be several more scientific steps in the process. You really need to know what the problem is, and then model it, and learn some rules or models that we can apply to the business. One of the things that’s going to come out of the end of the day is a little bit of electrical engineering is going to be taking these things and actually deploying them online. If my system is a high-frequency trading system, I will apply the logic of the trading I have learned. If it is ppDAI, what I have learned is a model to evaluate the borrower’s credit status, then I need to apply it to the whole loan process, for example, there is a link in our APP that needs to be scored.

What we’re going to talk about today is more scientific research, which is more about modeling. Within the whole concept of machine learning, it can be divided into three broad categories according to whether or not a teacher is present at the time of learning. One is that as I listen to the teacher in class, the teacher will teach me that this is an Apple, that is an iPhone, that is an orange and so on. Every sample that appears has a clear label, this is called Supervised Learning. Unsupervised Learning is called supervised Learning if there is no teacher at all and we need to explore some problems by ourselves. There is also Reinforcement Learning, which is not often used in general data analysis, but is often involved in robots or AlphaGo. In fact, this is similar to the process of children learning things, it is not the beginning of the teacher to tell you any measures, so that you do the right thing in practice to be rewarded, do wrong things to be punished. Through the feedback of the environment to it, let it continuously enhance its own understanding of the environment.

Of these three kinds of problems, the one we use most often and feel most comfortable with should be the first one: supervised learning. The good thing about supervised learning is that it has a clear definition of general problems, and then you have a clear goal to predict. Let’s say it’s a classification question: you need to know whether this person is a prospective user or a user with good credit. Or you need to know whether the stock is going to go down or up tomorrow. These are clearly labeled, and this is called Classification. If there is an explicit prediction goal, but the goal is a continuous number, such as the stock price you want to predict, it can be a Regression problem.

As for unsupervised learning, it is called unsupervised because the data itself is not well understood and labels are difficult to obtain in some cases. The problem itself is not so clear cut, so it is unsupervised. I might get this data and do some analysis to see what it looks like, but it’s kind of all-encompassing, unlike supervised learning, where there’s a clear direction. So unsupervised learning can be divided into several categories according to the purpose of learning. If I want to know, for example, how many groups the population is divided into, but I don’t really know what the grouping is, or what the right thing is, it’s called clustering. Segmentation is something like cutting into sets. It is also important to do dimensional regression, because nowadays, in an era of data explosion, for many people, data is too much rather than too little, and there may be thousands or even 100,000 data dimensions. Especially in image data, natural voice data such as some more complex data types or some access logs of the network, such data, the dimension is quite high. If you let it do some upfront monitoring it would take a very specific amount of data to determine a model that works. So dimensionality reduction is a pretty big job in machine learning. And then here’s topic Modeling for text: For an article you can reduce it down to a number of topics or a combination of topics. The next thing you can do is: Had a bunch of photo or video, but I don’t know what would happen in these photos is abnormal behavior, but I want to know what is abnormal, if all people see that is not economic, so abnormal findings (abnormal detection) is a big a category, and is used more in terms of security of these.

We won’t go into details about these methods, but we can discuss them in future posts if you are interested. Just to recap a little bit, just to recapitulate the complexity of the various methods or branches. First of all, let’s think about whether it’s supervised or unsupervised. Below is a simple illustration. On the left is a question used to test a classification, and on the right is a pile of data, but I have no idea how many categories I should classify. I just divide them slightly to see if each category is relatively special. This is the difference and intuitive experience between supervised and unsupervised learning.

Here is a simple, slightly different way of categorizing, which is always how the problem is classified, which is determined in defining the problem to be solved, rather than what we have to choose. However, Discriminative model and generative model are mostly about selecting how to build the model. This is for supervised situations, so you can think about both of these. When you have a learning goal, if I just want to know one conditional probability, that is, I’m given X, and I want to know what happens to Y, that’s called a discriminant model. Sometimes I also care about the distribution of X itself, so maybe we care about the joint distribution of X and Y, using the generative model. Generally speaking, it is better to make conditional probability directly for discriminant model, especially when the training set and the prediction set are consistent. But for some special cases, for example, where there are some missing variables in X, if you have a joint distribution, if you have a way of learning about the distribution of X itself, you can generate a new data set, do true values, do supplements, and so on, and there’s some benefit to that. Of course there are some other nice features of the generative model that we can expand on later if you’re interested. Now, just to warm up a little bit, give me a little overview of machine learning.

2. Classical model of machine learning

Let’s talk about model frameworks and classical machine learning models.

The model framework is not in terms of the scenario where you actually call the package when you use it, but if I were to really understand the model, what would be the best way to compare the different models. So let’s talk about what a machine learning model would involve.

First we need to make some assumptions about the model: should we assume it as a probability problem or an optimization problem? That is, Model assumptions.

When you have good assumption after the condition of the model, the next job is to how to solve some of these parameters or the results, using the known data to exercise necessary things, here will involve more people often hear the term, for example the EM algorithm, bayesian inference, Gibbs sampling, etc. We call it Model Inference.

Once you have learned the Model, the next step is the Model Application. The purpose of all machine learning is to apply it to new samples, so how to apply it to new samples and how to adapt to new samples is a special concern for us. Here we ask whether the model itself is inductive or transductive. Induction means that you can derive a function expression from the data that you already know, and then plug in the relevant data, and then you can calculate the latest data. When choosing a model or setting up a model, this is a point to consider, and there are probably three steps.

And then we can look at some pretty simple algorithms, but they’re pretty classic. For example, if you want to make an unsupervised model, if you want to do clustering, Kmeans will definitely be an inescapable thing. Its basic idea is very simple, that is, I have a pile of data to divide into three piles, suppose I want to divide into three piles, then I randomly set three points first, and then through iteration to achieve the minimum sum of squares from each point to its center, that is, to find the three most representative center points. It’s typically unsupervised learning, it doesn’t make any assumptions about the distribution of the data, it doesn’t think it’s some kind of distribution, or whatever the distribution is, I’m going to do it. The goal of optimization is to make these three points most representative, so he turns the problem into an optimization solution. When you have new data to add, you can also plug in the distance formula to find which point is the most representative center for this point.

Linear regression problems are supervised learning with a clear goal. If it’s a more general regression, Y is a more general function of X, and f of X becomes a simple linear sum, that’s what our model assumes. Actually here you can also have different hypothesis, for example, I now is to assume that it has a function that is noisy, I can then further assume that the distribution of the noise, such as gaussian, and then I want to how to solve this problem is the second step, we just speak to do the work of the model iteration to upgrade. In general, one of the most common ways to solve linear regression is to write its objective function as the sum of the squares of the errors. The nice thing about this is that you get a closed form solution that you can eventually write down, and you don’t have to do it iteratively. Later, when we talk about some routines, in fact, for example, for this linear regression, I can also make some variations in some ways to make it have some better properties.

Another classic one is logistic regression for classification. It is also a supervised learning method, and the assumptions of this model are based on a probability. So it’s not quite the same as linear regression, and the output is also a probability, which is what we get when we use it. So let’s say one of the things we care about here is whether this borrower is going to pay on time, and then the model scores out what is the probability that this person is going to pay on time. Now that I have a function of this model, the next thing I need to do is figure out how to solve it, because there’s an undetermined coefficient: beta. For a probability problem, the most commonly used method is to calculate the likelihood after log, and ultimately transform such a probability model into an optimization cost function problem. In fact, you will find that all routines are basically like this: if you want to do a point estimation, that is to say, to estimate the optimal value of this parameter, basically the routines will assume the function form, find its loss function, and turn it into an optimization problem to solve.

Then there is the Support Vector Machine, which is also a classification problem, but unlike logistic regression, it does not make any probabilistic assumptions. It is itself an optimization problem: to find a line on this plane that can separate the two categories. So that it has the largest margin, “margin,” which is the distance between this closest point and this line. This is a case that can be completely divided into two categories. In the actual application process, it may be overfitted. It is also an optimization problem, and the goal of optimization is to maximize this edge.

Neural networks are supervised learning that can be used to do regression or classification problems. The idea is that I have a lot of inputs, and then I have only one output. This is my independent variable X and dependent variable Y (see figure), but I think there are many hidden layers in the middle. Here is one layer, if it has many layers, which is the popular deep learning architecture. At each level are the next layer of a linear expression function, so that a layer of a layer of eventually going to optimization is the dependent variable Y, and, in fact, we observed that the error of Y, and minimize the error, you can write it loss function, into optimization problem. Why is it so popular to turn it into an optimization problem? Because if it is a convex problem, there is already a good solution, I believe all of you may know, basically equivalent to the earlier troublesome things into the routine above.

There’s another kind of model that doesn’t quite follow the same pattern — Bayesian thinking, which is basically point estimation. I think I have an unknown parameter, and I’m going to estimate its best value. And then Bayes is actually a little bit different, is that I start with an idea, let’s say I want to estimate some coefficient, I don’t know anything at first, it could be equal to 0.1, it could be equal to 0.9 and the probabilities in between are pretty much the same. But when I see some data, my opinion becomes more and more focused on one value. When I look at a million data points I’m pretty sure it’s above my point, but it’s still a pretty narrow distribution. Bayes says that nothing is certain, all that is certain is that it has a probability, and then it has a distribution, but as you look at more data, it becomes clearer and clearer. So the idea is that the framework for Bayesian learning is no longer a point estimate, it’s going to think that no matter how MUCH I learn what I learn to use is going to be a distribution of this variable, so its final output is going to be a distribution.

Some common discussions on machine learning

Here is a brief introduction to some common approaches to machine learning, which are applicable to most models and are likely to encounter some problems. One of the most common problems is overfitting. Overfitting means that I overtrust the data I see, that the data I see is all I really have, and that I might use a particularly complex model to fit it. When the model is complex enough, and the amount of data is relatively small, you can find a function that is particularly good, that can be combined perfectly, but that is completely useless for using new data later, because it has learned noise perfectly.

It doesn’t matter if the noise is a random term, so a common way to deal with overfitting is to add this regular constraint term to it. For example, the linear regression that WE just mentioned, if we don’t add any adjustment terms, it can be a straight closed solution, very simple. But there is also a problem, which is that the coefficients can be very large. So in general, you want to add some penalty terms, so that these coefficients aren’t too big, because there’s a lot of problems with big numbers, and they might cancel out. There are a lot of different ways to punish it, and the popular ones are to add a L1 norm or a L2 norm or a combination of the two, and so on. The selection of these penalty terms is different for the data with different characteristics, or the model with different real requirements.

Another common method is cross validation. Because many of our models will have some parameters to choose, in fact, it is not very reasonable to manually shoot into the parameters. Now, there’s a couple of options that might be better, but the one that’s absolutely feasible, the one that’s more versatile, is this cross-validation. Divide all the training sets into groups, and then use each group as a validation set at a time, and the other groups as training sets. Let’s say there are five there are five test sets, there are five sets of results, and you keep tuning the parameters. So in this case, for example, it’s just changing this variable called M around, and you’ll see that the error is getting smaller and smaller on the training set, and maybe it’s getting smaller and bigger on the test set. Near the jump point in this test set, this is generally an optimized choice.

Another trend that is very popular now is that if we have some weak classifiers or weak models, we can integrate them together to solve some problems, which includes Bagging, Boosting, Stacking and other different ways, namely classifier integration.

Let’s start with Bagging. This graph looks very complicated, but to put it simply, I may only have a thousand data originally, so it is difficult to test my data, or the model will be greatly affected by different data, because I do not have a hundred thousand or a million data, so I cannot make different samples. Let’s say I only have a thousand pieces of data, and EACH time I take a different piece of data from that thousand pieces of data, and I make each subset. All of this data is actually coming from the same place, training multiple models, and then putting them together, basically. And subsets are generated randomly, without any special care. Our popular random forest should be based on this idea.

Boosting and Bagging are both Boosting and Bagging. The major difference between Bagging and Bagging is that Bagging is divided randomly into many categories and then combined with each other. Boosting, in fact, is a combination of weak categories each time after they are divided, and the weight of the original bad results increases and they are divided again. It’s also done many, many times, and finally combined.

There is also Stacking, which is even more dizzying. Simply speaking, if there are several sensors, such as GBBoost, random forest, and other classifiers, I use them to learn on the training set, get a model and predict some results. Then, based on the predicted results, I use the corresponding methodology above the model, which is equivalent to a layer of model on top of the model. The input of this model is the output of the above models, which is equivalent to looking at which model should have more weight, rather than directly averaging the output of each model, which may do better models have higher weight, so this approach will generally greatly improve performance. Especially when the degree of difference between the base models is relatively large. If the base model is the same, then the average is the same no matter what. These are some simple routines, and there are many more later ones, but these are more general.

4. Future trends of machine learning

Finally, I will briefly talk about my personal understanding of the development trend of machine learning. In fact, WHEN I was preparing the PPT, I searched on the Internet and found that the results can be said to be varied. Some people may be academic, they are all about academic trends. Some may be more sociological, and the trend he talks about may be more of an impact on all aspects of the future society. I would like to talk about my experience in this aspect as someone who is doing things in this field.

One of these trends is increasing universalization. Like the machine learning people before, basically the text people do text, the image people do image, the sound people do sound, the structured data people do structured data. And then each of these methods has a very specific extraction and prediction method on the data source. But now deep learning is getting hot and well understood. There are actually a lot of different data sources, and its algorithm framework is very similar, so it becomes more and more generic.

Another trend is instrumentalization. The increasing availability of open source tools has made machine learning a much cheaper business. In the past, you might have to write code, or even at the beginning you might have to derive the assumptions of those models that I just talked about, and optimize them accordingly. Development to the later can write some code to switch. Now something like TPOT, even if it’s completely windowed, you can drag and drop and train the model, or if you don’t do anything, it automatically runs through all the models, all the parameters are tuned. So overall there is a trend towards lower user barriers.

The other is that as the cost of machine computing has gotten cheaper and cheaper, it’s getting a little violent now. In the past, it may have been more of an expert experience to build some variables and call some parameters. But in the future, even for people who don’t know much, if they have enough time and strong enough machines to calculate, they can make some better models through violence. These trends certainly do not mean that there is no need for more people to actually make some model choices, or variable choices. However, many people can use many tools of machine learning to solve many basic problems in daily life.

I will probably stop here today, and if you are interested in a topic, you are welcome to comment and communicate in this article, thank you!

End