Visit Flyai. club to create your AI project with one click.

A Few Useful Things to Know about Machine Learning by Professor Pedro Domingos of the University of Washington summarizes 12 valuable lessons for Machine Learning researchers and practitioners, This includes pitfalls to avoid, key questions to focus on, and answers to frequently asked questions. Hopefully these experiences will be of some help to machine learning enthusiasts.

1. “representation + evaluation + optimization” constitutes the main content of the machine

The machine learning algorithm consists of three parts:

Representation: Classifiers must be represented in a formal language that a computer can process. Conversely, choosing a representation for a training model is equivalent to choosing a set of trainable classifiers. This set is called the “hypothesis space” of the training model. If the classifier is not in the “hypothesis space”, then it cannot be obtained by training. A related issue is how to characterize inputs, that is, what features to use.

Evaluation: An Evaluation function is needed to distinguish between good and bad classifiers. The evaluation function used internally by the algorithm may be different from the external evaluation function used for classifier optimization, for ease of optimization, and because of the issues we discuss in the next section.

Optimization: We need a way to search for the highest-scoring classifier. The selection of optimization method is very important to improve the efficiency of the model. In addition, if the evaluation function has more than one optimal value, the optimization method helps determine the resulting classifier. New training models often start with an existing optimizer and then often switch to a custom optimizer.

2. “Generalization capability” is critical, “test data” validation is critical

The main goal of machine learning is to generalize samples outside the training set. Because no matter how much data there is, it’s unlikely you’ll ever see exactly the same example again in your test. It’s easy to perform well on the training set. The most common mistake machine learning beginners make is to test their models against training data, creating a false sense of success. If the selected classifier is tested on the new data, the results are often as good as random guesses. So, if you hire someone to build a classifier, be sure to keep some data for yourself so you can test it in the classifier they give you. Instead, if someone hires you to build a classifier, keep some of your data for final testing of your classifier.

3. Data alone is not enough. Knowledge works better

Another consequence of targeting generalization is that data alone is not enough, no matter how much data you have. Is that frustrating? So how can we expect it to learn anything? However, not all of the functions we want to learn in the real world are extracted from mathematically possible functions! In fact, using general assumptions — such as smoothness, similar categories for similar samples, finite dependencies, or finite complexity — often works well enough, and is much of the reason why machine learning is so successful. Just like deduction, induction (what the training model does) is a knowledge lever — it converts a small input of knowledge into a large output of knowledge. Induction is a more powerful lever than deduction, requiring less knowledge to produce useful results. However, it still requires greater than zero knowledge input to work. As with any lever, the more you put in, the more you get out.

In retrospect, the need for knowledge during training is not surprising. Machine learning is not magic. It can’t make something out of nothing. Like all engineering, programming requires a lot of work: we have to build everything from scratch. The training process is more like farming, and much of the work is done naturally. Farmers combine seeds with nutrients to grow crops. The training model combines knowledge with data and writes programs.

4. “overfitting” gives the illusion of machine learning effects

If we do not have enough knowledge and data to fully determine the correct classifier, the classifier (or part of it) can be “illusory.” The resulting classifier is not based on reality, but simply codes the randomness of the data. This problem is called overfitting, and it’s a thorny problem in machine learning. If your training model outputs a classifier that is 100% accurate on the training data but only 50% accurate on the test data, then in fact the classifier’s output accuracy on the two sets may be about 75% overall, and it is over-fitting.

In machine learning, everyone knows about overfitting. But overfitting comes in many forms, and people often don’t realize it right away. One way to understand overfitting is to break down the error of the generalization into bias and variance. Bias is the tendency of a model to keep learning the same mistakes. And variance refers to the tendency of the model to learn from random signals regardless of what the real signals are. Linear models are highly biased because the model cannot adjust when the boundary between two classes is not a hyperplane. Decision trees do not have this problem because they can represent any Boolean function. On the other hand, decision trees can have large variances: if trained on different training sets, the resulting decision trees often differ greatly, but in fact they should be the same.

Cross validation can help counter overfitting, for example, by using cross validation to choose the optimal size of a decision tree for training. But this is not a panacea, because if we generate too many parameter choices with cross validation, it starts to overfit itself.

In addition to cross validation, there are many ways to solve the over-fitting problem. The most popular is to add a regularization term to the evaluation function. For example, this penalizes more multinomial classifiers, which facilitates the generation of classifiers with simpler parameter structures and reduces the space for overfitting. Another approach is to perform a chi-square-like statistical significance test before adding a new structure to determine whether the distribution of classes really differs before and after the addition. These techniques are especially useful when there is very little data. Still, you should be skeptical of claims that some approach perfectly solves the overfitting problem. Reducing the overfit (variance) will easily make the classifier fall into the relative underfit error (deviation). If you want to avoid both, you need to train a perfect classifier. No one approach always works best without prior information (there is no such thing as a free lunch).

5. The biggest problem in machine learning is “dimensional disaster”

In addition to overfitting, the biggest problem in machine learning is dimensional disaster. This term was coined by Bellman in 1961 and refers to the fact that many algorithms that work well in lower dimensions will not work properly when the input dimensions are high. But in machine learning, it’s much broader. As the sample dimension (number of features) increases, it becomes more and more difficult to perform proper generalization because the coverage of the input space by a fixed-size training set diminishes.

The general problem with higher dimensions is that human intuition from a three-dimensional world usually doesn’t apply to higher dimensions. In the high dimension, most of the data of multivariate Gaussian distribution are not close to the average, but in the surrounding “shell” more and more distant; In addition, most of the volume of the high-dimensional distribution is distributed on the surface, not in the body. If a constant number of samples are evenly distributed in a higher-dimensional hypercube, then beyond a certain dimension most samples will be closer to one face of the hypercube than to their nearest neighbor. Furthermore, if we approach a hypersphere by embedding a hypercube, then at high dimensions almost all of the volume of the hypercube is outside the hypersphere. This is bad news for machine learning, because one type of shape can often be approximated by another, but fails in higher dimensions.

It is easy to establish two-dimensional or three-dimensional classifier. We can find reasonable boundaries between different categories of samples by visual inspection alone. However, in higher dimensions, it is difficult to understand the distribution structure of data. This, in turn, makes it difficult to design a good classifier. In short, one might think that collecting more features must not have a negative effect, since at best they simply do not provide new information about classification. But in fact, the dimension disaster may outweigh the benefits of adding features.

6. The relationship between “theoretical guarantee” and “actual discrepancy”

Machine learning papers are littered with theoretical assurances. The most common guarantee is the constraint on the number of training samples to maintain the model’s good generalization ability. First, the question is clearly provable. Induction is usually contrasted with deduction: by deduction, you can be sure that the conclusion is correct; In induction, all supposition is dismissed. Perhaps this is the ancient wisdom handed down. The main breakthrough of the last decade has been to recognize the fact that the results of induction are verifiable, especially if we are willing to give probabilistic guarantees.

What such constraints mean must be considered. This does not mean that if your network returns a hypothesis consistent with a particular training set, that hypothesis is likely to have good generalization power. Rather, given a sufficiently large training set, your network is likely to return a hypothesis that generalizes well or fails to agree. Nor do these constraints teach us how to choose a good hypothesis space. It only tells us that if the space is assumed to contain good classifiers, the probability of the network training a weak classifier will decrease with the increase of the training set. If the hypothesis space is narrowed, the effect of constraint will be enhanced, but the probability of training a strong classifier will also be reduced.

Another common theoretical guarantee is incremental: if the size of the input data is infinite, then the network must output a strong classifier. Sounds plausible, but choosing one network over another is imprudent in order to ensure asymptotic behavior. In practice, we are rarely in an asymptotic state. From the bias variance tradeoff discussed above, if network A is better than network B with A large amount of data, then B is often better than A with limited data.

Theoretical assurance exists in machine learning not only as a criterion for judging actual decisions, but also as a means of understanding and a driving force for designing algorithms. For this reason, it is very useful. In fact, it’s the connection between theory and practice that has driven leaps and bounds in machine learning over the years. Note: Learning is a complex phenomenon. Just because it makes sense in theory and works in practice doesn’t mean the former causes the latter.

7. Feature engineering is the key to machine learning

Finally, some machine learning projects are wildly successful, while others fail. What caused this? The most important factor is the characteristics of use. The learning process is easy if you have a lot of individual, category-specific features. Conversely, if a class is an extremely complex function of characteristics, your model may not be able to learn that function. In general, the raw data format is not very good for learning, but you can build features based on it. This is the most important and often the most interesting part of machine learning projects, where intuition, creativity, “magic” and technology are equally important.

Beginners are often surprised by how little time machine learning projects actually spend on it. But it’s not surprising when you factor in the time spent collecting, integrating, cleaning, and preprocessing data, as well as fixing bugs in the process of recharacterizing the data. Moreover, machine learning is not just about building a data set and running the model once. It is often an iterative process of running the model, analyzing the results, and modifying the data set/model. Learning is the fastest part of it, but it depends on how well we can use it! Feature engineering is difficult to do because it is domain-specific, whereas model architecture is more widely applicable. However, there is no clear line between the two, which often explains the better performance of models that incorporate domain knowledge.

8. Remember: The amount of data is more important than the algorithm

In most areas of computer science, time and memory are two resources in short supply. But in machine learning, data sets are the third resource in short supply. The bottleneck debate has changed over time. In the 1980s, data was often the bottleneck. Now time is more precious. We have a huge amount of data available today, but we don’t have enough time to process it, and it’s sitting there. This creates a paradox: even though in principle large amounts of data mean that more complex classifiers can be learned, in practice we tend to adopt simpler classifiers because complex classifiers mean longer training times. Part of the solution is to come up with methods that can quickly learn complex classifiers, and there is really significant progress in that direction today.

Part of the reason why using a smarter algorithm doesn’t pay as much as you’d like is that when you first approximate it, it looks like any other algorithm. It’s surprising when you think that the differences between representations are similar to the differences between rules and neural networks. But the truth is that propositional rules can be easily coded into neural networks, and other representations have similar relationships. In essence, all models are implemented by classifying the nearest neighbor samples into the same category, and the key difference lies in the meaning of “nearest neighbor”. For data that is not uniformly distributed, the model can produce widely different boundaries, while producing the same predictions in important regions (regions with a large number of training samples, and therefore where most text samples are likely to occur). This also explains why powerful models can be unstable but still accurate.

In general, we consider the simplest models first (for example, naive Bayes rather than Logistic regression, k-nearest neighbors rather than support vector machines). The more complex the models are, the more tempting they are, but they are often difficult to use because you need to tweak so many nodes to get good results, and their internals are extremely opaque.

Models can be divided into two main types: those with a fixed scale, such as linear classifiers, and those whose representational power increases with data sets, such as decision trees. Fixed-scale models can only make use of limited data. A model of variable scale can theoretically fit any function, as long as there is a large enough data set, but the reality is very thin, there are always algorithm limitations or computational costs. Also, due to dimensional disasters, existing data sets may not be sufficient. For these reasons, smarter algorithms — those that make the most of data and computing resources — will eventually yield good results if you’re willing to make the effort to debug. There is no clear line between design model and learning classifier; However, any given knowledge point can be encoded into a model or learned from data. Therefore, model design is often an important part of machine learning projects, and the designer should have a relevant professional background.

9. “Single model” is difficult to achieve the optimal, “multi-model integration” is the way out

In the early days of machine learning, everyone had their own favorite model, with some priori reasons for its superiority. The researchers developed a large number of variations on the model and picked the best one. Subsequently, empirical comparisons of systems showed that the best models changed as applications changed, and systems containing many different models began to emerge. Research is now beginning to try to debug different variants of multiple models and pick the one that performs best. But researchers are starting to notice that combining multiple variants, rather than choosing the best one they find, gets better results (often much better) without adding more work.

Model integration is now the standard approach. The simplest of these techniques is called the Bagging algorithm, where we simply re-sample to generate random variations of the training data set, then learn the classifiers separately based on those variations, and vote to consolidate the results of those classifiers. The feasibility of this method is that it significantly reduces variance and only slightly increases bias. In Boosting, training samples have weights, and these weights are different, so each new classifier focuses on samples where the previous model would be wrong. In stacking algorithms, the output of each individual classifier acts as input to “high-level” models that combine the models in an optimal way.

There are many other approaches, I won’t list them all, but the general trend is towards integrated learning on a larger and larger scale. Inspired by Netflix’s prize money, teams around the world work to build the best video recommendation system. As the competition progresses, competition teams find that they can achieve the best results by incorporating other teams’ models, which also encourages team consolidation. Both the winner and runner-up models are integrated models made up of more than 100 small models, and the combination of the two integrated models can further improve performance. No doubt there will be larger integration models in the future.

10. “Simple” is Not “Accurate”

Occam’s Razor states, do not add substance if it is not necessary. In machine learning, this usually means that given two classifiers with the same training error, the simpler of the two is likely to have the lowest evaluation error. Evidence for this claim is everywhere in the literature, but there are actually many counterexamples to refute it, and the “no free lunch” theorem questions its truth.

We also saw a counter example in the previous article: the integration model. Even if the training error has reached zero, the generalization error of the enhanced ensemble model can still be reduced by adding classifiers. Therefore, contrary to intuition, the number of parameters of the model is not necessarily related to its over-fitting trend.

An ingenious idea is to equate model complexity with the size of the hypothesis space, since the smaller space allows shorter codes to represent the hypothesis. The bounds in the guarantee part of the similar theory may be understood as shorter hypothesis codes with better generalization ability. We can further refine this by encoding the hypothesis in a space with a priori preference for a shorter time. But to see this as proof of a trade-off between accuracy and simplicity is circular: we make preferred hypotheses simpler by design, and if they are accurate, it is because the preferred hypotheses are correct, not because they are “simple” under a particular representation.

11. “Representable” does not mean “learnable”

All model representations applied to non-fixed scales actually have related theorems such as “any function can be represented or approximated infinitely using this representation”. This makes the preference of one representation method often ignore other elements. Representability alone, however, does not mean that models can be learned. For example, a decision tree model with more leaves than training samples will not learn. In continuous space, very simple functions typically represented by a fixed set of primitives require infinite components.

Further, if the evaluation function has many local optimality in the hypothesis space (which is common), the model may not find an optimal function, even if it is representable. Given limited data, time, and storage space, standard models learn only a small subset of all possible functions, and this subset varies with the characterization chosen. Therefore, the key question is not whether the model is representable, but whether the model is learnable and trying different models (even integrated models) is important.

12. “Correlation” is not “causation”

Correlation does not imply causation, a point that has been mentioned so often that it is no longer worth criticizing. However, one class of models we discuss may only learn about correlation, but their results are often seen as representing causality. Any questions? If so, why do people do it?

Often not, the goal of predictive model learning is to use them as a guide to action. When it turns out that people buy diapers when they buy beer, placing beer next to diapers might boost sales. But it’s hard to verify without actually doing experiments. Machine learning is typically used to process observational data where predictive variables are not controlled by the model, as opposed to experimental data (controllable). Some learning algorithms may be able to mine potential causality from observational data, but their usefulness is poor. Correlation, on the other hand, is just a marker of potential causation that we can use to guide further research.

Source | global artificial intelligence

End