How can machines learn better?

This series is a review of the course “Fundamentals of Machine Learning” by Professor Hsuan-Tien Lin, Department of Information Engineering, Taiwan University. Focus on grooming rather than taking detailed notes, so you may leave out some details.

The course consists of 16 lectures divided into 4 parts:

When will machines learn? When Can Machines Learn?
Why can machines learn? Why Can Machines Learn?
How do machines learn? (How Can Machines Learn?)
How can machines learn better? How Can Machines Learn Better?

This article is part 4, corresponding to lectures 13-16 of the original course.

Main contents of this part:

Overfitting problem, the relationship between overfitting and noise, objective function complexity;
Regularization, regularization and VC theory;
Validation, leave a cross validation and V-fold cross validation;
Three learning principles, namely Occam’s razor, sampling bias and data snooping.

1. Overfitting problem

1.1 Occurrence of overfitting

Suppose that 5 samples are now generated using a degree 2 polynomial with very little noise. For these 5 samples, it can be perfectly fitted using a degree 4 polynomial:

This would make Ein=0E_\text{in}=0Ein=0, but EoutE_\text{out}Eout would be very large.

If Ein _\text{in} is small and Eout _\text{out}Eout is large, bad generalization occurs. If EinE_\text{in}Ein becomes smaller and EoutE_\text{out}Eout becomes larger during training, it is called overfitting.

Both noise and data size affect the overfitting. Let’s start with the following two data sets:

The data is generated by the 10th degree polynomial with some noise;
The data is generated by 50 degree polynomial, noiseless.

Data set images are as follows:

If we fit the above two data sets with quadratic and tenth degree polynomials respectively, will there ever be a fitting from G2 ∈H2g_2 \in \mathcal{H} _2G2 ∈H2 to G10 ∈H10g_{10} \in \mathcal{H}_{10}g10∈H10?

The fitting results are as follows:

After comparison, it is found that overfitting occurs in both data sets!

Looking at the learning curve, as N goes to ∞N\to \inftyN goes to ∞ obviously H10\mathcal{H}_{10}H10 has smaller Eout \overline{E_{out}}Eout, but it has a large generalization error when NNN is small. The gray area is the area where overfitting occurs.

In fact, the “complexity of the objective function” itself can be regarded as similar noise for the data generated by the noiseless 50-degree polynomial.

Now let’s do a more detailed experiment. with

\begin{aligned} y &= f(x) + \epsilon\\ &\sim \text{Gaussian}\left(\sum_{q=0}^{Q_f} \alpha_q x^q, \sigma^2 \right) \end{aligned}

NNN data are generated, where ϵ\epsilonϵ is independent and identically distributed Gaussian noise, noise level is σ2\sigma^2σ2, f(x)f(x)f(x) with respect to complexity level QfQ_fQf is uniformly distributed. In other words, the objective function has two variables: QfQ_fQf and σ2\sigma^2σ2.

Qf=20Q_f=20Qf=20 and σ2=0.1 sigma^2=0.1σ2=0.1, respectively. And Eout (g10) – Eout E_ (g2) \ text {out} (g_ {10}) – E_ \ text {out} (g_ {2}) Eout (g10) – Eout (g2) to measure the fitting level. The results are as follows:

The areas that are redder in color are the ones that are overfitting.

The added sigma^2 sigma^2 gaussian noise is called stochastic noise, and the degree QfQ_fQf of the objective function has similar effects, so it is called Deterministic noise.

If f\notin \mathcal{H}f∈/H, then there must be some parts of FFF that cannot be captured by H\mathcal{H}H. The best H ∗∈Hh^*\in\mathcal{h}h∗∈H difference from FFF is Deterministic noise, which behaves like random noise (similar to pseudo-random number generators). It differs from the stochastic Noise in that it is related to H\mathcal{H}H, and its value is deterministic for each XXX:

1.2 Processing of over-fitting

Generally speaking, there are the following ideas for handling the fitting:

Start with a simple model;
Data cleaning, to correct incorrect data (such as its label category);
Data pruning, deleting outliers;
Data hinting: When the sample size is insufficient, some simple processing can be done on the existing sample to increase the sample size. For example, in digital classification, data can be slightly rotated or shifted without changing their labels, so as to increase the sample size.
Regularization, see next section;
Validation is shown below.

2 Regularization (regularization)

2.1 regularization

The idea of regularization is like “stepping back” from H10\mathcal{H}_{10}H10 to H2\mathcal{H}_{2}H2. The origin of the name is that in the early days when we do function approximation, there are many ill-posed problems, that is, many functions are to meet the solution of the problem, so we have to add some restrictions. In a sense, overfitting in machine learning is also a problem of “too many correct solutions”.

H10\mathcal{H}_{10} The general form of the hypothesis in H10 is

w_0+w_1 x+w_2 x^2+w_3 x^3+\cdots+w_{10} x^{10}

The general form of H2\mathcal{H}_{2}H2 is assumed to be

w_0+w_1 x+w_2 x^2

In fact as long as limiting the w3 = w4 =. = w10 = 0 w_3 = w_4 = \ cdots = w_ {10} = 0 w3 = w4 =. = w10 = 0, there will be the H10 = H2 \ mathcal {H} _ {10} = \ mathcal {H} _ {2} H10 = H2. If you add this restriction to H10\mathcal{H}_{10}H10, you are doing machine learning with H2\mathcal{H}_2H2.

H2\mathcal{H}_2H2 has limited flexibility, but H10\mathcal{H}_{10}H10 is dangerous. Is there a compromise hypothesis set? Might as well make the conditions to relax, ∑ q = 0101 [w1 indicates 0] 3 or less \ sum \ limits_ ^ = 0 {q} {10} \ mathbf {1} _ {[0] w_1 \ can} \ le 3 q = 0 101  = 0], [w1 ∑ 3 or less, The hypothesis set recorded under this limitation is H2 ‘\mathcal{H}_2’H2′, and has H2⊂H2 H10\mathcal{H}_{2}\subset \mathcal{H}_{2}’ \subset \mathcal{H}_{10}H2⊂H2 ‘⊂H10, That is, it is more flexible than H2\mathcal{H}_{2}H2, but not as dangerous as H10\mathcal{H}_{10}H10.

At H2 ‘\mathcal{H}_{2}’H2’, the problem is transformed

\min\limits_{\mathbf{w}\in \mathbb{R}^{10+1}} E_\text{in}(\mathbf{w})\quad \text{s.t. } \sum\limits_{q=0}^{10}\mathbf{1}_{[w_1\ne 0]}\le 3

This is a NP-hard problem, very complex. Let’s change it to

\min\limits_{\mathbf{w}\in \mathbb{R}^{10+1}} E_\text{in}(\mathbf{w})\quad \text{s.t. } \sum\limits_{q=0}^{10}w^2_q \le C

The hypothesis set is denoted as H(C)\mathcal{H}(C)H(C), which overlaps H ‘\mathcal{H}_2’H2’ and has a soft, smooth structure for CCC:

\mathcal{H}_{0} \subset \mathcal{H}_{1} \subset \cdots \subset \mathcal{H}_{\infty} =\mathcal{H}_{10}

The optimal solution found under H(C)\mathcal{H}(C)H(C) is wREG\mathbf{w}_\text{REG}wREG.

In the absence of regularization, the parameters are updated with gradient descent in the direction of −∇Ein(w)-\nabla E_\text{in}(\mathbf{w})−∇Ein(w). When the regularized wTw≤C\mathbf{w}^T \mathbf{w}\le CwTw≤C limit is added, it must be updated under this limit, as shown below:

WTw = mathbf{w}^T \mathbf{w}= CwTw= w\mathbf{w} ^T \mathbf{w}= w\mathbf{w}w As long as −∇Ein(W)-\ Nabla E_\text{in}(\mathbf{w})−∇Ein(W) and W \mathbf{w} W are not parallel, Ein(w)E_\text{in}(\mathbf{w})Ein(w) can continue to be reduced under this restriction. Therefore, When we reach the optimal solution, there must be

-\nabla E_\text{in}(\mathbf{w}) \propto \mathbf{w}_\text{REG}

Thus, the problem can be transformed into a solution

\nabla E_\text{in}(\mathbf{w}_\text{REG}) +\dfrac{2 \lambda}{N} \mathbf{w}_\text{REG}=0

Where λ\lambdaλ is the introduced Lagrange multiplier. Suppose lambda>0 lambda>0 lambda>0 lambda>0 lambda>0 lambda>0 lambda>0

\dfrac{2}{N}(X^T X\mathbf{w}_\text{REG}-X^T \mathbf{y})+\dfrac{2 \lambda}{N} \mathbf{w}_\text{REG}=0

You can solve it directly

\mathbf{w}_\text{REG}\leftarrow (X^T X+\lambda I)^{-1} X^T\mathbf{y}

As long as λ>0\lambda>0λ>0, XTX+λIX^T X+ lambda IXTX+λI is a positive definite matrix, which must be invertible.

In statistics, this is often called ridge regression.

Another way of looking at it, solve

\nabla E_\text{in}(\mathbf{w}_\text{REG}) +\dfrac{2 \lambda}{N} \mathbf{w}_\text{REG}=0

That’s the same thing as solving this equation.

\min\limits_{\mathbf{w}} E_\text{in}(\mathbf{w})+\dfrac{\lambda}{N}\mathbf{w}^T\mathbf{w}

WTw \mathbf{w}^T\mathbf{w}wTw {w} The entire Ein(w)+λNwTwE_\text{in}(\mathbf{w})+\dfrac{\lambda}{N}\mathbf{w}^T\mathbf{w}Ein(w)+NλwTw can be called a staggered error Eaug E_ (w) \ text {aug} (\ mathbf {w}) Eaug (w).

Thus, the solution of a conditional maximum value problem given CCC is transformed into an unconditional maximum value problem given λ\lambdaλ.

+λNwTw+\dfrac{\lambda}{N}\mathbf{w}^T\mathbf{w}+NλwTw can be called weigh-decay regulariztion because the larger λ\lambdaλ, This is equivalent to making w\mathbf{w}w shorter, which is equivalent to making CCC a little smaller.

One small detail: if phi (x)=(1,x,x2… , xQ) \ Phi (\ mathbf {x}) = (1, x, x ^ 2, \ ldots, x ^ Q) Φ (x) = (1, x, x2,… ,xQ), assuming xn∈[−1,+1] X_n \in [-1,+1]xn∈[−1,+1] XNQX ^q_nxnq will be very small, which requires a very large WqW_qwq to work. If regularization is used at this time, it will “over-punish” the coefficient of higher dimensions. Because it’s supposed to be big. So you can find orthonormal basis functions in Polynomials, which are Polynomials that are called Legendre Polynomials, And then use these polynomials to do the characteristic transformation (1,L1(x),L2(x)… LQ (x)) (1, L_1 (x), L_2 (x), \ ldots, L_Q (x)) (1, L1 (x), L2 (x),… LQ (x)). The first five Legendre polynomials are shown below:

2.2 Regularization and VC theory

Even though it is equivalent to constrained maximum value problems when minimizing augmented errors, it does not really limit w\mathbf{w}w to H(C)\mathcal{H}(C)H(C). So how does regularization happen?

A augmented error can be looked at another way:

E_\text{aug}(\mathbf{w})=E_\text{in}(\mathbf{w})+\dfrac{\lambda}{N}\mathbf{w}^T\mathbf{w}

If wTw\mathbf{w}^T\mathbf{w}wTw is Omega(w)\Omega(\mathbf{w}) ω (w), it measures the complexity of a hypothetical W \mathbf{w}w. And in VC Bound

E_\text{out}(\mathbf{w})\le E_\text{in}(\mathbf{w})+\Omega(\mathcal{H})

Omega(H)\Omega(\mathcal{H}) Omega(H) measures the complexity of the whole H\mathcal{H}H. If the lambda N Ω (w) \ dfrac {\ lambda} {N} \ Omega (\ mathbf {w}) N lambda Ω (w) and Ω (H) \ Omega (\ mathcal {H}) Ω (H) has a certain relevance, EaugE_\text{aug}Eaug can be directly used as the proxy of EoutE_\text{out}Eout without the need to do EinE_\text{in}Ein to do EoutE_\text{out}Eout, and at the same time, You can also enjoy the high flexibility of the entire H\ Mathcal {H}H.

Again on the other hand, the original for the H \ mathcal {H} H have dVC (H) = d ~ + 1 d_ \ text {} of VC (\ mathcal {H}) = \ tilde} {d + 1 dVC (H) = d ~ + 1, Now it is equivalent to considering only the hypothesis in H(C)\mathcal{H}(C)H(C), that is, the VC dimension becomes dVC(H(C))d_\text{VC}(\mathcal{H}(C))dVC(H(C)). It is possible to define A “valid VC dimension” dEFF(H,A)d_\text{EFF}(\mathcal{H},\mathcal{A})dEFF(H,A) as long as regularization is done in A\mathcal{A}A, the valid VC dimension will be smaller.

2.3 More general regular terms

Is there a more general regular term Omega(w)\Omega(\mathbf{w}) Omega(w)? How to choose? Suggestions are as follows:

Target-dependent, if we know some properties of the target function, we can write it. For example, we know in advance that the target function is close to even function, ∑1[q is odd]wq2\sum \mathbf{1}_{[q \text{is odd}]} w^2_q∑1[q is odd]wq2;
Beneath each of the following sentences, there may be smooth or simple choices, such as ∑∣wq∣ sum\vert w_q \vert∑∣wq∣, chosen for sparacrency;
Friendly, that is, easy to optimize, such as L2 regular term ∑wq2\sum w_q^2∑wq2;
It doesn’t matter if the selected regular term is bad, because it can be adjusted by λ\lambdaλ, or at worst, without the regular term.

L1 regular terms are shown below:

It is convex, but not differentiable everywhere, and when it is added, the solution is sparse. If a sparse solution is needed in practice, L1 is useful.

λ\lambda how to choose λ? An example of an optimal λ\lambdaλ can be chosen from the case of EoutE_\text{out}Eout as follows (best λ\lambdaλ is highlighted in bold) :

As can be seen from the figure, the larger the noise, the more regularization needs to be increased.

But in general, the noise is unknown, how to choose the appropriate λ\lambdaλ?

3. Validation

3.1 validation set

λ\lambda how to choose λ? We do not know EoutE_\text{out}Eout at all, and we cannot choose EinE_\text{in}Ein directly. It would be nice to have a test set that has never been used, so you can select based on the test set:

m^*=\mathop{\arg\min}\limits_{1\le m\le M} \left( E_m=E_\text{test}(\mathcal{A}_m(\mathcal{D})) \right)

And there is a generalization guarantee:

E_\text{out}(g_{m^*})\le E_\text{test}(g_{m^*})+O(\sqrt{\dfrac{\log M}{N_\text{test}}})

But where is the real test set? ⊂D\mathcal{D} D can only be partialed as the verification set Dval \mathcal{D}_\text{val}\subset \mathcal{D}Dval⊂D, and of course, It is also required that it has never been used by Am\mathcal{A}_mAm in the past.

Dval\mathcal{D}_\text{val}Dval is divided as follows:

Gm −g^-_mgm− obtained from the training set can also be guaranteed by generalization:

E_\text{out}(g_m^-)\le E_\text{val}(g_m^-)+O(\sqrt{\dfrac{\log M}{K}})

The general process of verification is as follows:

As can be seen, after the validation set is used to select the best model GM ∗− G ^-_{m^*}gm∗−, all the data should be used to train the best model GM ∗g_{m^*}gm∗ −. Generally speaking, gm∗g_m^* GM ∗ obtained in this training will have lower EoutE_\text{out}Eout due to the larger amount of training data, as shown in the figure below:

The bottom dotted line in the figure is EoutE_\text{out}Eout. As can be seen, KKK cannot be too large or too small. If KKK is too small, although GM −≈ GMG_m ^-\approx g_mgm−≈ GM, EvalE_\text{val}Eval and EoutE_\text{out}Eout differ greatly. Although Eval≈EoutE_\text{val}\approx E_\text{out}Eval≈Eout, it makes gm−g_m^-gm− much worse than GMg_mgm.

What we really want to do is

$E_\text{out}(g)\approx E_\text{out}(g^-)\approx E_\text{val}(g^-)$

The first approximately equal sign requires KKK to be smaller, and the second approximately equal sign requires KKK to be larger, so an appropriate KKK must be selected, according to the rule of thumb optional K=N5K=\dfrac{N}{5}K=5N.

3.2 Leave one Cross Validation (LOOCV)

If K=1K=1K=1, that is, only one sample NNN is left as the verification set, remember

E_\text{val}^{(n)}(g_n^-)=\text{err}(g_n^-(\mathbf{x}_n),y_n)=e_n

But a single ene_nen does not give us accurate information, so try to average all possible Eval(n)(gn−)E_\text{val}^{(n)}(g_n^-)Eval(n)(gn−). Leave-one-out Cross Validation can be used:

E_\text{loocv}(\mathcal{H},\mathcal{A})=\dfrac{1}{N}\sum\limits_{n=1}^{N} e_n=\dfrac{1}{N} \sum\limits_{n=1}^{N} \text{err}(g_n^- (\mathbf{x}_n),y_n)

We want Eloocv(H,A)≈Eout(g)E_\text{loocv}(\mathcal{H},\mathcal{A})\approx E_\text{out}(g)Eloocv(H,A)≈Eout(g). It can be proved:

\begin{aligned} &\mathop{\mathcal{E}}\limits_{\mathcal{D}} E_\text{loovc}(\mathcal{H},\mathcal{A})\\ =& \mathop{\mathcal{E}}\limits_{\mathcal{D}}\dfrac{1}{N}\sum\limits_{n=1}^{N} e_n\\ =&\dfrac{1}{N} \sum\limits_{n=1}^{N} \mathop{\mathcal{E}}\limits_{\mathcal{D}} e_n\\ =&\dfrac{1}{N} \sum\limits_{n=1}^{N} \mathop{\mathcal{E}}\limits_{\mathcal{D}_n} \mathop{\mathcal{E}}\limits_{(\mathbf{x}_n,y_n)} \text{err}(g_n^-(\mathbf{x}_n),y_n)\\ =&\dfrac{1}{N} \sum\limits_{n=1}^{N} \mathop{\mathcal{E}}\limits_{\mathcal{D}_n} E_\text{out}(g_n^-)\\ =&\dfrac{1}{N} \sum\limits_{n=1}^{N} \overline{E_\text{out}}(N-1)\\ =& \overline{E_\text{out}}(N-1) \end{aligned}

Since the expectation of Eloovc(H,A)E_\text{loovc}(\mathcal{H},\mathcal{A})Eloovc(H,A) will tell us something about the expectation of Eout(g−)E_\text{out}(g^-)Eout(g−), Therefore, it is also called an almost unbiased estimate of Eout(g)E_\text{out}(g)Eout(g).

Using handwritten number recognition — sorting whether the number is 1 or not — to see what happens, the two basic features are symmetry and average intensity, and to transform them (to increase the number of features), EinE_\text{in}Ein and EloocvE_\text{looCV}Eloocv are used to select parameters (parameters are the number of features after transformation), and the results are as follows:

If EoutE_\text{out}Eout, EinE_\text{in}Ein, EloocvE_\text{loocv}Eloocv change with the number of features, as shown in the figure:

3.3 $V$ – Fold cross validation

If there are 1000 points, it takes 1000 ene_nen computations to do a retention cross-validation, and 999 samples to train each computations. Except for a few algorithms (such as linear regression, which has an analytical solution), this can be very time consuming in most cases. On the other hand, because EloocvE_\text{looCV}Eloocv is averaged at a single point, the result will jump, which is not stable. Therefore, looCV is not used very often in practice.

In practice, vvV-fold Cross Validation is more commonly used, where D\mathcal{D}D is randomly divided into VVV equal parts, each part is verified in turn, and the remaining V−1V-1V−1 parts are trained. In practice, V=10V=10V=10 is generally taken as follows:

So we can figure out

E_\text{cv}(\mathcal{H}, \mathcal{A})=\dfrac{1}{V}\sum\limits_{v=1}^{V} E_\text{val}^{(v)}(g_v^-)

Then use it to select parameters:

m^*=\mathop{\arg\min}\limits_{1\le m\le M} \left( E_m=E_\text{cv}(\mathcal{H}_m, \mathcal{A}_m) \right)

It’s worth noting that since the validation process is also making choices, its results are still a little more optimistic than the final test results. Therefore, what matters in the end is the result of the test, not the result of the best validation found.

4 Three principles of learning

Here are three principles of learning.

4.1 Occam’s Razor

The first is Occam’s Razor.

An explanation of the data should be made as simple as possible, but no simpler.

–Albert Einsterin (?)

Einstein is said to have said this, but there is no proof. It dates back to Occam’s words:

entia non sunt multiplicanda praeter necessitatem (entities must not be multiplied beyond necessity)

–William of Occam (1287-1347)

In machine learning, this means that the simplest models that fit the data tend to make the most sense.

What is a simple model? For a single hypothesis HHH, Omega(h) Omega(h) is required to be small, i.e. fewer parameters, and for a model (hypothesis set) h \mathcal{h} h, Omega(\mathcal{h}) Omega(h) is required to be small, i.e. it does not contain many possible hypotheses. The two are related, for example, given H∣ vert \mathcal{H} vert∣H∣ with a scale of 2ℓ2^ ell2ℓ, then simply ℓ\ellℓ can be used to describe all HHH, So small Ω (H) \ Omega (\ mathcal {H}) Ω (H) means small Ω (H) \ Omega Ω (H) (H).

Philosophically speaking, the simpler the model, the less likely the “fit” is to occur, and if it does occur, there may be some important patterns in the data.

4.2 Sampling bias

The second is to pay attention to Sampling Bias.

If the sampling process is biased, machine learning will produce a biased result.

When we talked about the VC dimension, one of the prerequisites is that the training data and the test data have to come from the same distribution. When this is not possible, the rule of thumb is to make the test environment and training environment match as closely as possible.

4.3 Data Snooping

Thirdly, pay attention to Data Snooping.

If you look at the data and see that it fits well with a model, and you go with that model, it’s dangerous because it adds complexity to the model in your head.

In any process of using data, it’s actually indirect snooping on data. “Brain” complexity is introduced into any decision made after seeing how the data behaves.

For example, when scaling, you can’t do scaling with the training set and the test set together, you can only do scaling with the training set.

In fact, there is a similar situation in the forefront of machine learning research. For example, the first paper found that H1\mathcal{H}_1H1 would perform better on D\mathcal{D}D, while the second paper proposed H2\mathcal{H}_2H2, It works better on D\mathcal{D}D than H1\mathcal{H}_1H1 (it would not have been published otherwise), and so does the third… If you consider all papers as one final paper, then the true VC dimension is DVC (∪mHm)d_\text{VC}(\cup_m \mathcal{H}_m) DVC (∪mHm), which is very large and generalizations are very poor. That’s because at every step of the way, the authors snooped on the data by reading previous literature.

So when you do machine learning, you have to be careful with the data. To avoid using data to make decisions, it is better to add domain knowledge to the model in advance than to add features after observing the data. Also, always be skeptical, whether it’s in practice, reading papers, or treating your own results.