We can solve the curve fitting problem by choosing WWW values of E(w)E(w)E(w) E(w) as small as possible. Since the error function is a quadratic function of the coefficient WWW and its derivative with respect to the coefficient is linear in WWW elements, the minimization of the error function has a unique solution, expressed as W ∗ W ^*w∗, which can be found in the closed form. The resulting polynomial is given by the function y(x,w∗)y(x,w^*)y(x,w∗).

There is still the problem of choosing MMM of polynomial order, which, as we will see, will become an example of an important concept called model comparison and model selection. In figure 1.4, we show four examples of results fitting polynomials of order M=0,1,2,3,9M=0,1,2,3,9M=0,1,2,3,9 into the data set shown in figure 1.2.

We note that constant (M = 0) (M = 0) (M = 0) and the first order (M = 1) (M = 1) (M = 1) polynomial of the data fitting is poorer, therefore function sin ⁡ PI (2 x) \ sin (2 x \ PI) sin PI (2 x) the poor said. Third order (M = 3) (M = 3) (M = 3) polynomial seems to be the most appropriate the function in the example shown in figure 1.4 sin ⁡ PI (2 x) \ sin (2 x \ PI) sin PI (2 x). When we use higher order polynomials (M=9)(M=9)(M=9), we get an excellent fit of the training data. In fact, the polynomial passes through every data point exactly, E(w∗)=0E(w^*)=0E(w∗)=0. However, the fitting curve oscillates violently, and the function sin⁡(2πx)\sin(2 PI x)sin(2πx) performs very poorly. The latter behavior is called overfitting.

As we mentioned earlier, our goal is to achieve good generalization by making accurate predictions about new data. We can gain some quantitative insight into the dependence of generalization performance to MMM by considering separate test sets consisting of 100 data points generated using exactly the same procedure as the data set points, but with new choices of random noise values included in the target values. For each MMM selection, we can evaluate the E(W∗)E(W^*)E(W∗) residual values of the training data given in (1.2), as well as E(W∗)E(W^*)E(W∗) for the test data set. Sometimes it is more convenient to use the defined root mean square (RMS) error


E R M S = 2 E ( w ) / N (1.3) E_ = {RMS} \ SQRT {2 e (w ^ *)/N} \ tag} {1.3

Where, the division of NNN allows us to compare data sets of different sizes on an equal basis, and the square root ensures that ERMSE_{RMS}ERMS is measured on the same scale (and in the same units) as the target variable TTT. Figure 1.5 shows the graphs of root mean square errors of the training and test sets at different MMM values. The test set error is a measure of how good we are at predicting the TTT value of the new data observations for XXX. We note from Figure 1.5 that smaller values of M give relatively large pairs of test set errors, which can be attributed to the fact that the corresponding polynomial is quite inflexible and cannot capture oscillations of the function sin⁡(2πx)\sin(2 πx) sin(2πx). The value of M at 3≤M≤83\leq M \leq 83≤M≤8 gives a small value of the error of the test set, which also gives a reasonable representation of the generating function sin⁡(2πx)\sin(2 πx) sin(2πx), as can be seen from Figure 1.4 for M=3M=3M=3.

FIG. 1.4 Multinomial diagrams of MMM with different orders are shown in red curves, which fit the data set in FIG. 1.2.

Figure 1.5 Shows the evaluation of different values of MMM on the training set and independent test set by the root mean square error defined by (1.3).