1. General regression problem

Normally, econometrics textbooks start with linear regression, but let’s do a more general regression problem before linear regression.

Let’s define what regression is:

Definition 1 Regression Function (Regression Function) : E (y) ∣ x \ mathbb {E} (y | \ mathbf {x}) E (y) ∣ x is yyy to x \ mathbf {x} x Regression Function.

Define another metric that measures how well a prediction is done:

Definition 2 Mean Squared Error (MSE) : Suppose yyy is predicted with g(x)g(\mathbf{x})g(x), Is predicted by g (x) g (\ mathbf {x}) g (x) of the mean square error for MSE (g) = E \ [y – g (x)] 2 text (g) = {MSE} \ mathbb {E} [y – g (\ mathbf {x})) ^ 2 MSE (g) = E (x)] [y – g 2

What is the best form of the prediction function? The following theorem shows that the best predictive function is exactly the regression function, the conditional expectation.

Theorem 1 MSE optimal solution: E (y) ∣ x \ mathbb {E} (y | \ mathbf {x}) E (y ∣ x) is the following problem of the optimal solution: Y ∣ E (x) = arg ⁡ min ⁡ g ∈ FMSE (g) = arg ⁡ min ⁡ g ∈ FE [y – g (x)] 2 \ mathbb {E} (y | \ mathbf {x}) = \ arg \ min_ {g \ \ mathbb in {F}} \ text (g) = {MSE} , arg, min_ {g \ \ mathbb in {F}} \ mathbb {E} [y – g (\ mathbf {x})) ^ 2 E (y) ∣ x = argming ∈ FMSE (g) = argming ∈ FE [y – g (x)] 2 F\mathbb{F}F is the space of all Measurable and square-integrable functions: F = {g: fairly Rk + 1 – R ∣ ∫ g2 fX (x) (x) dx < up} \ mathbb {} F = \ {g: \ mathbb {R} ^ {k + 1} \ to \ mathbb {R} \ Big | \ int G ^ 2 (\ mathbf {x}) f_X (\ mathbf {x}) \, d \ mathbf {x} < \ infty \} F = {g: fairly Rk + 1 – R ∣ ∣ ∣ ∣ ∫ g2 fX (x) (x) dx < up}

In this theorem, it is more complicated to solve the maximum value problem directly, and it needs to use the variation method. It is easier to prove the theorem by the construction method, which can be decomposed directly by MSE(g)\text{MSE}(g)MSE(g). The g0 (y ∣ x) (x) ≡ E g_0 (\ mathbf {x}) \ equiv \ mathbb {E} (y | \ mathbf {x}) g0 (x) ≡ E (y) ∣ x, there are


MSE ( g ) = E [ y g 0 ( x ) + g 0 ( x ) g ( x ) ] 2 = E [ y g 0 ( x ) ] 2 + E [ g 0 ( x ) g ( x ) ] 2 + 2 E [ ( y g 0 ( x ) ) ( g 0 ( x ) g ( x ) ) ] 2 = E [ y g 0 ( x ) ] 2 + E [ g 0 ( x ) g ( x ) ] 2 \begin{aligned} \text{MSE}(g) = &\mathbb{E}[y-g_0(\mathbf{x})+g_0(\mathbf{x})-g(\mathbf{x})]^2\\ =& \mathbb{E}[y-g_0(\mathbf{x})]^2+\mathbb{E}[g_0(\mathbf{x})-g(\mathbf{x})]^2+2\mathbb{E}[\left(y-g_0(\mathbf{x})\right)\l eft(g_0(\mathbf{x})-g(\mathbf{x})\right)]^2\\ =& \mathbb{E}[y-g_0(\mathbf{x})]^2+\mathbb{E}[g_0(\mathbf{x})-g(\mathbf{x})]^2 \end{aligned}

Obviously, the first term is a constant and MSE(g)\text{MSE}(g)MSE(g) is minimized only if the second term is 000 i.e. g(x)=g0(x)g(\mathbf{x})=g_0(\mathbf{x})g(x)=g0(x).

Let’s look at another theorem about the perturbation term in regression:

Theorem 2 Regresssion Identity: A given E (y) ∣ x \ mathbb {E} (y | \ mathbf {x}) E (y ∣ x), There is always a = E (y) ∣ x + y epsilon y = \ mathbb {E} (y | \ mathbf {x}) + \ varepsilony = E (y) ∣ x + epsilon of epsilon \ varepsilon epsilon for return to the disturbance (regression disturbance). Meet E (epsilon ∣ x) = 0 \ mathbb {E} (\ varepsilon | \ mathbf {x}) = 0 E (epsilon ∣ x) = 0.

The next question is, how do we model the optimal solution g0(x)g_0(\mathbf{x})g0(x)? The simplest way to approximate it is to use a linear function.

2 linear regression

First, the concept of affine function is introduced:

Definition 3 Affine Functions: remember x=(1,x1… , xk) ‘\ mathbf {x} = (1, x_1, \ ldots, x_k)’ x = (1, x1,… , xk) ‘, beta = (beta 0, beta 1,… Beta k), ‘\ beta = (\ beta_0, \ beta_1 \ ldots, \ beta_k)’ beta = (beta 0, beta 1,… , beta k) ‘, then the affine functions were defined as A = {g: fairly Rk + 1 – > R (x) = x ∣ g ‘beta} \ mathbb {A} = \ left \ {g: ^ \ mathbb {R} {k + 1} \ to \ mathbb {R} \ Big | g (\ mathbf {x}) = \ mathbf {x} \ beta \ ‘right \} A = {g: fairly Rk + 1 – R ∣ ∣ ∣ ∣ g (x) = x’ beta}

After limiting the set of functions of G (x)g(x) from all measurable and square integrable functions to the set of affine functions, the problem is transformed to find the optimal parameter β∗\beta^*β∗ to minimize MSE, which is called the optimal least squares approximation coefficient.

Theorem 3: Best Linear Least Squares Prediction: Hypothesis (y2) E (y ^ 2) < < up E \ inftyE (y2) < E up and matrix (xx ‘) \ mathbb {E} (\ mathbf {x} \ mathbf {x} ‘) E (xx ‘) the singular, Min is optimization ⁡ g ∈ AE/y – g (x) = 2 min ⁡ beta ∈ fairly Rk + 1 e (y – x ‘beta) 2 \ min_ {g \ \ mathbb in {A}} \mathbb{E}[y-g(\mathbf{x})]^2=\min_{\beta\in\mathbb{R}^{k+1}} (y – \ \ mathbb {E} mathbf {x} ‘\ beta) ^ 2 Ming ∈ AE/y – g (x) = 2 min beta ∈ fairly Rk + 1 E (y – x’ beta) 2 solution, The optimal linear least squares prediction for g ∗ (x) = x ‘beta ∗ g ^ * (\ mathbf {x}) = \ mathbf {x}’ \ beta ^ * g ∗ (x) = x ‘beta ∗ Beta ∗ = (xx ‘) [E] – 1 E (x, y) \ beta ^ * = [\ mathbb {E} (\ mathbf {x} \ mathbf {x} ‘)] ^ {1} \ mathbb {E} (\ mathbf {x} y) beta ∗ = (xx ‘) [E] – 1 E (x, y)

It’s very easy to prove, I just have to do first order conditions DE (y – x ‘beta) 2 d beta ∣ beta = beta ∗ = 0 \ dfrac {d \ mathbb {E} (y – \ mathbf {x}’ \ beta) ^ 2} \ d {beta} \ tightly | _ {\ beta = \ beta ^ *} = 0 d beta dE (y – x ‘beta) 2 ∣ ∣ ∣ ∣ ∣ beta = beta ∗ = 0 to solve, Because the second order condition namely d2E Hessian matrix (y – x ‘beta) 2 d beta d beta’ = E (xx ‘) \ dfrac {d ^ 2 \ mathbb {E} (y – \ mathbf {x} ‘\ beta) ^ 2} {d \ beta D \ beta ‘} = \ mathbb {E} (\ mathbf {x} \ mathbf {x} ‘) d beta d beta ‘d2E (y – x’ beta) 2 = E (xx ‘) E (xx ‘) \ mathbb {E} (\ mathbf {x} \ mathbf {x} ‘) E (xx ‘) when the singular must be positive definite.

The linear regression model is formally defined below:

Definition 4 Linear Regression Model: Y =x ‘β+u,β∈Rk+1y=\mathbf{x}’\beta+u, \beta\in\mathbb{R}^{k+1}y=x’ β+u,β∈Rk+1 where uuu is regression model error.

So, what is the relationship between linear regression models and optimal linear least squares predictions?

If the condition of theorem 3 holds, y=x ‘β+uy=\mathbf{x}’\beta+uy=x’ β+u, And order Beta ∗ = (xx ‘) [E] – 1 E (x, y) \ beta ^ * = [\ mathbb {E} (\ mathbf {x} \ mathbf {x} ‘)] ^ {1} \ mathbb {E} (\ mathbf {x} y) beta ∗ = (xx ‘) [E] – 1 E (x, y) as the optimal linear least squares approximation coefficients . The beta = beta ∗ \ beta = \ beta ^ * beta = beta ∗ equivalent to the E (xu) = 0 \ mathbb {E} (\ mathbf {x} u) = 0 E (xu) = 0.

The proof of this theorem is very simple and needs to be proved from two aspects of necessity and sufficiency.

The theorem implies that the parameter values of the linear regression model are equal to the optimal linear least squares approximation coefficients β \beta^*β∗ as long as the orthogonal condition E(Xu)=0\mathbb{E}(\mathbf{x}u)=0E(XU)=0 is satisfied.

3. Correct setting of model

How is the mean value model set correctly?

Definition 5 Correct Model Specification in Conditional Mean: Linear regression model y=x ‘β+u,β∈Rk+1y=\mathbf{x}’\beta+u, \ beta \ in \ mathbb {R} ^ 1} of {k + y = x ‘beta + u, beta ∈ fairly Rk + 1 is conditions mean E (y) ∣ x \ mathbb {E} (y | \ mathbf {x}) E (y) ∣ x set correctly, If there is a certain parameter o ∈ fairly Rk + 1 \ beta ^ o \ \ mathbb in {R} ^ 1} of {k + beta o ∈ fairly Rk + 1 makes the E (y) ∣ x = x ‘beta \ mathbb {E} (y | \ mathbf {x}) = \ mathbf {x}’ \ betaE (y ∣ x) = x ‘beta. , on the other hand, if for any beta ∈ fairly Rk + 1 \ beta \ \ mathbb in {R} ^ 1} of {k + beta ∈ fairly Rk are E (y) ∣ x + 1 indicates a x ‘beta \ mathbb {E} (y | \ mathbf {x}) \ neq \ mathbf {x}’ \ betaE (y ∣ x)  = x ‘beta, The linear regression model is the E (y) ∣ x \ mathbb {E} (y | \ mathbf {x}) E (y) ∣ x error Settings.

By the definition as you can see, the linear regression model to set the right conditions is a parameter o \ beta beta o ^ o make E (u ∣ x) = 0 \ mathbb {E} (u | \ mathbf {x}) = 0 E (u ∣ x) = 0. In other words, the sufficient and necessary conditions for the linear regression model is correct is E (u ∣ x) = 0 \ mathbb {E} (u | \ mathbf {x}) = 0 E (u ∣ x) = 0, which u = y – x ‘beta ou = y – \ mathbf {x}’ \ beta ^ ou = y – x ‘beta o.

The following theorem illustrates the relationship between the regression model error term uuu and the true regression disturbance term ε\varepsilonε when the mean value model is set correctly:

Theorem 5 if the linear regression model y = x ‘beta + uy = \ mathbf {x}’ \ beta + uy = x ‘beta + u is the conditional average E (y) ∣ x \ mathbb {E} (y | \ mathbf {x}) E (y) ∣ x set correctly, Then (1) there exists a parameter βo\beta^oβo and a random variable ε\varepsilonε, y=x ‘βo+εy= mathbf{x}’\beta^o+\varepsilony=x’ βo+ε, E (epsilon ∣ x) = 0 \ mathbb {E} (\ varepsilon | \ mathbf {x}) = 0 E (epsilon ∣ x) = 0. (2) beta ∗ = beta o \ beta = \ beta ^ o ^ * beta ∗ = beta o.

According to definition 5, (1) can be obtained directly. For (2), By (1) E (epsilon ∣ x) = 0 \ mathbb {E} (\ varepsilon | \ mathbf {x}) = 0 E (epsilon ∣ x) = 0 launch E epsilon (x) = 0 \ mathbb {E} (\ mathbf {x} \ varepsilon) = 0 E epsilon (x) = 0, Then theorem 4 can be proved.

To make it easier to understand, here is an example of what a model’s right and wrong Settings are:

Assumes that the data generating process (DGP) is y = 1 + 12 x1 + 14 (x12-1) + epsilon y = 1 + \ dfrac {1} {2} x_1 + \ dfrac {1} {4} (x_1 ^ 2-1) + \ varepsilony = 1 x1 + 21 + 41 (x12-1) + epsilon, Where x1x_1x1 and ε\varepsilonε are independent N(0,1)\mathcal{N}(0,1)N(0,1) random variables. Now, if we use the linear regression model y = x ‘beta + uy = \ mathbf {x}’ \ beta + uy = x ‘beta + u to approximate the DGP, where x = (1, x1)’ \ mathbf {x} = (1, x_1) ‘x = (1, x1)’.

After calculation, we can solve the optimal linear least squares approximation beta ∗ = (1, 12) ‘\ beta ^ * = (1, \ dfrac {1} {2})’ beta ∗ = (1, 21) ‘, And g ∗ (x) = 1 + 12 x1g ^ * (\ mathbf {x}) = 1 + \ dfrac {1} {2} x_1g ∗ (x) = 1 + 21 x1, you can see which does not contain the nonlinear part. If in the regression model of beta = beta ∗ \ beta = \ beta ^ * beta = beta ∗, by theorem 4, there is E (xu) = 0 \ mathbb {E} (\ mathbf {x} u) = 0 E (xu) = 0, however, At this point (u ∣ E x) = 14 (x12-1) indicates a 0 \ mathbb {E} (u | \ mathbf {x}) = \ dfrac {1} {4} (x_1 ^ 2-1) \ neq 0 E (u) ∣ x = 41 (x12-1)  = 0, namely the model does not set correctly.

What happens if the model is not set correctly? Results show the real expected marginal effect for E (y) ∣ x dx1 = 12 + 12 x1 \ dfrac {\ mathbb {E} (y | \ mathbf {x})} {dx_1} = \ dfrac {1} {2} + \ dfrac {1} {2} x_1dx1E (∣ x y) = 21 + 21 x1, But it does not equal β1∗=12\beta^*_1=\dfrac{1}{2}β1∗=21. In other words, the wrong setting of the model will result in an optimal linear least squares approximation that is not really the expected marginal utility.

The resources

  • Hong Yongmiao advanced Econometrics, 2011