Original link:http://tecdat.cn/?p=23170 

Introduction to the

This article is concerned with the following two questions. How should you add dummy variables? How should you interpret the results?

It might be easier to understand these problems if we use an example.

data

Suppose we want to study how wages are determined by education, experience, and whether someone holds a managerial position. Assuming that

  1. Everyone starts at $40,000 a year.
  2. True knowledge comes from practice. Every year I gain experience, my salary goes up by $5,000.
  3. The more you learn, the more you will earn. High school, college and doctoral students will see annual salary increases of 0, 10 and 20, respectively.
  4. Anyone can steer when the sea is calm. For managerial positions, it’s 20K more.
  5. Born to be a great leader. For those with only a high school education who hold managerial positions, an extra 30,000 will be offered.
  6. The random factor affects the salary, the mean is 0, the standard deviation is 5,000.

Some of the data and summary are below.

Draw the data

The relationship between salary and education of people with and without managerial positions.

Jitter (alpha=0.25,color=colpla\[4\])+ facet_wrap(~ admin position)+ boxplot(color=colpla\[2\])

The relationship between salary and experience of people with and without managerial positions is based on education.

Stat_smooth (method = "lm")+ facet_wrap(~ admin position)

Regression analysis

Ignore the interplay between education and management

We only return salary to education, experience and managerial positions. As a result,

While these parameters are statistically significant, it doesn’t make any sense. How can a college degree cut your salary by 5,105 compared to high school?

The correct model should include interaction items for educational and managerial positions.

Add the interaction between education and management

Now, let’s add the interaction between education and management and see what happens.

Interpretation of the results

The results now make sense.

  • The intercept of 40,137 (close to 40,000) is basic guaranteed income.
  • The base of education is high school. Compared with high school, college education can increase the average salary by 9,833 yuan (nearly 10,000 yuan). Compared with high school, a PhD can increase the salary by 19,895 yuan (nearly 20,000 yuan).
  • One more year of work experience can increase your salary by 4,983 yuan (nearly 5,000 yuan).
  • High school graduates in managerial positions saw a premium of 49,695 yuan (nearly 50,000 yuan). These people are born leaders.
  • The premium for college graduates in managerial positions decreased by 29,965.51 to 29,571 (49,735.74 to 29,965.51, nearly 20,000) compared with high school graduates in managerial positions.
  • The premium for PhD graduates in managerial positions decreased by 29,501 to 19,952.87 (nearly 20,000) when compared to high school graduates in managerial positions. Alternatively, you could argue that management positions generate a 20K base premium regardless of education level. In addition to the $20,000, a high school graduate would receive another $30,000, increasing the total premium to $50,000.

Test whether the assumptions of the model are violated

In order for our model to be effective, we need to satisfy some assumptions.

  • The error should follow a normal distribution

A normal Q-Q diagram looks linear. So this assumption is satisfied.

  • No autocorrelation

The d-W test value is 1.8878, close to 2. Therefore, this hypothesis is also satisfied.

  • There is no multicollinearity

The VIF values of the predictive variables EDU, exp and MNGT are all less than 5, thus satisfying this hypothesis.

Perform regression with a subset of the data

You can get the same result by running the model with a subset of the data. Instead of using a dummy variable of education, you can divide the data into subsets by education level and run a regression model on each subset.

If you just use high school students, you get something like this.

Sub <-d %>% + filter(education ==" high school ")

You can get this just by looking at college students.

Just using data from PhD students, you get this result.


Most popular insight

1. Application case of R language multiple Logistic Logistic regression

2. Panel smooth transfer regression (PSTR) analysis case implementation

3. Partial least squares regression (PLSR) and principal component regression (PCR) in MATLAB

4. Analysis case of R language Poisson regression model

5.R language mixed effect Logistic regression model was used to analyze lung cancer

6. LASSO regression, Ridge regression and Elastic Net model implementation in R language

7.R language logistic regression, Naive Bayes, decision tree, random forest algorithm to predict heart disease

8. Python uses linear regression to predict stock prices

9. The R language uses logistic regression, decision tree and random forest to make categorical prediction of credit data sets