Handout 11

Stepwise Techniques and Using R for Stepwise Regression*


Introduction

The dataset (liver.dat) we will be using in this discussion is available here. The data is from a study of a surgical unit and contains data on the survival time of 54 patients. The variables in the dataset are:

Interest lies predicting survival time (by using log(Survival time)) from the four explanatory variables.

Some Philosphy

When doing multiple regression, there are many models that we may fit. How does one go about choosing a best model? What criteria do we use to define best?

Deciding on a model is a multi-faceted problem. A few points to keep in mind.

  1. We want simple (parsimonious) models because they are easier to interpret, explain and understand.
  2. We want models that fit the data.
  3. We want models that make sense in terms of the science that they represent.
  4. No model is correct, but some are useful.

R2 is an obvious choice. We want models with a high R2 becuase R2 is the proportion of variability in Y that is explained by the regression. The problem with R2 is that adding an extra variable to a model will always increase R2, even if the variable is completely unrelated to Y . (The increase may be slight).

Can we improve on R2? Adjusted R2 is a better choice.

R2 = 1 - SSE / SSTO

R2adj = 1 -MSE / MSTOT

R2adj adjusts R2 for the number of predictors that the model uses. The penalty is not very big if (a) the sample size is big and/or (b) the number of parameters
in the model is not too big. Now, we can have R2adj go down when we compare a model with more parameters to a comparable one with fewer predictors.
This Criteria for model selection is popular because it is easy and relatively easy to explain (as it is close to R2.)

SSE Criteria. Since SSE measures the sum of squared residuals it makes sense to make SSE as small as possible. The problem with SSE is that adding an extra variable to a model will always decrease SSE, even if the variable is completely unrelated to Y . (The decrease may be slight). Adding enough predictors to the model, will always make SSE = 0.

MSE Criteria. Since MSE measures the variance of the residuals in a model, it makes sense to make MSE as small as possible. Minimizing MSE is better than minimizing SSE, but there are still better alternatives. Like R2adj, this criteria also has a built in penalty for too many parameters, but the penalty is very weak.

Akaike�s Information Criteria (AIC). AIC is another criteria for model selection. Smaller is better. Typically, one model will have smallest AIC. AIC has a penalty for many parameters. Can be used with models that are more complex than regression. Well used in the statistics community because of mathematical properties. AIC is calculated as:

AIC = n log(RSS / n) + 2p

where: n is the number of observations, RSS is the residual SS, and p is the number of parameters fitted by the model

Bayesian Information Criteria (BIC). BIC is another criteria for model selection. Smaller is better. Typically, one model will have smallest BIC. BIC has a penalty for many parameters. Can be used with models that are more complex than regression. Very similar to AIC, but it has a larger penalty when n is large. BIC is calculated as:

BIC = n log(RSS / n) + log(n) * p

where: n is the number of observations, RSS is the residual SS, and p is the number of parameters fitted by the model

Analyzing The Data

For the liver data example, we will use the R step method that relys on the AIC criteria for model selection. The code used is summarized below and is followed by a discussion of the output.

R Code

  1. # get the original dataset
  2. liver = read.table("c:/rdata/liver.dat", header=T)
  3. # take survival time (STIME) and find the log base 10
  4. LSTIME = log10(liver[,5])
  5. # create the data matrix by removing STIME and appending the new column LSTIME
  6. liver_new = cbind(liver[,1:4],LSTIME)
  7. # find the pairwise correlations
  8. round(cor(liver_new, use="pairwise"),2)
  9. # do some cross plots
  10. pairs(liver_new)
  11. # do a stepwise forward selection
  12. # fit only the intercept term
  13. lm1 = lm(LSTIME~1,data=liver_new)
  14. summary(lm1)
  15. # step forward and add terms
  16. forward = step(lm1, scope =~ BCSCORE + PINDEX + EFTEST + LFTEST, direction=c("forward"))
  17. # get a summary of the final model
  18. summary(forward)
  19. # fit all the independent variables in the model
  20. lm2 = lm(LSTIME~.,data=liver_new)
  21. # get a summary
  22. summary(lm2)
  23. # step backward and remove items from the model with all independent variables
  24. backward = step(lm2, direction=c("backward"))
  25. summary(backward)
  26. # Compute all the single terms in the 'scope' argument that can be
  27. # added to or dropped from the model, fit those models and compute a
  28. # table of the changes in fit.
  29. drop1var = drop1(lm2, test="F")
  30. drop1var

Annotation

Line Output Discussion
2

none

just read in the dataset
4

none

create a variable that is the log10 of STIME
6

none

create a new data set (liver_new) that has the original 4 independent variables and the new dependent vatiable LSTIME
8

here

generate correlation coefficients for all the variables
10

here

cross plots of all variables
13

none

this fits a model to that data that only has the intercept term.
14

here

note only intercept fitted to data (LSTIME~1)
16

here

This statement starts with the intercept only model and adds additional variables using the AIC criteria. Note that the variables are added in this order: LFTEST ... EFTEST ... PINDEX ... BCSCORE The expression "scope =~ BCSCORE + PINDEX + EFTEST + LFTEST" specifies which variables may enter the model.
18

here

The final model with all the variables entered. Note that the variable LFTEST does not reject the hypothesis Ho: BLFTEST = 0
20

none

We're creating a model with all (4) dependent variables in the model. The notation ~. says to enter all variables into the model.
22

here

The model with all variables. Actually the same analysis as 18.
24

none

Start with the complete model (lm2) and step backwards eliminating model terms.
25

here

The summary of the final model. Note that LFTEST is not included.
29

none

The drop1 method eliminates variables one at a time from a starting model (in this case lm2). It lets us judge the influence of individual deletions from the model.
30

here

The results from dropping one variable at a time is given.

Conclusion

Based on the forward and backward selection methods, the final model selected to represent the data is:

lm(formula = LSTIME ~ BCSCORE + PINDEX + EFTEST, data = liver_new)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.102047 -0.016203 -0.002600  0.011850  0.138353 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.4836226  0.0426339   11.34 1.95e-15 ***
BCSCORE     0.0692287  0.0040784   16.97  < 2e-16 ***
PINDEX      0.0092946  0.0003826   24.30  < 2e-16 ***
EFTEST      0.0095233  0.0003064   31.08  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 0.04688 on 50 degrees of freedom
Multiple R-Squared: 0.9723,     Adjusted R-squared: 0.9707 
F-statistic: 585.9 on 3 and 50 DF,  p-value: < 2.2e-16 


Sources include here , here , here , here , here and here.