The dataset (liver.dat) we will be using in this discussion is available here. The data is from a study of a surgical unit and contains data on the survival time of 54 patients. The variables in the dataset are:
Interest lies predicting survival time (by using log(Survival time)) from the four explanatory variables.
When doing multiple regression, there are many models that we may fit. How does one go about choosing a best model? What criteria do we use to define best?
Deciding on a model is a multi-faceted problem. A few points to keep in mind.
R2 is an obvious choice. We want models with a high R2 becuase R2 is the proportion of variability in Y that is explained by the regression. The problem with R2 is that adding an extra variable to a model will always increase R2, even if the variable is completely unrelated to Y . (The increase may be slight).
Can we improve on R2? Adjusted R2 is a better choice.
R2 = 1 - SSE / SSTO
R2adj = 1 -MSE / MSTOT
R2adj adjusts R2 for the number of predictors that the model uses. The penalty is not very big if
(a) the sample size is big and/or (b) the number of parameters
in the model is not too big. Now, we can have R2adj go down when we compare a model with more parameters
to a comparable one with fewer predictors.
This Criteria for model selection is popular because it is easy and relatively easy to explain (as it is close
to R2.)
SSE Criteria. Since SSE measures the sum of squared residuals it makes sense to make SSE as small as possible. The problem with SSE is that adding an extra variable to a model will always decrease SSE, even if the variable is completely unrelated to Y . (The decrease may be slight). Adding enough predictors to the model, will always make SSE = 0.
MSE Criteria. Since MSE measures the variance of the residuals in a model, it makes sense to make MSE as small as possible. Minimizing MSE is better than minimizing SSE, but there are still better alternatives. Like R2adj, this criteria also has a built in penalty for too many parameters, but the penalty is very weak.
Akaike�s Information Criteria (AIC). AIC is another criteria for model selection. Smaller is better. Typically, one model will have smallest AIC. AIC has a penalty for many parameters. Can be used with models that are more complex than regression. Well used in the statistics community because of mathematical properties. AIC is calculated as:
AIC = n log(RSS / n) + 2p
where: n is the number of observations, RSS is the residual SS, and p is the number of parameters fitted by the model
Bayesian Information Criteria (BIC). BIC is another criteria for model selection. Smaller is better. Typically, one model will have smallest BIC. BIC has a penalty for many parameters. Can be used with models that are more complex than regression. Very similar to AIC, but it has a larger penalty when n is large. BIC is calculated as:
BIC = n log(RSS / n) + log(n) * p
where: n is the number of observations, RSS is the residual SS, and p is the number of parameters fitted by the model
For the liver data example, we will use the R step method that relys on the AIC criteria for model selection. The code used is summarized below and is followed by a discussion of the output.
| Line | Output | Discussion |
| 2 |
none |
just read in the dataset |
| 4 |
none |
create a variable that is the log10 of STIME |
| 6 |
none |
create a new data set (liver_new) that has the original 4 independent variables and the new dependent vatiable LSTIME |
| 8 | generate correlation coefficients for all the variables | |
| 10 | cross plots of all variables | |
| 13 |
none |
this fits a model to that data that only has the intercept term. |
| 14 | note only intercept fitted to data (LSTIME~1) | |
| 16 | This statement starts with the intercept only model and adds additional variables using the AIC criteria. Note that the variables are added in this order: LFTEST ... EFTEST ... PINDEX ... BCSCORE The expression "scope =~ BCSCORE + PINDEX + EFTEST + LFTEST" specifies which variables may enter the model. | |
| 18 | The final model with all the variables entered. Note that the variable LFTEST does not reject the hypothesis Ho: BLFTEST = 0 | |
| 20 |
none |
We're creating a model with all (4) dependent variables in the model. The notation ~. says to enter all variables into the model. |
| 22 | The model with all variables. Actually the same analysis as 18. | |
| 24 |
none |
Start with the complete model (lm2) and step backwards eliminating model terms. |
| 25 | The summary of the final model. Note that LFTEST is not included. | |
| 29 |
none |
The drop1 method eliminates variables one at a time from a starting model (in this case lm2). It lets us judge the influence of individual deletions from the model. |
| 30 | The results from dropping one variable at a time is given. |
Based on the forward and backward selection methods, the final model selected to represent the data is:
lm(formula = LSTIME ~ BCSCORE + PINDEX + EFTEST, data = liver_new)
Residuals:
Min 1Q Median 3Q Max
-0.102047 -0.016203 -0.002600 0.011850 0.138353
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.4836226 0.0426339 11.34 1.95e-15 ***
BCSCORE 0.0692287 0.0040784 16.97 < 2e-16 ***
PINDEX 0.0092946 0.0003826 24.30 < 2e-16 ***
EFTEST 0.0095233 0.0003064 31.08 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.04688 on 50 degrees of freedom
Multiple R-Squared: 0.9723, Adjusted R-squared: 0.9707
F-statistic: 585.9 on 3 and 50 DF, p-value: < 2.2e-16