Automatic Effect Selection in Regression Models
The GLMSELECT procedure performs effect selection in the framework of general linear models. A variety of model selection methods are available, including the LASSO method of Tibshirani (1996) and the related LAR method of Efron et al. (2004). The procedure offers extensive capabilities for customizing the selection with a wide variety of selection and stopping criteria, from traditional and computationally efficient significance-level-based criteria to more computationally intensive validation-based criteria. The procedure also provides graphical summaries of the selection search.
The GLMSELECT procedure compares most closely to REG and GLM. The REG procedure supports a variety of model-selection methods but does not support a CLASS statement. The GLM procedure supports a CLASS statement but does not include effect selection methods. The GLMSELECT procedure fills this gap. GLMSELECT focuses on the standard independently and identically distributed general linear model for univariate responses and offers great flexibility for and insight into the model selection algorithm. GLMSELECT provides results (displayed tables, output data sets, and macro variables) that make it easy to take the selected model and explore it in more detail in a subsequent procedure such as REG or GLM.
Example: Prices of round cut diamonds
Description: A dataset containing the prices and other attributes of almost 54,000 diamonds.
price: Selling price in dollars.
carat: Weight of the diamond.
lot:Area of the houses lot in square feet.
cut: Quality of the cut (Fair, Good, Very Good, Premium, Ideal).
color: Diamond colour, from J (worst) to D (best).
clarity: A measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best)).
x: Length in mm.
y: Width in mm.
z: Depth in mm.
depth: Total depth percentage = z / mean(x, y) = 2 * z / (x + y).
table: Width of top of diamond relative to widest point.
Source: https://vincentarelbundock.github.io/Rdatasets/doc/ggplot2/diamonds.html
Download the data from here
Task: Develop a model to estimate price given diamond parameters.
Let's run our regression model including all the variables and let the procedure select the best ones.
PROC GLMSELECT DATA = tutorial.diamonds;
CLASS cut color clarity;
MODEL price = carat cut color clarity depth table x y z / SELECTION=STEPWISE(CHOOSE=AIC SLE=0.05);
RUN;
SELECTION= option allows specifying effect selection algorithms. Available options are:
- FORWARD: The forward selection technique begins with just the intercept and then sequentially adds the effect that most improves the fit. The process terminates when no significant improvement can be obtained by adding any effect.
- BACKWARD: The backward elimination technique starts from the full model including all independent effects. Then effects are deleted one by one until a stopping condition is satisfied. At each step, the effect showing the smallest contribution to the model is deleted.
- STEPWISE: The stepwise method is a modification of the forward selection technique that differs in that effects already in the model do not necessarily stay there.
- LAR: Least angle regression was introduced by Efron et al. (2004). Not only does this algorithm provide a selection method in its own right, but with one additional modification it can be used to efficiently produce LASSO solutions. Just like the forward selection method, the LAR algorithm produces a sequence of regression models where one parameter is added at each step, terminating at the full least squares solution when all parameters have entered the model.
- LASSO: LASSO (least absolute shrinkage and selection operator) selection arises from a constrained form of ordinary least squares regression where the sum of the absolute values of the regression coefficients is constrained to be smaller than a specified parameter.
You use the CHOOSE= option to specify the criterion for selecting one model from the sequence of models produced. If you do not specify a CHOOSE= criterion, then the model at the final step is the selected model. In this particular case, selection terminates at the step where no effect can be added at the SLE=0.05 significance level. However, the selected model is the first one with the minimal value of Akaike’s information criterion.
Here's the output:
Data Set | TUTORIAL.DIAMONDS |
---|---|
Dependent Variable | price |
Selection Method | Stepwise |
Select Criterion | SBC |
Stop Criterion | SBC |
Choose Criterion | AIC |
Effect Hierarchy Enforced | None |
Dimensions | |
---|---|
Number of Effects | 10 |
Number of Parameters | 27 |
Stepwise Selection Summary | ||||||
---|---|---|---|---|---|---|
Step | Effect Entered |
Effect Removed |
Number Effects In |
Number Parms In |
AIC | SBC |
0 | Intercept | 1 | 1 | 948419.888 | 894486.784 | |
1 | carat | 2 | 2 | 846331.443 | 792407.235 | |
2 | clarity | 3 | 9 | 826940.500 | 773078.560 | |
3 | color | 4 | 15 | 816135.366 | 762326.801 | |
4 | x | 5 | 16 | 814103.107 | 760303.437 | |
5 | cut | 6 | 20 | 812636.431 | 758872.343 | |
6 | depth | 7 | 21 | 812447.395 | 758692.203 | |
7 | table | 8 | 22 | 812366.819* | 758620.523* |
Selection stopped at a local minimum of the SBC criterion. |
Stop Details | ||||
---|---|---|---|---|
Candidate For |
Effect | Candidate SBC |
Compare SBC |
|
Entry | z | 758629.333 | > | 758620.523 |
Removal | table | 758692.203 | > | 758620.523 |
Effects: | Intercept carat cut color clarity depth table x |
---|
Analysis of Variance | ||||
---|---|---|---|---|
Source | DF | Sum of Squares |
Mean Square |
F Value |
Model | 21 | 7.896133E11 | 37600633865 | 29441.7 |
Error | 53918 | 68859824359 | 1277121 | |
Corrected Total | 53939 | 8.584731E11 |
Root MSE | 1130.09790 |
---|---|
Dependent Mean | 3932.79972 |
R-Square | 0.9198 |
Adj R-Sq | 0.9198 |
AIC | 812367 |
AICC | 812367 |
SBC | 758621 |
Leave a Comment