Automatic Effect Selection in Regression Models

The GLMSELECT procedure performs effect selection in the framework of general linear models. A variety of model selection methods are available, including the LASSO method of Tibshirani (1996) and the related LAR method of Efron et al. (2004). The procedure offers extensive capabilities for customizing the selection with a wide variety of selection and stopping criteria, from traditional and computationally efficient significance-level-based criteria to more computationally intensive validation-based criteria. The procedure also provides graphical summaries of the selection search.

The GLMSELECT procedure compares most closely to REG and GLM. The REG procedure supports a variety of model-selection methods but does not support a CLASS statement. The GLM procedure supports a CLASS statement but does not include effect selection methods. The GLMSELECT procedure fills this gap. GLMSELECT focuses on the standard independently and identically distributed general linear model for univariate responses and offers great flexibility for and insight into the model selection algorithm. GLMSELECT provides results (displayed tables, output data sets, and macro variables) that make it easy to take the selected model and explore it in more detail in a subsequent procedure such as REG or GLM.

Example: Prices of round cut diamonds

Description: A dataset containing the prices and other attributes of almost 54,000 diamonds.

price: Selling price in dollars.
carat: Weight of the diamond.
lot:Area of the houses lot in square feet.
cut: Quality of the cut (Fair, Good, Very Good, Premium, Ideal).
color: Diamond colour, from J (worst) to D (best).
clarity: A measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best)).
x: Length in mm.
y: Width in mm.
z: Depth in mm.
depth: Total depth percentage = z / mean(x, y) = 2 * z / (x + y).
table: Width of top of diamond relative to widest point.

Source: https://vincentarelbundock.github.io/Rdatasets/doc/ggplot2/diamonds.html
Download the data from here

Task: Develop a model to estimate price given diamond parameters.

Let's run our regression model including all the variables and let the procedure select the best ones.

PROC GLMSELECT DATA = tutorial.diamonds;
CLASS cut color clarity;
MODEL price = carat cut color clarity depth table x y z / SELECTION=STEPWISE(CHOOSE=AIC SLE=0.05);
RUN;

SELECTION= option allows specifying effect selection algorithms. Available options are:

  • FORWARD: The forward selection technique begins with just the intercept and then sequentially adds the effect that most improves the fit. The process terminates when no significant improvement can be obtained by adding any effect.
  • BACKWARD: The backward elimination technique starts from the full model including all independent effects. Then effects are deleted one by one until a stopping condition is satisfied. At each step, the effect showing the smallest contribution to the model is deleted.
  • STEPWISE: The stepwise method is a modification of the forward selection technique that differs in that effects already in the model do not necessarily stay there.
  • LAR: Least angle regression was introduced by Efron et al. (2004). Not only does this algorithm provide a selection method in its own right, but with one additional modification it can be used to efficiently produce LASSO solutions. Just like the forward selection method, the LAR algorithm produces a sequence of regression models where one parameter is added at each step, terminating at the full least squares solution when all parameters have entered the model.
  • LASSO: LASSO (least absolute shrinkage and selection operator) selection arises from a constrained form of ordinary least squares regression where the sum of the absolute values of the regression coefficients is constrained to be smaller than a specified parameter.

You use the CHOOSE= option to specify the criterion for selecting one model from the sequence of models produced. If you do not specify a CHOOSE= criterion, then the model at the final step is the selected model. In this particular case, selection terminates at the step where no effect can be added at the SLE=0.05 significance level. However, the selected model is the first one with the minimal value of Akaike’s information criterion.

Here's the output:

Data Set TUTORIAL.DIAMONDS
Dependent Variable price
Selection Method Stepwise
Select Criterion SBC
Stop Criterion SBC
Choose Criterion AIC
Effect Hierarchy Enforced None
Dimensions
Number of Effects 10
Number of Parameters 27
Stepwise Selection Summary
Step Effect
Entered
Effect
Removed
Number
Effects In
Number
Parms In
AIC SBC
0 Intercept   1 1 948419.888 894486.784
1 carat   2 2 846331.443 792407.235
2 clarity   3 9 826940.500 773078.560
3 color   4 15 816135.366 762326.801
4 x   5 16 814103.107 760303.437
5 cut   6 20 812636.431 758872.343
6 depth   7 21 812447.395 758692.203
7 table   8 22 812366.819* 758620.523*
Selection stopped at a local minimum of the SBC criterion.
Stop Details
Candidate
For
Effect Candidate
SBC
  Compare
SBC
Entry z 758629.333 > 758620.523
Removal table 758692.203 > 758620.523
Effects: Intercept carat cut color clarity depth table x
Analysis of Variance
Source DF Sum of
Squares
Mean
Square
F Value
Model 21 7.896133E11 37600633865 29441.7
Error 53918 68859824359 1277121  
Corrected Total 53939 8.584731E11    
Root MSE 1130.09790
Dependent Mean 3932.79972
R-Square 0.9198
Adj R-Sq 0.9198
AIC 812367
AICC 812367
SBC 758621