Purpose:

I created the following SAS code in order to implement the Stepwise Regression algorithm in SAS. This SAS macro performs an automated backward elimination variable selection process for PROC GENMOD which does not come with model selection options. Note that the GENMOD procedure in SAS versions prior to 9.4 does not come with model selection options.

Introduction:

SAS users of SAS 9.2 and prior versions may face situations where some "powerful" options are only available in certain SAS procedures but not available in others. For example, the model selection options are available in PROC REG, LOGISTIC, PHREG, etc., but not in PROC GENMOD, CATMOD, MIXED, etc. This backwards selection macro could be used with the procedures GENNMOD, CATMOD, MIXED, GLIMMIX, etc.

Illustration:

The following SAS statements simulate 5000 observations, which are based on an underlying Tweedie generalized linear model (GLM) that exploits its connection with the compound Poisson distribution. A natural logarithm link function is assumed for modeling the response variable (yTweedie), and there are five categorical variables (C1–C5), each of which has four numerical levels and two continuous variables (D1 and D2). By design, two of the categorical variables, C3 and C4, and one of the two continuous variables, D2, have no effect on the response. The dispersion parameter is set to 0.5, and the power parameter is set to 1.5.

%let nObs = 5000;
%let nClass = 5;
%let nLevs = 4;
%let seed = 1234;

data tmp1;
   array c{&nClass};

   keep c1-c&nClass yTweedie d1 d2;

   /* Tweedie parms */
   phi=0.5;
   p=1.5;

   do i=1 to &nObs;

      do j=1 to &nClass;
         c{j} = int(ranuni(1)*&nLevs);
      end;

      d1 = ranuni(&seed);
      d2 = ranuni(&seed);

      xBeta = 0.5*((c2<2) - 2*(c1=1) + 0.5*c&nClass + 0.05*d1);
      mu = exp(xBeta);

      /* Poisson distributions parms */
      lambda = mu**(2-p)/(phi*(2-p));
      /* Gamma distribution parms */
      alpha = (2-p)/(p-1);
      gamma = phi*(p-1)*(mu**(p-1));

      rpoi = ranpoi(&seed,lambda);
      if rpoi=0 then yTweedie=0;
      else do;
         yTweedie=0;
         do j=1 to rpoi;
         yTweedie = yTweedie + rangam(&seed,alpha);
         end;
         yTweedie = yTweedie * gamma;
      end;
      output;
   end;
run;

11   ods listing close;ods html5 file=stdout options(bitmap_mode='inline') device=png; ods graphics on / outputfmt=png;
NOTE: Writing HTML5 Body file: STDOUT
12
13 %let nObs = 5000;
14 %let nClass = 5;
15 %let nLevs = 4;
16 %let seed = 1234;
17
18 data tmp1;
19 array c{&nClass};
20
21 keep c1-c&nClass yTweedie d1 d2;
22
23 /* Tweedie parms */
24 phi=0.5;
25 p=1.5;
26
27 do i=1 to &nObs;
28
29 do j=1 to &nClass;
30 c{j} = int(ranuni(1)*&nLevs);
31 end;
32
33 d1 = ranuni(&seed);
34 d2 = ranuni(&seed);
35
36 xBeta = 0.5*((c2<2) - 2*(c1=1) +
36 ! 0.5*c&nClass + 0.05*d1);
37 mu = exp(xBeta);
38
39 /* Poisson distributions parms */
40 lambda = mu**(2-p)/(phi*(2-p));
41 /* Gamma distribution parms */
42 alpha = (2-p)/(p-1);
43 gamma = phi*(p-1)*(mu**(p-1));
44
45 rpoi = ranpoi(&seed,lambda);
46 if rpoi=0 then yTweedie=0;
47 else do;
48 yTweedie=0;
49 do j=1 to rpoi;
50 yTweedie = yTweedie + rangam(&seed,alpha);
51 end;
52 yTweedie = yTweedie * gamma;
53 end;
54 output;
55 end;
56 run;
NOTE: The data set WORK.TMP1 has 5000 observations and 8 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds

57 ods html5 close;ods listing;

58

The following code generates a basic explanatory data analysis for the dependent and independent variables. The histogram for the yTweedie dependent variable and the independent character variable c1-c5:

/* EDA */

%let var_char = yTweedie c1 c2 c3 c4 c5 d1 d2;

%put &var_char;


data var_char;
    set tmp1
    (keep= &var_char);     
run;

proc contents data = var_char varnum nodetails noprint 
out=var_char_names (keep=name);
run;

data var_char_names;
    set var_char_names;
    j = _n_;
run;

* Determine the number of observations;
data _NULL_;
    if 0 then set var_char_names nobs=n;
    call symputx('nrows',n);
    stop;
run;

%put &nrows;

%macro do_eda_uni;
%do obs = 1 %to &nrows;

data _null_;
    set var_char_names;
    if j = &obs  then call symputx("var", put(name, 10.));
run;
%if (%upcase(&var)=YTWEEDIE) or (%upcase(&var)=D1) or (%upcase(&var)=D2)   %then %do; 

    ods graphics on;
        proc means data=tmp1 fw=12 printalltypes chartype
            qmethod=os maxdec=2

            mean 
            min 
            max 
            mode 
            range 
            n 
            nmiss   
            p1 
            p5 
            median 
            p95 
            p99 ;
            var &var;
        run;

        title "histograms";
        proc univariate data=tmp1   noprint;
            var &var;
            histogram ;
        run; 
    ods graphics off;
    %end;

    %else %do;
    ods graphics on;
        proc freq data=tmp1
        order=internal;
        tables &var /  scores=table plots(only)=freq;
        run;
    ods graphics off;
    %end;

%end;
%mend do_eda_uni;

%do_eda_uni;
SAS Output

SAS Output

The SAS System

The FREQ Procedure

The FREQ Procedure

Table c1

One-Way Frequencies

c1 Frequency Percent Cumulative
Frequency
Cumulative
Percent
0 1218 24.36 1218 24.36
1 1233 24.66 2451 49.02
2 1263 25.26 3714 74.28
3 1286 25.72 5000 100.00

Distribution Plots

Frequency Plot

Bar Chart of Frequencies for c1

The SAS System

The FREQ Procedure

The FREQ Procedure

Table c2

One-Way Frequencies

c2 Frequency Percent Cumulative
Frequency
Cumulative
Percent
0 1247 24.94 1247 24.94
1 1222 24.44 2469 49.38
2 1262 25.24 3731 74.62
3 1269 25.38 5000 100.00

Distribution Plots

Frequency Plot

Bar Chart of Frequencies for c2

The SAS System

The FREQ Procedure

The FREQ Procedure

Table c3

One-Way Frequencies

c3 Frequency Percent Cumulative
Frequency
Cumulative
Percent
0 1209 24.18 1209 24.18
1 1340 26.80 2549 50.98
2 1254 25.08 3803 76.06
3 1197 23.94 5000 100.00

Distribution Plots

Frequency Plot

Bar Chart of Frequencies for c3

The SAS System

The FREQ Procedure

The FREQ Procedure

Table c4

One-Way Frequencies

c4 Frequency Percent Cumulative
Frequency
Cumulative
Percent
0 1210 24.20 1210 24.20
1 1263 25.26 2473 49.46
2 1292 25.84 3765 75.30
3 1235 24.70 5000 100.00

Distribution Plots

Frequency Plot

Bar Chart of Frequencies for c4

The SAS System

The FREQ Procedure

The FREQ Procedure

Table c5

One-Way Frequencies

c5 Frequency Percent Cumulative
Frequency
Cumulative
Percent
0 1249 24.98 1249 24.98
1 1208 24.16 2457 49.14
2 1267 25.34 3724 74.48
3 1276 25.52 5000 100.00

Distribution Plots

Frequency Plot

Bar Chart of Frequencies for c5

The SAS System

The MEANS Procedure

The MEANS Procedure

Summary statistics

Analysis Variable : d1
Mean Minimum Maximum Mode Range N N Miss 1st Pctl 5th Pctl Median 95th Pctl 99th Pctl
0.51 0.00 1.00 . 1.00 5000 0 0.01 0.06 0.52 0.95 0.99

histograms

The UNIVARIATE Procedure

The UNIVARIATE Procedure

d1

Histogram 1

Panel 1

Histogram for d1

histograms

The MEANS Procedure

The MEANS Procedure

Summary statistics

Analysis Variable : d2
Mean Minimum Maximum Mode Range N N Miss 1st Pctl 5th Pctl Median 95th Pctl 99th Pctl
0.50 0.00 1.00 . 1.00 5000 0 0.01 0.05 0.49 0.95 0.99

histograms

The UNIVARIATE Procedure

The UNIVARIATE Procedure

d2

Histogram 1

Panel 1

Histogram for d2

histograms

The MEANS Procedure

The MEANS Procedure

Summary statistics

Analysis Variable : yTweedie
Mean Minimum Maximum Mode Range N N Miss 1st Pctl 5th Pctl Median 95th Pctl 99th Pctl
1.72 0.00 12.78 0.00 12.78 5000 0 0.00 0.16 1.37 4.54 6.40

histograms

The UNIVARIATE Procedure

The UNIVARIATE Procedure

yTweedie

Histogram 1

Panel 1

Histogram for yTweedie

The next lines contain the two SAS macros for the backwards elimination selection process using a Tweedie error function.

The first macro %MdStmt is a stand-alone macro. The main macro, %MdSelect, consists of multiple calls to the macro %MdStmt.

/* Variable Selection Macro: Backwards elimination */
%let p=1.5;
options mlogic;
%macro MdStmt(
        resvar = /*response variable */
       ,expvar = /*list of explanatory variables, separated by ' ' */
       ,clsvar = /*classification variables in the CLASS statement separated by ' ' */
       ,p = 
       );

        ods output Type3=pval(rename=source=parm);
        proc genmod data=tmp1 NAMELEN=50; 
            if _resp_ > 0 then 
            d = 2*(_resp_*(_resp_**(1-&p)-_mean_**(1-&p))/
            (1-&p)-(_resp_**(2-&p)-_mean_**(2-&p))/(2-&p)); 
            else d = 2* _mean_**(2-&p)/(2-&p);
            variance var = _mean_**&p;
            deviance dev = d;
            class &clsvar;  
            model &resvar =  &expvar /link=log type3 scale=pearson;                 
            *scwgt expos;
            title "&resvar = &expvar";  
        run;
        ods output close;
 %mend MdStmt;

144  ods listing close;ods html5 file=stdout options(bitmap_mode='inline') device=png; ods graphics on / outputfmt=png;
NOTE: Writing HTML5 Body file: STDOUT
145
146 /* Variable Selection Macro: Backwards elimination */
147 %let p=1.5;
148 options mlogic;
149 %macro MdStmt(
150 resvar = /*response variable */
151 ,expvar = /*list of explanatory variables, separated by ' ' */
152 ,clsvar = /*classification variables in the CLASS statement separated by ' ' */
153 ,p =
154 );
155
156 ods output Type3=pval(rename=source=parm);
157 proc genmod data=tmp1 NAMELEN=50;
158 if _resp_ > 0 then
159 d = 2*(_resp_*(_resp_**(1-&p)-_mean_**(1-&p))/
160 (1-&p)-(_resp_**(2-&p)-_mean_**(2-&p))/(2-&p));
161 else d = 2* _mean_**(2-&p)/(2-&p);
162 variance var = _mean_**&p;
163 deviance dev = d;
164 class &clsvar;
165 model &resvar = &expvar /link=log type3 scale=pearson;
166 *scwgt expos;
167 title "&resvar = &expvar";
168 run;
169 ods output close;
170 %mend MdStmt;
171 ods html5 close;ods listing;

172

There are five macro parameters in the macro %MdSelect: &VAR, &INTVAR, &CATVAR, &SLSTAY and &POWER:

  • &VAR is the response variable which will be passed into &RESVAR when calling the macro %MdStmt;
  • &INTVAR includes all the potential explanatory variables which will be passed into &EXPVAR in %MdStmt only forthe first call;
  • &CATVAR contains all the categorical explanatory variables which will be passed into %CLSVAR in %MdStmt;
  • &SLSTAY is the criteria for removing variable;
  • and &POWER is the power parameter of the Tweedie distribution
%macro MdSelect(
       var= /*response variable */
       ,intvar= /*initial explanatory variables for full model */
       ,catvar= /*categorical explanatory variables */
       ,slstay= /*criterion for removing variable */
       ,power=
       );
    %let var=%upcase(&var);
    %let intvar=%upcase(&intvar);
    %let catvar=%upcase(&catvar);
    %let power =&power; 
%*-------------------------------------------------------------------------*;
%* Create empty dataset "step" with only one column "parm". It will be *;
%* merged with "pval" from PROC GENMOD by "parm" *;
%*-------------------------------------------------------------------------*;
 proc sql;
    create table step_&var (parm char(9));
 quit;
%*------------------------------------------------------------------------------*;
%* %do %until performs multivariate backward model selection: *;
%* In each iteration: *;
%* 1. Run the logistic regression model *;
%* 2. Update the dataset "step_&var" *;
%* 3. Create &pmax as the maximum p-value, and &varlist as the list of *;
%* variables without the one with the max p-value *;
%* 4. Check whether the max p-value <= &SLSTAY *;
%* 5. If NO, then eliminate the variable with max p-value, repeat step 1 to 4.*;
%* If YES, the loop stops *;
%*------------------------------------------------------------------------------*;
 %let i=1;
 %do %until (&pmax<=&slstay);

    %if &i = 1 %then
        %MdStmt(resvar=&var ,expvar=&intvar, clsvar=&catvar, p=&power); %*initial model;
    %else %do;
        %MdStmt(resvar=&var ,expvar=&varlist, clsvar=&catvar, p=&power); %*reduced model;
    %end;
    proc sort data=step_&var; by parm;
    proc sort data=pval; by parm;
    data step_&var;
        merge step_&var pval;
        by parm;
        p&i=put(ProbChiSq, pvalue6.3);
        drop ProbChiSq ChiSq DF;
    run;
    proc sql noprint;
        select max(ProbChiSq) into :pmax
        from pval;  
        select distinct parm into :varlist separated by ' '
        from pval
        having ProbChiSq^=max(ProbChiSq);
    quit;

    %let i=%eval(&i+1);

 %end;
 proc print data=step_&var;
    title "&var: model selection process";
 run;
%mend MdSelect; 

%MdSelect(var=yTweedie, intvar=c1 c2 c3 c4 c5 d1 d2, catvar=c1 c2 c3 c4 c5, slstay=0.05, power=1.5);
SAS Output

SAS Output

YTWEEDIE = C1 C2 C3 C4 C5 D1 D2

The GENMOD Procedure

The GENMOD Procedure

Model Information

Model Information
Data Set WORK.TMP1
Distribution User
Link Function Log
Dependent Variable yTweedie

Number of Observations

Number of Observations Read 5000
Number of Observations Used 5000

Class Level Information

Class Level Information
Class Levels Values
c1 4 0 1 2 3
c2 4 0 1 2 3
c3 4 0 1 2 3
c4 4 0 1 2 3
c5 4 0 1 2 3

Criteria For Assessing Goodness Of Fit

Criteria For Assessing Goodness Of Fit
Criterion DF Value Value/DF
Deviance 4982 2730.8750 0.5481
Scaled Deviance 4982 5581.9454 1.1204
Pearson Chi-Square 4982 2437.3615 0.4892
Scaled Pearson X2 4982 4982.0000 1.0000
Log Likelihood   -2790.9727  
Full Log Likelihood   -2790.9727  
AIC (smaller is better)   5617.9454  
AICC (smaller is better)   5618.0827  
BIC (smaller is better)   5735.2549  

Convergence Status

Algorithm converged.

Analysis Of Parameter Estimates

Analysis Of Maximum Likelihood Parameter Estimates
Parameter   DF Estimate Standard
Error
Wald 95% Confidence Limits Wald Chi-Square Pr > ChiSq
Intercept   1 0.7181 0.0405 0.6386 0.7975 313.63 <.0001
c1 0 1 -0.0347 0.0237 -0.0811 0.0116 2.15 0.1422
c1 1 1 -1.0170 0.0271 -1.0701 -0.9638 1405.32 <.0001
c1 2 1 -0.0091 0.0234 -0.0549 0.0367 0.15 0.6956
c1 3 0 0.0000 0.0000 0.0000 0.0000 . .
c2 0 1 0.4966 0.0249 0.4478 0.5454 397.70 <.0001
c2 1 1 0.5139 0.0251 0.4648 0.5630 420.61 <.0001
c2 2 1 -0.0098 0.0264 -0.0615 0.0420 0.14 0.7118
c2 3 0 0.0000 0.0000 0.0000 0.0000 . .
c3 0 1 0.0118 0.0255 -0.0382 0.0617 0.21 0.6439
c3 1 1 0.0154 0.0248 -0.0332 0.0640 0.38 0.5351
c3 2 1 0.0498 0.0251 0.0007 0.0990 3.95 0.0469
c3 3 0 0.0000 0.0000 0.0000 0.0000 . .
c4 0 1 0.0060 0.0252 -0.0434 0.0553 0.06 0.8132
c4 1 1 0.0064 0.0248 -0.0423 0.0551 0.07 0.7977
c4 2 1 -0.0092 0.0248 -0.0578 0.0395 0.14 0.7113
c4 3 0 0.0000 0.0000 0.0000 0.0000 . .
c5 0 1 -0.7479 0.0252 -0.7973 -0.6986 882.99 <.0001
c5 1 1 -0.4761 0.0245 -0.5240 -0.4281 378.73 <.0001
c5 2 1 -0.2523 0.0234 -0.2981 -0.2065 116.67 <.0001
c5 3 0 0.0000 0.0000 0.0000 0.0000 . .
d1   1 0.0618 0.0308 0.0013 0.1222 4.01 0.0452
d2   1 0.0150 0.0303 -0.0445 0.0745 0.24 0.6212
Scale   0 0.6995 0.0000 0.6995 0.6995    

Note:The scale parameter was estimated by the square root of Pearson's Chi-Square/DOF.

LR Statistics For Type 3 Analysis - Scaled

LR Statistics For Type 3 Analysis
Source Num DF Den DF F Value Pr > F Chi-Square Pr > ChiSq
c1 3 4982 607.21 <.0001 1821.63 <.0001
c2 3 4982 276.99 <.0001 830.98 <.0001
c3 3 4982 1.48 0.2168 4.45 0.2167
c4 3 4982 0.17 0.9148 0.52 0.9148
c5 3 4982 324.79 <.0001 974.37 <.0001
d1 1 4982 4.01 0.0452 4.01 0.0452
d2 1 4982 0.24 0.6212 0.24 0.6212

YTWEEDIE = c1 c2 c3 c5 d1 d2

The GENMOD Procedure

The GENMOD Procedure

Model Information

Model Information
Data Set WORK.TMP1
Distribution User
Link Function Log
Dependent Variable yTweedie

Number of Observations

Number of Observations Read 5000
Number of Observations Used 5000

Class Level Information

Class Level Information
Class Levels Values
c1 4 0 1 2 3
c2 4 0 1 2 3
c3 4 0 1 2 3
c4 4 0 1 2 3
c5 4 0 1 2 3

Criteria For Assessing Goodness Of Fit

Criteria For Assessing Goodness Of Fit
Criterion DF Value Value/DF
Deviance 4985 2731.1286 0.5479
Scaled Deviance 4985 5584.6187 1.1203
Pearson Chi-Square 4985 2437.8882 0.4890
Scaled Pearson X2 4985 4985.0000 1.0000
Log Likelihood   -2792.3094  
Full Log Likelihood   -2792.3094  
AIC (smaller is better)   5614.6187  
AICC (smaller is better)   5614.7150  
BIC (smaller is better)   5712.3766  

Convergence Status

Algorithm converged.

Analysis Of Parameter Estimates

Analysis Of Maximum Likelihood Parameter Estimates
Parameter   DF Estimate Standard
Error
Wald 95% Confidence Limits Wald Chi-Square Pr > ChiSq
Intercept   1 0.7189 0.0378 0.6448 0.7931 361.45 <.0001
c1 0 1 -0.0348 0.0236 -0.0811 0.0116 2.16 0.1412
c1 1 1 -1.0168 0.0271 -1.0700 -0.9637 1405.64 <.0001
c1 2 1 -0.0091 0.0234 -0.0549 0.0367 0.15 0.6959
c1 3 0 0.0000 0.0000 0.0000 0.0000 . .
c2 0 1 0.4965 0.0249 0.4477 0.5453 397.93 <.0001
c2 1 1 0.5139 0.0250 0.4648 0.5630 421.34 <.0001
c2 2 1 -0.0100 0.0264 -0.0617 0.0417 0.14 0.7049
c2 3 0 0.0000 0.0000 0.0000 0.0000 . .
c3 0 1 0.0120 0.0255 -0.0379 0.0619 0.22 0.6363
c3 1 1 0.0154 0.0248 -0.0331 0.0640 0.39 0.5335
c3 2 1 0.0500 0.0251 0.0009 0.0991 3.98 0.0461
c3 3 0 0.0000 0.0000 0.0000 0.0000 . .
c5 0 1 -0.7483 0.0252 -0.7976 -0.6990 884.42 <.0001
c5 1 1 -0.4764 0.0245 -0.5243 -0.4285 379.62 <.0001
c5 2 1 -0.2523 0.0234 -0.2981 -0.2065 116.72 <.0001
c5 3 0 0.0000 0.0000 0.0000 0.0000 . .
d1   1 0.0619 0.0308 0.0015 0.1223 4.04 0.0445
d2   1 0.0148 0.0303 -0.0446 0.0742 0.24 0.6257
Scale   0 0.6993 0.0000 0.6993 0.6993    

Note:The scale parameter was estimated by the square root of Pearson's Chi-Square/DOF.

LR Statistics For Type 3 Analysis - Scaled

LR Statistics For Type 3 Analysis
Source Num DF Den DF F Value Pr > F Chi-Square Pr > ChiSq
c1 3 4985 607.35 <.0001 1822.04 <.0001
c2 3 4985 277.51 <.0001 832.53 <.0001
c3 3 4985 1.49 0.2151 4.47 0.2149
c5 3 4985 325.45 <.0001 976.35 <.0001
d1 1 4985 4.04 0.0446 4.04 0.0445
d2 1 4985 0.24 0.6257 0.24 0.6257

YTWEEDIE = c1 c2 c3 c5 d1

The GENMOD Procedure

The GENMOD Procedure

Model Information

Model Information
Data Set WORK.TMP1
Distribution User
Link Function Log
Dependent Variable yTweedie

Number of Observations

Number of Observations Read 5000
Number of Observations Used 5000

Class Level Information

Class Level Information
Class Levels Values
c1 4 0 1 2 3
c2 4 0 1 2 3
c3 4 0 1 2 3
c4 4 0 1 2 3
c5 4 0 1 2 3

Criteria For Assessing Goodness Of Fit

Criteria For Assessing Goodness Of Fit
Criterion DF Value Value/DF
Deviance 4986 2731.2449 0.5478
Scaled Deviance 4986 5584.3940 1.1200
Pearson Chi-Square 4986 2438.5792 0.4891
Scaled Pearson X2 4986 4986.0000 1.0000
Log Likelihood   -2792.1970  
Full Log Likelihood   -2792.1970  
AIC (smaller is better)   5612.3940  
AICC (smaller is better)   5612.4783  
BIC (smaller is better)   5703.6347  

Convergence Status

Algorithm converged.

Analysis Of Parameter Estimates

Analysis Of Maximum Likelihood Parameter Estimates
Parameter   DF Estimate Standard
Error
Wald 95% Confidence Limits Wald Chi-Square Pr > ChiSq
Intercept   1 0.7264 0.0346 0.6587 0.7942 441.89 <.0001
c1 0 1 -0.0352 0.0236 -0.0815 0.0111 2.22 0.1363
c1 1 1 -1.0172 0.0271 -1.0703 -0.9641 1407.64 <.0001
c1 2 1 -0.0091 0.0234 -0.0549 0.0366 0.15 0.6954
c1 3 0 0.0000 0.0000 0.0000 0.0000 . .
c2 0 1 0.4963 0.0249 0.4475 0.5451 397.71 <.0001
c2 1 1 0.5137 0.0250 0.4646 0.5628 421.08 <.0001
c2 2 1 -0.0101 0.0264 -0.0618 0.0416 0.15 0.7020
c2 3 0 0.0000 0.0000 0.0000 0.0000 . .
c3 0 1 0.0125 0.0254 -0.0374 0.0624 0.24 0.6234
c3 1 1 0.0155 0.0248 -0.0330 0.0641 0.39 0.5306
c3 2 1 0.0502 0.0251 0.0011 0.0993 4.01 0.0451
c3 3 0 0.0000 0.0000 0.0000 0.0000 . .
c5 0 1 -0.7483 0.0252 -0.7977 -0.6990 884.56 <.0001
c5 1 1 -0.4767 0.0244 -0.5246 -0.4288 380.16 <.0001
c5 2 1 -0.2525 0.0234 -0.2983 -0.2068 116.92 <.0001
c5 3 0 0.0000 0.0000 0.0000 0.0000 . .
d1   1 0.0621 0.0308 0.0018 0.1225 4.07 0.0437
Scale   0 0.6993 0.0000 0.6993 0.6993    

Note:The scale parameter was estimated by the square root of Pearson's Chi-Square/DOF.

LR Statistics For Type 3 Analysis - Scaled

LR Statistics For Type 3 Analysis
Source Num DF Den DF F Value Pr > F Chi-Square Pr > ChiSq
c1 3 4986 607.82 <.0001 1823.47 <.0001
c2 3 4986 277.41 <.0001 832.22 <.0001
c3 3 4986 1.50 0.2132 4.49 0.2131
c5 3 4986 325.55 <.0001 976.64 <.0001
d1 1 4986 4.07 0.0437 4.07 0.0437

YTWEEDIE = c1 c2 c5 d1

The GENMOD Procedure

The GENMOD Procedure

Model Information

Model Information
Data Set WORK.TMP1
Distribution User
Link Function Log
Dependent Variable yTweedie

Number of Observations

Number of Observations Read 5000
Number of Observations Used 5000

Class Level Information

Class Level Information
Class Levels Values
c1 4 0 1 2 3
c2 4 0 1 2 3
c3 4 0 1 2 3
c4 4 0 1 2 3
c5 4 0 1 2 3

Criteria For Assessing Goodness Of Fit

Criteria For Assessing Goodness Of Fit
Criterion DF Value Value/DF
Deviance 4989 2733.4414 0.5479
Scaled Deviance 4989 5582.3056 1.1189
Pearson Chi-Square 4989 2442.9224 0.4897
Scaled Pearson X2 4989 4989.0000 1.0000
Log Likelihood   -2791.1528  
Full Log Likelihood   -2791.1528  
AIC (smaller is better)   5604.3056  
AICC (smaller is better)   5604.3585  
BIC (smaller is better)   5675.9947  

Convergence Status

Algorithm converged.

Analysis Of Parameter Estimates

Analysis Of Maximum Likelihood Parameter Estimates
Parameter   DF Estimate Standard
Error
Wald 95% Confidence Limits Wald Chi-Square Pr > ChiSq
Intercept   1 0.7460 0.0309 0.6855 0.8066 583.67 <.0001
c1 0 1 -0.0346 0.0236 -0.0809 0.0116 2.15 0.1424
c1 1 1 -1.0170 0.0271 -1.0702 -0.9639 1406.61 <.0001
c1 2 1 -0.0092 0.0234 -0.0550 0.0366 0.15 0.6940
c1 3 0 0.0000 0.0000 0.0000 0.0000 . .
c2 0 1 0.4973 0.0249 0.4485 0.5461 399.04 <.0001
c2 1 1 0.5144 0.0250 0.4653 0.5635 421.94 <.0001
c2 2 1 -0.0097 0.0264 -0.0615 0.0421 0.13 0.7136
c2 3 0 0.0000 0.0000 0.0000 0.0000 . .
c5 0 1 -0.7486 0.0252 -0.7980 -0.6993 884.84 <.0001
c5 1 1 -0.4775 0.0245 -0.5254 -0.4295 381.13 <.0001
c5 2 1 -0.2528 0.0234 -0.2986 -0.2070 117.12 <.0001
c5 3 0 0.0000 0.0000 0.0000 0.0000 . .
d1   1 0.0620 0.0308 0.0016 0.1224 4.05 0.0442
Scale   0 0.6998 0.0000 0.6998 0.6998    

Note:The scale parameter was estimated by the square root of Pearson's Chi-Square/DOF.

LR Statistics For Type 3 Analysis - Scaled

LR Statistics For Type 3 Analysis
Source Num DF Den DF F Value Pr > F Chi-Square Pr > ChiSq
c1 3 4989 607.22 <.0001 1821.67 <.0001
c2 3 4989 277.87 <.0001 833.62 <.0001
c5 3 4989 325.70 <.0001 977.11 <.0001
d1 1 4989 4.05 0.0442 4.05 0.0442

YTWEEDIE: model selection process

The PRINT Procedure

Data Set WORK.STEP_YTWEEDIE

Obs parm NumDF DenDF FValue ProbF Method p1 p2 p3 p4
1 c1 3 4989 607.22 <.0001 LR <.001 <.001 <.001 <.001
2 c2 3 4989 277.87 <.0001 LR <.001 <.001 <.001 <.001
3 c3 3 4986 1.50 0.2132 LR 0.217 0.215 0.213 .
4 c4 3 4982 0.17 0.9148 LR 0.915 . . .
5 c5 3 4989 325.70 <.0001 LR <.001 <.001 <.001 <.001
6 d1 1 4989 4.05 0.0442 LR 0.045 0.045 0.044 0.044
7 d2 1 4985 0.24 0.6257 LR 0.621 0.626 . .

The execution of the above two macros create two outputs:

  • A summary table of the model selection process
  • The whole model selection process step by step

The summary table of the model selection process is the about last table. The table shows that the variable C4 is eliminated in the second step of the process. The variable D2 is eliminated in the third step. And the variable C3 is eliminated in the fourth step. After the fourth step the algorithm arrive at final main effects model.

Conclusion:

The above lines shows how the variable selection algorithm eliminates those variables (C3, C4 and D2) no associated with the dependent variable yTweedie - remember that the illustrative dataset was arterially created with this aim. Therefore, the macro works accurately.

The SAS macros %MdStmt and %MdSelect:

  • Performs a backwards elimination variable selection process
  • The last step in the elimination process shows the selected model and a summary table of the elimination process
  • The macro needs around 15 minutes to get results with a dataset of one million observations and around 13 variables
  • The elimination criteria is based on the p-values of the type 3 analysis
  • With small changes the macro is useful in a context with a GENMOD procedure under Gamma, Inverse Gaussian, Log-Normal, Binomial, Gaussian, Poisson, Negative Binomial, Zero Inflated Poisson and Zero inflated Negative Binomial error functions.
  • This macro could be useful as a template to create Forward and Stepwise variable selection processes
  • One drawback of the backwards elimination process is that if the full model with all potential main factors does not converge the macro does not work. That is one of the reasons because a forward option is interesting
  • The specification of the model is the same that the Tweedie macro used in the NAR project
  • This macro only admits main factors. So, it is not possible to include interactions in the model statement of the GENMODE procedure. To include interactions it is needed create a new variable with the interaction

References:

A detailed explanation of the algorithm and the code appears here:

Using Macro and ODS to Overcome Limitations of SAS® Procedures Jing Su and Wei (Lisa) Lin, Merck & Co, Inc., North Wales, PA 

The dataset for the example comes from here:

http://support.sas.com/documentation/cdl/en/statug/68162/HTML/default/viewer.htm#statug_genmod_examples12.htm

I made some changes in order to get coherence results.