Purpose:
I created the following SAS code in order to implement the Stepwise Regression algorithm in SAS. This SAS macro performs an automated backward elimination variable selection process for PROC GENMOD
which does not come with model selection options. Note that the GENMOD
procedure in SAS versions prior to 9.4 does not come with model selection options.
Introduction:
SAS users of SAS 9.2 and prior versions may face situations where some "powerful" options are only available in certain SAS procedures but not available in others. For example, the model selection options are available in PROC REG
, LOGISTIC
, PHREG
, etc., but not in PROC GENMOD
, CATMOD
, MIXED
, etc. This backwards selection macro could be used with the procedures GENNMOD
, CATMOD
, MIXED
, GLIMMIX
, etc.
Illustration:
The following SAS statements simulate 5000 observations, which are based on an underlying Tweedie generalized linear model (GLM) that exploits its connection with the compound Poisson distribution. A natural logarithm link function is assumed for modeling the response variable (yTweedie
), and there are five categorical variables (C1–C5
), each of which has four numerical levels and two continuous variables (D1
and D2
). By design, two of the categorical variables, C3
and C4
, and one of the two continuous variables, D2
, have no effect on the response. The dispersion parameter is set to 0.5, and the power parameter is set to 1.5.
%let nObs = 5000;
%let nClass = 5;
%let nLevs = 4;
%let seed = 1234;
data tmp1;
array c{&nClass};
keep c1-c&nClass yTweedie d1 d2;
/* Tweedie parms */
phi=0.5;
p=1.5;
do i=1 to &nObs;
do j=1 to &nClass;
c{j} = int(ranuni(1)*&nLevs);
end;
d1 = ranuni(&seed);
d2 = ranuni(&seed);
xBeta = 0.5*((c2<2) - 2*(c1=1) + 0.5*c&nClass + 0.05*d1);
mu = exp(xBeta);
/* Poisson distributions parms */
lambda = mu**(2-p)/(phi*(2-p));
/* Gamma distribution parms */
alpha = (2-p)/(p-1);
gamma = phi*(p-1)*(mu**(p-1));
rpoi = ranpoi(&seed,lambda);
if rpoi=0 then yTweedie=0;
else do;
yTweedie=0;
do j=1 to rpoi;
yTweedie = yTweedie + rangam(&seed,alpha);
end;
yTweedie = yTweedie * gamma;
end;
output;
end;
run;
11 ods listing close;ods html5 file=stdout options(bitmap_mode='inline') device=png; ods graphics on / outputfmt=png;
NOTE: Writing HTML5 Body file: STDOUT
12
13 %let nObs = 5000;
14 %let nClass = 5;
15 %let nLevs = 4;
16 %let seed = 1234;
17
18 data tmp1;
19 array c{&nClass};
20
21 keep c1-c&nClass yTweedie d1 d2;
22
23 /* Tweedie parms */
24 phi=0.5;
25 p=1.5;
26
27 do i=1 to &nObs;
28
29 do j=1 to &nClass;
30 c{j} = int(ranuni(1)*&nLevs);
31 end;
32
33 d1 = ranuni(&seed);
34 d2 = ranuni(&seed);
35
36 xBeta = 0.5*((c2<2) - 2*(c1=1) +
36 ! 0.5*c&nClass + 0.05*d1);
37 mu = exp(xBeta);
38
39 /* Poisson distributions parms */
40 lambda = mu**(2-p)/(phi*(2-p));
41 /* Gamma distribution parms */
42 alpha = (2-p)/(p-1);
43 gamma = phi*(p-1)*(mu**(p-1));
44
45 rpoi = ranpoi(&seed,lambda);
46 if rpoi=0 then yTweedie=0;
47 else do;
48 yTweedie=0;
49 do j=1 to rpoi;
50 yTweedie = yTweedie + rangam(&seed,alpha);
51 end;
52 yTweedie = yTweedie * gamma;
53 end;
54 output;
55 end;
56 run;
NOTE: The data set WORK.TMP1 has 5000 observations and 8 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
57 ods html5 close;ods listing;
58
The following code generates a basic explanatory data analysis for the dependent and independent variables. The histogram for the yTweedie
dependent variable and the independent character variable c1-c5
:
/* EDA */
%let var_char = yTweedie c1 c2 c3 c4 c5 d1 d2;
%put &var_char;
data var_char;
set tmp1
(keep= &var_char);
run;
proc contents data = var_char varnum nodetails noprint
out=var_char_names (keep=name);
run;
data var_char_names;
set var_char_names;
j = _n_;
run;
* Determine the number of observations;
data _NULL_;
if 0 then set var_char_names nobs=n;
call symputx('nrows',n);
stop;
run;
%put &nrows;
%macro do_eda_uni;
%do obs = 1 %to &nrows;
data _null_;
set var_char_names;
if j = &obs then call symputx("var", put(name, 10.));
run;
%if (%upcase(&var)=YTWEEDIE) or (%upcase(&var)=D1) or (%upcase(&var)=D2) %then %do;
ods graphics on;
proc means data=tmp1 fw=12 printalltypes chartype
qmethod=os maxdec=2
mean
min
max
mode
range
n
nmiss
p1
p5
median
p95
p99 ;
var &var;
run;
title "histograms";
proc univariate data=tmp1 noprint;
var &var;
histogram ;
run;
ods graphics off;
%end;
%else %do;
ods graphics on;
proc freq data=tmp1
order=internal;
tables &var / scores=table plots(only)=freq;
run;
ods graphics off;
%end;
%end;
%mend do_eda_uni;
%do_eda_uni;
SAS Output
The FREQ Procedure
The FREQ Procedure
Table c1
One-Way Frequencies
c1 | Frequency | Percent | Cumulative Frequency |
Cumulative Percent |
---|---|---|---|---|
0 | 1218 | 24.36 | 1218 | 24.36 |
1 | 1233 | 24.66 | 2451 | 49.02 |
2 | 1263 | 25.26 | 3714 | 74.28 |
3 | 1286 | 25.72 | 5000 | 100.00 |
Distribution Plots
Frequency Plot
The FREQ Procedure
The FREQ Procedure
Table c2
One-Way Frequencies
c2 | Frequency | Percent | Cumulative Frequency |
Cumulative Percent |
---|---|---|---|---|
0 | 1247 | 24.94 | 1247 | 24.94 |
1 | 1222 | 24.44 | 2469 | 49.38 |
2 | 1262 | 25.24 | 3731 | 74.62 |
3 | 1269 | 25.38 | 5000 | 100.00 |
Distribution Plots
Frequency Plot
The FREQ Procedure
The FREQ Procedure
Table c3
One-Way Frequencies
c3 | Frequency | Percent | Cumulative Frequency |
Cumulative Percent |
---|---|---|---|---|
0 | 1209 | 24.18 | 1209 | 24.18 |
1 | 1340 | 26.80 | 2549 | 50.98 |
2 | 1254 | 25.08 | 3803 | 76.06 |
3 | 1197 | 23.94 | 5000 | 100.00 |
Distribution Plots
Frequency Plot
The FREQ Procedure
The FREQ Procedure
Table c4
One-Way Frequencies
c4 | Frequency | Percent | Cumulative Frequency |
Cumulative Percent |
---|---|---|---|---|
0 | 1210 | 24.20 | 1210 | 24.20 |
1 | 1263 | 25.26 | 2473 | 49.46 |
2 | 1292 | 25.84 | 3765 | 75.30 |
3 | 1235 | 24.70 | 5000 | 100.00 |
Distribution Plots
Frequency Plot
The FREQ Procedure
The FREQ Procedure
Table c5
One-Way Frequencies
c5 | Frequency | Percent | Cumulative Frequency |
Cumulative Percent |
---|---|---|---|---|
0 | 1249 | 24.98 | 1249 | 24.98 |
1 | 1208 | 24.16 | 2457 | 49.14 |
2 | 1267 | 25.34 | 3724 | 74.48 |
3 | 1276 | 25.52 | 5000 | 100.00 |
Distribution Plots
Frequency Plot
The MEANS Procedure
The MEANS Procedure
Summary statistics
Analysis Variable : d1 | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Mean | Minimum | Maximum | Mode | Range | N | N Miss | 1st Pctl | 5th Pctl | Median | 95th Pctl | 99th Pctl |
0.51 | 0.00 | 1.00 | . | 1.00 | 5000 | 0 | 0.01 | 0.06 | 0.52 | 0.95 | 0.99 |
The UNIVARIATE Procedure
The UNIVARIATE Procedure
d1
Histogram 1
Panel 1
The MEANS Procedure
The MEANS Procedure
Summary statistics
Analysis Variable : d2 | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Mean | Minimum | Maximum | Mode | Range | N | N Miss | 1st Pctl | 5th Pctl | Median | 95th Pctl | 99th Pctl |
0.50 | 0.00 | 1.00 | . | 1.00 | 5000 | 0 | 0.01 | 0.05 | 0.49 | 0.95 | 0.99 |
The UNIVARIATE Procedure
The UNIVARIATE Procedure
d2
Histogram 1
Panel 1
The MEANS Procedure
The MEANS Procedure
Summary statistics
Analysis Variable : yTweedie | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Mean | Minimum | Maximum | Mode | Range | N | N Miss | 1st Pctl | 5th Pctl | Median | 95th Pctl | 99th Pctl |
1.72 | 0.00 | 12.78 | 0.00 | 12.78 | 5000 | 0 | 0.00 | 0.16 | 1.37 | 4.54 | 6.40 |
The UNIVARIATE Procedure
The UNIVARIATE Procedure
yTweedie
Histogram 1
Panel 1
The next lines contain the two SAS macros for the backwards elimination selection process using a Tweedie error function.
The first macro %MdStmt
is a stand-alone macro. The main macro, %MdSelect
, consists of multiple calls to the macro %MdStmt
.
/* Variable Selection Macro: Backwards elimination */
%let p=1.5;
options mlogic;
%macro MdStmt(
resvar = /*response variable */
,expvar = /*list of explanatory variables, separated by ' ' */
,clsvar = /*classification variables in the CLASS statement separated by ' ' */
,p =
);
ods output Type3=pval(rename=source=parm);
proc genmod data=tmp1 NAMELEN=50;
if _resp_ > 0 then
d = 2*(_resp_*(_resp_**(1-&p)-_mean_**(1-&p))/
(1-&p)-(_resp_**(2-&p)-_mean_**(2-&p))/(2-&p));
else d = 2* _mean_**(2-&p)/(2-&p);
variance var = _mean_**&p;
deviance dev = d;
class &clsvar;
model &resvar = &expvar /link=log type3 scale=pearson;
*scwgt expos;
title "&resvar = &expvar";
run;
ods output close;
%mend MdStmt;
144 ods listing close;ods html5 file=stdout options(bitmap_mode='inline') device=png; ods graphics on / outputfmt=png;
NOTE: Writing HTML5 Body file: STDOUT
145
146 /* Variable Selection Macro: Backwards elimination */
147 %let p=1.5;
148 options mlogic;
149 %macro MdStmt(
150 resvar = /*response variable */
151 ,expvar = /*list of explanatory variables, separated by ' ' */
152 ,clsvar = /*classification variables in the CLASS statement separated by ' ' */
153 ,p =
154 );
155
156 ods output Type3=pval(rename=source=parm);
157 proc genmod data=tmp1 NAMELEN=50;
158 if _resp_ > 0 then
159 d = 2*(_resp_*(_resp_**(1-&p)-_mean_**(1-&p))/
160 (1-&p)-(_resp_**(2-&p)-_mean_**(2-&p))/(2-&p));
161 else d = 2* _mean_**(2-&p)/(2-&p);
162 variance var = _mean_**&p;
163 deviance dev = d;
164 class &clsvar;
165 model &resvar = &expvar /link=log type3 scale=pearson;
166 *scwgt expos;
167 title "&resvar = &expvar";
168 run;
169 ods output close;
170 %mend MdStmt;
171 ods html5 close;ods listing;
172
There are five macro parameters in the macro %MdSelect: &VAR
, &INTVAR
, &CATVAR
, &SLSTAY
and &POWER
:
&VAR
is the response variable which will be passed into&RESVAR
when calling the macro%MdStmt
;&INTVAR
includes all the potential explanatory variables which will be passed into&EXPVAR
in%MdStmt
only forthe first call;&CATVAR
contains all the categorical explanatory variables which will be passed into%CLSVAR
in%MdStmt
;&SLSTAY
is the criteria for removing variable;- and
&POWER
is the power parameter of the Tweedie distribution
%macro MdSelect(
var= /*response variable */
,intvar= /*initial explanatory variables for full model */
,catvar= /*categorical explanatory variables */
,slstay= /*criterion for removing variable */
,power=
);
%let var=%upcase(&var);
%let intvar=%upcase(&intvar);
%let catvar=%upcase(&catvar);
%let power =&power;
%*-------------------------------------------------------------------------*;
%* Create empty dataset "step" with only one column "parm". It will be *;
%* merged with "pval" from PROC GENMOD by "parm" *;
%*-------------------------------------------------------------------------*;
proc sql;
create table step_&var (parm char(9));
quit;
%*------------------------------------------------------------------------------*;
%* %do %until performs multivariate backward model selection: *;
%* In each iteration: *;
%* 1. Run the logistic regression model *;
%* 2. Update the dataset "step_&var" *;
%* 3. Create &pmax as the maximum p-value, and &varlist as the list of *;
%* variables without the one with the max p-value *;
%* 4. Check whether the max p-value <= &SLSTAY *;
%* 5. If NO, then eliminate the variable with max p-value, repeat step 1 to 4.*;
%* If YES, the loop stops *;
%*------------------------------------------------------------------------------*;
%let i=1;
%do %until (&pmax<=&slstay);
%if &i = 1 %then
%MdStmt(resvar=&var ,expvar=&intvar, clsvar=&catvar, p=&power); %*initial model;
%else %do;
%MdStmt(resvar=&var ,expvar=&varlist, clsvar=&catvar, p=&power); %*reduced model;
%end;
proc sort data=step_&var; by parm;
proc sort data=pval; by parm;
data step_&var;
merge step_&var pval;
by parm;
p&i=put(ProbChiSq, pvalue6.3);
drop ProbChiSq ChiSq DF;
run;
proc sql noprint;
select max(ProbChiSq) into :pmax
from pval;
select distinct parm into :varlist separated by ' '
from pval
having ProbChiSq^=max(ProbChiSq);
quit;
%let i=%eval(&i+1);
%end;
proc print data=step_&var;
title "&var: model selection process";
run;
%mend MdSelect;
%MdSelect(var=yTweedie, intvar=c1 c2 c3 c4 c5 d1 d2, catvar=c1 c2 c3 c4 c5, slstay=0.05, power=1.5);
SAS Output
The GENMOD Procedure
The GENMOD Procedure
Model Information
Model Information | |
---|---|
Data Set | WORK.TMP1 |
Distribution | User |
Link Function | Log |
Dependent Variable | yTweedie |
Number of Observations
Number of Observations Read | 5000 |
---|---|
Number of Observations Used | 5000 |
Class Level Information
Class Level Information | ||
---|---|---|
Class | Levels | Values |
c1 | 4 | 0 1 2 3 |
c2 | 4 | 0 1 2 3 |
c3 | 4 | 0 1 2 3 |
c4 | 4 | 0 1 2 3 |
c5 | 4 | 0 1 2 3 |
Criteria For Assessing Goodness Of Fit
Criteria For Assessing Goodness Of Fit | |||
---|---|---|---|
Criterion | DF | Value | Value/DF |
Deviance | 4982 | 2730.8750 | 0.5481 |
Scaled Deviance | 4982 | 5581.9454 | 1.1204 |
Pearson Chi-Square | 4982 | 2437.3615 | 0.4892 |
Scaled Pearson X2 | 4982 | 4982.0000 | 1.0000 |
Log Likelihood | -2790.9727 | ||
Full Log Likelihood | -2790.9727 | ||
AIC (smaller is better) | 5617.9454 | ||
AICC (smaller is better) | 5618.0827 | ||
BIC (smaller is better) | 5735.2549 |
Convergence Status
Algorithm converged. |
Analysis Of Parameter Estimates
Analysis Of Maximum Likelihood Parameter Estimates | ||||||||
---|---|---|---|---|---|---|---|---|
Parameter | DF | Estimate | Standard Error |
Wald 95% Confidence Limits | Wald Chi-Square | Pr > ChiSq | ||
Intercept | 1 | 0.7181 | 0.0405 | 0.6386 | 0.7975 | 313.63 | <.0001 | |
c1 | 0 | 1 | -0.0347 | 0.0237 | -0.0811 | 0.0116 | 2.15 | 0.1422 |
c1 | 1 | 1 | -1.0170 | 0.0271 | -1.0701 | -0.9638 | 1405.32 | <.0001 |
c1 | 2 | 1 | -0.0091 | 0.0234 | -0.0549 | 0.0367 | 0.15 | 0.6956 |
c1 | 3 | 0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | . | . |
c2 | 0 | 1 | 0.4966 | 0.0249 | 0.4478 | 0.5454 | 397.70 | <.0001 |
c2 | 1 | 1 | 0.5139 | 0.0251 | 0.4648 | 0.5630 | 420.61 | <.0001 |
c2 | 2 | 1 | -0.0098 | 0.0264 | -0.0615 | 0.0420 | 0.14 | 0.7118 |
c2 | 3 | 0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | . | . |
c3 | 0 | 1 | 0.0118 | 0.0255 | -0.0382 | 0.0617 | 0.21 | 0.6439 |
c3 | 1 | 1 | 0.0154 | 0.0248 | -0.0332 | 0.0640 | 0.38 | 0.5351 |
c3 | 2 | 1 | 0.0498 | 0.0251 | 0.0007 | 0.0990 | 3.95 | 0.0469 |
c3 | 3 | 0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | . | . |
c4 | 0 | 1 | 0.0060 | 0.0252 | -0.0434 | 0.0553 | 0.06 | 0.8132 |
c4 | 1 | 1 | 0.0064 | 0.0248 | -0.0423 | 0.0551 | 0.07 | 0.7977 |
c4 | 2 | 1 | -0.0092 | 0.0248 | -0.0578 | 0.0395 | 0.14 | 0.7113 |
c4 | 3 | 0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | . | . |
c5 | 0 | 1 | -0.7479 | 0.0252 | -0.7973 | -0.6986 | 882.99 | <.0001 |
c5 | 1 | 1 | -0.4761 | 0.0245 | -0.5240 | -0.4281 | 378.73 | <.0001 |
c5 | 2 | 1 | -0.2523 | 0.0234 | -0.2981 | -0.2065 | 116.67 | <.0001 |
c5 | 3 | 0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | . | . |
d1 | 1 | 0.0618 | 0.0308 | 0.0013 | 0.1222 | 4.01 | 0.0452 | |
d2 | 1 | 0.0150 | 0.0303 | -0.0445 | 0.0745 | 0.24 | 0.6212 | |
Scale | 0 | 0.6995 | 0.0000 | 0.6995 | 0.6995 |
The scale parameter was estimated by the square root of Pearson's Chi-Square/DOF.
LR Statistics For Type 3 Analysis - Scaled
LR Statistics For Type 3 Analysis | ||||||
---|---|---|---|---|---|---|
Source | Num DF | Den DF | F Value | Pr > F | Chi-Square | Pr > ChiSq |
c1 | 3 | 4982 | 607.21 | <.0001 | 1821.63 | <.0001 |
c2 | 3 | 4982 | 276.99 | <.0001 | 830.98 | <.0001 |
c3 | 3 | 4982 | 1.48 | 0.2168 | 4.45 | 0.2167 |
c4 | 3 | 4982 | 0.17 | 0.9148 | 0.52 | 0.9148 |
c5 | 3 | 4982 | 324.79 | <.0001 | 974.37 | <.0001 |
d1 | 1 | 4982 | 4.01 | 0.0452 | 4.01 | 0.0452 |
d2 | 1 | 4982 | 0.24 | 0.6212 | 0.24 | 0.6212 |
The GENMOD Procedure
The GENMOD Procedure
Model Information
Model Information | |
---|---|
Data Set | WORK.TMP1 |
Distribution | User |
Link Function | Log |
Dependent Variable | yTweedie |
Number of Observations
Number of Observations Read | 5000 |
---|---|
Number of Observations Used | 5000 |
Class Level Information
Class Level Information | ||
---|---|---|
Class | Levels | Values |
c1 | 4 | 0 1 2 3 |
c2 | 4 | 0 1 2 3 |
c3 | 4 | 0 1 2 3 |
c4 | 4 | 0 1 2 3 |
c5 | 4 | 0 1 2 3 |
Criteria For Assessing Goodness Of Fit
Criteria For Assessing Goodness Of Fit | |||
---|---|---|---|
Criterion | DF | Value | Value/DF |
Deviance | 4985 | 2731.1286 | 0.5479 |
Scaled Deviance | 4985 | 5584.6187 | 1.1203 |
Pearson Chi-Square | 4985 | 2437.8882 | 0.4890 |
Scaled Pearson X2 | 4985 | 4985.0000 | 1.0000 |
Log Likelihood | -2792.3094 | ||
Full Log Likelihood | -2792.3094 | ||
AIC (smaller is better) | 5614.6187 | ||
AICC (smaller is better) | 5614.7150 | ||
BIC (smaller is better) | 5712.3766 |
Convergence Status
Algorithm converged. |
Analysis Of Parameter Estimates
Analysis Of Maximum Likelihood Parameter Estimates | ||||||||
---|---|---|---|---|---|---|---|---|
Parameter | DF | Estimate | Standard Error |
Wald 95% Confidence Limits | Wald Chi-Square | Pr > ChiSq | ||
Intercept | 1 | 0.7189 | 0.0378 | 0.6448 | 0.7931 | 361.45 | <.0001 | |
c1 | 0 | 1 | -0.0348 | 0.0236 | -0.0811 | 0.0116 | 2.16 | 0.1412 |
c1 | 1 | 1 | -1.0168 | 0.0271 | -1.0700 | -0.9637 | 1405.64 | <.0001 |
c1 | 2 | 1 | -0.0091 | 0.0234 | -0.0549 | 0.0367 | 0.15 | 0.6959 |
c1 | 3 | 0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | . | . |
c2 | 0 | 1 | 0.4965 | 0.0249 | 0.4477 | 0.5453 | 397.93 | <.0001 |
c2 | 1 | 1 | 0.5139 | 0.0250 | 0.4648 | 0.5630 | 421.34 | <.0001 |
c2 | 2 | 1 | -0.0100 | 0.0264 | -0.0617 | 0.0417 | 0.14 | 0.7049 |
c2 | 3 | 0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | . | . |
c3 | 0 | 1 | 0.0120 | 0.0255 | -0.0379 | 0.0619 | 0.22 | 0.6363 |
c3 | 1 | 1 | 0.0154 | 0.0248 | -0.0331 | 0.0640 | 0.39 | 0.5335 |
c3 | 2 | 1 | 0.0500 | 0.0251 | 0.0009 | 0.0991 | 3.98 | 0.0461 |
c3 | 3 | 0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | . | . |
c5 | 0 | 1 | -0.7483 | 0.0252 | -0.7976 | -0.6990 | 884.42 | <.0001 |
c5 | 1 | 1 | -0.4764 | 0.0245 | -0.5243 | -0.4285 | 379.62 | <.0001 |
c5 | 2 | 1 | -0.2523 | 0.0234 | -0.2981 | -0.2065 | 116.72 | <.0001 |
c5 | 3 | 0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | . | . |
d1 | 1 | 0.0619 | 0.0308 | 0.0015 | 0.1223 | 4.04 | 0.0445 | |
d2 | 1 | 0.0148 | 0.0303 | -0.0446 | 0.0742 | 0.24 | 0.6257 | |
Scale | 0 | 0.6993 | 0.0000 | 0.6993 | 0.6993 |
The scale parameter was estimated by the square root of Pearson's Chi-Square/DOF.
LR Statistics For Type 3 Analysis - Scaled
LR Statistics For Type 3 Analysis | ||||||
---|---|---|---|---|---|---|
Source | Num DF | Den DF | F Value | Pr > F | Chi-Square | Pr > ChiSq |
c1 | 3 | 4985 | 607.35 | <.0001 | 1822.04 | <.0001 |
c2 | 3 | 4985 | 277.51 | <.0001 | 832.53 | <.0001 |
c3 | 3 | 4985 | 1.49 | 0.2151 | 4.47 | 0.2149 |
c5 | 3 | 4985 | 325.45 | <.0001 | 976.35 | <.0001 |
d1 | 1 | 4985 | 4.04 | 0.0446 | 4.04 | 0.0445 |
d2 | 1 | 4985 | 0.24 | 0.6257 | 0.24 | 0.6257 |
The GENMOD Procedure
The GENMOD Procedure
Model Information
Model Information | |
---|---|
Data Set | WORK.TMP1 |
Distribution | User |
Link Function | Log |
Dependent Variable | yTweedie |
Number of Observations
Number of Observations Read | 5000 |
---|---|
Number of Observations Used | 5000 |
Class Level Information
Class Level Information | ||
---|---|---|
Class | Levels | Values |
c1 | 4 | 0 1 2 3 |
c2 | 4 | 0 1 2 3 |
c3 | 4 | 0 1 2 3 |
c4 | 4 | 0 1 2 3 |
c5 | 4 | 0 1 2 3 |
Criteria For Assessing Goodness Of Fit
Criteria For Assessing Goodness Of Fit | |||
---|---|---|---|
Criterion | DF | Value | Value/DF |
Deviance | 4986 | 2731.2449 | 0.5478 |
Scaled Deviance | 4986 | 5584.3940 | 1.1200 |
Pearson Chi-Square | 4986 | 2438.5792 | 0.4891 |
Scaled Pearson X2 | 4986 | 4986.0000 | 1.0000 |
Log Likelihood | -2792.1970 | ||
Full Log Likelihood | -2792.1970 | ||
AIC (smaller is better) | 5612.3940 | ||
AICC (smaller is better) | 5612.4783 | ||
BIC (smaller is better) | 5703.6347 |
Convergence Status
Algorithm converged. |
Analysis Of Parameter Estimates
Analysis Of Maximum Likelihood Parameter Estimates | ||||||||
---|---|---|---|---|---|---|---|---|
Parameter | DF | Estimate | Standard Error |
Wald 95% Confidence Limits | Wald Chi-Square | Pr > ChiSq | ||
Intercept | 1 | 0.7264 | 0.0346 | 0.6587 | 0.7942 | 441.89 | <.0001 | |
c1 | 0 | 1 | -0.0352 | 0.0236 | -0.0815 | 0.0111 | 2.22 | 0.1363 |
c1 | 1 | 1 | -1.0172 | 0.0271 | -1.0703 | -0.9641 | 1407.64 | <.0001 |
c1 | 2 | 1 | -0.0091 | 0.0234 | -0.0549 | 0.0366 | 0.15 | 0.6954 |
c1 | 3 | 0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | . | . |
c2 | 0 | 1 | 0.4963 | 0.0249 | 0.4475 | 0.5451 | 397.71 | <.0001 |
c2 | 1 | 1 | 0.5137 | 0.0250 | 0.4646 | 0.5628 | 421.08 | <.0001 |
c2 | 2 | 1 | -0.0101 | 0.0264 | -0.0618 | 0.0416 | 0.15 | 0.7020 |
c2 | 3 | 0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | . | . |
c3 | 0 | 1 | 0.0125 | 0.0254 | -0.0374 | 0.0624 | 0.24 | 0.6234 |
c3 | 1 | 1 | 0.0155 | 0.0248 | -0.0330 | 0.0641 | 0.39 | 0.5306 |
c3 | 2 | 1 | 0.0502 | 0.0251 | 0.0011 | 0.0993 | 4.01 | 0.0451 |
c3 | 3 | 0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | . | . |
c5 | 0 | 1 | -0.7483 | 0.0252 | -0.7977 | -0.6990 | 884.56 | <.0001 |
c5 | 1 | 1 | -0.4767 | 0.0244 | -0.5246 | -0.4288 | 380.16 | <.0001 |
c5 | 2 | 1 | -0.2525 | 0.0234 | -0.2983 | -0.2068 | 116.92 | <.0001 |
c5 | 3 | 0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | . | . |
d1 | 1 | 0.0621 | 0.0308 | 0.0018 | 0.1225 | 4.07 | 0.0437 | |
Scale | 0 | 0.6993 | 0.0000 | 0.6993 | 0.6993 |
The scale parameter was estimated by the square root of Pearson's Chi-Square/DOF.
LR Statistics For Type 3 Analysis - Scaled
LR Statistics For Type 3 Analysis | ||||||
---|---|---|---|---|---|---|
Source | Num DF | Den DF | F Value | Pr > F | Chi-Square | Pr > ChiSq |
c1 | 3 | 4986 | 607.82 | <.0001 | 1823.47 | <.0001 |
c2 | 3 | 4986 | 277.41 | <.0001 | 832.22 | <.0001 |
c3 | 3 | 4986 | 1.50 | 0.2132 | 4.49 | 0.2131 |
c5 | 3 | 4986 | 325.55 | <.0001 | 976.64 | <.0001 |
d1 | 1 | 4986 | 4.07 | 0.0437 | 4.07 | 0.0437 |
The GENMOD Procedure
The GENMOD Procedure
Model Information
Model Information | |
---|---|
Data Set | WORK.TMP1 |
Distribution | User |
Link Function | Log |
Dependent Variable | yTweedie |
Number of Observations
Number of Observations Read | 5000 |
---|---|
Number of Observations Used | 5000 |
Class Level Information
Class Level Information | ||
---|---|---|
Class | Levels | Values |
c1 | 4 | 0 1 2 3 |
c2 | 4 | 0 1 2 3 |
c3 | 4 | 0 1 2 3 |
c4 | 4 | 0 1 2 3 |
c5 | 4 | 0 1 2 3 |
Criteria For Assessing Goodness Of Fit
Criteria For Assessing Goodness Of Fit | |||
---|---|---|---|
Criterion | DF | Value | Value/DF |
Deviance | 4989 | 2733.4414 | 0.5479 |
Scaled Deviance | 4989 | 5582.3056 | 1.1189 |
Pearson Chi-Square | 4989 | 2442.9224 | 0.4897 |
Scaled Pearson X2 | 4989 | 4989.0000 | 1.0000 |
Log Likelihood | -2791.1528 | ||
Full Log Likelihood | -2791.1528 | ||
AIC (smaller is better) | 5604.3056 | ||
AICC (smaller is better) | 5604.3585 | ||
BIC (smaller is better) | 5675.9947 |
Convergence Status
Algorithm converged. |
Analysis Of Parameter Estimates
Analysis Of Maximum Likelihood Parameter Estimates | ||||||||
---|---|---|---|---|---|---|---|---|
Parameter | DF | Estimate | Standard Error |
Wald 95% Confidence Limits | Wald Chi-Square | Pr > ChiSq | ||
Intercept | 1 | 0.7460 | 0.0309 | 0.6855 | 0.8066 | 583.67 | <.0001 | |
c1 | 0 | 1 | -0.0346 | 0.0236 | -0.0809 | 0.0116 | 2.15 | 0.1424 |
c1 | 1 | 1 | -1.0170 | 0.0271 | -1.0702 | -0.9639 | 1406.61 | <.0001 |
c1 | 2 | 1 | -0.0092 | 0.0234 | -0.0550 | 0.0366 | 0.15 | 0.6940 |
c1 | 3 | 0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | . | . |
c2 | 0 | 1 | 0.4973 | 0.0249 | 0.4485 | 0.5461 | 399.04 | <.0001 |
c2 | 1 | 1 | 0.5144 | 0.0250 | 0.4653 | 0.5635 | 421.94 | <.0001 |
c2 | 2 | 1 | -0.0097 | 0.0264 | -0.0615 | 0.0421 | 0.13 | 0.7136 |
c2 | 3 | 0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | . | . |
c5 | 0 | 1 | -0.7486 | 0.0252 | -0.7980 | -0.6993 | 884.84 | <.0001 |
c5 | 1 | 1 | -0.4775 | 0.0245 | -0.5254 | -0.4295 | 381.13 | <.0001 |
c5 | 2 | 1 | -0.2528 | 0.0234 | -0.2986 | -0.2070 | 117.12 | <.0001 |
c5 | 3 | 0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | . | . |
d1 | 1 | 0.0620 | 0.0308 | 0.0016 | 0.1224 | 4.05 | 0.0442 | |
Scale | 0 | 0.6998 | 0.0000 | 0.6998 | 0.6998 |
The scale parameter was estimated by the square root of Pearson's Chi-Square/DOF.
LR Statistics For Type 3 Analysis - Scaled
LR Statistics For Type 3 Analysis | ||||||
---|---|---|---|---|---|---|
Source | Num DF | Den DF | F Value | Pr > F | Chi-Square | Pr > ChiSq |
c1 | 3 | 4989 | 607.22 | <.0001 | 1821.67 | <.0001 |
c2 | 3 | 4989 | 277.87 | <.0001 | 833.62 | <.0001 |
c5 | 3 | 4989 | 325.70 | <.0001 | 977.11 | <.0001 |
d1 | 1 | 4989 | 4.05 | 0.0442 | 4.05 | 0.0442 |
The PRINT Procedure
Data Set WORK.STEP_YTWEEDIE
Obs | parm | NumDF | DenDF | FValue | ProbF | Method | p1 | p2 | p3 | p4 |
---|---|---|---|---|---|---|---|---|---|---|
1 | c1 | 3 | 4989 | 607.22 | <.0001 | LR | <.001 | <.001 | <.001 | <.001 |
2 | c2 | 3 | 4989 | 277.87 | <.0001 | LR | <.001 | <.001 | <.001 | <.001 |
3 | c3 | 3 | 4986 | 1.50 | 0.2132 | LR | 0.217 | 0.215 | 0.213 | . |
4 | c4 | 3 | 4982 | 0.17 | 0.9148 | LR | 0.915 | . | . | . |
5 | c5 | 3 | 4989 | 325.70 | <.0001 | LR | <.001 | <.001 | <.001 | <.001 |
6 | d1 | 1 | 4989 | 4.05 | 0.0442 | LR | 0.045 | 0.045 | 0.044 | 0.044 |
7 | d2 | 1 | 4985 | 0.24 | 0.6257 | LR | 0.621 | 0.626 | . | . |
The execution of the above two macros create two outputs:
- A summary table of the model selection process
- The whole model selection process step by step
The summary table of the model selection process is the about last table. The table shows that the variable C4 is eliminated in the second step of the process. The variable D2 is eliminated in the third step. And the variable C3 is eliminated in the fourth step. After the fourth step the algorithm arrive at final main effects model.
Conclusion:
The above lines shows how the variable selection algorithm eliminates those variables (C3
, C4
and D2
) no associated with the dependent variable yTweedie
- remember that the illustrative dataset was arterially created with this aim. Therefore, the macro works accurately.
The SAS macros %MdStmt
and %MdSelect
:
- Performs a backwards elimination variable selection process
- The last step in the elimination process shows the selected model and a summary table of the elimination process
- The macro needs around 15 minutes to get results with a dataset of one million observations and around 13 variables
- The elimination criteria is based on the p-values of the type 3 analysis
- With small changes the macro is useful in a context with a
GENMOD
procedure under Gamma, Inverse Gaussian, Log-Normal, Binomial, Gaussian, Poisson, Negative Binomial, Zero Inflated Poisson and Zero inflated Negative Binomial error functions. - This macro could be useful as a template to create Forward and Stepwise variable selection processes
- One drawback of the backwards elimination process is that if the full model with all potential main factors does not converge the macro does not work. That is one of the reasons because a forward option is interesting
- The specification of the model is the same that the Tweedie macro used in the
NAR
project - This macro only admits main factors. So, it is not possible to include interactions in the model statement of the GENMODE procedure. To include interactions it is needed create a new variable with the interaction
References:
A detailed explanation of the algorithm and the code appears here:
The dataset for the example comes from here:
http://support.sas.com/documentation/cdl/en/statug/68162/HTML/default/viewer.htm#statug_genmod_examples12.htm
I made some changes in order to get coherence results.