Summary
The purpose of this notebook is illustrate how apply bootstrap a confidence interval for the slope of the MATH
independent variable in a multivariable regression environment using the GENMOD
procedure in SAS.
This technique is useful in an inference problem where the interpretation of the Beta parameter is critical for the business problem and we are not really sure if our model follows all the GLM assumptions.
This technique only is practicable with models that fit very fast.
Introduction
Apply bootstrapping to check the inference (p-value) of the coefficient estimates of the one or all explanatory variables. In this expample only the MATH
Beta parameter. The aim of this consists in determine if the beta estimator of the one or all explanatory variables are statistically significant. Therefore, if the assumptions of our GLM model are correct the p-value we get in the output of the GLM model should be correct and our work finish here. But, if one or several assumptions of the GLM are not correct the inference is not valid. The bootstrap approach does not rely on any of these assumptions, and so it is likely giving a more accurate estimate of the coefficient estimates (and of the standard errors). I think it worth checking if the bootstrap confidence interval are the same of the GLM Wald confidence interval.
Let's build up a SAS datasets to illustrate the bootstrapping technique:
options nosource;
504 ods listing close;ods html5 file=stdout options(bitmap_mode='inline') device=png; ods graphics on / outputfmt=png;
NOTE: Writing HTML5 Body file: STDOUT
505
506 options nosource;
data HSB2 ;
infile datalines dsd
;
input
id
female
race
ses
schtyp
prog
read
write
math
science
socst
;
datalines;
70,0,4,1,1,1,57,52,41,47,57
121,1,4,2,1,3,68,59,53,63,61
86,0,4,3,1,1,44,33,54,58,31
141,0,4,3,1,3,63,44,47,53,56
172,0,4,2,1,2,47,52,57,53,61
113,0,4,2,1,2,44,52,51,63,61
50,0,3,2,1,1,50,59,42,53,61
11,0,1,2,1,2,34,46,45,39,36
84,0,4,2,1,1,63,57,54,58,51
48,0,3,2,1,2,57,55,52,50,51
75,0,4,2,1,3,60,46,51,53,61
60,0,4,2,1,2,57,65,51,63,61
95,0,4,3,1,2,73,60,71,61,71
104,0,4,3,1,2,54,63,57,55,46
38,0,3,1,1,2,45,57,50,31,56
115,0,4,1,1,1,42,49,43,50,56
76,0,4,3,1,2,47,52,51,50,56
195,0,4,2,2,1,57,57,60,58,56
114,0,4,3,1,2,68,65,62,55,61
85,0,4,2,1,1,55,39,57,53,46
167,0,4,2,1,1,63,49,35,66,41
143,0,4,2,1,3,63,63,75,72,66
41,0,3,2,1,2,50,40,45,55,56
20,0,1,3,1,2,60,52,57,61,61
12,0,1,2,1,3,37,44,45,39,46
53,0,3,2,1,3,34,37,46,39,31
154,0,4,3,1,2,65,65,66,61,66
178,0,4,2,2,3,47,57,57,58,46
196,0,4,3,2,2,44,38,49,39,46
29,0,2,1,1,1,52,44,49,55,41
126,0,4,2,1,1,42,31,57,47,51
103,0,4,3,1,2,76,52,64,64,61
192,0,4,3,2,2,65,67,63,66,71
150,0,4,2,1,3,42,41,57,72,31
199,0,4,3,2,2,52,59,50,61,61
144,0,4,3,1,1,60,65,58,61,66
200,0,4,2,2,2,68,54,75,66,66
80,0,4,3,1,2,65,62,68,66,66
16,0,1,1,1,3,47,31,44,36,36
153,0,4,2,1,3,39,31,40,39,51
176,0,4,2,2,2,47,47,41,42,51
177,0,4,2,2,2,55,59,62,58,51
168,0,4,2,1,2,52,54,57,55,51
40,0,3,1,1,1,42,41,43,50,41
62,0,4,3,1,1,65,65,48,63,66
169,0,4,1,1,1,55,59,63,69,46
49,0,3,3,1,3,50,40,39,49,47
136,0,4,2,1,2,65,59,70,63,51
189,0,4,2,2,2,47,59,63,53,46
7,0,1,2,1,2,57,54,59,47,51
27,0,2,2,1,2,53,61,61,57,56
128,0,4,3,1,2,39,33,38,47,41
21,0,1,2,1,1,44,44,61,50,46
183,0,4,2,2,2,63,59,49,55,71
132,0,4,2,1,2,73,62,73,69,66
15,0,1,3,1,3,39,39,44,26,42
67,0,4,1,1,3,37,37,42,33,32
22,0,1,2,1,3,42,39,39,56,46
185,0,4,2,2,2,63,57,55,58,41
9,0,1,2,1,3,48,49,52,44,51
181,0,4,2,2,2,50,46,45,58,61
170,0,4,3,1,2,47,62,61,69,66
134,0,4,1,1,1,44,44,39,34,46
108,0,4,2,1,1,34,33,41,36,36
197,0,4,3,2,2,50,42,50,36,61
140,0,4,2,1,3,44,41,40,50,26
171,0,4,2,1,2,60,54,60,55,66
107,0,4,1,1,3,47,39,47,42,26
81,0,4,1,1,2,63,43,59,65,44
18,0,1,2,1,3,50,33,49,44,36
155,0,4,2,1,1,44,44,46,39,51
97,0,4,3,1,2,60,54,58,58,61
68,0,4,2,1,2,73,67,71,63,66
157,0,4,2,1,1,68,59,58,74,66
56,0,4,2,1,3,55,45,46,58,51
5,0,1,1,1,2,47,40,43,45,31
159,0,4,3,1,2,55,61,54,49,61
123,0,4,3,1,1,68,59,56,63,66
164,0,4,2,1,3,31,36,46,39,46
14,0,1,3,1,2,47,41,54,42,56
127,0,4,3,1,2,63,59,57,55,56
165,0,4,1,1,3,36,49,54,61,36
174,0,4,2,2,2,68,59,71,66,56
3,0,1,1,1,2,63,65,48,63,56
58,0,4,2,1,3,55,41,40,44,41
146,0,4,3,1,2,55,62,64,63,66
102,0,4,3,1,2,52,41,51,53,56
117,0,4,3,1,3,34,49,39,42,56
133,0,4,2,1,3,50,31,40,34,31
94,0,4,3,1,2,55,49,61,61,56
24,0,2,2,1,2,52,62,66,47,46
149,0,4,1,1,1,63,49,49,66,46
82,1,4,3,1,2,68,62,65,69,61
8,1,1,1,1,2,39,44,52,44,48
129,1,4,1,1,1,44,44,46,47,51
173,1,4,1,1,1,50,62,61,63,51
57,1,4,2,1,2,71,65,72,66,56
100,1,4,3,1,2,63,65,71,69,71
1,1,1,1,1,3,34,44,40,39,41
194,1,4,3,2,2,63,63,69,61,61
88,1,4,3,1,2,68,60,64,69,66
99,1,4,3,1,1,47,59,56,66,61
47,1,3,1,1,2,47,46,49,33,41
120,1,4,3,1,2,63,52,54,50,51
166,1,4,2,1,2,52,59,53,61,51
65,1,4,2,1,2,55,54,66,42,56
101,1,4,3,1,2,60,62,67,50,56
89,1,4,1,1,3,35,35,40,51,33
54,1,3,1,2,1,47,54,46,50,56
180,1,4,3,2,2,71,65,69,58,71
162,1,4,2,1,3,57,52,40,61,56
4,1,1,1,1,2,44,50,41,39,51
131,1,4,3,1,2,65,59,57,46,66
125,1,4,1,1,2,68,65,58,59,56
34,1,1,3,2,2,73,61,57,55,66
106,1,4,2,1,3,36,44,37,42,41
130,1,4,3,1,1,43,54,55,55,46
93,1,4,3,1,2,73,67,62,58,66
163,1,4,1,1,2,52,57,64,58,56
37,1,3,1,1,3,41,47,40,39,51
35,1,1,1,2,1,60,54,50,50,51
87,1,4,2,1,1,50,52,46,50,56
73,1,4,2,1,2,50,52,53,39,56
151,1,4,2,1,3,47,46,52,48,46
44,1,3,1,1,3,47,62,45,34,46
152,1,4,3,1,2,55,57,56,58,61
105,1,4,2,1,2,50,41,45,44,56
28,1,2,2,1,1,39,53,54,50,41
91,1,4,3,1,3,50,49,56,47,46
45,1,3,1,1,3,34,35,41,29,26
116,1,4,2,1,2,57,59,54,50,56
33,1,2,1,1,2,57,65,72,54,56
66,1,4,2,1,3,68,62,56,50,51
72,1,4,2,1,3,42,54,47,47,46
77,1,4,1,1,2,61,59,49,44,66
61,1,4,3,1,2,76,63,60,67,66
190,1,4,2,2,2,47,59,54,58,46
42,1,3,2,1,3,46,52,55,44,56
2,1,1,2,1,3,39,41,33,42,41
55,1,3,2,2,2,52,49,49,44,61
19,1,1,1,1,1,28,46,43,44,51
90,1,4,3,1,2,42,54,50,50,52
142,1,4,2,1,3,47,42,52,39,51
17,1,1,2,1,2,47,57,48,44,41
122,1,4,2,1,2,52,59,58,53,66
191,1,4,3,2,2,47,52,43,48,61
83,1,4,2,1,3,50,62,41,55,31
182,1,4,2,2,2,44,52,43,44,51
6,1,1,1,1,2,47,41,46,40,41
46,1,3,1,1,2,45,55,44,34,41
43,1,3,1,1,2,47,37,43,42,46
96,1,4,3,1,2,65,54,61,58,56
138,1,4,2,1,3,43,57,40,50,51
10,1,1,2,1,1,47,54,49,53,61
71,1,4,2,1,1,57,62,56,58,66
139,1,4,2,1,2,68,59,61,55,71
110,1,4,2,1,3,52,55,50,54,61
148,1,4,2,1,3,42,57,51,47,61
109,1,4,2,1,1,42,39,42,42,41
39,1,3,3,1,2,66,67,67,61,66
147,1,4,1,1,2,47,62,53,53,61
74,1,4,2,1,2,57,50,50,51,58
198,1,4,3,2,2,47,61,51,63,31
161,1,4,1,1,2,57,62,72,61,61
112,1,4,2,1,2,52,59,48,55,61
69,1,4,1,1,3,44,44,40,40,31
156,1,4,2,1,2,50,59,53,61,61
111,1,4,1,1,1,39,54,39,47,36
186,1,4,2,2,2,57,62,63,55,41
98,1,4,1,1,3,57,60,51,53,37
119,1,4,1,1,1,42,57,45,50,43
13,1,1,2,1,3,47,46,39,47,61
51,1,3,3,1,1,42,36,42,31,39
26,1,2,3,1,2,60,59,62,61,51
36,1,3,1,1,1,44,49,44,35,51
135,1,4,1,1,2,63,60,65,54,66
59,1,4,2,1,2,65,67,63,55,71
78,1,4,2,1,2,39,54,54,53,41
64,1,4,3,1,3,50,52,45,58,36
63,1,4,1,1,1,52,65,60,56,51
79,1,4,2,1,2,60,62,49,50,51
193,1,4,2,2,2,44,49,48,39,51
92,1,4,3,1,1,52,67,57,63,61
160,1,4,2,1,2,55,65,55,50,61
32,1,2,3,1,3,50,67,66,66,56
23,1,2,1,1,2,65,65,64,58,71
158,1,4,2,1,1,52,54,55,53,51
25,1,2,2,1,1,47,44,42,42,36
188,1,4,3,2,2,63,62,56,55,61
52,1,3,1,1,2,50,46,53,53,66
124,1,4,1,1,3,42,54,41,42,41
175,1,4,3,2,1,36,57,42,50,41
;
NOTE: Writing HTML5 Body file: STDOUT
NOTE: The data set WORK.HSB2 has 192 observations and 11 variables.
NOTE: DATA statement used (Total process time):
real time 0.02 seconds
cpu time 0.00 seconds
We want to check if the coefficient estimate of the MATH
explanatory variable really doesn’t contains the 0 value; that is if the MATH
coefficient is statistically significant (p-value less than .05).
How can I bootstrap estimates in SAS?
Bootstrapping allows for estimation of statistics through the repeated resampling of data. In this page, we will demonstrate several methods of bootstrapping a confidence interval about the slope of the MATH
explanatory variable in SAS. We will be using the hsb2
dataset that
can be found here. We will begin by running an OLS regression, predicting read
with female
, math
, write
, and ses
, and saving the slope of math
value in a dataset called t0
. The estimated MATH
paramter value in this regression is 0.4333.
proc genmod data = HSB2 ;
model read = female math write ses;
ods output parameterestimates = t0;
run;
SAS Output
The GENMOD Procedure
The GENMOD Procedure
Model Information
Model Information | |
---|---|
Data Set | WORK.HSB2 |
Distribution | Normal |
Link Function | Identity |
Dependent Variable | read |
Number of Observations
Number of Observations Read | 192 |
---|---|
Number of Observations Used | 192 |
Criteria For Assessing Goodness Of Fit
Criteria For Assessing Goodness Of Fit | |||
---|---|---|---|
Criterion | DF | Value | Value/DF |
Deviance | 187 | 9749.9643 | 52.1388 |
Scaled Deviance | 187 | 192.0000 | 1.0267 |
Pearson Chi-Square | 187 | 9749.9643 | 52.1388 |
Scaled Pearson X2 | 187 | 192.0000 | 1.0267 |
Log Likelihood | -649.4785 | ||
Full Log Likelihood | -649.4785 | ||
AIC (smaller is better) | 1310.9569 | ||
AICC (smaller is better) | 1311.4110 | ||
BIC (smaller is better) | 1330.5019 |
Convergence Status
Algorithm converged. |
Analysis Of Parameter Estimates
Analysis Of Maximum Likelihood Parameter Estimates | |||||||
---|---|---|---|---|---|---|---|
Parameter | DF | Estimate | Standard Error |
Wald 95% Confidence Limits | Wald Chi-Square | Pr > ChiSq | |
Intercept | 1 | 6.3343 | 3.2981 | -0.1299 | 12.7985 | 3.69 | 0.0548 |
female | 1 | -2.4704 | 1.1133 | -4.6525 | -0.2883 | 4.92 | 0.0265 |
math | 1 | 0.4333 | 0.0734 | 0.2894 | 0.5773 | 34.82 | <.0001 |
write | 1 | 0.4106 | 0.0744 | 0.2647 | 0.5564 | 30.43 | <.0001 |
ses | 1 | 1.3787 | 0.7423 | -0.0761 | 2.8336 | 3.45 | 0.0633 |
Scale | 1 | 7.1261 | 0.3637 | 6.4478 | 7.8757 |
The scale parameter was estimated by maximum likelihood.
Store the estimated MATH
parameter:
data _null_;
set t0;
if parameter = "math" then call symput('est_bar', estimate);
run;
%put &est_bar;
NOTE: Writing HTML5 Body file: STDOUT
NOTE: Numeric values have been converted to character values at the places given by: (Line):(Column).
738:56
NOTE: There were 6 observations read from the data set WORK.T0.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
0.4333340545
To bootstrap a confidence interval about this MATH
beta value, we will first need to resample. This step involves sampling with replacement from our original dataset
to generate a new dataset the same size as our original dataset. For each of these
samples, we will be running the same regression as above and saving the MATH
beta parameter
value. proc surveyselect
allows us to do this resampling in one step.
Before carrying out this step, let's outline the assumptions we are making about our data when we use this method. We are assuming that the observations in our dataset are independent. We are also assuming that the statistic we are estimating is asymptotically normally distributed.
We indicate an output dataset, a seed, a sampling method, and the number of
replicates. The sampling method indicated, urs
, is unrestricted random sampling,
or sampling with replacement. The samprate indicates how large each sample should
be relative to the input dataset. A samprate
of one means that the sampled datasets
should be of the same size as the input dataset. So in this example, we will generate
500 datasets of 200, so our output dataset bootsample
will have 100,000 observations.
%let rep = 100000;
proc surveyselect data= HSB2 out=bootsample
seed = 1347 method = urs
samprate = 1 outhits rep = &rep;
run;
SAS Output
The SURVEYSELECT Procedure
The SURVEYSELECT Procedure
Sample Selection Method
Selection Method | Unrestricted Random Sampling |
---|
Sample Selection Summary
Input Data Set | HSB2 |
---|---|
Random Number Seed | 1347 |
Sampling Rate | 1 |
Sample Size | 192 |
Expected Number of Hits | 1 |
Sampling Weight | 1 |
Number of Replicates | 100000 |
Total Sample Size | 19200000 |
Output Data Set | BOOTSAMPLE |
With this dataset, we will now run our regression model, specifying by replicate so
that the model will be run separately for each of the 100000 sample datasets. After that,
we use a data step to convert the MATH
beta paramter values to numeric.
ods select none;
ods output parameterestimates = tdata (where = (parameter = "math"));
proc genmod data = bootsample;
by replicate;
model read = female math write ses;
ods output parameterestimates = t (where = (parameter = "math"));
run;
quit;
SAS Output
NOTE: The ods select none
suppresses displayed output in SAS GENMOD
. The noprint
options deosn't work in this procedure. More about the use of the ODS
system in this example see this
ods select all;
NOTE: Writing HTML5 Body file: STDOUT
The following histogram shows the distributions of the bootstrapped MATH
beta estimate parameter for the 100000 samples follows a Normal distribution:
proc univariate data=tdata noprint;
histogram Estimate / kernel (noprint) normal (noprint);
run;
SAS Output
The UNIVARIATE Procedure
The UNIVARIATE Procedure
Estimate
Histogram 1
Panel 1
Method 1: Normal Distribution Confidence Interval
We will first create a confidence interval using the normal distribution theory. This assumes that the Beta values follow a t distribution, so we can generate a 95% confidence interval by about the mean of the Betea values based on quantiles from a t-distribution with 99000 degrees of freedom. We find the critical t values for our confidence interval and multiply these by the standard deviation of the Beta values that arose in our 100000 replications. Our confidence interval using this method is symmetric about the Beta value we saw in our original regression. We can see that the 95% confidence interval using this method is (0.286893, 0.579775). We have also calculated the bias in our original value of Beta as the difference between that value and the mean of the 100000 Beatas in our bootstrap sample.
%let alphalev = .05;
ods listing;
proc sql;
select &est_bar as estimate,
mean(estimate) - &est_bar as bias,
std(estimate) as std_err,
&est_bar - tinv(1-&alphalev/2, &rep-1)*std(estimate) as lb,
&est_bar + tinv(1-&alphalev/2, &rep-1)*std(estimate) as hb
from t;
quit;
SAS Output
The SQL Procedure
Query Results
estimate | bias | std_err | lb | hb |
---|---|---|---|---|
0.433334 | 0.000637 | 0.074715 | 0.286893 | 0.579775 |
Method 2: Percentile Confidence Interval
Another way to generate a bootstrap 95% confidence interval from the sample of 100000
Betas for MATH
values is to look at the 2.5th and 97.5th percentiles in this distribution.
This approach to the confidence interval has some advantages over the normal approximation
used above. This interval is not symmetric about the original estimate of the Beta
and this method is unaffected by monotonic transformations on the estimated statistic.
The first advantage is relevant because our original estimate is subject to bias.
The second advantage is less relevant in this example than in an instance where the
estimate might be subject to a transformation. The bootstrap estimates that form the
bounds of the interval can be transformed in the same way to create the bootstrap interval
of the transformed estimate.
We can easily generate a percentile confidence interval in SAS using PROC UNIVARIATE
after
creating some macro variables for the percentiles of interest and using them in the output
statement. We can see that the confidence interval from this method is (0.28637, 0.57973).
Since we have put the information of interest into a new dataset, pmethod, we have omitted
the standard output from the proc univariate.
%let alphalev = .05;
%let a1 = %sysevalf(&alphalev/2*100);
%let a2 = %sysevalf((1 - &alphalev/2)*100);
* creating confidence interval, percentile method;
proc univariate data = t alpha = .05 noprint;
var estimate;
output out=pmethod mean = estimate_hat pctlpts=&a1 &a2 pctlpre = p pctlname = _lb _ub ;
run;
data t2;
set pmethod;
bias = estimate_hat - &est_bar;
estimate = &est_bar;
run;
ods listing;
proc print data = t2;
var estimate bias p_lb p_ub;
run;
SAS Output
The PRINT Procedure
Data Set WORK.T2
Obs | estimate | bias | p_lb | p_ub |
---|---|---|---|---|
1 | 0.43333 | .000636566 | 0.28637 | 0.57973 |
Method 3: Bias-Corrected Confidence Interval
We can also correct for bias in calculating our confidence interval. We have calculated bias in the previous method as the difference between the Beta we observed in our initial regression and the mean of the 100000 Beta values from the bootstrap samples. The Beta estimate from the initial regression is assumed to be an unbiased estimate of the true parameter. If we wish to correct for the bias in calculating our confidence interval, we can go through the steps below. These are described by Cameron and Trivedi in Microeconomics Using Stata.
We first calculate the proportion of the bootstrap Beta that are less than our original value. We will adjust the percentiles used to define our confidence interval based on how this proportion differs from 0.5. We then find the probit of this proportion (z0) and the proportion associated with our alpha level (zalpha). Next, we calculate the two percentiles that will be used to find our confidence interval, p1 and p2, from these values. We then calculate our interval with proc univariate. From this method, our interval is (0.27954, 0.57413).
%let alphalev = .05;
%let alpha1 = %sysevalf(1 - &alphalev/2);
%put &alpha1;
proc sql;
select sum(estimate<=&est_bar)/count(estimate) into :z0bar
from t;
quit;
data _null_;
z0 = probit(&z0bar);
zalpha = probit(&alpha1);
p1 = put(probnorm(2*z0 - zalpha)*100, 3.0);
p2 = put(probnorm(2*z0 + zalpha)*100, 3.0);
output;
call symput('a1', p1);
call symput('a2', p2);
run;
SAS Output
The SQL Procedure
Query Results
0.4952 |
Creating confidence interval, bias-corrected method
proc univariate data = t alpha = .05 noprint;
var estimate;
output out=pmethod mean = estimate_hat pctlpts=&a1 &a2 pctlpre = p pctlname = _lb _ub ;
run;
data t2;
set pmethod;
bias = estimate_hat - &est_bar;
estimate = &est_bar;
run;
ods listing;
proc print data = t2;
var estimate bias p_lb p_ub;
run;
SAS Output
The PRINT Procedure
Data Set WORK.T2
Obs | estimate | bias | p_lb | p_ub |
---|---|---|---|---|
1 | 0.43333 | .000636566 | 0.27954 | 0.57413 |
Conclusion
Because the bootstrapping confident interval doesn’t contains the zero and is very similar to the confident interval of the GENMOD
we conclude that the estimated slope of 0.4333 for the MATH
explanatory variable is statistically significant and that our assumption in the GLM are correct.
The SAS code can be useful also to bootstrap the standard errors. Another good example consists in check the FEMALE variable that has a p-value close to 0.05.
Reference:
The code is based in the article titled "How can I bootstrap estimates in SAS?" where I changed the R-squared statistic by the slope of the MATH
explanatory variable. There are also some other slightly differences.