This program generates a correlation (association) matrix using the Cramer's V statistic for character variables.
Illustration:
- Dataset used in this code: claim_history.sas7bdat or claim_history.csv
- The claim_history.sas7bdat dataset comes from the help library of SAS Enterprise-Miner version 7.1
- Data description: observations: 10302, Variables: 27
Note: he NOSOURCE
system option as follows to suppress the listing of the SAS statements to the log. Refernce
options nosource;
582 ods listing close;ods html5 file=stdout options(bitmap_mode='inline') device=png; ods graphics on / outputfmt=png;
NOTE: Writing HTML5 Body file: STDOUT
583
584 options nosource;
libname mydata "/folders/myfolders/";
NOTE: Writing HTML5 Body file: STDOUT
NOTE: Libref MYDATA was successfully assigned as follows:
Engine: V9
Physical Name: /folders/myfolders
It is possible to use the CONTENTS
procedure to find the character variables:
proc contents data=mydata.claim_history varnum;
run;
SAS Output
The CONTENTS Procedure
The CONTENTS Procedure
MYDATA.CLAIM_HISTORY
Attributes
Data Set Name | MYDATA.CLAIM_HISTORY | Observations | 10302 |
---|---|---|---|
Member Type | DATA | Variables | 27 |
Engine | V9 | Indexes | 0 |
Created | 06/02/2011 04:36:18 | Observation Length | 224 |
Last Modified | 06/02/2011 04:36:18 | Deleted Observations | 0 |
Protection | Compressed | NO | |
Data Set Type | Sorted | NO | |
Label | |||
Data Representation | WINDOWS_64 | ||
Encoding | us-ascii ASCII (ANSI) |
Engine/Host Information
Engine/Host Dependent Information | |
---|---|
Data Set Page Size | 16384 |
Number of Data Set Pages | 144 |
First Data Page | 1 |
Max Obs per Page | 72 |
Obs in First Data Page | 54 |
Number of Data Set Repairs | 0 |
Filename | /folders/myfolders/claim_history.sas7bdat |
Release Created | 9.0301M0 |
Host Created | X64_PRO |
Inode Number | 27 |
Access Permission | rwxrwx--- |
Owner Name | root |
File Size | 2MB |
File Size (bytes) | 2360320 |
Varnum
Variables in Creation Order | ||||||
---|---|---|---|---|---|---|
# | Variable | Type | Len | Format | Informat | Label |
1 | ID | Char | 10 | Identification# | ||
2 | KIDSDRIV | Num | 8 | 4. | #Driving Children | |
3 | BIRTH | Num | 8 | DATE7. | Date of Birth | |
4 | AGE | Num | 8 | 4. | Age | |
5 | HOMEKIDS | Num | 8 | 4. | #Children @Home | |
6 | YOJ | Num | 8 | 4. | Years on Job | |
7 | INCOME | Num | 8 | DOLLAR10. | Income | |
8 | PARENT1 | Char | 3 | Single Parent | ||
9 | HOME_VAL | Num | 8 | DOLLAR10. | Home Value | |
10 | MSTATUS | Char | 5 | Marital Status | ||
11 | GENDER | Char | 3 | $3. | $3. | Driver Gender |
12 | EDUCATION | Char | 13 | Max Education Level | ||
13 | OCCUPATION | Char | 13 | $13. | $13. | Driver Occupation |
14 | TRAVTIME | Num | 8 | 4. | Distance to Work | |
15 | CAR_USE | Char | 10 | Vehicle Use | ||
16 | BLUEBOOK | Num | 8 | DOLLAR10. | Value of Vehicle | |
17 | TIF | Num | 8 | Time in Force | ||
18 | CAR_TYPE | Char | 11 | Type of Car | ||
19 | RED_CAR | Char | 3 | A Red Car | ||
20 | OLDCLAIM | Num | 8 | DOLLAR12. | Total Claims(Past 5 Years) | |
21 | CLM_FREQ | Num | 8 | #Claims(Past 5 Years) | ||
22 | REVOKED | Char | 3 | License Revoked (Past 7 Years) | ||
23 | MVR_PTS | Num | 8 | 5. | Motor Vehicle Record Points | |
24 | CLM_AMT | Num | 8 | DOLLAR12. | Claim Amount | |
25 | CAR_AGE | Num | 8 | 4. | Vehicle Age | |
26 | CLAIM_FLAG | Num | 8 | Claim Indicator | ||
27 | URBANICITY | Char | 21 | Home/Work Area |
So, the charecter explanatory variables in this dataset are:
PARENT1
MSTATUS
GENDER
EDUCATION
OCCUPATION
CAR_USE
CAR_TYPE
RED_CAR
RED_CAR
URBANICITY
An association matrix it is very useful in order to figure out potential stron association between all pairs of categorical variables.
Let's select only the character explanatory variables in our dataset:
/*********************************************************************************/
/*********************************************************************************/
/* CORRELATION MATRIX FOR CATEGORICAL EXPLANATORY VARIABLES*/
/*********************************************************************************/
/*********************************************************************************/
data work.data_train;
set mydata.CLAIM_HISTORY (keep=
PARENT1
MSTATUS
GENDER
EDUCATION
OCCUPATION
CAR_USE
CAR_TYPE
RED_CAR
RED_CAR
URBANICITY);
run;
NOTE: Writing HTML5 Body file: STDOUT
NOTE: Data file MYDATA.CLAIM_HISTORY.DATA is in a format that is native to another host, or the file encoding does not match the
session encoding. Cross Environment Data Access will be used, which might require additional CPU resources and might reduce
performance.
NOTE: There were 10302 observations read from the data set MYDATA.CLAIM_HISTORY.
NOTE: The data set WORK.DATA_TRAIN has 10302 observations and 9 variables.
NOTE: DATA statement used (Total process time):
real time 0.03 seconds
cpu time 0.02 seconds
Now we can use the SAS macro to create the association matrix. This macro has several parameters: the dataset name and the input character variables:
************************** MACRO PARAMETERS *******************************;
*Put the name of the modeling dataset:;
%let dataset = WORK.data_train;
*Put here the name of all the potential explanatory variables, the
categorical ones:;
*Put the name of the categorical explanatory variables;
%let varstoconsider =
PARENT1
MSTATUS
GENDER
EDUCATION
OCCUPATION
CAR_USE
CAR_TYPE
RED_CAR
RED_CAR
URBANICITY;
proc contents data=&dataset.(keep = &varstoconsider.)
nodetails varnum out=work.var_desc (keep=name type) noprint;
proc sort data = work.var_desc out = work.var_desc; by type;
run;
data test (keep=name);
length name $80.;
set work.var_desc;
name = upcase(name);
if type = 2 then output;
run;
NOTE: Writing HTML5 Body file: STDOUT
NOTE: The data set WORK.VAR_DESC has 9 observations and 2 variables.
NOTE: PROCEDURE CONTENTS used (Total process time):
real time 0.00 seconds
cpu time 0.01 seconds
NOTE: There were 9 observations read from the data set WORK.VAR_DESC.
NOTE: The data set WORK.VAR_DESC has 9 observations and 2 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
NOTE: There were 9 observations read from the data set WORK.VAR_DESC.
NOTE: The data set WORK.TEST has 9 observations and 1 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
Now we can run the macro code:
/****************************** MACRO STARTS HERE ***********************************/
%macro corr_matrix (table = );
%macro setup(dsout=_NULL_,dsin=&syslast,var=,ks=);
%local dsout dsin var totvar ks;
proc freq data=&dsin noprint;
tables &var / out=__count;
data _NULL_;
set __count end=last;
retain len oldlen 0;
nk + 1;
len = length(&var);
if ( len > oldlen ) then
oldlen = len;
if count > 1 then
do;
put "WARNING: The variable %upcase(&var)" /
"WARNING: has " count " occurrences of the value "
var;
end;
if last then
do;
if &ks > nk then
do;
put "ERROR: K greater than the Number of values (N)";
put "ERROR: Value of K: &ks Value of N: " nk;
put "ERROR: EXITING Macro!!!!!!!!!!!!!!!!!!";
end;
call symput("totvar",trim(left(put(nk,5.))) );
call symput("length",put(oldlen,3.) );
end;
run;
%if &ks > &totvar %then
%goto exit;
data __d1;
set __count end=last;
array _vals {&totvar} $ &length;
retain _vals1-_vals&totvar;
keep _vals1-_vals&totvar;
_vals{_N_} = &var;
if last then
output;
%comb(k=&ks,n=&totvar,len=&length)
run;
%exit:
%mend setup;
%macro comb(dsout2=&dsout,dsin2=__d1,k=,n=,len=);
%local dsin2 dsout2 numvar i start stop n k len;
%let start = 1;
%let stop = %eval( &n - &k);
data &dsout2;
set &dsin2;
array _vals {&n} $;
array combo {&k} $ &len;
keep j combo1 - combo&k;
retain j 1;
%do i = 1 %to &k;
%let stop = %eval(&stop + 1);
do var&i = &start to &stop;
%let start = %str(%(var&i + 1 %));
%end;
%do i = 1 %to &k;
combo{&i} = _vals{var&i};
%end;
output;
j + 1;
%do i = 1 %to &k;
end;
%end;
run;
%mend comb;
%setup(var=name,ks=2,dsout=combin);
*proc print data=combin ;
* Determine the number of observations;
data _NULL_;
if 0 then
set combin nobs=n;
call symputx('nrows',n);
stop;
run;
%macro do_corr;
%do obs = 1 %to &nrows;
data _null_;
set combin;
if j = &obs then
call symputx("combo1", PUT(combo1, 35.));
if j = &obs then
call symputx("combo2", PUT(combo2, 35.));
run;
%put &combo1;
%put &combo2;
proc freq data = &table noprint;
tables &combo1*&combo2 / chisq;
output out=results_&obs chisq;
run;
data results_&obs (keep= combo1 combo2 _CRAMV_);
length combo1 combo2 $30.;
set results_&obs;
combo1 = "&combo1";
combo2 = "&combo2";
run;
proc append base=association_matrix data=results_&obs force;
run;
%end;
%mend do_corr;
%do_corr;
proc sort data=association_matrix;
by descending _CRAMV_;
run;
proc sort data=association_matrix out=association_matrix_sort;
by combo2 combo1;
run;
/* reshape the output to the wide format */
proc transpose data=association_matrix_sort out=fin (drop=_name_ _label_) let;
by combo2;
id combo1;
var _CRAMV_;
run;
proc format;
value backcolor
.7 - high = "red"
low - -.7 = "red";
run;
proc print data=association_matrix noobs;
title "Sorted Associations - Cramer's V";
var _all_ / style(data) = [background=backcolor. ];
format _CRAMV_ percent.;
run;
title "Expenditures Summary Using Traffic Lighting";
proc print data=fin;
title "Association Matrix - Cramer's V";
var _all_ / style(data) = [background=backcolor. ];
format _numeric_ percent.;
run;
proc datasets lib=work nolist;
delete __count __d1 combin association_matrix
test association_matrix_sort fin results_: ;
run;
%mend corr_matrix;
/****************************** MACRO ENDS HERE ***********************************/
%corr_matrix(table = work.data_train);
/*proc datasets lib=work nolist;*/
/* delete VAR_DESC;*/
/*run;*/
ods graphics on;
SAS Output
The PRINT Procedure
Data Set WORK.ASSOCIATION_MATRIX
combo1 | combo2 | _CRAMV_ |
---|---|---|
CAR_TYPE | GENDER | 72% |
CAR_USE | OCCUPATION | 57% |
EDUCATION | OCCUPATION | 56% |
CAR_TYPE | CAR_USE | 54% |
CAR_TYPE | RED_CAR | 49% |
MSTATUS | PARENT1 | 47% |
OCCUPATION | URBANICITY | 31% |
CAR_USE | GENDER | 28% |
GENDER | OCCUPATION | 25% |
EDUCATION | URBANICITY | 23% |
CAR_USE | EDUCATION | 22% |
OCCUPATION | RED_CAR | 17% |
CAR_TYPE | OCCUPATION | 14% |
CAR_TYPE | EDUCATION | 10% |
OCCUPATION | PARENT1 | 10% |
EDUCATION | PARENT1 | 9% |
CAR_TYPE | URBANICITY | 8% |
GENDER | PARENT1 | 7% |
CAR_TYPE | PARENT1 | 6% |
EDUCATION | GENDER | 5% |
EDUCATION | MSTATUS | 5% |
GENDER | URBANICITY | 5% |
MSTATUS | OCCUPATION | 4% |
EDUCATION | RED_CAR | 4% |
PARENT1 | URBANICITY | 2% |
CAR_TYPE | MSTATUS | 2% |
MSTATUS | RED_CAR | 1% |
GENDER | MSTATUS | ( 0%) |
CAR_USE | PARENT1 | ( 0%) |
MSTATUS | URBANICITY | ( 1%) |
CAR_USE | MSTATUS | ( 1%) |
CAR_USE | URBANICITY | ( 2%) |
RED_CAR | URBANICITY | ( 4%) |
PARENT1 | RED_CAR | ( 4%) |
CAR_USE | RED_CAR | ( 19%) |
GENDER | RED_CAR | ( 66%) |
The PRINT Procedure
Data Set WORK.FIN
Obs | combo2 | CAR_TYPE | CAR_USE | EDUCATION | GENDER | MSTATUS | OCCUPATION | PARENT1 | RED_CAR |
---|---|---|---|---|---|---|---|---|---|
1 | CAR_USE | 54% | . | . | . | . | . | . | . |
2 | EDUCATION | 10% | 22% | . | . | . | . | . | . |
3 | GENDER | 72% | 28% | 5% | . | . | . | . | . |
4 | MSTATUS | 2% | ( 1%) | 5% | ( 0%) | . | . | . | . |
5 | OCCUPATION | 14% | 57% | 56% | 25% | 4% | . | . | . |
6 | PARENT1 | 6% | ( 0%) | 9% | 7% | 47% | 10% | . | . |
7 | RED_CAR | 49% | ( 19%) | 4% | ( 66%) | 1% | 17% | ( 4%) | . |
8 | URBANICITY | 8% | ( 2%) | 23% | 5% | ( 1%) | 31% | 2% | ( 4%) |
We obtain two outputs from the macro execution. The frist table is a list of all possible pairs of categorical variables and their associated Cramer's V statistic. The second table is a matrix that contains the same statistics. For those Cramver's V greater or equal to 70% (abs. value) the cell is highlighted in red.
Therefore, there is an strong positive associateship between the GENDER
and the CAR_TYPE
categorical variables. That means that some car types are used mainly by males and other car types are used mainly by females. Something that is plausible. These two varibles are quite associated, therefore, there are redundant information in our dataset. Both variables contains the same information. Therefore, we should select one of them, the most correlated with the target variable.
Let's analyze these two variables using a cross-table:
proc freq data = mydata.claim_history;
tables gender * car_type;
run;
SAS Output
The FREQ Procedure
The FREQ Procedure
Table GENDER * CAR_TYPE
Cross-Tabular Freq Table
|
|
Take a look to the Col Pct
. Now it is clear that the Panel Truck
is used mainly by men (95.31%) and the Sports Car
and the z_SUV
are used mainly by women.