This program generates a correlation (association) matrix using the Cramer's V statistic for character variables.

Illustration:

  • Dataset used in this code: claim_history.sas7bdat or claim_history.csv
  • The claim_history.sas7bdat dataset comes from the help library of SAS Enterprise-Miner version 7.1
  • Data description: observations: 10302, Variables: 27

Note: he NOSOURCE system option as follows to suppress the listing of the SAS statements to the log. Refernce

options nosource;

582  ods listing close;ods html5 file=stdout options(bitmap_mode='inline') device=png; ods graphics on / outputfmt=png;
NOTE: Writing HTML5 Body file: STDOUT
583
584 options nosource;

libname mydata "/folders/myfolders/";

NOTE: Writing HTML5 Body file: STDOUT
NOTE: Libref MYDATA was successfully assigned as follows:
Engine: V9
Physical Name: /folders/myfolders

It is possible to use the CONTENTS procedure to find the character variables:

proc contents data=mydata.claim_history varnum;
run;
SAS Output

SAS Output

The SAS System

The CONTENTS Procedure

The CONTENTS Procedure

MYDATA.CLAIM_HISTORY

Attributes

Data Set Name MYDATA.CLAIM_HISTORY Observations 10302
Member Type DATA Variables 27
Engine V9 Indexes 0
Created 06/02/2011 04:36:18 Observation Length 224
Last Modified 06/02/2011 04:36:18 Deleted Observations 0
Protection   Compressed NO
Data Set Type   Sorted NO
Label      
Data Representation WINDOWS_64    
Encoding us-ascii ASCII (ANSI)    

Engine/Host Information

Engine/Host Dependent Information
Data Set Page Size 16384
Number of Data Set Pages 144
First Data Page 1
Max Obs per Page 72
Obs in First Data Page 54
Number of Data Set Repairs 0
Filename /folders/myfolders/claim_history.sas7bdat
Release Created 9.0301M0
Host Created X64_PRO
Inode Number 27
Access Permission rwxrwx---
Owner Name root
File Size 2MB
File Size (bytes) 2360320

Varnum

Variables in Creation Order
# Variable Type Len Format Informat Label
1 ID Char 10     Identification#
2 KIDSDRIV Num 8 4.   #Driving Children
3 BIRTH Num 8 DATE7.   Date of Birth
4 AGE Num 8 4.   Age
5 HOMEKIDS Num 8 4.   #Children @Home
6 YOJ Num 8 4.   Years on Job
7 INCOME Num 8 DOLLAR10.   Income
8 PARENT1 Char 3     Single Parent
9 HOME_VAL Num 8 DOLLAR10.   Home Value
10 MSTATUS Char 5     Marital Status
11 GENDER Char 3 $3. $3. Driver Gender
12 EDUCATION Char 13     Max Education Level
13 OCCUPATION Char 13 $13. $13. Driver Occupation
14 TRAVTIME Num 8 4.   Distance to Work
15 CAR_USE Char 10     Vehicle Use
16 BLUEBOOK Num 8 DOLLAR10.   Value of Vehicle
17 TIF Num 8     Time in Force
18 CAR_TYPE Char 11     Type of Car
19 RED_CAR Char 3     A Red Car
20 OLDCLAIM Num 8 DOLLAR12.   Total Claims(Past 5 Years)
21 CLM_FREQ Num 8     #Claims(Past 5 Years)
22 REVOKED Char 3     License Revoked (Past 7 Years)
23 MVR_PTS Num 8 5.   Motor Vehicle Record Points
24 CLM_AMT Num 8 DOLLAR12.   Claim Amount
25 CAR_AGE Num 8 4.   Vehicle Age
26 CLAIM_FLAG Num 8     Claim Indicator
27 URBANICITY Char 21     Home/Work Area

So, the charecter explanatory variables in this dataset are:

PARENT1 MSTATUS GENDER EDUCATION OCCUPATION CAR_USE CAR_TYPE RED_CAR RED_CAR URBANICITY

An association matrix it is very useful in order to figure out potential stron association between all pairs of categorical variables.

Let's select only the character explanatory variables in our dataset:

/*********************************************************************************/
/*********************************************************************************/
/* CORRELATION MATRIX FOR CATEGORICAL EXPLANATORY VARIABLES*/
/*********************************************************************************/
/*********************************************************************************/

data work.data_train;
    set mydata.CLAIM_HISTORY (keep= 
 PARENT1
 MSTATUS
 GENDER
 EDUCATION
 OCCUPATION
 CAR_USE
 CAR_TYPE
 RED_CAR
 RED_CAR
 URBANICITY);
run;

NOTE: Writing HTML5 Body file: STDOUT
NOTE: Data file MYDATA.CLAIM_HISTORY.DATA is in a format that is native to another host, or the file encoding does not match the
session encoding. Cross Environment Data Access will be used, which might require additional CPU resources and might reduce
performance.
NOTE: There were 10302 observations read from the data set MYDATA.CLAIM_HISTORY.
NOTE: The data set WORK.DATA_TRAIN has 10302 observations and 9 variables.
NOTE: DATA statement used (Total process time):
real time 0.03 seconds
cpu time 0.02 seconds

Now we can use the SAS macro to create the association matrix. This macro has several parameters: the dataset name and the input character variables:

************************** MACRO PARAMETERS *******************************;
*Put the name of the modeling dataset:;
%let dataset = WORK.data_train;
*Put here the name of all the potential explanatory variables, the 
categorical ones:;
*Put the name of the categorical explanatory variables;
%let varstoconsider = 
PARENT1
MSTATUS
GENDER
EDUCATION
OCCUPATION
CAR_USE
CAR_TYPE
RED_CAR
RED_CAR
URBANICITY;

proc contents data=&dataset.(keep = &varstoconsider.) 
    nodetails varnum out=work.var_desc (keep=name type) noprint;
proc sort data = work.var_desc out = work.var_desc; by type;
run;

data test (keep=name);
    length name $80.;
    set work.var_desc;
    name = upcase(name);
    if type = 2 then output;
run;

NOTE: Writing HTML5 Body file: STDOUT
NOTE: The data set WORK.VAR_DESC has 9 observations and 2 variables.
NOTE: PROCEDURE CONTENTS used (Total process time):
real time 0.00 seconds
cpu time 0.01 seconds

NOTE: There were 9 observations read from the data set WORK.VAR_DESC.
NOTE: The data set WORK.VAR_DESC has 9 observations and 2 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds

NOTE: There were 9 observations read from the data set WORK.VAR_DESC.
NOTE: The data set WORK.TEST has 9 observations and 1 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds

Now we can run the macro code:

/****************************** MACRO STARTS HERE ***********************************/
%macro corr_matrix (table = );

%macro setup(dsout=_NULL_,dsin=&syslast,var=,ks=);
    %local dsout dsin var totvar ks;

    proc freq data=&dsin noprint;
        tables &var / out=__count;

data _NULL_;
    set __count end=last;
    retain len oldlen 0;
    nk + 1;
    len = length(&var);

    if ( len > oldlen ) then
        oldlen = len;

    if count > 1 then
        do;
            put "WARNING: The variable %upcase(&var)" /
                "WARNING: has " count " occurrences of the value "
                var;
        end;

    if last then
        do;
            if &ks > nk then
                do;
                    put "ERROR: K greater than the Number of values (N)";
                    put "ERROR: Value of K: &ks Value of N: " nk;
                    put "ERROR: EXITING Macro!!!!!!!!!!!!!!!!!!";
                end;

            call symput("totvar",trim(left(put(nk,5.))) );
            call symput("length",put(oldlen,3.) );
        end;
run;

%if &ks > &totvar %then
    %goto exit;

data __d1;
    set __count end=last;
    array _vals {&totvar} $ &length;
    retain _vals1-_vals&totvar;
    keep _vals1-_vals&totvar;
    _vals{_N_} = &var;

    if last then
        output;

    %comb(k=&ks,n=&totvar,len=&length)
    run;
%exit:
%mend setup;

%macro comb(dsout2=&dsout,dsin2=__d1,k=,n=,len=);
    %local dsin2 dsout2 numvar i start stop n k len;
    %let start = 1;
    %let stop = %eval( &n - &k);

data &dsout2;
    set &dsin2;
    array _vals {&n} $;
    array combo {&k} $ &len;
    keep j combo1 - combo&k;
    retain j 1;

    %do i = 1 %to &k;
        %let stop = %eval(&stop + 1);

        do var&i = &start to &stop;
            %let start = %str(%(var&i + 1 %));
    %end;

    %do i = 1 %to &k;
        combo{&i} = _vals{var&i};
    %end;

    output;
    j + 1;

    %do i = 1 %to &k;
        end;
    %end;
run;

%mend comb;

%setup(var=name,ks=2,dsout=combin);
 *proc print data=combin ;

* Determine the number of observations;
data _NULL_;
    if 0 then
        set combin nobs=n;
    call symputx('nrows',n);
    stop;
run;


%macro do_corr;
    %do obs = 1 %to &nrows;

data _null_;
    set combin;

    if j = &obs then
        call symputx("combo1", PUT(combo1, 35.));

    if j = &obs then
        call symputx("combo2", PUT(combo2, 35.));
run;

%put &combo1;
%put &combo2;

proc freq data = &table noprint;
    tables &combo1*&combo2 / chisq;
    output out=results_&obs chisq;
run;

data results_&obs (keep= combo1 combo2 _CRAMV_);
    length combo1 combo2 $30.;
    set results_&obs;
    combo1 = "&combo1";
    combo2 = "&combo2";
run;

proc append base=association_matrix data=results_&obs force;
run;

    %end;
%mend do_corr;

%do_corr;

proc sort data=association_matrix;
    by descending _CRAMV_;
run;

proc sort data=association_matrix out=association_matrix_sort;
by combo2 combo1;
run;

/* reshape the output to the wide format */
proc transpose data=association_matrix_sort out=fin (drop=_name_ _label_) let;
    by combo2;
    id combo1;
    var _CRAMV_;
run;

proc format;
    value backcolor 
    .7 - high    = "red"
    low - -.7    = "red"; 
run;

proc print data=association_matrix noobs;
title "Sorted Associations - Cramer's V";
    var _all_ / style(data) = [background=backcolor. ];
    format _CRAMV_ percent.;
run;

title "Expenditures Summary Using Traffic Lighting";
proc print data=fin;
    title "Association Matrix - Cramer's V";
    var _all_ / style(data) = [background=backcolor. ];
    format _numeric_ percent.;
run;

proc datasets lib=work nolist;
    delete __count __d1 combin association_matrix
           test association_matrix_sort fin results_: ;
run;

%mend corr_matrix;
/****************************** MACRO ENDS HERE ***********************************/

%corr_matrix(table = work.data_train);

/*proc datasets lib=work nolist;*/
/*   delete VAR_DESC;*/
/*run;*/
ods graphics on;
SAS Output

SAS Output

Sorted Associations - Cramer's V

The PRINT Procedure

Data Set WORK.ASSOCIATION_MATRIX

combo1 combo2 _CRAMV_
CAR_TYPE GENDER 72%
CAR_USE OCCUPATION 57%
EDUCATION OCCUPATION 56%
CAR_TYPE CAR_USE 54%
CAR_TYPE RED_CAR 49%
MSTATUS PARENT1 47%
OCCUPATION URBANICITY 31%
CAR_USE GENDER 28%
GENDER OCCUPATION 25%
EDUCATION URBANICITY 23%
CAR_USE EDUCATION 22%
OCCUPATION RED_CAR 17%
CAR_TYPE OCCUPATION 14%
CAR_TYPE EDUCATION 10%
OCCUPATION PARENT1 10%
EDUCATION PARENT1 9%
CAR_TYPE URBANICITY 8%
GENDER PARENT1 7%
CAR_TYPE PARENT1 6%
EDUCATION GENDER 5%
EDUCATION MSTATUS 5%
GENDER URBANICITY 5%
MSTATUS OCCUPATION 4%
EDUCATION RED_CAR 4%
PARENT1 URBANICITY 2%
CAR_TYPE MSTATUS 2%
MSTATUS RED_CAR 1%
GENDER MSTATUS ( 0%)
CAR_USE PARENT1 ( 0%)
MSTATUS URBANICITY ( 1%)
CAR_USE MSTATUS ( 1%)
CAR_USE URBANICITY ( 2%)
RED_CAR URBANICITY ( 4%)
PARENT1 RED_CAR ( 4%)
CAR_USE RED_CAR ( 19%)
GENDER RED_CAR ( 66%)

Association Matrix - Cramer's V

The PRINT Procedure

Data Set WORK.FIN

Obs combo2 CAR_TYPE CAR_USE EDUCATION GENDER MSTATUS OCCUPATION PARENT1 RED_CAR
1 CAR_USE 54% . . . . . . .
2 EDUCATION 10% 22% . . . . . .
3 GENDER 72% 28% 5% . . . . .
4 MSTATUS 2% ( 1%) 5% ( 0%) . . . .
5 OCCUPATION 14% 57% 56% 25% 4% . . .
6 PARENT1 6% ( 0%) 9% 7% 47% 10% . .
7 RED_CAR 49% ( 19%) 4% ( 66%) 1% 17% ( 4%) .
8 URBANICITY 8% ( 2%) 23% 5% ( 1%) 31% 2% ( 4%)

We obtain two outputs from the macro execution. The frist table is a list of all possible pairs of categorical variables and their associated Cramer's V statistic. The second table is a matrix that contains the same statistics. For those Cramver's V greater or equal to 70% (abs. value) the cell is highlighted in red.

Therefore, there is an strong positive associateship between the GENDER and the CAR_TYPE categorical variables. That means that some car types are used mainly by males and other car types are used mainly by females. Something that is plausible. These two varibles are quite associated, therefore, there are redundant information in our dataset. Both variables contains the same information. Therefore, we should select one of them, the most correlated with the target variable.

Let's analyze these two variables using a cross-table:

proc freq data = mydata.claim_history;
    tables gender * car_type;
run;
SAS Output

SAS Output

Association Matrix - Cramer's V

The FREQ Procedure

The FREQ Procedure

Table GENDER * CAR_TYPE

Cross-Tabular Freq Table

Frequency
Percent
Row Pct
Col Pct
Table of GENDER by CAR_TYPE
GENDER(Driver Gender) CAR_TYPE(Type of Car)
Minivan Panel Truck Pickup Sports Car Van z_SUV Total
M
1780
17.28
37.42
66.07
813
7.89
17.09
95.31
1231
11.95
25.88
69.47
30
0.29
0.63
2.54
804
7.80
16.90
87.30
99
0.96
2.08
3.43
4757
46.18
 
 
z_F
914
8.87
16.48
33.93
40
0.39
0.72
4.69
541
5.25
9.76
30.53
1149
11.15
20.72
97.46
117
1.14
2.11
12.70
2784
27.02
50.21
96.57
5545
53.82
 
 
Total
2694
26.15
853
8.28
1772
17.20
1179
11.44
921
8.94
2883
27.98
10302
100.00

Take a look to the Col Pct. Now it is clear that the Panel Truck is used mainly by men (95.31%) and the Sports Car and the z_SUV are used mainly by women.