This program generates a correlation (association) matrix using the Cramer's V statistic for character variables.

Illustration:

Dataset used in this code: claim_history.sas7bdat or claim_history.csv
The claim_history.sas7bdat dataset comes from the help library of SAS Enterprise-Miner version 7.1
Data description: observations: 10302, Variables: 27

Note: he NOSOURCE system option as follows to suppress the listing of the SAS statements to the log. Refernce

options nosource;

582  ods listing close;ods html5 file=stdout options(bitmap_mode='inline') device=png; ods graphics on / outputfmt=png;
NOTE: Writing HTML5 Body file: STDOUT
583  
584  options nosource;

libname mydata "/folders/myfolders/";

NOTE: Writing HTML5 Body file: STDOUT
NOTE: Libref MYDATA was successfully assigned as follows: 
      Engine:        V9 
      Physical Name: /folders/myfolders

It is possible to use the CONTENTS procedure to find the character variables:

proc contents data=mydata.claim_history varnum;
run;

SAS Output

The SAS System

The CONTENTS Procedure

MYDATA.CLAIM_HISTORY

Attributes

Data Set Name	MYDATA.CLAIM_HISTORY	Observations	10302
Member Type	DATA	Variables	27
Engine	V9	Indexes	0
Created	06/02/2011 04:36:18	Observation Length	224
Last Modified	06/02/2011 04:36:18	Deleted Observations	0
Protection		Compressed	NO
Data Set Type		Sorted	NO
Label
Data Representation	WINDOWS_64
Encoding	us-ascii ASCII (ANSI)

Engine/Host Information

Engine/Host Dependent Information
Data Set Page Size	16384
Number of Data Set Pages	144
First Data Page	1
Max Obs per Page	72
Obs in First Data Page	54
Number of Data Set Repairs	0
Filename	/folders/myfolders/claim_history.sas7bdat
Release Created	9.0301M0
Host Created	X64_PRO
Inode Number	27
Access Permission	rwxrwx---
Owner Name	root
File Size	2MB
File Size (bytes)	2360320

Varnum

Variables in Creation Order
#	Variable	Type	Len	Format	Informat	Label
1	ID	Char	10			Identification#
2	KIDSDRIV	Num	8	4.		#Driving Children
3	BIRTH	Num	8	DATE7.		Date of Birth
4	AGE	Num	8	4.		Age
5	HOMEKIDS	Num	8	4.		#Children @Home
6	YOJ	Num	8	4.		Years on Job
7	INCOME	Num	8	DOLLAR10.		Income
8	PARENT1	Char	3			Single Parent
9	HOME_VAL	Num	8	DOLLAR10.		Home Value
10	MSTATUS	Char	5			Marital Status
11	GENDER	Char	3	$3.	$3.	Driver Gender
12	EDUCATION	Char	13			Max Education Level
13	OCCUPATION	Char	13	$13.	$13.	Driver Occupation
14	TRAVTIME	Num	8	4.		Distance to Work
15	CAR_USE	Char	10			Vehicle Use
16	BLUEBOOK	Num	8	DOLLAR10.		Value of Vehicle
17	TIF	Num	8			Time in Force
18	CAR_TYPE	Char	11			Type of Car
19	RED_CAR	Char	3			A Red Car
20	OLDCLAIM	Num	8	DOLLAR12.		Total Claims(Past 5 Years)
21	CLM_FREQ	Num	8			#Claims(Past 5 Years)
22	REVOKED	Char	3			License Revoked (Past 7 Years)
23	MVR_PTS	Num	8	5.		Motor Vehicle Record Points
24	CLM_AMT	Num	8	DOLLAR12.		Claim Amount
25	CAR_AGE	Num	8	4.		Vehicle Age
26	CLAIM_FLAG	Num	8			Claim Indicator
27	URBANICITY	Char	21			Home/Work Area

So, the charecter explanatory variables in this dataset are:

PARENT1 MSTATUS GENDER EDUCATION OCCUPATION CAR_USE CAR_TYPE RED_CAR RED_CAR URBANICITY

An association matrix it is very useful in order to figure out potential stron association between all pairs of categorical variables.

Let's select only the character explanatory variables in our dataset:

/*********************************************************************************/
/*********************************************************************************/
/* CORRELATION MATRIX FOR CATEGORICAL EXPLANATORY VARIABLES*/
/*********************************************************************************/
/*********************************************************************************/

data work.data_train;
    set mydata.CLAIM_HISTORY (keep= 
 PARENT1
 MSTATUS
 GENDER
 EDUCATION
 OCCUPATION
 CAR_USE
 CAR_TYPE
 RED_CAR
 RED_CAR
 URBANICITY);
run;

NOTE: Writing HTML5 Body file: STDOUT
NOTE: Data file MYDATA.CLAIM_HISTORY.DATA is in a format that is native to another host, or the file encoding does not match the 
      session encoding. Cross Environment Data Access will be used, which might require additional CPU resources and might reduce 
      performance.
NOTE: There were 10302 observations read from the data set MYDATA.CLAIM_HISTORY.
NOTE: The data set WORK.DATA_TRAIN has 10302 observations and 9 variables.
NOTE: DATA statement used (Total process time):
      real time           0.03 seconds
      cpu time            0.02 seconds

Now we can use the SAS macro to create the association matrix. This macro has several parameters: the dataset name and the input character variables:

************************** MACRO PARAMETERS *******************************;
*Put the name of the modeling dataset:;
%let dataset = WORK.data_train;
*Put here the name of all the potential explanatory variables, the 
categorical ones:;
*Put the name of the categorical explanatory variables;
%let varstoconsider = 
PARENT1
MSTATUS
GENDER
EDUCATION
OCCUPATION
CAR_USE
CAR_TYPE
RED_CAR
RED_CAR
URBANICITY;

proc contents data=&dataset.(keep = &varstoconsider.) 
    nodetails varnum out=work.var_desc (keep=name type) noprint;
proc sort data = work.var_desc out = work.var_desc; by type;
run;

data test (keep=name);
    length name $80.;
    set work.var_desc;
    name = upcase(name);
    if type = 2 then output;
run;

NOTE: Writing HTML5 Body file: STDOUT
NOTE: The data set WORK.VAR_DESC has 9 observations and 2 variables.
NOTE: PROCEDURE CONTENTS used (Total process time):
      real time           0.00 seconds
      cpu time            0.01 seconds
      
NOTE: There were 9 observations read from the data set WORK.VAR_DESC.
NOTE: The data set WORK.VAR_DESC has 9 observations and 2 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds
      
NOTE: There were 9 observations read from the data set WORK.VAR_DESC.
NOTE: The data set WORK.TEST has 9 observations and 1 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds

Now we can run the macro code:

/****************************** MACRO STARTS HERE ***********************************/
%macro corr_matrix (table = );

%macro setup(dsout=_NULL_,dsin=&syslast,var=,ks=);
    %local dsout dsin var totvar ks;

    proc freq data=&dsin noprint;
        tables &var / out=__count;

data _NULL_;
    set __count end=last;
    retain len oldlen 0;
    nk + 1;
    len = length(&var);

    if ( len > oldlen ) then
        oldlen = len;

    if count > 1 then
        do;
            put "WARNING: The variable %upcase(&var)" /
                "WARNING: has " count " occurrences of the value "
                var;
        end;

    if last then
        do;
            if &ks > nk then
                do;
                    put "ERROR: K greater than the Number of values (N)";
                    put "ERROR: Value of K: &ks Value of N: " nk;
                    put "ERROR: EXITING Macro!!!!!!!!!!!!!!!!!!";
                end;

            call symput("totvar",trim(left(put(nk,5.))) );
            call symput("length",put(oldlen,3.) );
        end;
run;

%if &ks > &totvar %then
    %goto exit;

data __d1;
    set __count end=last;
    array _vals {&totvar} $ &length;
    retain _vals1-_vals&totvar;
    keep _vals1-_vals&totvar;
    _vals{_N_} = &var;

    if last then
        output;

    %comb(k=&ks,n=&totvar,len=&length)
    run;
%exit:
%mend setup;

%macro comb(dsout2=&dsout,dsin2=__d1,k=,n=,len=);
    %local dsin2 dsout2 numvar i start stop n k len;
    %let start = 1;
    %let stop = %eval( &n - &k);

data &dsout2;
    set &dsin2;
    array _vals {&n} $;
    array combo {&k} $ &len;
    keep j combo1 - combo&k;
    retain j 1;

    %do i = 1 %to &k;
        %let stop = %eval(&stop + 1);

        do var&i = &start to &stop;
            %let start = %str(%(var&i + 1 %));
    %end;

    %do i = 1 %to &k;
        combo{&i} = _vals{var&i};
    %end;

    output;
    j + 1;

    %do i = 1 %to &k;
        end;
    %end;
run;

%mend comb;

%setup(var=name,ks=2,dsout=combin);
 *proc print data=combin ;

* Determine the number of observations;
data _NULL_;
    if 0 then
        set combin nobs=n;
    call symputx('nrows',n);
    stop;
run;


%macro do_corr;
    %do obs = 1 %to &nrows;

data _null_;
    set combin;

    if j = &obs then
        call symputx("combo1", PUT(combo1, 35.));

    if j = &obs then
        call symputx("combo2", PUT(combo2, 35.));
run;

%put &combo1;
%put &combo2;

proc freq data = &table noprint;
    tables &combo1*&combo2 / chisq;
    output out=results_&obs chisq;
run;

data results_&obs (keep= combo1 combo2 _CRAMV_);
    length combo1 combo2 $30.;
    set results_&obs;
    combo1 = "&combo1";
    combo2 = "&combo2";
run;

proc append base=association_matrix data=results_&obs force;
run;

    %end;
%mend do_corr;

%do_corr;

proc sort data=association_matrix;
    by descending _CRAMV_;
run;

proc sort data=association_matrix out=association_matrix_sort;
by combo2 combo1;
run;

/* reshape the output to the wide format */
proc transpose data=association_matrix_sort out=fin (drop=_name_ _label_) let;
    by combo2;
    id combo1;
    var _CRAMV_;
run;

proc format;
    value backcolor 
    .7 - high    = "red"
    low - -.7    = "red"; 
run;

proc print data=association_matrix noobs;
title "Sorted Associations - Cramer's V";
    var _all_ / style(data) = [background=backcolor. ];
    format _CRAMV_ percent.;
run;

title "Expenditures Summary Using Traffic Lighting";
proc print data=fin;
    title "Association Matrix - Cramer's V";
    var _all_ / style(data) = [background=backcolor. ];
    format _numeric_ percent.;
run;

proc datasets lib=work nolist;
    delete __count __d1 combin association_matrix
           test association_matrix_sort fin results_: ;
run;

%mend corr_matrix;
/****************************** MACRO ENDS HERE ***********************************/

%corr_matrix(table = work.data_train);

/*proc datasets lib=work nolist;*/
/*   delete VAR_DESC;*/
/*run;*/
ods graphics on;

SAS Output

Sorted Associations - Cramer's V

The PRINT Procedure

Data Set WORK.ASSOCIATION_MATRIX

combo1	combo2	_CRAMV_
CAR_TYPE	GENDER	72%
CAR_USE	OCCUPATION	57%
EDUCATION	OCCUPATION	56%
CAR_TYPE	CAR_USE	54%
CAR_TYPE	RED_CAR	49%
MSTATUS	PARENT1	47%
OCCUPATION	URBANICITY	31%
CAR_USE	GENDER	28%
GENDER	OCCUPATION	25%
EDUCATION	URBANICITY	23%
CAR_USE	EDUCATION	22%
OCCUPATION	RED_CAR	17%
CAR_TYPE	OCCUPATION	14%
CAR_TYPE	EDUCATION	10%
OCCUPATION	PARENT1	10%
EDUCATION	PARENT1	9%
CAR_TYPE	URBANICITY	8%
GENDER	PARENT1	7%
CAR_TYPE	PARENT1	6%
EDUCATION	GENDER	5%
EDUCATION	MSTATUS	5%
GENDER	URBANICITY	5%
MSTATUS	OCCUPATION	4%
EDUCATION	RED_CAR	4%
PARENT1	URBANICITY	2%
CAR_TYPE	MSTATUS	2%
MSTATUS	RED_CAR	1%
GENDER	MSTATUS	( 0%)
CAR_USE	PARENT1	( 0%)
MSTATUS	URBANICITY	( 1%)
CAR_USE	MSTATUS	( 1%)
CAR_USE	URBANICITY	( 2%)
RED_CAR	URBANICITY	( 4%)
PARENT1	RED_CAR	( 4%)
CAR_USE	RED_CAR	( 19%)
GENDER	RED_CAR	( 66%)

Association Matrix - Cramer's V

The PRINT Procedure

Data Set WORK.FIN

Obs	combo2	CAR_TYPE	CAR_USE	EDUCATION	GENDER	MSTATUS	OCCUPATION	PARENT1	RED_CAR
1	CAR_USE	54%	.	.	.	.	.	.	.
2	EDUCATION	10%	22%	.	.	.	.	.	.
3	GENDER	72%	28%	5%	.	.	.	.	.
4	MSTATUS	2%	( 1%)	5%	( 0%)	.	.	.	.
5	OCCUPATION	14%	57%	56%	25%	4%	.	.	.
6	PARENT1	6%	( 0%)	9%	7%	47%	10%	.	.
7	RED_CAR	49%	( 19%)	4%	( 66%)	1%	17%	( 4%)	.
8	URBANICITY	8%	( 2%)	23%	5%	( 1%)	31%	2%	( 4%)

We obtain two outputs from the macro execution. The frist table is a list of all possible pairs of categorical variables and their associated Cramer's V statistic. The second table is a matrix that contains the same statistics. For those Cramver's V greater or equal to 70% (abs. value) the cell is highlighted in red.

Therefore, there is an strong positive associateship between the GENDER and the CAR_TYPE categorical variables. That means that some car types are used mainly by males and other car types are used mainly by females. Something that is plausible. These two varibles are quite associated, therefore, there are redundant information in our dataset. Both variables contains the same information. Therefore, we should select one of them, the most correlated with the target variable.

Let's analyze these two variables using a cross-table:

proc freq data = mydata.claim_history;
    tables gender * car_type;
run;

SAS Output

Association Matrix - Cramer's V

The FREQ Procedure

Table GENDER * CAR_TYPE

Cross-Tabular Freq Table

Frequency Percent Row Pct Col Pct

Table of GENDER by CAR_TYPE
GENDER(Driver Gender)	CAR_TYPE(Type of Car)
GENDER(Driver Gender)	Minivan	Panel Truck	Pickup	Sports Car	Van	z_SUV	Total
M	1780 17.28 37.42 66.07	813 7.89 17.09 95.31	1231 11.95 25.88 69.47	30 0.29 0.63 2.54	804 7.80 16.90 87.30	99 0.96 2.08 3.43	4757 46.18
z_F	914 8.87 16.48 33.93	40 0.39 0.72 4.69	541 5.25 9.76 30.53	1149 11.15 20.72 97.46	117 1.14 2.11 12.70	2784 27.02 50.21 96.57	5545 53.82
Total	2694 26.15	853 8.28	1772 17.20	1179 11.44	921 8.94	2883 27.98	10302 100.00

Take a look to the Col Pct. Now it is clear that the Panel Truck is used mainly by men (95.31%) and the Sports Car and the z_SUV are used mainly by women.