Pure premium regression model using XGBoost and random forest

Introduction

The most common types of ratemaking models in the insurance industry are frequency, severity, and pure premium models. Frequency models predict how often claims are made, and severity models predict claim amounts. The term pure premium is unique to insurance; it is the portion of the company’s expected cost that is “purely” attributed to loss. Pure premium does not include the general expense of doing business, such as overhead and commissions. Traditionally, the Tweedie distribution is used to model pure premium in a generalized linear model framework (GLM).

XGBoost has become incredibly popular on Kaggle in the last year for any problems dealing with structured data and accuracy predictions. Random forest was also incredibly popular on Kaggle few years ago. Let's play with both models using insurance data to model the pure premium.

Dataset description

Dataset used in this code: claim_history.csv
The claim_history.csv dataset comes from the help library of SAS Enterprise-Miner version 7.1
Data description:
- observations: 10302
- Variables: 27

Variable description:

#	Variable	Label
1	ID	Identification#
2	KIDSDRIV	#Driving Children
3	BIRTH	Date of Birth
4	AGE	Age
5	HOMEKIDS	#Children @Home
6	YOJ	Years on Job
7	INCOME	Income
8	PARENT1	Single Parent
9	HOME_VAL	Home Value
10	MSTATUS	Marital Status
11	GENDER	Driver Gender
12	EDUCATION	Max Education Level
13	OCCUPATION	Driver Occupation
14	TRAVTIME	Distance to Work
15	CAR_USE	Vehicle Use
16	BLUEBOOK	Value of Vehicle
17	TIF	Time in Force
18	CAR_TYPE	Type of Car
19	RED_CAR	A Red Car
20	OLDCLAIM	Total Claims(Past 5 Years)
21	CLM_FREQ	#Claims(Past 5 Years)
22	REVOKED	License Revoked (Past 7 Years)
23	MVR_PTS	Motor Vehicle Record Points
24	CLM_AMT	Claim Amount
25	CAR_AGE	Vehicle Age
26	CLAIM_FLAG	Claim Indicator
27	URBANICITY	Home/Work Area

Getting and Preprocessing the Data

Load packages and data
Drop unuseful features
Treatment of missing observations
Transform some features from continuous to categorical
Transform some categorical features into dummy variables
Clean some new features names
Scaling target variables
Splitting the data into training, testing, and validation sets

Load packages and data

First load the needed packages:

import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import scipy as scipy
import pylab as pylab
import seaborn as sns
from xgboost import XGBRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
%matplotlib inline

Next, load the CSV file as a Pandas data frame:

claims = pd.read_csv('data/claim_history.csv')

Let's take a look to the data:

claims.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10302 entries, 0 to 10301
Data columns (total 27 columns):
ID            10302 non-null int64
KIDSDRIV      10302 non-null int64
BIRTH         10302 non-null object
AGE           10295 non-null float64
HOMEKIDS      10302 non-null int64
YOJ           9754 non-null float64
INCOME        9732 non-null object
PARENT1       10302 non-null object
HOME_VAL      9727 non-null object
MSTATUS       10302 non-null object
GENDER        10302 non-null object
EDUCATION     10302 non-null object
OCCUPATION    9637 non-null object
TRAVTIME      10302 non-null int64
CAR_USE       10302 non-null object
BLUEBOOK      10302 non-null object
TIF           10302 non-null int64
CAR_TYPE      10302 non-null object
RED_CAR       10302 non-null object
OLDCLAIM      10302 non-null object
CLM_FREQ      10302 non-null int64
REVOKED       10302 non-null object
MVR_PTS       10302 non-null int64
CLM_AMT       10302 non-null object
CAR_AGE       9663 non-null float64
CLAIM_FLAG    10302 non-null int64
URBANICITY    10302 non-null object
dtypes: float64(3), int64(8), object(16)
memory usage: 2.1+ MB

There are 27 columns. The target variable is CLM_AMT. Let's go deeper into the data:

claims.head()

	ID	BIRTH	AGE	HOMEKIDS	YOJ	INCOME	PARENT1	HOME_VAL	MSTATUS	...	CAR_TYPE	RED_CAR	OLDCLAIM	CLM_FREQ	REVOKED	MVR_PTS	CLM_AMT	CAR_AGE	URBANICITY
0	63581743	16MAR39	60.0	0	11.0	$67,349	No	$0	z_No	...	Minivan	yes	$4,461	2	No	3	$0	18.0	Highly Urban/ Urban
1	132761049	21JAN56	43.0	0	11.0	$91,449	No	$257,252	z_No	...	Minivan	yes	$0	0	No	0	$0	1.0	Highly Urban/ Urban
2	921317019	18NOV51	48.0	0	11.0	$52,881	No	$0	z_No	...	Van	yes	$0	0	No	2	$0	10.0	Highly Urban/ Urban
3	727598473	05MAR64	35.0	1	10.0	$16,039	No	$124,191	Yes	...	z_SUV	no	$38,690	2	No	3	$0	10.0	Highly Urban/ Urban
4	450221861	05JUN48	51.0	0	14.0	NaN	No	$306,251	Yes	...	Minivan	yes	$0	0	No	0	$0	6.0	Highly Urban/ Urban

5 rows × 27 columns

Convert some features from currency to float

Now it is clear that some interval features such INCOME or HOME_VAL have a currency format. Therefore, we need to convert currency to float:

# From currency to float
for field in ['INCOME', 'HOME_VAL', 'BLUEBOOK', 'OLDCLAIM', 'CLM_AMT']:
    claims.loc[:,field] = claims[field] = (claims[field].replace( '[\$,)]','', regex=True )
               .replace( '[(]','-',   regex=True ).astype(float))

Drop unuseful features

There are some variables that are unuseful in this context, we can drop them:

claims = claims.drop(['ID','BIRTH','CLAIM_FLAG'], axis=1)

Treatment of missing observations

Let's take a look to the missing observations:

claims.isnull().sum()

KIDSDRIV        0
AGE             7
HOMEKIDS        0
YOJ           548
INCOME        570
PARENT1         0
HOME_VAL      575
MSTATUS         0
GENDER          0
EDUCATION       0
OCCUPATION    665
TRAVTIME        0
CAR_USE         0
BLUEBOOK        0
TIF             0
CAR_TYPE        0
RED_CAR         0
OLDCLAIM        0
CLM_FREQ        0
REVOKED         0
MVR_PTS         0
CLM_AMT         0
CAR_AGE       639
URBANICITY      0
dtype: int64

There are some variables with few missing observations. The features YOJ,INCOME, HOME_VAL and CAR_AGE are interval features, it is easy to replace the missed observations with the mean of the variable:

# Inputation for interval features
for field in ['AGE','YOJ','INCOME','HOME_VAL','CAR_AGE']:
    claims.loc[:,field] = claims.fillna(claims.mean()[field])

the OCCUPATION column is categorical. In order to fill NaNs with the most frequent value from the categorical OCCUPATION column:

claims = claims.fillna(claims['OCCUPATION'].value_counts().index[0])

Transform some features from continuous to categorical

The following features KIDSDRIV, HOMEKIDS and CLM_FREQ are continuous in our dataset. We need to convert this variables into a string type because has no sense to treat these variables as continuous:

claims[['KIDSDRIV','HOMEKIDS','CLM_FREQ']] = claims[['KIDSDRIV','HOMEKIDS','CLM_FREQ']].astype(str)

Transform some categorical features into dummy variables

Here we have some categorical variables like KIDSDRIV or CAR_TYPE. To include these in our model, we'll need to make dummy variables. This is simple to do with Pandas thanks to get_dummies().

dummy_fields = ['KIDSDRIV','HOMEKIDS','PARENT1','MSTATUS','GENDER','EDUCATION',\
                'OCCUPATION','CAR_USE','CAR_TYPE','RED_CAR','CLM_FREQ','REVOKED',\
                'URBANICITY']
for each in dummy_fields:
    dummies = pd.get_dummies(claims[each], prefix=each, drop_first=False)
    claims = pd.concat([claims, dummies], axis=1)

fields_to_drop = ['KIDSDRIV','HOMEKIDS','PARENT1','MSTATUS','GENDER','EDUCATION',\
                  'OCCUPATION','CAR_USE','CAR_TYPE','RED_CAR','CLM_FREQ','REVOKED',\
                  'URBANICITY']
claims = claims.drop(fields_to_drop, axis=1)

Let's check how many variables we have now:

claims.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10302 entries, 0 to 10301
Data columns (total 61 columns):
AGE                                 10302 non-null float64
YOJ                                 10302 non-null float64
INCOME                              10302 non-null float64
HOME_VAL                            10302 non-null float64
TRAVTIME                            10302 non-null int64
BLUEBOOK                            10302 non-null float64
TIF                                 10302 non-null int64
OLDCLAIM                            10302 non-null float64
MVR_PTS                             10302 non-null int64
CLM_AMT                             10302 non-null float64
CAR_AGE                             10302 non-null float64
KIDSDRIV_0                          10302 non-null uint8
KIDSDRIV_1                          10302 non-null uint8
KIDSDRIV_2                          10302 non-null uint8
KIDSDRIV_3                          10302 non-null uint8
KIDSDRIV_4                          10302 non-null uint8
HOMEKIDS_0                          10302 non-null uint8
HOMEKIDS_1                          10302 non-null uint8
HOMEKIDS_2                          10302 non-null uint8
HOMEKIDS_3                          10302 non-null uint8
HOMEKIDS_4                          10302 non-null uint8
HOMEKIDS_5                          10302 non-null uint8
PARENT1_No                          10302 non-null uint8
PARENT1_Yes                         10302 non-null uint8
MSTATUS_Yes                         10302 non-null uint8
MSTATUS_z_No                        10302 non-null uint8
GENDER_M                            10302 non-null uint8
GENDER_z_F                          10302 non-null uint8
EDUCATION_<High School              10302 non-null uint8
EDUCATION_Bachelors                 10302 non-null uint8
EDUCATION_Masters                   10302 non-null uint8
EDUCATION_PhD                       10302 non-null uint8
EDUCATION_z_High School             10302 non-null uint8
OCCUPATION_Clerical                 10302 non-null uint8
OCCUPATION_Doctor                   10302 non-null uint8
OCCUPATION_Home Maker               10302 non-null uint8
OCCUPATION_Lawyer                   10302 non-null uint8
OCCUPATION_Manager                  10302 non-null uint8
OCCUPATION_Professional             10302 non-null uint8
OCCUPATION_Student                  10302 non-null uint8
OCCUPATION_z_Blue Collar            10302 non-null uint8
CAR_USE_Commercial                  10302 non-null uint8
CAR_USE_Private                     10302 non-null uint8
CAR_TYPE_Minivan                    10302 non-null uint8
CAR_TYPE_Panel Truck                10302 non-null uint8
CAR_TYPE_Pickup                     10302 non-null uint8
CAR_TYPE_Sports Car                 10302 non-null uint8
CAR_TYPE_Van                        10302 non-null uint8
CAR_TYPE_z_SUV                      10302 non-null uint8
RED_CAR_no                          10302 non-null uint8
RED_CAR_yes                         10302 non-null uint8
CLM_FREQ_0                          10302 non-null uint8
CLM_FREQ_1                          10302 non-null uint8
CLM_FREQ_2                          10302 non-null uint8
CLM_FREQ_3                          10302 non-null uint8
CLM_FREQ_4                          10302 non-null uint8
CLM_FREQ_5                          10302 non-null uint8
REVOKED_No                          10302 non-null uint8
REVOKED_Yes                         10302 non-null uint8
URBANICITY_Highly Urban/ Urban      10302 non-null uint8
URBANICITY_z_Highly Rural/ Rural    10302 non-null uint8
dtypes: float64(8), int64(3), uint8(50)
memory usage: 1.4 MB

Now we have 61 variables in our dataset.

Clean some new features names

The XGBoost and the random forest algorithms don't admit special symbols in the feature names such as <> or / or ,. Therefore, we need to rename those feature names:

claims = claims.rename(columns={'EDUCATION_<High School': 'EDUCATION_lt_High_School', 
                        'OCCUPATION_z_Blue Collar': 'OCCUPATION_z_Blue_Collar',
                        'CAR_TYPE_Panel Truck': 'CAR_TYPE_Panel_Truck',
                        'CAR_TYPE_Sports Car': 'CAR_TYPE_Sports_Car',
                        'URBANICITY_Highly Urban/ Urban': 'URBANICITY_Urban',
                        'URBANICITY_z_Highly Rural/ Rural': 'URBANICITY_Rural'})

plot_clm_amt = claims['CLM_AMT']

Scaling target variables

To make training the network easier, we'll standardize each of the continuous variables. That is, we'll shift and scale the variables such that they have zero mean and a standard deviation of 1.

The scaling factors are saved so we can go backwards when we use the network for predictions.

quant_features = ['AGE', 'YOJ', 'INCOME', 'HOME_VAL', 
                  'BLUEBOOK', 'OLDCLAIM', 'CLM_AMT', 'CAR_AGE']
# Store scalings in a dictionary so we can convert back later
scaled_features = {}
for each in quant_features:
    mean, std = claims[each].mean(), claims[each].std()
    scaled_features[each] = [mean, std]
    claims.loc[:, each] = (claims[each] - mean)/std

Splitting the data into training, testing, and validation sets

We'll save the data for the last approximately 21 days to use as a test set after we've trained the network. We'll use this set to make predictions and compare them with the actual number of riders.

# Split off random 50% of the data for training, 25% for testing and 25% for validation
np.random.seed(21)
sample = np.random.choice(claims.index, size=int(len(claims)*0.5), replace=False)
data, train = claims.ix[sample], claims.drop(sample)

sample = np.random.choice(data.index, size=int(len(data)*0.5), replace=False)
val, test = data.ix[sample], data.drop(sample)

# Split into features and targets
features, targets = train.drop('CLM_AMT', axis=1), train['CLM_AMT']
features_val, targets_val = val.drop('CLM_AMT', axis=1), val['CLM_AMT']
features_test, targets_test = test.drop('CLM_AMT', axis=1), test['CLM_AMT']

With this we've finished the data preparation tasks, we are ready to use XGBoost and random forest.

Fitting the model

Let's fit the XGBoost and the random forest to the data:

# fit XGBoost model to training data
xgb_model = XGBRegressor()
xgb_model.fit(features, targets)

XGBRegressor(base_score=0.5, colsample_bylevel=1, colsample_bytree=1, gamma=0,
       learning_rate=0.1, max_delta_step=0, max_depth=3,
       min_child_weight=1, missing=None, n_estimators=100, nthread=-1,
       objective='reg:linear', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=0, silent=True, subsample=1)

# fit random forest to training data
rf_model = RandomForestRegressor(200)
rf_model.fit(features, targets)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=200, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False)

Make predictions for the test data

We are now ready to use the trained model to make predictions. We can make predictions using the fit model on the test dataset. To make predictions we use the scikit-learn function model.predict().

# make predictions for test data using XGBoost
y_pred_xgb = xgb_model.predict(features_test)
#predictions = [round(value ,2) for value in y_pred]

# make predictions for test data using random forest
y_pred_rf = rf_model.predict(features_test)
#predictions = [round(value ,2) for value in y_pred]

Evaluate model performance

Now that we have used the fit model to make predictions on new data, we can evaluate the performance of the predictions by comparing them to the expected values. For this we will use the built in mean_absolute_error function in scikit-learn.

# evaluate predictions using XGBoost
score = mean_absolute_error(targets_test, y_pred_xgb)
print("XGBoost Score: %.4f" % score)

XGBoost Score: 0.4036

# evaluate predictions using random forest
score = mean_absolute_error(targets_test, y_pred_rf)
print("Random forest Score: %.4f" % score)

Random forest Score: 0.4297

It is clear that the XGBoost fits the data better than the random forest. The absolute mean error is lower for the XGBoost. Let's go deeper and create a lift chart to compare the model performance from another point of view.

Create a lift chart to compere model performance

A classical chart used to compere regressions models (also classification) is a Lift chart. Let's create both lift charts.

def gini(actual,pred,weight=None):
    pdf= pd.DataFrame(scipy.vstack([actual,pred]).T,columns=['Actual','Predicted'],)
    pdf= pdf.sort_values('Predicted')
    if weight is None:
        pdf['Weight'] = 1.0

    pdf['CummulativeWeight'] = np.cumsum(pdf['Weight'])
    pdf['CummulativeWeightedActual'] = np.cumsum(pdf['Actual']*pdf['Weight'])
    TotalWeight = sum(pdf['Weight'])
    Numerator = sum(pdf['CummulativeWeightedActual']*pdf['Weight'])
    Denominator = sum(pdf['Actual']*pdf['Weight']*TotalWeight)
    Gini = 1.0 - 2.0 * Numerator/Denominator
    return Gini

gini(targets_test,y_pred_xgb,weight=None)

-13.489679386152188

gini(targets_test,y_pred_rf,weight=None)

-13.082823959818159

def mylift(actual,pred,weight=None,n=10,xlab='Predicted Decile',MyTitle='Model Performance Lift Chart'):
    pdf= pd.DataFrame(scipy.vstack([actual,pred]).T,columns=['Actual','Predicted'],)
    pdf= pdf.sort(columns='Predicted')
    if weight is None:
        pdf['Weight'] = 1.0

    pdf['CummulativeWeight'] = np.cumsum(pdf['Weight'])
    pdf['CummulativeWeightedActual'] = np.cumsum(pdf['Actual']*pdf['Weight'])
    TotalWeight = sum(pdf['Weight'])
    Numerator = sum(pdf['CummulativeWeightedActual']*pdf['Weight'])
    Denominator = sum(pdf['Actual']*pdf['Weight']*TotalWeight)
    Gini = 1.0 - 2.0 * Numerator/Denominator
    NormalizedGini = Gini/ gini(pdf['Actual'],pdf['Actual'])
    GiniTitle = 'Normalized Gini = '+ str(round(NormalizedGini,4))

    pdf['PredictedDecile'] = np.round(pdf['CummulativeWeight']*n /TotalWeight + 0.5,decimals=0)
    pdf['PredictedDecile'][pdf['PredictedDecile'] < 1.0] = 1.0
    pdf['PredictedDecile'][pdf['PredictedDecile'] > n] = n 

    pdf['WeightedPrediction'] = pdf['Predicted']*pdf['Weight']
    pdf['WeightedActual'] = pdf['Actual']*pdf['Weight']
    lift_df = pdf.groupby('PredictedDecile').agg({'WeightedPrediction': np.sum,'Weight':np.sum,'WeightedActual':np.sum,'PredictedDecile':np.size})
    nms = lift_df.columns.values
    nms[1] = 'Count'
    lift_df.columns = nms
    lift_df['AveragePrediction'] = lift_df['WeightedPrediction']/lift_df['Weight']
    lift_df['AverageActual'] = lift_df['WeightedActual']/lift_df['Weight']
    lift_df['AverageError'] = lift_df['AverageActual']/lift_df['AveragePrediction']

    d = pd.DataFrame(lift_df.index)
    p = lift_df['AveragePrediction']
    a = lift_df['AverageActual']
    pylab.plot(d,p,label='Predicted',color='blue',marker='o')
    pylab.plot(d,a,label='Actual',color='red',marker='d')
    pylab.legend(['Predicted','Actual'], loc=2)
    pylab.title(MyTitle +'\n'+GiniTitle)
    pylab.xlabel(xlab)
    pylab.ylabel('Actual vs. Predicted')
    pylab.grid()
    pylab.show()

mylift(targets_test,y_pred_xgb,weight=None,n=10,xlab='Predicted Decile',\
       MyTitle='Model Performance Lift Chart XGBoost')

png

mylift(targets_test,y_pred_rf,weight=None,n=10,xlab='Predicted Decile',\
       MyTitle='Model Performance Lift Chart Random Forest')

png

Although the random forest fits the data better than the XGBoost for the first nine percentiles, the XGBoost fits much better the tenth percentile. This is interesting because fitting a risk premium model is quite difficult because the great majority of the policies has no claim and therefore, the loss is zero as you can see in this image:

HTML5 Icon

The policies with positive claims are mainly in the tenth decile. The random forest performs really bad for the policies with positive claims.

Here we can see the distribution of the risk premium in our dataset:

sns.distplot(plot_clm_amt)

<matplotlib.axes._subplots.AxesSubplot at 0x1216f0990>

png