CREDIT RISK ANALYZER

1. Introduction:

Context

Banks need to protect their interest before it can take risk on you and issue credit card to you. Banks use their previous credit card holders records for understanding the patterns of the card holders. It is a lot more complex process to predict whether a person who they do not know at personal level, will be a defaulter or not. Banks, along with the data from their own records, also use CIBIL data. Based on all this data, banks want to develop a pattern that will tell them who are likely to be a defaulter and who are not. We have to use this dataset to generate a decision tree model that can successfully predict for a new applicant with recorded data for given parameters in the data set, if he is likely to be a defaulter.

Content

The dataset has 13 features with 50636 observations. The features are:

Here age, gender are the age and gender of the card holder.
education is the last acquired educational qualification of the card holder.
occupation can be salaried, or self employed or business etc.
organization_type can be tire 1, 2, 3 etc.
seniority denotes at which career level the card holder is in.
annual_income is the gross annual income of the card holder.
disposable_income is annual income - recurring expenses.
house_type is owned or rented or company provided etc.
vehicle_type is 4-wheeler or two-wheeler or none.
marital_status is of the card holder.
no_card has the information of the number of other credit cards that the card holder already holds.
And at the end of each row, we have a defaulter indicator indicating whether the card holder was a defaulter or not. It is 1 if the card holder was a defaulter, 0 otherwise.

2. Import Libraries:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn import metrics
%matplotlib inline

3. Load Dataset:

df = pd.read_csv('credit_data.csv')

4. Data Exploratory Analysis:

df.head()

df.columns

Index(['age', 'gender', 'education', 'occupation', 'organization_type',
       'seniority', 'annual_income', 'disposable_income', 'house_type',
       'vehicle_type', 'marital_status', 'no_card', 'default'],
      dtype='object')

df.shape

(50636, 13)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50636 entries, 0 to 50635
Data columns (total 13 columns):
age                  50636 non-null int64
gender               50636 non-null object
education            50636 non-null object
occupation           50636 non-null object
organization_type    50636 non-null object
seniority            50636 non-null object
annual_income        50636 non-null int64
disposable_income    50636 non-null int64
house_type           50636 non-null object
vehicle_type         50636 non-null object
marital_status       50636 non-null object
no_card              50636 non-null int64
default              50636 non-null int64
dtypes: int64(5), object(8)
memory usage: 5.0+ MB

df.isnull().sum()

age                  0
gender               0
education            0
occupation           0
organization_type    0
seniority            0
annual_income        0
disposable_income    0
house_type           0
vehicle_type         0
marital_status       0
no_card              0
default              0
dtype: int64

df.describe()

df['default'].value_counts()

0    42614
1     8022
Name: default, dtype: int64

obj_df = df.select_dtypes(include=['object']).copy()

obj_df.head()

#Looking unique values
print(obj_df.nunique())

gender               2
education            4
occupation           4
organization_type    4
seniority            6
house_type           4
vehicle_type         3
marital_status       3
dtype: int64

print("Gender : ",obj_df.gender.unique())
print("Education : ",obj_df.education.unique())
print("Occupation : ",obj_df.occupation.unique())
print("Organization Type : ",obj_df.organization_type.unique())
print("Seniority : ",obj_df.seniority.unique())
print("House Type : ",obj_df.house_type.unique())
print("Vehicle Type : ",obj_df.vehicle_type.unique())
print("Marital Status : ",obj_df.marital_status.unique())

Gender :  ['Male' 'Female']
Education :  ['Graduate' 'Under Graduate' 'Post Graduate' 'Other']
Occupation :  ['Professional' 'Salaried' 'Student' 'Business']
Organization Type :  ['None' 'Tier 3' 'Tier 2' 'Tier 1']
Seniority :  ['None' 'Entry' 'Mid-level 1' 'Junior' 'Mid-level 2' 'Senior']
House Type :  ['Family' 'Rented' 'Company provided' 'Owned']
Vehicle Type :  ['None' 'Two Wheeler' 'Four Wheeler']
Marital Status :  ['Married' 'Single' 'Other']

5. Data Visualization:

def plot_bar_graph(column_name):
    ed_count = column_name.value_counts()
    sns.set(style="darkgrid")
    sns.barplot(ed_count.index, ed_count.values, alpha=0.9)
    plt.title('Frequency Distribution of {} Levels using Bar Plot'.format(column_name.name))
    plt.ylabel('Number of Occurrences', fontsize=12)
    plt.xlabel('{}'.format(column_name.name), fontsize=12)
    plt.show()

def plot_pie_graph(column_name):
    labels = column_name.astype('category').cat.categories.tolist()
    counts = column_name.value_counts()
    sizes = [counts[var_cat] for var_cat in labels]
    fig1, ax1 = plt.subplots()
    ax1.pie(sizes, labels=labels, autopct='%1.1f%%', shadow=True) #autopct is show the % on plot
    ax1.axis('equal')
    plt.title('Frequency Distribution of {} Levels using Pie Chart'.format(column_name.name))
    plt.show()

for col in obj_df.columns:
    plot_bar_graph(obj_df[col])
    plot_pie_graph(obj_df[col])

sns.distplot(df.age, color="r")
plt.show()

sns.distplot(df.annual_income, color="g")
plt.show()

sns.distplot(df.disposable_income, color="b")
plt.show()

6. Data Preprocessing:

Converting Categorical Data to Numerical Data:

def convert_cat_to_num(columns):
    for col in columns:
        df[col] = pd.factorize(df[col])[0]
convert_cat_to_num(df.select_dtypes(include=['object']))
df.head(10)

featurecolumns = df.columns.difference(['default'])
featurecolumns

Index(['age', 'annual_income', 'disposable_income', 'education', 'gender',
       'house_type', 'marital_status', 'no_card', 'occupation',
       'organization_type', 'seniority', 'vehicle_type'],
      dtype='object')

Checking Data Correlation:

plt.figure(figsize=(14,12))
sns.heatmap(df.corr(),annot=True,fmt="0.2f",cmap="coolwarm")
plt.show()

7. Splitting the Data:

train_X,test_X,train_y,test_y = train_test_split(df[featurecolumns],df['default'], test_size = 0.2, random_state =43)

print (train_X.shape, train_y.shape)
print (test_X.shape, test_y.shape)

(40508, 12) (40508,)
(10128, 12) (10128,)

8. Model Building and Diagnostics:

1. Decision Tree with Entropy Criterion:

dtree=DecisionTreeClassifier(criterion='entropy',random_state=0
                             ,min_samples_leaf=10
                            ,min_samples_split=10)

dtree.fit(train_X,train_y)

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=10, min_samples_split=10,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=0, splitter='best')

y_pred_entropy=dtree.predict(test_X)

Score_entropy=accuracy_score(test_y,y_pred_entropy)
print("Accuracy: %0.2f" % (round(Score_entropy*100,2)))

Accuracy: 84.37

cm_dtclass = metrics.confusion_matrix(test_y,y_pred_entropy,labels = [1,0])
cm_dtclass

array([[ 509, 1134],
       [ 449, 8036]], dtype=int64)

from sklearn.metrics import roc_curve,auc
def plot_roc_curve(fper, tper):  
    plt.plot(fper, tper)
    plt.plot([0, 1], [0, 1], color='darkblue', linestyle='--')
    plt.plot(fper, tper, label='Decision Tree (area = %0.2f)' %auc(fper, tper))
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver Operating Characteristic (ROC) Curve')
    plt.legend()
    plt.show()
probs = y_pred_entropy 
fper, tper, thresholds = roc_curve(test_y, probs) 
plot_roc_curve(fper, tper)

2. Decision Tree with Gini Criterion:

dtree.gini=DecisionTreeClassifier(criterion='gini',random_state=0
                             ,min_samples_leaf=10
                            ,min_samples_split=10)

dtree.gini.fit(train_X,train_y)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=10, min_samples_split=10,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=0, splitter='best')

y_pred_gini=dtree.gini.predict(test_X)

Score_gini=accuracy_score(test_y,y_pred_gini)
print("Accuracy: %0.2f" % (round(Score_gini*100,2)))

Accuracy: 84.24

cm_dtclass2 = metrics.confusion_matrix(test_y,y_pred_gini,labels = [1,0])
cm_dtclass2

array([[ 488, 1155],
       [ 441, 8044]], dtype=int64)

probs = y_pred_gini 
fper, tper, thresholds = roc_curve(test_y, probs) 
plot_roc_curve(fper, tper)

9. Cross Validation with Stratified K-Fold:

headers = list(df.columns.values)

x = df[headers[:-1]]
y = df[headers[-1:]].values.ravel()

skf = StratifiedKFold(n_splits=10)

def SKFold(x,y,skf,model):
    dtree_predicted_y = []
    dtree_expected_y = []
    dtree_scores = []
    
    for train_index, test_index in skf.split(x, y):
        # specific ".loc" syntax for working with dataframes
        x_train, x_test = x.loc[train_index], x.loc[test_index]
        y_train, y_test = y[train_index], y[test_index]

        # create and fit classifier
        model.fit(x_train, y_train)
        

        # store result from classification
        dtree_predicted_y = model.predict(x_test)
        
        # store expected result for this specific fold
        dtree_expected_y = y_test

    # save and print accuracy
        accuracy = metrics.accuracy_score(dtree_expected_y, dtree_predicted_y)
        dtree_scores.append(accuracy)
        print("Accuracy for {}: {} ".format(model.criterion,str(accuracy*100)))
        cm_class = metrics.confusion_matrix(dtree_expected_y, dtree_predicted_y,labels = [1,0])
        print(cm_class)
    print("\n")
    print("Max Accuracy for {}: {} ".format(model.criterion,str(np.max(dtree_scores)*100)))
    print("Min Accuracy for {}: {} ".format(model.criterion,str(np.min(dtree_scores)*100)))
    print("Mean Accuracy for {}: {} ".format(model.criterion,str(np.mean(dtree_scores)*100)))
    print("\n")

SKFold(x,y,skf,dtree)
SKFold(x,y,skf,dtree.gini)

Accuracy for entropy: 86.00197433366239 
[[ 315  488]
 [ 221 4041]]
Accuracy for entropy: 86.10069101678184 
[[ 337  466]
 [ 238 4024]]
Accuracy for entropy: 85.68325434439178 
[[ 322  480]
 [ 245 4017]]
Accuracy for entropy: 85.48578199052133 
[[ 318  484]
 [ 251 4011]]
Accuracy for entropy: 85.85818684574363 
[[ 324  478]
 [ 238 4023]]
Accuracy for entropy: 85.20639936796366 
[[ 284  518]
 [ 231 4030]]
Accuracy for entropy: 85.66067548884061 
[[ 316  486]
 [ 240 4021]]
Accuracy for entropy: 81.19691882283232 
[[  93  709]
 [ 243 4018]]
Accuracy for entropy: 80.50562907367174 
[[  50  752]
 [ 235 4026]]
Accuracy for entropy: 80.22911317400751 
[[  40  762]
 [ 239 4022]]


Max Accuracy for entropy: 86.10069101678184 
Min Accuracy for entropy: 80.22911317400751 
Mean Accuracy for entropy: 84.19286244584168 


Accuracy for gini: 86.14017769002962 
[[ 312  491]
 [ 211 4051]]
Accuracy for gini: 86.21915103652518 
[[ 339  464]
 [ 234 4028]]
Accuracy for gini: 85.18957345971565 
[[ 311  491]
 [ 259 4003]]
Accuracy for gini: 85.6437598736177 
[[ 313  489]
 [ 238 4024]]
Accuracy for gini: 86.09520047402725 
[[ 322  480]
 [ 224 4037]]
Accuracy for gini: 85.5816709460794 
[[ 286  516]
 [ 214 4047]]
Accuracy for gini: 85.73968003160182 
[[ 296  506]
 [ 216 4045]]
Accuracy for gini: 80.76239383764566 
[[  94  708]
 [ 266 3995]]
Accuracy for gini: 80.46612680229113 
[[  50  752]
 [ 237 4024]]
Accuracy for gini: 80.24886430969781 
[[  41  761]
 [ 239 4022]]


Max Accuracy for gini: 86.21915103652518 
Min Accuracy for gini: 80.24886430969781 
Mean Accuracy for gini: 84.20865984612313

10. Results:

Decision Tree with Entropy Criterion: 84.37
Decision Tree with Gini Criterion: 84.24
Decision Tree with Entropy Criterion with Stratified K-Fold: 86.10
Decision Tree with Gini Criterion with Stratified K-Fold: 86.22

	age	annual_income	disposable_income	no_card	default
count	50636.000000	50636.000000	50636.000000	50636.000000	50636.000000
mean	29.527411	277243.989889	18325.788569	0.509815	0.158425
std	8.816532	153838.973755	12677.864844	0.669883	0.365142
min	18.000000	50000.000000	1000.000000	0.000000	0.000000
25%	25.000000	154052.250000	8317.750000	0.000000	0.000000
50%	27.000000	258860.500000	15770.000000	0.000000	0.000000
75%	30.000000	385071.500000	24135.000000	1.000000	0.000000
max	64.000000	999844.000000	49999.000000	2.000000	1.000000

	age	gender	education	occupation	organization_type	seniority	annual_income	disposable_income	house_type	vehicle_type	marital_status	no_card	default
0	19	Male	Graduate	Professional	None	None	186319	21625	Family	None	Married	0	1
1	18	Male	Under Graduate	Professional	None	None	277022	20442	Rented	None	Married	0	1
2	29	Male	Under Graduate	Salaried	None	Entry	348676	24404	Rented	None	Married	1	1
3	18	Male	Graduate	Student	None	None	165041	2533	Rented	None	Married	0	1
4	26	Male	Post Graduate	Salaried	None	Mid-level 1	348745	19321	Rented	None	Married	1	1