CREDIT RISK ANALYZER

1. Introduction:

Context

Banks need to protect their interest before it can take risk on you and issue credit card to you. Banks use their previous credit card holders records for understanding the patterns of the card holders. It is a lot more complex process to predict whether a person who they do not know at personal level, will be a defaulter or not. Banks, along with the data from their own records, also use CIBIL data. Based on all this data, banks want to develop a pattern that will tell them who are likely to be a defaulter and who are not. We have to use this dataset to generate a decision tree model that can successfully predict for a new applicant with recorded data for given parameters in the data set, if he is likely to be a defaulter.

Content

The dataset has 13 features with 50636 observations. The features are:

Here age, gender are the age and gender of the card holder.
education is the last acquired educational qualification of the card holder.
occupation can be salaried, or self employed or business etc.
organization_type can be tire 1, 2, 3 etc.
seniority denotes at which career level the card holder is in.
annual_income is the gross annual income of the card holder.
disposable_income is annual income - recurring expenses.
house_type is owned or rented or company provided etc.
vehicle_type is 4-wheeler or two-wheeler or none.
marital_status is of the card holder.
no_card has the information of the number of other credit cards that the card holder already holds.
And at the end of each row, we have a defaulter indicator indicating whether the card holder was a defaulter or not. It is 1 if the card holder was a defaulter, 0 otherwise.

2. Import Libraries:

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn import metrics
%matplotlib inline

3. Load Dataset:

In [2]:
df = pd.read_csv('credit_data.csv')

4. Data Exploratory Analysis:

In [3]:
df.head()
Out[3]:
age gender education occupation organization_type seniority annual_income disposable_income house_type vehicle_type marital_status no_card default
0 19 Male Graduate Professional None None 186319 21625 Family None Married 0 1
1 18 Male Under Graduate Professional None None 277022 20442 Rented None Married 0 1
2 29 Male Under Graduate Salaried None Entry 348676 24404 Rented None Married 1 1
3 18 Male Graduate Student None None 165041 2533 Rented None Married 0 1
4 26 Male Post Graduate Salaried None Mid-level 1 348745 19321 Rented None Married 1 1
In [4]:
df.columns
Out[4]:
Index(['age', 'gender', 'education', 'occupation', 'organization_type',
       'seniority', 'annual_income', 'disposable_income', 'house_type',
       'vehicle_type', 'marital_status', 'no_card', 'default'],
      dtype='object')
In [5]:
df.shape
Out[5]:
(50636, 13)
In [6]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50636 entries, 0 to 50635
Data columns (total 13 columns):
age                  50636 non-null int64
gender               50636 non-null object
education            50636 non-null object
occupation           50636 non-null object
organization_type    50636 non-null object
seniority            50636 non-null object
annual_income        50636 non-null int64
disposable_income    50636 non-null int64
house_type           50636 non-null object
vehicle_type         50636 non-null object
marital_status       50636 non-null object
no_card              50636 non-null int64
default              50636 non-null int64
dtypes: int64(5), object(8)
memory usage: 5.0+ MB
In [7]:
df.isnull().sum()
Out[7]:
age                  0
gender               0
education            0
occupation           0
organization_type    0
seniority            0
annual_income        0
disposable_income    0
house_type           0
vehicle_type         0
marital_status       0
no_card              0
default              0
dtype: int64
In [8]:
df.describe()
Out[8]:
age annual_income disposable_income no_card default
count 50636.000000 50636.000000 50636.000000 50636.000000 50636.000000
mean 29.527411 277243.989889 18325.788569 0.509815 0.158425
std 8.816532 153838.973755 12677.864844 0.669883 0.365142
min 18.000000 50000.000000 1000.000000 0.000000 0.000000
25% 25.000000 154052.250000 8317.750000 0.000000 0.000000
50% 27.000000 258860.500000 15770.000000 0.000000 0.000000
75% 30.000000 385071.500000 24135.000000 1.000000 0.000000
max 64.000000 999844.000000 49999.000000 2.000000 1.000000
In [9]:
df['default'].value_counts()
Out[9]:
0    42614
1     8022
Name: default, dtype: int64
In [10]:
obj_df = df.select_dtypes(include=['object']).copy()
In [11]:
obj_df.head()
Out[11]:
gender education occupation organization_type seniority house_type vehicle_type marital_status
0 Male Graduate Professional None None Family None Married
1 Male Under Graduate Professional None None Rented None Married
2 Male Under Graduate Salaried None Entry Rented None Married
3 Male Graduate Student None None Rented None Married
4 Male Post Graduate Salaried None Mid-level 1 Rented None Married
In [12]:
#Looking unique values
print(obj_df.nunique())
gender               2
education            4
occupation           4
organization_type    4
seniority            6
house_type           4
vehicle_type         3
marital_status       3
dtype: int64
In [13]:
print("Gender : ",obj_df.gender.unique())
print("Education : ",obj_df.education.unique())
print("Occupation : ",obj_df.occupation.unique())
print("Organization Type : ",obj_df.organization_type.unique())
print("Seniority : ",obj_df.seniority.unique())
print("House Type : ",obj_df.house_type.unique())
print("Vehicle Type : ",obj_df.vehicle_type.unique())
print("Marital Status : ",obj_df.marital_status.unique())
Gender :  ['Male' 'Female']
Education :  ['Graduate' 'Under Graduate' 'Post Graduate' 'Other']
Occupation :  ['Professional' 'Salaried' 'Student' 'Business']
Organization Type :  ['None' 'Tier 3' 'Tier 2' 'Tier 1']
Seniority :  ['None' 'Entry' 'Mid-level 1' 'Junior' 'Mid-level 2' 'Senior']
House Type :  ['Family' 'Rented' 'Company provided' 'Owned']
Vehicle Type :  ['None' 'Two Wheeler' 'Four Wheeler']
Marital Status :  ['Married' 'Single' 'Other']

5. Data Visualization:

In [14]:
def plot_bar_graph(column_name):
    ed_count = column_name.value_counts()
    sns.set(style="darkgrid")
    sns.barplot(ed_count.index, ed_count.values, alpha=0.9)
    plt.title('Frequency Distribution of {} Levels using Bar Plot'.format(column_name.name))
    plt.ylabel('Number of Occurrences', fontsize=12)
    plt.xlabel('{}'.format(column_name.name), fontsize=12)
    plt.show() 
    
In [15]:
def plot_pie_graph(column_name):
    labels = column_name.astype('category').cat.categories.tolist()
    counts = column_name.value_counts()
    sizes = [counts[var_cat] for var_cat in labels]
    fig1, ax1 = plt.subplots()
    ax1.pie(sizes, labels=labels, autopct='%1.1f%%', shadow=True) #autopct is show the % on plot
    ax1.axis('equal')
    plt.title('Frequency Distribution of {} Levels using Pie Chart'.format(column_name.name))
    plt.show()
In [16]:
for col in obj_df.columns:
    plot_bar_graph(obj_df[col])
    plot_pie_graph(obj_df[col])
In [17]:
sns.distplot(df.age, color="r")
plt.show()
In [18]:
sns.distplot(df.annual_income, color="g")
plt.show()
In [19]:
sns.distplot(df.disposable_income, color="b")
plt.show()

6. Data Preprocessing:

Converting Categorical Data to Numerical Data:

In [20]:
def convert_cat_to_num(columns):
    for col in columns:
        df[col] = pd.factorize(df[col])[0]
convert_cat_to_num(df.select_dtypes(include=['object']))
df.head(10)
Out[20]:
age gender education occupation organization_type seniority annual_income disposable_income house_type vehicle_type marital_status no_card default
0 19 0 0 0 0 0 186319 21625 0 0 0 0 1
1 18 0 1 0 0 0 277022 20442 1 0 0 0 1
2 29 0 1 1 0 1 348676 24404 1 0 0 1 1
3 18 0 0 2 0 0 165041 2533 1 0 0 0 1
4 26 0 2 1 0 2 348745 19321 1 0 0 1 1
5 26 1 3 2 0 0 404972 22861 0 0 1 0 1
6 28 0 1 2 0 0 231185 20464 0 0 0 0 1
7 24 1 1 1 0 1 102554 42159 0 0 0 1 1
8 26 1 1 1 0 3 226786 19817 0 0 1 0 1
9 26 0 0 1 0 2 250424 5271 0 1 0 1 1
In [21]:
featurecolumns = df.columns.difference(['default'])
featurecolumns
Out[21]:
Index(['age', 'annual_income', 'disposable_income', 'education', 'gender',
       'house_type', 'marital_status', 'no_card', 'occupation',
       'organization_type', 'seniority', 'vehicle_type'],
      dtype='object')

Checking Data Correlation:

In [22]:
plt.figure(figsize=(14,12))
sns.heatmap(df.corr(),annot=True,fmt="0.2f",cmap="coolwarm")
plt.show()

7. Splitting the Data:

In [23]:
train_X,test_X,train_y,test_y = train_test_split(df[featurecolumns],df['default'], test_size = 0.2, random_state =43)
In [24]:
print (train_X.shape, train_y.shape)
print (test_X.shape, test_y.shape)
(40508, 12) (40508,)
(10128, 12) (10128,)

8. Model Building and Diagnostics:

1. Decision Tree with Entropy Criterion:

In [25]:
dtree=DecisionTreeClassifier(criterion='entropy',random_state=0
                             ,min_samples_leaf=10
                            ,min_samples_split=10)
In [26]:
dtree.fit(train_X,train_y)
Out[26]:
DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=10, min_samples_split=10,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=0, splitter='best')
In [27]:
y_pred_entropy=dtree.predict(test_X)
In [28]:
Score_entropy=accuracy_score(test_y,y_pred_entropy)
print("Accuracy: %0.2f" % (round(Score_entropy*100,2)))
Accuracy: 84.37
In [29]:
cm_dtclass = metrics.confusion_matrix(test_y,y_pred_entropy,labels = [1,0])
cm_dtclass
Out[29]:
array([[ 509, 1134],
       [ 449, 8036]], dtype=int64)
In [30]:
from sklearn.metrics import roc_curve,auc
def plot_roc_curve(fper, tper):  
    plt.plot(fper, tper)
    plt.plot([0, 1], [0, 1], color='darkblue', linestyle='--')
    plt.plot(fper, tper, label='Decision Tree (area = %0.2f)' %auc(fper, tper))
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver Operating Characteristic (ROC) Curve')
    plt.legend()
    plt.show()
probs = y_pred_entropy 
fper, tper, thresholds = roc_curve(test_y, probs) 
plot_roc_curve(fper, tper)

2. Decision Tree with Gini Criterion:

In [31]:
dtree.gini=DecisionTreeClassifier(criterion='gini',random_state=0
                             ,min_samples_leaf=10
                            ,min_samples_split=10)
In [32]:
dtree.gini.fit(train_X,train_y)
Out[32]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=10, min_samples_split=10,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=0, splitter='best')
In [33]:
y_pred_gini=dtree.gini.predict(test_X)
In [34]:
Score_gini=accuracy_score(test_y,y_pred_gini)
print("Accuracy: %0.2f" % (round(Score_gini*100,2)))
Accuracy: 84.24
In [35]:
cm_dtclass2 = metrics.confusion_matrix(test_y,y_pred_gini,labels = [1,0])
cm_dtclass2
Out[35]:
array([[ 488, 1155],
       [ 441, 8044]], dtype=int64)
In [36]:
probs = y_pred_gini 
fper, tper, thresholds = roc_curve(test_y, probs) 
plot_roc_curve(fper, tper)

9. Cross Validation with Stratified K-Fold:

In [37]:
headers = list(df.columns.values)
In [38]:
x = df[headers[:-1]]
y = df[headers[-1:]].values.ravel()
In [39]:
skf = StratifiedKFold(n_splits=10)
In [44]:
def SKFold(x,y,skf,model):
    dtree_predicted_y = []
    dtree_expected_y = []
    dtree_scores = []
    
    for train_index, test_index in skf.split(x, y):
        # specific ".loc" syntax for working with dataframes
        x_train, x_test = x.loc[train_index], x.loc[test_index]
        y_train, y_test = y[train_index], y[test_index]

        # create and fit classifier
        model.fit(x_train, y_train)
        

        # store result from classification
        dtree_predicted_y = model.predict(x_test)
        
        # store expected result for this specific fold
        dtree_expected_y = y_test

    # save and print accuracy
        accuracy = metrics.accuracy_score(dtree_expected_y, dtree_predicted_y)
        dtree_scores.append(accuracy)
        print("Accuracy for {}: {} ".format(model.criterion,str(accuracy*100)))
        cm_class = metrics.confusion_matrix(dtree_expected_y, dtree_predicted_y,labels = [1,0])
        print(cm_class)
    print("\n")
    print("Max Accuracy for {}: {} ".format(model.criterion,str(np.max(dtree_scores)*100)))
    print("Min Accuracy for {}: {} ".format(model.criterion,str(np.min(dtree_scores)*100)))
    print("Mean Accuracy for {}: {} ".format(model.criterion,str(np.mean(dtree_scores)*100)))
    print("\n")

    
    
In [45]:
SKFold(x,y,skf,dtree)
SKFold(x,y,skf,dtree.gini)
Accuracy for entropy: 86.00197433366239 
[[ 315  488]
 [ 221 4041]]
Accuracy for entropy: 86.10069101678184 
[[ 337  466]
 [ 238 4024]]
Accuracy for entropy: 85.68325434439178 
[[ 322  480]
 [ 245 4017]]
Accuracy for entropy: 85.48578199052133 
[[ 318  484]
 [ 251 4011]]
Accuracy for entropy: 85.85818684574363 
[[ 324  478]
 [ 238 4023]]
Accuracy for entropy: 85.20639936796366 
[[ 284  518]
 [ 231 4030]]
Accuracy for entropy: 85.66067548884061 
[[ 316  486]
 [ 240 4021]]
Accuracy for entropy: 81.19691882283232 
[[  93  709]
 [ 243 4018]]
Accuracy for entropy: 80.50562907367174 
[[  50  752]
 [ 235 4026]]
Accuracy for entropy: 80.22911317400751 
[[  40  762]
 [ 239 4022]]


Max Accuracy for entropy: 86.10069101678184 
Min Accuracy for entropy: 80.22911317400751 
Mean Accuracy for entropy: 84.19286244584168 


Accuracy for gini: 86.14017769002962 
[[ 312  491]
 [ 211 4051]]
Accuracy for gini: 86.21915103652518 
[[ 339  464]
 [ 234 4028]]
Accuracy for gini: 85.18957345971565 
[[ 311  491]
 [ 259 4003]]
Accuracy for gini: 85.6437598736177 
[[ 313  489]
 [ 238 4024]]
Accuracy for gini: 86.09520047402725 
[[ 322  480]
 [ 224 4037]]
Accuracy for gini: 85.5816709460794 
[[ 286  516]
 [ 214 4047]]
Accuracy for gini: 85.73968003160182 
[[ 296  506]
 [ 216 4045]]
Accuracy for gini: 80.76239383764566 
[[  94  708]
 [ 266 3995]]
Accuracy for gini: 80.46612680229113 
[[  50  752]
 [ 237 4024]]
Accuracy for gini: 80.24886430969781 
[[  41  761]
 [ 239 4022]]


Max Accuracy for gini: 86.21915103652518 
Min Accuracy for gini: 80.24886430969781 
Mean Accuracy for gini: 84.20865984612313 


10. Results:


Decision Tree with Entropy Criterion: 84.37
Decision Tree with Gini Criterion: 84.24
Decision Tree with Entropy Criterion with Stratified K-Fold: 86.10
Decision Tree with Gini Criterion with Stratified K-Fold: 86.22