Credit scoring is a common application of AI and Machine Learning in banking industry. If you have asked for a loan during the past few years, you can be sure your profile has gone through some kind of credit scoring algorithm before the bank gave you a response!
The credit scoring project that we do in this post is based on a publicly available dataset that you can find in this link. Data is one of the two most important factors of machine learning (modeling technic being the other factor). Having a clean and useable dataset is key in designing a program that performs well in prediction or categorization. This project is therefore divided into two parts
- Preprocessing the input data and preparing it for training the AI model
- Training and testing the model and evaluating its performance
I have written the the whole process of cleaning and preparing the data in a Jupyter notebook that you can find here. All the details and checking of data quality are explained in this notebook. The codes are in the following, without the comments.
# www.soligale.com
# # Credit scoring with machine learning
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
df = pd.read_csv("credit_scoring_train.csv")
print("\nRows and columns:\n",df.shape)
print("\nColumn titles:\n",df.columns)
print("\nDuplicates:\n", df.duplicated().value_counts())
df.columns = [x.lower() for x in df.columns]
df.drop(['id', 'customer_id', 'month', 'name','ssn', 'changed_credit_limit'], axis=1, inplace = True)
print(df.shape)
print(df.columns)
print(df.isnull().sum())
df.drop(['monthly_inhand_salary', 'type_of_loan', 'credit_history_age'], axis=1, inplace = True)
df['num_of_delayed_payment'] = df['num_of_delayed_payment'].replace(np.nan, 0)
print(df['num_of_delayed_payment'].value_counts().sort_values())
df['num_of_delayed_payment'] = [float(str(x).replace('_','')) for x in df['num_of_delayed_payment']]
df['num_of_delayed_payment'][df['num_of_delayed_payment'] < 0] = 0
print(df['num_of_delayed_payment'].value_counts().sort_index(),"\n")
print(df['num_of_delayed_payment'].value_counts().sort_values())
print(df['num_credit_inquiries'].value_counts().sort_index(),"\n")
print(df['num_credit_inquiries'].value_counts().sort_values(),"\n\n")
print("Number of null values initailly: ", df.isnull()['num_credit_inquiries'].sum(),"\n")
df['num_credit_inquiries'] = df['num_credit_inquiries'].replace(np.nan, df['num_credit_inquiries'].mean())
print("Null values after correction: ", df.isnull()['num_credit_inquiries'].sum())
df['amount_invested_monthly'] = df['amount_invested_monthly'].replace(np.nan,0)
print("+ Before correction:\n", df['amount_invested_monthly'].value_counts().sort_values(),"\n\n")
df['amount_invested_monthly'] = [float(str(x).replace('_','')) for x in df['amount_invested_monthly']]
print("+ After correction:\n", df['amount_invested_monthly'].value_counts().sort_index(),"\n")
print(df['amount_invested_monthly'].value_counts().sort_values())
df['monthly_balance'] = df['monthly_balance'].replace(np.nan,0)
print("+ Before correction:\n", df['monthly_balance'].value_counts().sort_values(),"\n\n")
df['monthly_balance'] = [float(str(x).replace('__-333333333333333333333333333__','0')) for x in df['monthly_balance']]
print("+ After correction:\n", df['monthly_balance'].value_counts().sort_index(),"\n")
print(df['monthly_balance'].value_counts().sort_values())
df['amount_invested_monthly'] = df['amount_invested_monthly'].replace(10000,0)
print(df.isnull().sum(), "\n\n")
df.select_dtypes(include=('object'))
df['age'] = [float(str(x).replace('_','')) for x in df['age']]
mean_age = df['age'][df['age']<99].mean()
df['age'][df['age'] > 99] = mean_age
df['age'][df['age'] < 0] = mean_age
df['annual_income'] = [float(str(x).replace('_','')) for x in df['annual_income']]
df['num_of_loan'] = [float(str(x).replace('_','')) for x in df['num_of_loan']]
df['outstanding_debt'] = [float(str(x).replace('_','')) for x in df['outstanding_debt']]
[print(df[col].value_counts(),"\n\n") for col in df.select_dtypes(include=('object'))]
df['occupation'] = [x.replace('_______','Unemployed') for x in df['occupation']]
df['credit_mix'] = [x.replace('_','Unknown') for x in df['credit_mix']]
df['payment_of_min_amount'] = [x.replace('NM','No') for x in df['payment_of_min_amount']]
df['payment_behaviour'] = [x.replace('!@9#%8','Unknown') for x in df['payment_behaviour']]
# ## Data is ready!
df.to_csv('train_clean_data.csv')
Actually when someone doesn’t be aware of after that its up to other people that
they will help, so here it occurs.