Credit scoring with machine learning (01)

Credit scoring is a common application of AI and Machine Learning in banking industry. If you have asked for a loan during the past few years, you can be sure your profile has gone through some kind of credit scoring algorithm before the bank gave you a response!

The credit scoring project that we do in this post is based on a publicly available dataset that you can find in this link. Data is one of the two most important factors of machine learning (modeling technic being the other factor). Having a clean and useable dataset is key in designing a program that performs well in prediction or categorization. This project is therefore divided into two parts

  1. Preprocessing the input data and preparing it for training the AI model
  2. Training and testing the model and evaluating its performance

I have written the the whole process of cleaning and preparing the data in a Jupyter notebook that you can find here. All the details and checking of data quality are explained in this notebook. The codes are in the following, without the comments.

# www.soligale.com


# # Credit scoring with machine learning

import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')


df = pd.read_csv("credit_scoring_train.csv")
print("\nRows and columns:\n",df.shape)
print("\nColumn titles:\n",df.columns)
print("\nDuplicates:\n", df.duplicated().value_counts())


df.columns = [x.lower() for x in df.columns]
df.drop(['id', 'customer_id', 'month', 'name','ssn', 'changed_credit_limit'], axis=1, inplace = True)
print(df.shape)
print(df.columns)
print(df.isnull().sum())


df.drop(['monthly_inhand_salary', 'type_of_loan', 'credit_history_age'], axis=1, inplace = True)


df['num_of_delayed_payment'] = df['num_of_delayed_payment'].replace(np.nan, 0)
print(df['num_of_delayed_payment'].value_counts().sort_values())
df['num_of_delayed_payment'] = [float(str(x).replace('_','')) for x in df['num_of_delayed_payment']]
df['num_of_delayed_payment'][df['num_of_delayed_payment'] < 0] = 0
print(df['num_of_delayed_payment'].value_counts().sort_index(),"\n")
print(df['num_of_delayed_payment'].value_counts().sort_values())


print(df['num_credit_inquiries'].value_counts().sort_index(),"\n")
print(df['num_credit_inquiries'].value_counts().sort_values(),"\n\n")
print("Number of null values initailly: ", df.isnull()['num_credit_inquiries'].sum(),"\n")
df['num_credit_inquiries'] = df['num_credit_inquiries'].replace(np.nan, df['num_credit_inquiries'].mean())
print("Null values after correction: ", df.isnull()['num_credit_inquiries'].sum())


df['amount_invested_monthly'] = df['amount_invested_monthly'].replace(np.nan,0)
print("+ Before correction:\n", df['amount_invested_monthly'].value_counts().sort_values(),"\n\n")
df['amount_invested_monthly'] = [float(str(x).replace('_','')) for x in df['amount_invested_monthly']]
print("+ After correction:\n", df['amount_invested_monthly'].value_counts().sort_index(),"\n")
print(df['amount_invested_monthly'].value_counts().sort_values())


df['monthly_balance'] = df['monthly_balance'].replace(np.nan,0)
print("+ Before correction:\n", df['monthly_balance'].value_counts().sort_values(),"\n\n")
df['monthly_balance'] = [float(str(x).replace('__-333333333333333333333333333__','0')) for x in df['monthly_balance']]
print("+ After correction:\n", df['monthly_balance'].value_counts().sort_index(),"\n")
print(df['monthly_balance'].value_counts().sort_values())


df['amount_invested_monthly'] = df['amount_invested_monthly'].replace(10000,0)


print(df.isnull().sum(), "\n\n")



df.select_dtypes(include=('object'))

df['age'] = [float(str(x).replace('_','')) for x in df['age']]
mean_age = df['age'][df['age']<99].mean()
df['age'][df['age'] > 99] = mean_age
df['age'][df['age'] < 0] = mean_age

df['annual_income'] = [float(str(x).replace('_','')) for x in df['annual_income']]

df['num_of_loan'] = [float(str(x).replace('_','')) for x in df['num_of_loan']]

df['outstanding_debt'] = [float(str(x).replace('_','')) for x in df['outstanding_debt']]



[print(df[col].value_counts(),"\n\n") for col in df.select_dtypes(include=('object'))]

df['occupation'] = [x.replace('_______','Unemployed') for x in df['occupation']]
df['credit_mix'] = [x.replace('_','Unknown') for x in df['credit_mix']]
df['payment_of_min_amount'] = [x.replace('NM','No') for x in df['payment_of_min_amount']]
df['payment_behaviour'] = [x.replace('!@9#%8','Unknown') for x in df['payment_behaviour']]


# ## Data is ready!
df.to_csv('train_clean_data.csv')

Leave a Reply

Your email address will not be published. Required fields are marked *