In [369]:

```
import numpy as np
import pandas as pd
import matplotlib as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
```

Checking shape, column titles and duplicate records:

There are 10000 records, 28 columns and no duplicates.

In [370]:

```
df = pd.read_csv("train.csv")
print("\nRows and columns:\n",df.shape)
print("\nColumn titles:\n",df.columns)
print("\nDuplicates:\n", df.duplicated().value_counts())
```

In order to make data manipulation easier, let's

- change all column titles to lowercase, and
- remove columns that are not useful or not clear: 'id', 'customer_id', 'month', 'name', 'ssn' and 'changed_credit_limit'.

In [371]:

```
df.columns = [x.lower() for x in df.columns]
df.drop(['id', 'customer_id', 'month', 'name','ssn', 'changed_credit_limit'], axis=1, inplace = True)
print(df.shape)
print(df.columns)
```

Let's check how clean the data is. First we check how many null entries we have per column. The result shows too much!

In [372]:

```
df.isnull().sum()
```

Out[372]:

age 0 occupation 0 annual_income 0 monthly_inhand_salary 15002 num_bank_accounts 0 num_credit_card 0 interest_rate 0 num_of_loan 0 type_of_loan 11408 delay_from_due_date 0 num_of_delayed_payment 7002 num_credit_inquiries 1965 credit_mix 0 outstanding_debt 0 credit_utilization_ratio 0 credit_history_age 9030 payment_of_min_amount 0 total_emi_per_month 0 amount_invested_monthly 4479 payment_behaviour 0 monthly_balance 1200 credit_score 0 dtype: int64

We have at least seven columns that need taking care of. For our short toturial here, let's do the following

**'monthly_inhand_salary':**let's remove it since we have 'annual_income' to indicate the income;**'type_of_loan':**let's remove this one in this toturial, too, and consider the probability of payback default of a loan mostly depends on economic situation and personal profile of the person, not the type of loan;**'num_of_delayed_payment':**this one seems like an important factor, we will check the data and decide how to solve it;**'num_credit_inquiries':**the amount of missing data is very small, lets fill missing data with the average value;**'credit_history_age':**let's remove this one, too;**'amount_invested_monthly':**this one seems like an important factor, we will check the data and decide how to solve it;**'monthly_balance':**this one seems like an important factor, we will check the data and decide how to solve it;

Now first let's remove the ones we wanted to

In [373]:

```
df.drop(['monthly_inhand_salary', 'type_of_loan', 'credit_history_age'], axis=1, inplace = True)
```

We replace the null values with zeros, as it seems that the person most likely did not have any delayed payments hence no values were entered.

As the next step, we use .value_counts().sort_value() method who usually shows us the kind of anamolies and problems in data. Here we see that:

- some of them have the underscore character, we will remove those characters
- there are negative values: we will set them to zeros

In the end we see that the values are good.

In [378]:

```
df['num_of_delayed_payment'] = df['num_of_delayed_payment'].replace(np.nan, 0)
print(df['num_of_delayed_payment'].value_counts().sort_values())
```

In [379]:

```
df['num_of_delayed_payment'] = [float(str(x).replace('_','')) for x in df['num_of_delayed_payment']]
df['num_of_delayed_payment'][df['num_of_delayed_payment'] < 0] = 0
print(df['num_of_delayed_payment'].value_counts().sort_index(),"\n")
print(df['num_of_delayed_payment'].value_counts().sort_values())
```

Seems this column is pretty clean except for the missing ones, which we will fill with the mean value of the column as a simple solution.

In [406]:

```
print(df['num_credit_inquiries'].value_counts().sort_index(),"\n")
print(df['num_credit_inquiries'].value_counts().sort_values(),"\n\n")
print("Number of null values initailly: ", df.isnull()['num_credit_inquiries'].sum(),"\n")
df['num_credit_inquiries'] = df['num_credit_inquiries'].replace(np.nan, df['num_credit_inquiries'].mean())
print("Null values after correction: ", df.isnull()['num_credit_inquiries'].sum())
```

Here we see that .value_counts().sort_index() leads to an error of comparing **str** with **int**. This means some of the values are strings and not numbers. The outcome of .value_counts().sort_values()) reveals the culprit. After removing the underscore characters, we see that the .sort_index() method runs without errors since now all the records and int values.

- note that the outcome of the .value_counts() method, is a DataFrame whose indexes are all the unique values that were found among the data of that column. The values of each record in this DataFrame, show how many times each corresponding unique value appeared in the original data. Therefore, when using sort_index() method on this DataFrame, we will get an error if some of the original values in that column were of a different data type.

In [407]:

```
df['amount_invested_monthly'] = df['amount_invested_monthly'].replace(np.nan,0)
# print(df['amount_invested_monthly'].value_counts().sort_index())
print("+ Before correction:\n", df['amount_invested_monthly'].value_counts().sort_values(),"\n\n")
df['amount_invested_monthly'] = [float(str(x).replace('_','')) for x in df['amount_invested_monthly']]
print("+ After correction:\n", df['amount_invested_monthly'].value_counts().sort_index(),"\n")
print(df['amount_invested_monthly'].value_counts().sort_values())
```

Same problem as previous column: .value_counts().sort_index() leads to an error of comparing str with int. The outcome of .value_counts().sort_values()) reveals the culprit, '**-333333333333333333333333333**' repeated nine times!

After removing those records, we see that the .sort_index() method runs without errors since now all the records and int values.

**A Remark:** Here I feel that something in the data does not make sense. If the monthly balance values reach maximum of 1602 USD, that average monthly saving of 10000 USD that we had in previous column analysis was most an invalid value and must be set to zero. Let's do it then.

In [408]:

```
df['monthly_balance'] = df['monthly_balance'].replace(np.nan,0)
# print(df['amount_invested_monthly'].value_counts().sort_index())
print("+ Before correction:\n", df['monthly_balance'].value_counts().sort_values(),"\n\n")
df['monthly_balance'] = [float(str(x).replace('__-333333333333333333333333333__','0')) for x in df['monthly_balance']]
print("+ After correction:\n", df['monthly_balance'].value_counts().sort_index(),"\n")
print(df['monthly_balance'].value_counts().sort_values())
df['amount_invested_monthly'] = df['amount_invested_monthly'].replace(10000,0)
```

Now let's check again how clean the data has become: the null values are gone. Additionally, the unwanted characters are removed, but perhaps not from all columns. Let's check which columns who are supposed to be only numbers, still have strings in them: **'age', 'annual_income', 'num_of_loan', 'changed_credit_limit', 'credit_mix', and 'outstanding_debt'.**

I will correct them one by one in the following.

In [456]:

```
print(df.isnull().sum(), "\n\n")
df.select_dtypes(include=('object'))
```

age 0 occupation 0 annual_income 0 num_bank_accounts 0 num_credit_card 0 interest_rate 0 num_of_loan 0 delay_from_due_date 0 num_of_delayed_payment 0 num_credit_inquiries 0 credit_mix 0 outstanding_debt 0 credit_utilization_ratio 0 payment_of_min_amount 0 total_emi_per_month 0 amount_invested_monthly 0 payment_behaviour 0 monthly_balance 0 credit_score 0 dtype: int64

Out[456]:

occupation | credit_mix | payment_of_min_amount | payment_behaviour | credit_score | |
---|---|---|---|---|---|

0 | Scientist | Unknown | No | High_spent_Small_value_payments | Good |

1 | Scientist | Good | No | Low_spent_Large_value_payments | Good |

2 | Scientist | Good | No | Low_spent_Medium_value_payments | Good |

3 | Scientist | Good | No | Low_spent_Small_value_payments | Good |

4 | Scientist | Good | No | High_spent_Medium_value_payments | Good |

... | ... | ... | ... | ... | ... |

99995 | Mechanic | Unknown | No | High_spent_Large_value_payments | Poor |

99996 | Mechanic | Unknown | No | High_spent_Medium_value_payments | Poor |

99997 | Mechanic | Good | No | High_spent_Large_value_payments | Poor |

99998 | Mechanic | Good | No | Low_spent_Large_value_payments | Standard |

99999 | Mechanic | Good | No | Unknown | Poor |

100000 rows × 5 columns

In [440]:

```
# removing characters
df['age'] = [float(str(x).replace('_','')) for x in df['age']]
# replacing non acceptable values the column with average
mean_age = df['age'][df['age']<99].mean()
df['age'][df['age'] > 99] = mean_age
df['age'][df['age'] < 0] = mean_age
```

In [441]:

```
# removing characters
df['annual_income'] = [float(str(x).replace('_','')) for x in df['annual_income']]
```

In [442]:

```
df['num_of_loan'] = [float(str(x).replace('_','')) for x in df['num_of_loan']]
```

In [443]:

```
df['outstanding_debt'] = [float(str(x).replace('_','')) for x in df['outstanding_debt']]
```

Now that all numeric columns are cleaned to some level, wa can check the non-numeric ones quickly. Some of them need some correction that I will do in the following code blocks, based on my judgement.

In [451]:

```
[print(df[col].value_counts(),"\n\n") for col in df.select_dtypes(include=('object'))]
```

occupation Unemployed 7062 Lawyer 6575 Architect 6355 Engineer 6350 Scientist 6299 Mechanic 6291 Accountant 6271 Developer 6235 Media_Manager 6232 Teacher 6215 Entrepreneur 6174 Doctor 6087 Journalist 6085 Manager 5973 Musician 5911 Writer 5885 Name: count, dtype: int64 credit_mix Standard 36479 Good 24337 _ 20195 Bad 18989 Name: count, dtype: int64 payment_of_min_amount Yes 52326 No 35667 NM 12007 Name: count, dtype: int64 payment_behaviour Low_spent_Small_value_payments 25513 High_spent_Medium_value_payments 17540 Low_spent_Medium_value_payments 13861 High_spent_Large_value_payments 13721 High_spent_Small_value_payments 11340 Low_spent_Large_value_payments 10425 !@9#%8 7600 Name: count, dtype: int64 credit_score Standard 53174 Poor 28998 Good 17828 Name: count, dtype: int64

Out[451]:

[None, None, None, None, None]

In [455]:

```
# assuming _____ means no job was entered, hence unemployed situation
df['occupation'] = [x.replace('_______','Unemployed') for x in df['occupation']]
# just making it clearer
df['credit_mix'] = [x.replace('_','Unknown') for x in df['credit_mix']]
# assuming NM was typo for No
df['payment_of_min_amount'] = [x.replace('NM','No') for x in df['payment_of_min_amount']]
# what can I say? !@9#%8
df['payment_behaviour'] = [x.replace('!@9#%8','Unknown') for x in df['payment_behaviour']]
```

Now hopefully the data is ready to be used for some machine learning. I will save it under a new name.

The first 18 columns are the featuers and the last one, 'credit_score' is the train value for our model.

In [459]:

```
df.to_csv('train_clean_data.csv')
```