Examining Factors Affecting Loan Default: A Case Study for a Bank in Europe
In today’s economy, lending is an integral part of banking services. Banks face a constant challenge in determining the risk associated with lending money to their clients. Credit risk analysis is crucial for banks to determine whether a borrower is likely to repay their loan or default. Defaulted loans can lead to substantial losses for banks, especially when large sums of money are involved.
In this post, I will present a brief summary of a recent data analysis project I worked on for credit risk analysis for a fresh, up-and-coming bank in Europe. The bank’s main goal was to establish a solid foundation for its loan approval process by identifying the factors that affect loan defaults and building a predictive model to mitigate the risk of losses. We were given a public dataset containing 12 variables and 32,581 loans to work with, which provided a wealth of data to draw insights from.
Through our analysis, we aimed to identify universal data and factors that contribute to loan defaults, which would enable the bank to make informed lending decisions and minimize the risk of loss. By building a predictive model, the bank can better understand the characteristics of clients who are likely to default on their loans and make more accurate decisions on loan approvals.
The report will be structured as follows:
- Univariate Data Exploration
- Overview of the dataset
- Summary statistics and visualizations
- Bivariate Data Analysis
- Identifying patterns and relationships between variables
- Statistical tests and visualizations
- Modeling
- Training and evaluating several classification models
- Cross-validation and hyperparameter tuning
- Performance evaluation and interpretation
- Conclusion
- Summary of key findings and insights
- Implications for the bank’s loan approval process
- Limitations and potential avenues for future research.
We began our analysis by exploring the dataset provided to us, which consisted of 12 variables and 32,581 loans. Our aim was to understand the characteristics of the data and identify any patterns or trends that could inform our analysis.
Numerical Variables:
To gain a better understanding of the dataset, we first calculated summary statistics for each numerical variable, including measures of central tendency and variability. The table below summarizes these statistics:
Variable | Minimum | Maximum | Mean | Standard Deviation |
---|---|---|---|---|
Age | 20 | 144 | 27.73 | 6.35 |
Income | $4,000 | $6,000,000 | $66,074 | $61,983 |
Employment Length | 0 | 123 | 4.79 | 4.14 |
Loan Amount | $500 | $35,000 | $9,589 | $6,322 |
Interest Rate | 5.42% | 23.22% | 11.01% | 3.24% |
Percent Income | 0% | 83% | 17.02% | 10.68% |
Credit History Length | 2 | 30 | 5.80 | 4.05 |
From the table, we can see that the age of the customers ranges from 20 to 144, with a mean age of 27.73 and a standard deviation of 6.35. The income of the customers ranges from $4,000 to $6,000,000, with a mean income of $66,074 and a standard deviation of $61,983. The loan amount ranges from $500 to $35,000, with a mean loan amount of $9,589 and a standard deviation of $6,322. The interest rate ranges from 5.42% to 23.22%, with a mean interest rate of 11.01% and a standard deviation of 3.24%. The percent income ranges from 0% to 83%, with a mean percent income of 17.02% and a standard deviation of 10.68%. The credit history length ranges from 2 years to 30 years, with a mean credit history length of 5.80 years and a standard deviation of 4.05 years.
We also used histograms to visualize the distribution of some of these variables, looking for any outliers or unusual patterns:
From the histograms, we can see that the age of the customers is roughly normally distributed, with most customers being between 20 and 50 years old. The income of the customers is right-skewed, with most customers having an income below 100,000.
We also found that the age variable had a few extreme and illogical values, with some customers over the age of 100, and the same for the income, which has a long tail, with a few customers reporting very high incomes. We decided to cap the age variable at 90 and keep only observations with income less than $1M and employment length lower than 50 years. After removing these extreme outliers, the dataset had only 31670 observations left.
Categorical Variables:
For the categorical variables, we can use bar plots and frequency tables to summarize their distribution. The Home Ownership, Loan Intent, Loan Grade, Historical Default, and Loan Status variables are all categorical variables. Here are the brief descriptions for each one of them.
Home Ownership:
Looking at the home ownership variable, we found that the majority of borrowers (50.74%) are renting their homes, while 41.32% have a mortgage and 7.61% own their homes. A small percentage of borrowers (0.34%) fall under the “Other” category. This information can be useful in understanding the background of borrowers and their financial situations.
Loan Intent:
Analyzing the loan intent variable, we found that the most frequent loan purpose is education (19.85%), followed by medical expenses (18.61%) and venture (17.52%). Personal loans account for 16.94% of the dataset, while debt consolidation and home improvement loans represent 15.99% and 11.08% respectively.
Loan Grade:
In terms of loan grade, we observed that the majority of loans fell into
grades A and B, with 32.7% and 32.2% frequency, respectively. Grade C
made up the next largest proportion at 19.9%, followed by grade D at
11.2%. Grades E, F, and G each made up less than 4% of the loans, with
grade G being the least frequent at only 0.2%. This information could be
useful in determining which loan grades are associated with higher
levels of risk and which may be more likely to be repaid on time.
Historical Default:
The variable “historical default” has two categories, “Y” for loans that have a historical default and “N” for loans that have no historical default. The dataset has 31,670 entries, out of which 5,627 (17.77%) have a historical default, while 26,043 (82.23%) do not have a historical default.
Loan Status:
In terms of the target variable, loan status, we found that 78.45% of loans in the dataset had no default, while 21.55% of loans did default. This indicates that while the majority of loans were successfully repaid, a significant portion did not meet the required payments. The loan status variable will be used as the target variable for our predictive model to determine if a loan will default or not based on the other variables in the dataset.
To further explore the relationship between loan status and other variables, we created bar plots for the categorical variables and histograms for the numerical variables. The bar plot for home ownership showed that most borrowers owned their homes (51.3%), followed by those with a mortgage (42.3%). Only a small percentage of borrowers rented their homes (6.2%) or had other forms of home ownership (0.2%).
Overall, these summary statistics and visualizations provide a comprehensive overview of the dataset and suggest some potential relationships between variables and loan status. In the following sections, we will explore these relationships in more detail and develop a predictive model for loan default.
In this section, I conducted a comprehensive analysis of the relationship between the loan status and other variables in the dataset. To achieve this, I used various analytical methods, including summary statistics, bar charts, boxplots, and hypothesis testing. The summary statistics were used to compare the numerical variables between the default and non-default groups. The bar charts and boxplots were created to visually represent the difference between the two groups. Finally, hypothesis testing was performed to statistically determine the significance of the observed differences. By conducting this thorough analysis, I was able to gain insights into the factors that are most strongly associated with loan default and provide initial recommendations to better manage the lending risk.
Categorical Variables
We started by plotting some side-by-side barplots to show if there are any differences in the default percentage for any specific class in the categorical variables we have. Here are the findings:
Home Ownership: Borrowers who own a home are most likely to not default with a percentage of 93.07%, and a percentage of 6.93% to default. On the other hand, borrowers who have a rent have the highest percentage of defaulting, 31.08%, where the percentage of defaulting in the full dataset is only 21.55%.
Loan Intent: For loan intent, we do not have the same big difference as for the home ownership. The default percentages range from 14.67% for venture, to 28.38% for debt consolidation.
Loan Grade: We have an obvious conclusion that the percentage of default increases as we go from grade A to grade G. We have a percentage of 9.56% for grade A and a percentage of 98.44% for grade G (we have only 64 loans in this grade, and only 1 was not defaulted). This increase was expected as the grade system was made to rank the loans from the ones with lowest risk (A) to the ones with higher risk (E). This variable will be omitted in the modeling section, as the bank prefer to use only personal information to predict the status.
Historical Default: For the historical default variable, we found that 81.91% of loans with no historical default have a no default loan status, where the rest of 18.09% have a default status. On the other hand, for the loans with historical default, 62.43% have a no default status.
Numerical Variables
We also looked at the relationship between the loan status and each of the numerical variables in the dataset. Here are the findings:
Age vs. Loan Status: The mean age for both default and no default loan statuses is around 27 years old, 27.47 for default group, and 27.79 for no default group. The Mann-Whitney U test showed that this difference between the two groups is significant (p < 0.0001), even if it looks very small.
Income vs. Loan Status: The mean income for borrowers with a no default loan status ($70,565) is higher than the mean income for borrowers with a default loan status ($49,963). The Mann-Whitney U test showed a significant difference between the two groups (p < 0.0001).
Employment Length vs. Loan Status: The mean employment length for borrowers with a no default loan status (4.12 years) is also higher than the mean employment length for borrowers with a default loan status (4.96 years). The Mann-Whitney U test showed a significant difference as well between the two groups (p < 0.0001).
Loan Amount vs. Loan Status: The mean loan amount for borrowers with a default loan status ($7,156) is higher than the mean loan amount for borrowers with a no default loan status ($6,037). The Mann-Withney U test showed that this difference between the two groups is also significant (p < 0.0001).
Interest Rate vs. Loan Status: The mean interest rate for borrowers with a default loan status (13.12%) is higher than the mean interest rate for borrowers with a no default loan status (10.46%). The Mann-Whitney U test showed a significant difference between the two groups (p < 0.0001).
In this section, we will describe the modeling process and the results obtained from different machine learning models.
We used a stratified 10-fold cross-validation method to train and evaluate the models. The models were trained on the transformed training set, which contains 22806 samples, and evaluated on the transformed test set, which contains 9775 samples.
Data Preprocessing
The original data contained 32581 samples and 11 features. We transformed the data and added 8 additional features, resulting in a transformed dataset with 32581 samples and 19 features.
Out of the 19 features, one was ordinal, seven were numeric, and three were categorical. The transformed dataset had 12.1% missing values, which were imputed using simple imputation techniques. We imputed missing numeric values using the mean and missing categorical values using the mode.
Model Evaluation
We evaluated 16 different machine learning models and recorded their performance metrics, including accuracy, area under the ROC curve (AUC), recall, precision, and F1-score.
The table below summarizes the results obtained from the different models.
Model | Accuracy | AUC | Recall | Prec. | F1 | |
catboost | CatBoost Classifier | 0.9269 | 0.938 | 0.99 | 0.922 | 0.955 |
lightgbm | Light Gradient Boosting Machine | 0.9265 | 0.938 | 0.991 | 0.921 | 0.955 |
xgboost | Extreme Gradient Boosting | 0.9259 | 0.942 | 0.985 | 0.925 | 0.954 |
rf | Random Forest Classifier | 0.9194 | 0.922 | 0.987 | 0.917 | 0.95 |
gbc | Gradient Boosting Classifier | 0.916 | 0.92 | 0.987 | 0.913 | 0.948 |
et | Extra Trees Classifier | 0.903 | 0.901 | 0.979 | 0.905 | 0.94 |
ada | Ada Boost Classifier | 0.8764 | 0.888 | 0.949 | 0.898 | 0.923 |
dt | Decision Tree Classifier | 0.872 | 0.821 | 0.912 | 0.924 | 0.918 |
lda | Linear Discriminant Analysis | 0.8456 | 0.848 | 0.95 | 0.866 | 0.906 |
ridge | Ridge Classifier | 0.8426 | 0 | 0.969 | 0.85 | 0.906 |
knn | K Neighbors Classifier | 0.8339 | 0.807 | 0.927 | 0.869 | 0.897 |
nb | Naive Bayes | 0.819 | 0.801 | 0.916 | 0.861 | 0.888 |
lr | Logistic Regression | 0.8064 | 0.762 | 0.983 | 0.81 | 0.888 |
dummy | Dummy Classifier | 0.7819 | 0.5 | 1 | 0.782 | 0.878 |
qda | Quadratic Discriminant Analysis | 0.6066 | 0.528 | 0.67 | 0.83 | 0.651 |
svm | SVM – Linear Kernel | 0.5573 | 0 | 0.552 | 0.587 | 0.549 |
The results indicate that both CatBoost and LightGBM models achieve the highest accuracy of 0.9269 and AUC of 0.9378. The recall is also high for both models, with values above 0.99, meaning that they correctly identify almost all of the true positives. However, precision is slightly lower, with values around 0.92. The F1 score is higher for CatBoost with a value of 0.9549, which indicates a better balance between precision and recall.
The next best models are XGBoost and Random Forest, which achieve an accuracy of 0.9259 and 0.9194, respectively. The AUC scores are also quite high, at 0.9418 and 0.9218. The recall for both models is high, above 0.98, but the precision is lower, around 0.92. The F1 score is also high, with values around 0.95.
The Gradient Boosting Classifier and Extra Trees Classifier models have lower accuracy and AUC values, with scores of 0.9160 and 0.9030, respectively. However, they still have high recall values, above 0.97, and the F1 score is also relatively high.
The remaining models perform worse in terms of accuracy, AUC, and F1 score, with the SVM model achieving the lowest scores. The Dummy Classifier achieves an accuracy of 0.7819, which is only slightly better than randomly guessing.
Overall, the CatBoost and LightGBM models perform best on this dataset, achieving high accuracy, AUC, recall, and F1 scores. It is worth noting that the preprocessing step used simple imputation and did not perform any feature engineering, which suggests that there may be further opportunities to improve the performance of the models by exploring these techniques.
The feature importance analysis for the CatBoost Classifier revealed that the “percent income” feature had the highest influence on loan status, followed closely by “income” and “interest rate”. This suggests that the loan applicant’s financial standing, including their current income and the interest rate they are offered, are highly indicative of whether they will default on the loan or not. On the other hand, features such as “historical default” and “credit history length” were found to be less important, indicating that past loan repayment behavior may not be as strong a predictor of future loan default as other factors. Additionally, certain classes of loan intent were found to be less important, suggesting that the type of loan being applied for may not have as significant an impact on loan default as other factors. Overall, understanding the feature importance can help identify which factors to prioritize when making loan decisions, ultimately improving the accuracy and reliability of the lending process.
Throughout this project, I conducted a comprehensive analysis of a dataset containing information about loan applicants and whether or not they were approved for a loan. We began by performing exploratory data analysis on each variable in the dataset, using a combination of histograms, box plots, and scatter plots to gain insights into the distributions and relationships among the variables.
From there, we conducted bivariate data analysis, examining how various features were related to the loan status of each applicant. We found that certain variables, such as percent income, income, interest rate, and home ownership, had a strong influence on the target variable. We also identified that some loan intent classes and credit history length were less important in determining loan status.
Finally, we constructed and evaluated several machine learning models to predict loan status based on the available features. We found that the CatBoost classifier provided the best performance with an accuracy of 0.9269 and AUC of 0.9378. We also identified the important features for the CatBoost model, which could be useful for future modeling or business decisions.
Overall, this project provided a comprehensive analysis of a loan applicant dataset, including data exploration, feature selection, and modeling. It is important to note that this post only provides a brief summary of the analysis conducted, as the full project involved many additional analyses and insights.