Bank Marketing
All the files of this project are saved in a GitHub repository.
Bank Marketing Dataset
The Bank Marketing dataset contains the direct marketing campaigns of a Portuguese banking institution. The original dataset can be found on Kaggle.
All the files of this project are saved in a GitHub repository.
The dataset consists in:
- Train Set with 36,168 observations with 16 features and the
target
y
. - Test Set with 9,043 observations with 16 features. The
y
column will be added to the Test Set, with NAs, to ease the pre-processing stage.
This project aims to predict if a customer will subscribe to a bank term deposit, based on its features and call history of previous marketing campaigns.
Packages
This analysis requires these R packages:
- Data Manipulation:
data.table
,dplyr
,tibble
,tidyr
- Plotting:
corrplot
,GGally
,ggmap
,ggplot2
,grid
,gridExtra
,ggthemes
,tufte
- Machine Learning:
AUC
,caret
,caretEnsemble
,flexclust
,glmnet
,MLmetrics
,pROC
,ranger
,xgboost
- Multithreading:
doParallel
,factoextra
,foreach
,parallel
- Reporting:
kableExtra
,knitr
,RColorBrewer
,rsconnect
,shiny
,shinydashboard
, and…beepr
.
These packages are installed and loaded if necessary by the main script.
Data Loading
The data seems to be pretty clean, the variables being a combination of integers and factors with no null values.
## [1] "0 columns of the Train Set have NAs."
## [1] "0 columns of the Test Set have NAs."
As this analysis is a classification, the target y
has to be set as
factor. The structures of the datasets after initial preparation are:
## Structure of the Train Set:
## 'data.frame': 36168 obs. of 17 variables:
## $ age : int 50 47 56 36 41 32 26 60 39 55 ...
## $ job : Factor w/ 12 levels "admin.","blue-collar",..: 3 10 4 2 5 9 9 2 8 1 ...
## $ marital : Factor w/ 3 levels "divorced","married",..: 2 2 2 2 2 3 3 2 1 1 ...
## $ education: Factor w/ 4 levels "primary","secondary",..: 1 2 1 1 1 3 2 1 2 2 ...
## $ default : num 1 0 0 0 0 0 0 0 0 0 ...
## $ balance : int 537 -938 605 4608 362 0 782 193 2140 873 ...
## $ housing : num 1 1 0 1 1 0 0 1 1 1 ...
## $ loan : num 0 0 0 0 0 0 0 0 0 1 ...
## $ contact : Factor w/ 3 levels "cellular","telephone",..: 3 3 1 1 1 1 1 2 1 3 ...
## $ day : int 20 28 19 14 12 4 29 12 16 3 ...
## $ month : Factor w/ 12 levels "apr","aug","dec",..: 7 9 2 9 9 4 5 9 1 7 ...
## $ duration : int 11 176 207 284 217 233 297 89 539 131 ...
## $ campaign : int 15 2 6 7 3 3 1 2 1 1 ...
## $ pdays : int -1 -1 -1 -1 -1 276 -1 -1 -1 -1 ...
## $ previous : int 0 0 0 0 0 2 0 0 0 0 ...
## $ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 1 4 4 4 4 ...
## $ y : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 1 1 ...
## Structure of the Test Set:
## 'data.frame': 9043 obs. of 17 variables:
## $ age : int 58 43 51 56 32 54 58 54 32 38 ...
## $ job : Factor w/ 12 levels "admin.","blue-collar",..: 5 10 6 5 2 6 7 2 5 5 ...
## $ marital : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 2 2 2 2 3 ...
## $ education: Factor w/ 4 levels "primary","secondary",..: 3 2 1 3 1 2 3 2 3 3 ...
## $ default : num 0 0 0 0 0 0 0 0 0 0 ...
## $ balance : int 2143 593 229 779 23 529 -364 1291 0 424 ...
## $ housing : num 1 1 1 1 1 1 1 1 1 1 ...
## $ loan : num 0 0 0 0 1 0 0 0 0 0 ...
## $ contact : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ day : int 5 5 5 5 5 5 5 5 5 5 ...
## $ month : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ duration : int 261 55 353 164 160 1492 355 266 179 104 ...
## $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ y : Factor w/ 2 levels "No","Yes": NA NA NA NA NA NA NA NA NA NA ...
Exploratory Data Analysis
The target of this analysis is the variable y
. This boolean indicates
whether the customer has acquiered a bank term deposit account. With
88.4% of the customers not having subscribed to this product, we can say
that our Train set is slightly unbalanced. We might want to try
rebalancing our dataset later in this analysis, to ensure our model is
performing properly for unknown data.
The features of the dataset provide different type of information about the customers.
- Variables giving personal information of the customers:
age
of the customer
Customers are between 18 and 95 years old, with a mean of 41 and a median of 39. The inter-quartile range is between 33 and 48. We can also notice the presence of some outliers.job
category of the customer
There are 12 categories of jobs with more than half belonging toblue-collar
,management
andtechnicians
, followed by admin and services. Retired candidates form 5% of the dataset, self-emplyed and entrepreneur around 6% and unemployed, housemaid and students each around 2%. The candidate with unknow jobs form less than 1%.marital
status of the customer
60% are married, 28% are single, the others are divorced.education
level of the customer
51% of the customers went to secondary school, 29% to tertiary school, 15% to primary school. The education of the other customers remains unknown.
- Variables related to financial status of the customers:
default
history
This boolean indicates if the customer has already defaulted. Only 1.8% of the customers have defaulted.balance
of customer’s account
The average yearly balance of the customer in euros. The variable is ranged from -8,019 to 98,417 with a mean of 1,360 and a median of 448. The data is highly right-skewed.housing
loan
This boolean indicates if the customer has a house loan. 56% of the customers have one.loan
This boolean indicates if the customer has a personal loan. 16% of the customers have one.
- Variables related to campaign interactions with the customer:
contact
mode
How the customer was contacted, with 65% on their mobile phone, and 6% ona landline.day
This indicates on which day of the month the customer was contacted.month
This indicates on which month the customer was contacted. May seems to be the peak month with 31% of the calls followed by June, July, and August.duration
of the call
Last phone call duration in seconds. The average call lasts around 4 minutes. However, the longest call lasts 1.4 hours.
Usingduration
would make our model deployable, but we decided to use it for the sake of the project. However, if we would ignore it, we would try developing a model that will actually be able to predict the time of the call based on the other data.campaign
Number of times the customer was contacted during this campaign. Customers can have been contacted up to 63 times. Around 66% were contacted twice or less.pdays
Number of days that passed after the customer was last contacted from a previous campaign.-1
means that the client was not previously contacted and this is his first campaign. Around 82% of the candidates are newly campaign clients. The average time elapsed is 40 days.previous
contacts
Number of contacts performed before this campaign. Majority of the customers were never contacted. Other customers have been contacted 1 times on average, with a maximum of 58 times.poutcome
This categorical variable indicates the outcome from a previous campaign, whether it was a success or a failure. About 3% of the customers answered positively to previous campaigns.
A quick look at the Test Set shows that the variables follow almost similar distributions.
Analysis Method
This flowchart describes the method we used for this analysis.
Data Preparation
As the Test Set doesn’t contain the feature y
, it is necessary to
randomly split the Train Set in two, with a 80|20 ratio:
- Train Set A, which will be used to train our model.
- Train Set B, which will be used to test our model and validate the performance of the model.
These datasets are ranged from 0
to 1
, using the preProcess
function of the package caret
. These transformations should improve
the performance of linear models.
Categorical variables are dummified, using the dummyVars
function of
the package caret
.
Cross-Validation Strategy
To validate the stability of our models, we will apply a 10-fold cross-validation, repeated 3 times.
(Note: for the stacking that is explained in the following part of the report we use another, more extensive cross validation approach)
Baseline
For our baseline, we used 3 algorithms: Logistic Regression, XGBoost and Ranger (Random Forest).
If the Accuracy of the resulting models are similar, there is a bug difference in term of Sensitivity. Ranger has the best performance so far.
Model | Accuracy | Sensitivity | Precision | Specificity | F1 Score | AUC | Coefficients | Train Time (min) |
---|---|---|---|---|---|---|---|---|
Ranger baseline | 0.9738698 | 0.8319428 | 0.9356568 | 0.9924930 | 0.8807571 | 0.9569605 | 48 | 75.3 |
XGBoost baseline | 0.9252039 | 0.5589988 | 0.7328125 | 0.9732562 | 0.6342123 | 0.8383462 | 48 | 25.3 |
Logistic Reg. baseline | 0.9033596 | 0.3408820 | 0.6620370 | 0.9771661 | 0.4500393 | 0.7903627 | 49 | 0.5 |
Feature Engineering
A. Clusters
The first step of the Feature Engineering process is to create a new feature based on clustering method.
First of all, we select the variable describing the clients: age
,
job
, marital
, education
, default
, balance
, housing
, loan
.
These will be the clustering variables. Since we are using a K-means
algorithm, we first define the optimal number of clusters.
With K-means, 9 clusters have been created and added to the dataframe.
With these new features, we train over the 3 models that we used as our baseline and compare the results with the previous ones.
Model | Accuracy | Sensitivity | Precision | Specificity | F1 Score | AUC | Coefficients | Train Time (min) |
---|---|---|---|---|---|---|---|---|
Ranger baseline | 0.9738698 | 0.8319428 | 0.9356568 | 0.9924930 | 0.8807571 | 0.9569605 | 48 | 75.3 |
Ranger FE1 Clustering | 0.9738698 | 0.8307509 | 0.9368280 | 0.9926494 | 0.8806064 | 0.9574724 | 57 | 97.6 |
XGBoost FE1 Clustering | 0.9270012 | 0.5697259 | 0.7410853 | 0.9738818 | 0.6442049 | 0.8431443 | 57 | 28.6 |
XGBoost baseline | 0.9252039 | 0.5589988 | 0.7328125 | 0.9732562 | 0.6342123 | 0.8383462 | 48 | 25.3 |
Logistic Reg. FE1 Clustering | 0.9029448 | 0.3444577 | 0.6553288 | 0.9762277 | 0.4515625 | 0.7871756 | 58 | 0.4 |
Logistic Reg. baseline | 0.9033596 | 0.3408820 | 0.6620370 | 0.9771661 | 0.4500393 | 0.7903627 | 49 | 0.5 |
Clustering slightly improves the results of Logistic Regression and XGBoost, but not for Ranger.
B. Binning
The next Feature Engineering step is binning some of the numerical
variables (age
, balance
, duration
and campaign
) following to
their quantiles. Quantile binning aims to assign the same number of
observations to each bin. In the following steps we try binning with
various numbers of quantiles:
- 3 bins (dividing the data in 3 quantiles of approx 33% each)
- 4 bins (quartiles of 25%)
- 5 bins (quantiles of 20%)
- 10 bins (quantiles of 10%)
The clusters now being found based on Train Set A, can be associated to the points of Train Set B.
After the binning, we will one-hot-encode the categorical variables generated.
Comparing the results of the 3 algorithms with the best models we had so far. Adding Binning to Ranger seems to improve significantly its performance.
Model | Accuracy | Sensitivity | Precision | Specificity | F1 Score | AUC | Coefficients | Train Time (min) |
---|---|---|---|---|---|---|---|---|
Ranger FE2 Binning | 0.9776027 | 0.8617402 | 0.9401821 | 0.9928058 | 0.8992537 | 0.9611183 | 145 | 266.8 |
Ranger baseline | 0.9738698 | 0.8319428 | 0.9356568 | 0.9924930 | 0.8807571 | 0.9569605 | 48 | 75.3 |
Ranger FE1 Clustering | 0.9738698 | 0.8307509 | 0.9368280 | 0.9926494 | 0.8806064 | 0.9574724 | 57 | 97.6 |
XGBoost FE2 Binning | 0.9232684 | 0.5137068 | 0.7456747 | 0.9770097 | 0.6083275 | 0.8421837 | 145 | 67.8 |
Logistic Reg. FE2 Binning | 0.9028066 | 0.3551847 | 0.6478261 | 0.9746637 | 0.4588145 | 0.7839751 | 146 | 2.0 |
Feature Selection with Lasso and RFE
After doing all the necessary feature pre processing, feature engineering and developing our initial models, it is about time to start experimenting with some feature selection methodologies. In our case we follow the below two methods:
- Feature Selection using Lasso Logistic Regression.
- Feature Selection with Recursive Feature Elimination using Random Forest functions.
Going deeper in the process, we first take the variables that the lasso regression gives us, in order to deal with the problem of multicollinearity as well. In particular, we started our process with 146 variables and the algorithm ended up choosing 83 of them as important. 62 features were rejected by Lasso.
## [1] "The Lasso Regression selected 83 variables, and rejected 62 variables."
Rejected Features | ||||
---|---|---|---|---|
age_bin_10.2 | age_bin_10.3 | age_bin_10.5 | age_bin_10.6 | age_bin_10.7 |
age_bin_3.1 | age_bin_3.3 | age_bin_4.2 | age_bin_4.4 | age_bin_5.1 |
age_bin_5.2 | age_bin_5.3 | age_bin_5.4 | age_bin_5.5 | balance_bin_10.1 |
balance_bin_10.10 | balance_bin_10.3 | balance_bin_10.5 | balance_bin_10.8 | balance_bin_10.9 |
balance_bin_3.1 | balance_bin_3.3 | balance_bin_4.2 | balance_bin_4.3 | balance_bin_5.2 |
balance_bin_5.3 | campaign_bin_10.10 | campaign_bin_10.2 | campaign_bin_10.3 | campaign_bin_10.4 |
campaign_bin_10.5 | campaign_bin_10.6 | campaign_bin_10.8 | campaign_bin_3.2 | campaign_bin_4.2 |
campaign_bin_5.2 | campaign_bin_5.3 | campaign_bin_5.4 | cluster.2 | cluster.3 |
cluster.4 | cluster.6 | cluster.7 | contact.telephone | duration_bin_10.2 |
duration_bin_10.3 | duration_bin_10.4 | duration_bin_10.5 | duration_bin_10.6 | duration_bin_10.8 |
duration_bin_10.9 | duration_bin_3.2 | duration_bin_5.3 | education.secondary | job.management |
job.services | job.technician | marital.divorced | marital.single | month.may |
pdays | poutcome.failure |
As a following up step, we apply RFE, using Random Forest functions and a cross validation methodology, on the variables kept by the lasso regression to get another subset of the important variables, based now on a tree method (RandomForest) and we ended up with 18 variables. The justification behind the decision for 18 variables is the plot below that came up as an outcome of the RFE approach.
Selected Features | ||||
---|---|---|---|---|
age | balance | contact.cellular | contact.unknown | day |
duration | duration_bin_10.10 | housing | month.apr | month.aug |
month.dec | month.jul | month.mar | month.oct | month.sep |
poutcome.success | poutcome.unknown | previous |
Tuning
Now that we have the most efficient set of variables according to our Feature Selection approach, we remodel on these set of variables, using once more our 3 main algorithms: Logistic Regression, Random Forest and XGBoost.
However, at this stage in order to improve our results, we apply to each one of the algorithms an extensive Grid Search on the parameters that can be tuned and might affect our performance.
XGBoost Grid Search
Ranger Grid Search
The optimal paramaters for XGBoost and Random Forest are:
- XGBoost
- max_depth = 6
- gamma = 0
- eta = 0.05
- colsample_bytree = 1
- min_child_weight = 3
- subsample = 1
- Random Forest
- mtry = 4
- splitrule = gini
- min.node.size = 9
Unfortunately, reducing the number of variables impact significantly the performance of our models.
Model | Accuracy | Sensitivity | Precision | Specificity | F1 Score | AUC | Coefficients | Train Time (min) |
---|---|---|---|---|---|---|---|---|
Ranger FE2 Binning | 0.9776027 | 0.8617402 | 0.9401821 | 0.9928058 | 0.8992537 | 0.9611183 | 145 | 266.8 |
Ranger baseline | 0.9738698 | 0.8319428 | 0.9356568 | 0.9924930 | 0.8807571 | 0.9569605 | 48 | 75.3 |
Ranger FE1 Clustering | 0.9738698 | 0.8307509 | 0.9368280 | 0.9926494 | 0.8806064 | 0.9574724 | 57 | 97.6 |
XGBoost Tuning | 0.9065395 | 0.4386174 | 0.6422339 | 0.9679387 | 0.5212465 | 0.7857566 | 18 | 199.8 |
Ranger Tuning | 0.9054334 | 0.3969011 | 0.6516634 | 0.9721614 | 0.4933333 | 0.7881941 | 18 | 62.1 |
Stacking
At this stage, we have optimized our variable set and we have tuned our algorithms through a Grid Search. There is another option that we want to try: stacking models.
As an initial step to decide whether stacking would make sense or not, we gather all our previous predictions and we plot a correlation matrix. The reason behind this, is the fact that if our predictions are uncorrelated between each other or at least have low correlation, then the models that generate them are capturing different aspects of the validation set, so it makes sense to combine them through a stacking approach. Based on the matrix below, the models that seem better to combine are the ones that generated the predictions based on the binning feature engineering step, as the correlation between them is around 50%.
However, in order to have a more complete modelling approach, we want to stack all the possible combinations and compare their performance. More in detail we follow the below mentioned process:
- We create the below mentioned stacking categories:
- Baseline modelling
- Clustering modelling (FE1)
- Binning modelling (FE2)
- RFE - Tuning modelling
- For each of the categories we stack the corresponding models with 3
different algorithms:
- Logistic Regression
- XGBoost
- Random Forest
Final Model
Based on the table shown below, in which we gathered all the necessary metrics to compare our models we decide to go with the Ranger (RandomForest) algorithm trained on the set of variables that include the Binning step of the Feature Engineering process and is the one that we choose to create our final submission.
The metric that was required is the Sensitivity
, and was the main
driver of our decision. As a closing observation, we can notice that our
models are really good in predicting the No value of target variable
as we manage to achieve a high Precision, but they have harder times
(comparing to the NOs) adjusting their performance on the Yes
variable: the best model has a Recall
value of only 86%. We can detect
that through various confusion matrices, in which the True Negative
value is pretty high (thing that justifies the above mentioned
observation).
Model | Accuracy | Sensitivity | Precision | Specificity | F1 Score | AUC | Coefficients | Train Time (min) |
---|---|---|---|---|---|---|---|---|
Ranger FE2 Binning | 0.9776027 | 0.8617402 | 0.9401821 | 0.9928058 | 0.8992537 | 0.9611183 | 145 | 266.8 |
Ranger baseline | 0.9738698 | 0.8319428 | 0.9356568 | 0.9924930 | 0.8807571 | 0.9569605 | 48 | 75.3 |
Ranger FE1 Clustering | 0.9738698 | 0.8307509 | 0.9368280 | 0.9926494 | 0.8806064 | 0.9574724 | 57 | 97.6 |
XGBoost FE1 Clustering | 0.9270012 | 0.5697259 | 0.7410853 | 0.9738818 | 0.6442049 | 0.8431443 | 57 | 28.6 |
XGBoost baseline | 0.9252039 | 0.5589988 | 0.7328125 | 0.9732562 | 0.6342123 | 0.8383462 | 48 | 25.3 |
Stacking xgb clustering | 0.9113784 | 0.5220501 | 0.6460177 | 0.9624648 | 0.5774555 | 0.7924215 | 3 | 95.9 |
XGBoost FE2 Binning | 0.9232684 | 0.5137068 | 0.7456747 | 0.9770097 | 0.6083275 | 0.8421837 | 145 | 67.8 |
Stacking xgb baseline | 0.9105489 | 0.4958284 | 0.6500000 | 0.9649672 | 0.5625423 | 0.7929205 | 3 | 98.5 |
XGBoost RFE | 0.9164938 | 0.4910608 | 0.6994907 | 0.9723178 | 0.5770308 | 0.8176111 | 18 | 12.2 |
Stacking glm baseline | 0.9123462 | 0.4851013 | 0.6683087 | 0.9684079 | 0.5621547 | 0.8015457 | 4 | 98.5 |
Stacking rf baseline | 0.9025301 | 0.4755662 | 0.6009036 | 0.9585549 | 0.5309381 | 0.7669612 | 3 | 98.5 |
Stacking xgb binning | 0.9109636 | 0.4755662 | 0.6616915 | 0.9680951 | 0.5533981 | 0.7976633 | 3 | 80.5 |
Stacking glm clustering | 0.9116549 | 0.4719905 | 0.6689189 | 0.9693463 | 0.5534591 | 0.8011060 | 4 | 95.9 |
Stacking rf binning | 0.9026683 | 0.4660310 | 0.6043277 | 0.9599625 | 0.5262450 | 0.7681523 | 3 | 80.5 |
Stacking xgb Tuning | 0.9057099 | 0.4636472 | 0.6264090 | 0.9637160 | 0.5328767 | 0.7791755 | 3 | 72.2 |
Stacking rf clustering | 0.8981059 | 0.4588796 | 0.5763473 | 0.9557398 | 0.5109489 | 0.7535963 | 3 | 95.9 |
Stacking glm binning | 0.9077838 | 0.4505364 | 0.6472603 | 0.9677823 | 0.5312720 | 0.7889633 | 4 | 80.5 |
Stacking rf Tuning | 0.9008710 | 0.4469607 | 0.5971338 | 0.9604317 | 0.5112474 | 0.7634420 | 3 | 72.2 |
XGBoost Tuning | 0.9065395 | 0.4386174 | 0.6422339 | 0.9679387 | 0.5212465 | 0.7857566 | 18 | 199.8 |
Stacking glm Tuning | 0.9050187 | 0.4219309 | 0.6366906 | 0.9684079 | 0.5075269 | 0.7820266 | 4 | 72.2 |
Ranger Tuning | 0.9054334 | 0.3969011 | 0.6516634 | 0.9721614 | 0.4933333 | 0.7881941 | 18 | 62.1 |
Logistic Reg. FE2 Binning | 0.9028066 | 0.3551847 | 0.6478261 | 0.9746637 | 0.4588145 | 0.7839751 | 146 | 2.0 |
Logistic Reg. FE1 Clustering | 0.9029448 | 0.3444577 | 0.6553288 | 0.9762277 | 0.4515625 | 0.7871756 | 58 | 0.4 |
Logistic Reg. RFE | 0.9032213 | 0.3444577 | 0.6583144 | 0.9765405 | 0.4522692 | 0.7886803 | 19 | 0.3 |
Logistic Reg. baseline | 0.9033596 | 0.3408820 | 0.6620370 | 0.9771661 | 0.4500393 | 0.7903627 | 49 | 0.5 |
The overall objective of the model is to identify customers who are responsive to a campaign and will eventually purchase the product. Though the final model does feature high sensitivity (True Positive rate), it also has relatively low specificity (True Negative rate). This implies a lot of customers targeted by the campaign may ultimately end up rejecting the offer. Ultimately, the trade-off stands between the cost of targeting the client with the campaign, and the increased revenue from capturing the client.
In this particular case, one can assume the cost of running the campaign is only a small fraction of the Customer Life-time Value. Therefore, it makes sense to provide an aggressive rather than conservative model, since the cost of the campaign may only involve customer service labor at relatively low wages. In other settings where the cost of a false positive is higher relatively to the benefit of the true positive, a more conservative option should be adopted.
- Among those that we don’t call: all are not responsive anyways
- But among those we don’t call: we have a lot of No’s
This table lists the Sensitivities we can achieve with each model by
tuning the threshold t
, highlighting the Sensitivities above
0.9
.
Model | t_0.05 | t_0.10 | t_0.15 | t_0.20 | t_0.25 | t_0.30 | t_0.35 | t_0.40 | t_0.45 | t_0.50 | t_0.55 | t_0.60 | t_0.65 | t_0.70 | t_0.75 | t_0.80 | t_0.85 | t_0.90 | t_0.95 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
glm_baseline | 0.9631 | 0.8856 | 0.7771 | 0.6961 | 0.6198 | 0.5614 | 0.5042 | 0.4327 | 0.3886 | 0.3409 | 0.3063 | 0.2718 | 0.2384 | 0.2002 | 0.1609 | 0.1275 | 0.0894 | 0.0644 | 0.0334 |
glm_clustering | 0.9642 | 0.8868 | 0.7747 | 0.6937 | 0.6198 | 0.5566 | 0.503 | 0.4327 | 0.3862 | 0.3445 | 0.3087 | 0.2718 | 0.2396 | 0.205 | 0.1633 | 0.1275 | 0.0918 | 0.0656 | 0.0346 |
glm_binning | 0.9631 | 0.8987 | 0.8117 | 0.7342 | 0.6579 | 0.5781 | 0.5197 | 0.472 | 0.4136 | 0.3552 | 0.3063 | 0.2479 | 0.2014 | 0.1681 | 0.1383 | 0.1049 | 0.0727 | 0.0441 | 0.0215 |
glm_RFE | 0.9666 | 0.8462 | 0.7342 | 0.6603 | 0.5924 | 0.5352 | 0.4768 | 0.4327 | 0.3897 | 0.3445 | 0.2908 | 0.2503 | 0.2074 | 0.174 | 0.1466 | 0.118 | 0.0834 | 0.0584 | 0.0215 |
ranger_baseline | 0.9952 | 0.9893 | 0.9785 | 0.9714 | 0.9642 | 0.9535 | 0.9261 | 0.8975 | 0.8665 | 0.8319 | 0.77 | 0.7092 | 0.6055 | 0.5066 | 0.3874 | 0.236 | 0.1299 | 0.0346 | 0.0036 |
ranger_clustering | 0.9952 | 0.9881 | 0.9797 | 0.9738 | 0.9666 | 0.9559 | 0.9344 | 0.9058 | 0.8737 | 0.8308 | 0.7843 | 0.7199 | 0.621 | 0.5137 | 0.3874 | 0.2348 | 0.1144 | 0.0334 | 0.0048 |
ranger_binning | 0.9964 | 0.9857 | 0.9797 | 0.9702 | 0.9583 | 0.9464 | 0.9333 | 0.9201 | 0.8927 | 0.8617 | 0.8141 | 0.758 | 0.6687 | 0.5507 | 0.4041 | 0.2646 | 0.1216 | 0.0322 | 0.0012 |
ranger_tuned | 0.9666 | 0.9225 | 0.8701 | 0.8188 | 0.7688 | 0.7092 | 0.6448 | 0.5745 | 0.4958 | 0.3969 | 0.2968 | 0.2288 | 0.1728 | 0.112 | 0.056 | 0.0334 | 0.0095 | 0.0036 | 0 |
xgbTree_baseline | 0.9821 | 0.9428 | 0.9046 | 0.8522 | 0.8153 | 0.7688 | 0.7187 | 0.6627 | 0.6186 | 0.559 | 0.4851 | 0.4219 | 0.36 | 0.2861 | 0.2169 | 0.1645 | 0.1156 | 0.0572 | 0.0191 |
xgbTree_clustering | 0.9762 | 0.9416 | 0.8963 | 0.8665 | 0.8212 | 0.7831 | 0.7366 | 0.6865 | 0.6234 | 0.5697 | 0.5137 | 0.4446 | 0.3719 | 0.3027 | 0.2431 | 0.1883 | 0.1323 | 0.0763 | 0.0203 |
xgbTree_binning | 0.9774 | 0.938 | 0.8927 | 0.8546 | 0.8129 | 0.7533 | 0.6877 | 0.6341 | 0.5757 | 0.5137 | 0.4553 | 0.3909 | 0.317 | 0.2515 | 0.1895 | 0.1359 | 0.0834 | 0.0417 | 0.0119 |
xgbTree_tuned | 0.9583 | 0.9011 | 0.8427 | 0.7962 | 0.7366 | 0.6758 | 0.621 | 0.5614 | 0.5006 | 0.4386 | 0.3492 | 0.2872 | 0.2086 | 0.1549 | 0.118 | 0.0799 | 0.0358 | 0.0167 | 0.0012 |
xgbTree_RFE | 0.9714 | 0.9321 | 0.8665 | 0.8153 | 0.7557 | 0.7139 | 0.6675 | 0.6007 | 0.5435 | 0.4911 | 0.4184 | 0.3719 | 0.3063 | 0.2253 | 0.1657 | 0.1144 | 0.0667 | 0.0346 | 0.0072 |
glm_stack_baseline | 0.9404 | 0.8772 | 0.8141 | 0.7557 | 0.7092 | 0.6532 | 0.615 | 0.565 | 0.5256 | 0.4851 | 0.4279 | 0.3766 | 0.3206 | 0.2622 | 0.211 | 0.1633 | 0.1359 | 0.1097 | 0.0393 |
glm_stack_clustering | 0.9464 | 0.876 | 0.8117 | 0.7628 | 0.7044 | 0.646 | 0.6079 | 0.5638 | 0.5173 | 0.472 | 0.4124 | 0.3528 | 0.3027 | 0.2598 | 0.2038 | 0.1597 | 0.1323 | 0.1037 | 0.0501 |
glm_stack_binning | 0.9416 | 0.8605 | 0.789 | 0.7282 | 0.683 | 0.6257 | 0.5828 | 0.5411 | 0.4982 | 0.4505 | 0.4064 | 0.3528 | 0.3004 | 0.242 | 0.1895 | 0.1442 | 0.1097 | 0.0513 | 0.006 |
glm_stack_tuned | 0.9321 | 0.8319 | 0.7771 | 0.7092 | 0.6567 | 0.6114 | 0.5709 | 0.5292 | 0.4863 | 0.4219 | 0.3874 | 0.329 | 0.2765 | 0.23 | 0.1907 | 0.1418 | 0.0942 | 0.0405 | 0.0072 |
rf_stack_baseline | 0.9535 | 0.9237 | 0.8927 | 0.8427 | 0.8033 | 0.758 | 0.7044 | 0.6269 | 0.559 | 0.4756 | 0.3993 | 0.3159 | 0.2467 | 0.1943 | 0.1383 | 0.0906 | 0.0381 | 0.0226 | 0.0048 |
rf_stack_clustering | 0.9607 | 0.9297 | 0.8951 | 0.857 | 0.8057 | 0.7652 | 0.7008 | 0.621 | 0.5387 | 0.4589 | 0.3969 | 0.3099 | 0.2467 | 0.1669 | 0.1132 | 0.0751 | 0.0465 | 0.0191 | 0.0024 |
rf_stack_binning | 0.9428 | 0.9106 | 0.8653 | 0.8272 | 0.7747 | 0.7235 | 0.6722 | 0.6198 | 0.5423 | 0.466 | 0.3933 | 0.3194 | 0.2503 | 0.1824 | 0.1192 | 0.0703 | 0.0405 | 0.0143 | 0.0072 |
rf_stack_tuned | 0.938 | 0.8987 | 0.8677 | 0.8176 | 0.7783 | 0.7187 | 0.6615 | 0.5959 | 0.5185 | 0.447 | 0.3766 | 0.2992 | 0.2241 | 0.1609 | 0.1108 | 0.0608 | 0.0286 | 0.0155 | 0.0012 |
xgbTree_stack_baseline | 0.9631 | 0.9404 | 0.9082 | 0.8915 | 0.8474 | 0.7926 | 0.7676 | 0.6734 | 0.6341 | 0.4958 | 0.3862 | 0.292 | 0.1979 | 0.143 | 0.1216 | 0.0596 | 0.0083 | 0 | 0 |
xgbTree_stack_clustering | 0.9666 | 0.9392 | 0.9106 | 0.882 | 0.8439 | 0.7914 | 0.7437 | 0.6734 | 0.584 | 0.5221 | 0.4124 | 0.3051 | 0.1943 | 0.1311 | 0.093 | 0.056 | 0.0024 | 0 | 0 |
xgbTree_stack_binning | 0.9726 | 0.9142 | 0.8987 | 0.8379 | 0.8045 | 0.7747 | 0.7032 | 0.6794 | 0.5948 | 0.4756 | 0.4398 | 0.2563 | 0.1812 | 0.1168 | 0.0667 | 0 | 0 | 0 | 0 |
xgbTree_stack_tuned | 0.9607 | 0.9213 | 0.8856 | 0.8236 | 0.7914 | 0.7533 | 0.6841 | 0.6186 | 0.5375 | 0.4636 | 0.3528 | 0.2336 | 0.1824 | 0.0954 | 0.0632 | 0.0262 | 0.0131 | 0 | 0 |
Bonus
To display the model results, we also set up a dashboard using
shinydashboard
.