Human Resources Analytics
All the files of this project are saved in a GitHub repository.
Introduction
Objectives
This case study aims to model the probability of attrition of each employee from the HR Analytics Dataset, available on Kaggle. Its conclusions will allow the management to understand which factors urge the employees to leave the company and which changes should be made to avoid their departure.
Libraries
This project uses a set of libraries for data manipulation, plotting and modelling.
# Loading Libraries
import pandas as pd #Data Manipulation
import numpy as np #Data Manipulation
import matplotlib.pyplot as plt #Plotting
import seaborn as sns #Plotting
sns.set(style='white')
from sklearn import preprocessing #Preprocessing
from scipy.stats import skew, boxcox_normmax #Preprocessing
from scipy.special import boxcox1p #Preprocessing
from sklearn.model_selection import train_test_split #Train/Test Split
from sklearn.linear_model import LogisticRegression #Model
from sklearn.metrics import classification_report #Metrics
from sklearn.metrics import confusion_matrix #Metrics
from sklearn.metrics import accuracy_score #Metrics
from sklearn.metrics import roc_auc_score, roc_curve #ROC
from sklearn import model_selection #Cross Validation
from sklearn.feature_selection import RFE, RFECV #Feature Selection
Data Loading
The dataset is stored in the GitHub repository as a CSV file: turnover.csv
. The file is loaded directly from the repository.
# Reading Dataset from GitHub repository
hr = pd.read_csv('https://raw.githubusercontent.com/ashomah/HR-Analytics/master/assets/data/turnover.csv')
hr.head()
satisfaction_level | last_evaluation | number_project | average_montly_hours | time_spend_company | Work_accident | left | promotion_last_5years | sales | salary | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0.38 | 0.53 | 2 | 157 | 3 | 0 | 1 | 0 | sales | low |
1 | 0.80 | 0.86 | 5 | 262 | 6 | 0 | 1 | 0 | sales | medium |
2 | 0.11 | 0.88 | 7 | 272 | 4 | 0 | 1 | 0 | sales | medium |
3 | 0.72 | 0.87 | 5 | 223 | 5 | 0 | 1 | 0 | sales | low |
4 | 0.37 | 0.52 | 2 | 159 | 3 | 0 | 1 | 0 | sales | low |
Data Preparation
Variables Types and Definitions
The first stage of this analysis is to describe the dataset, understand the meaning of variable and perform the necessary adjustments to ensure that the data will be proceeded correctly during the Machine Learning process.
# Shape of the data frame
print('Rows:', hr.shape[0], '| Columns:', hr.shape[1])
Rows: 14999 | Columns: 10
# Describe each variable
def df_desc(df):
import pandas as pd
desc = pd.DataFrame({'dtype': df.dtypes,
'NAs': df.isna().sum(),
'Numerical': (df.dtypes != 'object') & (df.apply(lambda column: column == 0).sum() + df.apply(lambda column: column == 1).sum() != len(df)),
'Boolean': df.apply(lambda column: column == 0).sum() + df.apply(lambda column: column == 1).sum() == len(df),
'Categorical': df.dtypes == 'object',
})
return desc
df_desc(hr)
dtype | NAs | Numerical | Boolean | Categorical | |
---|---|---|---|---|---|
satisfaction_level | float64 | 0 | True | False | False |
last_evaluation | float64 | 0 | True | False | False |
number_project | int64 | 0 | True | False | False |
average_montly_hours | int64 | 0 | True | False | False |
time_spend_company | int64 | 0 | True | False | False |
Work_accident | int64 | 0 | False | True | False |
left | int64 | 0 | False | True | False |
promotion_last_5years | int64 | 0 | False | True | False |
sales | object | 0 | False | False | True |
salary | object | 0 | False | False | True |
The dataset consists in 14,999 rows and 10 columns. Each row represents an employee, and each column contains one employee attribute. None of these attributes contains any NA. Two (2) of these attributes contain decimal numbers, three (3) contain integers, three (3) contain booleans, and two (2) contain categorical values.
# Summarize numercial variables
hr.describe()
satisfaction_level | last_evaluation | number_project | average_montly_hours | time_spend_company | Work_accident | left | promotion_last_5years | |
---|---|---|---|---|---|---|---|---|
count | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 |
mean | 0.612834 | 0.716102 | 3.803054 | 201.050337 | 3.498233 | 0.144610 | 0.238083 | 0.021268 |
std | 0.248631 | 0.171169 | 1.232592 | 49.943099 | 1.460136 | 0.351719 | 0.425924 | 0.144281 |
min | 0.090000 | 0.360000 | 2.000000 | 96.000000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 0.440000 | 0.560000 | 3.000000 | 156.000000 | 3.000000 | 0.000000 | 0.000000 | 0.000000 |
50% | 0.640000 | 0.720000 | 4.000000 | 200.000000 | 3.000000 | 0.000000 | 0.000000 | 0.000000 |
75% | 0.820000 | 0.870000 | 5.000000 | 245.000000 | 4.000000 | 0.000000 | 0.000000 | 0.000000 |
max | 1.000000 | 1.000000 | 7.000000 | 310.000000 | 10.000000 | 1.000000 | 1.000000 | 1.000000 |
# Lists values of categorical variables
categories = {'sales': hr['sales'].unique().tolist(),
'salary':hr['salary'].unique().tolist()}
for i in sorted(categories.keys()):
print(i+":")
print(categories[i])
if i != sorted(categories.keys())[-1] :print("\n")
salary:
['low', 'medium', 'high']
sales:
['sales', 'accounting', 'hr', 'technical', 'support', 'management', 'IT', 'product_mng', 'marketing', 'RandD']
The variable sales
seems to represent the company departments. Thus, it will be renamed as department
.
# Rename variable sales
hr = hr.rename(index=str, columns={'sales':'department'})
The dataset contains 10 variables with no NAs:
satisfaction_level
: numerical, decimal values between 0 and 1.
Employee satisfaction level, from 0 to 1.last_evaluation
: numerical, decimal values between 0 and 1.
Employee last evaluation score, from 0 to 1.number_project
: numerical, integer values between 2 and 7.
Number of projects handled by the employee.average_montly_hours
: numerical, integer values between 96 and 310.
Average monthly hours worked by the employee.time_spend_company
: numerical, integer values between 2 and 10.
Number of years spent in the company by the employee.Work_acident
: encoded categorical, boolean.
Flag indicating if the employee had a work accident.left
: encoded categorical, boolean.
Flag indicating if the employee has left the company. This is the target variable of the study, the one to be modelled.promotion_last_5years
: encoded categorical, boolean.
Flag indicating if the employee has been promoting within the past 5 years.department
: categorical, 10 values. Department of the employee: Sales, Accounting, HR, Technical, Support, Management, IT, Product Management, Marketing, R&D.salary
: categorical, 3 values.
Salary level of the employee: Low, Medium, High.
Exploratory Data Analysis
Target Proportion
The objective of this study is to build a model to predict the value of the variable left
, based on the other variables available.
# Count occurences of each values in left
hr['left'].value_counts()
0 11428
1 3571
Name: left, dtype: int64
23.8% of the employees listed in the dataset have left the company.
The dataset is not balanced, which might introduce some bias in the predictive model. It would be interesting to proceed to two (2) analyses, one with the imbalanced dataset and one with a dataset balanced using the Synthetic minority Oversampling Technique (SMOTE).
A closer look to the means of the variables allow to highlight the differences between the employees who left the company and those who stayed.
# Get the mean of each variable for the different values of left
hr.groupby('left').mean()
satisfaction_level | last_evaluation | number_project | average_montly_hours | time_spend_company | Work_accident | promotion_last_5years | |
---|---|---|---|---|---|---|---|
left | |||||||
0 | 0.666810 | 0.715473 | 3.786664 | 199.060203 | 3.380032 | 0.175009 | 0.026251 |
1 | 0.440098 | 0.718113 | 3.855503 | 207.419210 | 3.876505 | 0.047326 | 0.005321 |
Employees who left the company have:
- a lower satisfaction level: 0.44 vs 0.67.
- higher average monthly working hours: 207 vs 199.
- a lower work accident ratio: 0.05 vs 0.18.
- a lower promotion rate: 0.01 vs 0.03.
Correlation Analysis
A correlation analysis will allow to identify relationships between the dataset variables. A plot of their distributions highlighting the value of the target variable might also reveal some patterns.
# Correlation Matrix
plt.figure(figsize=(12,8))
sns.heatmap(hr.corr(), cmap='RdBu', annot=True)
plt.tight_layout()
# Pair Plot
plot = sns.PairGrid(hr, hue='left', palette=('steelblue', 'crimson'))
plot = plot.map_diag(plt.hist)
plot = plot.map_offdiag(plt.scatter)
plot.add_legend()
plt.tight_layout()
No strong correlation appears in the dataset. However:
number_project
andaverage_monthly_hours
have a moderate positive correlation (0.42).left
andsatisfaction_level
have a moderate negative correlation (-0.39).last_evaluation
andnumber_project
have a moderate positive correlation (0.35).last_evaluation
andaverage_monthly_hours
have a moderate positive correlation (0.34).
Turnover by Salary Levels
# Salary Levels proportions and turnover rates
print('Salary Levels proportions')
print(hr['salary'].value_counts()/len(hr)*100)
print('\n')
print('Turnover Rate by Salary level')
print(hr.groupby('salary')['left'].mean())
Salary Levels proportions
low 48.776585
medium 42.976198
high 8.247216
Name: salary, dtype: float64
Turnover Rate by Salary level
salary
high 0.066289
low 0.296884
medium 0.204313
Name: left, dtype: float64
The salary level seems to have a great impact on the employee turnover, as higher salaries tend to stay in the company (7% of turnover), whereas lower salaries tend to leave the company (30% of turnover).
Turnover by Departments
# Departments proportions
hr['department'].value_counts()/len(hr)*100
sales 27.601840
technical 18.134542
support 14.860991
IT 8.180545
product_mng 6.013734
marketing 5.720381
RandD 5.247016
accounting 5.113674
hr 4.926995
management 4.200280
Name: department, dtype: float64
# Turnover Rate by Department
hr.groupby('department')['left'].mean().sort_values(ascending=False).plot(kind='bar', color='steelblue')
plt.title('Departure Ratio by Department')
plt.xlabel('')
plt.tight_layout()
Some observations can be inferred:
- Departure rate differs depending on the department, but no clear outlier is detected.
- HR has the highest turnover rate.
- R&D and Management have a significantly lower turnover rate.
Turnover by Satisfaction Level
# Bar Plot
plt.figure(figsize=(15,5))
sns.distplot(hr.satisfaction_level,
bins = 20,
color = 'steelblue').axes.set_xlim(min(hr.satisfaction_level),max(hr.satisfaction_level))
plt.tight_layout()
# Bar Plot with left values
plt.figure(figsize=(15,5))
sns.countplot(hr['satisfaction_level'],
hue = hr['left'],
palette = ('steelblue', 'crimson'))
plt.tight_layout()
The Satisfaction Level shows 3 interesting areas:
- Employees leave the company below 0.12.
- There is a high rate of departure between 0.36 and 0.46.
- Turnover rate is higher between 0.72 and 0.92.
Employees with very low satisfaction level obviously leave the company. The risky zone is when employees rates their satisfaction just below 0.5. Employees also tend to leave the company when they become moderately satisfied.
Turnover by Last Evaluation
# Bar Plot
plt.figure(figsize=(15,5))
sns.distplot(hr.last_evaluation,
bins = 20,
color = 'steelblue').axes.set_xlim(min(hr.last_evaluation),max(hr.last_evaluation))
plt.tight_layout()
# Bar Plot with left values
plt.figure(figsize=(15,5))
sns.countplot(hr['last_evaluation'],
hue = hr['left'],
palette = ('steelblue', 'crimson'))
plt.tight_layout()
The Last Evaluation shows 2 interesting areas:
- Turnover rate is higher between 0.45 and 0.57.
- Turnover rate is higher above 0.77.
Employees with low evaluation scores tend to leave the company. A large number of good employees leave the company, maybe to get a better opportunity. Interestingly, the ones with very low scores seem to stay.
Turnover by Number of Projects
# Bar Plot
plt.figure(figsize=(15,5))
sns.distplot(hr.number_project,
bins = 20,
color = 'steelblue').axes.set_xlim(min(hr.number_project),max(hr.number_project))
plt.tight_layout()
# Bar Plot with left values
plt.figure(figsize=(15,5))
sns.countplot(hr['number_project'],
hue = hr['left'],
palette = ('steelblue', 'crimson'))
plt.tight_layout()
The main observation regarding the number of projects is that employees with only 2 or more than 5 projects have a higher probability to leave the company.
Turnover by Average Monthly Hours
# Bar Plot
plt.figure(figsize=(15,5))
sns.distplot(hr.average_montly_hours,
bins = 20,
color = 'steelblue').axes.set_xlim(min(hr.average_montly_hours),max(hr.average_montly_hours))
plt.tight_layout()
# Bar Plot with left values
plt.figure(figsize=(15,5))
sns.countplot(hr['average_montly_hours'],
hue = hr['left'],
palette = ('steelblue', 'crimson'))
plt.tight_layout()
The Average Monthly Hours shows 5 interesting areas:
- Turnover rate is 0% below 125 hours.
- Turnover rate is high between 126 and 161 hours.
- Turnover rate is moderate between 217 and 274 hours.
- Turnover rate is roughly around 50% between 275 and 287 hours.
- Turnover rate is 100% above 288 hours.
Employees with really low numbers of hours per month (below 125) tend to stay in the company, whereas employees working too many hours (above 275 hours) have a high probability to leave the company. A ‘safe’ range is between 161 and 217 hours, which seems to be ideal to keep employees in the company.
Turnover by Time Spent in the Company
# Bar Plot with left values
plt.figure(figsize=(15,5))
sns.countplot(hr['time_spend_company'],
hue = hr['left'],
palette = ('steelblue', 'crimson'))
plt.tight_layout()
It seems that employees with 3-6 years of services are leaving the company.
Turnover by Work Accident
# Bar Plot with left values
plt.figure(figsize=(15,5))
sns.countplot(hr['Work_accident'],
hue = hr['left'],
palette = ('steelblue', 'crimson'))
plt.tight_layout()
Employees with a work accident tend to stay in the company.
Turnover by Promotion within the past 5 years
# Bar Plot with left values
plt.figure(figsize=(15,5))
sns.countplot(hr['promotion_last_5years'],
hue = hr['left'],
palette = ('steelblue', 'crimson'))
plt.tight_layout()
print('Turnover Rate if Promotion:', round(len(hr[(hr['promotion_last_5years']==1)&(hr['left']==1)])/len(hr[(hr['promotion_last_5years']==1)])*100,2),'%')
print('Turnover Rate if No Promotion:', round(len(hr[(hr['promotion_last_5years']==0)&(hr['left']==1)])/len(hr[(hr['promotion_last_5years']==0)])*100,2),'%')
Turnover Rate if Promotion: 5.96 %
Turnover Rate if No Promotion: 24.2 %
It appears that employees with a promotion within the past 5 years have less propensity to leave the company.
Number of Projects vs Average Monthly Hours
# Bar Plot with left values
plt.figure(figsize=(15,5))
sns.barplot(x=hr.average_montly_hours,
y=hr.number_project,
hue=hr.left,
palette = ('steelblue', 'crimson'))
plt.tight_layout()
# Scatter Plot with left values
plt.figure(figsize=(15,5))
sns.scatterplot(x=hr.average_montly_hours,
y=hr.number_project,
hue=hr.left,
palette = ('steelblue', 'crimson'))
plt.tight_layout()
It appears that:
- employees with more than 4 projects and working more than 217 hours tend to leave the company.
- employees with less than 3 projects and working less than 161 hours tend to leave the company.
A high or a low workload seem to push employees out.
Number of Projects vs Last Evaluation
# Bar Plot with left values
plt.figure(figsize=(15,5))
sns.barplot(x=hr.last_evaluation,
y=hr.number_project,
hue=hr.left,
palette = ('steelblue', 'crimson'))
plt.tight_layout()
# Scatter Plot with left values
plt.figure(figsize=(15,5))
sns.scatterplot(x=hr.last_evaluation,
y=hr.number_project,
hue=hr.left,
palette = ('steelblue', 'crimson'))
plt.tight_layout()
Employees with more than 4 projects seem to have higher evaluations but leave the company. Employees with 2 projects and a low evaluation leave the company.
Last Evaluation vs Average Monthly Hours
# Bar Plot with left values
plt.figure(figsize=(15,5))
sns.barplot(x=hr.average_montly_hours,
y=hr.last_evaluation,
hue=hr.left,
palette = ('steelblue', 'crimson'))
plt.tight_layout()
# Scatter Plot with left values
plt.figure(figsize=(15,5))
sns.scatterplot(x=hr.average_montly_hours,
y=hr.last_evaluation,
hue=hr.left,
palette = ('steelblue', 'crimson'))
plt.tight_layout()
Employees with high evaluation and working more than 217 hours tend to leave the company. Employees with evaluation around 0.5 and working between 125 and 161 hours tend to leave the company.
Last Evaluation vs Satisfaction Level
# Bar Plot with left values
plt.figure(figsize=(15,5))
sns.barplot(x=hr.satisfaction_level,
y=hr.last_evaluation,
hue=hr.left,
palette = ('steelblue', 'crimson'))
plt.tight_layout()
# Scatter Plot with left values
plt.figure(figsize=(15,5))
sns.scatterplot(x=hr.satisfaction_level,
y=hr.last_evaluation,
hue=hr.left,
palette = ('steelblue', 'crimson'))
plt.tight_layout()
Employees with satisfaction level below 0.11 tend to leave the company. Employees with satisfaction level between 0.35 and 0.46 and with last evaluation between 0.44 and 0.57 tend to leave the company. Employees with satisfaction level between 0.71 and 0.92 and with last evaluation between 0.76 and 1 tend to leave the company.
Encoding Categorical Variables
The variable salary
will be encoded using ordinal encoding and department
will be encoded using one-hot encoding.
# Encoding the variable salary
salary_dict = {'low':0,'medium':1,'high':2}
hr['salary_num'] = hr.salary.map(salary_dict)
hr.drop('salary', inplace=True, axis=1)
hr = hr.rename(index=str, columns={'salary_num':'salary'})
hr.head()
satisfaction_level | last_evaluation | number_project | average_montly_hours | time_spend_company | Work_accident | left | promotion_last_5years | department | salary | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0.38 | 0.53 | 2 | 157 | 3 | 0 | 1 | 0 | sales | 0 |
1 | 0.80 | 0.86 | 5 | 262 | 6 | 0 | 1 | 0 | sales | 1 |
2 | 0.11 | 0.88 | 7 | 272 | 4 | 0 | 1 | 0 | sales | 1 |
3 | 0.72 | 0.87 | 5 | 223 | 5 | 0 | 1 | 0 | sales | 0 |
4 | 0.37 | 0.52 | 2 | 159 | 3 | 0 | 1 | 0 | sales | 0 |
def numerical_features(df):
columns = df.columns
return df._get_numeric_data().columns
def categorical_features(df):
numerical_columns = numerical_features(df)
return(list(set(df.columns) - set(numerical_columns)))
def onehot_encode(df):
numericals = df.get(numerical_features(df))
new_df = numericals.copy()
for categorical_column in categorical_features(df):
new_df = pd.concat([new_df,
pd.get_dummies(df[categorical_column],
prefix=categorical_column)],
axis=1)
return new_df
hr_encoded = onehot_encode(hr)
hr_encoded.head()
satisfaction_level | last_evaluation | number_project | average_montly_hours | time_spend_company | Work_accident | left | promotion_last_5years | salary | department_IT | department_RandD | department_accounting | department_hr | department_management | department_marketing | department_product_mng | department_sales | department_support | department_technical | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.38 | 0.53 | 2 | 157 | 3 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
1 | 0.80 | 0.86 | 5 | 262 | 6 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
2 | 0.11 | 0.88 | 7 | 272 | 4 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
3 | 0.72 | 0.87 | 5 | 223 | 5 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
4 | 0.37 | 0.52 | 2 | 159 | 3 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
df_desc(hr_encoded)
dtype | NAs | Numerical | Boolean | Categorical | |
---|---|---|---|---|---|
satisfaction_level | float64 | 0 | True | False | False |
last_evaluation | float64 | 0 | True | False | False |
number_project | int64 | 0 | True | False | False |
average_montly_hours | int64 | 0 | True | False | False |
time_spend_company | int64 | 0 | True | False | False |
Work_accident | int64 | 0 | False | True | False |
left | int64 | 0 | False | True | False |
promotion_last_5years | int64 | 0 | False | True | False |
salary | int64 | 0 | True | False | False |
department_IT | uint8 | 0 | False | True | False |
department_RandD | uint8 | 0 | False | True | False |
department_accounting | uint8 | 0 | False | True | False |
department_hr | uint8 | 0 | False | True | False |
department_management | uint8 | 0 | False | True | False |
department_marketing | uint8 | 0 | False | True | False |
department_product_mng | uint8 | 0 | False | True | False |
department_sales | uint8 | 0 | False | True | False |
department_support | uint8 | 0 | False | True | False |
department_technical | uint8 | 0 | False | True | False |
Scaling and Skewness
Numerical variables average_monthly_hours
, last_evaluation
and satisfaction_level
are scaled to remove any influence of their difference in value ranges on the model.
hr_encoded[['satisfaction_level',
'last_evaluation',
'average_montly_hours'
]].hist(bins = 20, figsize = (15,10), color = 'steelblue')
plt.tight_layout()
hr_encoded[['satisfaction_level',
'last_evaluation',
'average_montly_hours'
]].describe()
satisfaction_level | last_evaluation | average_montly_hours | |
---|---|---|---|
count | 14999.000000 | 14999.000000 | 14999.000000 |
mean | 0.612834 | 0.716102 | 201.050337 |
std | 0.248631 | 0.171169 | 49.943099 |
min | 0.090000 | 0.360000 | 96.000000 |
25% | 0.440000 | 0.560000 | 156.000000 |
50% | 0.640000 | 0.720000 | 200.000000 |
75% | 0.820000 | 0.870000 | 245.000000 |
max | 1.000000 | 1.000000 | 310.000000 |
scaler = preprocessing.MinMaxScaler()
hr_scaled_part = scaler.fit_transform(hr_encoded[['satisfaction_level',
'last_evaluation',
'average_montly_hours']])
hr_scaled_part = pd.DataFrame(hr_scaled_part, columns=list(['satisfaction_level',
'last_evaluation',
'average_montly_hours']))
hr_scaled_part[['satisfaction_level',
'last_evaluation',
'average_montly_hours']].hist(bins = 20, figsize = (15,10), color = 'steelblue')
plt.tight_layout()
hr_scaled_part.describe()
satisfaction_level | last_evaluation | average_montly_hours | |
---|---|---|---|
count | 14999.000000 | 14999.000000 | 14999.000000 |
mean | 0.574542 | 0.556409 | 0.490889 |
std | 0.273220 | 0.267452 | 0.233379 |
min | 0.000000 | 0.000000 | 0.000000 |
25% | 0.384615 | 0.312500 | 0.280374 |
50% | 0.604396 | 0.562500 | 0.485981 |
75% | 0.802198 | 0.796875 | 0.696262 |
max | 1.000000 | 1.000000 | 1.000000 |
The skewness of the scaled variables is then fixed.
def feature_skewness(df):
numeric_dtypes = ['int16', 'int32', 'int64',
'float16', 'float32', 'float64']
numeric_features = []
for i in df.columns:
if df[i].dtype in numeric_dtypes:
numeric_features.append(i)
feature_skew = df[numeric_features].apply(
lambda x: skew(x)).sort_values(ascending=False)
skews = pd.DataFrame({'skew':feature_skew})
return feature_skew, numeric_features
def fix_skewness(df):
feature_skew, numeric_features = feature_skewness(df)
high_skew = feature_skew[feature_skew > 0.5]
skew_index = high_skew.index
for i in skew_index:
df[i] = boxcox1p(df[i], boxcox_normmax(df[i]+1))
skew_features = df[numeric_features].apply(
lambda x: skew(x)).sort_values(ascending=False)
skews = pd.DataFrame({'skew':skew_features})
return df
hr_skewed_part = fix_skewness(hr_scaled_part)
hr_skewed_part.hist(bins = 20, figsize = (15,10), color = 'steelblue')
plt.tight_layout()
hr_skewed_part.describe()
satisfaction_level | last_evaluation | average_montly_hours | |
---|---|---|---|
count | 14999.000000 | 14999.000000 | 14999.000000 |
mean | 0.574542 | 0.556409 | 0.490889 |
std | 0.273220 | 0.267452 | 0.233379 |
min | 0.000000 | 0.000000 | 0.000000 |
25% | 0.384615 | 0.312500 | 0.280374 |
50% | 0.604396 | 0.562500 | 0.485981 |
75% | 0.802198 | 0.796875 | 0.696262 |
max | 1.000000 | 1.000000 | 1.000000 |
The resulting values aren’t different than the initial ones, showing that the data wasn’t skewed.
hr_simple = hr_encoded.copy()
hr_simple.drop(['satisfaction_level',
'last_evaluation',
'average_montly_hours'], inplace=True, axis=1)
hr_ready = pd.DataFrame()
hr_simple.reset_index(drop=True, inplace=True)
hr_skewed_part.reset_index(drop=True, inplace=True)
hr_ready = pd.concat([hr_skewed_part,hr_simple], axis=1, sort=False, ignore_index=False)
# hr_ready['number_project'] = hr_ready['number_project'].astype('category').cat.codes
# hr_ready['time_spend_company'] = hr_ready['time_spend_company'].astype('category').cat.codes
hr_ready.head()
satisfaction_level | last_evaluation | average_montly_hours | number_project | time_spend_company | Work_accident | left | promotion_last_5years | salary | department_IT | department_RandD | department_accounting | department_hr | department_management | department_marketing | department_product_mng | department_sales | department_support | department_technical | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.318681 | 0.265625 | 0.285047 | 2 | 3 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
1 | 0.780220 | 0.781250 | 0.775701 | 5 | 6 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
2 | 0.021978 | 0.812500 | 0.822430 | 7 | 4 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
3 | 0.692308 | 0.796875 | 0.593458 | 5 | 5 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
4 | 0.307692 | 0.250000 | 0.294393 | 2 | 3 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
df_desc(hr_ready)
dtype | NAs | Numerical | Boolean | Categorical | |
---|---|---|---|---|---|
satisfaction_level | float64 | 0 | True | False | False |
last_evaluation | float64 | 0 | True | False | False |
average_montly_hours | float64 | 0 | True | False | False |
number_project | int64 | 0 | True | False | False |
time_spend_company | int64 | 0 | True | False | False |
Work_accident | int64 | 0 | False | True | False |
left | int64 | 0 | False | True | False |
promotion_last_5years | int64 | 0 | False | True | False |
salary | int64 | 0 | True | False | False |
department_IT | uint8 | 0 | False | True | False |
department_RandD | uint8 | 0 | False | True | False |
department_accounting | uint8 | 0 | False | True | False |
department_hr | uint8 | 0 | False | True | False |
department_management | uint8 | 0 | False | True | False |
department_marketing | uint8 | 0 | False | True | False |
department_product_mng | uint8 | 0 | False | True | False |
department_sales | uint8 | 0 | False | True | False |
department_support | uint8 | 0 | False | True | False |
department_technical | uint8 | 0 | False | True | False |
hr_ready.describe()
satisfaction_level | last_evaluation | average_montly_hours | number_project | time_spend_company | Work_accident | left | promotion_last_5years | salary | department_IT | department_RandD | department_accounting | department_hr | department_management | department_marketing | department_product_mng | department_sales | department_support | department_technical | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 |
mean | 0.574542 | 0.556409 | 0.490889 | 3.803054 | 3.498233 | 0.144610 | 0.238083 | 0.021268 | 0.594706 | 0.081805 | 0.052470 | 0.051137 | 0.049270 | 0.042003 | 0.057204 | 0.060137 | 0.276018 | 0.148610 | 0.181345 |
std | 0.273220 | 0.267452 | 0.233379 | 1.232592 | 1.460136 | 0.351719 | 0.425924 | 0.144281 | 0.637183 | 0.274077 | 0.222981 | 0.220284 | 0.216438 | 0.200602 | 0.232239 | 0.237749 | 0.447041 | 0.355715 | 0.385317 |
min | 0.000000 | 0.000000 | 0.000000 | 2.000000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 0.384615 | 0.312500 | 0.280374 | 3.000000 | 3.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
50% | 0.604396 | 0.562500 | 0.485981 | 4.000000 | 3.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
75% | 0.802198 | 0.796875 | 0.696262 | 5.000000 | 4.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 |
max | 1.000000 | 1.000000 | 1.000000 | 7.000000 | 10.000000 | 1.000000 | 1.000000 | 1.000000 | 2.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
hr_ready.hist(bins = 20, figsize = (15,10), color = 'steelblue')
plt.tight_layout()
The dataset is now ready to go through the baseline and feature engineering phases.
Training/Test Split
The model target left
is defined, taking all other variables as features. The dataset is split in a train set and a test set, using a random split with ratio 70|30.
target = 'left'
split_ratio = 0.3
seed = 806
def split_dataset(df, target, split_ratio=0.3, seed=806):
features = list(df)
features.remove(target)
X = df[features]
y = df[[target]]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=split_ratio, random_state=seed)
return X, y, X_train, X_test, y_train, y_test
X, y, X_train, X_test, y_train, y_test = split_dataset(hr_ready, target, split_ratio, seed)
print('Features:',X.shape[0], 'items | ', X.shape[1],'columns')
print('Target:',y.shape[0], 'items | ', y.shape[1],'columns')
print('Features Train:',X_train.shape[0], 'items | ', X_train.shape[1],'columns')
print('Features Test:',X_test.shape[0], 'items | ', X_test.shape[1],'columns')
print('Target Train:',y_train.shape[0], 'items | ', y_train.shape[1],'columns')
print('Target Test:',y_test.shape[0], 'items | ', y_test.shape[1],'columns')
Features: 14999 items | 18 columns
Target: 14999 items | 1 columns
Features Train: 10499 items | 18 columns
Features Test: 4500 items | 18 columns
Target Train: 10499 items | 1 columns
Target Test: 4500 items | 1 columns
Baseline
A logistic regression algorithm will be used to develop this classification model.
lr = LogisticRegression(solver='lbfgs', max_iter = 300)
def lr_run(model, X_train, y_train, X_test, y_test):
result = model.fit(X_train, y_train.values.ravel())
y_pred = model.predict(X_test)
acc_test = model.score(X_test, y_test)
coefficients = pd.concat([pd.DataFrame(X_train.columns, columns=['Feature']), pd.DataFrame(np.transpose(model.coef_), columns=['Coef.'])], axis = 1)
coefficients.loc[-1] = ['intercept.', model.intercept_[0]]
coefficients.index = coefficients.index + 1
coefficients = coefficients.sort_index()
print('Accuracy on test: {:.3f}'.format(acc_test))
print()
print(classification_report(y_test, y_pred))
print('Confusion Matrix:')
print(confusion_matrix(y_test, y_pred))
print()
print(coefficients)
lr_run(lr, X_train, y_train, X_test, y_test)
Accuracy on test: 0.797
precision recall f1-score support
0 0.82 0.94 0.88 3435
1 0.63 0.34 0.44 1065
micro avg 0.80 0.80 0.80 4500
macro avg 0.73 0.64 0.66 4500
weighted avg 0.78 0.80 0.77 4500
Confusion Matrix:
[[3220 215]
[ 700 365]]
Feature Coef.
0 intercept. 0.652320
1 satisfaction_level -3.616897
2 last_evaluation 0.440219
3 average_montly_hours 0.910047
4 number_project -0.285360
5 time_spend_company 0.245415
6 Work_accident -1.394756
7 promotion_last_5years -1.189347
8 salary -0.695794
9 department_IT -0.065202
10 department_RandD -0.474089
11 department_accounting 0.069995
12 department_hr 0.336695
13 department_management -0.352861
14 department_marketing 0.062124
15 department_product_mng 0.040313
16 department_sales 0.019114
17 department_support 0.230860
18 department_technical 0.147269
The ROC Curve can be plot for the model.
def plot_roc(model, X_test, y_test):
logit_roc_auc = roc_auc_score(y_test, model.predict(X_test))
fpr, tpr, thresholds = roc_curve(y_test, model.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1], 'r--')
plt.xlim([0.0, 1.05])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC curve')
plt.legend(loc="lower right")
plt.show();
plot_roc(lr, X_test, y_test)
Feature Engineering
Cross Validation Strategy
The model is cross-validated using a 10-fold cross validation and returning the average accuracy.
Example based on the baseline:
def cv_acc (model, X_train, y_train, n_splits, seed):
kfold = model_selection.KFold(n_splits=n_splits, random_state=seed)
scoring = 'accuracy'
results = model_selection.cross_val_score(model, X_train, y_train.values.ravel(), cv=kfold, scoring=scoring)
print("10-fold cross validation average accuracy: %.3f" % (results.mean()))
print()
for i in range(len(results)):
print('Iteration', '{:>2}'.format(i+1), '| Accuracy: {:.2f}'.format(results[i]))
cv_acc(lr, X_train, y_train, 10, seed)
10-fold cross validation average accuracy: 0.789
Iteration 1 | Accuracy: 0.79
Iteration 2 | Accuracy: 0.77
Iteration 3 | Accuracy: 0.78
Iteration 4 | Accuracy: 0.80
Iteration 5 | Accuracy: 0.81
Iteration 6 | Accuracy: 0.79
Iteration 7 | Accuracy: 0.79
Iteration 8 | Accuracy: 0.80
Iteration 9 | Accuracy: 0.79
Iteration 10 | Accuracy: 0.77
Features Construction
The dataset is copied to add or modify features.
hr_fe = hr_ready.copy()
Bin Satisfaction Level
Based on the EDA, we can bin the Satisfaction Level into 6 bins.
bins = [-1, 0.03, 0.29, 0.41, 0.69, 0.92, 1]
labels=['(0.00, 0.11]','(0.11, 0.35]','(0.35, 0.46]','(0.46, 0.71]','(0.71, 0.92]','(0.92, 1.00]']
hr_fe['satisfaction_level_bin'] = pd.cut(hr_fe.satisfaction_level, bins, labels=labels)
hr_fe.satisfaction_level_bin.value_counts()
(0.71, 0.92] 4765
(0.46, 0.71] 4689
(0.35, 0.46] 2012
(0.92, 1.00] 1362
(0.11, 0.35] 1283
(0.00, 0.11] 888
Name: satisfaction_level_bin, dtype: int64
plt.figure(figsize=(15,5))
sns.countplot(x=hr_fe.satisfaction_level,
hue=hr_fe.satisfaction_level_bin,
palette = sns.color_palette("hls", 6),
dodge = False)
plt.tight_layout()
hr_fe_1 = hr_fe.copy()
hr_fe_1 = onehot_encode(hr_fe_1)
hr_fe_1.drop('satisfaction_level', inplace=True, axis=1)
X_fe_1, y_fe_1, X_fe_1_train, X_fe_1_test, y_fe_1_train, y_fe_1_test = split_dataset(hr_fe_1, target, split_ratio, seed)
cv_acc(lr, X_fe_1_train, y_fe_1_train, 10, seed)
print()
lr_run(lr, X_fe_1_train, y_fe_1_train, X_fe_1_test, y_fe_1_test)
10-fold cross validation average accuracy: 0.916
Iteration 1 | Accuracy: 0.92
Iteration 2 | Accuracy: 0.92
Iteration 3 | Accuracy: 0.90
Iteration 4 | Accuracy: 0.91
Iteration 5 | Accuracy: 0.93
Iteration 6 | Accuracy: 0.92
Iteration 7 | Accuracy: 0.92
Iteration 8 | Accuracy: 0.92
Iteration 9 | Accuracy: 0.91
Iteration 10 | Accuracy: 0.91
Accuracy on test: 0.914
precision recall f1-score support
0 0.94 0.95 0.94 3435
1 0.83 0.79 0.81 1065
micro avg 0.91 0.91 0.91 4500
macro avg 0.89 0.87 0.88 4500
weighted avg 0.91 0.91 0.91 4500
Confusion Matrix:
[[3266 169]
[ 220 845]]
Feature Coef.
0 intercept. -4.095534
1 last_evaluation 1.885761
2 average_montly_hours 1.871660
3 number_project -0.118954
4 time_spend_company 0.433360
5 Work_accident -1.199810
6 promotion_last_5years -1.053322
7 salary -0.727225
8 department_IT -0.042518
9 department_RandD -0.274695
10 department_accounting 0.042351
11 department_hr 0.587357
12 department_management -0.686777
13 department_marketing 0.032783
14 department_product_mng -0.083776
15 department_sales -0.012227
16 department_support 0.255890
17 department_technical 0.198702
18 satisfaction_level_bin_(0.00, 0.11] 5.196334
19 satisfaction_level_bin_(0.11, 0.35] -1.585870
20 satisfaction_level_bin_(0.35, 0.46] 3.741138
21 satisfaction_level_bin_(0.46, 0.71] -2.639350
22 satisfaction_level_bin_(0.71, 0.92] -0.409764
23 satisfaction_level_bin_(0.92, 1.00] -4.285400
Bin Last Evaluation
Based on the EDA, we can bin the Last Evaluation into 4 bins.
bins = [-1, 0.14, 0.34, 0.64, 1]
labels=['(0.00, 0.44]','(0.44, 0.57]','(0.57, 0.76]','(0.76, 1.00]']
hr_fe['last_evaluation_bin'] = pd.cut(hr_fe.last_evaluation, bins, labels=labels)
hr_fe_1['last_evaluation_bin'] = pd.cut(hr_fe_1.last_evaluation, bins, labels=labels)
hr_fe_1.last_evaluation_bin.value_counts()
(0.76, 1.00] 6458
(0.57, 0.76] 4279
(0.44, 0.57] 3817
(0.00, 0.44] 445
Name: last_evaluation_bin, dtype: int64
plt.figure(figsize=(15,5))
sns.countplot(x=hr_fe_1.last_evaluation,
hue=hr_fe_1.last_evaluation_bin,
palette = sns.color_palette("hls", 6),
dodge = False)
plt.tight_layout()
hr_fe_2 = hr_fe_1.copy()
hr_fe_2 = onehot_encode(hr_fe_2)
hr_fe_2.drop('last_evaluation', inplace=True, axis=1)
X_fe_2, y_fe_2, X_fe_2_train, X_fe_2_test, y_fe_2_train, y_fe_2_test = split_dataset(hr_fe_2, target, split_ratio, seed)
cv_acc(lr, X_fe_2_train, y_fe_2_train, 10, seed)
print()
lr_run(lr, X_fe_2_train, y_fe_2_train, X_fe_2_test, y_fe_2_test)
10-fold cross validation average accuracy: 0.935
Iteration 1 | Accuracy: 0.93
Iteration 2 | Accuracy: 0.93
Iteration 3 | Accuracy: 0.93
Iteration 4 | Accuracy: 0.93
Iteration 5 | Accuracy: 0.94
Iteration 6 | Accuracy: 0.93
Iteration 7 | Accuracy: 0.95
Iteration 8 | Accuracy: 0.94
Iteration 9 | Accuracy: 0.93
Iteration 10 | Accuracy: 0.93
Accuracy on test: 0.936
precision recall f1-score support
0 0.95 0.97 0.96 3435
1 0.88 0.84 0.86 1065
micro avg 0.94 0.94 0.94 4500
macro avg 0.92 0.90 0.91 4500
weighted avg 0.94 0.94 0.94 4500
Confusion Matrix:
[[3315 120]
[ 167 898]]
Feature Coef.
0 intercept. -5.603085
1 average_montly_hours 2.193703
2 number_project 0.058753
3 time_spend_company 0.462998
4 Work_accident -1.172361
5 promotion_last_5years -0.951366
6 salary -0.723623
7 department_IT -0.095618
8 department_RandD -0.213647
9 department_accounting 0.034969
10 department_hr 0.620110
11 department_management -0.743974
12 department_marketing 0.043686
13 department_product_mng -0.108800
14 department_sales -0.000440
15 department_support 0.229240
16 department_technical 0.235322
17 satisfaction_level_bin_(0.00, 0.11] 4.810074
18 satisfaction_level_bin_(0.11, 0.35] -1.521279
19 satisfaction_level_bin_(0.35, 0.46] 3.612606
20 satisfaction_level_bin_(0.46, 0.71] -2.507489
21 satisfaction_level_bin_(0.71, 0.92] -0.262796
22 satisfaction_level_bin_(0.92, 1.00] -4.130269
23 last_evaluation_bin_(0.00, 0.44] -3.358944
24 last_evaluation_bin_(0.44, 0.57] 2.066166
25 last_evaluation_bin_(0.57, 0.76] -0.739115
26 last_evaluation_bin_(0.76, 1.00] 2.032740
Bin Average Monthly Hours
Based on the EDA, we can bin the Average Monthly Hours into 7 bins.
bins = [-1, 0.14, 0.165, 0.304, 0.565, 0.840, 0.897, 1]
labels=['(0, 125]','(125, 131]','(131, 161]','(161, 216]','(216, 274]','(274, 287]','(287, 310]']
hr_fe['average_montly_hours_bin'] = pd.cut(hr_fe.average_montly_hours, bins, labels=labels)
hr_fe_2['average_montly_hours_bin'] = pd.cut(hr_fe_2.average_montly_hours, bins, labels=labels)
hr_fe_2.average_montly_hours_bin.value_counts()
(216, 274] 5573
(161, 216] 4290
(131, 161] 3588
(0, 125] 486
(274, 287] 379
(125, 131] 353
(287, 310] 330
Name: average_montly_hours_bin, dtype: int64
plt.figure(figsize=(15,5))
sns.countplot(x=hr_fe_2.average_montly_hours,
hue=hr_fe_2.average_montly_hours_bin,
palette = sns.color_palette("hls", 7),
dodge = False)
plt.tight_layout()
hr_fe_3 = hr_fe_2.copy()
hr_fe_3 = onehot_encode(hr_fe_3)
hr_fe_3.drop('average_montly_hours', inplace=True, axis=1)
X_fe_3, y_fe_3, X_fe_3_train, X_fe_3_test, y_fe_3_train, y_fe_3_test = split_dataset(hr_fe_3, target, split_ratio, seed)
cv_acc(lr, X_fe_3_train, y_fe_3_train, 10, seed)
print()
lr_run(lr, X_fe_3_train, y_fe_3_train, X_fe_3_test, y_fe_3_test)
10-fold cross validation average accuracy: 0.944
Iteration 1 | Accuracy: 0.95
Iteration 2 | Accuracy: 0.94
Iteration 3 | Accuracy: 0.94
Iteration 4 | Accuracy: 0.94
Iteration 5 | Accuracy: 0.95
Iteration 6 | Accuracy: 0.94
Iteration 7 | Accuracy: 0.95
Iteration 8 | Accuracy: 0.95
Iteration 9 | Accuracy: 0.94
Iteration 10 | Accuracy: 0.93
Accuracy on test: 0.945
precision recall f1-score support
0 0.96 0.97 0.96 3435
1 0.91 0.86 0.88 1065
micro avg 0.95 0.95 0.95 4500
macro avg 0.93 0.92 0.92 4500
weighted avg 0.94 0.95 0.94 4500
Confusion Matrix:
[[3340 95]
[ 151 914]]
Feature Coef.
0 intercept. -4.893750
1 number_project 0.162189
2 time_spend_company 0.452624
3 Work_accident -1.155125
4 promotion_last_5years -0.830508
5 salary -0.709974
6 department_IT -0.047511
7 department_RandD -0.287313
8 department_accounting 0.011035
9 department_hr 0.541995
10 department_management -0.624920
11 department_marketing -0.042389
12 department_product_mng -0.115029
13 department_sales 0.027964
14 department_support 0.267117
15 department_technical 0.281319
16 satisfaction_level_bin_(0.00, 0.11] 4.671246
17 satisfaction_level_bin_(0.11, 0.35] -1.420167
18 satisfaction_level_bin_(0.35, 0.46] 3.396279
19 satisfaction_level_bin_(0.46, 0.71] -2.383964
20 satisfaction_level_bin_(0.71, 0.92] -0.187715
21 satisfaction_level_bin_(0.92, 1.00] -4.063411
22 last_evaluation_bin_(0.00, 0.44] -3.199925
23 last_evaluation_bin_(0.44, 0.57] 1.857071
24 last_evaluation_bin_(0.57, 0.76] -0.570796
25 last_evaluation_bin_(0.76, 1.00] 1.925918
26 average_montly_hours_bin_(0, 125] -4.209333
27 average_montly_hours_bin_(125, 131] 0.993610
28 average_montly_hours_bin_(131, 161] 0.341974
29 average_montly_hours_bin_(161, 216] -2.012571
30 average_montly_hours_bin_(216, 274] 0.640337
31 average_montly_hours_bin_(274, 287] -0.078632
32 average_montly_hours_bin_(287, 310] 4.336883
Categorize Number of Projects
Based on the EDA, the Number of Projects can be categorized into 4 categories.
categ = {2:'too low', 3:'normal', 4:'normal', 5:'normal', 6:'too high', 7:'extreme'}
hr_fe['number_project_cat'] = hr_fe.number_project.map(categ)
hr_fe_3['number_project_cat'] = hr_fe_3.number_project.map(categ)
hr_fe_3.number_project_cat.value_counts()
normal 11181
too low 2388
too high 1174
extreme 256
Name: number_project_cat, dtype: int64
plt.figure(figsize=(15,5))
sns.countplot(x=hr_fe_3.number_project,
hue=hr_fe_3.number_project_cat,
palette = sns.color_palette("hls", 6),
dodge = False)
plt.tight_layout()
hr_fe_4 = hr_fe_3.copy()
hr_fe_4 = onehot_encode(hr_fe_4)
hr_fe_4.drop('number_project', inplace=True, axis=1)
X_fe_4, y_fe_4, X_fe_4_train, X_fe_4_test, y_fe_4_train, y_fe_4_test = split_dataset(hr_fe_4, target, split_ratio, seed)
cv_acc(lr, X_fe_4_train, y_fe_4_train, 10, seed)
print()
lr_run(lr, X_fe_4_train, y_fe_4_train, X_fe_4_test, y_fe_4_test)
10-fold cross validation average accuracy: 0.946
Iteration 1 | Accuracy: 0.94
Iteration 2 | Accuracy: 0.94
Iteration 3 | Accuracy: 0.94
Iteration 4 | Accuracy: 0.95
Iteration 5 | Accuracy: 0.96
Iteration 6 | Accuracy: 0.94
Iteration 7 | Accuracy: 0.96
Iteration 8 | Accuracy: 0.96
Iteration 9 | Accuracy: 0.94
Iteration 10 | Accuracy: 0.94
Accuracy on test: 0.950
precision recall f1-score support
0 0.96 0.97 0.97 3435
1 0.90 0.88 0.89 1065
micro avg 0.95 0.95 0.95 4500
macro avg 0.93 0.93 0.93 4500
weighted avg 0.95 0.95 0.95 4500
Confusion Matrix:
[[3333 102]
[ 125 940]]
Feature Coef.
0 intercept. -2.841608
1 time_spend_company 0.507726
2 Work_accident -1.202049
3 promotion_last_5years -0.838310
4 salary -0.709500
5 department_IT -0.031415
6 department_RandD -0.186888
7 department_accounting 0.006734
8 department_hr 0.623972
9 department_management -0.707609
10 department_marketing -0.097518
11 department_product_mng -0.202634
12 department_sales 0.000202
13 department_support 0.327073
14 department_technical 0.270137
15 satisfaction_level_bin_(0.00, 0.11] 4.831770
16 satisfaction_level_bin_(0.11, 0.35] -1.270925
17 satisfaction_level_bin_(0.35, 0.46] 2.425227
18 satisfaction_level_bin_(0.46, 0.71] -2.250052
19 satisfaction_level_bin_(0.71, 0.92] 0.187860
20 satisfaction_level_bin_(0.92, 1.00] -3.921827
21 last_evaluation_bin_(0.00, 0.44] -2.975239
22 last_evaluation_bin_(0.44, 0.57] 1.473132
23 last_evaluation_bin_(0.57, 0.76] -0.497719
24 last_evaluation_bin_(0.76, 1.00] 2.001879
25 average_montly_hours_bin_(0, 125] -4.037480
26 average_montly_hours_bin_(125, 131] 0.708746
27 average_montly_hours_bin_(131, 161] 0.080656
28 average_montly_hours_bin_(161, 216] -1.803412
29 average_montly_hours_bin_(216, 274] 0.736018
30 average_montly_hours_bin_(274, 287] -0.077538
31 average_montly_hours_bin_(287, 310] 4.395065
32 number_project_cat_extreme 3.873512
33 number_project_cat_normal -2.153648
34 number_project_cat_too high -1.859092
35 number_project_cat_too low 0.141282
Categorize Time Spent in Company
Based on the EDA, the Time Spent in Company can be categorized into 4 categories, related to the rate of departure.
categ = {2:'low departure', 3:'high departure', 4:'high departure', 5:'very high departure', 6:'high departure', 7:'no departure', 8:'no departure', 10:'no departure'}
hr_fe['time_spend_company_cat'] = hr_fe.time_spend_company.map(categ)
hr_fe_4['time_spend_company_cat'] = hr_fe_4.time_spend_company.map(categ)
hr_fe_4.time_spend_company_cat.value_counts()
high departure 9718
low departure 3244
very high departure 1473
no departure 564
Name: time_spend_company_cat, dtype: int64
plt.figure(figsize=(15,5))
sns.countplot(x=hr_fe_4.time_spend_company,
hue=hr_fe_4.time_spend_company_cat,
palette = sns.color_palette("hls", 7),
dodge = False)
plt.tight_layout()
hr_fe_5 = hr_fe_4.copy()
hr_fe_5 = onehot_encode(hr_fe_5)
hr_fe_5.drop('time_spend_company', inplace=True, axis=1)
X_fe_5, y_fe_5, X_fe_5_train, X_fe_5_test, y_fe_5_train, y_fe_5_test = split_dataset(hr_fe_5, target, split_ratio, seed)
cv_acc(lr, X_fe_5_train, y_fe_5_train, 10, seed)
print()
lr_run(lr, X_fe_5_train, y_fe_5_train, X_fe_5_test, y_fe_5_test)
10-fold cross validation average accuracy: 0.956
Iteration 1 | Accuracy: 0.95
Iteration 2 | Accuracy: 0.94
Iteration 3 | Accuracy: 0.95
Iteration 4 | Accuracy: 0.96
Iteration 5 | Accuracy: 0.96
Iteration 6 | Accuracy: 0.96
Iteration 7 | Accuracy: 0.96
Iteration 8 | Accuracy: 0.96
Iteration 9 | Accuracy: 0.96
Iteration 10 | Accuracy: 0.95
Accuracy on test: 0.956
precision recall f1-score support
0 0.96 0.98 0.97 3435
1 0.93 0.88 0.91 1065
micro avg 0.96 0.96 0.96 4500
macro avg 0.95 0.93 0.94 4500
weighted avg 0.96 0.96 0.96 4500
Confusion Matrix:
[[3362 73]
[ 124 941]]
Feature Coef.
0 intercept. -1.288513
1 Work_accident -1.210856
2 promotion_last_5years -0.454837
3 salary -0.672500
4 department_IT -0.235474
5 department_RandD -0.395298
6 department_accounting -0.029671
7 department_hr 0.510471
8 department_management -0.297698
9 department_marketing 0.143294
10 department_product_mng -0.227719
11 department_sales 0.001829
12 department_support 0.350340
13 department_technical 0.179556
14 satisfaction_level_bin_(0.00, 0.11] 5.056556
15 satisfaction_level_bin_(0.11, 0.35] -1.622557
16 satisfaction_level_bin_(0.35, 0.46] 2.196762
17 satisfaction_level_bin_(0.46, 0.71] -1.869052
18 satisfaction_level_bin_(0.71, 0.92] 0.005086
19 satisfaction_level_bin_(0.92, 1.00] -3.767165
20 last_evaluation_bin_(0.00, 0.44] -2.659269
21 last_evaluation_bin_(0.44, 0.57] 1.342359
22 last_evaluation_bin_(0.57, 0.76] -0.295577
23 last_evaluation_bin_(0.76, 1.00] 1.612117
24 average_montly_hours_bin_(0, 125] -4.064773
25 average_montly_hours_bin_(125, 131] 0.755401
26 average_montly_hours_bin_(131, 161] 0.213408
27 average_montly_hours_bin_(161, 216] -1.742236
28 average_montly_hours_bin_(216, 274] 0.596288
29 average_montly_hours_bin_(274, 287] 0.086931
30 average_montly_hours_bin_(287, 310] 4.154612
31 number_project_cat_extreme 3.500939
32 number_project_cat_normal -1.998452
33 number_project_cat_too high -1.610600
34 number_project_cat_too low 0.107742
35 time_spend_company_cat_high departure 0.357731
36 time_spend_company_cat_low departure -1.265409
37 time_spend_company_cat_no departure -2.014069
38 time_spend_company_cat_very high departure 2.921377
Cluster by Number of Projects and Average Monthly Hours
Based on the EDA, the employees can be clustered by Workload, based on the Number of Projects and Average Monthly Hours, into 5 categories.
def workload_cluster(row):
if (row['average_montly_hours_bin'] == '(0, 125]'):
return 'very low'
if (row['number_project'] <= 2) and (row['average_montly_hours_bin'] in ['(125, 131]','(131, 161]']):
return 'low'
if (row['number_project'] >= 4) and (row['average_montly_hours_bin'] in ['(216, 274]','(274, 287]']):
return 'high'
if (row['average_montly_hours_bin'] in ['(287, 310]']):
return 'extreme'
return 'normal'
hr_fe['workload'] = hr_fe.apply(lambda row: workload_cluster(row), axis=1)
hr_fe.workload.value_counts()
normal 8265
high 4209
low 1709
very low 486
extreme 330
Name: workload, dtype: int64
plt.figure(figsize=(15,5))
sns.scatterplot(x=hr_fe.average_montly_hours,
y=hr_fe.number_project,
hue=hr_fe.workload,
palette = sns.color_palette("hls", 5))
plt.tight_layout()
hr_fe_6 = hr_fe.copy()
hr_fe_6 = onehot_encode(hr_fe_6)
hr_fe_6.drop('satisfaction_level', inplace=True, axis=1)
hr_fe_6.drop('last_evaluation', inplace=True, axis=1)
hr_fe_6.drop('average_montly_hours', inplace=True, axis=1)
hr_fe_6.drop('number_project', inplace=True, axis=1)
hr_fe_6.drop('time_spend_company', inplace=True, axis=1)
X_fe_6, y_fe_6, X_fe_6_train, X_fe_6_test, y_fe_6_train, y_fe_6_test = split_dataset(hr_fe_6, target, split_ratio, seed)
cv_acc(lr, X_fe_6_train, y_fe_6_train, 10, seed)
print()
lr_run(lr, X_fe_6_train, y_fe_6_train, X_fe_6_test, y_fe_6_test)
10-fold cross validation average accuracy: 0.958
Iteration 1 | Accuracy: 0.95
Iteration 2 | Accuracy: 0.94
Iteration 3 | Accuracy: 0.95
Iteration 4 | Accuracy: 0.96
Iteration 5 | Accuracy: 0.97
Iteration 6 | Accuracy: 0.96
Iteration 7 | Accuracy: 0.96
Iteration 8 | Accuracy: 0.97
Iteration 9 | Accuracy: 0.96
Iteration 10 | Accuracy: 0.95
Accuracy on test: 0.959
precision recall f1-score support
0 0.96 0.98 0.97 3435
1 0.94 0.88 0.91 1065
micro avg 0.96 0.96 0.96 4500
macro avg 0.95 0.93 0.94 4500
weighted avg 0.96 0.96 0.96 4500
Confusion Matrix:
[[3377 58]
[ 125 940]]
Feature Coef.
0 intercept. -0.766901
1 Work_accident -1.173201
2 promotion_last_5years -0.439302
3 salary -0.662271
4 department_IT -0.297131
5 department_RandD -0.447797
6 department_accounting 0.000741
7 department_hr 0.458777
8 department_management -0.164455
9 department_marketing 0.048457
10 department_product_mng -0.187570
11 department_sales 0.034650
12 department_support 0.347782
13 department_technical 0.205697
14 workload_extreme 2.350234
15 workload_high 0.104323
16 workload_low 1.471144
17 workload_normal -1.650481
18 workload_very low -2.276069
19 time_spend_company_cat_high departure 0.289188
20 time_spend_company_cat_low departure -1.110152
21 time_spend_company_cat_no departure -1.839473
22 time_spend_company_cat_very high departure 2.659588
23 average_montly_hours_bin_(0, 125] -2.276069
24 average_montly_hours_bin_(125, 131] 0.579145
25 average_montly_hours_bin_(131, 161] 0.135179
26 average_montly_hours_bin_(161, 216] -0.624238
27 average_montly_hours_bin_(216, 274] -0.014268
28 average_montly_hours_bin_(274, 287] -0.150833
29 average_montly_hours_bin_(287, 310] 2.350234
30 last_evaluation_bin_(0.00, 0.44] -2.541263
31 last_evaluation_bin_(0.44, 0.57] 1.163123
32 last_evaluation_bin_(0.57, 0.76] -0.196012
33 last_evaluation_bin_(0.76, 1.00] 1.573302
34 number_project_cat_extreme 3.487829
35 number_project_cat_normal -1.632124
36 number_project_cat_too high -1.306443
37 number_project_cat_too low -0.550112
38 satisfaction_level_bin_(0.00, 0.11] 4.765909
39 satisfaction_level_bin_(0.11, 0.35] -1.400822
40 satisfaction_level_bin_(0.35, 0.46] 1.637670
41 satisfaction_level_bin_(0.46, 0.71] -1.633800
42 satisfaction_level_bin_(0.71, 0.92] 0.169115
43 satisfaction_level_bin_(0.92, 1.00] -3.538921
Cluster by Number of Projects and Last Evaluation
Based on the EDA, the employees can be clustered by Project Performance, based on the Number of Projects and Last Evaluation, into 4 categories.
def project_performance_cluster(row):
if (row['last_evaluation_bin'] == '(0.00, 0.44]'):
return 'very low'
if (row['number_project'] <= 2) and (row['last_evaluation_bin'] in ['(0.44, 0.57]']):
return 'low'
if (row['number_project'] >= 4) and (row['last_evaluation_bin'] in ['(0.76, 1.00]']):
return 'high'
return 'normal'
hr_fe['project_performance'] = hr_fe.apply(lambda row: project_performance_cluster(row), axis=1)
hr_fe.project_performance.value_counts()
normal 8245
high 4589
low 1720
very low 445
Name: project_performance, dtype: int64
plt.figure(figsize=(15,5))
sns.scatterplot(x=hr_fe.last_evaluation,
y=hr_fe.number_project,
hue=hr_fe.project_performance,
palette = sns.color_palette("hls", 4))
plt.tight_layout()
hr_fe_7 = hr_fe.copy()
hr_fe_7 = onehot_encode(hr_fe_7)
hr_fe_7.drop('satisfaction_level', inplace=True, axis=1)
hr_fe_7.drop('last_evaluation', inplace=True, axis=1)
hr_fe_7.drop('average_montly_hours', inplace=True, axis=1)
hr_fe_7.drop('number_project', inplace=True, axis=1)
hr_fe_7.drop('time_spend_company', inplace=True, axis=1)
X_fe_7, y_fe_7, X_fe_7_train, X_fe_7_test, y_fe_7_train, y_fe_7_test = split_dataset(hr_fe_7, target, split_ratio, seed)
cv_acc(lr, X_fe_7_train, y_fe_7_train, 10, seed)
print()
lr_run(lr, X_fe_7_train, y_fe_7_train, X_fe_7_test, y_fe_7_test)
10-fold cross validation average accuracy: 0.960
Iteration 1 | Accuracy: 0.96
Iteration 2 | Accuracy: 0.95
Iteration 3 | Accuracy: 0.96
Iteration 4 | Accuracy: 0.96
Iteration 5 | Accuracy: 0.97
Iteration 6 | Accuracy: 0.96
Iteration 7 | Accuracy: 0.96
Iteration 8 | Accuracy: 0.96
Iteration 9 | Accuracy: 0.96
Iteration 10 | Accuracy: 0.95
Accuracy on test: 0.958
precision recall f1-score support
0 0.96 0.98 0.97 3435
1 0.93 0.88 0.91 1065
micro avg 0.96 0.96 0.96 4500
macro avg 0.95 0.93 0.94 4500
weighted avg 0.96 0.96 0.96 4500
Confusion Matrix:
[[3368 67]
[ 123 942]]
Feature Coef.
0 intercept. -0.304227
1 Work_accident -1.223252
2 promotion_last_5years -0.510657
3 salary -0.639244
4 department_IT -0.308566
5 department_RandD -0.427170
6 department_accounting -0.113093
7 department_hr 0.396920
8 department_management -0.146822
9 department_marketing 0.112515
10 department_product_mng -0.140261
11 department_sales 0.023193
12 department_support 0.369251
13 department_technical 0.234417
14 workload_extreme 2.405614
15 workload_high -0.219293
16 workload_low 1.266443
17 workload_normal -1.301441
18 workload_very low -2.150941
19 time_spend_company_cat_high departure 0.299676
20 time_spend_company_cat_low departure -1.065227
21 time_spend_company_cat_no departure -1.923445
22 time_spend_company_cat_very high departure 2.689379
23 average_montly_hours_bin_(0, 125] -2.150941
24 average_montly_hours_bin_(125, 131] 0.310555
25 average_montly_hours_bin_(131, 161] -0.065836
26 average_montly_hours_bin_(161, 216] -0.782745
27 average_montly_hours_bin_(216, 274] 0.117264
28 average_montly_hours_bin_(274, 287] 0.166471
29 average_montly_hours_bin_(287, 310] 2.405614
30 last_evaluation_bin_(0.00, 0.44] -1.472612
31 last_evaluation_bin_(0.44, 0.57] 0.498097
32 last_evaluation_bin_(0.57, 0.76] 0.165295
33 last_evaluation_bin_(0.76, 1.00] 0.809603
34 project_performance_high 0.246351
35 project_performance_low 2.100090
36 project_performance_normal -0.873446
37 project_performance_very low -1.472612
38 number_project_cat_extreme 3.644086
39 number_project_cat_normal -1.391861
40 number_project_cat_too high -1.002857
41 number_project_cat_too low -1.248984
42 satisfaction_level_bin_(0.00, 0.11] 4.679780
43 satisfaction_level_bin_(0.11, 0.35] -1.331063
44 satisfaction_level_bin_(0.35, 0.46] 1.205874
45 satisfaction_level_bin_(0.46, 0.71] -1.514709
46 satisfaction_level_bin_(0.71, 0.92] 0.241973
47 satisfaction_level_bin_(0.92, 1.00] -3.281472
Cluster by Last Evaluation and Average Monthly Hours
Based on the EDA, the employees can be clustered by Efficiency, based on the Last Evaluation and the Average Monthly Hours, into 4 categories.
def efficiency_cluster(row):
if (row['last_evaluation_bin'] == '(0.00, 0.44]'):
return 'very low'
if (row['average_montly_hours_bin'] in ['(0, 125]']):
return 'very low'
if (row['last_evaluation_bin'] in ['(0.44, 0.57]']) and (row['average_montly_hours_bin'] in ['(125, 131]', '(131, 161]']):
return 'low'
if (row['last_evaluation_bin'] in ['(0.76, 1.00]']) and (row['average_montly_hours_bin'] in ['(216, 274]', '(274, 287]','(287, 310]']):
return 'high'
return 'normal'
hr_fe['efficiency'] = hr_fe.apply(lambda row: efficiency_cluster(row), axis=1)
hr_fe.efficiency.value_counts()
normal 8436
high 3719
low 1994
very low 850
Name: efficiency, dtype: int64
plt.figure(figsize=(15,5))
sns.scatterplot(x=hr_fe.average_montly_hours,
y=hr_fe.last_evaluation,
hue=hr_fe.efficiency,
palette = sns.color_palette("hls", 4))
plt.tight_layout()
hr_fe_8 = hr_fe.copy()
hr_fe_8 = onehot_encode(hr_fe_8)
hr_fe_8.drop('satisfaction_level', inplace=True, axis=1)
hr_fe_8.drop('last_evaluation', inplace=True, axis=1)
hr_fe_8.drop('average_montly_hours', inplace=True, axis=1)
hr_fe_8.drop('number_project', inplace=True, axis=1)
hr_fe_8.drop('time_spend_company', inplace=True, axis=1)
X_fe_8, y_fe_8, X_fe_8_train, X_fe_8_test, y_fe_8_train, y_fe_8_test = split_dataset(hr_fe_8, target, split_ratio, seed)
cv_acc(lr, X_fe_8_train, y_fe_8_train, 10, seed)
print()
lr_run(lr, X_fe_8_train, y_fe_8_train, X_fe_8_test, y_fe_8_test)
10-fold cross validation average accuracy: 0.960
Iteration 1 | Accuracy: 0.96
Iteration 2 | Accuracy: 0.95
Iteration 3 | Accuracy: 0.96
Iteration 4 | Accuracy: 0.96
Iteration 5 | Accuracy: 0.97
Iteration 6 | Accuracy: 0.96
Iteration 7 | Accuracy: 0.96
Iteration 8 | Accuracy: 0.96
Iteration 9 | Accuracy: 0.96
Iteration 10 | Accuracy: 0.95
Accuracy on test: 0.960
precision recall f1-score support
0 0.96 0.98 0.97 3435
1 0.94 0.88 0.91 1065
micro avg 0.96 0.96 0.96 4500
macro avg 0.95 0.93 0.94 4500
weighted avg 0.96 0.96 0.96 4500
Confusion Matrix:
[[3377 58]
[ 124 941]]
Feature Coef.
0 intercept. 0.110311
1 Work_accident -1.234954
2 promotion_last_5years -0.581323
3 salary -0.653274
4 department_IT -0.319980
5 department_RandD -0.444509
6 department_accounting -0.118532
7 department_hr 0.420489
8 department_management -0.156571
9 department_marketing 0.097993
10 department_product_mng -0.141090
11 department_sales 0.034649
12 department_support 0.373496
13 department_technical 0.253988
14 workload_extreme 2.378101
15 workload_high -0.310825
16 workload_low 0.498770
17 workload_normal -1.224138
18 workload_very low -1.341976
19 time_spend_company_cat_high departure 0.291586
20 time_spend_company_cat_low departure -1.079268
21 time_spend_company_cat_no departure -1.877955
22 time_spend_company_cat_very high departure 2.665570
23 average_montly_hours_bin_(0, 125] -1.341976
24 average_montly_hours_bin_(125, 131] 0.122886
25 average_montly_hours_bin_(131, 161] -0.304675
26 average_montly_hours_bin_(161, 216] -0.571684
27 average_montly_hours_bin_(216, 274] -0.235870
28 average_montly_hours_bin_(274, 287] -0.046850
29 average_montly_hours_bin_(287, 310] 2.378101
30 last_evaluation_bin_(0.00, 0.44] -0.789941
31 last_evaluation_bin_(0.44, 0.57] 0.075586
32 last_evaluation_bin_(0.57, 0.76] 0.324766
33 last_evaluation_bin_(0.76, 1.00] 0.389521
34 project_performance_high 0.179345
35 project_performance_low 1.434834
36 project_performance_normal -0.824305
37 project_performance_very low -0.789941
38 number_project_cat_extreme 3.511605
39 number_project_cat_normal -1.541114
40 number_project_cat_too high -1.025555
41 number_project_cat_too low -0.945003
42 satisfaction_level_bin_(0.00, 0.11] 4.547035
43 satisfaction_level_bin_(0.11, 0.35] -1.297088
44 satisfaction_level_bin_(0.35, 0.46] 1.129828
45 satisfaction_level_bin_(0.46, 0.71] -1.465960
46 satisfaction_level_bin_(0.71, 0.92] 0.271559
47 satisfaction_level_bin_(0.92, 1.00] -3.185440
48 efficiency_high 0.730403
49 efficiency_low 1.642109
50 efficiency_normal -0.284602
51 efficiency_very low -2.087977
Cluster by Last Evaluation and Satisfaction Level
Based on the EDA, the employees can be clustered by Attitude, based on the Last Evaluation and the Satisfaction Level, into 7 categories.
def attitude_cluster(row):
if (row['last_evaluation_bin'] == '(0.00, 0.44]'):
return 'low performance'
if (row['satisfaction_level_bin'] in ['(0.92, 1.00]']):
return 'very happy'
if (row['last_evaluation_bin'] in ['(0.76, 1.00]']) and (row['satisfaction_level_bin'] in ['(0.71, 0.92]']):
return 'happy and high performance'
if (row['last_evaluation_bin'] in ['(0.44, 0.57]']) and (row['satisfaction_level_bin'] in ['(0.35, 0.46]']):
return 'unhappy and low performance'
if (row['satisfaction_level_bin'] in ['(0.00, 0.11]']):
return 'very unhappy'
if (row['satisfaction_level_bin'] in ['(0.11, 0.35]','(0.35, 0.46]']):
return 'unhappy'
return 'normal'
hr_fe['attitude'] = hr_fe.apply(lambda row: attitude_cluster(row), axis=1)
hr_fe.attitude.value_counts()
normal 6668
happy and high performance 2553
unhappy and low performance 1635
unhappy 1474
very happy 1336
very unhappy 888
low performance 445
Name: attitude, dtype: int64
plt.figure(figsize=(15,5))
sns.scatterplot(x=hr_fe.satisfaction_level,
y=hr_fe.last_evaluation,
hue=hr_fe.attitude,
palette = sns.color_palette("hls", 7))
plt.tight_layout()
hr_fe_9 = hr_fe.copy()
hr_fe_9 = onehot_encode(hr_fe_9)
hr_fe_9.drop('satisfaction_level', inplace=True, axis=1)
hr_fe_9.drop('last_evaluation', inplace=True, axis=1)
hr_fe_9.drop('average_montly_hours', inplace=True, axis=1)
hr_fe_9.drop('number_project', inplace=True, axis=1)
hr_fe_9.drop('time_spend_company', inplace=True, axis=1)
X_fe_9, y_fe_9, X_fe_9_train, X_fe_9_test, y_fe_9_train, y_fe_9_test = split_dataset(hr_fe_9, target, split_ratio, seed)
cv_acc(lr, X_fe_9_train, y_fe_9_train, 10, seed)
print()
lr_run(lr, X_fe_9_train, y_fe_9_train, X_fe_9_test, y_fe_9_test)
10-fold cross validation average accuracy: 0.964
Iteration 1 | Accuracy: 0.96
Iteration 2 | Accuracy: 0.95
Iteration 3 | Accuracy: 0.96
Iteration 4 | Accuracy: 0.97
Iteration 5 | Accuracy: 0.97
Iteration 6 | Accuracy: 0.97
Iteration 7 | Accuracy: 0.97
Iteration 8 | Accuracy: 0.96
Iteration 9 | Accuracy: 0.96
Iteration 10 | Accuracy: 0.96
Accuracy on test: 0.964
precision recall f1-score support
0 0.97 0.98 0.98 3435
1 0.94 0.90 0.92 1065
micro avg 0.96 0.96 0.96 4500
macro avg 0.96 0.94 0.95 4500
weighted avg 0.96 0.96 0.96 4500
Confusion Matrix:
[[3379 56]
[ 108 957]]
Feature Coef.
0 intercept. 0.155602
1 Work_accident -1.143174
2 promotion_last_5years -0.597843
3 salary -0.652169
4 department_IT -0.355823
5 department_RandD -0.441450
6 department_accounting -0.095917
7 department_hr 0.447624
8 department_management -0.163427
9 department_marketing 0.093569
10 department_product_mng -0.171977
11 department_sales 0.034884
12 department_support 0.366134
13 department_technical 0.288470
14 workload_extreme 2.282292
15 workload_high -0.227062
16 workload_low 0.383963
17 workload_normal -1.056297
18 workload_very low -1.380807
19 time_spend_company_cat_high departure 0.258945
20 time_spend_company_cat_low departure -1.051914
21 time_spend_company_cat_no departure -1.793281
22 time_spend_company_cat_very high departure 2.588338
23 average_montly_hours_bin_(0, 125] -1.380807
24 average_montly_hours_bin_(125, 131] 0.087262
25 average_montly_hours_bin_(131, 161] -0.283742
26 average_montly_hours_bin_(161, 216] -0.697097
27 average_montly_hours_bin_(216, 274] -0.167288
28 average_montly_hours_bin_(274, 287] 0.161469
29 average_montly_hours_bin_(287, 310] 2.282292
30 last_evaluation_bin_(0.00, 0.44] -0.693405
31 last_evaluation_bin_(0.44, 0.57] 0.121621
32 last_evaluation_bin_(0.57, 0.76] 0.748939
33 last_evaluation_bin_(0.76, 1.00] -0.175065
34 attitude_happy and high performance 0.807235
35 attitude_low performance -0.693405
36 attitude_normal -1.154903
37 attitude_unhappy -1.048762
38 attitude_unhappy and low performance 1.378160
39 attitude_very happy -1.938233
40 attitude_very unhappy 2.651998
41 project_performance_high 0.419734
42 project_performance_low 0.855019
43 project_performance_normal -0.579259
44 project_performance_very low -0.693405
45 number_project_cat_extreme 3.202322
46 number_project_cat_normal -1.490338
47 number_project_cat_too high -0.978540
48 number_project_cat_too low -0.731354
49 satisfaction_level_bin_(0.00, 0.11] 2.651998
50 satisfaction_level_bin_(0.11, 0.35] -0.522950
51 satisfaction_level_bin_(0.35, 0.46] 0.552467
52 satisfaction_level_bin_(0.46, 0.71] -0.530850
53 satisfaction_level_bin_(0.71, 0.92] -0.200445
54 satisfaction_level_bin_(0.92, 1.00] -1.948131
55 efficiency_high 0.631356
56 efficiency_low 1.559049
57 efficiency_normal -0.140877
58 efficiency_very low -2.047439
Removing Unbinned Variables and Encoding New Features
The variables which have been binned are removed from the dataset, and new features are one hot encoded.
hr_fe_encoded = onehot_encode(hr_fe)
hr_fe_encoded.drop('satisfaction_level', inplace=True, axis=1)
hr_fe_encoded.drop('last_evaluation', inplace=True, axis=1)
hr_fe_encoded.drop('average_montly_hours', inplace=True, axis=1)
hr_fe_encoded.drop('number_project', inplace=True, axis=1)
hr_fe_encoded.drop('time_spend_company', inplace=True, axis=1)
df_desc(hr_fe_encoded)
dtype | NAs | Numerical | Boolean | Categorical | |
---|---|---|---|---|---|
Work_accident | int64 | 0 | False | True | False |
left | int64 | 0 | False | True | False |
promotion_last_5years | int64 | 0 | False | True | False |
salary | int64 | 0 | True | False | False |
department_IT | uint8 | 0 | False | True | False |
department_RandD | uint8 | 0 | False | True | False |
department_accounting | uint8 | 0 | False | True | False |
department_hr | uint8 | 0 | False | True | False |
department_management | uint8 | 0 | False | True | False |
department_marketing | uint8 | 0 | False | True | False |
department_product_mng | uint8 | 0 | False | True | False |
department_sales | uint8 | 0 | False | True | False |
department_support | uint8 | 0 | False | True | False |
department_technical | uint8 | 0 | False | True | False |
workload_extreme | uint8 | 0 | False | True | False |
workload_high | uint8 | 0 | False | True | False |
workload_low | uint8 | 0 | False | True | False |
workload_normal | uint8 | 0 | False | True | False |
workload_very low | uint8 | 0 | False | True | False |
time_spend_company_cat_high departure | uint8 | 0 | False | True | False |
time_spend_company_cat_low departure | uint8 | 0 | False | True | False |
time_spend_company_cat_no departure | uint8 | 0 | False | True | False |
time_spend_company_cat_very high departure | uint8 | 0 | False | True | False |
average_montly_hours_bin_(0, 125] | uint8 | 0 | False | True | False |
average_montly_hours_bin_(125, 131] | uint8 | 0 | False | True | False |
average_montly_hours_bin_(131, 161] | uint8 | 0 | False | True | False |
average_montly_hours_bin_(161, 216] | uint8 | 0 | False | True | False |
average_montly_hours_bin_(216, 274] | uint8 | 0 | False | True | False |
average_montly_hours_bin_(274, 287] | uint8 | 0 | False | True | False |
average_montly_hours_bin_(287, 310] | uint8 | 0 | False | True | False |
last_evaluation_bin_(0.00, 0.44] | uint8 | 0 | False | True | False |
last_evaluation_bin_(0.44, 0.57] | uint8 | 0 | False | True | False |
last_evaluation_bin_(0.57, 0.76] | uint8 | 0 | False | True | False |
last_evaluation_bin_(0.76, 1.00] | uint8 | 0 | False | True | False |
attitude_happy and high performance | uint8 | 0 | False | True | False |
attitude_low performance | uint8 | 0 | False | True | False |
attitude_normal | uint8 | 0 | False | True | False |
attitude_unhappy | uint8 | 0 | False | True | False |
attitude_unhappy and low performance | uint8 | 0 | False | True | False |
attitude_very happy | uint8 | 0 | False | True | False |
attitude_very unhappy | uint8 | 0 | False | True | False |
project_performance_high | uint8 | 0 | False | True | False |
project_performance_low | uint8 | 0 | False | True | False |
project_performance_normal | uint8 | 0 | False | True | False |
project_performance_very low | uint8 | 0 | False | True | False |
number_project_cat_extreme | uint8 | 0 | False | True | False |
number_project_cat_normal | uint8 | 0 | False | True | False |
number_project_cat_too high | uint8 | 0 | False | True | False |
number_project_cat_too low | uint8 | 0 | False | True | False |
satisfaction_level_bin_(0.00, 0.11] | uint8 | 0 | False | True | False |
satisfaction_level_bin_(0.11, 0.35] | uint8 | 0 | False | True | False |
satisfaction_level_bin_(0.35, 0.46] | uint8 | 0 | False | True | False |
satisfaction_level_bin_(0.46, 0.71] | uint8 | 0 | False | True | False |
satisfaction_level_bin_(0.71, 0.92] | uint8 | 0 | False | True | False |
satisfaction_level_bin_(0.92, 1.00] | uint8 | 0 | False | True | False |
efficiency_high | uint8 | 0 | False | True | False |
efficiency_low | uint8 | 0 | False | True | False |
efficiency_normal | uint8 | 0 | False | True | False |
efficiency_very low | uint8 | 0 | False | True | False |
Features Selection
The dataset resulting from the Feature Engineering phase contains 58 features, with a model reaching the accuracy of 0.964. The Feature Selection phase aims to reduce the number of variables used by the model.
X_fe_encoded, y_fe_encoded, X_fe_encoded_train, X_fe_encoded_test, y_fe_encoded_train, y_fe_encoded_test = split_dataset(hr_fe_encoded, target, split_ratio, seed)
cv_acc(lr, X_fe_encoded_train, y_fe_encoded_train, 10, seed)
print()
lr_run(lr, X_fe_encoded_train, y_fe_encoded_train, X_fe_encoded_test, y_fe_encoded_test)
10-fold cross validation average accuracy: 0.964
Iteration 1 | Accuracy: 0.96
Iteration 2 | Accuracy: 0.95
Iteration 3 | Accuracy: 0.96
Iteration 4 | Accuracy: 0.97
Iteration 5 | Accuracy: 0.97
Iteration 6 | Accuracy: 0.97
Iteration 7 | Accuracy: 0.97
Iteration 8 | Accuracy: 0.96
Iteration 9 | Accuracy: 0.96
Iteration 10 | Accuracy: 0.96
Accuracy on test: 0.964
precision recall f1-score support
0 0.97 0.98 0.98 3435
1 0.94 0.90 0.92 1065
micro avg 0.96 0.96 0.96 4500
macro avg 0.96 0.94 0.95 4500
weighted avg 0.96 0.96 0.96 4500
Confusion Matrix:
[[3379 56]
[ 108 957]]
Feature Coef.
0 intercept. 0.155602
1 Work_accident -1.143174
2 promotion_last_5years -0.597843
3 salary -0.652169
4 department_IT -0.355823
5 department_RandD -0.441450
6 department_accounting -0.095917
7 department_hr 0.447624
8 department_management -0.163427
9 department_marketing 0.093569
10 department_product_mng -0.171977
11 department_sales 0.034884
12 department_support 0.366134
13 department_technical 0.288470
14 workload_extreme 2.282292
15 workload_high -0.227062
16 workload_low 0.383963
17 workload_normal -1.056297
18 workload_very low -1.380807
19 time_spend_company_cat_high departure 0.258945
20 time_spend_company_cat_low departure -1.051914
21 time_spend_company_cat_no departure -1.793281
22 time_spend_company_cat_very high departure 2.588338
23 average_montly_hours_bin_(0, 125] -1.380807
24 average_montly_hours_bin_(125, 131] 0.087262
25 average_montly_hours_bin_(131, 161] -0.283742
26 average_montly_hours_bin_(161, 216] -0.697097
27 average_montly_hours_bin_(216, 274] -0.167288
28 average_montly_hours_bin_(274, 287] 0.161469
29 average_montly_hours_bin_(287, 310] 2.282292
30 last_evaluation_bin_(0.00, 0.44] -0.693405
31 last_evaluation_bin_(0.44, 0.57] 0.121621
32 last_evaluation_bin_(0.57, 0.76] 0.748939
33 last_evaluation_bin_(0.76, 1.00] -0.175065
34 attitude_happy and high performance 0.807235
35 attitude_low performance -0.693405
36 attitude_normal -1.154903
37 attitude_unhappy -1.048762
38 attitude_unhappy and low performance 1.378160
39 attitude_very happy -1.938233
40 attitude_very unhappy 2.651998
41 project_performance_high 0.419734
42 project_performance_low 0.855019
43 project_performance_normal -0.579259
44 project_performance_very low -0.693405
45 number_project_cat_extreme 3.202322
46 number_project_cat_normal -1.490338
47 number_project_cat_too high -0.978540
48 number_project_cat_too low -0.731354
49 satisfaction_level_bin_(0.00, 0.11] 2.651998
50 satisfaction_level_bin_(0.11, 0.35] -0.522950
51 satisfaction_level_bin_(0.35, 0.46] 0.552467
52 satisfaction_level_bin_(0.46, 0.71] -0.530850
53 satisfaction_level_bin_(0.71, 0.92] -0.200445
54 satisfaction_level_bin_(0.92, 1.00] -1.948131
55 efficiency_high 0.631356
56 efficiency_low 1.559049
57 efficiency_normal -0.140877
58 efficiency_very low -2.047439
plot_roc(lr, X_fe_encoded_test, y_fe_encoded_test)
The Recursive Feature Elimination (RFE) method is used to select the most relevant features for the model.
accuracies = pd.DataFrame(columns=['features','accuracy', 'cols'])
print('Iterations:')
for i in range(1, len(X_fe_encoded.columns)+1):
logreg = LogisticRegression(solver='lbfgs', max_iter=250)
rfe = RFE(logreg, i)
rfe = rfe.fit(X_fe_encoded, y_fe_encoded.values.ravel())
cols_rfe = list(X_fe_encoded.loc[:, rfe.support_])
X_rfe_sel = X_fe_encoded_train[cols_rfe]
X_rfe_test_sel = X_fe_encoded_test[cols_rfe]
result = logreg.fit(X_rfe_sel, y_fe_encoded_train.values.ravel())
acc_test = logreg.score(X_rfe_test_sel, y_fe_encoded_test)
accuracies.loc[i] = [i, acc_test, cols_rfe]
print(i, end=' ')
Iterations:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58
# Line Plot
plt.figure(figsize=(15,5))
sns.lineplot(x = accuracies['features'],
y = accuracies['accuracy'],
color = 'steelblue')#.axes.set_xlim(min(hr.last_evaluation),max(hr.last_evaluation))
plt.tight_layout()
accuracies.nlargest(10, 'accuracy')
features | accuracy | cols | |
---|---|---|---|
14 | 14 | 0.967111 | [workload_extreme, workload_normal, time_spend... |
18 | 18 | 0.966889 | [workload_extreme, workload_normal, time_spend... |
19 | 19 | 0.966667 | [workload_extreme, workload_normal, workload_v... |
20 | 20 | 0.966667 | [Work_accident, workload_extreme, workload_nor... |
15 | 15 | 0.966444 | [workload_extreme, workload_normal, time_spend... |
16 | 16 | 0.966444 | [workload_extreme, workload_normal, time_spend... |
17 | 17 | 0.966444 | [workload_extreme, workload_normal, time_spend... |
22 | 22 | 0.965556 | [Work_accident, workload_extreme, workload_nor... |
21 | 21 | 0.965333 | [Work_accident, workload_extreme, workload_nor... |
29 | 29 | 0.964889 | [Work_accident, promotion_last_5years, workloa... |
The best model is found with 14 features, for an accuracy of 0.967.
from sklearn.exceptions import ConvergenceWarning
import warnings
warnings.filterwarnings(action='ignore', category=ConvergenceWarning)
features_rfe = list(hr_fe_encoded)
features_rfe.remove(target)
X_rfe = hr_fe_encoded.loc[:, features_rfe]
y_rfe = hr_fe_encoded.loc[:, target]
logreg = LogisticRegression(solver='lbfgs', max_iter=250)
rfe = RFE(logreg, accuracies.nlargest(1,'accuracy').features.values.ravel()[0])
rfe = rfe.fit(X_rfe, y_rfe)
print(sum(rfe.support_),'selected features:')
for i in list(X_rfe.loc[:, rfe.support_]):
print(i)
14 selected features:
workload_extreme
workload_normal
time_spend_company_cat_no departure
time_spend_company_cat_very high departure
average_montly_hours_bin_(287, 310]
attitude_normal
attitude_unhappy
attitude_very happy
attitude_very unhappy
number_project_cat_extreme
satisfaction_level_bin_(0.00, 0.11]
satisfaction_level_bin_(0.92, 1.00]
efficiency_low
efficiency_very low
Final Metric
Initial Dataset
A final model is tested with the 14 selected features.
cols = list(X_rfe.loc[:, rfe.support_]) + [target]
hr_sel = hr_fe_encoded[cols]
X_sel, y_sel, X_sel_train, X_sel_test, y_sel_train, y_sel_test = split_dataset(hr_sel, target, split_ratio, seed)
cv_acc(lr, X_sel_train, y_sel_train, 10, seed)
print()
lr_run(lr, X_sel_train, y_sel_train, X_sel_test, y_sel_test)
10-fold cross validation average accuracy: 0.964
Iteration 1 | Accuracy: 0.96
Iteration 2 | Accuracy: 0.95
Iteration 3 | Accuracy: 0.96
Iteration 4 | Accuracy: 0.97
Iteration 5 | Accuracy: 0.97
Iteration 6 | Accuracy: 0.96
Iteration 7 | Accuracy: 0.98
Iteration 8 | Accuracy: 0.97
Iteration 9 | Accuracy: 0.96
Iteration 10 | Accuracy: 0.95
Accuracy on test: 0.967
precision recall f1-score support
0 0.96 1.00 0.98 3435
1 0.98 0.88 0.93 1065
micro avg 0.97 0.97 0.97 4500
macro avg 0.97 0.94 0.95 4500
weighted avg 0.97 0.97 0.97 4500
Confusion Matrix:
[[3419 16]
[ 132 933]]
Feature Coef.
0 intercept. -0.375952
1 workload_extreme 2.245120
2 workload_normal -2.183617
3 time_spend_company_cat_no departure -2.016273
4 time_spend_company_cat_very high departure 2.403233
5 average_montly_hours_bin_(287, 310] 2.245120
6 attitude_normal -2.936478
7 attitude_unhappy -2.295986
8 attitude_very happy -2.423485
9 attitude_very unhappy 2.582233
10 number_project_cat_extreme 3.987229
11 satisfaction_level_bin_(0.00, 0.11] 2.582233
12 satisfaction_level_bin_(0.92, 1.00] -2.448987
13 efficiency_low 3.295257
14 efficiency_very low -4.182167
plot_roc(lr, X_sel_test, y_sel_test)
The model returns the accuracy of 0.967. The recall for employees who left the company now reaches 88%, which will allow the management to better predict which employees have a high probability to leave.
Over Sampling with SMOTE
To ensure the model is not biased by the imbalanced proportions of the variable left
, the dataset is enriched by synthetic samples using the Synthetic Minority Oversampling Technique (SMOTE). Only the train set is over-sampled, to ensure it doesn’t get polluted by the test set.
# Install the imbalanced-learn package with this command:
# conda install -c conda-forge imbalanced-learn
from imblearn.over_sampling import SMOTE
os = SMOTE(random_state=0)
X_smote, y_smote, X_smote_train, X_smote_test, y_smote_train, y_smote_test = split_dataset(hr_fe_encoded, target, split_ratio, seed)
columns = X_smote_train.columns
os_data_X,os_data_y = os.fit_sample(X_smote_train, y_smote_train.values.ravel())
os_data_X = pd.DataFrame(data=os_data_X, columns=columns)
os_data_y= pd.DataFrame(data=os_data_y,columns=['left'])
# we can Check the numbers of our data
print("Length of oversampled data is ",len(os_data_X))
print("Number of 'stayed' in oversampled data",len(os_data_y[os_data_y['left']==0]))
print("Number of 'left'",len(os_data_y[os_data_y['left']==1]))
print("Proportion of 'stayed' data in oversampled data is ",len(os_data_y[os_data_y['left']==0])/len(os_data_X))
print("Proportion of 'left' data in oversampled data is ",len(os_data_y[os_data_y['left']==1])/len(os_data_X))
Length of oversampled data is 15986
Number of 'stayed' in oversampled data 7993
Number of 'left' 7993
Proportion of 'stayed' data in oversampled data is 0.5
Proportion of 'left' data in oversampled data is 0.5
cv_acc(lr, os_data_X, os_data_y, 10, seed)
print()
lr_run(lr, os_data_X, os_data_y, X_smote_test, y_smote_test)
10-fold cross validation average accuracy: 0.963
Iteration 1 | Accuracy: 0.96
Iteration 2 | Accuracy: 0.95
Iteration 3 | Accuracy: 0.96
Iteration 4 | Accuracy: 0.95
Iteration 5 | Accuracy: 0.96
Iteration 6 | Accuracy: 0.95
Iteration 7 | Accuracy: 0.97
Iteration 8 | Accuracy: 0.97
Iteration 9 | Accuracy: 0.97
Iteration 10 | Accuracy: 0.98
Accuracy on test: 0.957
precision recall f1-score support
0 0.98 0.97 0.97 3435
1 0.90 0.92 0.91 1065
micro avg 0.96 0.96 0.96 4500
macro avg 0.94 0.95 0.94 4500
weighted avg 0.96 0.96 0.96 4500
Confusion Matrix:
[[3321 114]
[ 81 984]]
Feature Coef.
0 intercept. 17.979973
1 Work_accident -1.539814
2 promotion_last_5years -0.873542
3 salary -0.822126
4 department_IT -3.552008
5 department_RandD -3.639017
6 department_accounting -3.180176
7 department_hr -2.710086
8 department_management -3.366177
9 department_marketing -3.216680
10 department_product_mng -3.248802
11 department_sales -3.247247
12 department_support -2.799619
13 department_technical -2.786227
14 workload_extreme 1.227670
15 workload_high -1.370041
16 workload_low -0.331572
17 workload_normal -2.301065
18 workload_very low -2.294460
19 time_spend_company_cat_high departure -2.357160
20 time_spend_company_cat_low departure -3.895673
21 time_spend_company_cat_no departure -4.218755
22 time_spend_company_cat_very high departure 0.009826
23 average_montly_hours_bin_(0, 125] -2.294460
24 average_montly_hours_bin_(125, 131] -1.624161
25 average_montly_hours_bin_(131, 161] -2.149702
26 average_montly_hours_bin_(161, 216] -2.380130
27 average_montly_hours_bin_(216, 274] -2.063447
28 average_montly_hours_bin_(274, 287] -1.517476
29 average_montly_hours_bin_(287, 310] 1.227670
30 last_evaluation_bin_(0.00, 0.44] -1.594999
31 last_evaluation_bin_(0.44, 0.57] -2.717905
32 last_evaluation_bin_(0.57, 0.76] -2.229557
33 last_evaluation_bin_(0.76, 1.00] -3.532066
34 attitude_happy and high performance 0.667297
35 attitude_low performance -1.594999
36 attitude_normal -1.273042
37 attitude_unhappy -1.750187
38 attitude_unhappy and low performance 0.763155
39 attitude_very happy -3.146609
40 attitude_very unhappy 1.794301
41 project_performance_high -0.362548
42 project_performance_low 0.353313
43 project_performance_normal -1.388324
44 project_performance_very low -1.594999
45 number_project_cat_extreme 1.971823
46 number_project_cat_normal -3.373701
47 number_project_cat_too high -2.785609
48 number_project_cat_too low -3.104655
49 satisfaction_level_bin_(0.00, 0.11] 1.794301
50 satisfaction_level_bin_(0.11, 0.35] -2.184787
51 satisfaction_level_bin_(0.35, 0.46] -1.079961
52 satisfaction_level_bin_(0.46, 0.71] -2.742574
53 satisfaction_level_bin_(0.71, 0.92] -2.136022
54 satisfaction_level_bin_(0.92, 1.00] -3.187208
55 efficiency_high 0.772209
56 efficiency_low 1.066479
57 efficiency_normal -0.387084
58 efficiency_very low -3.708358
The accuracy is consistent with the initial dataset. The RFE algorithm is used to find the most relevant features.
accuracies_smote = pd.DataFrame(columns=['features','accuracy', 'cols'])
print('Iterations:')
for i in range(1, len(os_data_X.columns)+1):
logreg = LogisticRegression(solver='lbfgs', max_iter=250)
rfe_smote = RFE(logreg, i)
rfe_smote = rfe_smote.fit(os_data_X, os_data_y.values.ravel())
cols_rfe_smote = list(os_data_X.loc[:, rfe_smote.support_])
os_data_X_sel = os_data_X[cols_rfe_smote]
X_smote_test_sel = X_smote_test[cols_rfe_smote]
result = logreg.fit(os_data_X_sel, os_data_y.values.ravel())
acc_test = logreg.score(X_smote_test_sel, y_smote_test)
accuracies_smote.loc[i] = [i, acc_test, cols_rfe_smote]
print(i, end=' ')
Iterations:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58
# Line Plot
plt.figure(figsize=(15,5))
sns.lineplot(x = accuracies_smote['features'],
y = accuracies_smote['accuracy'],
color = 'steelblue')#.axes.set_xlim(min(hr.last_evaluation),max(hr.last_evaluation))
plt.tight_layout()
accuracies_smote.nlargest(10, 'accuracy')
features | accuracy | cols | |
---|---|---|---|
50 | 50 | 0.957111 | [Work_accident, promotion_last_5years, salary,... |
51 | 51 | 0.956889 | [Work_accident, promotion_last_5years, salary,... |
52 | 52 | 0.956889 | [Work_accident, promotion_last_5years, salary,... |
53 | 53 | 0.956889 | [Work_accident, promotion_last_5years, salary,... |
54 | 54 | 0.956889 | [Work_accident, promotion_last_5years, salary,... |
55 | 55 | 0.956889 | [Work_accident, promotion_last_5years, salary,... |
56 | 56 | 0.956667 | [Work_accident, promotion_last_5years, salary,... |
57 | 57 | 0.956667 | [Work_accident, promotion_last_5years, salary,... |
58 | 58 | 0.956667 | [Work_accident, promotion_last_5years, salary,... |
49 | 49 | 0.953333 | [Work_accident, promotion_last_5years, departm... |
The best model is found with 50 features, for an accuracy of 0.957.
logreg = LogisticRegression(solver='lbfgs', max_iter=250)
rfe_smote = RFE(logreg, accuracies_smote.nlargest(1,'accuracy').features.values.ravel()[0])
rfe_smote = rfe_smote.fit(os_data_X, os_data_y.values.ravel())
print(sum(rfe_smote.support_),'selected features:')
for i in list(os_data_X.loc[:, rfe_smote.support_]):
print(i)
50 selected features:
Work_accident
promotion_last_5years
salary
department_IT
department_RandD
department_accounting
department_hr
department_management
department_marketing
department_product_mng
department_sales
department_support
department_technical
workload_extreme
workload_high
workload_normal
workload_very low
time_spend_company_cat_high departure
time_spend_company_cat_low departure
time_spend_company_cat_no departure
average_montly_hours_bin_(0, 125]
average_montly_hours_bin_(125, 131]
average_montly_hours_bin_(131, 161]
average_montly_hours_bin_(161, 216]
average_montly_hours_bin_(216, 274]
average_montly_hours_bin_(274, 287]
average_montly_hours_bin_(287, 310]
last_evaluation_bin_(0.00, 0.44]
last_evaluation_bin_(0.44, 0.57]
last_evaluation_bin_(0.57, 0.76]
last_evaluation_bin_(0.76, 1.00]
attitude_low performance
attitude_normal
attitude_unhappy
attitude_very happy
attitude_very unhappy
project_performance_normal
project_performance_very low
number_project_cat_extreme
number_project_cat_normal
number_project_cat_too high
number_project_cat_too low
satisfaction_level_bin_(0.00, 0.11]
satisfaction_level_bin_(0.11, 0.35]
satisfaction_level_bin_(0.46, 0.71]
satisfaction_level_bin_(0.71, 0.92]
satisfaction_level_bin_(0.92, 1.00]
efficiency_high
efficiency_low
efficiency_very low
The selected columns are really numerous compared to the initial dataset. However, the model is built to check its metrics.
cols_smote = list(os_data_X.loc[:, rfe_smote.support_])
os_data_X_sel = os_data_X[cols_smote]
X_smote_test_sel = X_smote_test[cols_smote]
cv_acc(lr, os_data_X_sel, os_data_y, 10, seed)
print()
lr_run(lr, os_data_X_sel, os_data_y, X_smote_test_sel, y_smote_test)
10-fold cross validation average accuracy: 0.962
Iteration 1 | Accuracy: 0.96
Iteration 2 | Accuracy: 0.95
Iteration 3 | Accuracy: 0.96
Iteration 4 | Accuracy: 0.95
Iteration 5 | Accuracy: 0.96
Iteration 6 | Accuracy: 0.95
Iteration 7 | Accuracy: 0.97
Iteration 8 | Accuracy: 0.97
Iteration 9 | Accuracy: 0.97
Iteration 10 | Accuracy: 0.97
Accuracy on test: 0.957
precision recall f1-score support
0 0.98 0.97 0.97 3435
1 0.90 0.92 0.91 1065
micro avg 0.96 0.96 0.96 4500
macro avg 0.94 0.95 0.94 4500
weighted avg 0.96 0.96 0.96 4500
Confusion Matrix:
[[3322 113]
[ 80 985]]
Feature Coef.
0 intercept. 17.376456
1 Work_accident -1.540492
2 promotion_last_5years -0.877758
3 salary -0.827350
4 department_IT -3.555270
5 department_RandD -3.633989
6 department_accounting -3.178568
7 department_hr -2.706043
8 department_management -3.374921
9 department_marketing -3.230675
10 department_product_mng -3.256803
11 department_sales -3.244566
12 department_support -2.799412
13 department_technical -2.786458
14 workload_extreme 1.262675
15 workload_high -1.314907
16 workload_normal -2.114032
17 workload_very low -2.160452
18 time_spend_company_cat_high departure -2.366082
19 time_spend_company_cat_low departure -3.898360
20 time_spend_company_cat_no departure -4.220065
21 average_montly_hours_bin_(0, 125] -2.160452
22 average_montly_hours_bin_(125, 131] -1.748431
23 average_montly_hours_bin_(131, 161] -2.290330
24 average_montly_hours_bin_(161, 216] -2.490469
25 average_montly_hours_bin_(216, 274] -2.054918
26 average_montly_hours_bin_(274, 287] -1.536961
27 average_montly_hours_bin_(287, 310] 1.262675
28 last_evaluation_bin_(0.00, 0.44] -1.708380
29 last_evaluation_bin_(0.44, 0.57] -2.715554
30 last_evaluation_bin_(0.57, 0.76] -2.290356
31 last_evaluation_bin_(0.76, 1.00] -3.664226
32 attitude_low performance -1.708380
33 attitude_normal -1.835470
34 attitude_unhappy -2.648314
35 attitude_very happy -3.123197
36 attitude_very unhappy 1.817110
37 project_performance_normal -1.234341
38 project_performance_very low -1.708380
39 number_project_cat_extreme 1.958954
40 number_project_cat_normal -3.429596
41 number_project_cat_too high -2.845605
42 number_project_cat_too low -3.041480
43 satisfaction_level_bin_(0.00, 0.11] 1.817110
44 satisfaction_level_bin_(0.11, 0.35] -1.231504
45 satisfaction_level_bin_(0.46, 0.71] -2.114606
46 satisfaction_level_bin_(0.71, 0.92] -1.447229
47 satisfaction_level_bin_(0.92, 1.00] -3.145082
48 efficiency_high 1.124227
49 efficiency_low 1.463810
50 efficiency_very low -3.650808
plot_roc(lr, X_smote_test_sel, y_smote_test)
The model run on over-sampled dataset has an accuracy really closed to the model run on the original dataset. We can conclude that the imbalanced proportions of the target in our dataset didn’t insert bias in our model.
However, as more variables are necessary to achieve an equivalent accuracy, it indicates that the feature selection might have been biased by our feature construction. In fact, the binned features were built to fit the data, and allow to remove many features from our initial model. But this technique makes our model over-fit the data, reducing its chances to achieve the same accuracy with a new dataset. This might explain why the Recursive Feature Elimination cannot select the same features:
list_features = pd.DataFrame({'Initial':sorted(list(accuracies.loc[accuracies.features == 14]['cols'])[0]),
'SMOTE':sorted(list(accuracies_smote.loc[accuracies_smote.features == 14]['cols'])[0])})
list_features
Initial | SMOTE | |
---|---|---|
0 | attitude_normal | attitude_normal |
1 | attitude_unhappy | attitude_very happy |
2 | attitude_very happy | average_montly_hours_bin_(287, 310] |
3 | attitude_very unhappy | department_IT |
4 | average_montly_hours_bin_(287, 310] | department_RandD |
5 | efficiency_low | department_accounting |
6 | efficiency_very low | department_management |
7 | number_project_cat_extreme | department_marketing |
8 | satisfaction_level_bin_(0.00, 0.11] | department_sales |
9 | satisfaction_level_bin_(0.92, 1.00] | department_technical |
10 | time_spend_company_cat_no departure | efficiency_very low |
11 | time_spend_company_cat_very high departure | satisfaction_level_bin_(0.00, 0.11] |
12 | workload_extreme | satisfaction_level_bin_(0.92, 1.00] |
13 | workload_normal | time_spend_company_cat_no departure |
However, the model can be tested using the exact same selection of columns than the ones selected by the initial RFE.
cols_1 = cols.copy()
cols_1.remove('left')
print(cols_1)
os_data_X_sel_1 = os_data_X[cols_1]
X_smote_test_sel_1 = X_smote_test[cols_1]
['workload_extreme', 'workload_normal', 'time_spend_company_cat_no departure', 'time_spend_company_cat_very high departure', 'average_montly_hours_bin_(287, 310]', 'attitude_normal', 'attitude_unhappy', 'attitude_very happy', 'attitude_very unhappy', 'number_project_cat_extreme', 'satisfaction_level_bin_(0.00, 0.11]', 'satisfaction_level_bin_(0.92, 1.00]', 'efficiency_low', 'efficiency_very low']
cv_acc(lr, os_data_X_sel_1, os_data_y, 10, seed)
print()
lr_run(lr, os_data_X_sel_1, os_data_y, X_smote_test_sel_1, y_smote_test)
10-fold cross validation average accuracy: 0.937
Iteration 1 | Accuracy: 0.95
Iteration 2 | Accuracy: 0.93
Iteration 3 | Accuracy: 0.95
Iteration 4 | Accuracy: 0.93
Iteration 5 | Accuracy: 0.94
Iteration 6 | Accuracy: 0.94
Iteration 7 | Accuracy: 0.94
Iteration 8 | Accuracy: 0.93
Iteration 9 | Accuracy: 0.93
Iteration 10 | Accuracy: 0.93
Accuracy on test: 0.938
precision recall f1-score support
0 0.98 0.94 0.96 3435
1 0.83 0.94 0.88 1065
micro avg 0.94 0.94 0.94 4500
macro avg 0.90 0.94 0.92 4500
weighted avg 0.94 0.94 0.94 4500
Confusion Matrix:
[[3224 211]
[ 69 996]]
Feature Coef.
0 intercept. 0.698509
1 workload_extreme 1.946942
2 workload_normal -1.989443
3 time_spend_company_cat_no departure -2.883204
4 time_spend_company_cat_very high departure 2.039475
5 average_montly_hours_bin_(287, 310] 1.946942
6 attitude_normal -2.701934
7 attitude_unhappy -1.732064
8 attitude_very happy -2.828182
9 attitude_very unhappy 2.587483
10 number_project_cat_extreme 3.936500
11 satisfaction_level_bin_(0.00, 0.11] 2.587483
12 satisfaction_level_bin_(0.92, 1.00] -2.878639
13 efficiency_low 2.863472
14 efficiency_very low -4.951222
The resulting accuracy is still really good, which confirms that the initial model didn’t have bias due to the imbalance of the dataset.
The high accuracy is anyway driven by the binned features tailored to the dataset. If they work really well for this data, it might not be the case for another dataset. The features should instead be set using standard binning approach, which wouldn’t fit as well the data but which would be adaptable to any dataset. That solution would be recommended if the model has to be run in production.