Employee Churn Prediction

THE PROBLEM

Employee turn-over (also known as "employee churn") is a costly problem for companies. The true cost of replacing an employee can often be quite large. A study by the Center for American Progress found that companies typically pay about one-fifth of an employee’s salary to replace that employee, and the cost can significantly increase if executives or highest-paid employees are to be replaced. In other words, the cost of replacing employees for most employers remains significant. This is due to the amount of time spent to interview and find a replacement, sign-on bonuses, and the loss of productivity for several months while the new employee gets accustomed to the new role.

APPROACH

1. Importing essential libraries like: pandas for perfroming actions like importing data, deleting columns. Numpy for performing mathematical calculations on arrays. plotly, matplotlib, seaborn for plotting interactive graphs. imblearn package which has class over_sampling and in it function named SMOTE whihc is useful in balancing highly imbalanced dataset. and most important package for Machine Learning that is Sklearn which helps in eveluating result, splliting the dataset, tunning the hyper-paramentes, preprocessing the data,etc.
2. Data is imported using padnas package and csv file is loaded. It has 27 columns of which 8 are of data type object and rest of integer. After performing Exploratory Data Analysis:
- Many histograms are tail-heavy; indeed several distributions are right-skewed (e.g. MonthlyIncome DistanceFromHome, YearsAtCompany). Data transformation methods may be required to approach a normal distribution prior to fitting a model to the data.
- Age distribution is a slightly right-skewed normal distribution with the bulk of the staff between 25 and 45 years old.
- EmployeeCount and StandardHours are constant values for all employees. They're likely to be redundant features.
- Employee Number is likely to be a unique identifier for employees given the feature's quasi-uniform distribution.
- Laboratory Technican has the highest pay followed by Sales Executive and Research Scientist.
- Research Director having least pay rate followd by manager and Helthcare Representative.
3. Data Pre-processing and Wrangling:
Age if converted into binary form where employee above 39 has value 0 else 1, for PastEmployee tag Yes = 1 and No = 0 is assigned. Also Gender and Overtime class is converted using apply() and user defined function. Pandas get_dummies() function is used for converting other categorical variables into dummy/indicator variables. Data is divided where X:Independent Variables and Y:Target Varivable. Normalization is performed so that each feature has mean 0 and variance 1. Then data is splitted into tain-set and test-set in the ratio of 70:30 respectively. To balance the imbalance data SMOTE techniques is implemented.
4. Hyper-paramentes tuning: RandomizedSearchCV and GridSearchCV is implemented to find the best parameters for the models.

Model Tuning Time Best Score
XGBoost 113.663 s 0.9836
AdaBoost 4.662 s 0.8731
LightGBM 17.968 s 0.8649
Logistic regression 51.291 s 0.8735

EVALUATION

Model True Positive True Negative False Positive False Negative Accuracy Recall F1 AUCROC
XGBoost 27 360 14 40 0.8775 0.6585 0.5000 0.6827
AdaBoost 27 360 14 40 0.8775 0.6585 0.5000 0.6827
LightGBM 30 343 31 37 0.8458 0.4918 0.4687 0.6824
Logistic Regression 50 285 89 17 0.7596 0.3597 0.4854 0.7541

IMPORTANT FEATURES

Model First Second Third Least
XGBoost Overtime 0.255203 JobLevel 0.087756 JobRole_Sales Executive 0.071910 JobRole_Research Director 0.00000
AdaBoost WorkLifeBalance 0.0875 TrainingTimeLastYear 0.0875 TrainingTimeLastYear 0.0750 Department_Human Resources 0.0000
LightGBM JobSatisfaction 576 StockOptionLevel 546 NumCompaniesWorked 508 PerformanceRating 0s


Source Code

Job With Highest Pay

ROC Graph

LightGBM Important Features

Class Histogram

Heat-Map