During the era of IoT and Big Data as an emerging technology, it has major impact on all the business sectors. This technology results in generating massive amount of electronic data which contains valuable information. To extract and analyse this information at greater scale the only choice is Cloud computing and Machine Learning technology. The goal is to identify best process for Fraud Prediction and Sales prediction. Ten Machine Learning models are implemented for solving this business question. RandomizedSearchCV is implemented for hyper parameter tuning and time required by model for training and prediction is also evaluated. Even SMOTE is applied for imbalanced dataset. Classification models are validated based on Recall, F1, Confusion Matric ROC-AUC values. Regression Models are evaluated based on MAE and MSE score. Research is conducted using Amazon Web Services and python for predictive analytics by assimilating Machine Learning and Cloud Computing technologies.
- Perform EDA (Exploratory Data Analysis) for generating preliminary insights from the
dataset.
- Setting up the hyper-parameters for essential Machine Learning model using
RandomizedSearchCV.
- Feature engineering to obtain the impact of important features in the dataset.
- Evaluate and validate results of ML models using metrics like Accuracy, F1, Recall Score
and RMSE, MAE.
- Setting up the services of AWS which will make the implementation of predictive
analysis easier and efficient.
- Setting up AWS S3 permission,SageMaker,EC2 instance and installation of essential packages for performing
cloud integration task.
1. Data Collection
2. Data Wrangaling
3. Handeling Imbalanced Data - SMOTE (Synthetic Minority
Oversampling Technique)
4. Setting up AWS: S3,EC2,SageMaker,Jupyter Notebook Configuration
5. Evaluation Metrics (Confusion Matrix, Mean Squared Error)
6. ML Algorithms: Decision Tree, Light GBM, LDA, Random Forest, Ridge, LASSO, Linear Reg., XGBoost, AdaBoost
• Before applying SMOTE there were 123540 class with the
genuine transaction and 2823 with the fraudulent transaction which was later balanced
equally for value of 123540 observation for each class.
• Random Forest is the model which has the best Accuracy, Recall
and F1 score for tuned and default parameters. It consumes 0.296 s for prediction when
parameters are tuned and 0.406 s when parameters are set for default making the most timeconsuming model for prediction.
• In Sales prediction Linear regression is the model with least error rate based on RMSE and
MAE values followed by Ridge Regression. The worst score is for both the tree-based
regressor model where GBT model has MAE=1.8 highest among all the models and XGBoost
having RMSE=4.7 making it the least preferable model for the sales prediction task.
• Total loss of revenue is -3883547.345768667. It also indicated that this loss was due to the large
frequency of fraudulent transaction.
Model | Accuracy(%) | recall(%) | F1(%) | ROC_AUC(%) | Training Time(s) | Tuning time(s) |
---|---|---|---|---|---|---|
Decision Tree | 88.8396484 | 16.20481080 | 27.5994250 | 90.8604781 | 0.308 | 17.4 |
Random Forest | 98.3325947 | 65.96958174 | 60.5848974 | 77.6681912 | 65.774 | 1302 |
LDA | 84.4338577 | 12.81414830 | 22.7172717 | 92.0346958 | 0.615 | 5.4 |
Light GBM | 91.8771696 | 20.51138484 | 33.3181749 | 90.3260558 | 1.448 | 99.5 |
AdaBoost | 93.2306669 | 21.90321833 | 34.0410219 | 84.9888818 | 180.171 | 2352 |
Model | Mean Absolute Error | Root Mean Squared Error | Traing time(s) |
---|---|---|---|
LASSO | 0.08339548200648535 | 0.11536865510851727 | 0.04 |
Ridge | 0.0010036470043190342 | 0.001882263872915812 | 0.032 |
XGBoost | 1.1313262058563323 | 4.793739223668036 | 19.919 |
Linear Regression | 0.0005448947680783427 | 0.0014938985645114012 | 0.08 |
Gradient Boosted Tree | 1.8431233686568962 | 3.392379767811959 | 39.152 |
Payment Type v/s Region
Highest Fraud Region
Balanced Data-SMOTE
Significant Features
ROC Plot
MAE-RMSE Plot