Tirth Pipalia

Assimilation of Machine Learning and Cloud Computing

Abstract

During the era of IoT and Big Data as an emerging technology, it has major impact on all the business sectors. This technology results in generating massive amount of electronic data which contains valuable information. To extract and analyse this information at greater scale the only choice is Cloud computing and Machine Learning technology. The goal is to identify best process for Fraud Prediction and Sales prediction. Ten Machine Learning models are implemented for solving this business question. RandomizedSearchCV is implemented for hyper parameter tuning and time required by model for training and prediction is also evaluated. Even SMOTE is applied for imbalanced dataset. Classification models are validated based on Recall, F1, Confusion Matric ROC-AUC values. Regression Models are evaluated based on MAE and MSE score. Research is conducted using Amazon Web Services and python for predictive analytics by assimilating Machine Learning and Cloud Computing technologies.

Objective

- Perform EDA (Exploratory Data Analysis) for generating preliminary insights from the dataset.
- Setting up the hyper-parameters for essential Machine Learning model using RandomizedSearchCV.
- Feature engineering to obtain the impact of important features in the dataset.
- Evaluate and validate results of ML models using metrics like Accuracy, F1, Recall Score and RMSE, MAE.
- Setting up the services of AWS which will make the implementation of predictive analysis easier and efficient.
- Setting up AWS S3 permission,SageMaker,EC2 instance and installation of essential packages for performing cloud integration task.

Methodology

1. Data Collection
2. Data Wrangaling
3. Handeling Imbalanced Data - SMOTE (Synthetic Minority Oversampling Technique)
4. Setting up AWS: S3,EC2,SageMaker,Jupyter Notebook Configuration
5. Evaluation Metrics (Confusion Matrix, Mean Squared Error)
6. ML Algorithms: Decision Tree, Light GBM, LDA, Random Forest, Ridge, LASSO, Linear Reg., XGBoost, AdaBoost

Analysis & Findings

• Before applying SMOTE there were 123540 class with the genuine transaction and 2823 with the fraudulent transaction which was later balanced equally for value of 123540 observation for each class.

• Random Forest is the model which has the best Accuracy, Recall and F1 score for tuned and default parameters. It consumes 0.296 s for prediction when parameters are tuned and 0.406 s when parameters are set for default making the most timeconsuming model for prediction.

• In Sales prediction Linear regression is the model with least error rate based on RMSE and MAE values followed by Ridge Regression. The worst score is for both the tree-based regressor model where GBT model has MAE=1.8 highest among all the models and XGBoost having RMSE=4.7 making it the least preferable model for the sales prediction task.

• Total loss of revenue is -3883547.345768667. It also indicated that this loss was due to the large frequency of fraudulent transaction.

Result

Model	Accuracy(%)	recall(%)	F1(%)	ROC_AUC(%)	Training Time(s)	Tuning time(s)
Decision Tree	88.8396484	16.20481080	27.5994250	90.8604781	0.308	17.4
Random Forest	98.3325947	65.96958174	60.5848974	77.6681912	65.774	1302
LDA	84.4338577	12.81414830	22.7172717	92.0346958	0.615	5.4
Light GBM	91.8771696	20.51138484	33.3181749	90.3260558	1.448	99.5
AdaBoost	93.2306669	21.90321833	34.0410219	84.9888818	180.171	2352

Model	Mean Absolute Error	Root Mean Squared Error	Traing time(s)
LASSO	0.08339548200648535	0.11536865510851727	0.04
Ridge	0.0010036470043190342	0.001882263872915812	0.032
XGBoost	1.1313262058563323	4.793739223668036	19.919
Linear Regression	0.0005448947680783427	0.0014938985645114012	0.08
Gradient Boosted Tree	1.8431233686568962	3.392379767811959	39.152

View Presentation Source Code Link to Project Report

Payment Type v/s Region

Highest Fraud Region

Balanced Data-SMOTE

Significant Features

ROC Plot

MAE-RMSE Plot