Predicting Fraud in Mobile Money Transactions using Machine Learning: The Effects of Sampling Techniques on the Imbalanced Dataset

Mobile Money Fraud is advancing in developing countries. We propose a solution to this problem based on machine learning. Labeled data from ﬁnancial transactions which includes mobile money transactions are however, skewed towards the legitimate transactions. Machine learning models built with such skewed datasets are unreliable as the prediction algorithms will be biased towards the legitimate transactions. We investigate the performance of different sampling and weighting techniques such as Adaptive Synthetic Sampling (ADASYN) and Synthetic Minority Oversampling Technique (SMOTE). We select Logistic Regression for the experiments due to its simplicity and relatively low computational needs. The performance is evaluated with different metrics. Manually tuning the weights of the classes achieved the best results in our experiments


Introduction
The use of mobile devices have become a rudimentary part of our daily lives. The way we conduct our daily activities have become heavily dependent on mobile devices. One significant aspect of our interactions with mobile devices that cannot be overemphasized is the way we conduct financial transactions. Financial technology, often referred to as Fintech is the use of innovations and technology that attempts to contend with the conventional way of undertaking financial transactions. Having the reach to conventional financial services, or being financially included, give opportunities and capabilities to individuals on how to plan, save, and stabilize their financial lives [1]. In most part of the developing world, access to formal financial services becomes virtually impossible as the infrastructure and services needed for formal financial inclusion are nonexistent. Where these financial infrastructure exist, often, customers have to travel long distances in order to access these services culminating in additional cost to the already impoverished individual. The implication of financial exclusion is that individuals with no access to conventional financial services tend to be poor and this is vividly evident in most developing countries.
Mobile Money Transactions (MMTs), are financial services offered by Telecommunication companies often refereed to as Mobile Network Operators (MNOs) that enable the transfers of funds (cash). These transfer of funds otherwise known as mobile money (MM)are offered between service subscribers (customers) and MNOs through the use of telecommunication channels [2]. From Demirgüç-Kunt et. al. [3], a third of all account holders which is 12% of the adult population reported having a mobile money account in sub-Saharan Africa. This comes as a relief since it provides financial inclusion for millions of people in developing economies.
MMTs, are fundamentally deployed using short message services (SMS) and Unstructured Supplementary Service Data (USSD) code which makes it very easy for the service to be deployed in rural areas with less accessibility to the internet. It also enable customers to use feature phones which are less expensive compared to their smart phone counterpart. However, Mobile Money Services can also also be deployed on smart phones using specialized mobile applications.
With its humble inception as M-Pesa in Kenya, MMTs have made huge in-roads into making people in developing countries financially inclusive. For example, in Ghana, Cote D'Ivoire, Benin, and Senegal, 54% of the combined adult population use MMTs on a regular basis [4].The value of MMTs is estimated to be $129.29 billion by 2021 across the globe according to Deloitte as cited by [5]. These tremendous gains, made by MMTs are on the verge of been eroded as fraudsters have been perpetuating fraud on the account of legitimate users. According to Busuulwa and Laryea cited by [5], in 2015, fraudulent transactions stood at 53% of the entire mobile money transactions in Uganda, 42% in Tanzania, 12% in Kenya and 23% in Ghana. This may be partly due to inadequate formal education, as the researchers observed as part of their studies, the willingness of MMT account holders to release their secret codes and other sensitive information to third parties with the aim of seeking help to undertake basic transactions.
Traditionally, there have been many approaches in dealing with fraud in financial transactions. These methods have been rule-based, data mining and other statistical methods. These methods, however, are gradually becoming unreliable as the known patterns and mode of operations of criminals gets sophisticated by the day.
The use of machine learning algorithms in predicting fraudulent transactions have witnessed an ascendancy. These algorithms, be it supervised or unsupervised, such as K-nearest neighbor, Naïve Bayes, logistic regression, and support vector machines(SVMs) are trained with data and are used after the training process to classify and predict financial transactions into legitimate and fraudulent ones. Other deep learning methods such as artificial neural networks (ANNs), and Convolutional Neural Networks (CNNs) have also been employed to detect anomalies in financial transactions. These Deep learning and Machine Learning Algorithms have shown high levels of accuracy in their predictions.
Given any dataset on financial transactions such as MMTs, the number of fraudulent transactions (positive class) compared to the legitimate (negative class) ones, constitutes a very small percentage of the dataset. This makes the datasets highly imbalanced [6] and predictions from such data using machine learning algorithms are skewed towards the legitimate transactions with the long term effect that predictions made with such data can be misleading. We select Logistic Regression as the machine learning algorithm for this work as it has proven its potency [7] in a multitude of fields for classification and prediction. It has been used in medicine [8,9,10], Engineering [11,12], sports [13], Finance [14,15,16], computer science [17] etc. It is in this paradigm, that this paper explores the effects of different undersampling, weighting and oversampling techniques of equalizing the imbalanced dataset. These attempts to eliminate the problem of machine learning models whose results are lop-sided towards the majority class. Different undersampling and oversampling techniques are performed to evaluate the effects the imbalanced dataset have on predicting fraud in mobile money transactions. To the best of our knowledge none has been proposed. Our main contributions in this paper are in three folds: -A proposed weighting technique to eliminate the bias effects imbalanced dataset have on machine learning algorithms.
-A fraud prediction model based on the proposed tech-nique above to predict fraud in mobile money transactions as well as other financial transactions with imbalanced dataset.
-An in depth evaluation on the performance of our proposed model as well as that of the other analyzed models.
The remainder of this paper is structured as follows. In section II, we undertake a review of related works in the field of machine and deep learning. Section III gives a brief insight into machine learning and describe the foundations of our chosen machine learning algorithm; logistic regression. Section IV describes our dataset, our methods, and the experimental setup. In section V, we evaluate the performance of our models and discuss the results. We conclude the paper in section VI.

Related works
A survey of the majority of the studies done in the field of finance with regards to fraud prediction and detection using artificial intelligence, data mining and other statistical methods have focused on credit card fraud and others related to traditional banking activities. An example is the work of [18]. In their narrative of financial fraud, the itemized list of financial fraud included only bank, corporate and financial fraud. Banking fraud decomposed further into credit card, money laundering, and mortgage fraud without a mention of MMTs, due to, perhaps its little prominence in the developed world.
In the remainder of this section, we briefly review related literature on data imbalance, supervised machine learning algorithms for classification and the evaluation metrics used.
Data imbalance. This situation arises when the dataset been used to train a machine learning algorithm for classification or other purposes is unevenly distributed between the positive and negative classes. According to [19], the percentage of fraud in audited financial report in of all the United States of America (positive class) was 0.6% compared to 99.4% which constitutes legitimate transactions (negative class). Models developed with such imbalance data often results in misclassification. To correct this anomaly, researchers either attempts to reduce the length of the negative class so it can be at par with that of the positive class. This is known as undersampling. Another method is to increase the length of the positive class with synthetic data using methods such as Synthetic Minority Oversampling Technique (SMOTE) [20] and Adaptive Synthetic Sampling (ADASYN) [21,22] which are collectively known as oversampling.
Supervised classification Machine learning algorithms. We discuss recent works with classification algorithms such as Logistic Regression, Decision Trees, Naïve Bayes, K nearest neighbors (KNN) and other related algorithms.
Ref. [23] did a comparative analysis on the performance of Naïve Bayes, K-nearest neighbor, and logistic regression models in binary classification of imbalanced credit card fraud data. Their work analyzed the performance of these algorithms in classifying ULB dataset and proposed the use of other sampling techniques in relation to the imbalance data, having observed the fact that the nature of the dataset used had a serious impact on the obtained results.
The work of [24] looked at different learning algorithms with Université Libre de Bruxelles, Brussels, Belgium (ULB) dataset using SMOTE as the oversampling technique. They concluded by reporting on the performance of the classifiers based on the confusion matrix, recall, accuracy, and precision.
In the article [25], "Horse Race Analysis in Credit Card Fraud", the researchers also considered Deep Learning, Logistic Regression, and Gradient Boosted Tree. They found out from their investigations by examining the Area Under the Curve(AUC) Receiver Operating Characteristics(ROC) values that, deep learning methods had the most powerful predictive power. The work, however, used undersampling which had the potential of discarding a large chunk of relevant information about the dataset. Several work has been done in the field of fraud detection using artificial neural networks [26,27]. Neural networks, however, require large computational power as well as a huge dataset [28].
Evaluation metrics for machine learning algorithms. Different evaluation metrics may be used for evaluating the accuracy of different algorithms based on the uniqueness of their circumstances. The following are considered. Classification Accuracy. This is the ratio of the number of correct predictions to the total number of samples imputed for training and testing phase, Confusion Matrix which gives a vivid description of the performance of models, AUCROC [29,30,31,32,33] which is the probability that a machine learning algorithm will rank a randomly chosen positive example higher than a randomly chosen negative one, F1-Score [30], and Root Mean Squared Error. The F1-Score is normally used to predict since it has the ability to represent both the precision and recall [30].
An analysis of the reviewed literature creates the impression that majority of the data used by the researchers in developing their classifier models did little or nothing to address the problem of data imbalance. Again depending on the distinctiveness of environments, different evaluation metrics may be used in evaluating the performance and appropriateness of a model. For example, using just the model accuracy of a classifier as the performance criteria of an imbalanced dataset might give a wrong impression on its performance, the model accuracy might be very high but the true positive rate(TPR) and true negative rate(TNR) might be very low. The false positive rate (FPR) and false negative rate (FNR) may be very high which are indications of a bad classifier. These, therefore, leaves a sense of vagueness in the reported performance of these reviewed models as the imbalance problem was not properly handled and appropriate evaluations metrics used. This paper investigates the performance of logistic regression in lieu of the reviewed work by experimenting with different undersampling, weighting and oversampling techniques to classify and predict fraud in Mobile Money Transactions.

Machine learning
Artificial Intelligence (AI) is a field under computer science which emphasizes on the creation of intelligent machines that work and behave like humans. Machine Learning is a sub field of AI where the concept has been defined differently by different school of thoughts. The classical definition by a pioneer in the field of AI, Arthur Samuel coined from his paper [34] is, "a field of study that gives computers the ability to learn without being explicitly programmed ". Tom Mitchell [35] also defined a well posed learning problem as: " a computer program is said to learn from experience E with respect to some task T and some performance P , if its performance on T , as measured by P , improves with experience E ". Others have defined ML as the science of design and use of complex algorithms that has the ability to iterate over large datasets and analyze hidden patterns in the datasets. This process enables the machine to respond to different situations for which they have not been explicitly programmed to.
There are three categorizations of ML namely; supervised learning which uses labeled data for its training and testing having LR, Neural Networks and Support Vector Machines as examples, unsupervised learning which uses unlabeled data having self-organizing maps and and oneclass support vector machines as examples and reinforcement learning which uses software agents to interact with the environment to learn from it while giving rewards and punishments.

Logistic regression (LR)
The mathematical foundation of LR is established on the sigmoid function shown in Fig. 1. In LR we aim to achieve the output Given a training set, we fit the parameters θ. into equation (2), the probability output is given by equation (3).
which implies that and if  (2), the cost function can be deduced as The parameter θ, in this paper was fitted by minimizing equation (7) using Limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) algorithm, an algorithm for parameter estimation in machine learning which has a low computational cost for the iterations [36,37].

Experimental setup and methods
Under this section we introduce the dataset for this paper, explore our dataset to select the best features for our model construction and describe our methods. We describe the process of setting up our classifiers and the process involved in obtaining our results. The experiments were carried out on a computer running Microsoft Windows 10 home edition with Intel(R) Core(TM) i5 -7200U CPU @ 2.50GHz and 8GB of RAM.

The dataset
This paper used data from Kaggle [38], Originally sourced from a mobile money service provider in an African country. It consists of ten(10) columns with their descriptions given below; 1. type is made up of CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER.
2. amount is the amount of the transaction in local currency.
3. nameOrig is the customer who started the transaction.
4. oldbalanceOrg is the initial balance before the transaction 5. newbalanceOrig is the customer's balance after the transaction.
6. nameDest is the recipient ID of the transaction.
7. oldbalanceDest is the initial recipient balance before the transaction.
8. newbalanceDest is the recipient's balance after the transaction.

Data exploration, feature engineering, and selection
The dataset was explored with visualization tools from matplotlib.pyplot and seaborn. The heatmap is reported in Fig. 2. Further analysis showed no significant influence of certain independent variables on the dependent variable in the dataset. These irrelevant variables were subsequently removed from the dataset to enable an efficient model generation.
The dataset was checked further to ascertain the level of independence between the predictors using Spearman's rank correlation coefficient.
The relevant features selected from the previous analysis were further analyzed statistically to determine its suitability for the model development. The results are presented in Table 1.
The results from the Logit Regression Results showed that two of the features extracted for the model development cranked outpvalues of 0.540 and 0.708 which were above the acceptable value of 0.05, they were subsequently dropped. This rigorous selection process legitimizes the appropriateness of the selected features for the model development.

Methods
The methodology was implemented in Python. The approach is illustrated in the flow chart in Fig.3.We began by collecting the dataset. The data was preprocessed to extract the relevant features needed for the model development. The model was then constructed based on the selected features. After the model construction, different evaluation metrics were used to determine the suitability and relevance of the model to the problem at hand. Where appropriate, the model was accepted otherwise the parameters were tuned and the model reconstructed until an adequate one was found.

Model construction
In order to develop a good model, we analyzed the predictor variable, isFraud, in the paysim1 dataset to determine its distribution. The results showed 8213 for the positive class (1) and 6354407 for the negative class (0) representing 0.1290% and 99.8709% respectively. This is a clear indication of high imbalance in the predictor variables and further manifested in the training dataset (25% of the dataset), which also showed 6186 for the positive class (1) and 4765779 for the negative class (0) representing 0.1296% and 99.8703% respectively. From these values, it is evident that developing a "normal" Logistic Regression model will be biased towards the negative class. However, for the sake of analysis, this paper looked at the result from building a "normal" Logistic Regression which does not take into account the imbalance nature of the dataset as well as the other methods of dealing with data imbalance in machine learning to enable us perform better analysis of the results.

"Normal" logistic regression
In building the "normal" Logistic regression, we used 25% of the dataset for the training phase. No resampling was performed on the training dataset to see the effect of imbalance on the results. The model produced a score of 99.8747% and a wrong classification score of 0.075%. The rate at which the model was able to detect fraudulent transactions(TPR) was 81.3517% of all the actual fraudulent transactions and the ability to detect fraudulent free transactions(TNR) was 99.8983% of the actual fraudulent free transactions. The rate at which fraudulent free transactions(FPR) was classified as fraud was 79.6743% of all the actual transactions and fraudulent transactions classified as legitimate ones (FNR) was 0.0237% of all the actual fraudulent free transactions. Other classification reports, confusion matrix, root mean square error (RMSE), and AU-CROC values are reported in Tables 2,3,4, and 5 respectively. A plot of the receiver operating characteristics curve is also presented in Fig. 6.

Undersampling
For undersampling, we aimed at removing the tilt towards the negative class of the model by trying to equalize the class lengths of both the majority and the minority in the training dataset. In this method, we reduced the length of the majority class from 4765779 to make it equal to that of the minority class of 6186 which is 25% of the training dataset. We achieved this by removing randomly a number of some of the majority class indices in an attempt to reduce its length to make it equal to the length of the minority class. This method produced an accuracy of 89.5057% a wrong classification score of 10 Tables  2,3,4, and 5 respectively. A plot of the receiver operating characteristics curve is also presented in Fig. 6.

Logistic regression with weight
In this method of the model development, we imposed weights on the class errors which were proportional to the class imbalance. This was achieved by setting the hyper parameter of the logistic regression classifier "weight" to "balanced" from scikit-learn which assigned certain weights to the classes in an effort at balancing the influence both classes have on the classifier. This produced a model score of 96.2240% a wrong classification score of 3.7759%.  Tables 2,3,4, and 5 respectively. A plot of the receiver operating characteristics curve is also presented in Fig. 6.

Synthetic minority oversampling technique (SMOTE)
In this model, we employed the oversampling technique, SMOTE. SMOTE attempts to increase the size of the minority class by introducing new instances of the minority class in the neighborhood of the minority classes [20]. This method attempts to match the size of the majority and the minority class. The method yielded a length of 4765779 for the positive class (1) and 4765779 for the negative class (0). This model produced a score of 86.2210% a wrong classification score of 13.7789%. The rate at which the model was able to detect fraudulent transactions(TPR) was 98.0266% of all the actual fraudulent transactions and the ability to detect fraudulent free transactions(TNR) was 86.2060% of the actual fraudulent free transactions. The rate at which fraudulent free transactions(FPR) was classified as fraud was 10810.80% of all the actual fraudulent transactions and fraudulent transactions classified as legitimate ones (FNR) was 0.0025% of all the actual fraudulent free transactions. Other classification reports, confusion matrix, RMSE, and AUCROC values are reported in Tables 2,3,4, and 5 respectively. A plot of the receiver operating characteristics curve is also presented in Fig. 6.

Using smote re-sampling for best parameters (SMOTE RS)
This approach is similar to the method used in (H) and produced 785568 for the positive class (1) and 7855688 for the negative class (0). However, it further employed GridSearchCV [39,40,41]to tune the hyper parameters of logistic regression algorithm. This method searched for the best combination of a set of features from a specified grid of possible parameter values. Pipeline was also used to help automate the learning work flows. Pipeline works by enabling a sequence of data to be transformed and correlated together in a model. These two approaches aided in obtaining the best parameters for the SMOTE ratio for optimizing the algorithm. The method obtained 0.01 as the best SMOTE ratio for the model, with the plot of the mean test score against weight reported in Fig. 4. The model produced a score of 98.4941% a wrong classification score of 1.5061%. The rate at which the model was able to detect fraudulent transactions(TPR) was 90.3798% of all the actual fraudulent transactions and the ability to detect fraudulent free transactions(TNR) was 98.5044% of the actual fraudulent free transactions. The rate at which fraudulent free transactions(FPR) was classified as fraud was 1172.0769% of all the actual fruadulent transactions and fraudulent transactions classified as legitimate ones (FNR)was 0.0122% of all the actual fraudulent free transactions. Other classification reports, confusion matrix, RMSE, and AUCROC values are reported in Tables 2,3,4, and 5 respectively. A plot of the receiver operating characteristics curve is also presented in Fig. 6.

Manual weights tuning (MWT)
Under this approach, we aimed at achieving a trade-off for the harmonic mean by manually tuning the class weights for the false positives and the false negatives. The class size used was 25% of the original dataset, 4765779 for negative class (0) and 6186 for the positive class (1). We achieved this by setting twenty five(25) evenly spaced weight points between 0.01 and 1.0 using GridSearchCV with 5 fold cross validations. 0.8350 was obtained as the best weight parameter for the negative class and 0.1649 for the positive class. These results were then fitted into our logistic regression model. The plot of the mean test score against weight is reported in Fig.5   and 5 respectively. A plot of the receiver operating characteristics curve is also presented in Fig. 6.

Adaptive synthetic sampling (ADASYN)
ADASYN is one of the methods for dealing with the problem of class imbalance [21,42]. ADASYN works by generating additional class samples synthetically for the minority class by using density distribution to automatically determine the number of artificial samples that are to be generated for the minority class [22]. The algorithm produced a score of 85.2557% a wrong classification score of 14.7442%. The rate at which the model was able to detect fraudulent transactions(TPR) was 98.8653% of all the actual fraudulent transactions and the ability to detect fraudulent free transactions(TNR) was 85.2383% of the actual fraudulent free transactions. The rate at which fraudulent free transactions(FPR) was classified as fraud was 11569.1662% of all the actual fraudulent transactions and fraudulent transactions classified as legitimate ones (FNR)was 0.0014% of all the actual fraudulent free transactions. Other classification reports, confusion matrix, RMSE, and AUCROC values are reported in Tables 2,3,4, and 5 respectively. A plot of the receiver operating characteristics curve is also presented in Fig. 6.

Performance evaluation, results and discussion
In other to obtain a vivid analysis of our experimental result, we explore a variety of metrics. The following metrics were used in the evaluation of the models; Accuracy, Recall, Precision, F1-Score, AUCROC curve, and RMSE. For a classifier, the confusion matrix output is classified as True Positive(TP), True Negative(TN), False Positive(FP) and False negative(FN). Accuracy is the ratio of the correctly predicted samples to the total of all the samples used in the training set. It is given by equation(8) [43] Accuracy = T P + T N T P + T N + F P + F N Recall can be defined as the ratio of true positives to the sum of true positives and false negatives. It is given by equation (9)[43] Precision is defined as the ratio of correctly predicted positive observations to the total predicted positive observations. It is given by equation (10)[43] P recision = T P T P + F P F1 Score is defined as the weighted average of Precision and Recall. The formula is given by equation (11)[43] AUCROC is defined as the area under the curve of the plot of the true positive rate to the false positive rate [29,44]. The values range between 0 and 1. As the AUC approaches 1, it is an indication of a better model and a bad model as the value approaches 0. The curve is a plot of True Positive Rate (TPR) Versus False Positive Rate (FPR). It is given by equation (12)[45] Root mean squared error is a square root of the average of squared differences between the observed class and the predicted class. it is given by equation (13) RM

Results and discussion
For a model to be considered adequate for classification and used subsequently for prediction, one of the key indicators is the evaluation of the TPR, TNR, FPR, and FNR. The TPR and TNR should be high as possible whiles the FPR and FNR needs to be as low as possible. From Figure7, which is a Bar chart of Classifiers and their respective TPR,   We proceeded to analyze our model based on the F1 Score. Four out of seven models achieved F1 Score values of below 6% which are too low to be considered for inclusion in our model development. NLR, Undersampling and MWT are therefore the models left for consideration. Undersampling acheived the highest score of 90.20% followed by MWT 79.75% and NLR 62.33%.
Our next evaluation metric was the AUC ROC values. The AUC values produced by all the seven(7) models exceeded 0.9 which makes them all very good models per this evaluation.
We now consider the RMSE which were in the range of 0.0213 for MWT to 0.3839 for ADASYN.
In this experiments two models came top as good models; MWT and Undersampling. Manual Weights Tuning achieved a model score of 99.9559% F1-score with a value of 79.75%, the lowest in the RMSE and an AUC value of 0.9627. Undersampling recorded good results, having the best F1-score of 90.20% and an accuracy of 89.5057%. Undersampling, however, discarded a large chunk of the dataset, utilizing only 4080 for testing compared to 1 590 655 for the other models, making it inappropriate for our model. We therefore consider MWT as the best model in our experiments.
In comparison with other works, MWT achieved a superior performance compared with [46] which obtained 92.74% using C4.5. [47] who obtained 97% to 98% using cased based reasoning(CBR). MWT also performed better as compared to a similar work by [23] who experimented with a hybrid technique for undersampling and oversampling achieving 97.92% for Naive Bayes, 97.69% for knearest neighbor and 54.86% for LR The other models performed poorly with each obtaining harmonic means of less than 10%. Even though other parameters were used, the F1-score was one of the key metrics since it is normally used to evaluate prediction(classification) algorithms because of it's ability to balance the effect on recall and precision [48].

Conclusion
In this article, we looked at different approaches on how to classify and predict fraud cases in MMTs with keen interest in its associated class imbalance problem. We have shown the effects different resampling techniques have on our prediction (classification) results. We further indicated this by looking at different evaluation metrics. Our best model for this experiments was the manual tuning of the class weights for the false positives and the false negatives. This was aimed at achieving a trade off for the F1-score. We also demonstrated the practicality of our work using logistic regression.