A Comparative Analysis of Machine Learning Algorithms to Build a Predictive Model for Detecting Diabetes Complications

Diabetes complications have a significant impact on patients’ quality of life. The objective of this study was to predict which patients were more likely to be in a complicated health condition at the time of admission to allow for the early introduction of medical interventions. The data were 644 electronic health records from Alsukari Hospital collected from January 2018 to April 2019. We used the following machine learning methods: logistic regression, random forest, and k-nearest neighbor (KNN). The logistic regression algorithm performed better than the other algorithms achieving an accuracy of 81%, recall of 81%, and F1 score of 75%. Also, attributes such as infection years, swelling, diabetic ketoacidosis, and diabetic septic foot were significant in predicting diabetes complications. This model can be useful for the identification of patients requiring additional care to limit the complications and help practitioners in making decisions on whether the patient should be hospitalized or sent home. Furthermore, we used the sequential feature selection (SFS) algorithm which reduced the features to six, which is fewer than any model built before to predict diabetes complications. The primary goal of this study was achieved. The model had fewer attributes which means we have a simple and understandable model in addition to, it has a better performance. bolezni.


Introduction
Diabetes is a chronic disease and considered a serious global health challenge [1]. Diabetes Complications happen when diabetes is uncontrolled. That leads to serious health problems; the patient could suffer from a diabetic coma or die from a heart attack or stroke. The number of people with diabetes has risen to 422 million in 2014 which was about 17% of the world population [2]. More than 68% of diabetes-related deaths are caused by diabetes complications [3]. About 25% of people with diabetes are undiagnosed in the United States [4]. To address diabetic complications, doctors need to be allowed to identify and monitor patients at risk of complications [5]; [6]. Early discovery can prevent or delay diabetes-related complications and allow for effective intervention at both the individual and population levels, which are desperately needed to slow the diabetes epidemic [7].
The objective of this study was to predict which patients were more likely to be in a complicated health condition at the time of admission, which helps for the early introduction of medical interventions. We used the following machine learning methods: logistic regression, random forest, and KNN. The dataset was collected from Alsukari Hospital for building the machine learning model. The final model used the most six significant attributes and 644 records. This model is useful for predicting the likelihood that the patient should be hospitalized or sent home.
The rest of this paper is organized into Sections detailed as follows: section 2 literature review, data collection in section 3, feature selection in section 4, sequential feature selection section 5, model design in section 6, evaluation metrics section 7, results in section 8, section 9 conclusion, study limitations in section 10, last future work section 11.

Literature review
Various traditional methods, based on the physical and chemical tests are available for diagnosing diabetes. However, methods based on data mining techniques can be effectively implemented [8]. The authors agreed on the importance of developing machine learning algorithms to learn patterns and decision rules from data [9]. Although, many studies were conducted to assess the main causes of diabetes mellitus. But, a few were directed to discover the clinical risk factors [10]. A machine learning model was built to predict wound complications and mortality [11].
Electronic health records were used for many studies related to diabetes [12]; [13]. A method that enables risk assessment from electronic health records(EHR) on a large population, they also added administrative claims and pharmacy records [14]. Another study proposed a model that predicts the severity as a ratio interpreted as the impact of diabetes on different organs of the human body, the algorithm estimated the severity on different parts of the body like the heart and kidney [15]. A rapid model for glucose identification and prediction based on the idea of model migration [16]. Despite the data was collected from different sources and places but, some attributes were used in many studies such as age, gender, body mass index(BMI), glucose, blood pressure, time of diagnosis, and smoking [9]; [17]; [18]; [11]. Table 1 above is the review of the popular algorithms used in diabetes studies related to machine learning from 20015 to 2019. And the figure below demonstrates the trend.
According to the figure above, logistic regression (LR), decision tree (DT), and supervised machine learning (SVM) are the most popular algorithms used for diabetes studies among the research community with about 17%. Followed by, the artificial neural network (ANN) with 14%. According to, literature review researchers designed about 12% out of all used algorithms as new methods for diabetes problems.
We built a machine learning model to predict diabetes complications, using six attributes. This model introduced new attributes such as diabetic ketoacidosis, swelling, infection years, and diabetic septic foot were found to be significant. These attributes were not included in the previously mentioned studies [30]; [13]; [31]; [9]. Also, in this paper, we investigated five performance metrics such as F1 and recall. A logistic regression model was used to assess the factors associated with glycemic control. The model indicated that patients older than 65 years old were more likely to have complications compared to the younger[32]. Demographic and treatment data were collected and logistic regression was used to predict complications [33]. Another study was conducted to predict 30-day complication rate using random forest and logistic regression, the analysis showed that age is the most significant attribute in predicting complications [34]. Random forest and simple logistic regression methods showed the best performance compared to the evaluated algorithms [35]. Diabetes complications prediction model was based on similarity measure. first, they assessed the similarity between textual medical records after data cleaning, then topic mining is conducted, and last building the model [36].

Data collection
The dataset was collected from Alsukari Hospital. Ethical approval to use the data for research was obtained both from the Ministry of Health (MOH) and the hospital. The dataset contained 29 attributes and 644 records of diagnosed diabetes patients who were admitted to the hospital in the period from January 2018 to April 2019.

Feature selection
Classification problems usually have a big number of features in the dataset, but not all of them are significant for classification. Irrelevant features may reduce the performance and even complicate the model. Feature selection aims to select a small number of relevant features to get similar or better classification performance than using a larger number of features [39].
Thus, it is usual to apply a preprocessing step to remove irrelevant features and reduce the dimensionality of the data [40]. The selection of the features can lead to an improvement of the learning algorithm, either in terms of learning speed, generalization, or simplicity of the model. Furthermore, there are other advantages associated with a reduced feature: low cost, clear model, and a better understanding of the domain knowledge [41].

Filter method
Filter based feature methods evaluate features as an individual assessment. Therefore, these methods first assign a score value for each feature using one of the statistical criteria, and then, all features are sorted according to the scores. Then, they select top-n features with the highest score as the final step [42]. Filter feature selection algorithms are useful due to their simplicity and fast speed. A common filter is to use mutual information to evaluate the relationships between each feature and the class variable [43].

Wrapper method
Feature selection algorithms are divided into two methods. Firstly, it depends on the outcome of the selection algorithm: whether it returns a subset of significant features or an ordered ranking of all the features, recognized as feature ranking. Secondly, feature selection methods are divided into three approaches based on the relationship between a feature selection algorithm and the learning method, which is used for building the model: filters, which rely on overall characteristics of the dataset and are independent of the learning algorithm; wrappers, which use the prediction of a classifier to estimate subsets of features; and embedded methods, which perform the selection in the process of training and are specific to the given learning algorithm.
Wrapper techniques depend on the classification algorithm, which is used to evaluate the subsets of features, but they are more expensive when it comes to computation [44]. Despite this weakness, they often provide better outcomes; wrappers are used widely in many applications [45]. Especially, in healthcare where we care about the accuracy more than the performance of the algorithm in many situations and allow for implementation in real-time systems when we have fewer attributes [46].

Hybrid method
Recently, researchers are concentrating on developing novel hybrid feature selection methods as they speed up the removal of irrelevant features and give greater classification accuracy compared to other methods [47]. Though various techniques were developed for selecting the perfect subset of features, these methods faced some problems such as instability, high processing time, and selecting a semi-optimal solution as a final result. In other words, they have not been able to fully extract the effective features. Hybrid methods were introduced as a solution to overcome the weaknesses of using a single algorithm [48].

Embedded method
Embedded feature selection is related to classification algorithms. This relation in embedded methods is stronger than that in wrapper methods. Embedded methods are a sort of combination of filter and wrapper methods [49]. By, embedding feature selection into the model learning. They return both the learned model and selected features and are frequently used for classification [50]. Inserting the feature selection step into the training process can improve the performance of the model.

Ensemble method
Ensemble learning is an effective method for machine learning. The objective is to attain better learning accuracy by combining different learning models [41]. Ensemble methods are better than using a single machine learning model. Recently, the development of ensemble feature selection is increasingly getting attention [51].  Combined feature selection aims to find multiple optimal features. The advantage of integration technology would produce a stable and efficient method; especially, with high-dimensional data [52].

Sequential feature selection (SFS)
The sequential feature selection(SFS) algorithm begins with a blank set and increases one feature for the first step which gives the best value for the model. On the second step onwards the remaining features are added separately to the existing subset and the new subset is assessed. The new feature is permanently added to the subset if it gives the maximum classification accuracy. The process is repeated until the required number of features are added [53]. SFS method has the advantage of improving the prediction performance of the classifier by excluding any characteristic that reduces the performance [54].

Model design
Working with methods for reasoning under uncertainty is now one of the most interesting areas of machine learning [20]; [21]. Machine learning has been used for several decades to tackle a broad range of problems in many fields of applications [57]. The machine learning model was built on the historical data using different algorithms for each model; we evaluated the results, and assessed the model accuracy on the testing data. Six attributes were selected out of 29. The dataset was divided into two parts: training and testing set consisting of 70 and 30 percent respectively. The features were selected to be used in the final model based on the training set which achieved the highest accuracy of 86%, with the six attributes as shown in table 2. Logistic regression, random forest, and KNN were selected because of their simplicity and good predictive capability. However, machine learning models are more accurate than normal statistical methods [58].

Logistic regression
Logistic regression models are a sort of widespread linear model that is used for datasets where the dependent variable is categorical [59]. The logistic model is used to estimate the probability of the response variable based on one or more predictor variables. Logistic regression is used when the dependent variable is categorical and generates output in terms of probabilities [60]. Logistic regression is an effective prediction algorithm. Its applications are efficient when the dependent variable of a dataset is binary [61].

Random forest
Random Forest was proposed by Dr. Breiman in 2001. It is typically used in classification [62]. It is an algorithm based on statistical learning theory, which uses a bootstrap randomized re-sampling method to extract multiple versions of the sample sets from the original training datasets. Then it builds a decision tree model for each sample set, and finally combines all the results of the decision trees to predict via a voting mechanism [63].
Suppose we have the dataset D = {(x1, y1) … (xn, yn)} and the aim is to find the function f : X →Y where X is the inputs and Y is the produced outputs. Furthermore, let M be the number of inputs. Random forest randomly selects n observations from D with replacement to a bootstrap sample. Each tree is grown using a subset of m features from the overall M features. For regression, it is recommended to set the subset of features at M=3. Then at each node, m features are nominated at random and the best performing split among the M features is selected according to the impurity measure (Gini impurity). The trees are grown to a maximum depth without pruning[64].

K-nearest neighbor (KNN)
It is a classification algorithm that classifies data based on similarity measure or distance measure [18]. This algorithm can be used in both classification and regression problems [21], [65]. KNN classifies an instance by finding its nearest neighbors [66]. The KNN classifier applies the Euclidean distance or cosine similarity for differentiating the training tuple and test tuples. The same Euclidean distance between tuples Xi and Xt (t = 1,2,3 …n) can be explained as [67]: where yi, n and s be the tuple constant. It can be written as: This equation is prepared based on KNN algorithms, every neighboring point that is closest to the test tuple, which is encapsulated and on the nearby space to the test tuple [68].

Evaluation metrics
The three algorithms, Logistic Regression, Random Forest Classifier, and KNN were compared in terms of accuracy, recall, specificity, precision, and F1scores as demonstrated below. The level of efficiency of the classification model is measured with the number of correct and incorrect classifications in each potential value of the variables being classified. From the outcomes gained. The following equations are used to measure the Accuracy, Sensitivity, and Specificity, Precision, and F1 score [69].
Recall or also known as sensitivity refers to the percentage of total relevant results correctly classified by your algorithm.
Specificity is defined as the proportion of actual negatives, which got predicted as the negative (or true negative).
The precision of all the records we predicted positive.
F1-score, which is simply the harmonic mean of precision and recall.
From table 3 above the logistic regression achieved the highest accuracy of 81%. According to the medical objective, the model was designed to be more sensitive in predicting true positive which was calculated by recall and F1 of 81 and 79 percent respectively. As this model is used in healthcare our interest in predicting the positive class, which is more important. It is acceptable for the model to fall into a false positive error (type 1 error). But it is very costly for the model to commit the (false negative error) type 2 error, it means that the patient might be in complicated health status and needs special and immediate medical care and intervention, but the model tells us that the patient will not be in a complicated situation. And that could result in more health complications and even the patient's life.

Results
New attributes were found to be significant in predicting diabetes complications such as infection years, swelling, diabetic ketoacidosis, diabetic septic foot, which were found to be vital in predicting diabetes complications.    But, blood pressure, body temperature, cholesterol, protein level, oxygen level, and gender were not significant in predicting diabetes complications because they were excluded by the feature selection algorithm. We compared the following three algorithms: logistic regression, random forest classifier, and KNN to find the best algorithm for predicting diabetes complications. Accuracy, recall, specificity, precision, and F1; used as performance metrics. These metrics were used for selecting the best model. The logistic regression algorithm achieved the highest recall score of 81%, followed by KNN and random forest of 62% and 57% respectively, as shown in Table 3 above. This means that the model is more sensitive in predicting the positive class. Also, the model achieved the highest F1score of 75% as shown in Table 3 above. The model was designed to be more sensitive in predicting true positive class which means the diabetes status is complicated. It was calculated by the recall score of 81%. As this model is used in healthcare our interest in predicting the positive class, which is significant.

Conclusion
In this paper, we created a dataset from Alsukari Hospital for building our machine learning model. The best model was built using six out of 29 attributes. Three algorithms were compared in selecting the best model as follows: Logistic regression, random forest, and KNN. Furthermore, new attributes were investigated and included in the model. Finally, the best accuracy was obtained using logistic regression. The overall accuracy does not guarantee that the model will perform better and serve the specific domain interest. According to the medical objective, the recall score was used besides general accuracy. The higher recall score indicates that the model is more sensitive in predicting positive cases or medically patients with diabetes complications.

Study limitations
The accuracy is not very high as we are working in the medical field higher accuracy is needed. Second, the size of the dataset is small; machine learning models need more data for producing stable and well-trained models.

Future work
There are several future research directions. Firstly, for predicting diabetes complications more features could be included which were not included in this study. Secondly, more work should be directed toward identifying the risk factors associated with diabetes complications. Last we are interested in adopting this model to other chronic diseases.

Acknowledgement
The authors would like to thank the Ministry of Health (MOH) and Alsukari Hospital for giving us the ethical approval to use the data for this work. Also, we would like to take this opportunity to thank the reviewers for their constructive comments towards improving our manuscript.

Conflict of interest
The authors declare no conflict of interest.