Machine Learning for Dengue Outbreak Prediction: A Performance Evaluation of Different Prominent Classifiers

Dengue disease patients are increasing rapidly and actually dengue has recorded in every continent today according to the World Health Organization (WHO) record. By WHO report the number of dengue outbreak cases announced every year has expanded from 0.4 to 1.3 million during the period of 1996 to 2005 and then it has reached to 2.2 to 3.2 million during the year of 2010 to 2015 respectively. Consequently, it is fundamental to have a structure that can adequately perceive the pervasiveness of dengue outbreak in a large number of specimens momentarily. At this critical moment, the capability of seven prominent machine learning systems was assessed for the forecast of the dengue outbreak. These methods are evaluated by eight miscellaneous performance parameters. LogitBoost ensemble model is reported as the topmost classification accuracy of 92% with sensitivity and specificity of 90 and 94 % respectively.


Introduction
Dengue fever is the most well-known arboviral disease transmitted by female mosquitoes (Aedes Aegypti) in tropical and subtropical regions throughout the world [7]. Spanish word dengue is derived from dinga. Dengue fever also familiar as break-bone fever, break heart fever, and dandy fever. Dengue viral fever is originated by four concerned viruses known as DEN-(1 to 4). Now DEN-5 which is newly introduced in 2013 [13,3]. Dengue fever (DF), Dengue Hemorrhagic Fever (DHF), and Dengue Shock Syndrome (DSS) are the broad stages of dengue viral from normal to serious respectively [8,16].
According to WHO report the number of dengue outbreak cases announced every year has expanded from 0.4 to 1.3 million during the period of 1996 to 2005 and then it has reached to 2.2 to 3.2 million during the year of 2010 to 2015 respectively. Dengue outbreak is a champion among the most notable viral disease in human beings. Over 33% of the aggregate population of the world is under pitfall together with numerous urban communities of India. In due course, forecasting of dengue outbreak can protect the life of a human by alarming them to take appropriate treatment and care. Forecast of transmissible outbreaks like dengue disease is a challenging work and several prediction techniques are still in their early stages [10]. An Eco-bio-social framework for dengue vector breeding has been proposed by [2]. The researchers use six different Asian regions in their research work and as a gist, vector breeding and adult Aedes aegypti are determined by a complex interaction of the factor. Souza et al, (2007) [19] shows the influence of dengue disease on liver activity. They found that liver damage is more frequent in ladies. So, the liver test is more important that calculates the level of liver damage.
Machine learning is state of the art technology to embolden machines to perform without being explicitly customized to streamline performance standard use of case data or previous observations. Machine Learning model is used for the collection of precious information from the data by the normalized dataset. At this critical moment, the capability of many prominent machine learning systems was assessed for the forecast of the dengue outbreak. For the sake of this, seven machine learning algorithms have been used like LogitBoost, Logistic regression, Decision tree, Naive Bayes, Artificial neural network, Sequential minimal optimization, and knearest neighbor. Additionally, the ROC curve is also used for performance measurement. In table 4, we have shown the comparison among accuracy rate, sensitivity and specificity of the prominent classifier with two ensemble models i.e. Random forest [5] and LogitBoost. Additional review on related literature can be found in [10], which explores around thirty literature published between the year 1995 to 2013.

Methods & material
Data mining is an act of analyzing and extraction of substantial previous databases consider in mind that the end target is to the prediction of unknown information of a novel example from observed examples. Data mining phases are as follow: ▪ Phase 1: Problem identification ▪ Phase 2: Formulation of the hypothesis ▪ Phase 3: Data collection ▪ Phase 4: Data Pre-process (scaling, encoding, and selecting features and outlier detection or removal) ▪ Phase 5: Model estimation ▪ Phase 6: Model interpret and draw conjecture In this experiment, we use dengue disease dataset in CSV file format for the prediction on the WEKA data mining tool. This dataset consists of 75 samples with 36 samples without dengue disease (Negative) and 39 samples with dengue disease (Positive) [12,17,20]. The dataset is collected from test reports of different discharged patients. After that performs data preprocessing for smoothing some missing values using ReplaceMissingValues technique under filter option of WEKA tool. In this experiment, 8 distinct clinical attributes have been taken into account for the prediction of dengue diseases (

Machine learning algorithms 4.1 K-nearest neighbour (kNN)
K-nearest Neighbour classifier is based on instance learning approach that is influenced by the lazy learning technique. Instance-based method, alternatively known as memory-based learning. In this approach, it matches novel problem instances with previously picked instances at training, which is stored in the memory. It is most fruitful for huge datasets with fewer features and provides global approximation and less time in training. The k-NN method can be applied to both classification and regression. In both situations, the input composed of the k nearest training instances in feature space. The outcome is dependent on the application of k-NN is applied for classification or regression [10].
In k-NN classification, the result is a class belonging. The classification of entity is decided on the basis of a majority vote of their neighbor. In contrast k-NN regression, the outcome is the merit significance for the object. The significance is the means of the values of their kNN.
The k-NN model for continuous-valued objective functions that compute the average estimation of the k nearest neighbors. kNN is strong to noisy data by calculating the mean of k-nearest neighbors. The gap between neighbors can be overwhelmed by unnecessary features that lead to the curse of dimensionality. To defeat it, dimension stretch or elimination of the less significant features.

Support vector machine (SVM)
Support Vector Machine, also alternatively known as Support Vector Network introduced by Vladimir Vapnik, that is used for both classification and prediction. SVM is a machine learning method for binary classification problem, despite the fact that executions of multi-class SVMs exist to guide enter vectors to a multi-dimensional feature space. A straight decision environment is worked with exclusive competence guaranteeing high generalization capability of a machine learning strategy [6].
SVM depends on the statistical learning theory that there is an infinite line known as hyperplanes, isolating the two classes. SVM approach endeavoring to search the best one, that reduce the classification error on unknown data. SVM finds for the hyperplane with the biggest margin i.e. maximum marginal hyperplane (MMH).
The thought behind the SVM has been widely actualized in biology with some strategy for the limited situation where training data can be isolated error-free, additionally extending this outcome to non-separable training data. SVM is a deterministic approach that generates effective generalization properties. SVM has a strong mathematical function that uses kernel for complex learning.
Sequential minimal optimization (SMO) is a method for resolving quadratic programming issue which appears at training time of support vector machine [12,18].
A separating hyperplane can be calculated as: Where, H hyperplane, W weight, X input vector, and b bias.

Artificial neural network (ANN)
The artificial neural network is powerful processing machine, that can be an algorithm or real hardware device that has the ability to recognize experience or contemplation knowledge represented through intermediary unit collectively features, and can make such learning knowledge available for usage.
The weighted sum of product xiwkj (for i=0 to m) is usually denoted as netk: Finally, an artificial neuron computes the output yk as a certain function of netk value:

= ( )
Where x and y are input and output signals respectively, wkj synaptic weight, j synapse, and f is activation function [10].

Naive Bayes classifier
Bayesian learning is referred to as methods in probability and statistics. Bayes theorem illustrates the possibility of an event on the basis of conditions which may be respective to the event. It has a homological performance with chosen neural network classifiers and classification tree.
Every training sample can gradually increment or decrement the probability that a hypothesis is accurate means that previous knowledge could be associated accompanied by observed outcome. Naive Bayes is computability intractable and optimal decision making. Naive Bayes classifiers are applied for extraction of the appropriate grouping for a dataset wherever explicit elemental applications are conjoined [18].
The mathematical equation for Bayes theorem is stated as: Here X and Y represented as events, P(X) and P(Y) represents the ratios of X and Y without concern to each other. P(X|Y) is a conditional probability of observing occurrence X given that Y is correct. P(Y|X) is the ratio of observing occurrence Y specified that X is correct.

Decision tree
The decision tree is a hierarchical based prediction approach that sketches the observed attribute in the branches and the target value at their leaves. The predictions can be discrete values which is a classification decision tree or continuous values which is regression decision tree. The prominent algorithms have been developed e.g. ID3, C4.5, CART, CHAID and MARS for decision tree prediction model. J48 decision tree [11] algorithm is a popular Java development under the C4.5 algorithm in WEKA tool that is applied as one of the experiments in this research.
Attribute selection measure by information gain is described as: ( , ) = − + 2 + − + 2 + The entropy or requisite information required to the classification of objects in overall sub-trees is calculated as: The encoded information that can be obtained by divaricating on A: ( ) = ( , ) − ( ) Where A and I represent Attribute and Information gain respectively; p and n are an element of class P and N respectively.

Logistic regression classifier
Logistic regression is based on the regression technique in which the dependent variable is categorical. Logistic regression is a way to the prediction of a dichotomous result. Logistic regression can be binomial, ordinal and multinomial. In multinomial, the results can have more than two possible types.
Univariate logistic regression was applied for continuous covariates, whereas logistic regression techniques give odds proportion of interest, that is not easy to use as a diagnostic device because a computer would be required to compute dengue fever prediction. Consequently, we readjusted the two selected logistic regression technique that substituting continuous attributes with binary counterparts [4].

LogitBoost: an ensemble classifier
Various application of a data mining process demonstrated the legitimacy of mentioned No-Free-Lunch theorem [22]. According to No-Free-Lunch, a single learning model cannot be the best and most appropriate with the whole domain of application. Ensemble learning is an encouraging perspective strategy that combines weak learners to make a powerful model with a specific end goal to enhance the prediction model [15].
Ensemble model is a new way to the mixture of numerous prominent models for enhancement of the precision rate of a novel model for better prediction. It is a combination of k-learned models (M1, M2, M3...Mk) with the purpose of making an upgraded model M* [10], shown in figure 3.
In this research, LogitBoost algorithm has applied as an ensemble classifier for the prediction of dengue outbreak. LogitBoost follows the boosting approach as an ensemble. Boosting approach is most strong learning that is applied for both classification and regression analysis. Boosting approach first builds a weak classifier and test inputs are given starting weights and more often it begins with identical weighting. During iteration, the test inputs are assigned with new weight value to center the systems that are not accurately classified with a newly learned classifier. At each progression of learning, increment weights of the input instance that are not accurately trained by the weak learner and reduction of weights of the input instance that are accurately trained by the weak learner. The ultimate classification model is built on a weighted vote of weak classifiers produced in the repetition. In this comparative analysis, we found that LogitBoost performs better than another specific prominent classifier. LogitBoost ensemble model is reported as the topmost classification accuracy of 92% with sensitivity and specificity of 90 and 94 % respectively.

Classification performance metrics
In this research, seven supervised machine learning approaches were applied for the classification of dengue disease samples. Performance of the classification techniques was estimated on tenfold cross-validation. Eight quality parameters were taken into account for the assessment of classification models. Samples with the absence of dengue outbreak were treated as a negative class, and samples with the presence of dengue outbreak were treated as a positive class. Basic terminologies of confusion matrix as described here: The proportion of predicted negative sample to the total predicted negative samples.

Rate of Misclassification:
The proportion of overall incorrectly samples to the total number of samples. It can be also defined as the proportion of gross error (Type I Error and Type II Error) to the total number of samples ▪ RMC=1-CA ▪ Also known as "Error Rate" = + ❖ F1 Score: It is a weighted average of the recall and precision.

Results and discussion
The performance measurement of dengue outbreak prediction by seven machine learning algorithms is evaluated based on eight attributes as mentioned in the methods and materials section.
There was a total of 75 samples taken into account with 36 negative cases and 39 positive cases of dengue outbreak. Dengue dataset samples were divided in tenfold, each fold was used in testing and rest folds were applied as training throughout cross-validation. Table 2 for LogitBoost, and other classifications like, Logistic regression, Decision tree, Naive Bayes, Artificial neural network, Sequential minimal optimization, and knearest neighbor are shown in figure 4. Figure 4 depicts the predictions of these machine learning models. It is declared from the results that LogitBoost predicts the topmost number of true positives (number of records predicted as positive and it does have dengue outbreak) and it also predicts the topmost number of true negatives (number of records predicted as negative and it doesn't have dengue outbreak (Table 2; Figure 4).

Confusion matrix of prediction result is tabulated in
Decision tree confusion matrix shows that it has the second highest true positives and Logistic regression predicts the second-highest true negatives (Figure 4).
Logistic regression confusion matrix shows that it has the third highest true positives and SMO confusion matrix predicts third highest true negatives (Figure 4).
Naive Bayes and ANN confusion matrix depicts that both are the fourth highest true positives and true negatives ( figure 4).
SMO confusion matrix indicates that it has the fifth highest true positives and Decision tree predicts the fifth highest true negatives (Figure 4).
k-NN confusion matrix shows the worst performer in the sense of the lowest true positives and true negatives (Figure 4).   Table 3 declared that LogitBoost outperformed over all other machine learning methods with the topmost classification accuracy of 92% while the second highest classification accuracy is achieved by Logistic regression of 85%. In addition, LogitBoost has found the highest sensitivity of 90% and Decision tree has got the second highest sensitivity of 87%. Logitboost also acquires topmost specificity of 94% and precision of 95% which declared that LogitBoost ensemble model is most appropriate for the prediction of patients with dengue outbreak (positive class). has the highest negative predictive value of 89% whereas it also defeats all other methods on the F1 score with 92%. LogitBoost also achieves the lowest FP rate of 6%, and also the lowest Rate of misclassification (8%).

ROC curve for performance evaluation
Receiver Operating Characteristic (ROC) curve is a generally employed diagrammatical representation which estimates the performance of the classification models over all feasible thresholds. ROC curve is generated by tracing the FPR on the x-axis with contrary to the TPR on the y-axis. ROC is impartial of both classes and important when the number of instances of both classes mutates at training. Range under ROC must be close to 1 for the best classifier. Figure 5 enlighten that LogitBoost defeats all other methods in the prediction of negative dengue outbreak case and Figure 6, LogitBoost beat other methods in the prediction of positive dengue outbreak case.

Limitation and future work
In this experimental work, we have used 8 clinical parameters with 75 dataset samples (36 dengue negative and 39 dengue positive samples) and performs classification tasks of data mining. After that, we applied seven prominent algorithms in which LogitBoost (one of the ensemble model) performs better than others. According to No-Free-Lunch [22], a single learning algorithm cannot be the best and at most appropriate with the whole domain of application. It may be the computing cost and processing time can increase due to ensemble model but subsequently, day by day the new technologies have come into existence like cloud computing services .

Classification outputs of Machine Learning Algorithms
and distributed computing that reduced the computing cost and processing time.
In the future, one can use huge datasets with more related clinical parameters for their experiments and improvement of model accuracy as mention in the data classification section [10].

Conclusion
Dengue disease patients are increasing rapidly and actually, dengue has recorded in every continent today according to the World Health Organisation (WHO) record. Dengue outbreak prediction may save the life of people and can have valuable effectiveness on their diagnostic. This effort gives a work process established on machine learning techniques for the forecasting of the negative case or the positive case of dengue outbreak.
The prime focus of the research is toward prediction of dengue outbreak using WEKA tool. In this research article, seven prominent machine learning techniques have been applied and eight parameters are used for performance evaluation.
It has been concluded that LogitBoost ensemble model is the topmost performance classifier techniques that it has reached a classification accuracy of 92% with sensitivity and specificity of 90 and 94 % respectively and ROC area=0.967, and had the lowest error rate.
We have compared the accuracy rate of our analysis with other published results in Table 4. Based on our comparative analysis result using LogitBoost ensemble model as well as the Random forest classifier used by Fathima et al, (2015) [5] result concluded that ensemble model performs better than individual classifier (Table 4).
Furthermore, we are desirous to enhance the model accuracy with more related expressed and sensitive clinical features on a huge amount of dataset in future and as well as we are also interested to develop a web-based tool that helps doctors to take a decision with more accurate dengue outbreak.