An Illustration of Rheumatoid Arthritis Disease Using Decision Tree Algorithm

,


Introduction
Rheumatoid Arthritis (RA) is a rheumatic disease.The word 'Rheumatoid' implies 'rheumatism' relates to a musculoskeletal illness, 'arthr' means 'to joints,' and 'itis' denotes 'inflammation.'It is an inflammatory disorder that mostly impairs the joints, as well as other organs like the skin and lungs.Well-defined and reliable estimation of RA symptoms circumvents durable destruction to the patient's joints and bones if treated earlier, or else it affects the patient's quality of life.The research gap has found in the field of Rheumatoid Arthritis using data mining [1,2].
A dataset is an indispensable component in the discussion of the classification algorithm.The dataset features or attributes are qualitative (nominal) and quantitative (numeric).Many researchers have applied various datasets [3][4][5][6] on different classification algorithms and have processed different results based on it.The dataset was utilized as a training set.From the training set decision tree is built.
'Playing tennis' is often used dataset in the decision tree illustration [7][8][9][10].Preferably the next used dataset in the decision tree example is the student performance [11].Similarly, dataset like 'a dog represents a risk for citizens [7,12],' 'reservoir inflow forecasting [13],' 'PEP (Portfolio Evaluation Plan) [14],' 'rainfall forecasting [15],' and 'college scholarship evaluation [16],' are some illustrations in the classification algorithm that rarely handled by many research authors.A few authors only have examined and published medical datasets for the decision tree illustration.
The medical dataset created for this study is named the 'RA dataset.'The RA dataset was obtained from a new approach of the 2010 ACR/EULAR (American College of Rheumatology / European League Against Rheumatism) classification criteria of RA, which was formed, by two active groups of the ACR and the EULAR [17].It contains qualitative attributes in a binary category (yes / no).This dataset aims to diagnose whether the patients have Rheumatoid Arthritis or do not have Rheumatoid Arthritis.
Most RA patients experience abhor pain on the joints of the hands, legs, hip, spine, and shoulder.It would be beneficial for medical practitioners to predict the prominent features responsible for RA disease.The feasible attributes to identify RA patients are displayed in Figure 3.Among these feasible attributes, the optimal attributes for the RA patient are predicted in this study using the RA dataset.
Information gain was determined to find the dominant attributes from the dataset to build the decision tree for the iterative dichotomiser 3 (ID3) algorithm.C4.5 is another algorithm to construct the decision tree by calculating the gain ratio.Decision tree algorithms such as ID3 and C4.5 (modified version of ID3) are popular and efficiently used classifiers for RA prediction from a RA dataset.Only a few authors practiced the decision tree illustration with medical datasets [18,19].Although many authors have described and compared the decision tree algorithms, some confined their papers without the relevant decision tree result.

Related study
Data mining is the method to classify models from massive databases, that being broad, applied to learn and analyze, and obtain information [20][21][22][23].The decision tree algorithm falls under the type of supervised learning.It is the most familiar data mining technique used frequently to build the classification model.They are used to solve both regression and classification problems.All classification model, function with the classifier, which is a supervised learner that automatically perform the learning process for the training dataset, to predict its target attribute.Data mining techniques are widely used for classification and prediction of the healthcare domain so that it can be an aid for the doctors to identify complex diseases precisely and design a more reliable Decision Support System (DSS) [24].
The Electronic Health Record (EHR) of RA patients were studied for early prediction and diagnosis of the RA disease.Moreover, the comparative study made on several machine learning algorithms identifies which algorithm suites well for the prediction of RA disease [25,26].rheumatoid factor (RF), anti-cyclic citrullinated peptide (Anti-CCP), swollen joint count (SJC), and erythrocyte sedimentation rate (ESR) are four essential judging factors for rheumatoid arthritis [27].Once a patient is diagnosed with RA, the probability of getting heart failure is higher compared to the non-RA patient [28][29][30][31].Medical research and biological research are the ever-growing fields where many biological data are collected, classified, estimated, predicted, associated, clustered, and finally visualized through reports and patterns using data mining techniques [32,33].
The application of data mining is always in the progress of continuous development.The ID3 algorithm has some issues to handle multi-valued attributes and requires a high amount of computational complexity.A novel approach has been introduced to split attributes in the ID3 algorithm [34].In the field of bioinformatics, data mining has some challenges like sequencing technologies and data analysis skills.Under analysis estimation instruments, a review of data mining methods performs with the combination of examination tools suitable in research tasks.The literature review finalized the merits and demerits of data mining in bioinformatics [35].
After simulation analysis, ID3 decision tree classification accuracy was higher 6-7 percentage compared to other classifiers.The author proposed an optimized ID3 algorithm that constructs a tree with a minimum node so that it can improve the efficiency and reduce the error rate [36].Using the Gaussian mixture model, the analysis done using different clinical and laboratory data displayed results with various distributions.The patient global assessment (PGA) and health assessment questionnaire (HAQ) collected after three months of RA diagnosis, SJC, and tender joint count (TJC) considered being the functional attribute for RA diagnosis [37].Regarding Arthritis disease, women are affected at a higher rate when compared to men [38].The RA prediction and the RA diagnosis development done by the machine learning approach, it is mandatory to diagnose the essential features for RA prediction [39,40].
The earlier study practiced the decision tree computation technique to investigate the selection of the second-line drug DMARDs (Disease Modifying Antirheumatic Drug) by rheumatologists which depend on the factor of disease rigor to treat RA patients after the failure of Methotrexate [41].A few years back the immune suppression effects of DMARDs are systematic and lead to various side effects.Medical experts improved autoimmune response produced from RA by customizing a good care plan and predicting the prognosis of the disease [42].A recent study was made to support clinical RA treatments using the decision support system to predict a model that can support medical people to give suitable decisions in the early stage of RA disease [43].
The specific proteomic biomarkers have identified for RA diagnosis using matrix-assisted laser desorption/ionization time-of-flight mass spectrometry (MALDI-TOF-MS) combined with weak cationic exchange (WCX) magnetic beads.The classification tree model has been considered an innovative diagnostic tool for RA [44].The combination of proteomic fingerprint technology and magnetic beads obtained efficient biomarkers and discovered the diagnostic patterns for RA.The biomarker C-C motif chemokine 24 (CCL24) has considered as a significant diagnostic indicator for RA [45].
The author states anti-citrullinated protein antibodies (ACPAs) are specific for RA and, RF was observed in health and elder people with other autoimmune diseases, which indicate immune response for RA development.The shared epitope alleles dwell in the major histocompatibility complex (MHC) class II region involved in a genetic risk factor for RA development.ACPA is the spectrum of autoantibodies that aims for posttranslational modification (PTM) [46].
The authors declare that in the future machine learning (ML) will support rheumatologists to analyze and predict the development of the disease and discover significant disease agents.Furthermore, the authors affirm ML will perform treatment propositions and evaluate their predicted outcome.The shared decision-making combines the patient's viewpoint, rheumatologist's suggestion, and also machine-learned evidence in the future [47].
The general methodologies applied to examine the intensity of RA are the clinical, laboratory, and physical examinations.The authors proposed a hybrid optimization strategy called rheumatoid arthritis disease using weighted decision tree approach (REACT), which combines the features of ID3 and particle swarm optimization (PSO) for feature selection and classification of RA to improve the efficiency and reliability of RA diagnosis [48].
It is necessary to develop therapies for RA patient's treatment at each stage of the disease progress using pathological mechanisms that urge the deterioration of RA progress in individuals.Several modern pharmacologic therapies play a vital role in disease relief without joint deformity.The RA pathogenesis, disease-modifying drugs, and views on next-generation therapeutics for RA have been discussed in this review [49].Though joint connection, serology, levels of acute-phase reactants, and the duration of the symptoms are marked to be the primary diagnosis classification criteria for RA, yet the diagnosis requires well trained specialists who can discern early symptoms of RA from additional pathology [50].
The paper [51] developed a model for the flare prediction on the RA patients, with reduced intake of biological disease-modifying anti-rheumatic drugs (bDMARDs) in sustained remission.This proposed model used nested cross-validation and optimal hyperparameters for a suitable model selection approach with machine learning algorithms like Logistic Regression, k-Nearest Neighbors, Naïve Bayes and Random Forest.A dose reduction, feature was selected to be the predominant flare predictor attribute.
A new method [52] focused to promotes the treatment selection in RA patients using GUIDE (Generalized, Unbiased, Interaction Detection and Estimation) decision tree, which matches with predefined rules to predict treatment response to sarilumab and adalimumab.The result classified the presence of Anti-CCP and C-reactive protein (CRP) with a threshold greater than 12.3mg/l exposed as a biomarker pattern to predict response to sarilumab.
This paper [54] presents a review that summarized the healing treatment for RA.The objective was to highlight, polypeptides, small intermediate or end products of metabolism, and epigenetics regulators as the new targets for healing RA.And prominent molecular targets for medication design were identified, which lessen the early RA and determine nonresponses followed by the partial responses and severe effects for modern DMARDs.
Algorithm Pipeline Development and Validation Study were conducted on this paper [55] using EHR to identify patients with RA.Patients' records who had their first visits were suggested as input from EHRs, and Natural Language Processing (NLP) text processing was applied from randomly selected EHRs.Moreover, Six Machine Learning Methods were utilized in the training and 10-fold cross-validation dataset to identify patients with rheumatoid arthritis from format-free text fields of EHRs.
In this paper [56] dataset taken from The Korean College of Rheumatology Biology (KOBIO) Registry, nearly 1204 RA patients were treated with biologic disease-modifying anti-rheumatic drugs (bDMARDs).To predict remission machine learning techniques included Lasso, Ridge, SVM, Random Forest, and Xgboost and explainable artificial intelligence (XAI) were used to identify the essential clinical features correlated with remission.The accuracy and area under the receiver operating characteristic (AUROC) curve were analysed for prediction.
Treatment guideline for RA patients is given in this paper [57], many references and research work associated with vaccination were collected from precise literature reviews formed by ACR guidelines to deal with RA.These studies recommend services to assist the clinician and patient decision-making and relieve them from RA disease anxiety.In this study, let us analyze the RA dataset using the decision tree model and predict the efficient features that diagnose the disease.

About decision tree
A tree structure classifier is the decision tree with a decision node or internal node, a branch, and a leaf node.The test of the attribute has denoted by each internal node.Each leaf node predicts the target classification.Each branch corresponds to the attribute value.To classify training dataset using the decision tree, begin from the root node, follow the suitable decision branches corresponding to the attribute values, and finally reach a leaf-node predicted with the target class.The conjunction of attribute tests corresponds to each path from the root to the leaf.Further, as a whole, the disjunction of these conjunctions represents the tree [58].The dominant attribute is the best attribute classifier from the training set.The internal node represents the dominant attribute that supports to build the decision tree.The dominant attribute is the attribute with the highest information gain and gain ratio, which is discussed in sections 4.2 and 4.3.

ID3
A set of training examples are processed to learn and construct the decision tree.Furthermore, with the learned classifier, the decision tree classifies the new training examples.The algorithm technique employed is from the basic top-down greedy approach.The fundamental algorithm to build the decision tree is the ID3 algorithm developed by Quinlan in 1973 based on the Concept Learning System (CLS) algorithm.ID3 finds the dominant attribute that classifies the training examples by applying a greedy search and never backtrack [58], [59] (p.55).

C4.5
ID3 cannot handle practical issues such as attributes with missing values in the training dataset and attributes with continuous values.Additional problems to handle are a small sample of data leads to overfitting, to select an attribute for the decision node, one feature tested at the moment is time-consuming, and it is sensitive with a greater number of attribute values.Practical issues in ID3 overcome by the C4.5 algorithm, stated by Ross Quinlan, create the decision tree.C4.5 is a continuation of Quinlan's earlier ID3 algorithm [59] (p.55).

Metrics of ID3 and C4.5
Decision tree metrics are a set of measurement support to draw a decision tree with some parameters quantitative assessment derived from the dataset.

Information Gain
Let S be the sample of the training examples with A1, A2, ... , An are the non-target attributes.All the features in the dataset calculated using the information gain formula as shown in Equation 2. Attribute with the highest information gain is the best classifier because the expected reduction is laid out by the information gain in entropy formed by partitioning the records of the dataset using the attribute.How effectively an attribute classifies the training examples according to their target classification has been defined in the information gain measure [59] (p.57-58).WA(A) defines the weighted sum of the information content of each subset of the examples partitioned by the possible values of the attribute.It measures the total disorder or in-homogeneity of the leaf nodes.The minimum WA (A) or maximum information gain(S, A) shows attribute A as the best attribute at a node [58].The best attribute to select in growing the tree using each step of the ID3 algorithm, a precise measure is the information gain.The calculation of information gain is briefly described in Section 7.

Gain ratio
The gain ratio is a ratio between information gain and the split information.Rather than considering the entropy(S) on the target attribute, entropy(S) is concerned about all possible values of the attribute A defined to be the split information [59] (p.73-74).Information Gain Ratio is the fundamental information from the required decrease in entropy.The purpose of Quinlan to introduce this was to overcome bias on multi-valued features by considering the count of branches when choosing an attribute [60][61][62].

Work flow model for proposed illustration
The proposed illustration workflow model consists of a tree algorithm for RA [17], which is further converted to a relational database as shown in Table 1.The resultant RA dataset is applied to computational techniques such as ID3 and C4.5 decision tree classifier to obtain decision tree and classification rule.The RA dataset contains all feasible features necessary to identify RA patients, whereas the final result of the decision tree predicts only the optimal features mandatory to predict RA patients.

About dataset
As mentioned in the workflow model (Figure 2), the conversion of RA Tree Structure (Figure 3) to RA dataset (Table 1) is done by following each path from the root node to the leaf node.The shape of the root node and the intermediate node is a rectangle, whereas the leaf node is in a circle (Figure 3).Each path represents each row in the RA dataset.There are 60 paths (in Figure 3), so the RA dataset consists of 60 rows.The root node in Figure 3 is '>10 joints (at least one small joint)', and the leaf nodes in the Figure 3  The attributes used to diagnosis RA are mixed of both phenotype and genotype.They are '> 10 joints (at least one small joint)', '4-10 small joints', '1-3 small joints', and '2 -10 large (no small) joints' are four features of phenotype.'Serology +' (low positive RF or low positive ACPA), 'Serology ++' (high positive RF or high positive ACPA), and 'APR (Acute phase reactants) Abnormal' (abnormal C-reactive protein (CRP) or abnormal ESR) are three features of genotype and the last attribute is 'Duration of symptoms >=6 weeks'.In Table 1, the features name is followed with a score value to classify RA patients.The cumulative score value of each attribute per record is less than 6 out of 10.Such a score is not classifiable to diagnose RA.Those scores status is yet to be evaluated, and the criteria might be later fulfilled [17].

Illustration of RA dataset with ID3 and C4.5 classifiers
RA [Rheumatoid Arthritis] dataset contains the data field of the qualitative binary asymmetric attribute.Binary data has two conditions such as, 'yes or no,' 'affected or unaffected,' ' true or false.'Asymmetric defines binary values are not equally important.Both the predictor (nontarget attribute) and response (target attribute) variable in the RA dataset is binary and categorical.Two response variables 'ra' and 'no ra' suggest, diagnosis of rheumatoid arthritis and not rheumatoid arthritis.

Step-by-step illustration of ID/C4.5 algorithm using RA dataset
Step 1: Find the Entropy for the current RA dataset, S. In RA dataset 'ra' and 'no ra', two classes are present with the count of 26 and 34, total instances in the dataset are 60.The 'ra' target value informs the patient diagnosed with Step 3: Pick the feature which has the highest information gain.The attribute'> 10 joints' have the highest information gain, as shown in Table 2, '>10 joints' is the best classifier and determined as the root node as shown in Figure 4.
Calculate split information for each attribute using Equation 4. Now the decision tree node (root node) is the '>10 joints' attribute with a maximum of information gain (in the Table 2 it is represented as Info Gain).Since the RA dataset is categorical and not in continuous attribute, the decision tree built is the same for the ID3 and C4.5 algorithms.So, here the gain ratio measure is necessary to construct the decision tree using the C4.5 algorithm.

𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛(𝑆, > 10 𝑗𝑜𝑖𝑛𝑡𝑠)
Step 4: Each branch from the attribute '>10 joints' partition the set S into subsets corresponds to the attribute value 'yes' and 'no.' From the root node '>10 joints', the 'yes' branch of the subset has 14 'RA' and 2 'NO RA' examples obtained.Though we can grow a tree further from the 'yes' branch, we have stopped with the target class RA, to avoid overfitting in the decision tree.This approach followed to stop growing the tree earlier before it attains the level to classify the training data perfectly [59] (p.68).Now recurse (from step 2 to step 3) on the subset (from the root node '>10 joints', the 'no' branch of the subset has 12 'RA' and 32 'NO RA' mentioned as '?' in Figure 4) until the ID3 algorithm satisfies the stopping criteria [63] or by following the first-class approach to avoid overfitting [59] (p.68).
Step 5: The classification rule is generated from the decision tree.

Top-down generalization approach for the decision tree
Figure 5 illustrates the decision tree built from Table 1, which depicts the RA dataset, after applying the ID3 algorithm [58], [59]

Illustration analysis report
The RA dataset consists of all possible feasible features from a RA patient.The predicted optimal features for RA disease are obtained using the classifier ID3 and C4.5.The Figure 5, describes the first predictor variable, '>10 joints' is achieved from level 1, the second predictor variable, ' 4-10 small joints' is identified from level 2, the third and fourth predictor variables namely 'serology ++' and '1-3 small joints' exhibited from level 3 and finally, the fifth predictor variable, 'serology +' is obtained from level 4. Therefore, five optimal features (predictor variables) are '>10 joints', '4-10 small joints', 'serology ++', '1-3 small joints', and 'serology +' plays a vital role to predict RA patients.The accuracy is 90% (54/60) for both ID3 and C4.5 decision tree.The performance is identical for both ID3 and C4.5 because the RA dataset contains categorical data.As shown in Table 2 (first level), for all the remaining levels in the decision tree, the information gain and gain ratio are simultaneously highest as displayed in Figure 6.

Conclusion
The tree-structured data is converted to a relational database (RA dataset), to identify all feasible features for RA disease.Furthermore, the RA dataset is fed into the decision tree algorithm to obtain optimal features for RA disease.Therefore, we have explored the medical dataset to elucidate with the decision tree approach, and derived decision tree and classification rule as the output from the RA dataset.To summarize the work, ID3 and C4.5 decision tree algorithms construct the same decision tree with a classifier accuracy level of 90% for the RA dataset derived from the tree flowchart for diagnosing precise Rheumatoid Arthritis given in the 2010 RA classification criteria.ID3 and C4.5 classifiers result are equal in performances when considered with RA dataset.

Figure 1 :
Figure 1: Entropy function relative to binary classification.S is the sample of training examples (size =10).In the S dataset, positive proportion examples denoted as 'p,' and negative proportion examples denoted as 'n.' Entropy(S) is zero, if the proportion of positive examples (10+, 0-) is the same as the size of the training examples, similarly if the proportion of negative examples (0, 10-) is the same as the size of the training examples.Suppose, positive and negative examples are of equal size (5+, 5-), the impurity in the dataset S is maximum, i.e., entropy is one as shown in Figure 1.Therefore, it is distinct that the impurity of dataset S is measured by entropy.Entropy(S) is the expected number of bits needed to encode class (true or false, + or -, yes or no, low or medium or high) of randomly drawn members of S. A novel way to assign −log 2  bits to messages having probability 'p' introduced in the Information Theory concept of optimal length code [58].So the expected numbers of bits to encode (yes or no, true or false, + or -) a random member of S is− log 2  −  log 2 , where positive examples proportion denoted as 'p,' and negative examples proportion denoted as 'n.' Entropy characterizes the impurity of a collection and measures the information content from the sample of training examples.If the number of unique target feature values assigned as m, then the entropy of S w.r.t n-wise classification is equated as () = − ∑    =1 log 2   (1) , ) = () − () = () − ∑ (     ℎℎ   ℎ   () − are 'RA' and 'crossed RA' (not RA).The root node and the intermediate node indicate the features/attributes, and the leaf node implies the class label of the RA dataset.The aim of this dataset is to Classifying patients by diagnosis of Rheumatoid Arthritis or not Rheumatoid Arthritis.The source of our dataset is from the tree flowchart for classifying distinct Rheumatoid Arthritis (RA) given in the 2010 RA classification criteria.Two active groups of the ACR and the EULAR join together to form a new approach for the 2010 ACR/EULAR classification criteria of RA[17].The number of instances (rows) of the RA dataset -60.The number of features (columns) of the RA dataset -9.Number of Classes (unique values of the target feature) -2.Number of missing values -0.

Figure 2 :Step 2 :
Figure 2: Work flow Model for Proposed Illustration.
* The highest information gain is the best attribute defined in Equation2.

Table 2 :
A sample of Information gain and gain ratio for RA dataset.