Predicting the Causal Effect Relationship Between COPD and Cardio Vascular Diseases

underlying


Introduction
The concern for deaths due to COPD and Cardiovascular diseases is increasing worldwide in an exponential manner. Experts have identified a causal effect relationship between these two diseases where the presence of one determines the onset of the other or vice versa. COPD occurs in people complaining about severe difficulty in breathing or arrhythmia (irregular heartbeats). The presence of COPD in patients mostly makes them vulnerable to cardiovascular diseases and their mortality. It has also been observed that a patient suffering from cardiovascular disease also complains about COPD. This paper reveals the factors to be considered for patients with COPD for determining whether he is suffering from cardiovascular disease or not. COPD is characterized by obstruction in the air passages, which persist over a significant amount of time. It typically refers to bronchial asthma, bronchitis, and emphysema. Bronchial asthma is an allergic reaction that affects the respiratory tract, caused due to significant amounts of histamine in the blood. Bronchitis is defined as the inflammation of the bronchi caused due to infection. Emphysema is the inflation of bronchioles or alveolar sacs in the lungs. The common factors affecting the patients with COPD and Heart Disease have been shown in our below mentioned in

Symptoms of COPD
The common symptoms include coughing, wheezing, difficulty in breathing mainly during exhalation, excess mucous discharge, fatigue, pressure in the chest, anxiety, and loss in muscle tissue and weight.

Complications of COPD
People with COPD are more prone to fatal diseases such as pneumonia, pneumothorax (lung collapse), osteoporosis, edema, enlargement of liver and cor pulmonale (right side of the heart), diabetes, sleep apnea (repeated starting and stopping of breathing during sleep), stroke, high blood pressure, arrhythmia (irregular heartbeat), and heart failure. There has been evidence that patients with COPD [24] have a higher risk of Myocardial Infarction (MI) and it becomes worse when it is not properly managed in hospitals and adequate treatment is not given. Studies have revealed that factors like smoking contribute to around 3.8 % to 16 % of patients suffering from COPD and Congestive Cardiac Failure [7,25]. Cardiovascular and COPD have comorbidities [17] like diabetes, smoking, hypertension, atria fibrillation, Congestive heart failure, and several other factors.
There is a necessity to study the factors that correlate COPD and Cardiovascular diseases [13,9] so that their underlying factors can be identified and help in treating the patient on time to avoid premature deaths. COPD in most patients causes cardiovascular deaths, and it remains a challenge to identify its occurrence and treat the patient on time.
In this paper, we have used the real patient data from the Srirama Chandra Bhanja Medical College and Hospital 1 , Cuttack, Odisha, India. The data was collected for 200 Patients by the Regional Medical Research Center (RMRC), Bhubaneswar 2 . RMRC is a medical Research Institute that collects patient's data for research and study of ailments. RMRC is the authorized body of Government to carry out research activities on available medical data of patients from different organizations. The data set used in our study was contributed by RMRC for research purposes. The data was collected by RMRC through questionnaires and from test reports of the patients with the consent of the patients. The raw data was pre-processed to drop identification labels of patients and missing values were imputed. The data was then split to train the classifiers. The experiment uses the heat map to identify the most important factors which affected the patients with COPD. Our paper reveals important factors like age, Coronary Pulmonate, and smoking as major contributing factors that are highly responsive to cause COPD in patients. The other factors include the systolic and diastolic pressure of the patients. Figure 2 shows the pipeline of the work that has been carried out in this paper.

Literature survey
Various supervised ML classifiers have been used for the prediction of health conditions for a long time. Works have been done previously to study the correlation between COPD and related risk factors. We have attempted to consolidate the most relevant works in this field that have been carried out in the past. Rabe et al. [21] establishes the strong connection between COPD and Cardio-Vascular diseases. Whenever a patient is suffering from either or both the diseases there are pathophysiological changes in the body which includes inflamed lungs and heart. The focus has been on suggesting various treatments to be administered to the patients suffering from COPD who have these underlying heart diseases. The critical factors that have been identified to coexist with COPD and CVD are smoking, age, diabetes, and a sedentary lifestyle. The paper suggests the use of Beta-blockers for treating patients with Cardio Vascular ailments that do not have much interference with COPD drugs. Cazzola et al. [2] has suggested effective ways to manage the two interrelated deadly diseases COPD and CVD which coexist in the patients. The prime objective is to reduce the COPD causing symptoms by first treating the swollen lungs and breathlessness of the patients. Suggested drugs to reduce the urge to smoking, bronchodilators, and inhaled corticosteroids are used in the majority of the cases to treat COPD. The use of Angiotensin-converting enzyme inhibitors, angiotensin II type 1 receptor blockers, statins, antiplatelet drugs, or βadrenoceptor blockers have been proved to be beneficial in treating patients with CVD which has effectively reduced COPD deaths in patients and reduced hospitalization for them. Holm et al. [10] focuses on identifying a genetic deficiency Alpha-I Antitrypsin deficiency and its impact on COPD with growing age. With growing age, the effect of psychological and clinical conditions of patients has been studied with COPD arising due to genetic deficiency (AATD). 468 individuals were considered for study with the genetic deficiency within varying age groups of 32 -84 years. The individuals who were having severe AATD were found to be at greater risk of suffering from COPD. AATD is the genetic cause for the onset of COPD and this deficiency also aggravates smoking. The patients were studied for two years and from the study, it was observed that the younger generation was prone to anxiety, depression, and health issues regardless of their relationship status.
Fukuchi [8] significantly focuses on the growing age of individuals which is the factor of consideration for patients affected with COPD. Three models have been studied in which the animals were prematurely aged. Their lungs did not have any pathological changes like naturally aged lungs and were consistent in their function even after premature aging. The author has tried to state that the abnormal functioning of the lungs due to the increase in the size of the air spaces is not related to increasing age and the relationship between age and COPD is misleading. It has been suggested to further investigate the accelerated aging of lungs may be a factor to cause COPD but directly is not the cause.
The increased rate of smoking has also been greatly affecting patients and has been identified as the major cause of COPD. Laniado-Laborín [12] establishes the fact that smoking is the most important causal factor for patients suffering from COPD. The reduction of smoking in patients has been identified as a successful treatment for COPD patients. Different types of therapies i.e. both pharmacological and behavioral therapies have been suggested for stopping the progression of COPD in patients by controlling their smoking habits. However, studies have also suggested that the patients who refrained themselves from smoking after one year of follow-up were very low. Studies have also revealed that pharmacological therapies are more effective than placebo and 25-30% of people have abstained from smoking after taking these therapies. But, still smoking cessation remains a major challenge in the world and patients continue to suffer from COPD due to their smoking habits. Khan et al. [11] describes in their paper the after-effects of smoking and how this creates abnormalities in lung function. It is responsible for the thickening of airways, dilation of air spaces with abnormal distension of alveoli. It is observed that almost all cigarette smokers have inflammation in their lungs. It has been summarized that tobacco inhalation active or passive results in abnormal inflammation, leads to tissue-damaging oxidants, a reduced level of antioxidants (for self-protection), and induced cell death.
Work has also been done to bring out the association of cor pulmonale with dysfunction of lungs by Shujaat et al. [23]. The earlier fact that the right ventricular dysfunction of the heart is due to enlargement of the tissue results because of left ventricular dysfunction adds up to the information that the right ventricle of the heart has an underlying cause of the malfunction of the lungs and its size. They have identified Pulmonary Hypertension as the underlying cause of right ventricle dysfunction resulting in heart failure. Cor pulmonale with pulmonary hypertension occurs due to various reasons and to treat the patients with the disease several treatments like inhalation of Nitric Oxide, usage of diuretics to remove excess water from the lungs and heart, giving a pulmonary vasodilator like sildenafil, reduction in hematocrit, and surgery for reducing lungs size have been suggested to reduce the cardiac arrests in patients suffering from COPD.
Quint [20] focuses on the fact that patients suffering from COPD have a higher risk of suffering from cardiovascular diseases. The author summarizes delayed identification of disease, late treatment given for reperfusion of STEMI (ST-Elevation Myocardial Infarction), and use of angiography after in-STEMI as causes of a gap for mortality of COPD patients suffering from Myocardial Infarction. Troponin has been identified as the direct indicator in patients suffering from COPD. The results revealed that the higher the presence of Troponin in cardiac patients, the longer they stay in hospitals and have less chance of survival.
The link to various diseases associated with COPD has also been previously studied. Feary et al. [6] has tried to summarize the diseases that are associated with inflammation in the lungs which include cardiovascular diseases and diabetes mellitus as the most common associated diseases. The study determines quantitatively the effect of cardiovascular diseases for patients suffering from COPD. The data of patients have been analyzed with logistic regression with multiple variables and Cox regression. The presence of cardiovascular diseases is found to be more in the young age group of COPD affected patients after considering several factors such as smoking and age strata of patients. It has been found that in most cases, patients suffering from COPD are already affected by myocardial infarction and diabetes.
Roever et al. [22] discussed the factors responsible for COPD and cardiovascular diseases. The authors have identified various factors such as sedentary lifestyle, systemic inflammation, improper function of skeletal muscle, etc for being responsible for cardiovascular diseases and which again becomes the major reason for patients suffering from COPD. The diseases like diabetes, hypertension, a metabolic syndrome that arises due to smoking, arterial fibrillation, Vitamin D deficiency, Congestive cardiac failure are among the several factors affecting patients suffering from heart disease. These patients are found to be vulnerable to COPD and in most cases are affected which is the ultimate factor for causing deaths. COPD patients mostly die of strokes and cardiovascular-related diseases have been identified as the major cause affecting these patients.
Esteban et al. [5] mentioned the factors responsible for the worsening of COPD. Using machine learning an early prediction system is designed which will warn about chronic obstructive pulmonary disease in three levels: red, yellow, and green. The model used Random forests for predicting the condition of COPD in patients. The attributes of the data set were obtained from the daily activities and questionnaires from the patients. The model was trained and tested using 10-fold cross-validation for improving the performance. The model achieved a ROC curve of 0.87 for forecasting whether a patient will suffer from COPD worsening within the next three days.
Peng et al. [19] developed a model for predicting acute illness in patients affected with Chronic Obstructive Pulmonary disease. The 28 important features were selected from watches, sphygmomanometers, thermometers, and routine clinical tests. They identified 410 records from the hospital database for the study and the trained: test ratio varied from 90:10 to 50:50. The best results were obtained with a split in the 80:20 ratio. The model used C4.5 and C5.0 decision trees for classification of the disease. ID3, CART, and C 5.0 classifiers have been compared to find out the best model. C5.0 with 80:20 train and test split gave 80.3% accuracy.
Xie et al. [26] tried to relate physiological homeostasis and the onset of COPD exacerbation. A regression model is built to study the patterns of variables extracted from patients and are evaluated longitudinally for all records to classify the nature of risk the patient is subjected to. The data set was obtained from a hospital in Sydney using TeleMedCare Health Monitor which included the attributes: weight, diastolic and systolic blood pressure (DBP, SBP, heart rate (HR), SpO2, and temperature.

Naïve Bayes classification
Naive Bayes classifiers are a throng of classification algorithms based on Bayes' Theorem. This classification technique makes an assumption of independence among predictors. In layman's terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. Bayes' Thoerem [14] is based on probability theory: where P (A|B) is how often A happens given that B happens, P (B|A) is how often B happens given that A happens, P (A) is how likely A is on its own, P (B): is how likely B is on its own.

Support vector machine (SVM)
SVM is a set of learning methods that are supervised and used for classification and regression. An SVM model is a representation of the data as examples in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. This is done by identifying a hyperplane [27] which separates the classified data with maximum space between them and is determined in such a fashion that most of the points of one category fall on one side of the plane. SVM determines the bestfitted plane.

Decision trees
Decision Tree is a tool that represents nodes of the tree and helps take decisions depending upon the inputs of a node. It helps in giving a pictorial presentation of the consequences of a certain condition. It is used in classification and regression. Here the nodes represent the data [18] and not the decisions. Here a threshold value has to be given after which the algorithm will terminate. It is left with some points which could not be classified and this is called Gini impurity.

KNN classification
K Nearest Neighbor algorithm is a non-parametric method used for classification and regression [1]. This classification technique calculates the distance of a point with coordinates (x, y) from its neighbors. For example, when there are two sets of points in the space and a new point has to be plotted in the area, then the question is where it should be plotted and with which region to determine its basic characteristics that satisfy the classification correctly. The Euclidean distance is calculated from the point from its neighbors and finally, it is positioned in the area which is closest to its neighboring points.

Logistic regression
It is a statistical tool that is used for making decisions on the binary output of the testing condition. In the Linear model the equation used is: whereas logistic regression uses the equation: In the logistic regression, the constant (b 0 ) moves the curve left and right and the slope (b 1 ) defines the steepness of the curve. By simple transformation, the logistic regression equation can be written in terms of an odds ratio:

Random forest
The random forest classifier [15] creates a set of decision trees instead of a single one which is generated by randomly taking one seed from a selected subset of the data of the training set. Subsequently, it determines the average of the results from different decision trees to determine the class to which the object belongs to. It is an ensemble method [4,3] for classification as well as a regression that operates by constructing an assembly of decision trees at training time and finding out the class that is the most suitable for its outcome depending upon its predicted values.

Experiment
The following methodology has been adapted for the collection, study, and experimentation of the data which has been contributed by RMRC.
1. Carrying out of the survey of the 200 patients who volunteered to share their health cards and reports to carry out the research work.
2. Taking medical experts' opinions to identify which factors are related to the COPD occurrence.
3. Pre-processing of the raw data(imputing the missing values and dropping the features which are not directly relevant to the outcome).

Finding the correlation between all the risk factors
with the presence or absence of COPD in the patient samples using heat-map. In this process, we can find the most relevant features associated with the outcome.

Results
Our work has been implemented in Python to study the behavior of the supervised algorithms. The original data set has been collected from SCB Medical College through a survey conducted by RMRC. The dataset is novel and has been used for study only. The raw data was collected for 200 patients of Government Medical College to identify the critical connection between COPD and heart disease. Results of the six Classification Algorithms have been noted in Table 2 with their confusion matrix/accuracy in predicting the test samples.  The identification of important features [16] has been done using the heat map attached in Figure 4 and the factors which are mostly responsible for causing COPD have been determined as Cor Pulmonale, Age, and Smoking. These factors are then studied with several algorithms to analyze their performance in terms of accuracy, precision, and recall.
The performance of the classifiers have been studied by plotting the ROC for all methods as shown in Figure 3. The results show the best curve has been given by a random forest classifier which has the area under the curve 0.87. The true positives obtained for Random Forest are more than any other classifiers used. The Logistic Regression and SVM Classifiers have also given good results with AUC 0.77, which can be explicitly seen from the ROC curves.
From the data collected, the best results were produced by Random forest classifier(accuracy: 87.5%) with a Precision of 95.23% and a recall score is 90.90%.

Discussion
Various works have been carried out to identify the symptoms and factors affecting the health of COPD patients. A strong connection is found to be existing in patients with COPD, who are also affected by Cardiovascular disease [21,2,23]. Our paper also has stressed the explicit fact that these two diseases are interrelated and the onset of one disease causes the other disease to impact the patient soon.
The factors affecting the condition of COPD have been identified to be smoking, Cor pulmonate, age, and diabetes [23,20,6,22], and other lifestyle-related factors. Similar facts have been established in our work, where the important factors obtained from the heatmap denote smoking, coronary pulmonate, and age. Even to some extent literacy status and lifestyle played a vague role.
The papers which were used to study the performance of models for predicting COPD gave an accuracy of 79-80% [5,19,26] with max. ROC of 0.87. Our model has given superior results than other models and has given the highest accuracy of 87.5% with a ROC of 0.87. There is an improvement of almost 7% in the prediction of COPD patients using our suggested model.

Conclusion
Random forests being an ensemble classifier have given the best results for predicting the condition of a patient affected with COPD. The most important factors which have been identified as the causal ones include age, smoking, and Cor pulmonale. In addition to these Systolic and diastolic pressure also have been identified to have an impact on the underlying disease. Cor pulmonale is an abnormal enlargement of the heart due to infection in the lungs or any blood vessels. The above facts conclude that COPD is related to heart disease and gets worsened with the systolic and diastolic pressure of the patient. The systolic and diastolic pressure gets affected when the person is suffering from heart disease. With our data set, we have identified that these two diseases coexist, or both are interrelated to each other. Our work has been done with a novel data set of 200 patients. The work illustrates the fact that classification methods can be used to find the relationship between various diseases and their occurrences. The study can be extensively made for more number of patients so that the model behavior can be the same for new experimental case studies. The classification methods can be combined with optimization techniques for better prediction accuracy. Our model can be used by medical experts to determine heart disease well in advance and the symptoms of COPD can be interpreted by the model to predict occurrences of fatal diseases.