Photo by Nhia Moua on Unsplash

Can we predict deadly chronic disease early using data science?

Chronic Obstructive Pulmonary Disorder is currently one major causes of death worldwide. It also poses a global public health challenge due to its high prevalence — bringing mortality, disability, and socioeconomic burden to high income and low income countries alike. As we abolish infectious diseases, the rates of chronic diseases such as cardiovascular diseases have also increased, particularly in regions with high levels of urbanization and industrialization (such as China, India, Mexico, Brazil, etc.). Early diagnosis and prevention of non-communicable diseases is thus important. In this article, we will focus on COPD and take a data-driven approach to predict COPD development.

The widely regarded gold-standard for COPD diagnosis is spirometry, we say that a patient has COPD if the forced expiratory volume (FEV) is less than a certain threshold value. While the most important risk factor for COPD is smoking, other risk factors also play an important role, such as occupational exposure to fumes, dust, air pollution, as well as genetic factors and lifestyle habits (i.e. physical activity and diet). In fact, several people with COPD have never smoked in their life. Although lung function is the gold standard for a clinical diagnosis of COPD, it alone is still not sufficient for an early detection.

Meanwhile, the possibility of developing a risk model that takes into account a combination of the known risk factors and genetic polymorphisms for diagnosing and predicting diseases has been met with some success. Here, we use data consisting of DNA polymorphisms and clinical features from a clinical study with 441 COPD patients and 192 healthy controls in Chinese hospitals to develop a predictive model for COPD development.

Part I: What does the data suggest about the risk factors for COPD?

Comparing COPD patients and healthy controls in the dataset, one can see rather significant differences in age, gender, and smoking status. BMI levels are however similar in both groups.

Figure 1: Proportion of people smoking, BMI, age, and proportion of male in both COPD patients (orange) and healthy controls (blue).

The results suggested that COPD patients were more likely to be men than women, older, and smoking at some point in life. It would be more interesting to combine these risk factors with genetic polymorphisms to build risk models for COPD.

Part II: Which machine learning model predicts the test data best?

We evaluate six prediction models: kNN, Linear Regression, Support Vector Machine, decision tree, XGBoost, and feed-forward neural network using area under the ROC curve (AU-ROC) as the performance metric. Based on the AUC score, the three highest performing models are kNN, Linear Regression, and XGBoost, although all six models show good performance on the test dataset.

The area under receiver operating characteritic is highest for XGBoost, but all models are significantly better than random chance (dashed line).

Furthermore, all models also show decent overall predictive power measured using various metrics:

Comparison of the six machine learning models evaluated on the test set. KNN=k-nearest neighbour, LR= Linear Regression, SVM=support vector machine, DT=decision tree, MLP=multi-layer perceptron.

Because XGBoost is the best-performing model based on AUC, it is further used to analyse feature importance in the next part.

Part III: Which are the most important features that play a major role in the predictive model?

The XGboost model is used to analyze the importance of features, where the feature score rankings (i.e. F score) were measured by the total_gain metric in the plot_importance function from Python’s xgboost library.

The relative importance of clinical features (location, age, BMI, sex, smoking status) and DNA polymorphisms (rs***) in the xgboost model.

It should be noted here that the absolute values of the F score mean little to us. What matters here is the relative value of the F score of one feature w.r.t. the others.

With that being said, there are two major takeaways from this result:

  1. The five clinical features (location, age, BMI, sex, and smoking status) are generally more important than genetic polymorphisms
  2. Location plays the most important role in predicting COPD, which is not surprising because the air quality in different Chinese cities can differ by quite a large margin.

Interestingly, smoking is not as big a risk factor here according to the model, which is actually the classical culprit for COPD in developed countries. However, this is probably overshadowed by the high levels of ambient air pollution in some parts of China. It should be noted here that we can only make reports about association, and NOT causation, since we have no data from a randomized controlled trial (a.k.a. randomized experiment).


In this article, we looked at how data science can be used to recognize COPD based on clinical features and DNA data. There is no cure for COPD and it is progressive, so early detection is of paramount importance for any treatment to control its severity and reduce its impact on daily life.

The findings here are based on just one study with a rather limited dataset that specifically focus on the Chinese population, so further studies would be necessary to validate these findings. Therefore, the problem remains:

We need more DATA and SCIENTISTS to help with early diagnosis for effective treatment of chronic diseases!

The analysis code and more details can be found in the project Github repo. The data and original study can be found in this BMC Medicine journal.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store