The past 5 years have seen an explosion of interest in the use of artificial intelligence (AI) and machine learning techniques in medicine. This has been driven by the development of deep neural networks (DNNs)—complex networks residing in silico but loosely modelled on the human brain—that can process complex input data such as a chest radiograph image and output a classification such as ‘normal’ or ‘abnormal’. DNNs are ‘trained’ using large banks of images or other input data that have been assigned the correct labels. DNNs have shown the potential to equal or even surpass the accuracy of human experts in pattern recognition tasks such as interpreting medical images or biosignals. Within respiratory medicine, the main applications of AI and machine learning thus far have been the interpretation of thoracic imaging, lung pathology slides and physiological data such as pulmonary function tests. This article surveys progress in this area over the past 5 years, as well as highlighting the current limitations of AI and machine learning and the potential for future developments.
- imaging/CT MRI etc
- lung physiology
Statistics from Altmetric.com
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.
‘Artificial intelligence (AI)’ may be defined as ‘the theory and development of computer systems able to perform tasks normally requiring human intelligence, such as visual perception, speech recognition, decision-making, and translation between languages’.1 Machine learning is a subfield of AI in which statistical models are used to learn patterns from data in order to accomplish a specific task. Machine learning techniques range from simple linear models such as logistic regression and naïve Bayes classifiers to complex neural network models with many thousands of parameters.
The explosion of interest in medical applications of AI during the past 5 years may be attributed to the confluence of two key factors:
Deep neural networks (DNNs)
Artificial neural networks (ANNs) are loosely modelled on the human brain and consist of multiple layers of ‘neurons’ that successively process input data until the output layer is reached. DNNs are a recently developed variant of ANNs that have a large number of intermediate layers (often greater than 10) and process input data in a hierarchical manner, with the first few layers responding to simple low-level features (such as straight lines) and successive layers responding to more abstract high-level features (such as the shape of specific objects).2 DNNs are typically used to classify the input data into a number of categories. For instance, a chest radiograph image may be classified as ‘normal’ or ‘abnormal’. DNNs have been accompanied by a paradigm shift in AI. In the early years of AI research, the goal was to encode the knowledge of human experts into rule-based ‘expert systems’ that would be explicitly programmed to look for certain hand-crafted features in the data. However, DNNs are developed using large training datasets and learn in an autonomous manner the features which most discriminate between categories. DNNs therefore have the potential to surpass human experts in classification tasks. Although their accuracy is impressive, a drawback of DNNs is their lack of interpretability. The features that are used to distinguish between data categories are not readily translated into verbal or visual ‘rules’ that a human can understand.
Big data and faster computation
The success of DNNs has been dependent on the availability of large training datasets that have been correctly labelled with the categories to be distinguished, as well as faster computation to train the DNNs within a reasonable timeframe. For example, a system that can distinguish handwritten numerals requires thousands of examples of each numeral to achieve reasonable accuracy. The widespread use of Electronic Health Records (EHR) and Picture Archiving and Communication Systems has resulted in the availability of large training datasets for healthcare applications, subject to appropriate ethical and information governance safeguards.
DNNs have shown equivalent diagnostic accuracy to expert dermatologists at distinguishing between the macroscopic appearance of malignant and benign skin lesions3 and to expert pathologists at detecting breast cancer nodal metastases on histological slides.4 DNNs have been successfully applied to retinal images for the detection of diabetic retinopathy and other retinal pathologies,5–8 CT images for the detection of acute intracranial events,9 ECGs for the diagnosis of arrhythmias10 and cardiac contractile dysfunction,11 identification of the facial phenotypes of genetic disorders12 and interpretation of screening mammography.13 Significant progress has also been made in the analysis of EHRs for medical diagnosis14 and predicting future events such as acute kidney injury.15 DNNs have shown an ability to detect subtle features that are undetectable by human observers even in hindsight. For instance, Attia et al 16 developed a DNN that accurately predicted the presence of atrial fibrillation occurring previously or in the near future using ECGs recorded during sinus rhythm.
This review will focus on the advances that have been made in AI and machine learning as applied to respiratory medicine in the past 5 years. The inputs that have been subjected to machine learning techniques may be broadly categorised into:
Histopathology or cytology.
Physiological measurements and biosignals.
The following search was performed on the PubMed database on 19 March 2020:
(‘artificial intelligence’[All Fields] OR ‘machine learning’[All Fields] OR ‘deep learning’[All Fields] OR ‘neural network’[All Fields]) AND (‘chest’[All Fields] OR ‘lung’[All Fields] OR ‘pulmonary’[All Fields] OR ‘respiratory’[All Fields] OR ‘thorax’[All Fields] OR ‘thoracic’[All Fields] OR ‘pneumonia’[All Fields] OR ‘pneumonitis’[All Fields] OR ‘bronchiectasis’[All Fields] OR ‘bronchiolitis’[All Fields] OR ‘cystic fibrosis’[All Fields] OR ‘tuberculosis’[All Fields] OR ‘mycobacteria’[All Fields] OR ‘asthma’[All Fields] OR ‘copd’[All Fields] OR ‘pleural’[All Fields] OR ‘sarcoidosis’[All Fields] OR ‘sleep’[All Fields] OR ‘ventilation’[All Fields)].
A total of 4610 results were returned. All abstracts were reviewed by the first author (SG), and full-text articles were retrieved for papers that described a clinically relevant advance in the field. These papers were reviewed in detail, and examples representing the state of the art in each subfield were selected for inclusion in this narrative review. It was observed that there has been an exponential increase in published papers on this topic starting from 2016 (figure 1).
The application of DNNs to chest radiographs and CT scans has resulted in a step change in diagnostic accuracy compared with qualitative semantic features such as tumour spiculation and quantitative features such as shape and texture derived using image analysis software (often referred to as radiomics). The advantage of DNNs is that they derive features directly from the data, resulting in greater accuracy than hand-crafted qualitative or quantitative analyses but with the disadvantage of reduced interpretability. However, some progress has been made towards correlating ‘deep features’ derived from DNNs with semantic features that are detectable by human radiologists.17
A number of algorithms have been developed for automated reporting or triage of plain chest radiographs, in many cases exceeding the accuracy of expert thoracic radiologists. Annarumma et al 18 trained a DNN to triage chest radiographs as ‘normal’, ‘non-urgent’, ‘urgent’ and ‘critical’ using a training dataset of 329 698 images. The AI system detected normal radiographs with a sensitivity of 71%, specificity of 95% and a positive predictive value of 73% in the test dataset. In a simulated radiology reporting pipeline in which the AI was used to prioritise urgent and critical radiographs for reporting by a radiologist, there was an approximately fourfold reduction in the delay to report radiographs with critical findings and a twofold reduction in the delay to report urgent findings. A similar chest radiograph triage system using a binary classification of ‘normal’ or ‘abnormal’ was developed by Yates et al,19 with a final model accuracy of 94.6% in the test dataset. These findings suggest a potentially important role for AI in prioritising cases for review by a radiologist, in order to expedite the reporting of cases with critical abnormalities. This could be particularly relevant in resource-poor settings in which there is a shortage of trained radiologists. AI assessment of chest imaging may also have prognostic significance: Lu et al 20 developed a DNN that accurately predicted all-cause mortality over a follow-up period of 12 years based on a single plain chest radiograph, even after adjusting for radiologists’ diagnostic findings and standard risk factors for mortality.
DNNs have been trained to recognise specific pathologies on chest radiographs including tuberculosis,21–24 malignant pulmonary nodules,25 congestive cardiac failure26 and pneumothorax.27 Hwang et al 28 developed a DNN that could recognise lung cancer, tuberculosis, pneumonia and pneumothorax on chest radiographs as well as providing visual localisation of the abnormality. In a head-to-head comparison using the same test dataset, the DNN achieved an area under the receiver operating curve (AUC; a measure of the accuracy of a diagnostic test) of 0.983, exceeding that of thoracic radiologists (0.932), general radiologists (0.896) and non-radiology physicians (0.614). The same research group subsequently tested this DNN algorithm in an emergency department setting and found that it improved the sensitivity of radiology residents in the detection of clinically significant abnormalities when used as a second reader.29 However, the DNN was not trained to interpret radiographs with multiple pathologies, nor to interpret images in the context of background clinical information. Therefore, while current DNNs cannot replace radiologist reporting of chest radiographs, they may act as a competent second reader to reduce perceptual errors. Prospective studies incorporating DNNs as a second reader in routine clinical practice are warranted to determine whether they can reduce the rate of reporting errors.
Evidence that lung cancer screening can reduce mortality is steadily accumulating, and a recent European Union position statement has concluded that implementation of low-dose CT screening should start throughout Europe as soon as possible.30 An important limiting factor in this implementation is the availability of radiologists to report the large volume of screening CT scans. There has therefore been substantial interest in developing AI systems that can detect and accurately diagnose malignant pulmonary nodules on CT imaging.31–35 Ardila et al 31 trained a DNN to predict the risk of lung cancer based on current and previous chest CT scans using cases from the National Lung Cancer Screening Trial. The DNN achieved an AUC of 0.944 for predicting biopsy-proven cancer in the test dataset. The accuracy of the DNN was higher than that of six board-certified radiologists when only the current CT scan was available and was equivalent to the radiologists when both current and previous CT scans were available for review. Similarly, Baldwin et al developed a DNN to predict malignancy in incidentally detected pulmonary nodules measuring 5–15 mm and achieved an AUC of 0.896 in the test dataset, which was significantly higher than that of the Brock model currently recommended in UK guidelines.32 In order to improve the interpretability and clinical acceptability of DNN predictions, Shen et al 33 merged deep learning techniques with more traditional semantic features such as nodule calcification and margin definition. Incorporating semantic features into DNN predictions did not significantly affect model accuracy but may have improved interpretability of the model output. A number of investigators have taken a hybrid approach, in which radiomic features are entered into machine learning models in order to derive the best combination of features to optimise classification accuracy.34 35 Delzell et al 34 measured 416 quantitative imaging biomarkers in CT scans of pulmonary nodules from 200 patients and entered these radiomic features into a variety of machine learning models. The best performing models were elastic net and support vector machine, which achieved an AUC of 0.72 for distinguishing benign from malignant nodules.
Beyond lung cancer diagnosis, there is evidence that DNNs can be used to predict prognosis and tumour type based on CT images. Hosny et al 36 trained a DNN to predict survival based on CT appearances in patients with non-small cell lung cancer undergoing surgery or radiotherapy. The DNN distinguished between early (<2 years) and late (≥2 years) mortality with an AUC of 0.71 and 0.70 in patients undergoing surgery and radiotherapy, respectively. Wang et al 37 found that a DNN could predict epithelial growth factor receptor mutation status in patients with lung adenocarcinoma based on CT images, with an AUC of 0.81 in the validation dataset. The accuracy of the DNN significantly exceeded that of predictive models using clinical features alone, semantic features or radiomics features.
Machine learning techniques have also been used for the diagnosis of interstitial lung disease.38 39 Walsh et al 38 trained a DNN using a total of 420 096 montages each consisting of four transverse CT images. These were derived from full high-resolution CT scans of 210 patients with usual interstitial pneumonia (UIP), 392 with possible UIP and 327 whose scans were considered inconsistent with UIP. The reference standard for each CT scan was determined by an experienced thoracic radiologist with a specialist interest in interstitial lung disease. In a test set of 68 093 montages derived from 139 separate patients, the algorithm achieved an accuracy of 76.4%. A second test set consisted of 150 four-slice montages from CT scans that had been previously evaluated by 91 thoracic radiologists, with the reference standard being the majority opinion of the radiologists. The algorithm achieved an accuracy of 73.3% in this test set that was comparable with the median accuracy of the individual radiologists (70.7%). Moreover, in a Cox regression analysis, an algorithm diagnosis of UIP was associated with a HR for mortality of 2.88, compared with a diagnosis of ‘not UIP’, with the equivalent HR for a majority radiologist opinion diagnosis of UIP being 2.74.
González et al 40 trained a DNN using four-slice CT montages from 7983 smokers who took part in the COPDGene study and found that the algorithm could accurately diagnose COPD, with an AUC of 0.856. A subsequent study using the same dataset found that the staging of emphysema from ‘absent’ to ‘advanced destructive’ using a DNN was highly predictive of survival and lung function.41 DNNs have also been developed to diagnose and evaluate the burden of thrombus in acute pulmonary embolism. The algorithm developed by Liu et al 42 achieved an AUC of 0.926 for the diagnosis of pulmonary embolism, and the clot burden measured by the DNN correlated significantly with manual (Qanadli and Mastora) scores and with measures of right ventricular function.
Histopathology and cytology
Deep learning techniques have been successfully applied to digital histology images, particularly in the field of lung cancer. Coudray et al 43 found that a DNN could distinguish between adenocarcinoma and squamous cell carcinoma of the lung with comparable accuracy with expert pathologists (AUC of 0.97). This significantly exceeded the performance of traditional image-processing techniques with hand-crafted features, which achieved an AUC of approximately 0.75 for the same task.44 Moreover, the DNN could also predict the presence or absence of six common gene mutations of therapeutic significance (STK11, EGFR, FAT1, SETBP1, KRAS and TP53) with AUC values ranging from 0.73 to 0.86. Similarly, Sha et al 45 trained a DNN to predict programmed death-ligand 1 status in non-small cell lung cancer based on morphological appearances on standard H&E stained tumour sections, with an AUC of 0.80. DNNs have also been trained to accurately differentiate between lung adenocarcinoma growth patterns (acinar, micropapillary, solid, lepidic and cribriform),46 47 as well as to detect lung cancer metastases in lymph node slides.48 Courtiol et al 49 trained a DNN (MesoNet) to accurately predict overall survival of patients with malignant mesothelioma based on whole slide digitised images. The predictions made by MesoNet cut across traditional histological boundaries (such as epithelioid, sarcomatoid and biphasic) and moreover identified the specific regions within the slides that most contributed to patient outcome prediction.
Kim et al 50 used machine learning methods (support vector machines and penalised logistic regression) to develop classifiers for interstitial lung diseases based on high-dimensional transcriptional data from surgical lung biopsies. In a subsequent prospective study, the investigators found that the molecular classifier they developed could accurately distinguish between UIP and non-UIP in less invasive transbronchial biopsy samples, suggesting that the technique could avoid the need for surgical biopsy in some cases.51
Xiong et al 52 trained a DNN to recognise acid fast-stained Mycobacterium tuberculosis bacilli on digital cytology slides. The small size of the bacilli (20×4 pixels) and the loss of resolution when scanning the digital images resulted in some technical challenges. Although good sensitivity of 98% was achieved following modifications to the algorithm, there were a number of false positive results due to contaminant bacilli and slide artefacts, resulting in a specificity of 84%.
Physiological measurements and biosignals
Interpretation of pulmonary function tests including spirometry, body plethysmography and measurement of diffusing capacity has traditionally been considered an important aspect of the expertise of respiratory physicians. Topalovic et al 53 developed a random forest machine learning model using 1430 historical cases that could accurately differentiate between eight categories of respiratory disease. In a head-to-head comparison using 50 test cases, the model displayed an accuracy of 82% and outperformed 120 European pulmonologists by a wide margin.
Machine learning approaches have also been applied to the forced oscillation technique (FOT), which measures respiratory impedance non-invasively using sound waves, with minimal effort from the subject.54 Amaral et al 55 applied a variety of machine learning models (including K nearest neighbour, decision trees, ANNs and support vector machines) to FOT measurements to detect COPD (AUC >0.95) to discriminate between different Global initiative for Obstructive Lung Disease stages of airflow obstruction56 and to identify early smoking-induced changes in healthy subjects,57 as well as to identify airflow obstruction in patients with asthma.58
Breath analysis offers excellent potential to phenotype respiratory disorders because exhaled breath contains a mixture of gases and traces of many volatile organic compounds (VOCs) that emanate from the respiratory tract itself. Several techniques exist to measure VOCs in the breath, such as gas chromatography–mass spectroscopy, electronic nose and chemical sensors, each of which require advanced pattern recognition methods to identify abnormal signatures in measured VOCs.59 Machine learning methods such as decision trees and support vector machines on VOC data have been used to discriminate COPD and healthy individuals60 and to detect lung cancer.61 Brinkman et al 62 used an electronic nose to classify inflammatory asthma phenotypes using K-means and Ward clustering. These unsupervised learning techniques do not rely on prelabelling but instead group the cases into novel categories or clusters based on similarity of exhaled breath metabolites.
Computerised lung sound analysis involves discriminating between normal and adventitious lung sounds obtained during auscultation. Although machine learning has become a standard method to classify adventitious sounds, these sound events are intermittent and highly variable from one person to another presenting a challenge in generalising these algorithms to a general population.63 Machine learning approaches (including ANNs and support vector machines) have been applied to classify adventitious sounds associated with asthma,64 COPD65 and interstitial lung disease66 and to detect common respiratory disorders in children using cough sounds.67 Bardou et al 68 found that DNNs outperformed traditional machine learning techniques in the classification of lung sounds into seven categories (normal, coarse crackle, fine crackle, monophonic wheeze, polyphonic wheeze, squawk and stridor).
Analysis of biosignals using machine learning may permit a superior understanding of the dynamics of physiological regulation in health and disease. Examples of biosignal monitoring in the respiratory sphere include polysomnography, which is used to diagnose obstructive sleep apnoea and other sleep disorders. Nikkonen et al 69 developed an ANN that accurately determined the oxygen desaturation index (ODI) and apnoea–hypopnoea index (AHI) using only the oxygen saturation signal as input. The median absolute error was 0.78 events/hour for AHI and 0.68 events/hour for ODI, using manual scoring of events as the gold standard. Allocca et al 70 developed an automated sleep-stage classification programme that achieved high accuracy against a gold standard of manual visual scoring in human, rodent and pigeon polysomnography data. Mousavi et al 71 developed a DNN to annotate various sleep stages using an openly accessible electroencephalogram (EEG) dataset, achieving an accuracy of 84%. Automated monitoring of biosignals has also been proposed as a solution to patient-ventilator asynchrony, which is a mismatch between ventilator delivery and patient demand. Gholami et al 72 developed a random forest machine learning model to detect cycling asynchrony based on waveform analysis with positive and negative predictive values above 90%.
As the use of smartphones, sensors and wearables proliferates, telemedicine may become an important tool for the self-management of respiratory disorders. By monitoring clinical outcomes at an individual level, such technologies facilitate preventive and pre-emptive care while providing medical expertise remotely. Machine learning offers a powerful solution to analyse patterns associated with respiratory outcomes in data collected by telemonitoring devices.73 Machine learning methods such as naïve Bayes classifiers and support vector machines have been applied to home peak expiratory flow measurements and symptom scores to predict exacerbations a week early in adults74 and children75 with asthma. Similar studies have also been published to predict exacerbations in patients with COPD.76 77
Conclusion and future developments
Research into AI in medicine has accelerated markedly since 2015, with the field of respiratory medicine being well represented. DNNs are emerging as a key tool to develop imaging biomarkers for diagnosis, prognosis and prediction of response to therapy. Figure 2 summarises the process by which machine learning models may be developed and incorporated into routine clinical practice in the near future. There remains an enormous potential for DNNs to embrace domains outside of imaging such as pulmonary function tests and physiological biosignals. However, a major limitation for such computational approaches is a shortage of sufficiently large medical training datasets. Overcoming this will require large-scale collaborations such as the recently formed Open-Source Imaging Consortium (https://www.osicild.org/), a collaboration between academia and industry to develop imaging biomarkers for idiopathic pulmonary fibrosis and other interstitial lung diseases using AI.
Large clinical databases from multicentre randomised controlled trials are another underexplored domain. Applying DNNs to these detailed datasets, potentially including merged data from multiple similar trials, carries the potential to predict treatment effects for individual patients, ushering in a new era of personalised medicine. The benefits of sharing and reuse of individual participant data from clinical trials are increasingly recognised but will require a robust internationally recognised ethical and legal framework to gain wider adoption and acceptance.78 Similarly, data collected during the course of routine clinical practice has great potential for training AI algorithms for patient benefit. However, clear legal and ethical guidelines are needed to maximise the benefit of such datasets while preserving the confidentiality of individual patients.79
Natural language processing (NLP) is still at an early stage of development but in future may be deployed to extract clinical insights from the vast pool of unstructured EHR80 or to extract relationships between concepts from the rapidly expanding body of medical research.81 NLP may also be used to accelerate the development of AI algorithms for interpreting radiological or histological images, by automatically converting free-text radiology or pathology reports into a structured format suitable for training DNNs, potentially obviating the need for manual labelling of cases.82–84
While the advances made over the past 5 years have been impressive, a number of challenges must still be overcome before AI can be widely adopted into routine clinical practice.79 85 86 These include intrinsic problems with the machine learning algorithms themselves, logistical difficulties and social or cultural barriers. It is known that DNNs have the potential to misclassify examples that have been subtly altered, even by the addition of a few extra pixels.87 In a clinical setting, this could manifest as a lack of generalisability; for instance, a DNN model trained on imaging data from the latest scanner at an advanced care facility may not function properly at a hospital that has older machines. Similarly, machine learning models are prone to perpetuating biases that may exist in the training dataset, as well as spurious associations in which confounding factors are used as predictors.86 A related problem with DNNs and other complex machine learning algorithms is their lack of interpretability, which may be defined as an ability to provide reasons for their output. DNNs often act like ‘black boxes’ with the reasons for their output remaining opaque, even to their developers. In the medical sphere, interpretability is critical for gaining trust, particularly if important management decisions are being made based on the evaluation of a DNN. Fortunately, progress is now being made towards more interpretable AI. Several techniques have been developed that can generate explanations by estimating how the input features or different regions within an input image contributed to the output.85 These techniques should allow a closer inspection of DNN outputs by clinicians so that decisions based on faulty or biased explanations can be over-ruled.
In conclusion, AI and machine learning have the power to transform many aspects of respiratory medicine. The emergence of DNNs developed using big training datasets has resulted in a number of novel applications, particularly in the field of thoracic imaging. However, DNN models still suffer from problems with interpretability, generalisability and potential bias. Rigorous validation strategies combined with the development of new standards for reporting machine learning models are required to address these issues before AI can take its place in the clinic.88
Contributors SG conceived the idea for the manuscript, undertook literature search, co-wrote the first draft, and prepared the final draft and figures; WJ, ND and MT undertook literature search, co-wrote the first draft and critically appraised the final draft.
Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.
Competing interests SG has received speaker’s fees from Teva and consultancy fees from Anaxsys and 3M. WJ has received grants from AstraZeneca, Chiesi and GSK. WJ and MT are co-founders of ArtiQ, a spinoff company of KU Leuven. ND has no competing interests to declare.
Patient consent for publication Not required.
Provenance and peer review Not commissioned; externally peer reviewed.