Article Text

Original research
Software using artificial intelligence for nodule and cancer detection in CT lung cancer screening: systematic review of test accuracy studies
  1. Julia Geppert1,
  2. Asra Asgharzadeh2,3,
  3. Anna Brown3,
  4. Chris Stinton1,
  5. Emma J Helm4,
  6. Surangi Jayakody3,
  7. Daniel Todkill3,
  8. Daniel Gallacher3,
  9. Hesam Ghiasvand3,5,
  10. Mubarak Patel3,
  11. Peter Auguste3,
  12. Alexander Tsertsvadze3,
  13. Yen-Fu Chen3,
  14. Amy Grove3,
  15. Bethany Shinkins1,
  16. Aileen Clarke3,
  17. Sian Taylor-Phillips6
  1. 1Warwick Screening & Warwick Evidence, Warwick Medical School, University of Warwick, Coventry, UK
  2. 2Population Health Science, University of Bristol, Bristol, UK
  3. 3Warwick Evidence, Warwick Medical School, University of Warwick, Coventry, UK
  4. 4Department of Radiology, University Hospitals Coventry and Warwickshire NHS Trust, Coventry, UK
  5. 5Research Centre for Healthcare and Communities, Coventry University, Coventry, UK
  6. 6Warwick Screening, Warwick Medical School, University of Warwick, Coventry, UK
  1. Correspondence to Dr Yen-Fu Chen; Y-F.Chen{at}warwick.ac.uk

Abstract

Objectives To examine the accuracy and impact of artificial intelligence (AI) software assistance in lung cancer screening using CT.

Methods A systematic review of CE-marked, AI-based software for automated detection and analysis of nodules in CT lung cancer screening was conducted. Multiple databases including Medline, Embase and Cochrane CENTRAL were searched from 2012 to March 2023. Primary research reporting test accuracy or impact on reading time or clinical management was included. QUADAS-2 and QUADAS-C were used to assess risk of bias. We undertook narrative synthesis.

Results Eleven studies evaluating six different AI-based software and reporting on 19 770 patients were eligible. All were at high risk of bias with multiple applicability concerns. Compared with unaided reading, AI-assisted reading was faster and generally improved sensitivity (+5% to +20% for detecting/categorising actionable nodules; +3% to +15% for detecting/categorising malignant nodules), with lower specificity (−7% to −3% for correctly detecting/categorising people without actionable nodules; −8% to −6% for correctly detecting/categorising people without malignant nodules). AI assistance tended to increase the proportion of nodules allocated to higher risk categories. Assuming 0.5% cancer prevalence, these results would translate into additional 150–750 cancers detected per million people attending screening but lead to an additional 59 700 to 79 600 people attending screening without cancer receiving unnecessary CT surveillance.

Conclusions AI assistance in lung cancer screening may improve sensitivity but increases the number of false-positive results and unnecessary surveillance. Future research needs to increase the specificity of AI-assisted reading and minimise risk of bias and applicability concerns through improved study design.

PROSPERO registration number CRD42021298449.

  • Imaging/CT MRI etc
  • Lung Cancer
  • Non-Small Cell Lung Cancer
  • Clinical Epidemiology

Data availability statement

All data relevant to the study are included in the article or uploaded as supplementary information.

https://creativecommons.org/licenses/by/4.0/

This is an open access article distributed in accordance with the Creative Commons Attribution 4.0 Unported (CC BY 4.0) license, which permits others to copy, redistribute, remix, transform and build upon this work for any purpose, provided the original work is properly cited, a link to the licence is given, and indication of whether changes were made. See: https://creativecommons.org/licenses/by/4.0/.

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

WHAT IS ALREADY KNOWN ON THIS TOPIC

  • Artificial intelligence (AI)-based software is increasingly used to assist the detection and measurement of pulmonary nodules as part of lung cancer screening, but its impact on test accuracy and clinical management has not been comprehensively critiqued and summarised.

WHAT THIS STUDY ADDS

  • AI assistance in lung cancer screening tends to increase sensitivity (detecting more cancers) but at the cost of reduced specificity (resulting in significant additional surveillance of nodules, which would never develop into cancer).

  • Evidence was mostly from retrospective studies conducted in research settings with high risk of bias and applicability concerns.

HOW THIS STUDY MIGHT AFFECT RESEARCH, PRACTICE OR POLICY

  • Adoption of AI software and further research should focus on improving the specificity of AI assistance and prospective collection of evidence from in-practice settings using robust study design.

Introduction

Early detection, assessment, monitoring and timely intervention of pulmonary nodules are the key approach to reducing lung cancer morbidity and mortality. Lung cancer screening programmes have been established in several countries including the USA, Croatia, Czech Republic and Taiwan following growing evidence demonstrating survival benefits.1 2 In September 2022, the UK National Screening Committee recommended targeted lung cancer screening using low-dose CT for people aged 55–74 identified as being at high risk of lung cancer.3

Recommendations for nodule management differ across guidelines internationally,4 but most rely on measuring the diameter or the volume of the nodule to help determine next steps. Many individuals with nodules are placed under regular CT surveillance to assess whether the nodule is growing. Obtaining an accurate manual measurement of nodules can be challenging; nodules present in a wide range of different shapes and sizes. There is evidence of substantial inter-reader and intra-reader variability, and that variability increases the more complex the nodule morphology is.5 In the recently published Dutch–Belgian lung cancer screening trial (NELSON), 9.2% of the CT scans were indeterminate (ie, showed either a solid nodule with a volume of 50–500 mm³, pleural-based solid nodules with a minimal diameter of 5–10 mm or a solid nodule with a non-solid component with a mean diameter of ≥8 mm).6 All these individuals required a repeat CT scan in 3 months to calculate volume-doubling time. As the proportion of people with nodules detected on CT scans is high, the accurate measurement and appropriate management of nodules have significant implications for radiologist time and potential patient anxiety.

Computer-aided detection (CAD) systems for assisting radiologists in reading CT scans, which rely on predefined rules, thresholds and patterns, have been available for many years. They were used in the NELSON trial,6 the UKLS trial,7 the Multicentric Italian Lung Detection trial8 and the ongoing Yorkshire Lung Screening Trial.9 Different types of software using modern forms of artificial intelligence (AI) capable of automatically detecting and measuring pulmonary nodules have become available and could potentially reduce the screening workload and reading time for radiologists. These operate differently to traditional CAD systems; they do not rely on predefined rules and instead learn task-relevant features and generate algorithms from raw input data.

We aimed to examine the accuracy of CE-marked (compliant with relevant European Union regulations), AI-based software use for automated detection and analysis of pulmonary nodules in chest CT scans as part of lung cancer screening. As secondary outcomes, we analysed the reading time and the provided information on the impact of AI assistance on Lung CT Screening Reporting & Data System (Lung-RADS) categorisation.

Methods

Protocol and registration

This systematic review is an update of part of a diagnostic technology assessment for the National Institute for Health and Care Excellence.10 The protocol for the original systematic review was registered with PROSPERO. This paper is reported according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) extension for diagnostic test accuracy studies.11

Data sources

We conducted literature searches on 17–19 January 2022 and updated these on 6 March 2023. The search strategy was based on three themes: lung cancer/nodules, AI and computer tomography/mass screening/early detection of cancer. Databases searched were MEDLINE, Embase, Cochrane Database of Systematic Reviews, Cochrane CENTRAL, Health Technology Assessment (HTA) database (CRD), International HTA database (INAHTA), Science Citation Index Expanded (Web of Science), Conference Proceedings—Science (Web of Science). Endnote V.20 was used to identify and remove duplicate results.

We searched or reviewed websites of selected conference proceedings, health technology assessment organisations, device manufacturers and devices@FDA between 24 January and 16 February 2022. Forward citation tracking from key publications of included studies was also undertaken in May 2022, using Science Citation Index (Web of Science) and Google Scholar. Details of the search strategies are provided in online supplemental material 1. Reference lists of included studies and recent, relevant systematic reviews identified via the database searches were checked.

Supplemental material

Study selection

Two reviewers independently reviewed titles and abstracts of all retrieved records and all potentially eligible full-text publications against inclusion criteria. Disagreements were resolved by consensus or discussion with a third reviewer. Studies were eligible for inclusion if they reported test accuracy of AI-based software for automated detection and analysis of lung nodules from CT images performed for lung cancer screening or secondary outcomes relating to the impact on clinical management and practical implications. We included all AI-based software which had (or was anticipated to have) an appropriate regulatory approval (CE mark) across the UK and the EU by December 2021 and was near-market—that is, with anticipated availability for commercial use by 2023. The reference standard for lung nodule presence/absence was experienced radiologist reading. Lung cancer presence was confirmed by histological analysis of lung biopsy or health record review; lung cancer absence was confirmed by CT surveillance (imaging follow-up) without significant nodule growth or follow-up without lung cancer diagnosis. Eligible outcomes included test accuracy for nodule detection and/or risk categorisation based on size (any nodules, actionable nodules and malignant nodules, respectively), impact on clinical management and practical implications. Eligible study designs were test accuracy studies, randomised controlled trials, cohort studies, historically controlled trials, before–after studies and retrospective multireader multicase (MRMC) studies. We included peer-reviewed papers; conference abstracts and manufacturer data were only included if they were related to an eligible peer-reviewed full-text paper and reported additional outcome data.

We excluded studies using PET-CT scan images, lung phantom images or where less than 90% were CT images taken for lung cancer screening. We excluded studies if traditional CAD systems without deep learning were used, or they had no relevant test accuracy or clinical management outcomes, and non-human studies along with letters, editorials and communications unless they reported outcome data not reported elsewhere, in which case they were handled in the same way as conference abstracts. We excluded articles not available in English or published before 2012.

Data extraction and quality assessment

Detailed information related to study design, sampling of patients or CT scan images, AI-based software, reference standard and test accuracy outcomes was collected from each included study. Data allowing construction of 2×2 tables were extracted where possible, to calculate sensitivity and specificity. The unit of analyses (per person or per nodule) and features of detected/missed nodules were noted. Comparative data on the potential or actual impact of AI assistance on clinical management (eg, risk categorisation of lung nodules according to clinical guidelines based on measured nodule sizes) and time required by readers to interpret and report findings of the CT scans were also collected.

One reviewer extracted data into a predesigned electronic data collection form (online supplemental material 2). Data extraction sheets were checked by a second reviewer. Any disagreements were resolved through discussion, with the inclusion of a third reviewer when required. Study quality was assessed independently by two reviewers using QUADAS-212 combined with the QUADAS-C tool for comparative studies,13 tailored to the review question (online supplemental material 3). Assessment of applicability was based on a UK/EU frame of reference. Disagreements were resolved through consensus, with the inclusion of a third reviewer if required.

Data analysis

We focused on comparisons between trained human readers (radiologists or other trained healthcare professionals) assisted by AI-based software and those undertaking unassisted reading of CT scan images as this reflects current use of the technology in clinical practice. Supplementary evidence from other comparisons (ie, performance of stand-alone software vs unassisted reading) or non-comparative test accuracy studies (ie, AI-assisted reading or stand-alone software vs reference standard) were also reported where available. We calculated sensitivities and specificities in paired forest plots for the detection of any nodules, actionable nodules and malignant nodules. Where data allowed, we plotted our findings in receiver operating characteristic (ROC) space. Given the substantial heterogeneity in study populations, technologies, reader specialty and experiences, reference standards, test accuracy outcomes used and other study design features, no meta-analysis was carried out and findings are summarised narratively. Secondary outcomes such as reading time and impact on Lung-RADS ratings were summarised narratively.

Results

Study selection

We retrieved 6330 unique results in January 2022, of which 4886 were published since 2012. Nine records were judged to be relevant,14–22 and two records were identified from other sources.23 24 Update searches in March 2023 yielded an additional 1687 results, only one was identified as potentially eligible25 but was subsequently excluded. Eleven studies were, therefore, included (see online supplemental material 4 for full PRISMA flow diagram). Reasons for exclusions at full-text level are listed in online supplemental material 5.

Study characteristics

Characteristics of included studies are presented in table 1.14–24 They comprised 19 770 screened participants. There is potential for overlap as some studies may have sampled the same patients while using the same databases. Two studies used data from the Korean Lung Cancer Screening Project15 16 and four studies used US National Lung Screening Trial (NLST) data.18–20 22 Three studies were conducted in the USA.14 18 20 Two studies reported data from the same screening programme in South Korea.15 16 One study was conducted in each of the UK,23 Taiwan17 and China.21 Two studies conducted in the Netherlands and Denmark22 and in South Korea,19 respectively, utilised CT scan images from the US NLST. The remaining reader study was conducted in the Netherlands using ultra-low-dose CT images from Russia.24 Eight studies adopted an MRMC design.17–24 Two of these used unaided reading originally carried out as part of clinical practice for the comparators.21 23 Four studies sampled consecutive patients,15 16 21 23 and six used nodule-enriched samples,17–20 22 24 while the remaining study adopted random sampling.14

Table 1

Characteristics of included studies

Six different AI-based software programs were used in the studies: AI-Rad Companion (Siemens Healthineers),14 AVIEW Lungscreen (Coreline Soft),15 16 24 ClearRead (Riverain Technologies),17 18 20 InferRead CT Lung (Infervision),21 VUNO Med LungCT AI (VUNO)19 and Veolity (MeVis).22 23

Risk of bias and applicability

The evidence is of low quality. There were problems in most studies in almost all domains in terms of risk of bias and applicability, given the design and operationalisation of the studies and our UK/EU frame of reference (table 2 and online supplemental material 6). Risk of bias according to QUADAS-C was considered ‘high’ in three or more domains in five of the eight comparative studies.17 18 20 23 24 These issues included no consecutive or random sampling, test set laboratory studies in which radiologist behaviour is known to differ from clinical practice,26 unpaired design (before/after study or different radiologists with and without AI) and/or suboptimal or biased reference standard.

Table 2

Limitations of the included studies

Test accuracy

AI-assisted reading versus unaided reading

Eight studies reported on AI-assisted reading, where AI-based software was used concurrently (seven studies15 18–21 23 24) or in addition sequentially (also referred to as ‘second-read AI’)17 to re-interpret images.

One study (described later) compared AI assisted radiographers (without prior experience in thoracic CT reporting) with unaided, experienced radiologists.23 Across all remaining seven studies, the addition of concurrent AI to trained radiologists increased sensitivity and decreased specificity compared with unaided, trained radiologists. Two studies reported detection of actionable nodules (range: +5% to +13% for sensitivity; −3% to −6% for specificity)18 20 and one for detecting malignant nodules (+15% for sensitivity, −6% for specificity).18 Two studies reported detection of lung cancer through Lung-RADS category ≥3 (range, +3% to +7% for sensitivity; −8% to −6% for specificity),15 19 see figure 1 and online supplemental material 7. Concurrent AI-assistance also increased sensitivity (+20%) and decreased specificity (−7%) in nodule measurement and categorisation using a volume cut-off of 100 mm3.24 For detection of nodules of any size, including nodules too small to be considered clinically actionable, radiologists’ sensitivity was increased with concurrent AI use (range, +16% to +56%), with an unclear impact on specificity (range, −3% to +4%).17 21 One of these studies17 evaluated both concurrent AI and second-read AI and found very similar sensitivity (79% vs 80%) and specificity (81% vs 82%), see online supplemental material 7 and 8.

Figure 1

Accuracy of readers (nodule detection; nodule categorisation based on volume measurement; or nodule detection plus risk categorisation and recall decision for lung cancer diagnosis) both with and without concurrent AI use (seven studies with comparative data). Estimates connected with a line are from the same study. 1 Zhang et al21; 2 Hsu et al17; 3 Lo et al18; 4 Singh et al20; 5 Lancaster et al24; 6 Hwang et al15; 7 Park et al.19 *Data from Hall et al23 are not presented as the study compared AI-assisted reading by radiographers against unaided radiologists, which differed in nature from the other studies. AI, artificial intelligence; Lung-RADS, Lung CT Screening Reporting & Data System.

For illustrative purposes (ie, the examples given here are plausible but hypothetical, given that test accuracy often changes as the screened population and disease prevalence varies, and the data were based on individual studies that used different AI software), if the changes in sensitivity and specificity for the detection of malignant nodules with concurrent AI assistance was in the range of those observed in the large screening programme reported by Hwang et al15 or in the MRMC study by Lo et al,18 and if the prevalence of lung cancer among the screening population was similar to that observed in the NELSON trial (ie, 0.5%),6 AI assistance would allow an additional 150–750 people attending screening with cancers to be detected but an additional 59 700 to 79 600 people attending screening without cancer would be placed on CT surveillance and/or further investigations per million people screened (equivalent to a reduction in positive predictive value of screening from 5% to 3%15 or from 3% to 2%, respectively18; online supplemental material 9).

Impact on Lung-RADS categorisation

Three MRMC studies provided comparative data on the impact of AI assistance on Lung-RADS categorisation of nodules.19 20 22 The proportion of actionable nodules identified (Lung-RADS categories 3–4) was higher when images were assessed with AI assistance in all three studies (66% vs 53%,22 34.2% vs 28.5%,19 55% vs 50%20). However, no reference standards were used, so it is not possible to know whether the additional actionable nodules were malignant.

Impact on CT scan reading time

Three comparative MRMC studies reported on the impact of AI assistance on reading times.18 22 23 Reading times were significantly faster with AI assistance compared with unaided readers: median 86 (IQR 51–141) seconds vs 160 (IQR 96–245) seconds (p<0.001)22 and mean 98.0 seconds vs 132.3 seconds per case (p<0.01)18 for radiologists, and median 3 (IQR 2–5) and 5 (IQR 4–8) min for radiographers using AI in a laboratory (ie, non-clinical) setting vs 10 (IQR 5–15) min for radiologists (unassisted reading in clinical practice).23

Other methods of using AI (stand-alone AI and supporting less experienced staff)

Studies have also investigated other ways of using AI (comparing stand-alone AI with no human input to unaided radiologists or used AI to support less trained staff) or used non-comparative evidence (eg, AI-assisted reading or unaided reading compared with a reference standard). These are presented in online supplemental material 8.

Across studies and outcomes, stand-alone AI was associated with the highest sensitivity (range 58%–100%) but lowest specificity (62%–82%) when compared with AI-assisted radiologist reading (sensitivity 71%–99%, specificity 74%–97%) and/or unaided radiologist reading (sensitivity 43%–94%, specificity 63%–97%) (online supplemental material 8).18–20 24

One study investigated whether AI assistance would support radiographers to match the accuracy of radiologists.23 Experienced radiologists were more sensitive (91% vs 71%) and specific (97% vs 92%) for detecting and categorising actionable nodules than AI-assisted reading by radiographers (without prior experience in thoracic CT reporting) (online supplemental material 8). Further decisions of experienced, unaided radiologists (made during clinical practice) were consistent with British Thoracic Society guidance 71.6% of the time, while the decisions of two radiographers with AI assistance in a laboratory setting were consistent with the guidance 39.7% and 60.7% of the time, respectively.

Discussion

Summary of clinical context

Targeted lung cancer screening programmes are being set up in many countries due to strong randomised controlled trial (RCT) evidence that screening leads to a reduction in lung cancer-specific mortality. This will, however, place enormous pressure on already over-stretched healthcare systems, particularly in terms of scanner capacity and radiologist time. Different types of software using AI-derived algorithms have become available and could potentially reduce the screening workload and reading time for radiologists. These AI-based software, however, also have the potential to cause patient harm or create further workload for radiologists, and evidence is required to determine their performance in a screening context. Here, we have reported the results of a systematic review, synthesising the available evidence on the accuracy, reading time and impact on clinical management.

Statement of principal findings

Our searches yielded 6573 publications, from which 11 heterogeneous studies, reporting on nearly 20 000 patients from six different countries and using six different AI-based software systems were included. All 11 studies were at high risk of bias with multiple applicability concerns. We used a narrative approach to summarise our results, finding that AI-assisted reading was faster and generally improved sensitivity (range: +5% to +20% for detecting/categorising actionable nodules; +3% to +15% for detecting/categorising malignant nodules), with lower specificity (range: −7% to −3% for correctly detecting/categorising people without actionable nodules; −8% to −6% for correctly detecting/categorising people without malignant nodules) compared with unaided reading. AI assistance tended to increase the proportion of nodules allocated to higher risk categories. If these findings were replicated in a population of a million people attending screening, the impact of AI would be an extra 150–750 cancers detected at the cost of 59 700–79 600 people receiving unnecessary surveillance, reducing positive predictive value.

Strengths and limitations

Our searches were extensive but limited by date (January 2012–March 2023). The 2012 cut-off was introduced after discussion with experts who considered that our definition of AI would not include systems introduced or tested prior to that date. Our searches are also limited to studies published in the English language although this is unlikely to have biased our findings.27 28 We aimed to include all AI-based software, which had (or was anticipated to have) appropriate regulatory marking (CE mark) across the UK and the EU, with anticipated availability for commercial use by 2023. However, our searches were inclusive, and we were unlikely to have omitted significant studies from our research because of this inclusion criterion.

QUADAS-2 was used independently by two reviewers12 combined with the QUADAS-C tool for comparative studies,13 which we tailored to the review question to assess risk of bias and applicability. Almost all the studies fell short in key elements of quality, including patient selection, definition of reference standard, index test and flow and timing. The studies we identified were extremely heterogeneous using six different AI-based software systems and from at least six different countries, where the epidemiology of lung cancer, training of radiologists and experience of use of CT screening for lung cancer differ substantially. Therefore, we undertook a narrative review and plotted our findings in ROC space, however if it was possible, meta-analysis would allow for more precise estimates of the accuracy of the addition of AI-based software to CT lung cancer screening. We acknowledge that the potential benefit of AI assistance (150–750 additional lung cancers detected in a screened population of a million people) will depend on the prevalence of lung cancer in the cohort and as such is not generalisable to other populations at higher or lower risk. In addition, software derived from AI potentially allows continuous improvement of performance through learning from expanding sources of data. Although the various softwares evaluated in our review did not involve learning from data in real time, companies may refine their software by retraining their AI models with new datasets and then update the AI-derived algorithms used in the software periodically. Published evaluations on the performance of AI-based software in screening are, therefore, only a snapshot and could be outdated by the time when they are published, and our findings might not completely reflect systems that are currently available. The AI software that we evaluated only processed and utilised data from CT scan images to enhance nodule segmentation, detection and measurement that underpin current practice based on contemporary guidelines. Use of AI software to combine and interrogate additional morphological data from scan images (radiomics) along with a wide range of demographic, histological, proteomic and genomic data for prediction of nodules that are malignant is an area of very active research. These advances could fundamentally change clinical practice in the future. Nevertheless, it is crucial that any claims of improvement in risk stratification and cancer detection with AI software are supported by robust evidence generated from studies with strong designs that address risk of bias and applicability concerns that we highlighted.

Strengths and weaknesses versus other studies

We identified 12 previous systematic reviews on the accuracy of AI for lung nodule/cancer detection and/or malignancy risk prediction in medical images. Nine of these were non-comparative and focused on stand-alone AI performance of algorithms that were not commercially available, so were not informative for our review question (references are reported in online supplemental material 10). One rapid review29 was comparative but focused on the accuracy of AI-based software for the classification of lung nodules into benign or malignant, a software function that was not included in our review.

Two reviews30 31 did cover our question but were broader and did not separately report on the screening population or on commercially available software. Li et al31 evaluated the impact of AI on physicians’ performance in detecting various thoracic pathologies on CT and chest X-ray. The review by Ewals et al.30 was more relevant but covered not only the screening population but also the oncologic, symptomatic or mixed populations as well as software that was not commercially available. Of our 11 included papers, only one20 was identified in the review by Li et al31 and three17 18 21 in the review by Ewals et al.30 Despite the broader population in the review by Ewals et al, they found a similar pattern of increased sensitivity and reduced specificity with AI use. However, Li et al found that, across all pathologies and both image types, both sensitivity and specificity generally improved when using AI-based devices. In concordance with our review, a faster reading time was reported with concurrent AI use in both previous reviews.30 31

Conclusions and implications for clinicians and policymakers

Our systematic review demonstrates that, when used in population-based lung cancer screening programmes, assistance of AI-based software can increase sensitivity, but at the expense of a reduction in specificity, that is, an increase in false-positive findings. The lung checks in the NHS England Targeted Lung Health Checks programme are already supported by AI,32 and removing AI-based software from existing screening programmes is not a practical policy option. However, the limited available evidence suggests that there is significant scope for improvement in the AI-based software, particularly in specificity. This is particularly important to consider as the screening programme is rolled out in the UK, given the potential increase in false-positive findings and the resulting additional workload for radiologists and anxiety for patients. Furthermore, care must be taken that AI-based software does not contribute to changing disease definitions or referral thresholds as the limited evidence base suggests its measurements and categorisations are more cautious and biased towards greater referral. Finally, more research is needed particularly in clinical settings and around the impact of AI assistance on medical staff with less training. Prospective, comparative, test accuracy studies that measure accuracy of the whole testing pathway with AI assistance integrated in clinical practice and compare it with the accuracy of the pathway without AI assistance are needed.

Data availability statement

All data relevant to the study are included in the article or uploaded as supplementary information.

Ethics statements

Patient consent for publication

Ethics approval

Not applicable.

Acknowledgments

We thank Pearl Pawson, Eileen Taylor and Sarah Abrahamson for their managerial and administrative support.

References

Supplementary materials

  • Supplementary Data

    This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.

Footnotes

  • JG and AA are joint first authors.

  • X @Asra_aaa, @GhiasvandHesam, @yfchen12, @amylougrove

  • Contributors JG, AA, CS and Y-FC undertook the review with assistance from SJ and HG. AB devised the search strategy and undertook the searches in discussion with the other authors. EJH provided clinical advice. DG and MP provided statistical advice. AA, JG, CS, DT, PA, AT, AG, BS, AC, ST-P and Y-FC contributed to the conception of the work and interpretation of the findings. Y-FC, AC, ST-P, BS, JG and CS drafted the manuscript. All authors critically revised the manuscript and approved the final version. Y-FC takes responsibility for the integrity and accuracy of the data analysis. Y-FC acts as guarantor. The corresponding author attests that all listed authors meet authorship criteria and that no others meeting the criteria have been omitted.

  • Funding This review was funded by the UK National Institute for Health and Care Research (NIHR) Evidence Synthesis Programme (NIHR135325). The funder had no role in the study design; in the collection, analysis, and interpretation of data; in the writing of the report; and in the decision to submit the article for publication. DT and AG are partly funded by the NIHR West Midlands Applied Research Collaboration. STP is funded by the NIHR through a research professorship (NIHR302434). AG is supported by a NIHR Fellowship (NIHR300060).

  • Disclaimer The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR, or the Department of Health and Social Care.

  • Competing interests All authors have completed the ICMJE uniform disclosure. All authors involved in Warwick Evidence are wholly or partly funded by the NIHR. STP and AG are funded by the NIHR on personal fellowships. STP serves as Chair of the UK National Screening Committee Research and Methodology group, but this work is independent research not associated with that role.

  • Provenance and peer review Not commissioned; externally peer-reviewed.

  • Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.