Article Text

other Versions


Original article
Using socio-demographic and early clinical features in general practice to identify people with lung cancer earlier
  1. Barbara Iyen-Omofoman1,
  2. Laila J Tata1,
  3. David R Baldwin2,
  4. Chris JP Smith1,
  5. Richard B Hubbard1,2,3
  1. 1Department of Epidemiology and Public Health, University of Nottingham, Nottingham, UK
  2. 2Nottingham University Hospitals NHS Trust, Nottingham, UK
  3. 3Respiratory Biomedical Research Unit, University of Nottingham, Nottingham, UK
  1. Correspondence to Dr Barbara Iyen-Omofoman, Department of Epidemiology and Public Health, University of Nottingham, Clinical Sciences Building, City Hospital, Nottingham NG5 1PB, UK; barboiyen{at}


Introduction In the UK, most people with lung cancer are diagnosed at a late stage when curative treatment is not possible. To aid earlier detection, the socio-demographic and early clinical features predictive of lung cancer need to be identified.

Methods We studied 12 074 cases of lung cancer and 120 731 controls in a large general practice database. Logistic regression analyses were used to identify the socio-demographic and clinical features associated with cancer up to 2 years before diagnosis. A risk prediction model was developed using variables that were independently associated with lung cancer up to 4 months before diagnosis. The model performance was assessed in an independent dataset of 1 826 293 patients from the same database. Discrimination was assessed by means of a receiver operating characteristic (ROC) curve.

Results Clinical and socio-demographic features that were independently associated with lung cancer were patients’ age, sex, socioeconomic status and smoking history. From 4 to 12 months before diagnosis, the frequency of consultations and symptom records of cough, haemoptysis, dyspnoea, weight loss, lower respiratory tract infections, non-specific chest infections, chest pain, hoarseness, upper respiratory tract infections and chronic obstructive pulmonary disease were also independently predictive of lung cancer. On validation, the model performed well with an area under the ROC curve of 0.88.

Conclusions This new model performed substantially better than the current National Institute for Health and Clinical Excellence referral guidelines and all comparable models. It has the potential to predict lung cancer cases sufficiently early to make detection at a curable stage more likely by allowing general practitioners to better risk stratify their patients. A clinical trial is needed to quantify the absolute benefits to patients and the cost effectiveness of this model in practice.

Statistics from

Key massages

What is the key question?

  • Can the early records of patients with lung cancer in general practice be used to develop a predictive model that will aid earlier identification of patients with lung cancer?

What is the bottom line?

  • A model developed using a combination of patients’ socio-demographic and clinical features was found to be predictive of lung cancer 4–12 months before diagnosis and outperformed the current National Institute for Health and Clinical Excellence referral guidelines and all comparable models.

Why read on?

  • The model developed and validated in this study is the first risk-prediction model for lung cancer that incorporates the combination of patients’ baseline characteristics and early clinical features by excluding records made in the months before diagnosis when general practitioners  had initiated investigations for suspected lung cancer. Application of this model in practice should lead to earlier identification and an improved prognosis for patients with lung cancer in general practice.


Lung cancer is the most common cancer and the leading cause of cancer deaths worldwide.1 Survival from lung cancer is known to vary across Europe2 and for patients in the UK, survival is lower than other comparable countries.3–5 Delays in diagnosis are thought to contribute to this problem.3 ,4 Since curative treatments for lung cancer are only available for the minority of people with cancers diagnosed in the early stages,6 any change that results in earlier diagnosis is a priority. The National Awareness and Early Diagnosis Initiative established in 2008 has set up programmes to increase public awareness of symptoms of lung cancer.7 There are currently no widely available screening tests for lung cancer, although several randomised controlled trials are ongoing,8–11 one of which has shown a 20% reduction in mortality. There are also no clinical predictive models currently available that have been demonstrated to detect lung cancer at a stage early enough to improve clinical outcomes.

Most patients with lung cancer experience at least one symptom before diagnosis.12 In a study of 22 people with recently diagnosed lung cancer, symptoms were recalled starting between 4 months and 2 years before diagnosis.13 In the UK, the general practitioner (GP) acts as the gatekeeper to specialised healthcare and most people present with symptoms to their GP before the diagnosis of lung cancer is made.13 ,14 A case–control study of 247 cases of lung cancer and 1235 controls from 21 practices in Exeter, UK showed that haemoptysis, dyspnoea, abnormal spirometry and smoking were independently associated with lung cancer up to 180 days before diagnosis.14 The National Institute for Health and Clinical Excellence (NICE) referral guidelines15 have provided a big step forward in aiding earlier diagnosis of lung cancer by facilitating urgent referral of suspected lung cancer cases; however, the guidelines were not designed to specifically improve patient ascertainment and were not based on a strong evidence base.13 ,16 ,17 Because many lung cancer symptoms are non-specific, GPs need help to estimate the risk of lung cancer by taking into account a combination of socio-demographic features and clinical symptoms.

Although several risk prediction models have been developed to estimate the risk of lung cancer,18–21 only one algorithm has been developed using a combination of patients’ baseline risk factors and symptoms in primary care.22 However, this model did not exclude symptoms in the period preceding lung cancer diagnosis when patients would likely be undergoing investigations for suspected cancer and it may therefore be limited in its ability to identify patients with lung cancer early enough to result in an improvement in outcome.

The aim of this study is to develop and validate a lung cancer risk prediction model that could be used to aid earlier diagnosis in general practice by identifying the socio-demographic factors and the pattern and frequency of symptoms and clinical investigations prior to diagnosis.


The general practice data used in this study were from The Health Improvement Network23 (THIN), a large nationally representative database of general practice records in the UK. Over 95% of the UK population is registered with a GP and general practices in THIN are broadly representative of general practices across the UK in terms of the patient demographics, geographical distribution and practice size.24 THIN has a high level of completeness of lung cancer data and the characteristics of patients with lung cancer in THIN are representative of the UK lung cancer population.25 At the time of this study, THIN had data from 446 UK general practices with a total of 8.2 million patients. To derive the lung cancer risk-prediction model, we identified all incident cases of lung cancer diagnosed between 1 January 2000 and 28 July 2009 (Read code list available). Patients who had less than 1 year of active records prior to their first diagnosis of lung cancer were removed to exclude prevalent cases. Since lung cancer is rare in patients younger than 40 years of age, these patients were also excluded. For each case, 10 randomly selected controls were identified. Controls were registered in the same general practice as the case, with at least 1 year of active data, and they were aged 40 years or older at the time of lung cancer diagnosis in their practice-matched case.

The variables analysed were 5-year age band, sex, socioeconomic status (Townsend deprivation quintiles) and smoking history. Smoking records made within 6 months preceding lung cancer diagnosis were excluded to account for a possible change in cigarette consumption in the months leading up to diagnosis. Patients were categorised as current smokers, ex smokers or non-smokers. Based on the highest ever recorded number of cigarettes smoked daily, the smoking records of current or ex smokers were further categorised as trivial (less than one cigarette daily), light (1–9 cigarettes daily), moderate (10–19 cigarettes daily), heavy (20–39 cigarettes daily) or very heavy (more than 40 cigarettes daily). Smokers who had no records of daily cigarette consumption were recorded as such and patients who had no recorded smoking information were coded in a separate category.

Symptoms that were analysed in cases and controls were those detailed in the NICE guidelines15 (box 1). In addition, we assessed the six most common symptoms and diagnoses recorded in the records of patients with lung cancer prior to their diagnoses. These were upper and lower respiratory tract infections (URTI and LRTI), non-specific chest infections, constipation, depressive disorders and chronic obstructive pulmonary disease (COPD). Records of chest x-rays, blood tests and number of general practice consultations for symptoms other than those already assessed were also identified.

Box 1

Clinical features for which urgent referral for a chest x-ray should be offered for suspected lung cancer15

  • Haemoptysis

  • Any of the following unexplained or persistent symptoms or signs:

    •  Cough

    •  Chest/shoulder pain

    •  Dyspnoea

    •  Weight loss

    •  Hoarseness

    •  Finger clubbing*

    •  Features suggestive of metastasis from lung cancer*

    •  Cervical/supraclavicular lymphadenopathy*

  • *Clinical features not analysed in study.

All symptoms, diagnoses and investigations over the 2-year period before lung cancer diagnosis (or matched date) were identified. Since a chest x-ray is the initial investigation for suspected lung cancer,15 we examined the timing of chest x-rays prior to lung cancer diagnosis and found a steep increase in the chest x-ray frequency in cases (but not controls) within the 4 months prior to diagnosis; so all symptoms, blood tests and other general practice consultations recorded within this period were excluded.

To determine the independent early predictors for lung cancer, univariate logistic regression models were used to calculate ORs. These analyses for symptoms, diagnoses, blood tests and GP consultations were done separately for records made in the 4–12 month and 13–24 month periods prior to diagnosis. Multivariate modelling was done using only variables that were associated with lung cancer in univariate analyses, using a statistical significance cut-off level of p<0.05. Variables that were not statistically significant in the multivariate analysis were removed from the model and those that previously showed no association with lung cancer in the univariate model were rechecked for significance in the final model. In developing the risk probabilities for lung cancer, we weighted each variable according to the strength of its association in the multivariate logistic regression model and then applying the method used to develop the Thoracic Surgery Scoring System (Thoracoscore),26 the β-coefficient values (log OR) from the multivariate model were used to compute aggregate scores for individual patients.

Validation of the model was carried out on a cohort all THIN patients who were 39 years or older and free from lung cancer on 29 July 2009. Eligibility in this cohort was limited to patients who had at least 1 year of general practice follow-up. Each person was given a lung cancer risk probability score on the basis of their records. The actual number of incident lung cancer cases within the year after 29 July 2009 were identified and then the performance of the model was assessed by comparing the sensitivity and specificity at different cut-offs. Additionally, a comparison of the sensitivity and specificity of this model with those of the NICE guideline symptoms was made. The discriminatory power of the model was assessed by means of a receiver operating characteristic (ROC) curve and an area under the curve (AUC) calculation.

All analyses were performed using Stata release SE11 and the study protocol was approved in 2009 by the Cegedim Strategic Data Medical Research Scientific Review Committee.


We identified 12 135 incident cases of lung cancer. After excluding 59 patients who were under 40 years old at the time of diagnosis (0.49%), 12 073 cases were matched with 10 controls each, two cases had no eligible controls and were excluded, and the remaining case had one eligible control, giving a total of 12 074 cases and 120 731 controls. The average follow-up time prior to diagnosis was similar in the cases and controls: 9.5 years (IQR 5.5–13.5 years) and 9.1 years (IQR 5.2–13.2 years) respectively. Compared with controls, people with lung cancer were more likely to be older men, live in households located in more deprived areas and more likely to be current or ex smokers (table 1).

Table 1

Social, demographic and lifestyle characteristics of lung cancer cases and controls

A plot of the chest x-ray frequency among cases leading up to lung cancer diagnoses showed a stable pattern up to the fourth month preceding diagnosis. However, after this, there was a steep increase, implying that investigations for lung cancer were initiated by GPs (figure 1).

Figure 1

Frequency distribution of chest x-rays among cases in general practice, 12 months prior to lung cancer diagnoses. The plot for frequency of chest x-rays in controls is not shown but the pattern was consistent over the 12-month period and overall only 4% of controls had chest x-rays performed within the 12 months.

Analysis of the symptoms, diagnoses, blood tests and other GP consultations in the 4–12 month and 13–24 month periods preceding lung cancer diagnosis (table 2) showed greater ORs for lung cancer with all the symptoms recorded in the 4–12 month period than in the 13–24 month period. Furthermore, graphically, the increase in symptom presentations in cases occurred in the year before diagnosis (plot not shown), so the remaining analyses focused only on the 4–12 month period.

Table 2

Symptoms, blood investigations and number of general practice consultations recorded among cases and controls in the 4–12 and 13–24 month periods prior to lung cancer diagnosis

The symptoms with the highest frequency among cases were cough, non-specific chest infections, dyspnoea, chest pain and COPD. Although haemoptysis records were made for only 2% of cases in the 4–12 months before diagnosis, the OR for lung cancer among people who had haemoptysis in this period was 20.15 (95% CI 16.24 to 25.01). Compared with controls, people with lung cancer consulted their GPs for other symptoms more often before diagnosis. Using fewer than 10 consultations as a reference value, the OR for cases to consult their GPs 21 times or more was 4.45 (95% CI 4.24 to 4.68) in the 4–12 months before diagnosis. There were also more blood investigations among cases than controls, with an increase in the number of normal and abnormal test results.

Development of the lung cancer risk model

Our model was developed using the independent predictors of lung cancer in the 4–12 month period before diagnosis (table 3). Variables that were independently associated with lung cancer and included in the final model were age, sex, Townsend deprivation quintiles, smoking (status and highest daily cigarette consumption), number of other GP consultations, and symptom presentations of cough, haemoptysis, dyspnoea, weight loss, LRTI, non-specific chest infections, COPD, chest/shoulder pain, voice hoarseness and URTI. Constipation, depression and blood tests were not independently associated with lung cancer. The odds of lung cancer increased with increasing age, male sex, greater socioeconomic deprivation and higher daily cigarette consumption. The association with daily cigarette consumption was stronger among current smokers than ex smokers. Haemoptysis and weight loss were relatively uncommon symptoms among lung cancer cases but they were associated with the greatest risk of lung cancer.

Table 3

Multivariate model of factors associated with lung cancer 4–12 months before diagnosis

Using β-coefficient values derived from multivariate logistic regression (shown in table 3), aggregate risk probabilities were computed for individual patients in the dataset using the equation:Embedded Image

The validation cohort comprised 1 826 293 patients in THIN who had no history of lung cancer up to the 29 July 2009 and with at least 1 year of follow-up data before and after 29 July 2009. There were 939 299 women (51.4%) and 886 994 men (48.6%). A total of 1728 incident cases of lung cancer (0.09% of the cohort) were identified during the 1-year of follow-up from 29 July 2009.

Risk probability scores were computed for all individuals in the dataset using the β-coefficient values derived from the logistic regression model. The number of patients identified by the score at different cut-off values, and the sensitivity and specificity of the risk model at the cut-off values are shown in table 4.

Table 4

Performance of the risk model at different cut-off values in the validation population (n=1 826 293)

Table 5 shows, for different symptoms in the NICE guidelines, the number of patients in the validation cohort who will require a chest x-ray, the number of true positives identified and the sensitivity and specificity of the guideline symptoms in predicting lung cancer risk. Using haemoptysis alone as a trigger for chest x-rays, only 24 cases of lung cancer in the cohort population can be detected. Using the most commonly reported symptom, cough, as a trigger for investigations, 175 290 patients are identified to be at risk of lung cancer and 413 of these will be diagnosed with lung cancer. Therefore, using the NICE symptoms to identify a comparable number of true positives as the lung cancer risk model, a higher number of patients are required to undergo chest x-rays than the risk model. For example, at a cut-off to identify 610 cases of lung cancer in the validation cohort, the risk model identified 72 883 patients at high risk of lung cancer for whom chest x-ray investigations are indicated (119 chest x-rays per identified case), yet using a weighted combination of all the NICE symptoms, a total of 305 137 patients will have to undergo chest x-ray investigations to identify 724 cases of lung cancer (421 chest x-rays per identified case).

Table 5

Sensitivity and specificity of NICE guideline symptoms alone in validation population (n=1 826 293)

The ROC curve obtained from the application of the risk model in the validation cohort is shown in figure 2. The AUC is 0.88. Using a weighted combination of the NICE guideline symptoms alone to identify patients at high risk of lung cancer, the area under the ROC curve was 0.64.

Figure 2

Receiver operating characteristic (ROC) curve for the lung cancer risk prediction model. The area under the curve is 0.88. The diagonal line represents the discrimination expected by chance alone.


We used a combination of patients’ socio-demographic and clinical records in general practice to develop a lung cancer risk prediction model which can be used by GPs to aid earlier identification of patients at high risk of lung cancer. On validating this model in an independent dataset, it performed well and showed good discrimination, with an area under the ROC curve of 0.88.

The lung cancer risk prediction model was developed using the THIN database, which has previously been validated against other UK national lung cancer databases.25 By incorporating information that is routinely collected and therefore readily available to GPs, application of the risk model from this study could allow an easy and practical means of identifying general practice patients who are at risk of lung cancer, at no extra cost to GPs. We excluded records made in the 4 months prior to diagnosis of lung cancer to avoid symptoms, diagnoses and investigations attributable to lung cancer rather than predictive of it, and we focused on the 4–12 month period because symptom records by cases increased in the 12 months before diagnosis. This ensured that our model would predict lung cancer at an earlier stage.

Some relevant information is not reliably recorded in THIN (occupational exposure to carcinogens such as asbestos) and so could not be included in the model. Although inclusion of these variables may improve the performance of the model, the validation analyses using the currently available variables have shown good discrimination and the model performed substantially better than the current NICE guidelines15 when validated in an independent dataset. With further improvements in general practice data recording, a review of this model will be warranted to reflect more accurate lung cancer prediction. Another limitation in this study was the unavailability of information on cigarette pack-years for defining patients’ lifetime cigarette exposure. As a proxy, we categorised patients’ exposure to cigarette smoke using the highest recorded quantity of cigarettes smoked daily, which allowed us to classify patients’ worst possible estimate of daily consumption. The results from analyses using these categories fit broadly with findings from the literature. Nevertheless, these pragmatic categorisations are using the information that would be available to GPs in standard practice for assessing their patients’ risk.

Validation of the risk model in an independent cohort showed that a considerable number of patients need to undergo chest x-ray investigations to diagnose lung cancer cases. This is unsurprising considering that lung cancer was rare in our validation cohort and was only diagnosed in 1728 patients (0.09% of the population). Positive predictive values are not good measures of model accuracy, particularly with rare outcomes, as they are usually low even with good sensitivity and specificity.27 A similar finding was shown in the randomised Danish lung cancer screening trial in which 980 CT scans were done to identify 69 lung cancer cases.28 However, the model compared quite favourably with the NICE guideline symptoms, with about a quarter of chest x-rays required to detect a comparable number of lung cancers even than a weighted combination of the NICE guideline symptoms.

A number of models including the Bach,18 Spitz20 and the Liverpool Lung Project (LLP)21 models have been developed to predict the risk of lung cancer using patients’ baseline risk factors. The Bach model was developed to determine variation in lung cancer risk among current or former smokers aged between 55 and 74 years who were enrolled in a clinical trial of lung cancer prevention.18 Since this model was developed using only data (age, sex, asbestos exposure and smoking history) from individuals with a smoking history, it is only applicable to smokers—a subset of individuals at risk of lung cancer. The expanded Spitz model was developed using information from 725 newly diagnosed cases of lung cancer and 615 healthy controls, on age, smoking history, family history of cancer, occupational exposure to carcinogens, previous history of respiratory disease and biomarker assays. This model is limited in that the biomarker assays included in the model derivation are select markers of host DNA repair capacity which require technical expertise and are not readily available in general practice.

A study that compared the discriminatory power of the Spitz, LLP and Bach models found an AUC statistic of 0.69 in the Spitz and LLP models and an AUC of 0.66 for the Bach model.29 These are substantially lower than the AUC statistic value of 0.88 in our study. Compared with the Bach and Spitz models, the LLP model has been found to have a much higher rate of false positives and therefore falsely identifies more individuals who have low risk of lung cancer than the previous two models.29 The LLP model is currently being used to select individuals who have a 5% risk of developing lung cancer over 5 years for inclusion in the UK lung screen trial of low-dose CT screening for lung cancer.10 However, at a cut-off to capture 62% of cases of lung cancer, the LLP model falsely identifies 30% of non-lung cancer controls and does not perform as well as our risk model, which for accurately identifying 79.6% of lung cancer cases gives a false positive rate of 21.2%. However, LLP applies to asymptomatic patients.

Only one other model used patient records from a large primary care database to predict the risk of lung cancer.22 In developing this model, patient records in the database were examined up to a certain time point to establish baseline risk, after which incident diagnoses of lung cancer over the subsequent 2 years were predicted. In the validation study of this model, it appeared to have a good discriminatory power with ROC values of 0.92 for men and women and at a threshold to identify the top 0.5% of patients at risk of lung cancer, the positive predictive value of the model was 1.3% (77 patients identified to be at risk of lung cancer for one true case). However, all GP records of patients recorded in the period leading up to lung cancer diagnosis were included in the algorithm development so it is likely that many symptoms and smoking records included were those after the point at which clinical lung cancer investigations were already underway and a diagnosis of lung cancer was actively being sought by the GPs. Our study has shown that in the 4-month period leading up to lung cancer diagnosis, the majority of patients with lung cancer start undergoing investigations in general practice. Therefore, it follows that the model developed by Hippisley-Cox and Coupland22 will be predicting lung cancer in patients who are already being investigated in general practice and hence it is of limited value in diagnosing lung cancer at an earlier stage.

In conclusion, a combination of patients’ socioeconomic characteristics, smoking status and early-stage symptoms appear to aid earlier identification of patients who are at an increased risk of lung cancer and who will benefit from further investigations such as chest x-rays. The weighting and inclusion of socio-demographic variables—age, sex, socioeconomic status and smoking history—and the weighting and inclusion of other clinical diagnoses—URTI, LRTI, non-specific chest infections, COPD and the frequency of general practice consultations—make our model a huge improvement on the NICE list15 of symptoms. Evidence from past research has shown that a delay of 18–131 days (median of 54 days) between diagnosis and curative treatment for lung cancer was associated with an increase in cross-sectional tumour size and an increased risk of the cancer becoming incurable.30 The outcomes of lung cancer are likely to be better in patients referred earlier and whose disease is diagnosed earlier because they may have earlier-stage disease and better performance status. A clinical trial, perhaps in conjunction with a screening trial, is needed to fully quantify the benefit of the model in practice.

Clinical implications

There are several potential ways of applying this model clinically. Our primary aim is to develop the algorithm into a programme which could be incorporated into GP software and used by GPs to provide a rational estimate of patients’ lung cancer risk during consultation. For example, if a patient presents with symptoms such as cough, chest pain and a history of weight loss, the GP with the aid of this algorithm, can calculate an estimate of the patient's risk of developing lung cancer taking account of the patient's background risk factors in addition to the current presenting symptoms and other clinical data within a preceding time frame. By incorporating the model into GP computer software, these risk assessments would not need to be directly calculated by GPs. Similar methods are already being used for the calculation of cardiovascular disease risk and the benefits of this as opposed to GPs working out the lung cancer risk for individual patients is that rather than making a risk estimation based on information collected by the GP during a consultation, the system takes account of all previous recorded data for patients, including records entered during consultation with other GPs in the same practice.

Another potential means of applying this model is by making the algorithm widely available to the general population to enable individuals to estimate their own risk of developing lung cancer. This could ultimately encourage earlier symptom presentation to general practice by high-risk patients who, following an assessment of their lung cancer risk, recognise the need for further investigation.

View Abstract


  • Contributors RH, LT and BI-O conceived the idea for and designed this study. DB provided advice on the study and extensively edited this paper. CS extracted the THIN data and ensured its accuracy. BI-O performed the statistical analysis and wrote the first draft of the manuscript. All authors critically revised and approved the final manuscript.

  • Funding This piece of research was funded by a PhD studentship from the Economic and Social Research Council, held by Barbara Iyen-Omofoman.

  • Competing interests None.

  • Provenance and peer review Not commissioned; externally peer reviewed.

Request permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.