Article Text

Download PDFPDF
How to handle big data for disease stratification in respiratory medicine?
  1. Krasimira Tsaneva-Atanasova1,
  2. Chris Scotton2
  1. 1Mathematics and Statistics and Living Systems Institute, University of Exeter, Exeter, UK
  2. 2Medical School, University of Exeter, Exeter, UK
  1. Correspondence to Professor Krasimira Tsaneva-Atanasova, Mathematics and Statistics and Living Systems Institute, University of Exeter, Exeter, Devon, UK; k.tsaneva-atanasova{at}

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Increasingly complex datasets of biomedical measurements offer an opportunity for discovering patient endotypes. These represent subtypes of a disease marked by distinct pathomechanisms—which can have enormous implications for prognosis and clinical management. Such datasets often include imaging, genomics and transcriptomics, proteomics, microbiotal composition, allergen/environmental exposures, and immunological data—as well as patient outcomes and routinely collected clinical parameters. Given the sheer volume of data, interpretation is extremely challenging.

Recently, topological data analysis (TDA) has been rapidly gaining in popularity for application to such datasets (see Skaf and Laubenbacher1 for review). Respiratory medicine is no exception, as topology offers a suite of techniques and tools that could be applied to diverse data. This enables a holistic approach to robustly identify multidimensional properties and relationships within a given multimodal dataset, by using the full range of available clinical and pathobiological data simultaneously. Topological methods also naturally lend themselves to visualisation, rendering them useful for applications that require user interpretation and understanding.

TDA offers a more unbiased and rigorous approach to analysing complex datasets, since it does not depend on prior hypotheses nor focus on pairwise relationships within the data. This contrasts with other established analytical methods, such as supervised clustering and classical association analyses. The Mapper algorithm2 is a popular technique in TDA that converts a complex dataset with many dimensions into a simpler network representation embedded in a lower number of dimensions. To achieve this, common techniques such as principal components analysis (PCA), t-distributed stochastic neighbour embedding and uniform manifold approximation and projection (UMAP) could be employed to reduce the dimensionality of the data. The latter has certainly gained notoriety in light of the plethora of single cell RNAseq data currently in circulation.

Specifically, the Mapper algorithm starts by applying a projection (eg, UMAP) to the data set and using …

View Full Text


  • Twitter @KrasiTsaneva, @isteefer

  • Contributors KT-A and CS wrote the editorial.

  • Funding KT-A gratefully acknowledges the financial support of the EPSRC via grant EP/T017856/1. CS is supported by MRC grants MR/V002538/1 and MR/W014491/1.

  • Competing interests None declared.

  • Provenance and peer review Commissioned; internally peer reviewed.