Article Text

Download PDFPDF
S95 Identifying new hereditary haemorrhagic telangiectasia genes by applying a machine learning approach to screen whole genome sequencing data
  1. S Xiao1,
  2. D Brown2,
  3. IG Mollet3,
  4. FS Govani1,
  5. D Patel1,
  6. L Game1,
  7. HHT/PAVM GeCIP2,
  8. Genomics England Research Consortium2,
  9. CL Shovlin1
  1. 1Imperial College, London, UK
  2. 2Genomics England, London, UK
  3. 3Universidade Nova de Lisboa, Lisbon, Portugal

Abstract

Introduction and objectives Hereditary haemorrhagic telangiectasia (HHT) is a rare autosomal dominantly-inherited disease that causes pulmonary arteriovenous malformations and pulmonary hypertension. Four disease-causing genes have been identified- ENG, ACVRL1, SMAD4 and GDF2. Here, we demonstrate an unbiased screening method using whole genome sequencing (WGS) to identify novel genes that may cause HHT.

Methods Through the UK 100,000 Genomes Project Data Release 6.0, WGS data were available for 160 HHT participants from 126 families, following Illumina pipeline alignments and variant calling. For the current project, customised scripts were written in Python to extract all variants in HHT patients’ variant call files (vcfs, currently for single nucleotide variants and small indels). The variants were then prioritized by characteristics such as allele frequency, deleteriousness, gene location and gene expression profiles, using both stepwise filtering and machine learning feature selection algorithms including LASSO and SVM-RFE.

Results A mean of 4,813,192 variants (range 4,726,104 to 5,362,271) were found in each HHT patient. Stepwise filters removed an average of 3,663,003 variants which exceeded an allele frequency of 0.02% in the 1000 Genome Project database, and a further 690 synonomous variants that did not change the genetic code. Excluding variants present in HHT patients where a likely pathogenic variant was already identified through the Genomic Medicine Centres left a residual 501,702 variants. Subsequent stages required novel machine learning algorithms focusing on endothelial cell-expressed variants (defined if present in one of the 11,488 genes with alignments in our RNASeq experiments in primary normal human microvascular endothelial cells); in-house RNASeq changes following BMP9 or TGF-β1 stimulation; and absence or very low frequency in non HHT Participants in the 100,000 Genomes project. Selected variants are being prioritised based on expert input from the HHT PAVM GeCIP Pathway Analyses Subgroup’s knowledge of gene coding and untranslated regulatory regions, and detailed functional pathways.

Conclusions We have already identified multiple genes with putative damaging variants in patients with unexplained HHT, and are next to focus on variants in genes expressed by other cell types. Similar approaches could also be implemented in other rare diseases.

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.