Exome-wide analysis of rare coding variation identifies novel associations with COPD and airflow limitation in MOCS3, IFIT3 and SERPINA12

Background Several regions of the genome have shown to be associated with COPD in genome-wide association studies of common variants. Objective To determine rare and potentially functional single nucleotide polymorphisms (SNPs) associated with the risk of COPD and severity of airflow limitation. Methods 3226 current or former smokers of European ancestry with lung function measures indicative of Global Initiative for Chronic Obstructive Lung Disease (GOLD) 2 COPD or worse were genotyped using an exome array. An analysis of risk of COPD was carried out using ever smoking controls (n=4784). Associations with %predicted FEV1 were tested in cases. We followed-up signals of interest (p<10−5) in independent samples from a subset of the UK Biobank population and also undertook a more powerful discovery study by meta-analysing the exome array data and UK Biobank data for variants represented on both arrays. Results Among the associated variants were two in regions previously unreported for COPD; a low frequency non-synonymous SNP in MOCS3 (rs7269297, pdiscovery=3.08×10−6, preplication=0.019) and a rare SNP in IFIT3, which emerged in the meta-analysis (rs140549288, pmeta=8.56×10−6). In the meta-analysis of % predicted FEV1 in cases, the strongest association was shown for a splice variant in a previously unreported region, SERPINA12 (rs140198372, pmeta=5.72×10−6). We also confirmed previously reported associations with COPD risk at MMP12, HHIP, GPR126 and CHRNA5. No associations in novel regions reached a stringent exome-wide significance threshold (p<3.7×10−7). Conclusions This study identified several associations with the risk of COPD and severity of airflow limitation, including novel regions MOCS3, IFIT3 and SERPINA12, which warrant further study.


Supplementary Tables
Supplementary Table 1    A. rs7269297 in MOCS3¸ associated with COPD risk.
B. rs140549288 in IFIT3¸ associated with COPD risk.
C. rs140198372 in MOCS3¸ associated with percent predicated FEV1 in COPD cases.

Case Collections
Gedling: The Gedling cohort is a general population sample of adults aged 18 to 70 years, recruited in Nottingham in 1991 (n=2,633) in a cross-sectional study of the relationship between diet, asthma, and COPD. Subjects were followed-up in 2000 (n=1346), where they had blood samples taken for DNA extraction (Source Biosciences, UK), completed questionnaires on respiratory symptoms, smoking, and other variables, and had pre-bronchodilator FEV1 and FVC measurements taken using a calibrated dry bellows spirometer (Vitalograph, Buckingham, UK), recording the best of three satisfactory attempts (1, 2).   Genotype calling and Quality Control (QC) of exome array genotype data Genotypes were called using Illumina's Gencall algorithm in Genomestudio (12), then SNPs and samples with >90% missing data were removed. Subsequently, samples with a call rate < 98%, heterozygosity rate>3 standard deviations (SD) from mean (calculated separately using SNPs with MAF>=1% and SNPs with MAF<1%), gender mismatches, and duplicates were excluded. Additionally, samples with an excess of singleton SNPs (those who were the only individual to have the minor allele for >50 SNPs), and samples whose mean probe intensity across all autosomal SNPs was outlying were also excluded. Ancestry principal components analysis (PCA) was carried out using EIGENSTRAT (13), with a subset of Linkage disequilibrium (LD) pruned HapMap 3 CEU SNPs with MAF>1% and call rate>99%; any individuals >4SD from the sample mean for either of the first two principal components were excluded. Additionally, SNPs were excluded if they showed differential rates of missingness in cases and GS:SFHS controls (P<10 -4 ).

EU COPD Gene
Following these exclusions, missing genotypes were recalled using zCall (14) and a second stage of QC was carried out. SNPs with call rate<99% or which deviated from Hardy Weinberg Equilibrium (P<10 -4 ) were excluded, along with samples with call rate < 99%, and heterozygosity outliers (>3SD from mean). To eliminate variants that were subject to genotyping batch effects, we tested for associations between genotype and sample collection, separately in cases and controls; any SNP showing association (P<10 -5 ) was excluded.

Discovery exome gene-based analysis
Variants were annotated to genes using ANNOVAR (15)

Custom content single variant analyses
Single variant associations with the custom content SNPs and both COPD risk and severity of airflow limitation were undertaken equivalently to the exome analyses. The custom content analysis included only SNPs with a priori evidence of association with lung function, thus we used a threshold of P<10 -3 for selecting SNPs for replication analyses in UK BiLEVE. We set a Bonferroni corrected significance level for replication, for the number of SNPs in novel regions taken forward to replication (P<0.0125 for analysis of COPD risk; P<0.010 for analysis of airflow limitation severity).

Replication and meta-analysis with UK BiLEVE data
The 4231 cases with airflow limitation indicative of COPD and 8979 controls from UK BiLEVE contributing to the meta-analysis, were selected based on their % predicted FEV1 values, calculated using reference equations derived using healthy never smokers in the whole of UK Biobank. For association testing, percent predicted FEV1 was recalculated using the NHANES III spirometric reference equations (21) for consistency with the exome discovery analyses. All selected COPD cases met GOLD 2 criteria (FEV1/FVC<0.7 and % predicted FEV1<80%) under both reference equations.
SNP associations with risk of COPD were carried out using a logistic regression model, implemented in Plink v1.07 (22) adjusting for age, sex and pack-years and assuming an additive genetic model. In the analysis of severity of airflow limitation, associations with untransformed percent predicted FEV1, in cases were tested using a linear regression model, with adjustment for pack-years.
The genomic inflation factor (λ) was calculated for the exome array and UK BiLEVE analyses and where λ>1, genomic control was applied, adjusting the standard errors of effect estimates accordingly. All SNPs were oriented to the same strand, with consistent coded alleles. Effect estimates and standard errors were combined across the two analyses using an inverse-varianceweighting meta-analysis. λ was calculated for the pooled effect estimates and genomic control was applied again where λ>1. Meta-analysis statistics and figures were produced using R version 3.1.1.
For the replication of associations identified in the discovery exome analyses, a look-up of the UK BiLEVE and meta-analysis results was undertaken. Replication of associations identified through the custom content analyses was undertaken in the same way, where SNPs were genotyped in the UK BiLEVE samples. Where a SNP was not directly genotyped, additional analyses was carried out using imputed data: UK BiLEVE samples were imputed to a combined 1000G (18) and UK10K Project (23) reference panel. Following imputation, SNPs were excluded if they had imputation INFO score ≤0.5 or minor allele count (MAC)<3. Associations were carried out for relevant SNPs using SNPTEST v2.5b4 (24).

Supplementary Results
Discovery exome gene-based analysis In the gene-based analyses of COPD risk, PRICKLE1 was the only gene to reach the P<10 -5 significance level (P=1.968×10 -6 , unadjusted analysis). The SKAT-O test included three SNPs within this gene, however "drop-one" analyses showed the signal to be entirely driven by rs3827522 (MAF=0.4%), the SNP identified in the single variant analysis (Supplementary Table 5). The analyses of severity of airflow limitation identified no associated genes with P<10 -5 .

20
Supplementary Custom content single variant association analysis After exclusions, 3226 case samples, 3262 controls and 2489 SNPs were included in the custom content analyses of COPD risk and severity of airflow limitation. The strongest signal identified in these analyses was for an association with COPD risk at the previously reported 15q25 region (rs2036527 near CHRNA5, Supplementary Figure 9 and Supplementary Table 6). There were two additional SNPs within the 15q25 region showing association with COPD risk; conditional analyses confirmed that these did not represent an independent signal to the sentinel SNP (Supplementary Table 7). Although an additional four SNPs showed association (P<10 -3 ) with COPD risk and 5 SNPs showed association (P<10 -3 ) with severity of airflow obstruction in cases (Supplementary Figure 10 and Supplementary Table 8), none were further replicated in the UK BiLEVE study.

Supplementary Figure 10: A) Custom content analysis of severity of airway limitation, with pack-years adjustment (SNPs with P<10 -3 highlighted). B) Custom content analysis of severity of airway limitation, without pack-years adjustment ( SNPs with P<10 -3 highlighted).
Supplementary