Genetic association studies in Thorax
- Division of Therapeutics and Molecular Medicine, Queen’s Medical Centre, University Hospital, Nottingham, UK
- Correspondence to:
Professor I P Hall
Division of Therapeutics and Molecular Medicine, D Floor, South Block, Queen’s Medical Centre, University Hospital, Nottingham NG7 2UH, UK;
- genetic association studies
- respiratory disease
- single nucleotide polymorphisms
- multiple testing
A guide to assessing the validity of genetic association studies in respiratory disease
Increasing knowledge regarding the extent of genetic variation in the human genome has led to an explosion of interest in performing genetic association studies in complex diseases. Well designed studies have the potential to provide functionally relevant data on the pathophysiology of disease initiation and severity.1 Unfortunately, this field has acquired a bad reputation over recent years because of problems with poor design and variable replication of findings.2,3 Because such studies are relatively easy to undertake when one has access to a population of patients with disease, the number of such studies has increased markedly: Thorax now receives, on average, eight each month. As there are a number of common flaws present in many of these studies, we felt it would be helpful to publish some broad guidance on the subject. While Thorax will always be keen to receive high quality manuscripts dealing with genetic studies in respiratory disease, in the future it is unlikely that submitted manuscripts will be sent out for further review if they do not conform to the guidance contained within this editorial.
STUDY POPULATION SIZE
The majority of submitted genetic association studies use a case-control design, so this is the focus of this editorial. The limiting factor in recruitment is usually the number of cases available to study. There are some advantages in increasing the number of controls (that is, having more than one matched control for each case): in practice 2:1 matching of controls to cases often provides the most efficient design for relatively common diseases. For any given genetic association study an initial power calculation should be undertaken to determine the power of the study to detect effects. For most of the genetic factors contributing to common complex diseases, published relative risks have been no higher than 2. The size of population required to determine a relative risk of this magnitude will depend upon the allele frequency of the polymorphisms under consideration (table 1). Programs for estimating required sample size are readily available—for example, downloadable from http://hydra.usc.edu/gxe/4 and online at http://Statgen.iop.kcl.ac.uk/gpc.5
For the majority of genes of interest population sizes of several hundred will be required to ensure adequate power. For many common respiratory diseases such as asthma and chronic obstructive pulmonary disease (COPD), populations of an adequate size are already available in many centres. The investigation of gene–environment or gene–gene interactions greatly increases the sample size requirement (and may necessitate collaboration between several research groups), but is to be encouraged as this has the potential for greater insight into disease.6 For common diseases, therefore, it is unlikely that studies involving small numbers of subjects (for example, 150 asthmatics and 150 controls) will be adequately powered to truly estimate population contributions of genetic variants. In general, studies which the journal would wish to publish will either have large sample sizes or alternatively demonstrate replication in two independent populations. This approach may not be practical for very rare conditions and, where a strong case can be made, smaller studies which provide preliminary information on insight into novel mechanisms of disease would still be of interest.
SNP, HAPLOTYPES OR FUNCTIONALLY RELEVANT POLYMORPHISMS?
The public domain databases contain more than six million single nucleotide polymorphisms (SNPs) and it is likely that there are over 10 million SNPs with allele frequencies greater than 1%. The number of polymorphisms involved in disease aetiology and modulation is therefore massively outweighed by those not involved, which means that association studies using randomly selected SNPs have very high false positive rates. This is compounded by publication bias: negative association studies are more difficult to publish and may not even be written up by investigators. To raise the prior probability of true association, very careful consideration should be given in the initial design with regard to the polymorphisms chosen. Where functional information regarding a polymorphism in a given candidate gene is available, this may help to prioritise selection. If no functional information is available regarding the gene of interest, then the choice is either to undertake functional studies on the polymorphic variants within that gene or to select tag SNPs from which information can be inferred for other variants in that gene.7 The alternative is to use combinations of SNPs (or other polymorphisms) across that genetic region (haplotypes): the disadvantage with this approach is that there are likely to be many different haplotypes and hence the population study size will need to be increased accordingly. Many studies submitted to Thorax examine only a single SNP in the gene of interest: unless there are good supporting functional data on the chosen polymorphism, ideally in the same study population, this approach is unlikely to be very informative. In general, investigators interested in pursuing genetic studies for their favoured candidate gene should look carefully at the polymorphic variation at that genetic locus, study the haplotype structure and linkage disequilibrium profile around the region (increasing information will be available in the public domain as a result of the HapMap Project), and evaluate functional effects of known polymorphisms at this locus.
The issues raised in the preceding section will help prioritise genetic variants worthy of evaluation but it is likely that there will be more than one genetic factor involved in the primary analysis. Similarly, there may be more than one phenotype of interest—for example, a study might consider the presence of asthma as the major phenotype, but might also look at bronchial hyperresponsiveness, IgE, asthma severity, or response to medication as additional phenotypes. Each additional genetic factor and each additional phenotype to be studied adds to the number of comparisons made in the analysis and gives rise to additional problems of multiple testing. Although new methods are increasingly applied,8 there is no simple answer to this issue. We hope the following advice may help.
Before commencing the analysis it is critical to determine the primary end point for the study. Sub-analyses can also be reported but it should be made explicitly clear that positive associations have come from secondary analyses when presenting data. Some studies require a more complicated analytical approach—for example, studies on genetic factors influencing the development of COPD in smokers need to allow for known confounding effects on lung function such as age, sex, height, and duration of smoking exposure. This is usually done using regression analysis although alternative approaches (such as those using recursive partitioning) may also be of value in this setting, especially for the investigation of epistasis.9
One particular concern for genetic association studies is the repeated use of the same population for different association studies. It may not be apparent to readers of a manuscript that the population has been used for previous analyses. This issue should at least be acknowledged by authors submitting papers describing sequential studies in a given population. Ideally, where multiple candidate genes are to be assessed in the population, it is preferable to report data on all the genes of interest in a single comprehensive manuscript rather than in multiple smaller papers.
The use of an independent replication sample greatly increases the confidence that an observed association is true. One potential approach is to use one sample for hypothesis generation (that is, accepting “significant” p values without correction) and then seeking replication for only those initially associated variables in a second population.
The above sections have dealt with major issues concerning study size and the selection of genetic variants for study. A common flaw is the inappropriate selection of study populations. Failure to match the control and study populations for ethnic or geographical origin may lead to spurious results because of population stratification. Increasingly it has been recognised that even apparently homogenous populations may show sub-stratification.
One should aim to match controls and cases for every characteristic bar the outcome under study. However, bias from the recruitment locations of cases and controls is commonly seen: controls are often attending hospital for another reason, or may be blood donors or younger healthy volunteers. These types of control group may not, by their nature, be representative of the population at large.
With this in mind, it is reassuring to see study populations typed for unlinked markers to identify and address stratification. An alternative approach which is feasible for some conditions is to use family based association approaches. Investigators are directed elsewhere for a fuller discussion of population stratification (for example, Cardon and Palmer10), but should at least reassure themselves that the control and study populations are drawn from the same general pool.
The above issues are some of the most important factors to be taken into consideration when assessing the validity of genetic association studies. Despite the recent criticism of this kind of study, good examples have the potential to provide novel insight into mechanisms of disease. Authors intending to submit a genetic association study to Thorax should consider whether or not their study has addressed the following specific questions:
Is the study size adequate to provide a reasonable estimate of the population contribution of the genetic variation under consideration?
Is the control population appropriately selected?
Is the choice of polymorphism(s) studied at a given genetic locus logical?
Has linkage disequilibrium at the relevant genetic locus been considered?
Are phenotypes well documented?
Have issues of multiple testing been addressed?
Have findings been replicated in a second sample or are there functional data to support findings in the main study population?
Does the genetic association study advance our understanding of the mechanisms underlying the disease of interest or its treatment?
A guide to assessing the validity of genetic association studies in respiratory disease
The authors declare no conflict of interest with the material presented in this editorial.