The study of human genetics has recently undergone a dramatic transition with the completion of both the sequencing of the human genome and the mapping of human haplotypes of the most common form of genetic variation, the single nucleotide polymorphism (SNP)1,2,3. In concert with this rapid expansion of detailed genomic information, cost-effective genotyping technologies have been developed that can assay hundreds of thousands of SNPs simultaneously. Together, these advances have allowed a systematic, even 'agnostic', approach to genome-wide interrogation, thereby relaxing the requirement for strong prior hypotheses.

So far, comprehensive reviews of the published literature, most of which reports work based on the candidate-gene approach, have demonstrated a plethora of questionable genotype–phenotype associations, replication of which has often failed in independent studies4,5,6,7. As the transition to genome-wide association studies occurs, the challenge will be to separate true associations from the blizzard of false positives attained through attempts to replicate positive findings in subsequent studies. The purpose of a replication study is to evaluate a positive finding from a previous study, to provide credibility that the initial finding is valid. Replication is essential for establishing the credibility of a genotype–phenotype association, whether derived from candidate-gene or genome-wide association studies. However, there is a lack of agreement about what constitutes a finding deserving of replication, what constitutes an adequate replication study and what constitutes a replication or refutation.

Investigators and journal editors have offered guidelines for how to address this problem8,9,10,11,12, but these initial efforts have been hampered by limited experience and conflicting empirical data. However, as evidence has accumulated, several instructive examples have emerged of genotype–phenotype associations being reproduced reliably in follow-up studies. These include peroxisome proliferator-activated receptor-γ (PPARG)13 and the transcription factor TCF7L2 (refs 14–19), related to diabetes; nucleotide-binding oligomerization domain containing 2 (NOD2) and Crohn's disease20,21,22; complement factor H (CFH) and age-related macular degeneration23,24,25,26; and chromosome region 8q24 and prostate cancer risk27,28,29,30,31.

Many instances have arisen in which initial findings have not been reproduced in follow-up studies because of issues in either the initial study or the attempted replication4,5,6,32,33. Small sample size is a frequent problem and can result in insufficient power to detect minor contributions of one or more alleles. Similarly, small sample sizes can provide imprecise or incorrect estimates of the magnitude of the observed effects. Poor study design — particularly a lack of comparability between cases and controls — can increase the risk of biases because there can be heterogeneity in exposure to environmental challenges and population stratification. The latter arises when investigators fail to account for case–control differences in the genetic structure of the underlying population. Heterogeneity in classification of outcomes across studies can undermine the opportunity to compare among them. Similarly, data 'dredging' can be a major problem, especially when criteria for defining phenotypes are altered to achieve statistical significance worthy of publication.

Another challenge arises when follow-up studies analyse different variants. An example is the reported association between DTNBP1 and schizophrenia, initially identified in Irish pedigrees34 and 'confirmed' in independent European studies35. Unfortunately, different risk alleles and haplotypes were reported in each study, making comparison difficult36,37,38,39. Although it is plausible that more than one variant could contribute to schizophrenia risk at the DTNBP1 locus, it is difficult to draw this conclusion from the literature because follow-up studies have not consistently analysed the same markers or those in perfect linkage disequilibrium (r2 = 1.0). Other recent examples for which initial reports of association have been inconsistently replicated include insulin-induced gene 2 (INSIG2) and obesity40,41,42,43,44, and cyclic-AMP-specific phosphodiesterase (PDE4D) and stroke45,46. These have been accompanied by controversies about what actually constitutes replication.

This paper presents the conclusions of a working group on the replication of genotype–phenotype associations — whether identified in genome-wide or candidate-gene studies — convened by the National Cancer Institute and the National Human Genome Research Institute. The group was composed of experts from diverse disciplines, including biostatistics, clinical medicine, epidemiology, genetics and scientific publishing. The purpose was to review the current state of the field and propose best practices for the design, conduct and publication of replication studies that aim to follow up notable findings, particularly in genome-wide association studies. The group addressed three topics. First, assessment of the validity and limitations of any single genetic association study. Second, criteria for establishing replication in genetic association studies. Third, points to consider for publication of high-quality genotype–phenotype association reports (Box 1).

Initial association studies

The initial study of any association represents an important discovery tool. In the near future, it is unlikely that a single study will unequivocally establish a valid genotype–phenotype association and not require replication. A number of points relating to the study design and reporting should be considered in determining whether a finding in an initial genome-wide or candidate-gene study merits follow-up replication studies (Box 2). Attempts to replicate a reported association are often complicated by lack of methodological detail in the initial report or lack of methodological rigour in the original study.

Because of the enormous number of genotype–phenotype associations tested in each genome-wide study, spurious associations will substantially outnumber true ones unless rigorous statistical thresholds are applied. Although no universal threshold can be specified for statistical significance in all circumstances, smaller P-values generally provide greater support for a true association. Extremely small P-values should be interpreted carefully, however, until completion of replication studies, because many can be due to inappropriate reliance on asymptotic distributions of test statistics, or to technical artefact or genotype errors that are distributed differently between cases and controls. Cluster plots for highly significant markers should be examined carefully. It may be desirable to include confirmatory data from a second genotyping technology in the initial report to verify genotype accuracy. Cases and controls should be drawn from populations that are generally comparable both in terms of genetic background and environmental exposures47, and should be analysed for confounding population stratification. This may require genotyping of ancestry informative markers (AIMs), which should be strongly encouraged as genotype costs fall and AIMs become increasingly well-characterized within marker sets. Family-based studies are affected by population stratification, so researchers should opt for methods robust to this, such as transmission disequilibrium methods48. They may be particularly valuable in the initial study if there is evidence for ethnic differences in the genetic effect of a trait, although at the cost of increased genotyping. Cautious interpretation is required either if significance is observed only for unusual or highly specific phenotypes (especially if they represent a small proportion of the study sample) or if significance depends on a particular analytical method that is not publicly available for confirmation.

Approaches for dealing with multiple comparisons are beyond the scope of this report, but more robust methods are clearly needed49. Permutation testing is an effective strategy to address the problem of multiple comparisons, especially if a large number of phenotypes are being analysed. Many methods for addressing the problem of multiple comparisons invoke a conservative approach, namely a standard Bonferroni correction, which assumes the independence of all tests performed. In many association studies, markers are not independent because they are in linkage disequilibrium, and so a standard Bonferroni correction is overly conservative. Lowering the threshold for calling a finding of particular variants — such as non-synonymous coding SNPs — positive in the analysis scheme (weighting) has merit but must be declared before initiation of the analysis and not once the analysis has begun49,50. The number of variants for which there is either credible laboratory evidence or a validated in silico prediction a priori is quite small. However, the temptation to create a credible biological hypothesis post hoc can be quite strong.

At present, many studies are barely powered to identify, much less to establish, associations of common alleles of weak effect in complex diseases51,52. Recently, appreciation of this crucial issue has led to larger, more definitive studies, such as the Cancer Genetic Markers of Susceptibility (CGEMS) project and the Wellcome Trust Case Control Consortium, (WTCCC). An estimated large effect (that is, with an odds ratio greater than 2) in a well-powered study can lend credence to an association, because unknown confounding factors are less likely to produce large effects53. Unfortunately, many risk variants contribute less than this. Small studies are prone to large variation in risk estimates, of which only selected strong positives are initially detected and reported. Furthermore, the estimate of the effect declines as replication studies are pursued, a phenomenon known as 'winner's curse'54,55.

Consortial studies comprised of multiple independent studies combined into a pooled analysis can be viewed as a practical approach that overcomes many of the disadvantages of a disconnected set of underpowered studies. In addition, consortia may meet the need for rapid replication by achieving sufficiently large sample size40,56. Collaborations among multiple independent studies can offer important advantages over a single large study, particularly regarding the generalizability of findings observed in multiple studies that typically have greater diversity of populations and/or exposures.

As far as possible, similarly rigorous criteria should be considered for evaluation of genotype–phenotype association studies with limited or no availability of subjects for replication, such as studies of rare diseases or severe toxicity due to therapy or environmental exposures. In these circumstances, additional information gathered from laboratory techniques, bioinformatic tools and a priori biological insight should be used to provide plausibility for interpreting genetic association findings. The expectation for demonstrated replication might be relaxed if it is unethical to attempt replication — such as in studies that link genetic variation with adverse effects of therapy or environmental exposure (for example, benzene or cigarette smoke). Similarly, the public health impact of a finding may lessen the stringency of expectation for replication before initial publication — for example, in an urgent situation in which effective intervention is available and can be readily implemented.

Genotype–phenotype associations that have been replicated widely have often used clearly defined phenotypes classified by standard and widely-accepted criteria, such as diabetes and age-related macular degeneration57,58. Use of accepted criteria should reduce misclassification rates59. Some association studies have reported intermediate phenotypes (known as endophenotypes) but have provided little detail on the actual measure or its reliability60. In the absence of standard criteria, sufficient detail should be provided for both the definition of the phenotypes investigated and assessment of their validity and comparability across studies.

Replication of initial studies

To establish a positive replication of a genotype–phenotype association, many of the same considerations important for genome-wide association or candidate-gene studies should be fulfilled (Box 3). In replication studies, every effort should be made to analyse phenotypes comparable to those reported in the initial study. In the first attempt to replicate a finding, comparable populations should be analysed not only for the main effect but also to guard against confounding population stratification, either in the initial or replication studies61,62. Because many initial studies and replication studies have been reported in populations of European descent, the challenge remains to extend the studies to other populations. It has already been shown that many variants that have a significant association with disease in several studies in one population may not necessarily have the same association in another (such as TCF7L2 in West Africa and East Asia18,63,64; in this case, it has provided an opportunity to refine the signal to a restricted region). In some circumstances, it might be impossible to conduct follow-up studies because of the uniqueness of a study population or the lack of availability of additional subjects for replication. If replication is not an option, interpretation of association findings could be supplemented by biological insights derived from the laboratory.

Evaluation of an association in populations of different ancestry from that of the initial report would generally be expected, because genomic variation is greater when compared across populations, and should increase confidence in the finding. By contrast, failure to replicate in a population different from that of the initial report does not necessarily invalidate the original finding. In some cases, the differences in linkage disequilibrium relationships across populations can be used to narrow the region of interest for later genetic and possible functional analysis. Owing to their robustness to population stratification, as noted above, family-based studies can also serve as valuable replication studies for notable findings48.

Reports of attempts at replication should distinguish between tests of the same SNP as in the original study, SNPs in strong linkage disequilibrium with the reported SNP, and other SNPs that were genotyped to search for additional variants associated with disease in the region (Fig. 1). In some circumstances, the initial study might have identified a marker that is not in strong linkage disequilibrium with the causal variant, which could lead to a false refutation in a different population, whereas testing additional SNPs in the region might reveal another association worthy of follow-up. For clarity, if new, previously untested SNPs are included, they should be clearly identified and the rationale for their inclusion explicitly stated. If differences in linkage disequilibrium patterns across populations are used to invoke an association at a new marker but not at the originally tested marker, the different linkage disequilibrium patterns should be empirically demonstrated in the appropriate populations and shown to be a plausible and consistent explanation for both the new and original results. Otherwise, the new association cannot be considered a replication.

Figure 1: Linkage disequilbrium across the region containing SNPs associated with breast cancer in FGFR2.
figure 1

Black diamonds represent four single nucleotide polymorphisms (SNPs; rs11200014, rs2981579, rs1219648 and rs2420946) for which associations with breast cancer were replicated in multiple studies73,74. Estimates of the square of the correlation coefficient (r2) were calculated for each pairwise comparison of SNPs in the initial genome-wide association study across the FGFR2 region73. The log(10) r2 values are colour-coded.

Publication of associations

The evaluation of a publication addressing one or more genotype–phenotype associations is a daunting task in the age of large, dense datasets. To this end, published genome-wide association reports should include detailed descriptions of design, genotyping and statistical methods, and results, even if available only through online supplements, or perhaps in a separate journal. A checklist of key possible issues is provided in Box 1 — this could be used as a guide for authors, editors, reviewers and the general readership.

It is a challenge to make the case for the importance of the replication finding(s) without exaggerating the significance of the observation. Remarks about possible follow-up of genetic markers and corroborative studies to investigate plausibility should be brief and well referenced. Authors should practise sound judgement and temper enthusiasm based on prior publications (especially from the same investigative group), particularly if the replication study results differ from those of the initial study. Disclosure of known previous attempts to replicate the reported findings, whether positive or negative, by the authors or others is important for interpreting the replication study.

Although it is desirable for the initial report of a genotype–phenotype association to include adequately powered replication studies, requiring replication with every initial study may not be necessary, as long as the preliminary nature of a study without replication is emphasized. Such studies can still provide valuable information if the entire set of results is made available, and releasing such results before replication would be of value to the field. However, there is substantial added value in presenting robust findings based on an initial scan together with follow-up replication, and an appropriate balance is needed that facilitates rapid publication of valid findings and encourages collaboration19,65. If replication studies are included, each should be described or referenced in the same detail as the initial study and should include the results for all SNPs tested at each stage. As noted above, replication studies should preferably investigate the same or a very similar phenotype.

In many cases, the follow-up study will fail to replicate the initial results. Such findings are valuable for distinguishing false-positives from the true-positive signals that should be pursued for putative causal variants. The preference for publishing positive findings, even if derived from suboptimal studies, presents a formidable barrier to the dissemination of well-conducted negative studies. Failure to disseminate results from well-conducted negative studies withholds essential pieces of evidence for investigators who may be deciding whether to launch a follow-up study to replicate or to extend the original study. Thus, high-quality instances of 'meaningful negativity' are useful and should be reported succinctly in the literature. Criteria for a meaningful negative replication study are the same as those for a positive study (Box 3), with the added requirements that the same trait should be studied in a population of comparable underlying structure with sufficient power to measure the appropriate effect size and yield a negative result.

Negative studies are difficult to publish but they are crucial for separating true-positive from false-positive findings. Journals are strongly encouraged to publish high-quality negative studies refuting earlier positive reports of genotype–phenotype associations. The journal in which the initial scan is published is encouraged to solicit and publish well-conducted follow-up studies within a specified time frame, perhaps between 3 and 9 months of the initial report. A case in point is the recent collection of reports published by The American Journal of Human Genetics66,67,68,69,70,71 that failed to replicate the initial findings of a genome-wide association study on Parkinson's disease. A handful of journals — such as Cancer Epidemiology, Biomarkers and Prevention and the new PLoS series72 — currently feature well-conducted negative reports, and such efforts are to be lauded. The value of a well-executed negative study cannot be overemphasized; more venues are needed to capture these valuable results.

Although there are challenges to making data on individual research participants available to other investigators, every effort should be made to provide researchers with an opportunity to reproduce the reported results and to investigate new hypotheses and methods. To facilitate this research in genome-wide association studies, a public data archive known as the Database of Genotypes and Phenotypes, or dbGaP (http://view.ncbi.nlm.nih.gov/dbgap) has been established at the National Library of Medicine's National Center for Biotechnology Information and will be used by many National Institutes of Health (NIH)-supported studies. dbGaP will provide study documentation and aggregated genotype and phenotype data through its website with no account or authorization required. Access to individual, de-identified genotype and phenotype data will require an authorization and approval process that is currently under development. Whether through dbGaP or other venues, genotype summaries of computed analyses should be published online unless there are strong reasons not to do so, such as data derived from special populations (that is, isolated populations or minority communities) or other groups that will not permit such sharing. There are substantial informatic challenges for data presentation and data archiving, especially on public and journal websites. Best practices for retrieval and analysis of such data continue to evolve.

Conclusion

The history of genotype–phenotype association studies has focused on initial discoveries as opposed to careful replication. Earlier attention to the appropriate design of subsequent replication studies might have helped limit the plethora of false-positive results. Determination of valid genotype–phenotype associations presents a series of challenges that will require a logical strategy for conducting well-designed studies, based on excellent quality control practices interwoven with sound analytical methods and judicious interpretation. Other than the obvious differences in the drawbacks involved in multiple comparisons, standards for assessing the validity of the initial findings of a genotype–phenotype association should not differ substantially between the candidate-gene approach and genome-wide association studies. As experience accumulates, we can look forward to methodological advances that will facilitate our interpretation of studies, such as continued improvement of proposed methods for lowering the threshold for positive findings, adjustments for population structure, and exploitation of linkage disequilibrium structure in a candidate region.

The best practices suggested here for reporting initial and replication studies are based on sufficient disclosure of study methods to permit independent confirmation of study findings. Often a sequence of studies will be required to establish a valid genotype–phenotype association, perhaps involving several rounds of replication studies. And, of course, the conclusive demonstration of a replicated association represents only the beginning of the process towards finding the causal genetic variant(s). Labour-intensive and costly investigation will subsequently be required to sequence the candidate interval in depth, genotype all the common and perhaps uncommon variants that are markers for the outcomes of interest in multiple population samples, understand their functional consequences, examine their potential interactions with other genes or environmental factors, and devise strategies for preventative or therapeutic interventions. None of these steps should proceed far, however, without conclusive replication of findings from an initial genotype–phenotype association study.