For genome-wide association studies, data is power.
The more data you have, the more statistical power you wield to find genetic associations. But are there ways to get more from the data you already have?
In the May issue of GENETICS, Kaufman and Rosset describe a testing framework that substantially boosts the power of genome-wide association studies (GWAS), without the need to collect more samples.
“We think practically everyone who’s ever done a case-control GWAS could benefit from reanalyzing their data in this way,” says author Saharon Rosset, an associate professor of statistics at Tel Aviv University.
The major innovation of the approach is tapping an abundant source of existing data: independent genetic studies of the same population, including other GWAS. Such studies don’t provide data on which individuals have the disease or trait of interest, but they do tell you more about the general population that was sampled. That extra information translates to more statistical power.
One reason the boost is sorely needed for GWAS of complex disease is the infamous problem of ‘missing heritability’: genetic associations found by GWAS never account for all the known heritability. A lack of power could be part of the problem, and certainly makes the mismatch worse. “If we have too little power, even strong associations are difficult to detect,” Rosset says.
More power is also crucial for detecting genetic interactions, where the effect of one genome variant is dependent on other variants. Such interactions are expected to be widespread, but finding them is a challenge because the overwhelming number of tests creates a huge multiple comparisons burden. Even limiting the search to pairwise interactions between a modest 300,000 SNPs requires 45 billion tests.
Rosset and graduate student Shachar Kaufman decided to approach the problem by making use of other genomic studies that sampled the same population. The extra data is intuitively useful, Rosset explains. “Let’s say we know the exact genotype distribution of the population. For a real genetic association, we would expect the controls to look similar to the general population and the cases to look different. But if it’s the other way round — if the cases look very similar to the population and the controls look different — then the statistical association is more likely to be a coincidence,” he says.
Kaufman and Rosset developed a maximum likelihood formulation to incorporate both population and case-control data into their testing framework. Using simulations, they show that using population data in this way consistently outperforms standard approaches, including the intuitive approach of simply adding the population samples as additional controls.
Their model can be readily combined with other methods for boosting performance. For example, the authors also increase power by assuming Hardy-Weinberg equilibrium of genotypes and linkage equilibrium between loci. With these improvements, incorporating population data samples of realistic size typically leads to a substantial increase in power, bringing it close to what would be possible if all genomes in the population were known.
To see how the tests fared with real data, the authors used them to look for genetic interactions in a dataset that had already been intensively studied, the Wellcome Trust Case Control Consortium study. With the help of the new approach, they were able to identify several promising new candidates for pairs of loci that affect bipolar disorder, coronary artery disease, Crohn’s disease, and rheumatoid arthritis.
Although the authors haven’t confirmed the candidate pairs by replication in a different dataset, the interactions would not have been found at all using standard approaches. With the help of a little extra power, many interactions currently hidden in GWAS data could be brought to light.
Genetics May 2014 197:337-349 doi:10.1534/genetics.114.162511