ASHG Meeting Report: A guide to the Exome Aggregation Consortium data

With genomic data from hundreds of thousands of people accumulating, geneticists are now able to mine these data for very rare, but very informative genetic variants, including loss-of-function alleles. For example, across the enormous “reference set” of human exomes announced at the 2014 American Society for Human Genetics Meeting, on average there’s a variant every six bases. In the first of our reports from the ASHG meeting, Exome Aggregation Consortium (ExAC) lead analyst Monkol Lek (Massachusetts General Hospital/Broad Institute), has written a practical guide for geneticists looking to explore their-favorite-genes in the publicly-available exome data. Thanks to Monkol and Daniel MacArthur! If you’d like to write a guest post for Genes to Genomes, contact editor Cristy Gelling: cgelling@thegsajournals.org

We live in an amazing time to do human genetics. Over the last five years, thanks to impressive advances in DNA sequencing technology, the research community has collected sequencing data on genetic variation from over 200,000 samples. This provides us, for the first time, with the ability to study genetic variants at very low frequencies in the general population. However, in order to perform this research it’s critical that these genetic data be brought together and analyzed in the same way to ensure that the genetic changes that we find are real, and not the artifacts of differences in sequencing technology or analytical pipelines.

This goal is what drives the Exome Aggregation Consortium (ExAC), an international coalition of investigators with a focus on data from exome sequencing — an approach that allows us to focus variant discovery on the regions of the genome that encode proteins, known collectively as the exome. To date the Consortium has accumulated and jointly analyzed exome data from nearly 92,000 individuals, and has prepared a publicly accessible data set spanning 61,486 of these individuals for use as a global “reference set”. While the individuals in the reference set aren’t necessarily healthy — many have adult-onset diseases such as type 2 diabetes and schizophrenia — we have removed individuals with severe pediatric diseases, making this (we believe) a reasonable comparison data set for childhood-onset Mendelian diseases.

On October 20th at the American Society of Human Genetics (ASHG) conference we announced release 0.1 of the ExAC data set in two forms, as a browser and a downloadable raw data file. This was not just a massive data release but also a massive collaborative effort, which is detailed here. Four weeks after the release, the ExAC browser has received over 120,000 page views from over 17,000 unique users, and the raw data has been downloaded by over 150 organizations. The annotation tools ANNOVAR and ATAV have provided updates that have incorporated the ExAC data and the developers of Combined Annotation Dependent Depletion (CADD) have provided corresponding CADD scores. The commercial tools from GoldenHelix and GeneTalk have also incorporated the ExAC data. As the lead analyst on the project for over 2 years, I’ve been thrilled with the response it has received and the kind words and valuable feedback from the research community.

This practical guide, which uses two example genes FBN1 and MECP2 is aimed at general users and how they can access information using the ExAC Browser.

FBN1 Example

John Belmont commented on Nature News and Twitter that, within the ExAC dataset, the FBN1 gene associated with Marfan syndrome has 11 subjects with Loss of function (LoF) mutations. If these are true disease-causing variants then it fits roughly with the 1 in 5000 incidence of this disease.

The FBN1 LoF variants can be directly viewed on the ExAC browser by searching FBN1 or clicking on this link and then clicking on the LoF button. Things to note on the FBN1 gene summary page:

The coverage plot (affectionately called the Guilin plots ) is of the canonical transcript, using the Ensembl definition. This may not necessarily correspond to the clinically relevant transcript.
The genomic coordinates uses GRCh37 and NOT the recently released GRCh38.
Variant sites with multiple alleles are represented on separate rows.
The functional annotation and corresponding protein consequence is from the most severe impact amongst the transcripts and may not affect all transcripts. As it is summarized from multiple transcripts, the amino acid position can sometimes appear out of order.
Allele Number is the number of chromosomes, so is twice the number of individuals (maximum 2*61,486). Due to the nature of exome capture and quality thresholds applied, this will not always be at the maximum.
All variant data displayed in the table can be downloaded as a CSV text file and opened in Excel to restore the columns and rows.

Amongst the 10 Loss of Function variants in FBN1

Using the stop-gained 15-48719948-G-C variant page is a good example to highlight important features:

The histogram of Depth and Genotype Quality (GQ) is for individuals with the allele. Click on the “full site metrics” check box to display the histogram for all individuals with genotype calls (including those that are homozygous reference).
The stop-gain variant does not affect all transcripts. The variant results in a missense change in the canonical transcript (ENST00000316623). In fact the canonical transcript has only 8/10 LoF mutations.

One of the upcoming features we are developing for the ExAC browser is the ability to view the sequencing reads from the reconstructed BAMs produced by the Genome Analysis Toolkit (GATK) Haplotype Caller using the –bamOutput option. For the splice acceptor variant 15-48760301-T-C, this is particularly useful to not only show the reads/bases supporting the SNP calls but also the reference sequence context and whether the acceptor site is canonical (i.e. ends in [T/C]AG).

Note: FBN1 is on the reverse strand

MECP2 Example

Loss of Function and ClinVar variants

Variants of the X-linked gene MECP2 can cause the neurodevelopmental disorder Rett syndrome, which affects mainly females. MECP2 LoF variants can be viewed by either following this link or searching MECP2 then clicking on the LoF button. In MECP2 there are 6 LoF variants. The stop-gained variant X-153296689-G-A has an allele count of 68, with 20 homozygous individuals. Currently the ExAC data set is not sex aware and does not differentiate between hemizygous males and homozygous females. An upcoming feature is to calculate these numbers correctly for variants on the X chromosome. The sex of each individual in ExAC was determined by heterozygosity on the X chromosome and normalized chromosome Y coverage.

Differentiating males and females from exome sequencing data, using chrX heterozygosity (X axis) and coverage on the Y chromosome (Y axis). Males form a cluster on the left, females on the bottom right. A small number of unassigned individuals are also visible, some of whom are probable Klinefelter cases.

Now for the stop-gained variant of interest, all 20 of the homozygous individuals are actually hemizygous males. Similar to the FBN1 example, the stop-gained annotation only affects 1/3 transcripts while the other two (including the canonical) have a missense (p.Thr197Met, p.Thr209Met) annotation. According to ClinVar, this variant is a missense variant and classified as benign. The LoF variants X-153296104-TCAGG-T and X-153296112-AGGTGGGG-A with homozygous individuals are also due to hemizygous males.

The variant X-153295997-C-T is an example of a pathogenic ClinVar variant in MECP2 that is claimed to be associated with neonatal severe encephalopathy in males. The 4 homozygous individuals in ExAC are actually 4 hemizygous males. It was later argued to be a rare variant rather than pathogenic but still remains classed as pathogenic in ClinVar!

Finally for a pathogenic variant in ClinVar not found on the ExAC browser, with genomic coordinates X-153296806. The coverage data also provided for download shows that this site has adequate coverage for variants to be detected.

tabix -h Panel.chrX.coverage.txt.gz X:153296806-153296806

Fraction of samples at various coverage.

Chr	Pos	Mean	Median	1x	5x	10x	20x	30x	50x	>=100x
X	153296806	72.81	72.00	1.0000	1.0000	0.9995	0.9918	0.9670	0.8025	0.3005

Looking more deeply at the insertion/deletions (indels) that result in frameshifts

Another advantage of having the ability to view sequencing reads is that users can now look at the reliability of the more difficult indels and other complex variant calls. The Exome Variant Server (EVS) is a fantastic resource for the research community but did not have features for researchers to scrutinize indel variant calls. This was particularly concerning for researchers when publishing on novel disease genes. In the case of a recently published paper on LMOD3, for instance, the presence of homozygous frameshift indels in EVS greatly concerned our collaborators; it was only through careful scrutiny of the raw data for these variants that we were able to reassure them that these were genotyping errors.

Firstly, let’s take a look at the 28 bp deletion X-153296090-CGGAGCTCTCGGGCTCAGGTGGAGGTGGG-C in MECP2 which results in a frameshift variant.

Now let’s see the reads for a 1 bp insertion X-153296070-A-AG from a heterozygous female.

Both of these frameshift mutations appear real and may cause intellectual disability, so why do they exist in a data set of individuals without severe diseases? I propose three possible reasons:

There is an obvious drop in coverage where all the LoFs in MECP2 have accumulated. This may indicate a region difficult to capture or sequence and perhaps also challenging to detect variants.
The shorter protein coding transcript ENST00000407218 avoids all but one LoF mutation (X-153296689-G-A) and may rescue some function lost in the larger isoforms.
Lastly, the LoF mutations are towards the end of the gene and may result in a milder phenotype.

Investigating which of these possibilities may be contributing will require further detailed analysis. We welcome comments from MECP2 researchers regarding the LoF mutations in ExAC.

Tri and Quad allelic SNPs

Ending on an interesting point resulting from larger and larger data sets. The assumption that common variants remain bi-allelic is no longer valid, as with each new individual added there is a possibility of finding a new allele at a site where a bi-allelic variant is present. For example, the variant site rs2063690 is now a quad-allelic SNP – in other words, every possible base is present at this site in at least one individual in our data set! Furthermore, the figures below show three individuals who are heterozygous for the reference and each of the alternate alleles, while the last individual is heterozygous for two alternate alleles.

Heterozygous G/C (ref/alt)

Heterozygous G/A (ref/alt)

Heterozygous G/T (ref/alt)

Heterozygous C/A (alt/alt)

There is increasing urgency for the development of tools that deal appropriately with these multiallelic sites — approximately 7% of ExAC sites are now multi-allelic, and that fraction will grow as our sample size increases. That high rate of multiallelism shouldn’t be surprising, by the way; the ExAC dataset now (staggeringly) contains one variant every six bases on average, so it’s not a shock to see many cases where variant locations overlap.

Final thoughts

We’ve been gratified to see the rapid and positive response of the community to the ExAC data set. We still have plenty of work to do, though – and we’d love to get your feedback. If you have issues with the data set or the website, please drop us an email. For website bugs or feature requests you can also lodge a Github issue.

Many thanks to Daniel MacArthur for comments/feedback, writing introduction and final thoughts!

Bioinformatics, Genomics, Human Evolution & Variation

Guest posts are contributed by members of our community. The views expressed in guest posts are those of the author(s) and are not necessarily endorsed by the Genetics Society of America. If you'd like to write a guest post, e-mail jtreboschi@genetics-gsa.org.

View all posts by Guest Author »

Early Career Leadership Spotlight: Julio Molina Pineda

We’re taking time to get to know the members of the GSA’s Early Career Scientist Committees. Join us to learn more about our early career scientist advocates. Julio Molina Pineda Policy and Advocacy University of Arkansas Research Interest My research interests focus on using model organisms to genetically dissect complex traits related to human disease. My…
Early Career Leadership Spotlight: Peiwei Chen

We’re taking time to get to know the members of the GSA’s Early Career Scientist Committees. Join us to learn more about our early career scientist advocates. Peiwei Chen Accessibility Subcommittee California Institute of Technology Research Interest Far from a harmonious place, the genome is a battleground, where every bit of DNA fights for inheritance and…
#Dros23 GSA Poster Award winners

We are pleased to announce the GSA Poster Award winners from the 64th Annual Drosophila Research Conference! Undergraduate and graduate student members of the GSA were eligible for the awards, and a hard-working team of postdocs volunteered their time as judges. Congratulations to all! Undergraduate Students 1st Place: Sofia Karter Lopez, University of Toronto “Rab11 mediates E-cadherin recycling during…
Congratulations to the Fall 2022 DeLill Nasser Awardees!

GSA is pleased to announce the recipients of the DeLill Nasser Award for Professional Development in Genetics for Fall 2022! Given twice a year to graduate students and postdoctoral researchers, DeLill Nasser Awards support attendance at meetings and laboratory courses. The award is named in honor of DeLill Nasser, a long-time GSA supporter and National Science Foundation…
New editors join GENETICS, G3 editorial boards

Several new editors are joining the GSA Journals. We’re excited to welcome Ricardo Zayas to the GENETICS editorial board under the Molecular Genetics of Development section, and on the G3: Genes|Genomes|Genetics board, we welcome Polly Campbell, Kevin Vogel, Joe Parker, and Ricardo Mallarino. Ricardo Zayas Associate Editor Ricardo Zayas is a Professor of Biology at…
Worms and Flies Provide Key Clues to Medical Mystery

This article is part of a series of posts outlining the history and impact of research in experimental organisms. The series is developed in collaboration with the GSA Public Communications and Engagement Committee. By the time Bertrand Might was six months old, it was clear something was amiss. His muscles weren’t developing normally; he was…
Congratulations to the 2023 Early Career Leadership Program Cohort!

The Genetics Society of America (GSA) is excited to announce the latest cohort of student, postdoc, and early-career research leaders joining the Early Career Leadership Program. Participants receive training and mentoring while serving on committees charged with understanding the needs, interests, concerns, and challenges of early career scientist members of the GSA. As part of…
GSA LOCI: Local Outreach Community Initiatives @ GSA Conferences

Highlights: Local Outreach Community Initiatives (LOCI): The Genetics Society of America is committed to supporting the communities of the host cities of our conferences. This new year, we are excited to reconnect with our GSA community in meaningful ways within and beyond our existing programming. The GSA membership has created a caring and supportive environment…
New members of the GSA Board of Directors: 2023–2025

We are pleased to announce the election of five new leaders to the GSA Board of Directors: 2023 Vice President/2024 President Mariana Wolfner Distinguished Professor of Molecular Biology and Genetics and Stephen H. Weiss Presidential Fellow My research has focused on the genes and pathways that mediate sexual development and reproduction, primarily in Drosophila. From…
Lance David Miller: Lighting Your Own Fire by Finding the Right Resources

By Daniel J. Gironda In the Paths to Science Policy series, we talk to individuals who have a passion for science policy and are active in advocacy through their various roles and careers. The series aims to inform and guide early career scientists interested in science policy. This series is brought to you by the…
Graça Almeida-Porada: The Importance of Communication in a Technologically Advancing World

By Daniel J. Gironda In the Paths to Science Policy series, we talk to individuals who have a passion for science policy and are active in advocacy through their various roles and careers. The series aims to inform and guide early career scientists interested in science policy. This series is brought to you by the…