Loading...

MIA: Jerome Kelleher, Simulating, storing & processing genetic variation data w/millions of samples

462 views

Loading...

Loading...

Transcript

The interactive transcript could not be loaded.

Loading...

Loading...

Rating is available when the video has been rented.
This feature is not available right now. Please try again later.
Published on Apr 26, 2017

April 19, 2017

(Production note: There were technical issues with the audio in the early portion of this talk. Rather than removing the talk from YouTube, we have opted to leave the video up in its entirety. The audio issue clears up at about the 28 minute mark.)

Jerome Kelleher
Wellcome Trust Centre for Human Genetics, Oxford

Simulating, storing and processing genetic variation data with millions of samples

Abstract: Coalescent theory has played a key role in modern population genetics and is fundamental to our understanding of genetic variation. While simulation has been essential to coalescent theory from its beginnings, simulating realistic population-scale genome-wide data sets under the exact model was, until recently, considered infeasible. Even under an approximate model, simulating more than a few tens of thousands samples was very time consuming and could take several weeks to complete a single replicate. However, by encoding simulated genealogies using a new data structure (called a tree sequence), we can we now simulate entire chromosomes for millions of samples under the exact coalescent model in a few hours. We discuss some applications that these simulations have made possible, including a study of biases in human GWAS and the systematic benchmarking of variant processing tools at scale. The tree sequence data structure is also an extremely concise way of representing genetic variation data, and we show how variant data for millions of simulated human samples can be stored in only a few gigabytes. Moreover, we show that this very high level of compression does not incur a decompression cost. Because the information is represented in terms of the underlying genealogies, operations such as computing allele frequencies on sample subsets or measuring of linkage disequilibrium can be made very efficient. Finally, we discuss ongoing work on inferring tree sequences from observed data and present some preliminary results.

For more information visit: http://www.broadinstitute.org/mia

Copyright Broad Institute, 2017. All rights reserved.

Comments are disabled for this video.

to add this to Watch Later

Add to

Loading playlists...