 Well, good afternoon, everyone. I think we're gonna go ahead and start, although we are north of 300 people joining and the numbers keep going up, but let's get on with this exciting seminar series. Let me start by welcoming all of you and thanking you for joining us either live or for those of you who will be watching this after the fact, since these will be video archived. And welcome to this new NHGRI seminar series entitled Bold Predictions for Human Genomics by 2030. By the way, I should start by pointing out that I was just in conversation with today's speakers who both live in Seattle. And we concluded that there's actually very little good to say about the pandemic, but the one thing the pandemic has forced us to do is to come up with creative ways of doing seminar series and other things using formats like Zoom. And that's actually a good thing for today because the truth of the matter is, had this been a traditional live seminar, I doubt it would be happening. We're getting snow and ice in the Washington DC area. And I have a feeling that our seminar speakers would not have made it into the Bethesda area today reliably, and I bet you we would not be having the seminar right now. So one advantage of Zoom is that they are comfortably in Seattle and are able to do this rain or snow or whatever. And so we will proceed. And for those who are local and watching this, what a great way to spend the next hour and a half on this yucky, icy day here in the Washington DC area. I wanted to spend a few minutes since this is the first of a 10 part seminar series to set a context so that in subsequent sessions, you will better appreciate why we're having this series and what it's really, where to derive from. And the origins of this, of course, relate to the new NHGRI strategic vision, which we published in October of 2020 and you can see on this slide. This strategic vision document published in nature really encapsulated the feedback and the ideas and the creative energies we heard through a two and a half year strategic planning process that preceded the publication in October. And if you wanna learn more about it, our website shown here features many and captures many of the aspects of that strategic planning process that had many, many events, 50 in total and lots and lots of input from the scientific community. For those who have seen or wanna go look at our strategic vision, we found that what we heard over two and a half years could be synthesized into a series of elements and those elements could be categorized into one of four major areas. And we describe each of these as you read through the paper dealing with guiding principles and values, robust foundation for doing genomics, breaking down barriers that impede progress and pursuing audacious research projects in genomics. But unlike the previous two strategic visions published by the Institute since the end of the Human Genome Project, we decided to add something at the end of this paper. And it was a series of bold predictions. This is very much capturing the spirit of audacity, if you will, and actually also aligned with something that the NIH has been doing more and more as they've been building up their broader NIH strategic ideas and trying to throw out some bold predictions of what might be possible in the coming years as we march and make progress in biomedical research. And so in the end, we actually came up with 10 bold predictions that we said just very well might come true or maybe we should be reaching for them over the next decade since this strategic vision was supposed to be about the next decade. And in particular, it's important to recognize sort of the framing for these strategic bold predictions. And I'll just read, you should just read along some of the most impressive genomics achievements when viewed in retrospect could hardly have been imagined 10 years ago. But here are 10 bold predictions for human genomics that might come true by 2030. Now, although most of them are unlikely to be fully attained, achieving one or more of these would require individuals for striving for something that currently seems out of reach. These predictions were crafted to be both inspirational, aspirational in nature, provoking discussion about what might be possible at the forefront of genomics in the coming decade. Reaching way ahead, really lunging for what might be possible was sort of very much the essence of what we put together in these 10 bold predictions. And we really didn't do much other than just list them. I will tell you that when our strategic plan was published the bold predictions actually captured a lot of attention actually even more attention than many other parts of the strategic vision initially. I was invited to write a commentary for Scientific American Online and that's shown here. They specifically said right about the predictions that seems to be the most fun thing to sort of really spur interest in the larger document and really does stir up people's imagination. Similarly, and in a very fun way when our institute was thinking about how should we maybe promote our new strategic vision and really get people excited about all the things that might be possible in the coming decade we decided to put together a promotional video in-house and have that video specifically be organized around these 10 bold predictions. And so since this is the introduction of this 10 part seminar series I thought I would share with you this promotional video. It's also available on YouTube. You can come to our website and find it as well. But if you haven't not, I've seen it yet. I think you'll enjoy it and you will recognize the narrator of it who's a very famous scientific author. So let me show you this video. The Genome Mix Genome Mix, which is the study of all DNA in an organism called the genome, is changing the way we practice medicine, how we manage and treat some previously incurable diseases and our relationship with society at large. Since the 1950s, double helix and DNA have become household words, standing in for the historic discovery of the four chemicals that give us life, commonly abbreviated as A, C, G, and T. Genome Mix is at an age of innovation and invention like no other, all thanks to how strings of those four letters encode the information for creating and operating something as complicated as a human body. Springing forward from the success of the human genome project, a massive reduction in the cost of sequencing a human genome and the dawn of the CRISPR gene editing era, Genome Mix is leading into an even more audacious future, a future that will define our collective lives for decades to come. But what could this future look like? Here is a summary of 10 bold and fantastical predictions made by experts at the National Human Genome Research Institute and the Genome Mix community for the next decade of human genomics. While most are unlikely to be fully attained by 2030, achieving even one or more of these by that time would have lasting impact for science, medicine, and society. Prediction one, sequencing and analyzing a complete human genome will become commonplace for any research laboratory. Prediction two, the role of every gene in the human genome will be known. Prediction three, environmental influences on our genome will be routinely used for making predictions about health and disease. Prediction four, genomics will no longer use social constructs such as race in human research studies. Prediction five, student projects involving the study of millions of people's genome sequences will be regularly featured at school science fairs. Prediction six, genomic testing will become commonplace in medicine just like a standard blood test is used now. Prediction seven, it will be readily known if a given letter difference in a person's DNA is clinically important. Prediction eight, a person's complete genome sequence will be available on their smartphone in a user-friendly form. Prediction nine, advances from human genomics will benefit all of society and not just a few. Prediction 10, genomic discoveries and technologies will help cure more genetic diseases than ever before. How do we get there? The National Human Genome Research Institute is committed to leading research to make such predictions a reality. Learn more about the institute's 2020 strategic vision for the future of human genomics by visiting genome.gov slash 2020 SV. Dive deeply into a field that continues to transform medicine and will hopefully continue to bring curiosity, purpose and compassion for people everywhere. So when we thought about how to extend the reach of our strategic vision and maybe couple to it a seminar series to sort of unpack some of the things, the idea bubbled up actually from my senior advisor Chris Gunter who was really the brainchild for the seminar series that maybe we should link a session in one of a number of sessions in a seminar series to each of the 10 volt predictions and then invite in speakers to talk about it. And that's exactly what we did. So over the next 10 or 11 months or so we're gonna have 10 sessions. Each specifically focus on one of the 10 volt predictions. We actually have the speakers lined up for the first five. By April, we're gonna come up with the speakers for the second set of five. Each one of these 10 sessions is gonna have a very similar format. We're gonna always have paired speakers. It's always gonna be two of them. In general, it's not perfect but in general, we tended to try to get somebody who was a little further along in their career and somebody who was a little earlier in their career. We're not gonna say who's who but it's pretty obvious when you see our two speakers today. No, but you'll see in many cases we tried to sort of do that to get the perspectives of those who have been in the field for a long time and those who may be a little newer to the field of genomics. They're gonna each be giving roughly 30 minute talks and then in each case, there'll be an NHGRI moderator I'm the moderator for the first session but you'll see nine other senior leaders for NHGRI service moderators in each of the next sessions and they'll have some discussion with the speakers and then there'll be an opportunity for Q and A with the remote audience. So that is the plan for the next 10 sessions starting today. And today we're gonna focus on the first of these bold predictions. We're gonna specifically focus on the idea that generating and analyzing a complete human genome sequence will be routine for any research laboratory becoming as straightforward as carrying out a DNA purification. And we're delighted to be joined by two great colleagues and good friends of NHGRI, Evan Eichler and Karen Miga. Let me briefly tell you about each and then Karen's gonna give the first of the two talks. Karen is a terrific investigator who did her graduate work with Hunt-Willard got her PhD at Duke, ended up doing a postdoctoral fellowship at UC Santa Cruz with another NHGRI colleague, David Housler and is now a faculty member at UC Santa Cruz. And she has been involved in a number of really very, very important and cutting-edge research projects and programs related to genome sequencing. And she's gonna tell you about some of them. Evan Eichler did his graduate work working with David Nelson at Baylor College of Medicine did a brief postdoc at Livermore National Laboratory with Harvey Moore-Invisor and then took a faculty position at University of Washington where he has been remarkably successful. Now, Howard Hughes Medical Institute Investigator is a member of the National Academy. Evan and I have been friends and colleagues for more decades than either of us wish to admit to but he is truly a star in genomics and has made major contributions in genomics starting during the Genome Project and certainly ever since the Genome Project and also in the area very relevant for this first bold prediction. So we're gonna hear from each of them then I'm gonna have a conversation with them then we're gonna take some questions from the virtual audience. And with that, I will stop sharing my screen and encourage Karen to share hers. Come on. Thank you for that introduction. Hopefully everyone can hear me. My name is Karen Miga. I am from the University of California in Santa Cruz and today I'm thrilled to be able to talk about reaching complete human genomes routinely by 2030. So for the past two decades we've actually seen more than a million fold reduction and the cost of DNA sequencing. And we can all appreciate that the sequencing costs really influences the scale and the scope of almost everything we do in genomics and epigenetics. And so this particular bold goal is quite important and will have long range effects. However, I wanna emphasize that this goal isn't focused merely on sequencing costs but more of a holistic picture from the sample to analysis. And in the past, this has always been something that we've taken into account for the cost of sequencing but here we need to look at three different components. One that we need to think about this is a fast process now similar to the DNA purification. We need to make it easy. No longer should this be something for a specialist to sequence or to assemble. And also we need to make it inexpensive. And in doing so we'll make it more accessible and more equitable to all research and clinical labs. So when we think about 2021, where are we now? We've already championed that we've made huge groups and gains since the initial human genome project. However, right now we're sitting proudly approaching the goal of $1,000 per genome of which this may actually fall in the near future. But we've also made a lot of technology and innovation gains in the bookends of this. So in the DNA extraction prep as well as the assembly, leading to more open software, cloud-based resources and even establishing new methods for data storage and sharing. This entire process, however, is not something that I would say is similar to a DNA purification protocol though. However, even though it takes a lot of time, expense and expertise at the moment, we've been really successful and generating, I'm gonna say maybe in the bottom end of hundreds of thousands of human genomes. However, the really important word that I want everyone to focus on at this bold prediction is not necessarily routine but the word complete. So you see right now in 2021, we've become really good at doing routine, generating and analyzing, but all of the genomes that we've been generating are in fact incomplete human genomes. And the bold goal here is actually one of our future milestones and that it's never been accomplished before. And that's reaching a complete phased diploid human genome. Now, many of you may think, hey, we've already reached a finished genome. We've celebrated this back in 2003. And indeed 99% of the eukromatic regions had been finished and released and we've been making wonderful strides in progress on these particular regions. However, the highly repetitive heterochromatic regions were intentionally left off of our mouth. And these present very large gaps in every human reference that comes off of an Illumina sequencer or even a long range sequitur. In the sense that these gaps seem to span regions that we know are centromeres, telomeres, low copy repeats or segmentally duplicated genes. And they also seem to exclude these tandem repeat copies that we know involve very important gene families as well as things like our DNAs. This is really striking when you start looking at things like acrocentric short arms where the entire short arm is missing from many, if not all of our reference maps. And this can represent 25% of a given chromosome. So essentially our understanding of the human genome, the sequence, the epigenetics, the regulations and even the evolution has been shaped by the sequence we can see. And we've been largely ignoring the regions we can. And the regions that we have been ignoring are fundamental to cell viability in life. Centromeric sequences, for example, are known to be responsible for proper chromosome segregation that these regions have a genomic complexity. You could imagine that would influence antiplotty either in cancer or in early development. The RNA genes are important for ribosomes. And this is critical to the foundation of life and bringing together peptides to make proteins. When we start thinking about genome and spatial organization, these are very special places in our genome where they're spatially distinct and have perhaps different regulation that could affect cell biology. And the large, unresolved intra-intersegmental duplications which Evan Eichler will talk more in depth can introduce genome instability and can cause copy number variation with regards to genes that could be important for clinical research as well as evolution. So I'm trying to use this to say, look, these are interesting and important biological parts of our genome. And we're not ignoring them because they're not interesting. We're ignoring them because we haven't had them as part of our maps before. In fact, these have always been incomplete. So today I want to celebrate our efforts to complete the first human reference. And this is largely credited in an open, community-based consortium effort led by Adam Philippia in NIH, Evan Eichler and myself, of which we're trying to generate the first complete assembly of the human genome. The cell line that we're working with is actually quite important. It's an effectively halfway genome. It's derived from something called a complete hi-todiform mole. This is an early developmental mistake. You can think of it that way where the maternal genome, which I'm trying to show here in red, is lost. And then the male genome is then duplicated. So you can think about this as being two chromosomes, as I'm showing you here in red and blue, where you have the repeat sequences which could vary in copy number. And trying to phase and assemble is actually quite complicated. But when you lose the maternal genome, when you have a duplicated paternal genome, this then becomes a lot easier of a problem to solve because you can begin to focus on one repetitive region instead of phasing two. Now, our community is very open. We share data eerily and as soon as it becomes quality assured. So I just wanted to put this up at the very beginning of my talk to ensure that all the listeners would know that they could access our data and attend our GitHub page for any more information. So for repeat assembly, we've known for many, many years, it's very difficult to actually traverse some of these longer repeat sequences with short reads. Here, I'm just trying to show you one example of a 200,000 base pair tandem repeat. And you can imagine that many of these copies of the repeats are near identical, if not exactly identical. So if you have a short read that doesn't span a unique marker, it has the opportunity to confidently and exactly match multiple places in this array. As a result, attempts to assemble using short reads often lead to misassemblies and collapse. So most of our algorithms in the past have ignored or filtered out these types of events. However, we're in a very exciting time right now because long read data are now in vogue and becoming accessible to researchers. One in particular is from the Oxford Nanopore Technologies. This is something that issues ultra long reads. Here, what I'm trying to share with you is that the DNA sequence itself is fed through a protein core. And as it travels through the core or the channel, you can actually see the signal file and that can be read for either DNA or RNA. The cool thing here is that there's no size limit. So the larger DNA can be placed in and it's really whatever can get to the pore. And so we're routinely seeing hundreds of kilobase size reads now and we can reach reads that can go greater than a megabase. And we're constantly seeing gains here with single read accuracy. Even today, I was seeing on Twitter that we've been announcing this new Q20 with experimental models. So this is a really fast moving and exciting space. And what's really cool about when you look at these large tandem repeat example I was giving you is if you have a read this larger than the tandem repeat array itself, you can now begin to span these repetitive regions completely and get correct assemblies. Another really cool technology that came up for our consortium is the arrival of the high fidelity or circular consensus data from Pac-Bio. These are not as long of reads. These can be 20 KB. However, the quality of these reads are Sanger quality in the sense that you have Q30. And the way you can achieve this type of quality is that you have a polymerase that goes round and round and you issue many, many of these more noisy subreads. But when you take a consensus, there's an incredible power here being high, high quality reads. And the size of the reads 20 KB is actually sufficient to begin to anchor across many of these smaller unique markers or micro heterogeneity that we know exist within some of the tandem arrays I'm showing you here in purple. And once you begin to utilize this type of micro heterogeneity, you can begin to traverse using overlap assembly across these more repetitive regions. So I really want to credit some of our team members from Adam Philippi's group in particular, Sergei Nerk, who went through and actually did some experiments to show that high five base level quality could even be further improved. And that's by taking the original CCS read and going through various stages of homopolymer compression, overlap error correction and also looking at the indel and tandem repeats. And in doing so, you reach this almost perfect 20 KB read. Now, when you start to look at these satellite DNA repeats that I was sharing with you that are in centromeric regions that can be millions of bases in length, you can begin to utilize these near perfect repeats to do assembly. And here I'm trying to show you at self similarity of a centromeric repeat. You can see that many of the copies are near identical to one another. But as you start moving this self similarity closer and closer to perfect, you can make it all the way across. And this to me was really groundbreaking when I received news that we were now doing automated assemblies in high canoe across centromeric regions that could be greater than five megabases in length and regions that were previously thought to be shared between multiple different chromosomes. So now we're aiming for generating T-limited T-limited complete chromosome assemblies across an entire genome. I had T-limited T-limited T-limited means you have a single contig of equal quality or high quality finished quality from one T-limited to the other. And we're using mainly two different sequencing technologies as the hi-fi to construct a stream graph using these long, perfect overlaps and also nanopore to help with the harder tangles to resolve. And a lot of this innovation took place over a workshop during the summer. And it was, I want to once again, credit Sergina for really driving a lot of this. So from our custom assembly pipeline for T-to-T reconstruction, what I'm showing you here are all the different chromosomes. I hope you can appreciate that many, if not all of the different chromosomes are shown independently as single linear events. In particular, many of our T-to-T chromosomes fell out from our first T-to-T reconstruction. Chromosome 11 is a nice example of that. However, we still have events of these types of linked connections. One would be on chromosome nine that I'm trying to highlight, which is a large human satellite domain where we found a tandem duplication. You can see the link as well. And between the Acura-centric chromosomes where the RDNAs are, Adam Phillips is actually leading this particular effort or we're trying to resolve some of the Acura-centric short arms now. But I wanted to use this slide to show you that the actual arms themselves are well resolved. What we're seeing is the assemblies all the way to the telomere and to the cinchmere seem to have separated quite well. It's just these recent recombinations with the distal junctions that need to be resolved. And we have these chromosome specific RDNA clusters. So our team is very active in this space. And we're hoping to close these five remaining gaps in the future only held for the RDNA in the near future. So as I started off my talk by saying, hey guys, we only have 25% of the chromosome missing for chromosome 21. It's amazing to me that I can now look and see with high resolution every single one of the Acura-centric short arms. This is representing 90 megabases of sequence. We now have new classification of satellite DNAs, RDNAs, segmental duplications with hundreds of new candidate genes that are emerging from these particular assemblies. Now it's been my life's work focusing on centromeric satellites. So this is perhaps one of the, it's a hard year for everyone's pandemic, but one of the best years of me, my research team because we were finally seeing these higher resolution maps of human, peri-centric and centromeric regions. And this is incredibly exciting, not only for the satellite DNA, which I'm incredibly excited about thinking about genome function and biology, but also for the sequences that live with them. In this particular case where I'm showing you chromosome one, where we've introduced 20 megabases of sequences that were outside of HD38, where we now have our gene annotation team flagging multiple regions in here, which were containing new genes that had not been described before using comparative annotation tools and ProSeq to show signs of where the polymerase is going through. We've also organized a team that is digging into these new assemblies, doing repeat annotation and discovery and we're finding new repeat elements that have never been described before. So how do we begin to use these maps to study things like epigenetic regulation and function? How do we start digging into the genome biology? So this is an incredibly fun collaboration that we've made with Winston Thimstein, and in particular, I want to credit Ariel Gershman, his student who's been leading a lot of these analysis. And essentially when you're using the Anipore sequencing, you get direct read of the chemical structure as it goes through the pore with high sensitivity. And when you have these super long reads, once again, I should say ultra long reads, on the span of over a hundred KB, you're now able to look at these epigenetic profiles over the repeats. And so here I'm trying to bring your attention to one tandem repeat that's on the arm of DXC4. And this was known to be an area where we had a tag domain boundary and we're able to see for the first time this type of oscillation between, I hope your eyes can appreciate the red versus blue where the red is hypermethylated and blue is hypomethylated until we see it kind of drop off into a hypomethylated region. And this seems to have a correlation with how we're envisioning the genome organization with the repeats going through an inverted and more divergent stage for the repeat array as well. Really striking for me was when we started looking at centromeric regions, we were also flagging in the centromeric regions themselves, dips and hypomethylation and the X chromosome we published that we found one that was roughly 60 KB in length. And as you go through every single human centromeric region, we find that this is quite consistent. There's a consistent dip over all of the different arrays. This has been a lot of fun. So we've been moving into how to map short read data and teaming up with Gina Caldas from University of Berkeley and Abby Dernberg's group. She's been optimizing a cut and run experiment that can work for satellite DNA. And then also to credit the work from Glenis Logstone who's an Evan Eichler's group has also been leading some of the SEMP-A analyses to show that these sites of hypomethylation are actually coincident with some of our centromer proteins in sites of kinetochore assembly. So this is actually the genomic definition of a human centromere. So, so far I've told you more of the highlights. We've released the complete sequence of a human genome. You can access it on our GitHub page. We have information there as well. Just to kind of give you the overview, we've estimated the size to be 3.057 gigabases. And we're looking at about three to 6% of new sequences compared to GA-HG38. However, I want to kind of pivot my talk by emphasizing that we're not at the finish line. There's reasons to celebrate. There's reasons to be excited, but we're not there yet. So far, what I've been talking about is a focus on a haploid, what we're calling T2T genome. However, there are fundamental technological barriers that stand in the way of getting as closer to this bold prediction for one. One is that we have to break through the first barrier to reach a diploid T2T genome. This is going through the challenges of trying to phase and do assembly of these repeats across the maternal and paternal chromosomes. And then not only that, once we do it for a diploid, we have to get past the second technological barrier to say that we have to move this into production mode. How can we begin doing this for hundreds of individuals, if not thousands, if not making it more routine? And so just to credit once again, this has been a merger of interest between the telomer to telomer consortium and the human pan genome reference consortium, which aims to generate 350 telomer to telomer assemblies in the next four years. So how will complete telomer to telomer assemblies revolutionize genetics and genomics? I think that there are three main points I wanted you to take away from my talk. One is healthcare. As I mentioned at the beginning of my talk, we haven't been able to see these regions for decades, but guess what? We're all going to be able to see them now. And now once we have access to comprehensive variant screen, this will be something that will be more attractive, I imagine, once we begin to track different variants that could be involved with clinically important associations. Also, having these types of data will likely expand our understanding of human sequence variation and evolution. And stay tuned for the next bold prediction too, because this is going to hopefully introduce a lot of new non-coding region for us to begin to study about function, how the genome works, and understanding cell biology at a much deeper level. So I'm coming back to this bold prediction now that I've told you a little bit about where we are and our near-term expectations here to reach this complete production of T2T diploid genomes. Right now with the HPRC or for the Human Pan-Jump Genome Reference Consortium, this will require a multi-platform approach where we have to have Hi-Fi and Ultralong and other orthogonal sequences like Hi-C data to begin to put these particular genomes together. It also requires a multicenter approach, but lots of experts and lots of expertise. However, we have to look ahead for this bold prediction and we want to move away from this model where it's going to be very expensive and very expertise-rich to ensure equity and accessibility. And so what we want to think about is what needs to change, what needs to happen in order to reach this type of milestone. And there are some very obvious concepts here that I think everyone in the genomes community would probably nod with. And that's the idea that it needs to be kind of a balance and optimal here, of finding a coverage, a sequence accuracy and a read link at the right place. And once you start to hit that sweet spot, assembly should be quite easy. But I would argue that for being longer and longer reads may actually take off some of the analysis component as well. In other words, you could have perfect reads that are shorter and then bigger reads that are perfect, but it's probably going to take you less time to put together those bigger puzzle pieces. And so this is kind of the exciting game changers here. I'm trying to show you along the link to the chromosome. You could once again have those perfect puzzle pieces that I'm showing below. However, we can reach QV30 or QV20 of reads that are capable of spanning huge multi-megabase parts of the genome as well as maybe even bee gene links that can span an entire chromosome. This becomes a very straightforward exercise. Also we need to think about bold predictions in terms of not only the sequencing, but more holistically, as I mentioned at the beginning of my talk. So we need to move this process in-house. We need to move away from sequencing in Q. In other words, this idea of moving it into a sequencing core, moving it into a company and having it to where it's on someone's bench or someone's clinical research lab where you have a sample, you can move this particular sample into some type of DNA prep and then move directly into analysis. And so there's a lot of interest here, perhaps I added another check mark and maybe having a mobile device or a smaller device that's currently being supported by Oxford Nanaport to credit that particular sequencing technology. But I expect over time of the next 10 years, this may become something that we see lots of development and new sequencing innovation and technologies popping up to where there's this opportunity to do things outside of a core. And I wanna kind of end my talk with this idea of a bolder prediction. This is the idea that we are not one genome. We do have of course a germline genome, but we're all collections of genomes. Our genomes can range in the amount of variation that we accumulate as we develop, we age, we're exposed to DNA damaging agents. I mean, we're very comfortable thinking about this when we start thinking about cancer, but it's likely that our bodies are constantly undergoing this type of variation even in otherwise healthy adults. And the repeat regions in particular, notably perhaps tandem repeat expansions may actually be sensitive to some of these types of events. Therefore over time, I would say we need to move away from this idea of having one complete genome for individual, but we need to think about complete single cell sequencing so that we can begin to describe data as being a cell and a population of single cells from an individual. And this idea, I think of moving into kind of a single cell complete genomics is very bold and we'll take a huge technology push to reach because right now we don't have a pathway, we're moving in that direction, but we don't yet have a pathway from where we can get this reproducible production of a long-read sequencing. And it might be that we might have a parallel track here to where we're no longer depending on sequencing per se, but I've been very excited about seeing some of these more novel technologies where you have in-situ sequencing, where once you know all the sequences in a comprehensive way for human genomes, you can begin to think about a model where you begin to use fish-based approaches to begin to study not only where the sequence lives, but also its proximity and spatial organization which can tie us into all other cool questions about regulation and spatial organization. So with that, I'm gonna go ahead and end my talk. I like to make sure folks realize that this is a huge collaborative effort of all the work that I shared and it's really a celebration of one of the legacies of the Human Genome Project which has been collaborative and open science. So thank you. Karen, that was terrific. But like I said, we're gonna wait and talk to you as when the three of us can chat. So we will transition to Evan. If you go ahead and share your screen. Not to make you nervous, Evan, but there's over 550 people actively listening in right now and that number did not dip at all during Karen's talk. So I hope they stick around for years. You've got some competition here. I'm happy to. Can you hear me, Eric? Yeah, I can. All right, so I'm gonna pick up, I think in a way where Karen has left off, I'm gonna come back to this question of completely phased and assembled human genomes and the bold prediction, which is to make this process of sequencing them routine. I put in a slightly different emphasis and I think on some of the dynamics that we're seeing in the genome and new insights that we've understood in terms of mutational mechanism. So what has been the challenge? Back in the day when we first assembled the human genome, I was interested in looking at the pattern of segmental duplications. So the pattern that's shown here at the blue lines represents these very large highly identical sequences that exist in our genome. The purple represent the regions that Karen talked about, which were the centromeres and acrocentric regions. The other kind of devil in the detail with respect to finishing and why we didn't finish the genome in 2020, 2003 was the fact that these duplicated regions of our genome are essentially themselves highly copy number of polymorphic in our species. And so one of the things we realized right off the bat is the reason that we had so many gaps in these particular areas of the genome was the fact that about 50% of all normal copy number variation in the human genome actually maps to that 5% of the genetic code that we actually have a difficulty assembling. And the reason it's difficult is not just because it's large repeats but that it's in fact variable between humans. And so that took another decade or two to actually prove without a shadow of a doubt but the really frustrating part was we weren't able to sequence resolve those regions. And so while 95% of the genome could be characterized excluding centromeres and segmental duplications where we have huge amounts of human genetic variation these words weren't resolved. And this is just to show you the inter chromosomal pattern. So not all the duplications are shared between or within a chromosome, they're shared among chromosomes. So the big achievement I would say with the telomere to telomere has been this focus on looking at essentially a haploid source and converting the human genome from one that's almost complete like we see in the case of GRC38 where every transition from a dark to a black here represents a gap to one over the last I would say nine to 10 months that looks something like this or essentially the human genome is almost there. It's almost complete. And I say almost as Karen pointed out there are still these little nagging gaps located in the acrocentric chromosomes which we estimate is about 15 megabases and Adam, Philippi and others are working really hard to finish off those last bits and incorporate them into really a truly complete genome. So when we go back to the question and again, I'm going to go back to the segmental duplication pattern on here. This is the current view of the pattern of these large, highly identical segmental duplications. So instead of that representing 4% to 5% of the genome we now know it represents 7% and what's interesting to me at least is that the genome now that we have our first reference is actually more complicated than we thought before. There's in fact much more duplications particularly in regions such as the acrocentrics which actually contribute to duplications both within and outside of the acrocentric regions. The really exciting piece for me is this is the part for the last 25 years that I've been waiting for having our first reference in hand. So we can really systematically interrogate the structure, the variation, the evolution and the function of these regions. So I want to focus on one. So this is not a segmental duplication. This is a centromere. This is a complete sequence and assembly of chromosome 8 worked from Glennis Logsdon in my group with the telomere to telomere consortium. What I'm showing you here is a complete sequence. We now have a complete sequence across chromosome 8 but in particular just zooming in on the centromeric satellite repeats. And so what you can see with this heat map what we've essentially done is taken the genome because you know the satellite sequences within the centromere are made up of repetitive sequences. We've actually broke it up into five kilobases and compared every five kilobase against itself across this entire two megabase region. And so any area where you see bright red that means that the sequences are virtually identical. Any things that you see that are cool I'd say down here in the blue and the purple these are more divergent with respect to one another. So here's the exciting point. When you look at this chromosome 8 centromere you can clearly see there are certain areas such as here that are highly identical and areas that are much more divergent. But the neat thing is that if you actually look at these divergent what we often have referred to as monomeric satellites you can see that they themselves are more identical to one another than anywhere else in this satellite repeat sequences. Suggesting that this is probably the ancestral or vestigial centromere that existed in some ancestral primate lineage. So this pattern of layers continues out. So if you look at this little patch right here and right over here you can see once again high sequence identity but one again, once again a layer. And if you think about it what we're actually looking at is evolution in action where the youngest activity or the most recent activity in terms of unequal crossing over is actually pushing out centromere satellites in layers toward the prolific. So this idea of both a mirror symmetry which doesn't exist for all centromeres as Karen would have guessed but exists for many. In this layered organization it's one that was predicted in the 1970s by George Smith and it's one that we now can test and we can assess variation on and develop evolutionary models for. Why? Because we have an assembly for the first time of these regions. So here's another area this is another view of the acrocentric. So each line here represents an acrocentric. Now just looking at the bar colors you can see these are the gaps I've highlighted that are still work in progress. This is where the R DNA sequences live but you can see that we know that the acrocentrics are composed of satellites as Karen's already mentioned. Here's the centromeres for each of these beyonds. In addition however, it is loaded with segmental duplications. In fact, segmental duplications are the most abundant sequence in the acrocentrics. And some of these segmental duplications carry genes. They're shared between specific subsets of acrocentrics while others are shared even with chromosomes outside of the acrocentrics. So here's actually some experimental data that can show you this. This is work from Muriel Ventura and his student Ludovica Mercury who basically took some of those probes because we now have a reference acrocentric genome and started testing them against different metaphases from the original source, CHM 13, different human samples. And we're seeing patterns that we fully appreciated that certain probes of certain regions are fixed among all the acrocentrics. But others as shown here on the right are actually variable both between humans and where they map. So here's for example, one that maps to almost all chromosome 13, but it's variable what we call heteromorphic between different humans on 14. And this is not an acrocentric chromosome 18, but Clarity is also present in about half the chromosomes on the survey. Why I'm showing you this is now we have the ability to look at both small scale and large scale variation and because we have a reference for the first time I would argue over these regions, we can ask biologic, evolutionary and structural questions. So as Karen suggested, the next big challenge is moving away from hydratidiforms which are haploid source material and be able to do this really routinely fully phased and assembled genomes from both diverse human samples as well as clinical samples. And I would argue that this year has been or the end of last has been spectacular in terms of this development. The idea you've heard about Hi-Fi or high fidelity pack bio sequencing, but the idea of actually being able to generate this sequence data from any source material, any human whether it's blood DNA or cell line and then combining with data from what we call linking or phasing information data from high C strand seek or Oxford Nanoport Technology I think is going to revolutionize the way we think about human genetic variation. And the reason for that, and there are two papers I'll point you to one from our group working with Tobias Marshall in Germany, this one in Nature Biotech and this other one from Schilper Garg from George Church's group. This is a very simple concept that if we actually have high fidelity, long read sequencing, what we can do is actually build what we typically build is a squashed assembly of any human genome. And when we do this, we can actually distinguish with high accuracy all the SNPs or most of the SNPs that exist in that genome. So showing here are two different colors for the father in green and the mother in red or pink. But then we actually use once we've created this catalog of single nucleotide variants we use the high C data, the strand seek data or the on D data. And we use that to actually basically create a SNP phased haplotype for each of the two complements of this genome. Once we have a highly accurate physically phased we don't even need parental information in principle. We can then go back to those high fidelity reads and assign them either the paternal or maternal haplotype. And so doing do a sequence assembly of both complements of that given genome. So if you think about it, this is the two for one special of human genomics. What we can essentially do is phase and assemble both the maternal and the paternal complement based on having phasing information combined with high fidelity sequencing data. So showing here is our first attempt of a child. We had, this is actually a child from the 1000 genomes project to support a recon child where we were able to essentially reconstruct not a three gigabase genome, but a six gigabase genome with both the paternal showing here in blue and the maternal haplotype almost completely resolved. And so just so you guys get understanding of this, anywhere where you see essentially, for example, continuous blue or continuous pink means we perfectly phased that portion of the genome and assembled it into contigs that were greater than 25 megabases in size. Anywhere we see a blue juxtapose with a pink means we made a mistake, there's a phasing error. But here's the important point. It's not completely solved and there's obviously still gaps in some of the more complex regions that 96% of human genome can now be phased and assembled into contigs that are almost as good as the human reference genome. That's to me, revolutionary. So for our lab and many other labs, we're interested in all forms of variations, just not the SNP, just not those to indels but also structural variation events. And so what we can do is once we have a phased assembled genome, we can now take the phased assembled genome and actually compare it back to reference and completely define the break points of structural changes that exist in both the paternal and maternal haplotype. So this individual has about 25,000 structural variation events that are fully resolved. And the little heat map shows you that they're non-randomly distributed, particularly near the ends of the chromosome. Something that we figured out over the last couple of years was that in fact, structural variants are biased to the last five megabases up to about four or five fold compared to anywhere else in the human genome. Of course, on top of that, we also get the SNPs in the indels. But the one point I want to make is that when we sequence these genomes, as Karen says, with short read data, they're largely incomplete. We only, when we do the same analysis with short read data, we capture only about 25% of this 25,000 structural variance. So we also get the SNPs. So this is showing you the SNP pattern, fully phased 4.1 million SNPs, showing pockets of increased diversity over HLA or MHC regions. But the really important point is that the sequence-resolved structural variants, the indels and the SNPs are all fully phased into one context. So having one genome, the Human Genome Structural Variation Consortium, whose focus is on complete understanding of normal structural variation in the human genome, started a pilot project to tackle 32 human genomes. These were selected from the 1,000 genomes project wherever possible to have one representation, the so-called index genome from the 1,000 genomes project, and basically apply this phased assembly long-read approach to characterize structural variation. I'm highlighting here the two lead authors of this were Peter O'Donnell from my lab and Peter Abert from Tobias Marshall's group. So I'm not gonna go into a lot of detail, but what we were able to do from this analysis on these 32,000 genome-genome examples was build 64 phased and assembled human genomes. And we focus specifically on structural variation discovery. What I'm showing you in this plot here is essentially each of those phased assembled genomes and how much additional structural variation was identified by the N plus one phased assembly. So in total 107,000 structural variants and what's indicated here in the dark blue are ones that are predicted to be polymorphic. And obviously you can see some interesting patterns here. You can see that as you get more and more genomes, the amount of increase is essentially more modest, except when I break one out specifically Africans versus non-Africans, Africans actually yield twice as much novel variation. And our data suggests that we have not even begun to aptly sample the diversity that's represented in Africa. So this is a nice example to suggest, suggest we do more African-based sequencing since this is the origin of the human species in order to actually better understand variation. But the most important piece is once we've sequenced resolve the structural variation, what we can now do because we know the break points, we know the snips, we know the indels that flank it, we can now go back to Illumina data and genotype it. And so this tool that will be developed and genie like to be as Marshall essentially allows us to go to Illumina data. So this is Illumina data generated from the 1000 genomes project and genotype the variants that we've discovered. And our data suggests that we can probably genotype one sequence result more than 50% of the genetic variation that's in structural variant sequences. So what do you do when you can do that? Or you can ask interesting questions like are there in fact specific structural variants that are really stratified between human populations? Things that are at high frequency in one population but low on everybody else. That's essentially what this max PBS and the Y axis is doing. It's showing you variants, structural variants. And we specifically picked ones that were near or within genes that are highly at high frequency in certain populations, but low frequency in others. And the X axis just simply indicates the size of the event. Some of you pop gen people might be recognizing this gene. This is the very famous human genetics gene called lactase which is in fact enriched in European populations for particular SNPs that are associated with continual expression of lactase allowing us to digest milk past two years of life or dairy products. So what we found in the lactase looks by doing this, the 64 human genomes was essentially we sequence resolved in addition to the SNPs that are often associated with lactase persistence. We identified a four kilobase insertion precisely in the first intron of the lactase gene very close to the promoter that is loaded with transcription factor binding sites if you just go in here which is deleted in most Europeans and in fact all Europeans that show lactase persistence. So this is the hypothesis. The hypothesis is maybe the SNPs in conjunction with this large chunk of sequence that has been lost in the European ancestors and maybe the loss of temporal regulatory elements is responsible for lactase persistence in human species. So I'm not restricted to studying humans and many of you know me know that I love my apes and on human apes. And in particular, I've been excited by the prospects of doing telomere to telomere assemblies of essentially the great apes and beyond. And I'm really excited we've started this work and this is some very early work on this but we started to build our first assemblies of chimps, gorillas and orangutans using the same basic technologies. And we are discovering much to our pleasure I guess very large genetic differences between our species. So you often hear people talk about our genomes being 99% identical or 98. Here's an example of a region on chromosome seven, keen in which there are 500 kilobases of additional sequence that have been added to the human lineage shown here along the top compared to the chimpanzee. Moreover, this additional sequence which is blowing up below is packed with genes. In fact, a gene family known as TBC1D3 which has expanded in some humans up to about 12 copies compared to chimpanzee where there's a single. In addition to these genes there's also chemokine ligand genes which have actually been studied over the last two decades that are enriched in the specific region of the genome. What's exciting about this not work from our lab but work from a different group based out of China showed that in fact TBC1D3 actually promotes a generation of basal neural progenitors. So these are the progenitors that will give rise to neurons. So an increase in the number of cell divisions which results in increased or cortical folding in mice. And if you know you're a human in your chimp brains one of the big surprises or big interesting aspects about humans that may make us unique as a species is this expansion of the frontal cortex. So here we have a genomic change associated with a functional gene which is implicated in essentially the expansion of the frontal cortex which we now can fully realize but it's been sitting in the genome we just haven't been able to see it. To make matters a little bit more complicated this is in fact a view of variation within humans. So the only thing I want you to focus on are these little red ticks that are shown here. These represent one of those two clusters and what I'm showing you in this is essentially variation among different human haplotypes. So you can see that humans are ranging all the way from for this particular haplotype from seven to about two copies. So humans are highly variable in terms of their copy number and we don't know what the implications of this are. One thing that we're excited by and such stuff that we're working with the HPRC as well as the HGSBC is how to represent this type of variation where humans vary from 200 to 600 kilobases over a gene rich region of the genome. And work from Hang Lee for example has developed graph based approaches to actually visualize this. And so what I'm showing you in this kind of string diagram is all the regions that are conserved among humans are indicated in red and all the variation is represented as these bubbles of different colors corresponding to structural changes that we now fully sequence result. So in addition to finding differences that make us uniquely human compared to other non-human primates, I'm excited by the prospects of actually going back and figuring out the big differences that distinguish us from our CA comedians. This is one that we published a few years back. It was based on using long read sequencing to fully assemble 105,000 base pair duplication event which contains about a half a dozen genes, Bola2, which are expanded in all humans but not seen in either Neanderthals and denusable. So something that happened in the last 200,000 years in the root of our species that we have shown and actually another postdoc that went on to be an independent faculty has shown this expansion of Bola2 actually changes individual cells ability to actually metabolize iron. The significance of this is unknown but we do know that this expansion also came at a cost that exposes our genome to recurrent microdeletions associated with autism. More recently, that exact same region of the genome that had this expansion in the Homo sapiens ancestor we recently discovered is a site of an incredible copy number variant that's variable between human populations. So this is actually a little plot showing copy number for different human populations including the two arcades on the top. And one thing that should stand out is this Melanesian population which has increases in copy of three and four of a segment that's greater than a hundred kilobases. So we work with some of our colleagues from Papua New Guinea and we actually sampled all of the islands that we get our hands on in terms of sampling how frequent this event was. And it turns out that this particular duplication is found in 80% of all Melanesians and probably all oceanic populations that are native to those areas but 0% outside of those areas. What's really cool is with the long read sequencing and assembly-based approaches we could fully sequence resolve that polymorphism. What we determined was it was actually 383 kilobases in size. So a massive difference. And it actually encoded two to three novel quote unquote novel genes because they have duplicates elsewhere in other human genomes that they are specific to oceanic and Melanesian peoples. One of these genes is nuclear pore interacting protein B16. We found a new copy of it. And we show that the locus itself shows evidence of classic positive selection in terms of amino acid replacement changes as well as selective sweeps suggesting that it may be conferring an advantage that people living in Melanesia we just don't know what it is. But here's the point by actually having complete phased assemblies we can now discover these developed hypotheses and test forward in terms of the functional significance. So this is our model for this particular event. The event itself arose in the ancestral or non-ancestral related species of Neanderthal called Denisovila. And that particular event probably is about 300,000 years ago but as some of you may know those archaic hominins actually interbred with ancestral human populations. And in so doing it introduced this 386 kilobase polymorphism back into the human lineage. Then we believe as a result of selection it rose to high frequency over the last 40,000 years in Melanesia. So this is where I gets fun. And Eric asked us to devote the last few minutes in terms of short and long term. So short in my book will be in the last the next 10 years. So that's 2030 long term is beyond 2030. So the first thing I'm gonna say is I believe that we will have a nearly complete understanding of human polymorphic genetic diversity. And some of you may be thinking well, we already knew that from the SNPs wasn't that solved like 10 years ago? When I mean complete I mean for structural variants duplication, centrum years, everything. No base left essentially as a gap. And I think this is doable and the work from the Human Pan Genome Sequencing Consortium which both Karen and I are part of in the telomere to telomere assembly of those 350 is a starting point. I would predict in 10 years there will be thousands of human genomes that will be individual references from efforts in Asia, efforts in America, efforts in Europe and Africa. And I think a fun thing will be able to not only understand essentially variation in contemporary populations but both go back and reconstruct what existed in the archaic species and what really happened in the trajectory leading to essentially the evolution of the species. An example I presented on 16 P1 1.2 there's one of about a two dozen that we have characterized over the last few years. So I wanna extend the tree back and actually this is actually comes from the strategic planning believe it or not Eric from David Hausler and I think one of the first meetings we had and one that we haven't achieved. So I'm bringing that back up because I think one of the things that's possible is now if we can build telomere to telomere assemblies why would we stop at 350 humans? We have already begun the work on sequencing the apes. One could argue that we need index species including biomedically relevant organisms but I think we should be bolder than that and we should basically do essentially all of them. All 148 species and subspecies and not just one but I think we should do two, three, four or five of each of these. If we do this, we will be able to reconstruct with the finest resolution the evolutionary history of every base of the human genome. And some of you will say, well, who cares? But if we do this, we'll be able to develop more realistic estimates of evolutionary process, fitness, drift and selection. And most importantly, be able to predict pathogenic events much more accurately in our children who were born with genetic disease. So as an example of new evolutionary insights and new mutation insights when we started to phase and assemble orangutans, gorillas, bonobos, chimps compared to humans. We found areas of our genome, which is indicated here where essentially the area of the genome has been toggling back and forth from a direct to an inverted status for tens of millions of evolutionary years. And the reason for that is because segmental duplications have been residing at the boundaries of these for 10, 20 million years. And so they're like a switch flipping on and off in terms of orientation. And they continue to do so in the human population. The really exciting thing too, and for regions that Karen studies and Glennis is to go back into those regions and actually start to look at parent child trio information. So imagine, right? People have talked for the last five years about de novo mutations and getting an accurate estimate on it. But they haven't looked at the most mutable regions of the genome. And now that we can, we can actually see variation in both mom and dad but also in terms of children. So here's that region again, where we've sequenced assembled. I think it's the first deployed assembly of chromosome eight centromere from the father and the mother of that Puerto Rican child. And you can actually see variation. So this region, for example, is bigger in the mother versus essentially the father. This region on the other hand is bigger in the mother as a result of compared to essentially the father. And so once we fully phase and assemble within family of these complex regions, we can go back and look at mutational patterns. So this kind of dashed line here represents what we would normally expect in comparing two alleles in human. And this is the type of diversity that we're seeing across the century. We estimate a minimum of four to five increase in the rate of mutation over these regions. So with respect to disease, I think it's clear the handwriting should be clear as long read phase genomes are the future of clinical diagnosis. This is a 10 year plus of the vision. So as an example, if you think about how we discover essentially this is an example of a disorder that was a part where largely characterized here by a former postdoc Heather Mefford called Barotella Scott syndrome. It's due to a triplet repeated expansion in the five prem UTR of a gene called XYLT1. So this is a typical family. Actually both alleles, both maternal maternal have this expansion. And the way it was identified and it's still diagnosis by Southerners. But imagine as a showing here to the right, we can just go in and sequence and assemble in phase. Here's the mom for two alleles 237 and 674 base pairs. And here's the child with hyperexpanded CGG repeats on both the paternal and the maternal appetite. This can be done for all the triplet repeat diseases. It can be done for all the undiscovered ones that are coming forward by one stop shopping with phase genome assembly of clinical material. And in fact, even some of the most complex regions. So this is actually a child that has multiple rearrangements on essentially multiple chromosomes. Typically the way we would do this is by whole genome shotgun sequencing and measuring read depth, trying to identify break points. And this is the type of output that we would see whether it's a race CGH or whether it's essentially read depth profile. But with phase genome assembly and this was done with actually Oxford Nanopore Technology using a reader called reader until we get complete break point resolution which is indicated here by the subway map that Danny produced, showing all the break points that exist in this child's chromosome. And then in fact match perfectly with essentially the carrier type. However, additional break points involving additional genes were discovered with essentially base pair resolution. This is the future, I believe of how we will diagnose children with essentially genetic defects. So in summary, I guess coming beyond 2030, you probably, I don't need to tell you that I believe this is a revolutionary time for our field. Much like PCR and the lumina sequencing war, I think phase genome assemblies are going to revolutionize human genetics for really one simple reason. We are now discovering variants in an entirely new way. We're not discovering variants by aligning reads to our reference but we're building references from our humans and actually characterizing variation by direct genome to genome comparison. And I'm going to predict that when we do this and we do it well, that humans and non-humans are going to be much more genetically diverse than we appreciate. And we are going to understand the differences that make us uniquely human as well as the population differences that help us to adapt to new environments and change our susceptibility to disease. And why I believe that is because these types of large events that we have largely been cryptic to us are now becoming clear and the functional effect of those large events is going to be much greater than what we've seen for SNPs. I'm going to predict that all unsolved Mendelian disease will be solved. And this will lead to not only improved diagnosis but the therapies. And I'm also going to predict that both rare and complex structural variants will be recognized as playing a larger role not just in rare disease, but into common disease. And I believe this is part of the solution to the missing heritability question which has plagued human genomics probably for at least the last decade and a half. And finally, much like what Karen suggested, I'm going to say that a phased human genome somebody will be become part of our medical record and not just one but really sampling our genome at different time points. And I think it'll actually happen at multiple levels for different tissues. Maybe some of it will be single cell. And I think it'll be coupled with full length transcriptional sequencing and epigenomic profiling. And I think the real power will be actually seeing these mutational events arise of pre-cancer somatic mutations and actually being able to do what we should be doing in this country which is preventative medicine as opposed to as a reactionary medicine after the fact. So I think this is going to transform our field and really at the most fundamental level in terms of actually understanding mutational process and what makes us human and what makes us different as humans. So these are all the folks I highlight a lot of people that let me present unpublished data. I think they need a special acknowledgement on this. And I also want to acknowledge actually my lab in particular these folks who have continued to kind of push science forward despite the fact that we have to meet by Zoom and suffer the indignity of not being able to meet in person. All right, thank you. Thank you, Evan. That was terrific. I just want to hug Karen. I hope we'll bring Karen back on now. Maybe Evan, you can stop sharing your screen. We'll bring Karen back. There she is. And I encourage the audience to please submit your questions in the Q&A box. And Chris Gunter is going to be looking at those questions and bringing them to us in a minute. But right now we're going to just have a bit of a conversation with Evan and Karen and I and Karen and Evan can also ask questions to each other. I mean, I will kick this off. We deliberately, when we wrote this bold prediction, obviously it had to be very succinct, but you could unpack it in a lot of different ways. And what exactly do you mean by routine? What exactly do you mean by straightforward, et cetera, et cetera. And the other aspect of this, of course, we said any research laboratory, but as both of you illustrated and in particular Evan at the end really illustrated, there will be evolutionary biologists who may have one set of needs, medical geneticists who have other set of needs. Of course, fundamental biologists using, sequencing capabilities for all sorts of applications. One just could imagine there's going to be a diversity of needs and what is routine for one set of expertise areas may be different than another. So I guess, and of course you folks are so far ahead, it's the rest of the world's going to even have to catch up with all of these subtleties. I guess as you've thought about this, do you think we will get to the point where it'll be one set of methodologies used for all different applications and everything will just be so easy? Or do you think that what might be routine for medical geneticists will have to be a little bit different than what's routine for an evolutionary biologist. And that is both at the experimental bench level, but also at the computational level. What do you think that's going to look like? A black box, everybody uses the same one or multiple different variable boxes and approaches? I don't want to go first. I'm happy to take a guess. This is all speculation, I think. I like the comparison that Evan made at the end of his talk to this becoming more PCR based. And in that case, I do like the idea of this not being too much diversified because I like the idea that going more accessible and more equitable across all research labs. However, I still think that there's room for a lot of different innovation to carve out different areas of this. One, you can imagine that if you needed to survey genome sequences in the ocean or in the middle of the rainforest, you're using some particular type of innovation versus if you're in a clinical lab and you're wanting to look at spatial organization and methylation profiling and things like that. So I think there absolutely will be this type of Swiss Army knife approach of where the technology will start to move into what is needed for those particular specialized and focused questions. I would probably, I mean, I agree there's always going to be specialized applications but I'm thinking about, yes, my clinical colleagues and my evolutionary biologists and my pop gen people. I think there will be one size fits all from most people and the way I think about it now, it'll change probably a year from now. I've seen such power of combining the ultra long reads which Karen talked about with the hi-fi or the high fidelity back bioreads where the ultra long provides you the scaffolding and the hi-fi provides you the accuracy that it could be, and I don't think it's that much of a stretch to think that it could be as routine as PCR to assemble and phase a genome. Now, there's obviously lots of room that we got I mean, it's not going to happen overnight but the potential of those two technologies I think is still untapped. And I think what's missing in it is funny I think Maynard Olsen would get a kick out of this. Now, going back to what we really need we need some of these methodologies that were popular in the 1990s, right? To actually streamline high molecular weight DNA preparation. The higher the molecular weight DNA the more of it that can be contiguously sequenced, right? And so I think if we can, this might go back to even your postdoc Eric. Oh my God, agarose plugs in DNA, oh, I'm so good at that. I'm not going to hurt many other things. You're going to have a job after this, thankfully. I was worried about you, right? But the bottom line is that I think that this type of combination right now and if I, to be fair, I have no stock in either of the companies. But in terms of what I've seen in terms of combining these, I think it's just absolutely amazing. And I think there's obviously still algorithmic challenges but these, with people like Adam and Tobias Marshall and Adam Philippi these are quickly being taken out, right? Pretty quickly. And I think what's going to be the limitation is, you know how when you feel in the rainforest you're going to get more molecular weight DNA from a tree or an insect that lives out there. That's a challenge, I agree. But I think there is going to be a method that's going to be pretty standard and useful for most of the community. I'm really hopeful for microfluidics. I grew up also doing postbode gels for Southern blacks and so I know the history of that badge of honor. However I do, I was really excited that Bionano was showing some of their processing automatic kind of microfluidics of large DNA molecules. And so it could be that this is a bioengineering project that's probably going on now. And you can, I could easily imagine that if you can have a small device with a microfluidic system attached to it where you can deliver high molecular weight DNA and it's stable and not sensitive to temperature fluctuations you've now created a black box or an opportunity to do long-range sequencing wherever you want to be. So I actually want to hear from the questions then. I want to tell one thing to each of you before I ask Chris Country to start asking questions that came in. So Karen, thank you for upping the ante by putting on an even bolder prediction that we were going to get this, you know a complete sequence from individual cells. I like that, I like your audacity. And Evan, I did want to let you know that this idea we did not lose the thread of getting the evolutionary history of every basic human genome. I will tell you, it was the 11th bold prediction. We literally was a bold prediction that trapped and at the end of the day it just didn't make the cut but it actually is found in the paper. I don't know if you saw it or not but it's that concept is actually incorporated at the end of the paragraph where we talk about an aspirational goal should be understanding the evolutionary history of every basic human genome. So it is in the strategic vision we haven't lost it. It just was number 11 and we decided to have time ball predictions. Chris, why don't you give some of the questions that are coming in from the audience? Yeah, absolutely. So we did actually get a question from Tom NIH about high molecular weight DNA, which is great but you have already covered that. So instead I'm going to roll a second question from Tom in with one of Matt who asked about cost. So Karen in the beginning, you said that things need to be fast and easy and inexpensive. That was part of the ball prediction. And so they are both asking about the affordability of several parts of this, the price for sequencing genome, the price for storage of data and then the computational capacity to be able to process it. So can you comment on those? Yes, but in the speculative way once again, I would say that where we are right now where Evan was crediting a lot of these long read technologies, both high-fi and ultra long, having these types of dependency on orthogonal methods, it makes it quite expensive in the sense of trying to make this anywhere to a reasonable price tag to what we're getting now for something like a high coverage aluminum run. I've been seeing a lot of promises and deliveries on trying to get mid-length reads and high throughput on the Promethean of ONT and I'm seeing a lot of growth like Evan said, headroom here for high-fi. So I think both of these sequencing technologies are really stepping up their game and over the next 10 years we're gonna see a price fall, hopefully once again, that will hopefully be competitive with what we're seeing now for short reads and that will be transformative because it would kind of change things. I think that the question however, was smart in the sense that they not only focused on the sequencing technology, but the data storage. That's huge, that's a huge cost and it's not even just storing the data for short-term like it's going into where you can constantly access it and do these types of comparisons like a pan genome project where you wanna grow from hundreds of genomes to millions of genomes, where you have to store all that information and all of those sequence that went into it and that's not cheap. And so I think that the bold prediction that we've been talking about, you have to be careful about where the analysis ends and where storage and maintenance picks up and that's a pretty hefty price tag that I think we've always talked about and it's not going away unless you put things on glacier or start thinking about some of these really fun technologies of how we are gonna just jump into a new technology for DNA storage or storage in a different way than what we're doing now. Yep, Evan or Eric, you wanna add on to that? I mean, I would just say that currently to do a hi-fi genome with some ancillary data that say high C or strand C is about 10 grand. So the actual hi-fi data is still the bulk of it by R, this is our cost center. So I'm not saying but anybody else or what kind of deal anybody else can get. But what I have noticed from the companies pretty much uniformly, I think most people have seen this as well is that the prices have been dropping as the throughput and the production has increased. So not unlike what we saw with the early times of next generation aluminum sequencing where we had these drops. We haven't had 10-fold drops like orders of magnitude but we've had two-fold drops and I think that will continue on. So I don't think it's that unreasonable. Certainly before 2030 that we're gonna be down to a fairly reasonable number in terms of generating the phased assemblies, at least the majority of the phased assembly. The storage costs issues, yeah, are a significant one. And I guess the only model, even though I'm not a huge fan of it because we do most of our sequencing processing locally, I think most of it will have to move to a cloud level at some point so that people can access one set of data and people are wasting resources on redundant analyses that have already been done. And I don't know how NIH or anybody will pay for that model because that's not necessarily a cheap model either. But yeah, I think in terms of the computational time for generating a phased assembly, I mean, it's not finished but generating a 20 to 30 megabase phased assembly with a good compute farm can be done overnight. So it's not like this is taking weeks or months to do. Finishing on the other hand, telomere to telomere. Yeah, that's another story. That's another leap, exactly. And I think building on that, we got another question from Jerry, which is how critical of a role do you expect artificial intelligence to play in the ongoing analysis of genomes? And will that be part of the future of routine analysis for a typical laboratory? Yeah, since we don't really use AI, I'd be hard to press to really comment on that. I think it's hard to predict what AI will even be 10 years from now. Well, I do imagine that if you imagine, for example, that we can do single cells. And so the medical record of Eric Green has snapshot date 272 of your life. And then a whole bunch of other views of your genome and different cell types and different organs. I don't imagine that there's going to, we're gonna need some higher level processing, right? Then essentially just running assembly algorithms and finding variants. There's going to have to be some, obviously machine learning is gonna play a role in predicting regions. But in terms of artificial intelligence, I can imagine there's room for that in terms of actually processing such large numbers of genomes. But I couldn't provide specifics. Yeah, and that's, we're speculating, that's what we're doing here. So I think that leads directly to another question, which I think Evan proves that your PhD advisor always stalks you, everyone. So it's from one, David Nelson, who says. But there are many David Nelson, so maybe a different one. I'm gonna throw up on Taylor. It's different one. Love the emphasis on somatic variation. Single-cell seems challenging. It's a bold prediction for sure. We've pushed genome sequencing over targeted studies. Might there be a strategy to survey specific loci? Oh, I mean, that already exists, right? So surveying specific loci. So for those, which I didn't have time, I apologize for, you know, that we couldn't go into it. But there's read until based methodology developed by Oxford Nanopore Technology, where you can actually go and you can select your targets, not by any CRISPR design, but by actually, you know, in real time, computationally in silico, reverse the pores, so you only get the molecules that you're interested in, is tremendous, right? This is what my clinical fellow is using to target unsolved Mendelians. He's just walking through the critical regions and then finding the second allele in unsolved cases. So yeah, I think that's actually a great point. I think a guy could imagine that even for markers, right? If you think of certain areas that are, you know, that matter when they're somatically unstable, and they can imagine David's thinking about triplet repeat expansions, but this would be something where you could identify it and maybe even get an assessment of when and where those mosaic events are occurring during development, right? Which I think would be really interesting. Yeah, I wanna just comment there as well, that it's one of the really cool things about the Nanopore platform, especially with this type of selective sequencing is you capture all the heterogeneity because each DNA sequence is coming from an individual cell. However, you don't have kind of that connectivity with the rest of the genome of where the other fragments are coming from. And so there might be a nice opportunity here to start thinking about some of the tricks we've learned for single cell cloning to type of TAG or mark individual single cells and start to think about that in that context as well. But if you're just looking at one particular site of the genome, you can really kind of, and it's not just sequence, you can look at methylation heterogeneity as well. So let me ask a biology question, which is not something that we really covered in our strategic vision paper, which is from Irwin here at NIH. How do you think complete sequencing will affect the understanding, and I guess has affected the understanding of origins of replication? Well, yeah, I mean, I can, I mean, I think one way to think about this, right? I mean, a lot of the way that replication timing, at least recently has been characterized is by applying some of these seek-based methodologies, right, to actually see early and late replicating regions of the genome. And I think one of the problems has been the fact that the genome has been incomplete. And some of the earliest and latest replicating regions could in fact have unusual, you know, genetic compositions, genetic architectures. And so I think having a complete assembly, right, you know, of the genome that you're studying replication timing on is a pretty important place to start, right, in terms of identifying late and early origins. And obviously being able to look at that, I think having actually those genomes, you know, for your reference that you're studying in your lab, generated, I think is incredibly important for these types of questions, as opposed to always mapping everything against generic GRC38, right, or 37. And I just want to add to that, this is a very exciting time because the regions that we're including are heterochromatic and have a deep literature of being late replicating. And there's expectations here for a certain type of cellular and genomic behavior. But it'd be very interesting to see if that's true, kind of in a very strange comprehensive way, if we see some variation in late replicating versus early replicating and find new models that maybe we were being too generalized before. And I think that in these particular regions of satellite DNAs, and even in the larger segmental duplications, we have a failure with a lot of our epigenomic assays with a short read mapping, we can't do it precisely. And so what we need to do is develop new long read methods to study these types of epigenomic signatures over the repeats. This is something that when we saw the hypo-methylated dip in the satellites, we could not find that using mapping short read data. So moving into this new world with long read data, and this is once again, a parapellate of gain. Once again, with these technologies moving to generate complete genomes, you're gonna see offshoots and new advantages and new programs and epigenetic tests popping up. That also benefit from these types of innovations. Yeah. Chris, is there one more question and then I'm gonna have a question for each of them. I don't know if there was one more quick running one. So while we're speculating, we got a question from Ben who asks, are there regulatory or legal barriers that would need to be addressed to leverage the progress and possibilities you've described today? Oh, yes. Short answers. Yes, a short answer. I mean, this is actually a big issue, right? For the, I mean, Karen can talk more to this, but one of the big issues that we're facing, right? Now we're, as scientists, we wanna sample all the diversity in humankind, right? And that includes sampling every population, understanding the changes that have occurred in those lineage's population at the level of the population. Cause we believe that it will fundamentally improve associations and healthcare for all those individual populations. But there are populations as we all are well aware that actually have been ill served, right? By geneticists, by essentially comprehensive sweeps of trying to understand genetic variation. And I think one of the big challenges that we're facing, specifically this year and last year, in particular is how to pursue the scientific goals, but at the same time be very sensitive to the cultural issues, to the population issues. And if populations and cultures decide that they don't want their genomes to be sequenced and referenced, right? I mean, I think we have to, as obviously as people, scientists, whatever, we have to recognize this and basically back off. And so a big part of what we're doing and Karen has actually been leading this is basically, how do we do this in the most ethical way in terms of actually moving into new populations and characterizing that variation? So as an example, the Melanesian story that I told you, when we put that gather that paper, we had worked with our colleagues and some of them from Melanesia to do this. But before we released that paper, we actually submitted the paper to the Melanesian Research Council on the advice of one of our colleagues. They actually have them review everything that we were writing, how we were saying it and whether in fact it was appropriate and whether they had any issues with it. And to be fair, they looked it all over, they sent us some comments and it felt like the right way to do it. But obviously there's much more that needs to be done, right, when dealing with populations like the Sun populations of South Africa or the Native American populations here in the States. Karen, you wanna add to that? Of course, I mean, I think Evan gave a really nice overview. I just wanna add to this with a different kind of layer in the sense that this technology that we're talking about, making sure it's cheap and accessible and equitable really sits at the interface with society in a different way than we've experienced with the genomic community before. The way we engage with police force, forensics, politics, even to the point of one of the bold statements of how we're gonna bring this into science fairs and teaching, over the next 10 years, we're gonna get a wake up call and we're gonna face a public that perhaps, as Evan said, may have mistrust of genetic research. So there's a huge amount of ethics, legal implications that need to be taken seriously and front on in any project that surveying and engaging with participants needs to view their project as just that serving participants for the better good of what those communities can gain from being part of that particular project. And we need to onboard specialists with ethics and policy and legal insight so they can provide the necessary oversight moving forward to ensure there aren't any missteps in these processes that are gonna be incredibly important. They've always been important. They've always been part of the human genome project. They've always been part of kind of these policy teams, but right now I think we, especially at the same alone for this bold one, I mean, we're trying to be disruptive and this is outside of people who specialize in complete genomics. This is soon going to be something that we need to think very carefully about how we move forward with ethics and legal and policy questions in the future. Just to add, I mean, the best way to do this is always to engage the communities in the own assembly and sequencing of their own samples and including their own analysis. And that obviously is challenging at several levels, but this is something I think that we've gotten from the bioethicists is that we need to not just onboard the ethicists, but actually the communities that actually who will be affected by this. I think that's the right, it's a slow process for sure. Makes it slower. So, but I think it is the right way ultimately to do it. And just to add one more point, I think we have to think about every genomic aspect of this as being more global and that we are contributing to this on that platform, engaging with other contributors. And I think that's where Evan was making the point as well that there will be other experts in other countries and for us to standardize and begin to working together. I think that's going to be just transformative of the next two years. Well, I think your comments at the end were fantastic, both of you on that last question. So I'm actually gonna skip into the answer time. I'm gonna skip the question I was gonna ask. But what I wanna do in wrapping up is to start by thanking Karen, thanking Evan. You exceeded our expectations. We knew this would be a great way to start the series. Fantastic talks, great discussion. You are such leaders in this field and it's just wonderful to watch all of your many contributions. So keep going, we are very proud of you. We look forward to continued interactions. I wanna thank Chris Gunter helping with the questions, helping with organizing this entire series. I also wanna mention Susan Vasquez, who is my special assistant who also is helping with many of the micro details of this series and Gerald Osonomy, who is really helpful at our IT to create this infrastructure and the rest of our communication team who's helpful for getting the videos made and distributed to the community. We have over 500 people following this. I'm sure many more will view this after the fact as well. This is just the first of 10. I invite you to number two, which will be the same time on March 8th. And here you can see it right here, March 8th. You can come here, Nancy Cox and Neville Sanjaya of the New York Genome Center. So, and Carolyn Hutter will be the NHRI moderator and then you can again, sign up or we'll get you emails so you can follow all 10 of these. So Karen Evan, take care, enjoy the rest of your day since you have three more hours of the day than I do. It's been great being here today. It's great seeing you and we'll see everybody on March 8th. Take care, bye-bye. Nice seeing you, nice talking with you. Bye, thanks so much.