 So, this is going to be a very well-organized moderated session since I just met my co-moderator here. But for those of you who don't know me, I'm Eleanor Carlson. I'm piloting this 200-Mammals project, which after Harris' talk feels incredibly small. I'm thinking I may need to scale up by several orders of magnitude, and do you want to say something? Sure. My name is Sophie Salama. I'm at UC Santa Cruz and work on a variety of comparative genomics projects there. I'm really excited to be here. Oh, yeah. Yeah, and I'm at the UMass Medical School and the Broad Institute, I forgot that part. But our first speaker here is Eric Jarvis, who is a professor at the Rockefeller University. He got his start, I believe, in the neurobiology of vocal learning. At some point in time, got frustrated with the quality of his bird genomes. And then, and you'll have to tell me how this happened at some point in time, somehow ended up being the fearless leader of the vertebrate genomics project. And I don't know totally all the details, but it's an incredible project that's really trying to get these high-contaguity, high-quality genomes for a bunch of vertebrate species. And it's going to empower a lot of science that's coming up in the next few years. So, Eric will let you go. Well, I'm just going to add that Eric is a tremendous dancer, and so I think we should have an interpretive chromosome movement at some point. I'm going to Paris up the stage with that animation. All right, well, thank you for that introduction, and yes, I'm going to get into the neurobiology of dance at some point in the near future and the genomics of it. So is this going to project up for you? Okay, so how did I actually, oh, thank you, how did I actually get involved in helping to lead the vertebrate genomes project and project of the G10K mission? Steve O'Brien over there asked me to. And he made it hard offered to refuse, okay, good. While we're getting that set up, I'll go ahead and get started to tell you some of the background story here. Yes, my area of research is understanding brain function, and particularly how the brain controls vocal learning as a model for spoken language. And there are certain animals out there that can actually imitate sounds like we do, like songbirds and parrots. And so I have always been interested, as many people here in the room know, and I've over the years talked to, been interested in the genetics of that trait, the genetics of language, and doing it from a comparative approach, you know, comparing genomes across many species. Those who have it, those who don't, like humans and chimps. And when you start comparing these genomes and you're coming up with interesting candidate genes and the students in your lab start to study them, you begin to discover that the assemblies aren't quite accurate. And so therefore, since they're not accurate, the students spend six months, sometimes a year, trying to clone the accurate structures of the genes, and then imagine it's not just one species, but it's five or six or 10 species you're comparing. And so you've got a PCR out this gene in 10 species, and then you discover, wait a minute, your trait's controlled by 50 genes. So you've got a PCR out 50 genes from 10 species to actually get the right candidate genes. It's frustrating. And that's how I got involved in trying to deal with and help foster high-quality genome assemblies. Okay. So, and because, you know, although I'm focused on the brain and species that can learn how to imitate sounds, which is a small piece of that 1.5 million species that Harris talked about, and thank you, Harris, for the shout-out, what we're learning in terms of trying to generate high-quality assemblies is useful for all of biology. So Eleanor asked that I talk about something that's relevant to all of biology, and even though we'll talk about vertebrates, DNA is DNA to a certain degree. So to get right to the chase, in case I didn't have time to cover everything, so I had to already define upfront what this particular session is about. What is a reference genome? What I and other people consider a reference genome, at least an inspired reference genome, that we may not be able to achieve at this point in time, is a genome assembly that is complete, zero gaps, all right, that is accurate in all its nucleotide calls, base calls, and also a genome structure, and that is representative of the species, which means more than one individual sequence. All right. So how can we get there, and I'm going to tell you a little about the trajectory that I've been taking. I was first involved in a large-scale genome project called the avian phylogenomics project, which in collaboration with BGI and many others in G10K, we had produced genomes of around 48 bird species, mostly with short reads, that led to a series of publications, many up here, that help advance biology, including my favorite questions on vocal learning or the family tree for birds. But here is where after we had these genomes, my students started to study some of these genes functionally, and we had lots of problems. And around this time, Steve asked me to take on help with the leadership of the G10K project, which led to this vertebrate genomes project in collaboration with the B10K bird group, the BAT-1K and earth biogenome group, whose mission it is to produce high-quality genome assemblies of all different vertebrate species, all 66,000, in different phases, like phase one for all orders and then families and so on. And here, to say this verbatim, the goal of the vertebrate genomes project is to generate at least one high-quality, error-free, near-gapless, chromosomal-level, haplotype phase, and annotated reference genome for all extent vertebrate species, and to utilize those genomes to address fundamental questions in biology, disease, and conservation so that my students don't have to suffer with poor, assembled genes. And neither does the rest of the community. And we decided to take this not just, as some people say, stamp-collecting genomes, but with a particular question in mind for phase one, all orders. We're using this family tree of birds that we generated at genome scale, and other trees generated with nuclear DNA for mammals. We found that what most people consider orders of animals have some common ancestor that dates back to the time of the dinosaur extinction, the last mass extinction. And so we used that criterion to select out species. And we go from 150 so-called orders to 260 by using this criterion. And we believe once we have all these 260 species sequence at high-quality reference level, we will be able to use them to learn something about the last mass extinction, who survived, who didn't, what kind of genomes they had using ancestral inference to help us inform what's going on with the current mass extinction induced by humans, the sixth mass extinction. And to cut to the chase, because we've done a lot of work on 14 species you heard Harris talk about, and we have many more that we're about to announce in a few weeks. And the two biggest take-home lessons learned in trying to create these high-quality reference genomes is that read lengths need to be longer than the actual repeats in the genome to get a good quality assembly, and particularly within the same haplotype. And the other is that haplotypes is just one giant repeat, the maternal and paternal chromosomes. And that causes a big problem bigger than many of us realize for generating these high-quality assemblies. And now to show you some of the data, we started out with two different bird species. Some of my favorites, vocal learners. This is what happens when you ask someone to help lead this effort. You take vocal learning species, a hummingbird and a zebrafinch. The zebrafinch, fortunately, there was a prior Sanger-based reference. And we convinced ourselves and a lot of companies to apply their favorite technologies to one or both of these individuals here. And so you have one animal with all these different technologies applied to it for the actual sequencing or scaffolding approaches, short reads versus long reads, long-range information here. Many of you know about these. And we did a lot of different assembly comparisons with different algorithms. And the first lesson up in this quadrant here, this is the NG50 or really the N50-contake value, continuous sequence without gaps. And the first lesson you learn no matter what method we tried, what algorithm we tried, long reads always gave you a lot more contiguity than the short reads here. And we really beat this to death because a lot of people, a lot of companies were promising that the short reads could get us there. It just never has. And anybody who has is going to get a Nobel Prize, I think, who can figure that out. The second lesson we learned here is all the scaffolding tools that try to link these contexts together into chromosomes. They do a lot, but the long reads versus short reads doesn't make a difference there. What really makes a difference is the range of the scaffold links, with high C, 3D chromosome interaction maps, giving you the longest scaffolds that are chromosomal length, here matching the sizes of what we see in the Hummingbird karyotype. So second lesson I told you about already, about the phasing. Here with work I did with Jonas Korlek, we found that whether it's long read assemblies or short reads, Sanger base, or whatever, if you don't phase your haplotypes, you get errors. And in this case, this is an interesting gene called DSP1. It's regulated by singing behavior in the white signal here, the mRNA product, in the song learning nuclei of all vocal learning species that we've looked at, but not by movement behavior in the surrounding motor pathway. So there's something that mutated in the regulatory region of this gene that allowed it to be regulated in speech-like areas of these birds that we don't see in a chicken, or we don't think we'll see in a monkey. And so we have been trying to take the Sanger base or other assemblies and study this region, and we find there's a bunch of repeat sequences in the promoter region that we think is responsible for this specialized regulation in these speech-like areas. And we've really had a hard time assembling or putting this together from any of the assemblies of those 48 species. And we found that once you start to phase the haplotypes in the assembly that these repeats were accidentally strung together as one haplotype, where they really belong to different types of repeats in each haplotype. And the assemblers just had a hard time distinguishing repeats that are actual real repeats and repeats that are actually divergent haplotypes. And once we did that, then we were getting the accurate regulatory region structure for these specialized regulation of this gene. And so I like to look at this as a puzzle. Here is a puzzle. And you break down the genome in many pieces. If you have short pieces, it's hard to fill in some of these gaps, or repeats like one wing versus the other wing. You can't decide if this should go with the left wing or the right wing, all right? But with long reads, this makes it easier. But it's not just long reads. You need to do that twice for diploid genomes. And once you do that, then you start to get more accurate assemblies. So these two have been key in the last four years in the lessons that we've learned. In that time period, as you heard in the previous talk, the VGP group came up with a set of metrics that tried to define what kind of metrics do we need to do the biology that we want to do so my students don't have to clone the genes over again. You need about an N50 contact using this metric equation here, an N50 contact that's 1 million base pairs or bigger, an N50 scaffold that is 10 million base pairs or bigger, at least two pieces of evidence to identify whether your gene structures are correct in chromosomes, and a QB value of 40 or greater for the base call, meaning no more than one error in every 10,000 base pairs. We have not put a phasing metric in that equation yet. But with these 14 genomes that we sequenced in the past year, we're learning about actually more metrics that we need to quantify to get these genomes to be high quality. And this is a table that we're putting together for an assembly paper that Arang Ray and others are preparing within the VGP G10K assembly group. And we're coming up with six quality control categories of continuity, correctness in the base calls, as well as in the accuracy of the base calls, correctness in the actual organization, in the structure, phasing metrics, as well as functional completeness, like with Busco gene scores that are known functionally relevant genes across vertebrates or all organisms. And chromosomes, whether they be signed to autosomes or sex chromosomes or mitochondrial genomes. It's a long table. I don't have much time to talk. But we might be even going beyond the metric that we call the VGP metric that lots of people are using, 14 quality metrics, and four different genome assembly quality levels, from draft to reference to high quality reference to the perfect genome, which I started out in the first slide. We're not there yet. The only place where we are, we think what perfect genomes is like mitochondria or bacteria, where you can actually sequence long reads through a single mitochondrial sequence now, is what we're finding in these vertebrate sequences. But to get to the near perfect, or getting close to it, I won't say even near perfect. But the best we can get right now is not taking one single technology, but it's combining multiple technologies together. In this case, long reads to get your initial contigs, then 10x-linked reads to scaffold them together into initial scaffolds, followed by longer-range bio-nano-optical maps to go further, and then finally high C to get armed to arm chromosomal-length scaffolds. And once you get that, you can use these high C maps, shown here. Some of you have seen these juicer plots, where before curation or after manual curation, you can see one box here represents the high C reads map to one scaffold. And if you don't see any other scaffolds mapping to your scaffold here, to the right plot here, is it your right? Yes, OK. What that means is that this is a arm-to-arm chromosome with no other scaffolds matching to it. Even though there is gaps in there, it's an assembled, complete, what we think represents a chromosome. And this is how we're now defining chromosomes. There's some debate whether we should call these chromosomes or not, but they're chromosomal-level scaffolds, that we can actually identify in this way that fish-karyotyping mapping does similarly. And so with that, we've noticed that with the high C mapping for the zebra-finch genome, what the fish mapping said was chromosome 1 and 1b, the high C mapping is saying that they're actually belonging together as one chromosome. And instead of having 35 chromosomes, we narrowed it down to 33 chromosomes in the zebra-finch. So here is a Sanger-based reference that most of the scientific community uses. 35 chromosomes in the new VGP assembly, 33. The gaps per chromosome range from 33 to 7,000 gaps per chromosome. We're now down to one gap across the centromere to 25 gaps per chromosome. Unassigned scaffolds, that is not assigned to chromosomes. The previous assembly is 35,000 of them. We now have 101 of them. Some of them we think actually artificial, haplotype duplications. The gaps, the unassigned scaffold, 22,000 of them before we have 115 left, and so on. So we're not at that perfect reference that I told about in the beginning, but we're getting there. And what's causing us from preventing us getting there? There are two things, and here's one of them. Here is what we call the primary contig or haplotype in black and the alternative haplotype in blue. And then this line here, this gray line, is the level of heterozygosity, the divergence between the two haplotypes. We find that the black and the blue get closer to the true value in this blue line here, the higher the heterozygosity. In this case, having highest heterozygosity is good, because it means you can figure out whose mom and whose dad. They're different, OK? If mom and dad are so similar, it's really hard to figure out who's in the child. But sometimes this heterozygosity causes the assemblers to get confused, where this alternative haplotype is really not as strong together on the mom and dad are basically brought together where they should be separated. That's basically what should happen here. And so the assemblers, we have to retool them not only to handle haplotypes, but try to figure out when the divergence between two sequences is a haplotype divergence, when it's a real repeat divergence. OK, so by the way, 99% of these haplotype duplications in an assembly have a gap between them. And roughly half, up to half of all gaps in the prior assemblies, we're discovering as a result of these haplotype issues. And then the repeats themselves, not the haplotype repeats, but the real repeats within a haplotype, the bigger number of repetitive elements in a genome in the light blue here, the harder it is to get the contiguity we need for these assemblies in black here. And then there's some of the oddball genomes, where they have chimeric DNA in Marmoset genomes. Or the platypus has 10 sex chromosomes, as opposed to two. And that's crazy. Lamprey genome, for those who don't know, the lamprey genome, it undergoes somatic recombination. So it cuts up the genome, pastes things back together so different cells have different genomes in them. I mean, that's crazy. And so these are hard problems, which I think we can crack. I would say the best assembly approach we're coming out there with now, coming from Adam Philippi's group with Arang Ray and Sergei Corrine, is what we call in a trio approach, where you take mom and dad here in the case of the zebrafinch, and you sequence short reads above for them, and you then use those short reads to separate out the long reads of the child, and you assemble then those two haplotypes as two independent genomes. And there you're actually reducing this haplotype repeat issue by quite a bit in improving the assembly quality. And here are some examples on a zebrafinch and a zebrafish. And here are some BUSCO genes scored. We're getting 93% BUSCO scores identified on the non-trio long read assembly, 5% gene duplication. When we actually do a trio based assembly, that 5% gene duplication goes down to 1.4%, meaning that was artificial haplotype duplications. Many people publishing papers of gene family expansions that should not be. And the zebrafish genomes are highly repetitive. 20% BUSCO gene duplication down to 3.5% with a trio based assembly in collaboration with the Sanger team. And so we're looking into now also false gene losses with a group in Korea. I don't have lots of results to show you today. I'm just showing you that there are eight types of false gene losses that we're finding in the old assembly versus the new, like the gene totally not there, split between two exons, or false coding sequence in the middle, or stop codons, and so forth. And we're seeing these corrected as well. Not that the new genome assemblies are really good, the best in the world, but they're corrected. What about your transcriptome? If you're doing RNA-seq epigenetic studies on histone acetylation transcriptome sequencing, well, mapping back to the zebrafish genome, we're getting half of our reads mapping back to well annotated, well-structurally organized genes. Here in green is everything that's unassigned with the new assemblies. This is what it looks like. Almost 90% of our transcriptome data mapping back to the assemblies now. And so to end off with here, what is a reference genome? And our aspired reference is, yes, complete, no gaps. For some of these bird genomes, we're down to one gap. That's great. Accurate, base calls, these long reads have some errors in them. We're figuring out ways to fix them. And the structure is now more accurate. And what about being representative of species? Well, this is still expensive to generate these high quality genomes, but at least with the vertebrate genomes model approach, we're generating haplotype assemblies. So we're at least getting two individuals represented in these two green lines here. And for the zebrafish, we've already seen between haplotypes, big giant insertions in one haplotype versus the other, and in big inversions in one haplotype versus the other, not just SNPs. And so we're already having two different representatives here that are representing the species. And this VGP effort in collaboration with the EBP and others is a group of people of over 200 participants now. And here I'm just giving credit to the assembly group, but I don't have time to go through all the names. I'll leave this slide up here, but a number of them are sitting in this room. And I want to thank Steve and Harris and others for inviting me to be into the G10K group and to actually come here and talk. Where's Taylorin? Yes, OK, and thank you for also all the advice you've given us on comparative genomics. And thank you for inviting me here. I'll stop. All right, thank you, Eric. Because we're behind schedule, I think we're going to table questions and we have the discussion section.