 So, I'm happy to introduce Robin Buell. I know what everybody's thinking. Finally, a talk that's not about things with backbones. So I actually, I looked, so she's from Michigan State University. I took a look at her bio, and as far as I can tell, she did all of the plants. She's involved in every single plant genome project. Probably not, but that's what it looked like. But also runs a bunch of the plant genomic resources out there, including some additional database, and then the much more memorably named SpudDB, which I had no idea even existed. So, thank you. Okay. Yeah, so I think this might be the only plant talk, so I hope I can give the whole kingdom justice. And a lot of the things I'll talk about have already been spoken about by Eric and Harris, and I think you'll see them echoed in some challenges we have in plants. So a little bit about the first plant genome. This is Arabidopsis thaliana. It was the first plant genome done. This was done five years after the first microbial whole genome assembly. So to me, this is a big achievement. That's when I started in genomics, just down the street. It's really 157 megabases, but it's really only 119, and we only have the five centromere gaps plus a few other gaps. So I think we're almost perfect. Not quite. Okay. So it's a model species for plants. That's why there was so much effort in it. It's a homozygous diploid with very limited repetitive sequences, which is one reason it was chosen. It's excellent quality. I have to go to the definitions. I think it's just one tier below perfect, because this was old-fashioned back-by-back Sanger sequencing and essentially manually curated. A person actually looked at every sort of base that had a problem. And there was a dedicated annotation project, so every gene model was examined by a person. And then there was a whole host of these functional genomics support data resources developed. There's a whole set of TD&A insertion lines. These are tag lines. There was a dedicated full-length CD&A project, so people could then get open reading frames for functional genomics works, and then a stock center as well. And then there was also this species-specific databases, first tier and then airport. And so that really took off in terms of advancing plant biology. So now, 20 years later, after Arabidopsis, it was succeeded by RICE in 2002. And now we have hundreds of plant genomes, but most of these are draft quality. Some are sort of below draft quality. All the major crops have a genome assembly, as well as a limited number of ecological models and evolutionary taxa have at least one reference genome sequence. Some species have hundreds, if not thousands, of re-sequencing data to look at them. Everything such as like maize has lots of it, as well as something like tomato. But the problem is that there's a lot of complexity. So compared with animals, especially mammals, plants can be quite challenging to work with. The size is just all over the place. So 54 megabases for their carnivorous bladderwort to 149 gigabases in this flowering plant. There's a high degree of repetitive sequences. So there's transposable elements, and these are really rather recent in some species. So they're very difficult to get through, even with a long reed-based assembly. And then there's large centromeric satellite arrays. And so I highlighted here maize, which is the number one crop in the United States. And it's 2.5 gigabases greater than 85 percent transposable elements, with genes interspersed between these large transposable element blocks. And then breadweed is a hexapoid. It is in bread, so that makes it a little easier. So over 90 percent transposable elements, but it's 17 gigabases. So going to what Eric was talking about, there's also a high degree of heterozygosity. So in potato, I work on that. It's great. There's one snip per every 37 nucleotides. So you would actually think you could assemble the two haplotypes if you were working with a diploid, essentially as if it's an allopolyploid. But the problem is there'll be large, highly heterozygous regions, and then regions where they're almost nearly identical to each other. And the assembler just can't get through that. And so this leads us to the underlying problem here in plant genomes, is that there's a large amount of genome duplication, either allopolyploid, autopolyploid, and paleopolyploid. And so this figure here is actually taken from the Amberella genome project that Pam Solstice was involved in. And essentially throughout the evolution, there's been at least one whole genome duplication, if not multiple whole genome duplications. And that really complicates the genome assembly process. And so this just highlights here, I think this is the pointer here, there, okay. And that if you look at some of these species, there's been multiple duplications when you trace it all the way back through to the ancestral angiosperm. But we also have high rates of gene duplication, and I like to use this example. So this is eucalyptol. It comes from eucalyptus species, and this is used in mouth washes and cough suppressants and cosmetics. And what's shown here is actually just a phylogenetic tree of terpenes and faces. That's what's responsible for the proper, for eucalyptol. It also produces a lot of violatiles that are used as perfumes and fragrances. And what's shown here are different species, including chlamydomotus with enalgae, moss, which is fisco metrilla, and all the way up here to eucalyptus. And so, chlamydomotus and algae doesn't have a terpenes synthase. Moss only has two, but eucalyptus has 113. And that's in this sort of like lime-green color right here. And when you look at this phylogenetic tree, you can see that there's this entire branch over here of terpenes synthase family, subfamily A1. And you can see that there's just major duplications, and these are nearly identical, and then they're also found in tandem. And this is just one example where you can have high rates of gene duplication within a species, and these are actually very hard to resolve. Okay, so some other issues is that a lot of times we couldn't actually even sequence, say, the representative organism. So in the early days of genomics, what happened is that people were really clever and they knew something about biology, and they could manipulate them. So in plants, in some species, you can inbreed it. You can also make what's called a doubled haploid, and you can reduce the heterozygosity. You can also sequence the progenitor species to bypass some issues with polyploidy, and you can also sequence haploid tissue. And that's what's shown right here, and that when grape was first sequenced, they actually had an inbreed grape line that was sequenced. Banana is a triploid. That's the commercial banana, but they sequenced the doubled monoploid. And lab-lolly pine, which I still believe today is the largest plant genome that's been sequenced. It's very heterozygous, and they actually sequenced a haploid tissue from the seed. And then for potato, it's a tetraploid, and what was actually sequenced was a doubled monoploid to get past these barriers. So the value of one, so the question would be, and Eric raised this, is there a representative organism that you could do? And mostly because of funding and effort, typically a single reference genome is done, and that's followed by alumina re-sequencing that will reveal the variants. And so a few communities, not many, and you notice all of these are crops here, except for Arabidopsis. They have multiple de novo assemblies of multiple genotypes, and this is typically done when there's large communities, like for maize, soybean, Arabidopsis, and tomato. They're able to generate multiple de novo assemblies. These are also, I'll tell you right now, they're all homozygous inbred diploids, so they're much easier to do. And so the question to ask is, is there a value in getting multiple ones, other than just to do it and see what kind of variation there is? And my answer would be most likely, there is value. And the point is, is that we have structural variation in plants, just like in other organisms, and this has been studied in several different plant species. The take home point is that most of these structural variants tend to be genes that are lowly expressed. They tend to be shorter than other genes. A large number of these are actually on the way to being a pseudogene, but others have actually been associated with adaptive responses, such as biotic or abiotic stress, or responses to the environment. And I always like to show these examples, so when people have actually done functional genomics with some of these structural variants, they find that these genes are really actually pretty important to biology. So in we, one of the structural variants controls flowering time. A really, really critical example is in soybean. There's been copy number variation of three genes. And when you get up to, like, three sets of those copies, you can actually be resistant to the soybean cyst nematode. Submergence tolerance and rice is controlled by a copy number variant. Herbicide resistance, polymer amaranthus, because you've had large gene duplication of the EPSP synthase. And that's, of course, a major agricultural problem that we have roundup ready weeds across the Midwest. But I like to use this example especially, because this is an IH conference, that poppy actually is sort of, I think, a really interesting example of how structural variants can be very important. And in this case, it's secondary metabolism. So in poppy, there's one cultivar of poppy that produces nascopamine, and that's an anti-tuber alkaloid produced by this one cultivar called HN for high nascopamine variety. And it's not produced in two other poppy varieties. And some work done by Ian Graham's lab showed that they did differential transcript analysis. They found that there was actually a cluster. It turns out it's a 440 kilobase cluster right here, where there's 10 genes, and they're unique to the HN poppy cultivar. And they're arranged in this cluster, and this is what's responsible for nascopamine biosynthesis. So if you hadn't sequenced this specific cultivar, you would have never have found the genes responsible for production of this alkaloid. And when they looked at and they hypothesized how this actually came to a rise, that there was gene duplication, neo-functionalization. They occurred at this cluster locus. But in other cases, there was a gene duplication that seemed to then be relocated to this region. So in this case, you have secondary metabolism. You're getting new de novo cluster formation. And you're producing here a compound that, of course, is important for human health. So also about these structural variations is that they can be very dynamic in how many you can find. So the question is, does tomato have a lot? Does maize have a lot? And these numbers are actually quite dynamic. And this is because there's different amount of genotypes that are actually studied when people do these kinds of structural variation experiments. And also different types of germ plasm that's actually used. So you can look at different cultivars. For example, in the poppy, you can look at land races. And you can also look at wild species and make these comparisons. And then, of course, we know the methods employed will give you different numbers. Some of these are actually artifacts of the computational methods, how genomics and bioinformatics was done at the time. So there's missing sequences and some of our reference genomes. There's annotation errors. I mean, we could talk all day about annotation errors. I thought it was only plant people that had to amplify the gene and sequence it before you to the experiment. And then the question is, what's an allele, right? So how do we call something the same and something that's actually different? So all of those things can determine how many structural variants you find. But others are because of biology, and I like to sort of emphasize this, and this is a figure my graduate student came up with. And what I'm showing here are five different plant species here. Arabidopsis, small size genome, inbred diploid. Rice, it's a little bit bigger. It's about 400 megabases. Again, it's an inbred diploid. Cucumber, I think it's about 300 megabases. It's also inbred. Maze, which we know is 2.5 gigabases. It's an outcrosser, natural outcrosser. And then potato, which is a polyploid and clonally propagated here. And what's shown in the two different shades of green is what percentage of the genome was found to be variable in terms of structural variation and what percentage of genes. And what you can see is for Arabidopsis, rice, and cucumber, even though they looked at 80, 50, and 115 different accessions from some different genetic groups, either subpopulations or wild species, they really didn't find very many. But as soon as somebody looked at maize, they looked at 103 different accessions that included elite cultivars, land races, and wild, there's actually an awful lot of structural variation. And in potato, this is work that we did. We only looked at 12 samples. These were land races. And you can see we have an awful lot of variation here in terms of structural variation. And the point is, is that based on the reproductive mode, whether it's selfing or out-crossing or vegetatively propagated, or if it's a polyploid, it can actually tolerate more structural variation. The transposable element composition is having a big role. And all of these are impacting not just structural variation, but also genetic load. And so I think you could hypothesize that if you looked at an out-crossing polyploid, you're going to even find even more. So the other thing we want to ask ourselves is, are we sampling? And I think Harris brought that up. We're not quite sampling everything we could. So there's some major questions in plant biology. And I pulled this one out just as an example of how comparative genomics could actually answer this. And so in plants, we have different kinds of photosynthesis. We have what's called C3 and C4. It involves changes in the anatomy of the leaf. And it turns out that this has evolved many times across the history of angiosperms. This is in monocots right here, where you have examples of C3, C4. And then over here in the dicots, you see that it's evolved many times, where you have C3, C4. And you also have these intermediate species right here. So being able to do a large-scale comparative analyses with sister species across the entire phylogeny would really help to understand how this evolved. What are the key mechanisms that are involved in this major change in photosynthesis, where you're concentrating CO2 very efficiently in the bundle she sells, and also changing the leaf anatomy as well? So in terms of what we've actually sampled, I'm really pleased I came up with the same number. Because I went to NCBI, and I didn't believe any of the numbers when it said genome projects. Because I looked at some of those, and I was like, I think this is just like re-sequencing data. So this is just an example of the phylogeny. And angiosperms, the flowering plants is what we emphasize a lot. But there's definitely lower land plants right here that are important. And there's 350,000 species of angiosperms, about 400,000 species total in the plant kingdom if we're counting algae. I suspect we have only about 400 genomes, maybe 12 or of not perfect, the next level below perfect. A lot of them are draft, or even, I don't know what the standard is below draft. A lot of them are really not very good. So I think that would be very helpful to get those definitions unified. So there's clearly a lot there, and there's certainly been a lot of focus on certain taxa over others. So it's definitely getting easier. I don't want to make it sound like it's terrible. So definitely it's a lot easier to do this, compared to what was done with the rabbitopsis, or even the rice genome. The logistics are easier. There's lots more trained people doing this now. I mean, I look at all these young people that got hired, and how much they know, and how much energy they have to do these things. It's just fabulous. Also, this came out a long time ago. But sequencing technology is democratized. You can do it on your lab bench with your laptop. And it's really, really changed how things have been done, and they will. And it's definitely a lot cheaper now. So here, I'd like to use this example. So this was on a cover of Nature Genetics in 2001. It was strawberry, and they actually had dipped it in chocolate, because the chocolate genome came out too. They did a diploid. This is a Woodland strawberry. They did a diploid, because at the time, you couldn't do an octoploid. And it was really a great quality at that time. The N50 scaffold was 1.3 megabases. It was absolutely fabulous. And now this just got published this year. This is octoploid strawberry. It is bigger, not just that it's bigger. It's an octoploid, and it's bigger, because this is cultivated strawberry right here. This is 805 megabases, and it's chromosome scale. And so clearly, this is going to really facilitate work on strawberry. There's four species that went into making the octoploid strawberry. So but what are some challenges? So first of all, it's the size. Imagine the cost. If you're trying to do something like the bread wheat, that's just ridiculous. The repetitive sequences is a huge problem. The polyploid, both current polyploids and also paleo polyploids, and the heterozygosity. And also, I haven't talked about it very much, but there's a lot of genome, what I call heterogeneity, where you can have like 400 kb that are just missing in one of your haplotypes. And I think that's a big problem for the genome assemblers. And I think they're mostly all designed to either human or mammalian or bacterial genomes. And so when we get to use them on a plant genome, we can find that they sort of choke or they have problems. And I think this really is a big challenge for people to take on so that we can do this. Single cell omics methods, they're very challenging because we have things called the plant cell wall. And we have to make protoplasts first. So that's, of course, problematic. And then also, the cell size is variable, and so that's a problem as well. So the funding sources clearly emphasize crop species. I mean, that's just the way it is. And their relatives, and that leaves most of the phylogeny unsampled. And then I think, and I really liked Eric's presentation, because there's really a lack of community standards for the assembly, the annotation, and the germplasm. And so what we have is poor quality genomes, a lack of documentation and provenance. And these really limit how we can advance the research community, because once the genome's done, you're not going to ever get any money to clean it up. And I think, and I started in Arabidopsis where every night something got sent to NCBI. That was just automatic. And really, there's been a big lack of enforcement for data sharing. And so we were asked to provide one thing about our vision. So my vision, and I'm only going to go until 2025, is that we want to have accurate, inexpensive sequencing platforms, and that's going to allow us to get near-identical repetitive sequences resolved will be able to identify similar alleles, homologs, and homilogs, and complete centromeres. And that will have a genome assembly algorithm can accurately assemble a plant genome in just a few days on even a large compute cluster. And it's haplotype resolved in spite of this homologous or homeologous chromosomes and these repetitive sequences. And that there's a set of community standards that are actually enforced for the metrics, for the annotation. And then here, that there's actually a voucher specimen, a seed, or a clone, so that I could go and get that same genotype and do work. And that was all I had. Thank you. Well, thank you, Robin. You very well represented the plant community, I think. So we are going to take a 15-minute break. So please come back to the room at 11.10. And I have one thing to say. So for those of you in the back, we are aware that you cannot see it. And we are going to rearrange seeds during the break to try and give some more space where people can see it. And we do apologize. We had everybody wanted to come. So it's wonderful to have you all here. But we also want you to be able to get something out of the meeting. So we are going to be working on that. All right. And the other thing I was supposed to say is coffee is here and bathrooms are this way. Yeah.