 Some of the nitty-gritty of actually handling a metagenome, so total community DNA sequence, so moving beyond single markers, including some of the assembly, the binning, and the extraction of genomes from those metagenomes to look at a specific organism in more detail. So go through some of the background of that. And so when I'm working with metagenomes, and some of the techniques that I'm going to talk about today and that you're going to use in this tutorial following this are genome-resolved metagenomics. And so I'm just going to quickly go through exactly what that means so that we're all on the same page. And so with genome-resolved metagenomes, we start with total mixed microbial community. And in this case, this is the scanning electron micrograph of the microbiome on your tongue. We all just had lunch, so maybe that's not too gross. I work with landfills, and so this is what some of my microbial communities look like. This is landfill leachate. It smells worse than it looks. And in general, all you need is a mixed microbial community, a total sample from your environment of choice. And then the basic protocol here is to extract the DNA from all of the genomes, from all of the organisms present, and sequence that. Typically this is next-generation sequencing, Illumina high-seq, or next-seq, so you're getting short reads, lots of them. And then for genome-resolved metagenomics, you do require this assembly step. So I know you've heard earlier in the workshop ways of working with total metagenomic reads to look at annotating functions in that total community. If you're going to bin genomes, you need to assemble your metagenome. So you assemble those reads, annotate them at that stage or later, and then bin those fragments in the process of binning is the mechanism where you identify which scaffolds within your total assembly came from the same original microbial population. You can sequester those based on a number of different factors that we're going to talk about. And depending on the complexity of your community and depending on the amount of sequencing that you have and the amount of time that you want to put into this, and really that last factor time is going to be the major component, you can get to a closed and curated draft genome. In general, with binning you're able to get usually 15 to 20 percent of the total community that you've surveyed, you're able to generate reasonably good quality draft bins for it. And what I mean by reasonably good quality is that they're 70 percent complete and they're less than 10 percent contaminated. And that's, from my purposes, a reasonable threshold at which you can then start looking at that metabolism of that organism, meaningfully, whereas if you have a genome and it's really clean and it's from the organism that you're really interested in and it is 23 percent complete, there is only so much you can do with that because missing pathways, you're never going to know if that's something that's missing or genuinely absent from the genome itself. You just missed it in your sequencing. So we tend to work with that. And when we're thinking about microbial complexity, that does tend to drive how effective this process is. So depending on your samples of interest, the human microbiome samples tend to be very low complexity. They would fall somewhere in the sort of tens to hundreds of more abundant organisms. You'll have that rare taxa for sure. And the less complex your community, the more effective it is for developing genome bins. And you get into soils and sediments and lantelichate, it turns out, which are highly complex. Then it can take a lot of sequencing to have the depth required to start assembling reasonable portions of these systems and then be able to then bin them. But depth of sequencing is something that we're getting pretty good at. And so I would say, from the time that I inherited this slide, which was about a decade ago to now, the picture is really encouraging in that we're able to generate draft genomes of reasonable quality, even from the most complex of environments. And the less complex environments are now ones that we can almost fully genomically resolve. You can sequence them deeply enough to be able to identify even differences between strains. And the tutorial we're going to do after this is actually working with an infant gut, a preterm infant gut, so a very simple community, and looks at binning almost identical strains that were presented differential abundances across this baby's timeline. So we're getting, this is actually, it's not straight forward, but it is getting very reliable in that you can get good quality genomes from just about any data size. So I'm going to talk you through a little bit the steps required starting from assembly and some terminology with that. And this may be review for some of you, but it might not be for others. So I'm going to go through it. So when you start with your sequence reads, to assemble them means to align them in a way that they overlap each other. And that will lead to having one consensus contiguous sequence of DNA that these reads identify. And there's algorithms, there's different algorithms for this. They're all pretty good. And that contiguous sequence is called a contig. If you sequenced using paired reads, where the reads from different ends of your input DNA were sequenced, then your read pairs can be used to identify scaffolds. What I mean by that is if, for instance, in your sequencing library, you have, let's say, 500 base pair inserts, and your reads are 150 base pairs long, your forward read and your reverse read. You know they came from the exact same piece. If you have a contig A that contains your forward read from this specific piece of DNA in this oriented direction, and you have a contig B that has this read in this oriented direction, but you don't actually have any assembly or reads that covered the space in between, then what a scaffolding algorithm will do as part of an assembly process will say, OK, well, we know that these two contigs match to each other. They order an orient in this order. And we have usually much more than one repair worth of evidence that these should indeed face each other and they should be connected. And because we know the size of this inner distance, we also know how much sequence here we're missing. And that can often be something like 10 nucleotides or 15 nucleotides, because you have more and more reads overlapping. And they just don't quite span through the vagaries of sequencing. And so what a scaffold is is taking these two contigs that are supported by contiguous overlapping sequence data to a certain depth and connecting them as a single unit for you to work with, and just putting ends for those 10, 15, couple hundred nucleotides that we know are missing. We just don't know what their sequence is. So that's a scaffold. So if you've ever run into contig versus scaffold, that's what's going on. And so scaffolds are useful because they connect marker genes along larger fragments, and they can be helpful for binning because the longer the fragments that you're working with, the more robustly they tend to bin, because they have more information on them. And they give you a better average of any of the metrics that you're using. We'll talk about those in a minute. This is your standard assembly protocol. But assembly is a little bit up for grabs right now. It's in a really exciting phase because of long read sequencers and their impacts on how assembly works. And there's a couple of different ways that long read sequencers can be used. And one is for de novo assembly, pictured here. And all the little red Xs here on these long reads are errors. So all of the long read sequencing platforms right now have a much higher error rate than any of the sequencers that are next generation. So I think for Pac-Bio, it's sitting steady at about 85% accuracy. It's a lot of errors. But if you have enough coverage and the errors are randomly enough distributed, then you can identify what the true answer is for each of these pieces and get a good consensus sequence out. You can alternatively have relatively low long read coverage and use it for this scaffolding process. So scaffolding is just ordering and orienting contigs. And if you have ones that span across breaks in your contigs, here, these are three different pieces of consensus DNA, then they can help to order and orient, find your overlaps if the algorithms weren't finding those. For metagenomes, I would argue that long read sequencing hasn't been heavily applied yet. And the main reason is in a metagenome, you don't have a clonal population of organisms. So you don't have one right answer here. You have strain variants within your populations that you've sequenced. And right now, the fear is that you will incorporate sequencing error as strain variation or remove strain variation as sequencing error inappropriately. And so I haven't seen a lot of adoption of long read sequencing except for very relatively simple communities in systems like soils and sediments and for me landfills. We don't know what the right answer is well enough to be able to identify the differences between a meaningful change in the DNA sequence and a sequencing error. The other major advantage for long reads is that they can span repeats. And so with short reads, it can be really difficult to say. Some of these reads are completely encapsulated in this repeat and here as well. So which position do they belong in? How do you appropriately score coverage across these regions? And how do you identify exactly where they should place and where this repeat should read through to? Whereas the long reads can span across that region and really clarify, OK, well, this repeat is anchored within these sequences and we have good assembly up to that point. And so they can be really, really powerful. And I personally am really, really excited about what we might be able to do in metagenomics with long reads. And I'm waiting for the bioinformaticians to come up with some algorithms to robustly use them. And I just haven't seen that yet, so I haven't moved forward. One of the things that can happen with assemblers is most assemblers for short read data now and the rest of what I'm going to talk about is short read data. Most of the assemblers have a specific window of depths of coverage for which they're best tuned. So they have camera frequencies that they're going to iterate along in order to identify what pieces should be overlapping here. And here, this is a coverage of about three or four. Like anywhere I draw a line down on this, there's three or four reads supporting that sequence. And most assemblers have a preferred depth of coverage of sort of 15 to 25, maybe 10 to 30 if they're good ones, even if they're defined for metagenomes. And within your metagenome, you may have organisms that are 200 times more abundant than others or a swath at one abundance and another set of organisms at a different abundance. And so when you are assembling, one of the things you can take into account is is your assembler designed to handle the actual community profile that you have, which you might know from 16S from the mRNA gene, Amplicon data, or if there's been earlier studies at that site, you might have a sense that this is dominated by one organism or this is 10 abundant organisms and 2,000 very rare organisms or everything is at less than 1% abundance. The whole thing is just a rare tale of organisms. And one of the ways around this quirk of your assemblers is to subset your data so that you are targeting its perfect window sort of iteratively for your different levels of abundance of population. So if one organism is very, very dominant, 2% of your sequence data might be enough to get that genome into a sort of 20X coverage range. And so when you do this kind of thing, if you assemble less of your data, if you take a 5% subset or a 20% subset or a third of your data, you get less, like the total amount of data assembled and the total length of that assembly is less, that makes sense. But, and this illustrates my point, who you get from that can vary wildly. So this is a data set that I worked with as a sediment community that did have two or three organisms at really high abundance compared to the others. And so in this case, it by assembled only 5% of my data and this was one alumina high-seq lane. So it was a lot of data for this sample. Then you can see that this Tomerkeon and this candidate Phyla organism assembled as the vast majority of this assembly, which ended up being in length way less than 1% of what my total assembly of all the reads looked like. So there was much less information here and many fewer organisms represented because I'd really, anything that was at low abundance, just what maybe a few reads made it through but they weren't enough of them to assemble. And interestingly, so the Tomerkeon, I then removed these reads and sub-sampled here and removed these reads, sub-sampled again and sort of built up to the total data set. And these Tomerkeon here are actually different organisms. And so in my, if I just took the entire data set and assembled it, the most abundant organisms, the first five abundant organisms, I couldn't detect them in here because they were at such high coverage that the assembler didn't handle them well and they ended up fragmenting. In one of the five cases, it was because there was another strain and it was breaking it. When you removed that lower abundant strain, it became a cleaner assembly. For the others, I don't know why they weren't assembling well but I've seen this over and over and over again. So something to consider, if you know your system's dominated by an organism where there's a high abundance organism that you're interested in and you're not getting a good assembly from your metagenomic data, consider taking less of your data. It's a bit counterintuitive but it can work really well. So I knew that these were different organisms because I had binned my data. And so we're gonna talk through a little bit. Some of the options that you have for binning metagenomic data. And right now, most binning algorithms take into account some or all of these different factors. The first is nucleotide composition, the phylogenetic affiliation of the genes, that taxonomic signature that can come in with a genome, the read depths and coverage, which is a proxy for the abundance of the organisms in your original data set and coverage patterns. And so a nucleotide composition, the way this works, is it kind of gives you a fingerprint of what your organism's genome looks like. And this relies on the fact that there are different codon frequencies that microbes use. And so some will have tRNAs for some, but not all of the lysine amino acid codons. And you'll see that signature reflected in their genome. It's most common to use tetronucleotide composition. So this is a sliding window of four nucleotides. And so this would be a score of one for an ACGG. This would be a score of one for CGGC. And then every time you find ACGG, again, as you slide one nucleotide at a time, you would add it to your frequency count and develop a frequency table. And that gives you a lot of different dimensions to build, there's, oops, sorry, to look at. This is a fairly high information way of looking at things. And in fact, every genome will have a, not a perfect signature, right? Not every piece from every genome, not every scaffold from every genome will have exactly the same signature, but they will share more in common in this frequency table with each other than with other unrelated genomes. And so the earliest spinning worked solely from this. And it does work well, but on its own it is a little bit limited compared to what we can do now. Another factor to take into account, which can be very powerful, is the phylogenetic affiliation of the scaffold or the genes in your metagenomes. And this is an idea where you take your scaffold and maybe it has 10 or 100 genes on it, and what is the best taxonomic match for each of those genes. And you can get cases where on a scaffold, all of these have a very strong delta-proteobacterial signature, and in fact three of them share a general level assignation in a genome that is, that does have close relatives to our reference databases. You'll see relatively clear indications of identification. And so other scaffolds from this genome, we would expect to have a very similar signature. And while this can be hard to automate to bin, it can be very useful in going back to curate bins and to look at bins and make sure that there's a consistent signal. Scaffold two here, each gene has a different best hit phylum, and gene C in fact has no hit in the reference database. This is the kind of signature that you see when there isn't a good reference genome in your database. If you have an organism coming from a candidate phylum, where there isn't a sequence genome, or it's relatively novel within its clade, then you can get scrambly hits like this. And interestingly, this looks messy, but this is also a signature that you can leverage. And so there was a paper a couple years ago looking at a non-photosynthetic sister group to the cyanobacteria. And for a really long time, we were calling those genomes the cyanopharmies because 50% of their genes, well, 30% of their genes hit to the cyanobacteria best, 30% of their genes hit to the firmicutes, and the other 30% didn't hit to anything at all. And every scaffold had that same signature. So even though from that signature, we couldn't tell who those organisms were, we could identify, well, that scaffold looks the same as this scaffold. So you could still back check from that, even if you don't have a particularly precise hit. Coverage is another important piece for looking at, for binning, every scaffold from an organism should be present at approximately the same depth of coverage, because the genomes were present at the same abundance in your original sample. Yeah. I would just go on, how do you know whether that's a error as opposed to? So, yes, I was gonna come to this later. So the concern that is frequently levied against assembling metagenomes is that you will end up with chimeric contigs where the first half of the contig is a firmicute piece and the second half of the contig is an archaeum, and that they really should never in nature have existed together. In my experience, that is exceedingly rare, having been quite deeply curating these things. What more frequently happens and what the algorithms are designed to do is in those cases of uncertainty they break. And so you are much more likely to have a fragmented scaffold that should perhaps have been stitched together than you are to have a scaffold that's been stitched together as a Frankenstein monster. When you see this kind of signature and you see it consistently across tens or hundreds of genes, that is not what one would expect if a rogue contig has been chimera. The only time I've seen one properly chimera contig in my experience with metagenomes and it really did go from like firmicute signature very strongly straight into archaea and the coverage dropped. So there was this coverage signal across and then the coverage signal dropped by half at the same time that the annotation shifted and you look at that and that's very clearly an error. But if the coverage is consistent, the nucleotide frequency is consistent and you suddenly have a piece that has a different taxonomic identification then my instinct would be say, okay, are the genes on that associated with mobile elements, genomic islands, things that may have come in and ameliorated their genome signature but still be more closely related taxonomically to a different group. Because you will see lateral gene transfer as a true event. And I don't think that you necessarily have to be more careful inferring that when you're working from a metagenome assembly. But you do have to look at it. And make sure it makes sense. Okay, so coverage as a proxy for abundance. Again, in your general DNA, all of the pieces of the genome came from the same number of copies of that genome in your original DNA. So they should all have approximately the same coverage. There's a range because sequencing is not perfectly even. But it can give you a sense. So if something is 200X and something is 20X, you can tease those apart quite easily. In a more complex system, in sediment, certainly most organisms are all present in less than 1% abundance. You don't have those really big differences in abundance. You have sort of a gradual shift. So it can be harder to draw lines as with coverage as to where your bins lie. But with the advent of, well, being continued increase in the availability of sequencing and the relative cheapness of sequencing, even though library prep is still very expensive. What we now have usually access to is multiple metagenomes from a single site. So either this is a transect across a geographic system or it's a clinical trial before, during and after a treatment. Often you will now be taking multiple samples from your same source, be that a human and animal, a landfill, et cetera. And what happens with abundance is that as an organism responds to a treatment, either positively or negatively, or as they shift according to pH across the site, things like that, so too will the coverage of their scaffolds in a genome, in different metagenomes. And so in this case, these are two different scaffolds in blue, and you can see that each of these scaffolds experiences a gradual increase in abundance and then a drop in the last sample. And this could be a time series, a depth series, a treatment series. You contrast that with scaffold three, which was at relatively high abundance and then experienced this sort of plateauing decline. And with scaffold four, where whatever organism this scaffold belonged to, experienced quite a strong peak in the second time point and then immediately was removed from contention. I think you can see that my initial example of 200X versus 20X is a little blurrier than, okay, we're pretty sure that this is a scaffold from a completely different organism responding in a completely different way to this treatment than these two. And from this alone, we would probably state that, okay, these two belong together to the exclusion of the other two. And what most often happens is that between these four things, between nucleotide frequency, coverage taxonomic identification and coverage across an abundant series, multiple of these are used. And they're all put into a data matrix that allows you to then, well, you allows a clustering algorithm to go in and identify, okay, these are the clusters that belong together to the exclusion of anything else. I'm going to identify these specifically belonging to a bin and those scaffolds can be pulled out for analysis as now a single organism or population instead of looking at the total microbial community. Okay, how do you bin something? That's great, we have all this information. Making that data matrix is not hard. Getting scaffold coverage of all your different scaffolds from a metagenome assembly is simply mapping the reads back and generating coverage stats. Looking at your annotations is something you'd probably be planning to do anyway. You have your annotation table of best hits. Nucleotide frequencies, there's lots of single line command scripts that will get you that building those tables to the specifications of your binning tool of interest is the general bioinformatician game of trying to get file formats, right? But all of these are things that you can generate from a metagenome assembly with existing tools. And originally, then you were kind of on your own, you had to figure out how to do this clustering, how to do these binning, how to pull things apart. And in the last six years, there's been a pretty strong explosion of options for binning and algorithms that do different, take slightly different information into account or use a slightly different clustering method and things like that. We're gonna work with both group M and concoct within the visualization sphere of NVO in the tutorial. And what NVO will actually allow you to do is look at the exact same dataset that was binned with group M, which is an algorithm, and binned with concoct, another algorithm, and see the differences, the different decisions that they've made and actually explore what those differences looked like. Is this a bin that's better quality or less good quality? Do I agree with this one more or that one more? But at the end of the day, a lot of binning is still very subjective. There are about eight popular binning algorithms. They all treat the data slightly differently. They all come up with slightly different answers. And the standard is to go in and manually refine things, which is not reproducible. And so I'm really excited that this paper is out now. It was published on the 20th of May this year, so really recently. And it is a tool called DOSTool, Christian Siebers German. It was a running joke. And what it does is it takes, this is how we're doing like, like taking a meta perspective on binning and you bin your dataset with two or four or seven different binning algorithms, just going through the automated pipelines, right? You just run it under their defaults, you get their bins. And then you put those bins, all of those different programs bins into this program. And so for instance, if this is the input assembly, bin set one has these three bins, bin set two has two bins. This, these are different binning algorithms. And what it will do is compare and contrast to find what is a contaminant within different bins? What is a consensus agreement? And then run its own sort of scoring strategy on each of those bins to say, okay, have we made this more or less complete? Have we made the total dataset more or less complete from a genome perspective? Have we made them more accurate? And so this is looking at single copy genes to be able to do that marker based, are these good genomes or not? And then looking at overlaps within different binning algorithms and iterating through to find the non-redundant bin set that best represents this community. And I, if you're really interested in binning, I encourage you to read the paper. I think that it is going to be a really nice, although additional step, addition to binning to take away some of the manual refinements that happen. And Christian, I worked with Christian in my postdoc and he gave us datasets and had us manually refine them and then tested it against his tool. And we were people, you know, we're kind of binning your heads. And so we were happy to do it. And none of us scored better on the actual answer than DOSTool did. So it was performing better than humans who are highly trained and that is really rare for an algorithm. So I'm actually really pleased that it's available. Okay, so the ultimate goal with genome resolve metagenomics is you go from your community in question, you end up with draft or decent quality genomes. And that allows you to look at not only who these organisms are in whatever robust way you wish. 16S-rebsimal RNA, if you have it, it often doesn't assemble well in metagenomes. So assembly there can be a loss. Or with different marker genes. But it allows you to directly connect that with the specific organism and populations functions within a community. And so not only can you look at what the capacity of the community is, you can identify handoff points between different organisms. You can identify different fluxes within the system and pressures that organisms may be more or less susceptible to. And so when we're studying microbial communities, I mean, I think the really big questions that I always go in with is who's there? Where do they fall within our understanding of microbial diversity? And what are they doing? And what of those functions are valuable to the site in question or important for geochemical cycling and the planet in general? And I like to take the rest of the time that I have to talk about this half of the equation and saying, okay, it's all very well and good to have a metagenome or to have these bins. How do you then go from a bin genome to say, what are they doing? What are they contributing to the environment? What's their metabolism? What dependencies do they have? Do they have the function of interest that you're interested in? And in essence, how do you go from your standard outputs of annotation pipelines, which are either giant protein translation files or tables of gene coordinates with their respective best hits to something that you can meaningfully interpret? Either a cell diagram like this, where maybe you're tracking the location of all of the different heme molecules across an organism's metabolism or a description of a novel radiation within a known group. And the specific carbon flux and major dominant pathways within that that allow you to infer that this is an anaerobic. That it is in fact using the wood lung doll pathway, which is unexpected in that group. How do you get to that? Or how do you get to just looking at the global nitrogen cycling within a community? How do you take that and piece it together with omics data? And I'll come back to this figure because it does have some proteomic data overlaid. And unfortunately, I would love to say here are the programs that you should use to do this and they will generate for you a cell diagram. But I was actually, I was involved in a grant to write a program that would allow you to take your table of interest and identify features and build a cell diagram. This was built in Illustrator. This was built in PowerPoint. They were both built manually. And so my colleague Cindy went and built these models, these very detailed careful models of protein structure and then imported them into PowerPoint and arrange them on the membrane in their appropriate place. So these are incredibly manual and require a really uncomfortable level of metabolic knowledge for pathways that you are not going to necessarily see on a daily or even yearly basis. And so they take a ton of work. There are tools that make this easier. Our grant to build that program was not funded. So it is not in the pipeline. Unfortunately, we're going to try again. But there are some databases that have come up and as ways of identifying some meaningful pathways to help orient you within this ocean of genomic data for each of your genomes. Each one is its own little ocean. And so there is the keg automated annotation server which will take your genome, your genes of interest to identify the chaos if identifiable and highlight them on the keg maps which allows you to sort of click through different metabolisms and see very clearly, okay, this genome has 15 to 15 genes for glycolysis, Czech glycolysis, most likely. This gene does not have any genes associated with the electron transport chain except with the ATPase. So this is likely an anaerobic. Let's go look at hydrogenases. Yes, it has hydrogenases. This is likely an anaerobic organism. And then you end up down an endless rabbit hole of it has terpenoid synthesis. What's terpenoid synthesis? I'm going to look it up. And then an hour later, like everyone has terpenoid synthesis. Why am I searching this? And it just goes from there. You just end up various down various rabbit holes forever until you get this sort of Zen-like understanding of your genome. I don't necessarily recommend this approach but it is the only approach that I know of so far. Except there are some advantages and some new tools that are working really nicely and helping you to orient what's important and what isn't. And so keg also has this program called Maple which is metabolic and physiological potential evaluator. Just a bit of a stretch for an algorithm. And what it allows you to do is take your genome bins if they're partial or if they're complete, they have different toggles for that and map it to all of the keg modules and transporter pathways and network pathways. What you can additionally do with this is in the same analysis, upload the most closely related genomes from a reference database at the same time or pull from their already completed ones. A bunch of them are already uploaded but some aren't. And what that means is that for instance, and I think this is gonna come out really small but I will describe it as best I can, is that you get this table, again a big ugly table. But what we have here are five bins that are part of the bacteronides that are from a leachate metagenome from a Riverton waste site in Jamaica. And they are at different levels of completion so this guy not very complete, these two very, very nicely complete. And then off to one side and I couldn't screenshot the whole thing. I have a whole bunch of reference genomes from within the Bacteria DDS and highlighted here is a pathway that is essentially universally present or at least partially present within the Bacteria DDS that is canonically absent from most of my genomes and a pathway that is not necessarily expected to be within this, oh, I don't know why that's green. Okay, this one. No, I don't have the right view, that's okay. But wait, what my point is, what this gives you is, if I looked at this instead of my original example and said, terbinoid synthesis is cool and then looked and was like, I would be able to see in a single glance, oh, everyone has that, that's okay. We will just leave that. But it allows you to see what distinguishes your organisms, what they have that is expected, what they have that is not expected. It allows you to place them a little bit more closely, Rob. So is de-nitrification? De-nitrification, yeah. Basically the mode of this is that they never, ever, ever have it in the Bacteria DDS and then this relatively incomplete genome has 75% of that module. So yes, exactly, that would be an unexpected for this group of organisms feature. This tool is only gonna be as good as your reference for the comparative aspect, although it'll be great for giving you a snapshot overall of what your genome and organism and population might be up to. But if you're out in candidate phylo land and there aren't any good genomes that are relevant, your best match is not gonna tell you a lot of what you actually expect to see. Okay, once you have a protein, so if I was like, ah, de-nitrification, I think this is something that must be interesting. How much do I believe the annotations? Do I believe that this specific gene is involved in this specific function? There's a number of different ways and so this is us narrowing down, right? We've got global genome properties, global metabolism, pathway of interest. Do I believe this annotation? And I recommend for anything that you think is a keystone or a linchpin for what you think this microbe is doing in the environment that you double check your annotations because annotations are hugely suspect. Reference, a lot of reference databases that are being used to annotate things are based on homology themselves. So you're looking at homology to something, homologous to something and you are now too many steps away. A quick and dirty way of doing that is to just blast P on the web server because it will, not because it gives you hits that I have any meaning whatsoever, but because it gives you this hit to a conserved domain. When you open up this view, you get a look at the E value of its match to a conserved domain of information. If there is a crystal structure for your gene, then you can also model it against it. It's of what you think your gene function is to look at whether or not it has a conserved space, a conserved active site, a conserved fold, things like that. That only works if there's a crystal structure which is few and far between, but it can be very helpful for some of the more important geochemical genes in order to confirm them. And if you're trying to see a little bit more about your total genome as to what it has available to it, you can also work with IPath. I think it's IPath three now, it might be IPath two, which is a way of visualizing the total metabolism encoded by an organism. And what I like this for is less so looking at a total metabolism, like I took every protein from a methanogen, that'll be called the caucas genaceae, and uploaded it. Unless you really, really know the microbial metabolism map, this is not a very helpful information tool. But what you can also do is overlay on the pathways present within a genome, pathways that you've identified in later all mixed tools. And I think John's gonna talk about metatranscriptomex, but if you had transcriptomec or proteomic data from this, what you can also do is overlay. Now we're seeing only the active pathways in the genome, and everything in green is something that was shown in the transcriptome or the proteome, whichever data you have. So you can look at what the differential activity within an organism is. And this is an interactive website, so you can also hover over these and get some information as to what they are. Not perhaps useful for like a publication style summary, but very useful for your original sort of genome-mable-gazing identification of things. As I said, I've come back to this, and I am, in that we have the ability to then illustrate not just your omic work, not your metagenomic work, in terms of what specific organisms are contributing to a specific pathway. And this is, again, the nitrogen cycle within a subsurface environment. But here, anything dashed as an arrow was something we predicted from the metagenomic data, and anything as a solid arrow also had protein data backing up that it was expressed in the system. And so, anything that's a solid arrow, we predict that it's present and it's being expressed. And so it's a sort of dual level of confirmation that these are active processes in that system. And so, for me in the complex systems that I'm working with, metatomex, I'm gonna transcriptomex, metaproteomex, I typically cannot get enough coverage across my community to be able to look at statistical changes in abundance or expression levels. Typically, I'm looking at presence absence. I'd love for that to change, but the depth of sequencing required is usually a little bit out of reach. And so, and really, for metatomex, they're most effective when they're paired with a high quality metagenome because then every hit that you have from that sequencing can be directly mapped back to the gene in question. And when you have novel genes or novel organisms that don't map in a reference database, it means you're not losing that data. So if you can pair metagenomex with your metaproteomex or metatranscriptomex work, it can greatly strengthen the results that you're getting. And with RNA, the main disadvantage is that the ribosomal RNA ends up being the bulk of your sequence. Okay. So it was mostly the overview I wanted to give you. I've highlighted a bunch of tools. All of those are in your slide deck except Maple, which I added late. So you can take note of Maple if you wish. The tutorial that you guys have is from an infant gut metagenome data set. And it is tracking a single baby over a time series. Some of the dirty work of getting your profile database and your contact database and things have already been done. We've installed NVO on the student instance of the server and the data set that I asked you guys to download is also there because people were having trouble installing NVO. So that should circumvent any of your problems. But if you want NVO operated on your machine and you can't get it to work, let me know. That can try and help. And so the tutorial is actually coming directly from the person who designed NVO. And it's a really nicely annotated tutorial and it covers a data set that I really like. I put the paper for the data that this is coming from up on the Google classroom chat. So if you're interested in the actual study and what was reported from those data, you have that published work. And then I think, I hope that the tutorial is relatively straightforward as you go. It is longer than you are likely to get through. It covers a lot of different things. The key points I want you to play with are working with the actual genome binning where you interactively click around to increase the quality or decrease the quality of your bins and then through to the comparison of the different binning systems and adding taxonomy to start identifying those organisms as well. And when you get down into the tutorial later on, they're looking at single nucleotide variants and looking at very specific strain variants between genomes. If you are interested, carry forward kind of thing, but I think that you'll probably not get to that in the exact time allocated. Yeah, I'm here if you have questions. Do you have questions on this? This guy? First I should say hi. So just, you've mentioned something about RNA, like one of the issues with RNA, and I just sort of, you just sort of think right here what you were saying. Oh, so if you extract total RNA from a microbial community, then usually about 97 to 99% of it is ribosomal RNA instead of messenger RNA. If you're interested in what genes are being expressed that aren't the ribosome, then that's a really, really tiny proportion of your actual data. And so there are ways to subtract out ribosomal RNA. The best I've seen that do is get your messenger RNA up to maybe 5% of your total sequence, which is a pretty significant increase, but it still means you're doing a lot of sequencing of just 16S and 18S and 5S and 23S, et cetera. And there are pros and cons to subtraction. The pro is you do get more messenger RNA for your sequencing dollar. The con is that you have removed any ability that you ever have to look at the microbial community composition from your RNA data. So if you don't subtract any sequence, then you actually have a really nice sequence of the active organisms in your system because they were actively expressing their ribosomal RNAs. If you subtract out using probes that match the specific sequences and things like that, you irrevocably alter that community composition. You can no longer speak to it in your data. So you really have converted that 95% of your data into junk. So it kind of depends what your real question is with your data, how complex you think your data set is and how deep your sequencing can be. Because if you can sequence deeply enough, that one to 3% messenger RNA might be all you need. And there'll be more about metatransfer. Yeah, I wasn't exact. I didn't want to stick on your choice too much. As a prelude, we'll see more of this tomorrow. And with protein sequencing, just to go a bit deeper on that one, peptides that don't have an exact match in a reference database map incredibly poorly. And so with proteomics, you really do need to have quite a high quality paired metagenome for most environments. Maybe not with the exclusion potentially of gut systems where the organisms present don't shift that much between samples and they've been very heavily sequenced. The reference databases are quite good. But the minute you're in an environment that's not gonna hold. It's a waste of money.