 Okay, so we're going to jump right into our first official module, I guess, where we're going to talk about 16S sequencing. It was a great introduction this morning, and I just want to encourage you know that this is a workshop for everyone here so if you have questions I really encourage you to to ask them I think it's one of the nice opportunities from a workshop like this there's a lot of content on the web. But you don't get a chance to really ask, you know, questions that are maybe you think are done or not done or more specific questions so feel free to interrupt me at any time I'm happy to try to answer questions. There's other people in the room that can answer questions. People, you know, even the trainees here will actually sometimes know some of the, some of the content. You'll also find that as we go through the rest of the workshop, you know, is not a for sure certain thing that you must do this there will be a lot of well you could do it this way or that way. And so that's sometimes I'm satisfied, but I think just knowing the different options out there and understanding you know what's not a hard fast rule and what is flexible is part of that so sometimes a nice discussion can start as well from the trainees. And then I recognize to that people are coming here from a lot of different experiences right so some of you have probably never logged into a cloud instance. Many of you have never probably been working on the command line. That's okay. We're here for you. There's other people that are a bit more advanced. There's people have done microbiome analysis in the past, some people have never, you know, are still trying to figure out what six minutes versus manage it all makes. So that's completely understood you're in the right place. And so we just want to, you know, basically the focus of this workshop wouldn't make sure that we might get around. Everyone has a pretty good foundation of where we're starting from. Okay, so with that we're going to get started. So in this module I'm going to talk about an hour or so maybe less with questions I'm hoping. And we're going to talk about six and s which is kind of the basics. And I think it's a pretty good place to start because it's well defined on the bioinformatic steps that you can do whereas you'll see tomorrow when we dive into medicinal mix. There's a lot more different options out there for different types of analysis. I am from Dalhousie University. My name is Morgan Landrell I guess I should have said that first. I am an associate professor there. And before we get into it I just thought I'd run through a few things just about my own research. So one is out of my lab I run the integrated microbiome resource or the MR. It's basically a sequencing and bioinformatic service core out of my research lab I started back in 2015 pretty much when I started my position. And we've done a lot of sequencing a lot of microbiome sequencing so we actually just passed the 1000 sequence run just last weekend which was pretty exciting. A lot of that is 16S sequencing but we also do metagenomics sequencing. We also do pack bio sequencing and I'm going to talk about that a little bit today in this lecture. And we you know we had a lot of samples from local at Dalhousie across Canada. I know there's lots of people that do sequencing here at Calgary I'm not trying to convert everyone I'm just trying to give you a flavor of what I do here and we do a lot of samples across the world so I do run that. I do some bioinformatic tool developments so in my lab our research kind of focuses on particular research projects which I'll talk about in the next slide and some bioinformatics development and testing of different bioinformatic methods. So we're going to use microbiome helper a bit and I'll explain what that is which is just basically you know a resource for different types of tutorials and standard operating procedures for bioinformatics. So I was a developer of the first pie crust version and a PhD in my lab generated pie crust to not going to talk a lot about pie crust in this actual workshop will briefly mention it tomorrow. But if you have questions about it I'd be happy to try to answer them, you know, in the break times. A couple newer tools developed in the rat lab is palms which is a differential abundance method and Jervis which is a visualization method which we may or may not use in the lecture in the workshop tomorrow. From a research perspective I've covered quite a few different things we tend to jump around in different research in the past I've worked on the inflammatory bowel disease and pediatric. A bit on exercise, but right now in the lab or our major folks is our cancer and for that we're looking at saliva blood and tumor samples. Mental health we're mostly looking at connections between oral microbiome and mental health and adolescents. And we have a project ongoing with frailty and for that one we have saliva and stool I'm just listening to the tissue types there so you get a sense of the different types of tissues that we do microbiome profiling on. I'm happy to chat about any of that you know on the off times but let's get right into why you're here. Okay so the learning objectives for this lecture is to help you understand what amplicon sequencing is what that term is the that there's different amplicon targets besides just 16 s and why you might use those will talk about variable regions which is always a nice little debate. Let's get right into the major bioinformatics steps of a 16 s analysis pipeline. And we're going to really understand then you know when you get through that pipeline one of the outputs that you're going to use in future downstream analysis. Okay. Great. So, generally talks a bit about you know methods for studying microbiome we talked about 16 s we talked about managing omics right and we talked about minute transcriptomics and maybe even metabolomics. So in terms of sequencing we can group that into something that's just a bit beyond 16 s sequencing so we can turn that we can call that amplicon based sequencing. Some people call this marker based sequencing gets lots different terms, but amplicon just a fancy term for you know amplifying up a piece of DNA through PCR, or RNA, but in most of these cases we're always talking about DNA. And if we're thinking about then profiling a microbiome community we're hopefully going to have a target. That's a universal gene or barcode. Okay. This is actually quite a bit different than you might have heard about barcoding from a eukaryotic standpoint you may or may not have you know if you see those things about the fish you eat is not the fish you buy in the market. Those are like a barcode specifically for that taxa in this case we're looking for a particular gene that's found in all the, the microbes that were interested in, in the hopes of sequencing and all and then categorizing them. So 16 s or amp gone sequencing is very beneficial for telling you the types of microbes obviously in that community who is there, not so much on the functional side but we can talk about that a bit later. And it's important to remember that, you know, based on the ampicon that we're choosing and the variable regions we're restricting our field of view to that actual target, right. So if there's something like metagenomics which we're going to talk about tomorrow is just blanket, we're sequencing everything. And so that's going to get everything in the sample, you are going to sort of restrict your view to to some of those targets. So the most common one that we're going to talk about obviously is the 16 s ribosomal RNA gene. So RNA genes are just what they sound like they're present in all living organisms, they are ribosome, they are an RNA gene so they don't form a protein product right the part of the ribosome, they're important for structure. And because of this fact that they don't code for protein they form a nice secondary structure as part of the ribosome. And that means that they have these nice regions that are conserved and more relaxed. And it's thought that these are a very good marker essentially for different types of microbiome profiling. It's very nice because it's universal. So in meaning that it's found in bacteria and you carry out so the 16 s gene is actually what was used to sort of find these three major domains of life. I'm not going to talk about 18 s but the fact is that 16 s gene and 18 s are homologs right so we just call it 18 s and you carry us. But it's actually the same gene. As in, in bacteria, we just call it something different. This is always super confusing, but that's means why we can make a nice phylogenetic tree like that. So it overall it's a great marker it's been used for a long time. There's a lot of history here and even though there may be other better candidates actually nowadays. On the basis that we use the amount of studies I've already like essentially profile different six nests genes out there is so massive that we have a really good sampling depth on a lot of different six nests genes. So, by far and the most common, we hear about 16 s right we hear about 16 s and for that, that's going to be targeting primarily bacteria and archaea and we'll talk about that in a bit more in a second. So there are other marker genes out there you might come across for bacteria prior that other big candidate is CPN 60 chaperone and 60. I see it pop up every once in a while especially particular labs out there but you will see it as a marker. It is beneficial in a couple cases one is that it's only a single copy, which is really nice. We'll talk about copy number later. You often see it in I see a lot in vaginal set vaginal sort of microbiome studies seems to have a bit better resolution there, but it is another target that that you'll see that what they will see used up there. If you're interested in you carry on right not bacteria or archaea, you often see 18 s right, which I just told you is homologous to six nests, but we'll also see it yes, which is this inner transcribe sequence, not going to talk about a lot today. But let's just let you know there's an ITS one and an ITS two. And in those applications are really good for resolving basically fungal communities right so you'll see a lot of someone's really interested in fungal communities. Maybe some protests, they'll often target ITS and not just 18 s. So that sort of covers our three domains of life and then if I don't know if there's anyone interested in viruses support viruses they always get left out anyone interested in viral sort of microbiome. One I see two maybe right. So for viral microbiome, sometimes people use a couple of specific markers but typically I would say nowadays it's I don't see it used at all. So really for those applications you're usually going to go to med genomic sequencing, you're going to try to purify for those viruses. And then you're going to go down usually that road unless there's a particular class of viruses that you have a target for but there's no universal gene that's found in all viruses that we can easily sequence. Okay, so if you focus in a bid on six months because that's what we're going to talk about but I'm happy to you know people to interrupt me about hey what about going back to 18 s or it s what's the difference here. I'm happy to talk about those things. But if we talk about six minutes a little bit more. We also get into these variable regions. And so as I mentioned before the variable regions come up. This is just a plot showing on the right to see that yeah. So basically conservation across the genes the genes about 1400 base pairs long. And you'll get basically conserved areas, which is shown here and then you'll get more variable regions down here. So this is great from like a sequencing standpoint and PCR amplifying conserved regions great for designing your primers right because these are conserved. And then you get the variability to give you some resolution across those different types of. And because of this, and the fact that most of our sequencing up until just a few years ago was mostly using short read sequencing, either 454 back in the day but primarily we're talking about all aluminum now. We're going to sequencing what I'm not going to talk about sequencing a lot, but aluminum reads tend to be shorter right we're talking about 150 base pair to 300 base pairs long on the forward and the reverse right. So because you only have 300 and 300 max, you're not going to cover the whole 1400 base pair gene, which is not possible. So then you have to pick a variable region you have to pick what region of the gene you're going to sequence. So we've been up for debate for a lot of time people have their favorite favorite variable regions out there. Anything around the v4 region is very popular v5 as well. But people used to do v1 v3 people do v6 v8, not going to get into the debate of it. But I am going to talk about the fact that different variable regions will give you slightly different views, and then also the primers that you use will also have certain biases. So this is just from actually from the IMR website we put this up because I thought it was quite useful and what we just show here I know there's a lot of information here. But what we're just showing is different variable region targets on the left, and then basically the percentage of genomes within those different taxonomic groups that are theoretically amplified by that primary pair. And so what we see is we do a lot of v4 v5 sequencing right now we call that universal because in theory we actually get pretty good representation across archaea bacteria cyanobacteria eukaryotes, and then you also pick up mitochondria and floor past DNA. Now just a little bit of an aside just because we're covering all the bases here. So mitochondria flora class, they have a 16 s distant homo homo lock right. So what happens is even though you're not really thinking about trying to amplify up say mitochondria or chloroplasts which are found in plants right. Sometimes you'll actually amplify up that DNA, especially if your sample is really enriched with a human cells or host cells, or plant material, you'll actually sometimes see that signal come through. If your sample is really rich in bacteria archaea, you don't have to worry about because it's very minor contributor, but you will see that pop up. So that's v4 v5 we do a lot of that sequencing and we just reference the source for that. Sometimes if you're interested in archaea specific we have a v6 v8 a specific primer that mostly focuses on archaea, as well as a v6 v8 in bacteria with the benefit here that doesn't do as much application on these off bacterial. taxonomic groups, and then you can see here that we have other other groups as well. And so the, you know, the reason I'm mentioning this is that you will have to think about if you haven't done six net sequencing, you know what variable region are you going to do. And if the sequences are even done hey maybe I should think about what off targets I might get in my sample right what other things may I sequence that maybe I don't want in there I should try to filter out, or just be aware of when I'm doing the the processing. And then besides the fact that they have slightly different targets is this affects of sort of biases right you're going to essentially get certain primary groups, essentially, applying up certain taxonomic groups a bit better than others. And so this is just an example from our paper back in 2017 we first started the MR just comparing this v4 versus v6 v8, and then this was using a standard 20 tax HMP mock community in the middle. So these are all in in same amounts. And then what you can see here is yeah I mean overall the taxonomic, you know, it looks pretty good. But you can see there's biases in both directions right so the v6 v8 we see actually pretty low levels of helicobacter compared to the compare to v4 five and that's just a little purple long. Sorry, I don't know what that is light purple because I was going to cut up on some purple, whereas down the bottom for the propane bacterium. You know we see basically none of it in the v4 v5 but really good in the v6c. So, you know the take home message here is that there's going to be trade off across your variable regions. And the other big thing is if you're comparing, you know, a data set done with this particular primer set this particular operation, and comparing it to another data set. They're not going to, they're not going to agree. They're just fundamentally not going to agree. Okay. So it basically means you know if you ever want to really compare the two, you're going to have to really think about what you're comparing there you're not going to be able to throw them on like say a nice plot and see that you know there's a nice biological signal, the technical bias is huge. Okay. And then the other thing one. Yes, question. Yeah. Yeah. No, it's one PCR product spanning the v4 and the v5 region. Yeah, that's good question. And for most of those prime repairs are all for all of them actually it's all one PCR product. Yeah, there has been a couple of groups do this where they sequence different variable regions and then try to combine the data after gets really messy. It's kind of complicated. I don't know if I'd really suggest it there's probably other better options out there, maybe this one actually. So, because I said aluminum has a short reads can only span certain variable regions. Of course you've heard quite a long reads right either coming from Nanopore or pack by or specific biosystems it's called. So there's the two leaders and really long read technology. And so for packed by at least I'm not as familiar with Nanopore so if anyone has done Nanopore and six nests I'd love to hear about it. But for pack by all the accuracy in general is not great. And now pack by oh well it's always had this way to basically do this consensus circular consensus sequencing right so the basic what they have is a PCR product in the middle. They add what they call these smart bells on the end. And now for this one, you know one PCR product you actually sequence it over and over. You just read through you only get about 85 to 90% accuracy but because you read through it multiple times and the same molecule, you actually generate very high accuracy as long as you've gone around it enough times. So for that they call these high five reads if you're into the techie lingo, but it means that we can do you know full length 16s sequencing or full length 18s sequencing or full length it sequencing that spans all those variable regions. The biggest benefit is that pack by all pricing has come down so it's, it's still a little bit higher than the limina but it's, it's, it's pretty close now. Yeah, and it's probably going to take another plummet here soon because they've announced a new, another new sequencing machine. So just give you, you know, a rough idea we're talking 16s is maybe like $20 a sample to sequence it depending on where you're sequencing it at. Yeah, for us for pack biosequencing I think we're charging around $40. So it's about double but you do get less depth right now but just to give you an idea of what we're talking about there. Okay. So that's sequencing. Let's get into, you know, the actual bioinformatics any any questions about sequencing PCR products. Good. Great. Anyone doing non 16s sequencing. Anyone doing it s or 18s sequencing 123 sure four or five. Right. Okay, good. Great. The good news is, even if you're doing 16s for something else the basic steps by max steps are going to be very similar to slight, slight tweaks. So for, so for platforms are doing the bioinformatics there's really two big competitors here I would say chime and mother they've been around since the start of, you know, I would say microbiome stuff. I would say chime winds out on the on the popularity contest, but mother has not gone anywhere and has been sort of really still in the background and you still see it just people still love it. I don't really have a big preference here we tend to use chime in my lab. That's it. I'm not going to play favorites here. We are going to use though, I guess, sort of a chime flavored definitely workshop for the lab and we're also going to probably use a chime flavored sort of major steps. But that being said, the steps are very similar in mother. Okay, so if we break this down into sort of different steps, what we're going to start with is for for talking to this and with the lab is you know you're going to have your sequence data you're going to have your raw sequence data coming off of a sequencer. We're going to probably assume it's a lumina for right now. And then we're going to do a bunch of stuff to that we're going to we're going to de multiplex it we're going to stitch reads together we're going to quality filter I'm going to talk with all these steps in a second. So we're going to go into O to you or ASP picking. And so that's a really important step and we're going to talk about that in more detail. Once we do that we're going to go into you know how we assign taxonomy to those sequences, how we build an ASV table. And then on the other side just you know how you use those reads also to build a project tree, because this ASV table and your project tree, along with your metadata table. So you're going to have three sort of major outputs that you're going to use for a lot of downstream analysis. And you're going to do that maybe within chime but you're also going to do it you know in Excel, you can put it into our, but basically you're going to get to this point where you have process data. And then you're going to do a lot of what I would say, sort of real with real bioinformatics starts to start playing a lot with the data trying to visualize it trying to make sense of it. And then there's a lot of steps, although definitely there's lots of variation, you know your hope is that you can get through that fairly, fairly straightforward. Okay. So, just some of the real basics here so often one of the first steps will be demultiplexing a lot of times sequence machines will actually do this for you nowadays you might already have it demultiplexed, but literally just means that your tab one file. There's a sequence coming off a sequencer that contains all of your sequence data across multiple samples. Right. So because we run sequencers that can generate a lot of data, they're not saying or sequence sequencers anymore right we can run data a lot. A lot of samples on a single sequencing run. We don't need all of the data for one sample right so because of that we multiplex multiple samples on a single sequencing but. So for instance for the my seek pretend actually multiplex about 380 samples on to a single my seek run. Right. You could do more or less if you did less you just get more sequences per sample. You multiplex those by simply incorporating some sort of DNA power code into the actual Amplicon product. So you have your Amplicon product you add a barcode different ways. And then when you sequence it now you have a sequence plus a barcode. Okay. And then at the afterwards you do sequencing now you just use that barcode to say okay this sequence was from this sample the sequence was from this sample. And you filter it all barely straightforward. And because it's fairly straightforward. Most sequencers will do this for you, you upload the barcode to the sequencer. And then it does the demultiplexing for you. And it gives you a fast queue file for every sample separately. So that's the multiplex not really complicated. But it is an essential step. So then you move on to quality filtering. And so we've talked about quality filtering sometimes in previous workshops quite a bit and I'm not going to go into as much detail but it is very important but it's also quite a variable, depending on your type of data. But obviously quality filtering is important right we don't want junk in junk out or different terminology for saying junk and junk out. So you want to essentially remove, you know, sequences that you think are not very good, or trim certain sequences are very good. So for certain types of sequencing, like alumina, which is probably the biggest name in town. Classic is that you're going to get fairly high quality across your sequence read and then you're going to see your quality drop off towards the end of your sequencing read. And this is just what this is showing. So this is a scale showing the quality of your of your read. And then this is the base position along the read and you'll see these plots quite a bit from different types of tools to view that that quality. The quality score is, you know, encoded in those fast Q files fast Q Q stands for the quality right so you get the read, plus you get a quality score for every single base in that region. Using that information you can then, you know, put those files are different ways to filter the data, right you could imagine okay maybe I trim these reads back meaning I just get rid of the last part of every read. Or so what we see is that often within reads you'll get lower quality reads across the whole thing, you'll actually get ambiguous basis. And then so the easiest is just because sequencing is fairly cheap. We just say we don't want this read right we don't want, we don't we don't want that read downstream. So for methods for filtering there's lots of different criteria, right so we could say there's a minimum base quality that we need a q third a q of over 30, maybe across the whole thing or on average. Maybe it's a minimum percentage of high quality basis, you know, a maximum number of ambiguous basis, usually we just say if there's one or more ends we don't want to read, or you can set a minimum read length as well. So for tools out there we're going to use cut adapt in the, in the lab today, but traumatic is out there sickle is another tool. You can do your own thing if you like there's there's tons of ways out there but some sort of filtering is a good idea for sure. Okay. The other thing that we'll often do now in the lab is we'll actually also remove any reads that don't have our primer right so sometimes you'll get some off target sequences in there. And a way to sort of check for that is we'll just just say hey we know what primers should be in our sequencing read. We're going to check for that primer if it's there will keep the read if it's not we're not going to waste our time with trying to figure out what it is or why it's there. And then we'll remove those and so we'll see that we can see those tools like cut adapt. This is my primer pair. Check for this. Keep the read and a lot will trim that will trim that primer pair off that read as well for downstream analysis. Anybody using a filtering tools besides the ones I've listed your favorite. Yes, what's your favorites. Yep. So, yeah, so you can do this within chime, for sure. And some of the filtering will also happen and we're going to get into denoising. I think in the next slide or the next one after that. Some of this filtering will also happen at that stage you're going to lose reads because you're hanging in the denoising method you use it's going to remove those those reads as well. Yes, and we're going to you'll see how the flow but usually we want to try to filter out anything sort of low quality before getting into denoising or directing for sequencing error after that. And that can happen within chime or, or the flight outside chime. Yeah. So there was other questions or comment about tools that you use for quality filtering. No. Yep. Yep. So they all work sort of the same in that they all do some sort of quality filtering. But with most of them they have different defaults, and then they'll have different options for how you want to remove move read so traumatic by this by the sounds of it is really good for trimming read as opposed to to removing them. Cut adapt we tend to use a lot because it's really nice for feeding in like these primer pairs which I don't think traumatic has that option. So I'm not sickle. I'm not sure the benefit of it versus others but in in all the different combinations they just have slightly different ways of doing it. And it's not that really one's better than the other I'll say this a lot. It's just that people come at these problems from really different angles and people tend to have their favorites that they stick with. Yeah. Okay. So when we have some filtering that's happened the next step we're often going to do often, not always but often is we're going to actually join or stitch our reads together. So again this is a bit unique only to alumina when applied to something like pack bio sequencing, but aluminum produces a for greed which is what's shown as this arrow on top, and it produces a reverse read, right. So we're doing an amplicon sequence that's a set size we should understand how well that forward and reverse read overlap, right. So in an ideal world, we would get a read like this because we've designed or amplicons to overlap. And then this overlap is used to essentially join the Ford and the reverse read together into a single longer read. Okay. So these two reads belong together because we know it's for an reverse we know comes right off the sequence of that way there's no guesswork. If you're doing say metagenomic data, as if you're doing metagenomic sequencing, right your fragment length is, is variable it's not a certain size, right. So men genomic sequencing length would often be larger than our amplicon. In that case your for an anniversary don't overlap. Right, so you can't really stitch those together. In other cases the amplicon could be really short they actually run past each other's just showing that option. But that's with men genomic data as we'll talk about tomorrow, you often won't do this stitching step because the Ford and reverse reads don't overlap and so you just treat your Ford reads and your reverse reads either is two different independent views on the sample, or you just combine them together and we can talk more about that tomorrow. But for amplicon data often will try to stitch or join those together. You can, you know, just do for reverse only reverse read only people sometimes do that but again the default for us is you need to try to join those together. So that's the tools for doing that we use to use a pair a lot in the lab back in the day. But now we just run V search within chime that will do this stitching for you. And you'll see that in the lab. It's just done pretty straightforward. Yep, question. That's right. That's right. Yeah. And so when we first started actually we just see a lot of sequencing early and still actually to this day where people would do just say before, just by itself. Because one they would use only 150 base pair reads instead of 300s often, or certain people like to really see complete overlap with their four numbers reads for quality. But by design and I would say for the, at least for the IMR, we design those primers to be about 450 base pairs to get this over. Yeah, that's a good question. Yeah, that's a really good question. So it's about it. Yes, you get, you don't get this nice even length to basically get very, get various across different species essentially where the ICS can be shorter or longer, right. So the way we usually handle that is we'll often try to join the reads to sort of see how much we lose, right. If you're if you're getting pretty good stitching across it means that, you know, there hasn't been too much outside of it and so we'll, we'll try to join them and see like what percentage of our reads they kept after. So if you get 89%, okay, we're going to stitch it. But if you're throwing away, you know, if half your reads don't stitch together and join, then you're going to have to do sort of a forward read by itself. And sort of combined and at some point downstream. Okay. And so we get on to the fun topic of denoising. So let's just see how we're doing very good. Okay, so denoising what is this and so this is where we're going to this idea of O2 use an ASDs and my, my hope is here to explain it to you so you understand the benefits of both. So when we do this sequencing of different types of amycones, you can imagine this little cut out we have our imaginary of two organisms, okay, red and yellow, blue and gray. And because of sequencing error, we're actually going to generate reads that don't perfectly map to this one. So you can imagine you have a base pair off or two base pairs off. We're going to get a read that's close to this, right, but not exactly that's okay. That can be problematic if you also have another organism that's also just two or three base pairs away from from that, from that. And you know we're going to try to do this in different types of much more distance. Okay, we're still going to basically get a plan to think about sequencing errors, you know, mismatches or base pairs that are called wrong, right, but actually belong to this one. Okay, so it's just a product of the sequencing itself you're going to get sequence hairs, they're not going to map perfectly 100% when you when you look at them. So one way to handle that historically. I almost didn't put this in because really we don't hear as much about to use I think historically it's still good to understand where it's from is to basically just collapse those and cluster them at some percent idea identity. It's one of the most operational taxonomic units. So historically you would have seen people cluster any sequence that's at 97% identity or greater. Now you could do different O2 cutoffs but you would typically see a 97% identity. Why 97% because someone thought it sort of equaled about a species, right, so an O2 and a species you can kind of interchange those two words in your head. Many species would have not always 97% identity or higher. Okay, so that's one approach. I know to use or use in the field for a long time, but you can see that we do that right so we can imagine that all the sequence here within 97%. Now we're calling this one thing, but we actually know it's two things right so you're losing that resolution you're losing 3% resolution. So now we have two things called an O2 level but really there's there's four, but it's beneficial because we don't have this sequencing error problem. Another approach really is instead of just simply clustering them is to try to model that error in the sequencing and to correct for us such that we actually get the original, you know, the real biological sequence. So in a perfect world, or ASD following your amp-front sequence herein would differentiate this group from this group. Now there's different approaches for this denoising stuff, right? One of the things I think of is that from a sort of popularity contest, right, most of your sequences are going to form a sequence. And then you can say, well, if it's one base pair away from the speed loop, and maybe we're going to put it in there. We also know from a limited data, so I should give a quality control that we would expect more errors at the end of the week, right? You can model those errors and try to collapse them together. So the short end, you know, the short answer to all this is that, you know, should you be using ASVs or O2s? I would say across the board probably ASVs. And then the big question is after that is, you know, what type of ASVs. Now, I remember. Yes, sure. Just repeat the question. Yes. So the question I guess is around, does ASVs, you know, sometimes get it wrong. And do you make different things that weren't different? Yeah, yeah. The answer is yes. And we actually see that we still got it wrong with O2s a lot of times. And you'll see different approaches do a better job or not of collapsing them down to the original sequence. And you can imagine this is an approximation, right? It's especially as you get more rare taxa, it becomes really more differentiated to say, is this a real thing or is this a sequencing error? Both, both happens. Yeah. I'm not going to talk about this. I think I'm going to skip over it, but there is different approaches to O2 picking. There was de novo clustering, which is just sort of like describing this cluster, everything closed reference where you map things to a database of sequences and open reference, which is kind of like a hybrid between those two. I don't think we need to go into it, but if anyone has questions about it, I'd be happy to answer it. And I would say much more popular now is ASVs. ASVs are also sometimes called sub O2s back when this came out and there was another term circulating for a while, but I think we've all decided on ASVs now. Big methods out there are probably Datatoo and Dubler are the two that we're going to talk about. Unoys2 was made by Robert Edgar still exists out there as well. So in my lab back in 2018, we tend to do this a lot. We're like, okay, we have to figure out which one's best and then some poor sap gets the job of, as I look at Robin, no, she's not a poor sap, she's really good. A student or a trainee tends to take it on and says, okay, let's compare the tools and then we'll know which one's the best, and then we're going to pick the best one from now on. So in some of those cases, it's never a clear best tool. It's just like you're just left with this like, oh, they're all sort of different in different ways. So that's really hard. But what I'm showing you here essentially. And again, a mock community of expected. And that's comparing Datatoo versus the blur. And then O2 was a standard O2 cost-train method. So what do you see there? You see some slight differences, but on a fairly simple mock community, they look, they all look pretty good. You can definitely spot some biases there across them. And then one thing that we talked about a lot is total ASD or O2 counts. And with O2, you'd actually generate, you'd think less, but you actually do create quite a few spurious O2s. And Datatoo and the blur, the blur being a bit more, I would say collapsing a bit more so than Datatoo. That all being said, they all have different parameters and options going into those tools. And so depending on how you sort of run them, you will get different answers. And for tools that we still sort of hassled out in the lab, we're going to run the blur, I think, in the tutorial. The blur tends to be faster. Faster is not always better. But we will be running the blur, but often we will use Datatoo depending on the situation. And what would that situation be? It's probably, maybe that the blur just seems to really be not doing a great job. Sometimes it over filters the data and you lose a lot of reads. And so we're not satisfied with that. And we'll go to Datatoo and run that. And we find that if you have the time and patience, Datatoo is usually the better approach, but Datatoo also usually requires often some training a bit on your data ahead of time. And so it's just a little bit more, maybe a little bit more expertise required, I guess. Whereas the blur is pretty quick and dirty, and for most applications does get the job done. But I'm not here to say which one's better, but both are supported well. If you're moving to something like a tactile, like long reads, I think Datatoo is the only one that will sort of handle those right now. I don't think the blur handles long reads. Okay, right. Okay, any questions about denoising? There's one question over here, sure. No, go for it. Thank you for carrying these different softwares, right. And the trends you're seeing, for example, you give us a species of genus level, right? Do we see those similar trends at higher levels? Yeah, not as much for sure. I mean, genus is definitely the standard level, a lot of that heterogeneity goes by quickly. One thing that was really troublesome, if you dig through that paper, if you do a paper analysis, but you'll see, actually, it's really hard to make some sense of the synthetic community that you're really super caught in that there's not something else in low level. You know, I see sometimes pick up these things that you're like, wow, it's really sequencing a lot for you, and maybe it's there. So even though they say 20 things, or there could be a few extras. So, you know, having a gold standard is actually kind of difficult. Okay, all right, let's move on to the next one. Oh, sorry, question there. I'm not sure if I got all the questions. So the question again, if you're looking at rare species. Yeah, can you reserve the issues that are reported using ASDs? Oh, I'll come up with you. Well, that's a, look, the lawsuit question. So yeah, sort of for us, those rare types of things comes down often it's sequencing that, right? I don't know. I don't think it would be an answer to that. So, say in our lab, we're not so interested in rare types of times, we're kind of leachless with like different types of building which we'll probably have to sort of remove this long tail. Because across the world, you see with most of the communities is approaches tend to over inflate the number of things reported. And so obviously they're not doing a perfect job of collapsing that sequence. So I guess it just depends on what you're doing with it downstream. So, biologically, if you're looking at, say, I'm going to have a peek away watch later today. They're not influenced that much by this rare tail. But it comes down to, hey, if you start reporting about, you know, it's really rare tax being different, then maybe you're going to want to go back and check out, you know, how many reasons you actually have for that very tax. I'm not confident that, you know, I want to say three reasons versus ten reasons. But yeah, if we have what it's going to be reads versus no, it looks pretty good. So I don't know, I guess it depends on your threshold a bit. I don't know if I answered it completely. Yeah, okay. Okay, so let's move on. So now we have, okay, now we have sort of exciting point right this is where I think most people like coming in the field be like I have a sequence read. I want to know where it is, we had to do all the stuff we had to do filtering we had to denoise it we had to pair it. Now we have a sequence which we think represents the real biological signal. Right. And we have a count for that sequence to. So we just have a single sequence for it. And we say okay that's been there 10 times the sample it's 20 times the sample. We know how many times we've seen it in that sample. So what do we do with it now. Well we can do different things one is you can actually do analysis without doing any sort of annotation of it we don't have to sign taxonomy to it we can just literally call those things. ASD one AC two AC three, and then you can do a lot of the analysis that we're going to talk about later on today on that table without any sort of labels without you know saying what it is right. So often what you'll see is sometimes you will see these labels oh two one oh two two three four five or AC one two three four five. You'll also sometimes see this really long string of characters that look like garbley look as a label. That's just what we call an MD five some label, and it's really technical I don't think I want to get into it but just means that reliably filter, if you put a sequence like a DNA sequence or anything into an MD five some checker, it'll spit out this, this unique, this unique computer science terminology, a unique character string and I'll always generate that. Some of the biomec tools will generate these what they call F MD five sums. Just think of it as a unique identifier for that DNA string that's shorter and reproducible doesn't really matter it doesn't really matter but you could, you know in your head you could replace one two three four five. It just means that like if you did an analysis and someone else did analysis, you know and you call different things different labels, the MD five someone actually give you the same identifier. And I just mentioned it because it's super confusing and people are like what's this weird label I get that's just a random string of characters. Okay, so that's all said, of course, we want to assign taxonomy to it right, why do we do that well one is easier to communicate we can talk about things we can't just say hey I get this ASP with ATG, CDG, ACG, okay, you know, so we can talk with this communication. We imply sort of certain tax that have different functions rights we know like if you're talking with cyanobacteria versus, you know, pro clariflacus versus the video bacteria and like all those names mean something sort of there's context behind them. So let's start a pie crust if you're interested in that later. And then the other big thing it really does is it allows us to group things quite nicely. Right so we taxonomy has this structure, right so we can collapse species into genera. We can collapse genus is into family we can collapse those to different levels and so it's easily collapsible to different levels to understand, you know, how those differences go at different sort of tax on resolutions. Now one thing I don't think I mentioned later is that often for 16S at least we often don't get down to this species or level definition. We get to the family level we get to the genus level. Sometimes we get to the species level, maybe having on your data set 2540% of time to the species level definition, but for a long time you're not going to get down to that level. So what do you do well, you could just go back to your PCs right so someone says I do an ASV analysis, you're not taking taxonomy you're basically saying I'm going to count all these things at 100% right and so you can think of as ASV almost like a species level analysis without the taxonomic labels and that way you don't lose things essentially in your in your analysis, but just goes to show you say if you say okay I'm going to do a difference at species level. It's only going to be based on anything that could be assigned a species level name. Okay, so taxonomy assignment is also a big can of worms but maybe not as complicated as denoising but essentially there's different approaches to doing this. And the two big things that are important is your database and your method for signing your sequences to this database. Okay, now really the big one in town is I would say naive Bayes is used by chime is very popular. And is really the, I would say I think what like 95% of people probably use naive Bayes for the assignment. You could blast your sequences, right, with something blast you get there's other tools out there called RDP classifier and our tax. So what you'll see is a naive Bayes approach. And then the other big choice, more so than the method is probably the database that you map to. So you could use RDP, which is why am I drawing a blank what RDP stands for the ribosome database. Yeah, it's RDP green jeans and silver private big three. And in all those cases they're going to give you you know rank taxonomy names, but they're not going to agree with each other. It's always fun taxonomy is just awful. So again, which, which one should you use. I would say RDP has really, you don't see people classify much to RDP anymore it was pretty popular back in the day but you really don't see it much. And then the two that have been competing I would say over the years has been silver versus green jeans. Now green jeans was really popular back on it was sort of newly minted in 2012. It was introduced by China for many years, and then was an update for quite a while and I would say everyone switched to silver in the past five or so years I would say everyone sort of started using silver as the main database, still pretty safe using silver I think is a really good choice. You can do that within China. So green jeans to just came out in a pre print just recently. And so it looks like they've up done some updates to green jeans and so now you'll probably see in the coming years some people prefer green jeans to versus silver, and people will have heated arguments though which one is better. You don't have to decide which one to use but you know I feel safe and using silver feel safe probably using green jeans to. Okay, and the preference can be up to you. Yes, questions first there and then when I get. Oh yes, my next slide maybe a little bit across the board though I think it's pretty safe. No matter what you're doing is so so people use silver. Yeah there is, you'll see some preference maybe like people doing when ranging the first one wasn't the first one but when green jeans sort of came out in 2012, you would see more human people doing that and the environmental people sort of shifted towards silver because they thought had better representation. But now it's, yeah, sorry. My next slide running my next slide. Yeah. So I didn't have a question. Yeah. Yeah, okay so GTDB I almost put in here so GTDB we'll talk about I think tomorrow with metagenomics but GTB is the genome taxonomy database very nice I love it. And it's very much starting to take over metagenomic taxonomic assignment which we'll see tomorrow. So the biggest restriction with GTDB for 16 s is that it's based on genome right like genomes. So you don't have all that, you know, all the sequences that made into silver or green jeans, you know that they don't have genomes for. So much more comprehensive for green jeans and silver compared to say GTB which is based on primarily genomes. Now that being said one of the big, you know, and I haven't dug into it fully on the preprint on green jeans to is that suppose it plays really nice with GTDB so finally, you know, there's this big problem where you would do 16 s sequencing and metagenomic sequencing. And not only is the technology and everything really different, and you can't compare, but the taxonomy would never mesh up like you would use a certain tax on me based on genomes, and then you have taxonomy based on six nest sequences from silver. And they just don't agree like they don't call the same things, the same label. So it may comparing them really problematic. So the preprint actually talks about that and actually says okay we've actually made green jeans to use the same taxonomy as GTDB so that you can actually say this thing you're calling taxa a is actually called taxa a and GTDB. And if so, then that is a really big advantage of it I would I would going forward. Yeah. Yeah. Okay, yeah, I don't go into it. So this idea about training versus untrained. So, as I mentioned back here. The database is technically, I guess, yeah, it's classified as a machine learning approach for for annotation, and requires training of that based on a certain database you would take a database, you would train a classroom and call night base, and then you would use that database for your all pre made and something like chime so you can just point directly to it or you would download them. Okay, so I lost my train thought yeah so you would train using that. And then you can actually get into training about whether you train it on different variable regions versus the whole gene that became a big thing. So to, we have switched primarily to training on. I don't remember now, Robin, do you remember if we train do we use the database is just train on the whole gene now and then Michael by number. Yeah. Yeah. Yeah, so to reiterate, often you could train on particular variable regions, but then you had to have a different database for each one of those and you can just the benefits weren't as as good as sometimes the difficulties I guess associated with that. Yeah, question. And it's left. Oh my goodness I'm going to get cut off. Okay. Okay, sorry. Okay so and then I guess an unclassified one would be something like blast, where essentially it's just taken a sequence to sequence approach. And then special database just want to mention briefly is there are, you know, specialist they based out there, we do some oral microbiome and so we often get reviewers are like you should use the HOM DB. Why because they sort of created a list of taxa right that they think are relevant to the oral microbiome and they essentially show that you can get better species level definitions. And being said, if your database is overly focused and not comprehensive, you can get, you know, false positives you can get things that say, it looks like this tax but because it doesn't know that there's all these other tags in the world. You, you get this, you're overconfident and you're calling it this thing when it's really not that thing. Right and so this, this paper show that essentially you can get these false positive if you restrict your database too much. Most applications, yeah, go big, go like silver and green jeans, although the human oral microbiome database we do use in the lab tends to be pretty good too. Okay, where are we. Oh yeah. Okay. Okay. Can I go over just by maybe five minutes extra. Yes. This is not bad we've historically going way over so like fine it's not bad. Okay, so you're going to get. We start talking about outputs now okay so you've finally sort of got through a lot of the work of assigning taxonomy to your reads to your sequences. You're actually going to get some sort of, you know, table, you're going to get an OT or ASD table whatever you want to call it. And you can think about this just like it's shown here, samples as columns, rows or your ASVs. Okay, and that table is just represented digitally like in lots of different formats right it could be in a TSP or CSV that you could load in Excel, easy peasy. It could be as a Q, Z, a which we'll talk about in a second more in chime. There's to be this file format called biome, so many different file formats just to represent a table it's gross. But the reality is, it is a table with your OT or ASV by some count of that. Right. And that's a table. It's a file that has your sequence right for every one of those otus or ASVs that's going to be important. Those are your denoised reads. Right. And then you're also going to have a metadata file with information about every one of those samples. Now, before you get there though, more filtering, more filtering so you filter based on quality, you've got to your table. Now, there's sometimes this other filtering that happens. Okay, and I'm not going to go on too much about it but there is a few different topics of interest here. So one is this idea of a bleed through ASVs. And this is that sometimes you can see on my seeks essentially sequences shouldn't really be in that sample somehow they get there either from the previous sequencing run or from another barcode. So you'll see those at really super low levels. And so you can essentially just we tend to just remove really rare things. Now this is problematic. This is problematic if you're really into those rare things, right. But this is where I say we're a bit cutthroat. And we remove those things that are below 0.1% of the mean sample deaths. So if you're missing enough that bleed through maybe it's going to go bye bye with next week. So I'm hoping that's that's good. I believe this doesn't have to be a worry in the future seems primarily focused on my seeks. And the next six don't show this. And if we move to more aftercon sequencing on the new next six we won't have this issue but for now, it is an issue. Yeah, quick question. I don't know if I was actually there. Yeah, no, I mean, I think I think that goes across a lot of what we do actually is that, you know, if you're going to really take home some crazy message, especially with the rare things to then go back in, either with further PCR is the or QPCR in it possibly as well. Yeah. Sometimes we'll remove contaminant AST. So this is like where, hey, maybe we don't want mitochondria and chloroplast DNA in your in your sample. Right. I mean, it may be of interest to you. At what time that's probably not. I mean, I am really cool. I've been thinking about this idea of chloroplast DNA as a measure of how many veggies you eat, which I think would be super cool as an experiment to find out if it correlates but that's a whole other side project and I have no idea if it would work. Anyway, usually we would remove those. Often we would remove maybe other ASTs depending on different criteria. So the biggest one is if you have a sample that just doesn't sequence well. Right. When do we decide that that sample didn't sequence up to a certain standard and we'll move that sample. There's no golden cut off. Usually I get really scary. Once you get below 1000 right. Usually, that's usually not an issue but you're going to have just instances because of sequencing is not perfectly normalized you're going to get certain samples. Not sequence as well as others. And so you have to make a decision to say, you don't want to sample in there with like 200 sequences and then you compare it to a sample with 20,000 sequences right it's just too much of a difference and so you'll have to make a cut off and say this sample didn't sequence well enough. You don't want to sequence it again, or maybe it's problematic. So you remove those. And then often we'll do prevalence filtering, which removes certain ASVs based on how present they are across different samples. This leads a bit into statistics, but essentially what we see is, you know, if a sample, if an ASV is only found and say, 5% of your samples, you know it's hard to do statistics on that unless you have a larger sample size. And that all leads to a problem of doing a lot of statistical tests, which reduce your power later on. And so we'll actually consider removing those. Again, this is good for some people not good for others and there's no standard here. Yeah. Yeah, so I'm going to leave that to Henna later on. She's going to talk all about normalizing and rarefaction versus other approaches later. Yeah. Okay, good for the filtering. Okay, okay. And then this is going to be really quick. Basically, you know, the other big piece of it is that you'll see later is that a lot of our different diversity metrics require a phylogenetic tree for your sequences. And so that can be done a couple of different ways to build that phylogenetic tree that goes along with your ASVs. Right. And again, this has changed a bit over time, but you'll see things like who's heard of like who's heard of Unifrack. Right. That's a phylogenetic method that is probably going to talk about later. That phylogeny is needed to map every one of your ASV sequences to a tree. Okay, so you need to have a tree to put that into a phylogenetic measure like that. So different approaches for that. The two biggies are de novo. So this just means that you take all of your sequences. You align them, you make a new tree, bam, you got a tree. Easy. Now, downside to that is that tree may not be super robust because it's only based on very short amount of sequence data. And then the other big approach which I think we'll see more and more of is, oh, this is SO to use. This is like the same as ASVs is essentially inserting those fragments into a reference tree. So the reference tree is shown in black. And then you insert your short pieces of your sequence into that reference tree. So overall the tree is conserved, but you get to place your reads into that tree. And so you'll see that placement approach quite a bit. You can do that within chime, but you can do de novo, or you can do a placement approach. And again, depending on your Amplicon, maybe there's no reference tree out there. Maybe you have to do de novo. So I don't produce a tree, but it's, I think, pretty widely accepted that this insertion approach is probably a better approach overall. Okay. Okay, so with that, I'm not going to talk about biases anymore because I think I've covered them a lot. There's a lot of biases and a lot of this and it's going to come up a few other times over the next couple of days. So hopefully I covered the ones within marker genes. Just remember that anything we do by Max included creates different biases, we talked about some of those. And so those are going to change what you see as your output from your finger tables. Since we're over time. So one quick fun fact, 16 S gene, we think about a universal shore, it's great. And most bacteria genomes, it's in a single copy as in it's like in the genome once, but there are bacteria and there's many are key out there, where there's multiple copies of the 16 S gene in the genome. So that creates a bias. Unfortunately, so a bias that essentially you can imagine if you have two copies of that gene and genome, when you sequence it, it's going to look like there's twice as much. You have three copies, it's going to look like your chance of sequencing it's three times right. So that's problematic. And basically in the field as well recognized that mostly across the board is just ignored, just people just ignored, which is fun. There's three tools out there to try to correct for that. So pie crust will actually do that is one of the steps there was a tool specifically for that called copywriter and and paprika can do that as well. But we don't see that across the board that most people do this correction. I don't know why you just, you don't do it. It's really weird, you all know about it. It's a bias, we just ignore it. So I was just paper from Laura Parfris group, which was nice that sort of went into this and showed that all of all these methods. They're all not doing maybe a great job. And so maybe when you try to correct for you're doing just as much harm it's just leaving alone. And so, yeah, it's good to recognize that there is different copies of it, different a number of copies, but you don't see it as a standard by my next step I would say in most steps. Yeah, question the back. Yeah, no, it's fairly well found that we can serve that's why like actually like pie crust is five and I basically it'll do an already pretty good job of predicting that the really high multiple copies are the tend to mean our key which tends to be not a focus of a lot of microbiome studies, especially human microbiome. So hailar kia is really well known for having like seven eight copies bacteria you don't see that as much you see one or two copies. But yeah, but yeah it's sort of well known but you'll still see what's a new genome and you're like oh surprise it's got multiple copies. Yeah. Okay, so with that, that's pretty much it's chime we're going to move to in the tutorial and chime is going to be, you know, it's a really well known tool. So as you can see within the tutorial you're going to produce different types of files there's these qz a files and these qz v files qz a files are just this really nice way to package up a whole bunch of files and directory so you can just pass it around. And it provides some province which just means it tracks all the steps you've done before it. It's actually just if you're interested it's just a zipped file you can actually unzip it and see all the stuff inside of it if you're curious. And then qz v is a nice way to take something and then you can put in chime viewer and it makes a nice pretty graph on their on their website so you'll see these two files specifically coming out of chime. And then just a quick message about microbiome helper. So chime wraps a bunch of different tools. Sounds great there's a lot of documentation out there to sort of self teach you. And in microbiome helper we tend to start putting a lot of stuff on a wiki many years ago, where for our lab we just have you know this is the commands you sort of run with different options that you should be aware of. And we find it pretty straightforward and this is where we host a lot of different other like our tutorials, but we also have our standard operating procedures so if you're ever curious about you know what we're doing. In our lab and I'm not saying our lab is always right or anything and it's not a one stop shop. You can sort of see these are the major steps we do when we analyze like a typical data set coming out of our lab.