 Okay, so the learning objectives for this module understand and perform marker gene-based microbiome analysis Analysis of six will focus mainly on analysis of 16s where I was on RNA market genes and will use the 16s to profile and compare my microbial communities microbiomes and then was Learned to how learn a little bit about selecting suitable parameters for marker gene analysis and The end will have some discussion on the advantage and disadvantage of using market Marker gene-based analysis versus metagenomics analysis Okay, so you've seen this already So the reason a lot of people choose ribosomal RNA says they're marker genes. It's because it's it's truly the universal phylogenetic markers it's present in all living organisms and Place a critical role in protein translation means that it's functionally conserved and therefore the rate of mutation is Conservably slower than some of the more fun some of the Let's say genes involved in virulence that are much more fast-evolving The ribosomal RNAs are Behaved like a molecular clock so as a result that it's useful for phylogenetic analysis and Tree of lives being Built based on this particular molecule. It's mostly common. It's a most common Marker genes by far meaning that if you want to compare your data set to a reference data set or to another Study, it's more likely you'll be able to do so if you have 16 sequences versus other more Other other types of markers Which will talk about it as well so RNAs call the length into life because It help us place organisms into the universal phylogenetic tree and We can therefore use this tool to understand the composition of a microbial community In other words, we can profile the the microbial community using this marker gene this process is Typically called that generates that's what's called an alpha diversity of the community It's also because it's also a tool that can be used to relay one microbial Community to another and when you compare communities you you do so by doing beta diversity analysis and Beyond sort of the taxonomic profiling the tool can also be useful in reading our properties of the host what the environment We know that certain type of organisms associated with a certain type of host and certain type of environment So you don't necessarily have to get the functional data to do Some functional interpretation and Morgan will cover a little bit of that Using the tool that he developed called pie cross as an example and moreover Profile community profiles can be associated with certain hosts phenotype or certain environmental features so for example, there's Community bacterial community profiles associated with obese obesity and associate with people with normal or lean weight And there's also profiles associated with different types of diseases and the more you characterize these communities the better you can infer functions Using just the the marker genes as a proxy for your analysis So briefly here's a list of different marker genes that have also been used For eukaryotic organisms the most common Marker is the 18 s Rival sum of RNA genes the equivalent of the 16 s and the the most common database is Most popular one anyways provided by Selva and it's Provided in this link here another common Marker is the the ITS the the inter in What's the ITS and inter transcript All right India trans group spacer, so it's a space between the RNA genes and as I mentioned earlier on it Evolved it mutates faster and therefore can potentially have better resolution for closely related organisms and it's commonly used for fungal Community profile analysis and I think Greg or someone was asking ITS database. There's a list of there's a There's one that's been incorporated into mother that you can can use for your analysis And yeah, so it's called a united ITS database Okay, so for bacterial organisms with beyonds besides 16 s Chaperonin protein has also been used and again ITS and Rec A genes the key here is try to identify a gene that's Universally present in all back all the organisms that you're interested in The other key consideration is that there's a good reference database because the first step for For your interpretation is usually comparing your community It's usually annotating your community using what's available in the reference So if you don't have a reference database to compare to then you cannot really do tax taxonomic analysis You can only do you can only generate OT use and then potentially comparing OT use And we'll talk a little bit about that later. So the popularity of these Secondary markers are not Less than the the 16 s for viruses It's really quite tricky because there's no real universal genes in viruses, but GP 23 has been used for T4 like bacterial phage and our DPD or rdrp. Sorry has been used for the cone of viruses And the viral marker genes typically are Family-specific rather than pen and viral So a few Considerations one is that the market really should have sufficient resolution to differentiate the different communities That you want to study or different groups of organisms that you want to study so for example 16 s it's It can differentiate most of the Genera and in in some species, but it doesn't really have strain level resolution HSP he shot protein 65. It's a more is a fast faster evolving Protein that's found in Mycobacterium so for example if you want to compare different strains of mycobacterium different species of mycobacterium You can use HSP 65 so again depending on the resolution you need you might have to pick The marker gene that has the the right resolution As mentioned the reference database is needed for taxonomic assignment and for binning so The availability of a good reference database For the samples that you want to analyze should need to be taken into consideration when you design your experiment Again, if you want to compare across different studies, you should also check out the studies that you want to compare to what markers they use What experimental protocol they use and what? Analysis pipeline they use and then try to match it as closely as possible In order to be able to compare the results across multiple studies So the upfront design of your study is it's really quite important rather than just Do the sequencing and then trying to figure out how to best analyze your results after it's much better to Have to design your experiments based on the outcome that you're Yeah, you want to achieve before sequencing Okay, so I'm going to go over The steps involved in going from sample to sequences The reason again is to give you some to highlight some caveats when you're doing your experimental design And that will facilitate it Downstream analysis if you have a good good design So DNA extraction there are many Kids available from different vendors available But it has been clearly demonstrated that there's kids specific bias different kids can preferentially Select for different types of different groups of organisms and and the H&P and other major projects have tried to standardize their extraction protocol and you can Here's an example of found the earth microbound project giving their recommended protocol, but again if your project Let's say it's trying to isolate DNA from a tricky niche that has very little DNA for example Then you might have to tweak your your Extraction protocol And sometimes that's not You can't just match someone else protocol verbatim and sometimes that's not possible the DNA DNA extraction can also be done after you fractionate the samples and try to separate your samples based on size fractionation which to isolate you carry out excels from bacteria cells were from viruses and We've carried out some of those we have carried out that type of fractionation For our watershed metagenomics project But then one of the lessons we learned and via and Mike who have been doing the bulk of the analysis can can definitely provide more information Is that when you do fractionation? The results would not necessarily would definitely not be the same as if you just take the whole community and then directly and Amplify specific genes with specific domains of organisms. So again the experimental Design can affect that your downstream analysis and may potentially bias your interpretation The other issue To consider is that you're dealing with fairly minute amount of of DNA and moreover You're dealing with mixed population. So you don't really sometimes you don't really know what to expect So contamination can could easily creep up without you knowing it. So Usually it's recommend that you include an extraction negative control using this clean of a water sample or as you can as your your negative control and and alternatively use the Suitable negative control to make sure that whatever you've found in the negative control is not part of your Sample. So this is a good way of trying to detect reagent Contamination will potentially water contamination as well So you so bioinformatics you can then subtract out taxa or Otu is found in the negative samples from your Experimental samples before carrying our analysis and this is especially critical if your amount of starting material is really low as then the The contaminations is More it's more. It's very likely to show up. So for one study we looked at human spinal fluid Sample and we also looked at human serum sample and the amount of DNA The amount of microbial DNA or viral DNA in those samples are Extremely low and without using a negative control you will be actually imprifying DNA just from the reagents or from the From the from the water and so on Yeah, so what we've done is usually just subtracting out the what the common Taxa or Otu is found in your negative controls As well as in your samples that sometimes it's not Necessary the best way to do it, but it's certainly the safest way I don't know if any other comments from it. Well from anyone really it's it's not a straightforward issue Especially if the same if the sample has very DNA a very low DNA Yeah, I've not seen this I've not seen a sample that put on a run like a blank sample that put on a run gave zero And give zero sequences There's always going to be something and if nothing the carry over contamination from the sequencer itself will show up So yeah, so that's why anyway negative controls. It's definitely recommended So let's solve for marker genes, but the host and environment Contamination can also creep up when you do metagenomics shotgun sequencing and as a result of that the Usually what when do when doing metagenomics analysis, you have a step where you try to subtract out So-called host or environment contaminants before carry out your analysis and I won't get into that in more details here because it's less an issue in 16 s So for Marker gene analysis it typically starts with a target amplification using PCR so your primers would anneal to a specific region of the genome and Create sequencing templates that then is sequenced the in this case, I'm just showing the v4 region of of 16 s and And also to show that Typically when you do amplification you can attach the sequencing Adopters and also the barcode directly to your PCR primer and then carry out didn't amplify your Substrate your sequencing substrate in a single PCI reaction rather than multiple steps, but I'll talk a little bit about that later So PCR primers designed to amplify specific regions of a gene of a genome In a dirty sample there might be inhibitors that would interfere with your PCR So in one of our studies looking at soil samples, for example, we have We have to go through a lot of sample cleanup To minimize PCR Inhibitors in in those samples and also in complex samples You could sometimes get non specific amplification, especially if your bacterial load in that sample is low Then the the specific primers can actually anneal to other temp other non 16 s template and amplify those And so you sometimes would want to check your PCR product before sequencing the gel Size selection may be necessary to clean up some of the PCI products Again, we have some messy samples and we found that By doing it essentially a run your sample on a gel excise the band that you want to sequence The old-fashioned way really help to really help clean up the samples The there's actually platforms that you can buy now to Automate this the gel size extraction process There's a local company called ranger genomics. I think sorry Coastal genomics and their platform is called ranger. I think and I can Do that do the gel size extraction for you on the on the robot So so some experience we had is if you have a sample that has a lot of human DNA, for example Or a lot of host DNA, but very little microbial DNA in other words very clean From our from our perspective very clean Environment from a host perspective then you sometimes get non specific amplification. That's not 16 s Okay, so a little bit this came up a bit in the introductory But here is the this this graph that I mentioned that shows you Sort of in each of the V reach the variable we hyper variable regions shown in pink The sort of the actually can be the But the the mean frequency of the most Common residue right so So you can see Sorry, actually, let me let me take that back the V regions are down here and the pink ones are the PCR primer regions So you can see V regions typically the dominant base it Occurs a low frequency But the the conserved regions of course the dominant basis occur at a much higher frequency So these are the conserved regions that you will want to design your primers and then it should flank the hyper variable regions that Regions that you want to sequence so But you can also see that for example v1 v3 and v6 are more variables than v4 or v5 So that's what I meant before that the different hyper variable regions could have different phylogenetic Resolutions and therefore can give you different taxonomic results And I'm sure all of you are aware that my seek or any of the next-gen sequencing platforms Can produce a lot of reads and it's usually an overkill to just put one sample in a single run So you typically will multiplex multiple samples sometimes thousands of samples in a single run but to and But in order to do that you have to be later on be able to de-multiplex or to disambiguate Reads from one sample from reads from another sample So this is achieved by using unique barcodes that are incorporated into the Amplicons you try you you sequence so it's shown in this graph here a unique barcode will be put into your PCR primer and then will be part of your sequencing substrate that you will sequence and then we will then use bioinformatics tools to separate out the reads based on the barcodes associated with each of the reads So each sample would get one barcode effectively Okay, so the other caveat is The way these Next-gen sequencing machines are designed they're highly parallel, but also means that if you put too many Too many sequences of the same type on the Into a single run The machines will have trouble Differential will have trouble reading the signals from this the sequencer You can then analogy to that is that if you take a picture of an of Really bright region of the Sun you'll have a washout picture like a white picture And that's the same as in Sequencing that if you have a lot of So as you know that the sequencers go through cycles if in a given cycle all the spots contain a and therefore The spot lit up doing that particular cycle Then you'll have essentially a bright field in your in your sensor and this As a result of that it's kind of like the washout effect of taking pictures of a for the Sun for example having by so is that sort of clear why if you put too many samples too many sequences of the same type on the on the On the sequencing run you will the results will not be be very good. You're not shaking your head Okay, so that wasn't it's for the my seat. It's for the my seat. Yeah, so make so are people familiar with the my seek platform Okay, so a my seat platform basically you have a chip on the chip There's spots on the chip each with the template you're trying to sequence, right? So each cycle base will be incorporated in in the things in the Chin in the in the DNA chain that's Being synthesized and as the incorporation occurs Light signal is given off Right, so if you have if you have a homogeneous sample then in each of the cycle All the spots will be saturated and in the next cycle None of the spot will be saturated and then the machine essentially gets confused It's and then that's like taking pictures of really bright Sun or really dark spots You have low contrast there So that's sort of the analogy of why you should not put samples of the same type all into a single one Oh Yeah, basically it's right there so so typically when you have a homogeneous sample Let's say you're only sequencing 16 s you need to spike in more More non 16 s Templates and in this case 5x is usually used as the control read right in in my seat platform. So Can't remember what the recommend ratio is but like if you're doing a metagenomic sequencing I think you only have to spiking 1% of the of my seat 5% Right, so you're losing some Spots when you're doing 16 s the alternative is that you can try to pull different marker genes Into the same run then that would also diversify the pool of sequences you want to to sample It's good to have a diverse sequence set when you're running on a singles When when when doing a single my seat run Yeah, so I don't think this was not not as much was for Calibrates Yeah, that's a good point Yeah, so in analogy camera analogy Well, it's actually a camera in the system But the analogy is that it it determines your Aperture or how bright the the sample is going to be in the first few cycles Well, at least I used to do that So if you like to say yours because your 16 s sequences the primer region would be very conserved So it becomes a really bad region to actually be using to determine the the Capturing threshold and but I think bioinformatics is that problem has been Partially solved so now it's less a lesson issue, but it's still sort of recommend that you would diversify the samples you put in Actually occasionally ask you to send Of course your library prep your sequencing library prep have to be compatible between the 16 s and Whatever additional sequence you want to throw on the on the same run So that's something to again to plan ahead when you did a design in your experiment When you say samples, you mean Okay, but are you are they all 16 s or I would say most of the people would not run a single sample on a high-seq lane because that that's overkill for usually for My seek there's only one lane So my seek only has a single lane actually in each run So so you won't yeah, so so but yeah I mean most of the time people would mix multiple samples into a single high-seq lane rather than run One lane per sample for metagenomics on the other hand you Could if you need to run one lane per sample, but actually my recommendation would be to mix your samples because then You're less likely to have the lane specific bias or other type of bias Depends on the coverage you want, but I don't know if you have a SOP that recommend a certain number Yeah, so yeah, I think the studies shown so far is a million is probably the upper limit you want for 16 s because after that the really rare Organisms in your community still not going to be covered because you know the dynamic range of Abundant organisms and less abundant organisms could easily be Thousand tenth of thousands of fold or millions of fold so you'll just be sequencing the abundant community over and over again so So for cost effectiveness usually people aim for a few Tenth of thousand to a hundredth of thousands of sixteen s reads per sample And there's the famous quote from Rob nice. I it only takes a few hundred reads to separate your elbows from your ass So yeah, so so basically if you're if your samples are very distinct You don't need a lot of reads to differentiate them or cat or to Classify them, but if your samples are likely to be more similar to each other then you might want to increase your coverage Sorry You mean how many reads you need for shotgun sequencing I think that really it's a That's a wide range depending on the complexity of your community as I mentioned the introductory lecture I think in the acid my drainage when there's only a handful of organisms they they generate about I think it was 100,000 Sanger sequences to be able to reconstitute the community But if you have a really diverse community like the soil community then you can be generating 10 million reads and still not able to assemble your your assemble the the context so it's it's a huge range depending on the Diversity of your community Okay, so a quick few quick comments about one step versus two step in amplification so I already mentioned that if you combine your Your PCI primer your marker gene sorry your PCI primer for the marker gene your barcode and The ilumna sequencing adopter into your single PCI Primer construct then you can achieve you can generate that the Sequencing substrate in one amplification one single PCI amplification step But some in some other cases people prefer to do two-step Implication and in those cases the barcode in adopters are separate from the Empocon primer and they and the barcode in adopters are later annealed to the Empocon primers so The The differences are that the sink the one separate ampicons Requires long primers and the barcode can actually in some cases interfere with the amplification primers so group Rob Knight's group has published a bunch of barcodes that they've Informatically tested to minimize Interference with 16s sequences, but without experimentally So and then they later on use those in their experiments to validate that they have less bias But but overall because the longer primers it's more likely to have interference the Approach with two-step Amplification means that you can first test your target primers independently making sure they work well with your target Target genes or target communities before you apply the barcodes for the adopters The one-step approach is not really suited for degenerative primers because then you have to design one Long primers for each of the degenerative variation It's more difficult to do that, but The the two-step approach you could use the degenerative Primers or even random primers and then just a neal The so random primer that would be in the case of metagenomic studies, but you can Then later on annealed the barcodes and adopters after the PCI application The one-step and become It's suitable for working with one and put on type from many samples. It's simple to generate Set of bark set of bar set of Primers I already have barcode embedded in them if you just have one Amplicon types, but if you have many different and put on types and you want to recycle Reuse your barcodes and your adopt and the adopter then it you you will have to design the PCI primers separately and then for each of your target You can anneal the same barcodes and same adopters and Then bioinformatically even though two different marker genes have the same barcode because they have different sequences you can still separate them out if necessary So The one step of course is a rapid protocol with a little loss of of bio material. It's much more efficient so if you have a Less samples to begin or less material to begin with then the one-step application Could give could save your precious sample the the two-step application has a longer protocol and in the process of doing the two step application You could lose some biomaterials So as I mentioned the barcode the primers of so 16s genes are available Are these the sequences are available from Rob Knight's group and you can order them from any of the The primer Companies, okay, so so Morgan is advertising his service you can talk about you can talk to him if you have studies that need someone to generate the sequences for you and The the two-step application more flexible because you can just buy commercial barcodes from Illumina bio Oh, or a New England bio. They all sell their versions of the barcodes are compatible with my seat Okay, so that's the the sort of wet lab experimental Components I'm going to go into more on the analysis part talking specifically about chime and Comparing it to mother these are the two dominant marker gene analysis platforms and As we talked about earlier, they are kind of in in competition with each other and Then publications in from the different camps typically highlights the weakness of the other Software, so it's good to to actually compare the the publications from two To two different to the two groups and you can actually learn a lot about the the weakness of each of the the platforms But at the very high level Just to give you sort of what what's the sort of high level difference between chime and and mother So chime is a python interface Glue together many different programs, so there's a lot of dependencies you have to install a lot of different programs When you install install chime it has done a fairly good job Help you streamline the installation, but some of the chime scripts require specific versions of a program and And if it doesn't get the same the right version that is expecting it it's going to throw in air and so it's sort of The upfront work to set up chime before you can use it It's definitely a lot longer than than mother and if you're using Virtual machine it provides you a virtual machine that you can just download and launch on your computer but the virtual machine is severely limited to the The like say if you launch it on a desk on your desktop or laptop It's severely limited by the computer You're using to do the analysis and some of the really large data sets potentially need a really powerful Server-grade machine with a lot of rams and a lot of a lot of this space and a lot of CPUs for analysis So VM for doing using virtual machine to do Metagenomics or 16 as analysis usually Scales poorly and it's good for learning, but not so good for actually processing your own samples Mother on the other hand is developed essentially by two people Pashlaws and and his programmer and they've been doing that for the past But just Almost ten years and so it has a it's a single program you download and there's minimum external Dependencies I think between five to five around five additional programs you have to download separately and it tries to Reimplement a lot of the popular algorithms directly into into mother So you would just be called you would still be doing similar analysis as chime or using External program, but you'll just be using the mother version of the or the mother implementation of a specific algorithm Sometimes the re-implementation is more actually more efficient Than the original algorithm. Sometimes it's not so it's it's a bit of a hit and miss as I'm Looted it's much easier to install. It's really just a single download and and you can start using it However, it's designed to work on a single server rather than a cluster. So Usually the server has to be fairly powerful. We have the early days of running mother Back in the late, I think I was 2008 2009 when I was using it we used to have machines that have 512 Gigabase of memory so most of your laptop have about a gigabase of memory So just for comparison and we routinely crash a machine was 512 megabase gigabase of memory Running mother because it's just really quite a resource hog They have since developed much better algorithms that minimize the memory usage and memorize Minimize the and making the algorithm more efficient, but Still it's usually recommend if you have a largest data set of Let's say hundreds of samples each with a Few hundred thousand sequences you will still need to run this on a machine with a few hundred gigabase of memory to and To to to have a successful run Chime on the other hand does the scale scaling a little bit better and the algorithms typically much better handling The memory issue and as a result It's typically more scalable if you have a large data. So you need to analyze chimes seem to work better than the mother Of course chime because it's a wrapper of a diverse set of software It's a much steeper learning curve, but much more flexible workflow You can certainly modify the scripts that come with chime to to achieve To to achieve custom to to do custom analysis The mother on the other hand it's the workflow works best if you just stay within mother and Run the the commands that it it provides rather than trying to Use external programs and then bring the data set into mother although I'm sure people you know people do that too, but the One thing good about the mother is how to keep track of the output files it generate in the previous step So if you can if you stay within mother then the workflow It's much easier to keep track of and much easier to do if you have to bring your data set in and out of mother then mother the sort of automatic tracking of the output file from the previous step Does not work and you have to be manually tracking all the all the output files that you create and and In least analysis typically generate hundreds of output files, so it can become a bit of a nightmare Keeping track of the output Yeah, so there's a recent publication that show that the two if you use the default workflow for trip or for a quality Control and trimming and so on mother and chime actually gave quite a bit that the result The the trimming results can actually impact your your downstream analysis in interpretation So I don't really know what the good answer there is and again Maybe do do it in reference to the data set you want to compare it to and keep it as consistent as possible across your data sets I Don't think and that would be a issue both are very well received within the in the community. So one or the other Would be fine for publishing for publication purpose Not likely unless you did something wrong in your analysis. Yeah, I haven't seen Rob Knight in Ben moves. Have you? Yes He said Javier follow from New Zealand, so I don't I don't think I've ever seen him in the band Okay, so this is the overall bioinformatics workflow and we will be going over these And sort of highlighting the different steps for you And you can keep referring I sort of numbered the the each of the major steps And then I also put them on top of the slide so you can refer back and forth what what What slides are referring to which steps? Okay? Okay, so you of course first are with your sequence data. That's in the fast fast queue format and you also need the metadata About your experiments and that will be used downstream When you're trying to interpret your results before upstream processing you're mainly just using the sequence data So there's a pre-processing step that removes the primers the multiplex quality filtering and Potentially decontamination if necessary of your samples after that after you clean up your samples What you usually call a clean reed then it's put into Clustering algorithm for OTU picking and trying to reduce the number of raw sequences that you have to analyze so you're trying to Remove redundancy in your data set and also Generate the OTUs in in the process and that's followed by two major branch of analysis One is called taxonomic Or phylogenetic phylo typing or phylogenetic analysis. So it's Do so by take your sequence do taxonomic assignment based on similarity match to to a reference database and then you can build Sorry and for the I should say for the novel sequences That does not have a taxonomic assignment You could keep it as just generic calling OTU one OTU two and so on so forth and then from there you will build What's called an OTU table and I'll talk a little bit about that of course that you that's the Effectively the starting material for your downstream analysis In order to do Phylogenetic analysis you first have to align your sequences and to build a phylogenetic tree So there's a sequence alignment step to take your Sequence align it to the template sequence and so all the all your sequences are in the same alignment space and then the distances Can be established the phylogenetic distance then can be established from the from your sequence alignment and This will give you a phylogenetic tree so the Collectively the OTU table and the phylogenetic tree forms your processed data and Typically these are in standard formats then that could be bring Into well it could be then used for Downstream statistical analysis network analysis visualization and so on and so forth Okay, so I'll quickly run through each of these steps so the first step of receiving your samples usually is sample demultiplexing and The reads need to be linked back to the samples that it came from Using the unique barcodes that you introduced during the sequencing phase The multiplexing is essentially is the the bound formatic step to take the Take the reads and then put and put the ones that have the same barcodes into the same bin and in that process it will also remove the the primer sequences and The in Illumina platform the barcodes is actually separate from the From your reads so but in on a 454 platform the barcodes is usually part of your sequence and reads So if you're running a 454 The multi the multiplexing workflow There's an extra step to remove the barcodes from your sequence sequences and There's a chime actually has a suite of different scripts to help you prepare your sequence files for analysis And I provide a link there and you can actually refer to that for for your reference It's the filtering the quality filtering step is actually very important Numerous studies have shown that it the quality filtering and the preprocessing can affect your downstream analysis And this is actually a lot of the the contention Contensions between the mother camp and the chime campus to figure out what's the best way of Filtering your data. I think Mother camp sort of favors more stringent heavy filtering of your data Whereas chime as you can see here Has a much more relaxed view of filtering data, but again, it really could affect your analysis results And I don't know if anyone has comments on what the What the community consensus right now is but my personal view is I don't think there's a real consensus As to how to best Quality filter before publication purpose if you follow the chime recommendation I know most likely you won't get a rejection or if you follow the mother sort of recommended filtering step then again your shelter yourself from criticisms and So so chime filter by four different parameters. It looks at the maximum number of consecutive low quality bases Usually these low quality bases occur at the end of your sequencing Read so effectively some of the ends three-prime end of the sequence could be trim off if the quality drops off too much towards the end of a read a And you can also define what that what's What constitute low quality? So this is where chime actually use a very relaxed quality score of three and and think programs like mother usually you but use a quality score of 20 and This is the fresh score for people familiar with the sequencing Yeah, that's a huge difference. So yeah, so The minimum lengths of consecutive high quality bases so this is how much Well, yeah, how much of a sequence what percentage of the sequence it's is kept after After the quality trimming so sometimes the quality trimming can trim the sequence too short and in that case The this particular score determines what what the what that threshold is so the default is 75% so if you have a Relance of a hundred then after trimming your relays to minimally be 75 base pairs and again This is the recommendation by by chime. So The last maximum number of ambiguous base both mother and chime seem to agree that you shouldn't have any ambiguous bases in your reads Yeah, so this is based on the Fred score So the the quality score So Fred score is a probability of The base being incorrect. So the higher the The higher the the Fred score the lower the probability of of that base being incorrect So let's say a fresh score of 20 means I have one in the hundred Chance that the base is incorrect a score of 30 means one in a thousand So it's sort of the inverse of of a base tank Okay There are other quality filtering tools available some some of these are to build into mother and chime and you can Again use them, but and each of them could have their own Flavors of of of trimming and filtering and again I don't think there's really good consensus on what the best way is but there's our numerous numerous accepted quality trimming SOPs out there that you can follow That depends on the sequence so Your majority of the reads will not be trimmed lens-wise at all actually so If you're using in Illumina, so that's a star was 250 most of time you probably get 250 back Not most of time, but the majority of the time you Majority of your reads would maintain the relays, but some of your reads could be trimmed due to low quality at the end and also if It the Relent the the the relents Or really the the minimum relents let's say if you have a data set with majority of your reads being 250 and Minority of reads being really short. Let's say only 50 then you're actually limiting your analysis to that region of 50 bases because when you do alignment Only If you want to analyze your data using alignment Then only the the window that consists of that 50 base pair would be meaning for for analysis, right? So the rest of the sequences are Not not being used for for your phylogenetic analysis. So that's why If you if That's why the minimum lengths of consal quality base usually it's it's Quite a high percentage rather than you know Allowing a much lower percentage So This is I'll skip this but it just shows you that the chime actually has a workflow determining which Parameters to use first and so this shows you of the four parameters. I show here which ones how How they use what's the algorithm for for doing the trimming and the quality control? I'll skip the decontamination because it's less Relevant to 16 years analysis Okay, so OTU picking the As mentioned in the introductory the OTU sub really formed arbitrarily based on sequence identity and The color of you use so One caveat to keep in mind is that the sequence similarity of 97% is a stab as a threshold for genius level Sorry for species level Differentiation is actually established over the entire 16 s gene and as we've seen before that the hyper variable regions on average more variable than the entire 16 s so So That's something to keep in mind that the traditional definition of a species When when you only looking at hyper variable regions is probably not 97% but something bit lower than that But most most of the analysis still use 97% is the starting point There's three different clustering Approaches the novel clustering close reference and open reference and there's some links on how to More details on how to recommendations for how to pick OTU's so I'm going to go through each of these clustering approaches and I'm sort of also mindful of the time. So so we'll pick up Talking about OTU picking So as mentioned, there are three main ways of picking OTU's we're doing clustering the novel clustering close reference and open reference and clustering arguably is one of the most important step in your Marker gene analysis, so Just I'll try to summarize the the pro and cons of each approach Starting with the novel clustering. So this is just grouping sequences based on sequence identity alone There's no external reference. You're comparing to you just comparing the sequences within your data set and group them based on sequence identity the naive way of doing this is to do pair-wise comparison of all your Sequences and then based on the distance between sequences you cluster then starting with the ones are most similar to each other and then Gradually expand out and that's called hierarchical clustering This process required a lot of this and this space and memory and also it's time-consuming and it's suitable only if It's suitable Advantage is that if you have no reference database available or if your community it has a lot of novel species Or poorly characterized species and this approach would not give you a would not rely on a reference genome reference, sorry data database for clustering and Based on the mother camp, I should say the Their analysis show that average link average linkage clustering in other words grouping based on the average distance in the members within the member is the most robust to changes in your input data and also in changes in the algorithm parameters and code and code They say that generated The process general to use that were most likely to represent the actual distance between sequences And it kind of makes sense because it it's starting points. It's just a just sequence pair-wise sequence comparison and grouping based on that distance And but because doing pair-wise comparison is time-consuming The mother group also recommend that you first group Organisms into broad taxonomic groups at the for example class for family level and then you cluster cluster within each Class or each family this will reduce the number of pair-wise comparison you need in other words You only compare within family so to Just to say that the pair-wise distance matrix comparison Is not very scalable if you have a thousand sequences Then it requires a thousand times a thousand comparisons Which is a million comparisons and if you have a million sequences, that's a hundred million is our or anyway like 10 to the 12th Number of comparison needed so it's not very not scalable So there are greedy algorithms Developed to handle the situation and greedy algorithm simply means first comfort serve So take the first encounter sequence group everything Let's say the first encounter sequence is this one in the middle group everything that's close to it distance-wise into Into the same group into into the same group and then rather than consider them in a pair-wise fashion before grouping and Effectively only a subset of sequences rather than all sequences are considered and these subset of sequences are typically called centroid or seat but as you can imagine picking the right centroid is quite important it's very important and It's been shown clearly that the input order of sequences will affect the centroid well Input order of sequence to affect centroid picking and that in turn would also affect the clustering you achieve So if you permutate the the sequence Order you actually get you could get different clusters So that's not a desirable trait to have you obviously want your OT used to be as stable as possible, right? So So the algorithms the one way to address the issue is that you have to pre-order pre-sort the the sequences based on based on some traits such as the common ones up pre-sort by the Sequence length so typically the longer sequences Higher quality than the shorter sequences and also it has more information than the shorter sequences but actually the better way to do is probably by the abundance of Sequence or if a sequence is present in Present many many times it's likely to be a correct sequence and it's likely to also be the dominant organism So using that as the starting point Stabilized the the clustering So you favor the Abundant organisms in your clustering So that's So so actually the like it is I don't I'm not trying to say don't don't do that The the greedy algorithm have been shown to perform quite well almost as well as Well it's been shown to perform almost as well as the naive approach and The I guess the maybe a better way to answer is that the the the novel clustering approach would not be biased by the reference Right, so if you have reference, that's not very rep database That's not very representative of your community Then you may not want to cluster based on that reference because that might actually Bias your your cluster but by bias the OT use that a form Specific example I know anyone want to chime in on the issue I mean I could point you to a paper where patch loss developer mother has demonstrated that Why he favors the novel clustering and why he thinks that approach most accurately represent the structure of the community then the the other approaches So it's more time-consuming, but it but it's more it's let the least bias approach Yeah, and by sorting by presorting your input that reduced the the the issue with stability of your clustering and And the the above based on the abundance also symbiologically a reasonable thing to do So some tools pre-sorted for you automatically in chime The you class I think does not pre-sorted so you have to sort it yourself first Yeah, so it's so do pay attention to make sure the clustering algorithm Either pre-sorted for you somehow or you sort the the sequences to minimize Clustering Like the fluctuation your cluster membership Okay, so close reference It's the opposite of the novel cluster and essentially you match sequences in your Data set to existing database of reference sequences and the unmatched sequences that simply discard it And this and typically you will have a similarity threshold for the matching so Sequences that are considered novel Will be discarded this it's quite it's fast and it could be paralyzed Onto multiple machines, so it's very scalable And it's also suitable if if you have a comprehensive reference database and in other words You don't really care about put the novel Sequences in your data set and in some situations That might might be the case and the other advantage is that it allows you to do taxonomic Comparisons across different data sets and different markers, so imagine that you have Generate a data set you of the V. Let's say the V3 region of 16 s and someone else has a V6 region of 16 s But if you do close reference clustering then both sets of data are now matched to the reference then You can then compare the two data sets based on the top taxonomic assignment of the best match in the in the reference set So the downside of course is novel organisms are missed with discarded so It's definitely not recommended if you're doing with you dealing with environmental samples Typically those are much less will characterize and say human microbiome Okay, so the open reference theoretically has the best of both worlds But what's been shown is that the open reference approach the Reference database you use could bias your clustering so but this approach essentially first You do a close reference clustering and then the unmatched sequence instead of discarding them You then do one the novel clustering of Just the unmatched sequences, so that cuts down the amount of compute computational Resource needed to do the de novo clustering it's suitable if a mixture of novel sequences and known sequences are to be expected in your data in your data set and This is the chime recommended approach and in the One of the studies that we point out in the in the tutorial they show sort of based on their analysis the the new Open reference based approach Work the best with their Simulated and mock and mock community data sets so The key take-home message I think force for marker gene analysis is it's actually this slide once you have Once you do the clustering into OT use you have to pick a representative sequence for that OT you and This means that all the downstream analysis that you're doing Effectively treat all the members in that given OT you as one single organism Right, so they're all represented by This representative sequence, so if your OT you picking is poorly done Then your representative sequence would not be a very representative sequence of the date of the underlying Of the members in that OT you and The how then there are several ways to pick representative sequences the most common way is probably based on abundance assuming that the most common Members of that OT you is the representative Mem of is the rep can be used as a representation for the entire OT you Some would favor the centroid approach in other words the the OT you that's sort of At the center of the cluster. So in other words, it's roughly equal distance from all the other members of the OT you is the most representative and Then some use the lens of the OT you again Arguing longer reads have more information. So why not use that as the most representative? and If you do a close reference OT you picking then Typically you will take the existing reference sequence as the representative sequence of the OT you Was the argument that the reference sequence is the one that's been characterized the most already Not recommended is to pick randomly that usually doesn't end very well But some people do that and also not recommend is to pick the first sequence in an OT you in the In the list of members in that OT you again, that's fairly arbitrary and no But that's been done before as well So, okay, so quick word on Chimeric part of the cleaning up is to remove chimeric sequence and what a common way to detect chimeric sequences is it's basically by looking at Alignments and if you if their situation were one end of the align one end of the of your sequence Align the best to a template a reference and the other end of your query sequence Align the best with another another Reference organism or another template then that's a sign that your Sequence can potentially be a chimeric sequence Typically you would also want to take the abundance of that particular sequence into account And if it's a really abundant sequence, then it may be a novel sequence rather than a chimeric Read so a sign of a chimeric sequence will be that it's low abundance and And that it matches two different And two unrelated organisms Okay, so Since OT use don't have names and we typically like to refer things with a name We usually assign the taxonomic name to an OT you so for close reference usually you will just transfer the taxonomic the taxonomic name of the reference sequence as the Name for your for that particular OT you in open reference and the noble clustering approaches That done using It's done by transferring the Annotations of the from the reference data to your OT use and this is done through a similarity search process so So because this process is Because this is a step that How does it is because it's this step is it's done To do the matching and there are different ways to do it It's important to report how you map how the matching is done In your publication in which taxonomic database you use for the for the matching. Here's a sort of graphic Graph to show what I what I just said imagine you have the different OT use without names You can specify the matching algorithm basically a similarity search algorithm To match the OT use to entries that are found in the database. So you can see here that the The process is algorithm dependent, but also it's a database dependent. So that's why well, well, I meant by that the When you do the the taxonomic assignment, it's important to report the algorithm and the database and at the end of the Assign, yes, the assignment your end Your your your OT use will be given a taxonomic name Well in mother you can yes, you can try different matching algorithms and then consolidate the results, but Using a same database. I don't know if you can do it Against multiple databases and trying to consolidate, you know, anyone know a tool that could essentially do the get the consensus of the matching algorithm and the consensus of the database Yeah, so you still have to take the results and potentially consolidate yourself You could parse the results and then just list the hits and then try to come up with The most popular annotation is the the right one. I suppose was the cumulative annotation for your Entry, but that's a good maybe a good tool to develop But the thing is different databases do have slightly different names for the organism so consolidating getting them wouldn't be a straightforward and here are some of the differences so RDP Is the most similar to NCBI taxonomy it actually incorporate NCBI taxonomy And it also has a built-in tool to call RDP classifier that Would allow you to do a rapid classification based on camer counts The green gene is the one favor by chime as I mentioned before and Silva is the one fit favor by mother Then there Again, there's very a besides what I mentioned that the the template how they generate the the alignment template are different. There's also Naming and tax only minor naming and tax only differences and and also how they rank that as you know, you can go from For for a proper taxonomic name, there's different what's called ranks of taxonomy and These databases don't necessarily have the same rank for For a given organism, so for example, that I'm just theoretically speaking. Let's say back Roy D's in Green gene could have all these ranks going from the most broad to the the most narrow but in And in Selva Some of some of the ranks might be missing or might be named Differently or it might be grouped differently. So it's not always straightforward to compare across databases That's a harder question than comparing across Different matching algorithms, which has been done already No, you can have your own custom database, yeah, you can certainly have your own annotate annotated set of Sequences to use as a database. Yeah, and that's actually one approach some people do is you take I jump That's a John the instructor for the third day Okay, so there's a potentially useful approach when you're trying to characterize a Community that doesn't have a good reference genome is that you actually take the most abundant organisms in in That community as the references and you build As well as possible your attack your target taxonomy database based on those organisms. So And and do your cluster clustering and OTU calling that way that potentially give you a more Custom data OTU taking results to deal with Okay, so the taxonomy summary is very often is shown as a bar graph just for each of the samples shows the relative abundance of each taxon In the in the sample. So of course same color represent the same samples and you can see that this first one has slightly more yellow Whatever the yellow organism is compared to some of the others and these ones have more orange Taxon okay, so From the OTU assignment The OTU is the results is usually summarized in a sample by observation matrix called the OTU table and it's simply a table that has samples on the y-axis and OTU's on the x-axis so it shows you how many How many Reads in each of the OTU's and each of the samples And sometimes you so this table can again get quite big because you can have thousands of samples and thousands of OTU's so in order to Condense the storage requirement instead of storing OTU table as a Two-dimensional table Usually it's encoded in this standard format called a biome format Yeah Okay, so so the short answer to that is you could compare abundance but best to do so within your own Data set rather than comparing relative abundance cost multiple data sets It's not It's it's it's it's less straightforward when you're trying to compare across multiple Data sets based on relative abundance and in those cases maybe the presence and absence of OTU's will be It will be a coarser comparison, but it will be a more it will be a It would be less subject to potential experimental Biases in how your samples collected how your samples process and so on so But if you're all the samples are processed the same way and collected collect the same way process the same way Then the relative abundance is meaningful in that sense. So yeah Should yeah, but you would I will get into that a little bit later, but you would typically Verify or stop sample your samples when you compare so so you try to read try to have all the same number of reads for each of your Samples before you do the comparison then in that case relative abundance is also Meaningful in that sense So Now a day people tend to throw away the singletons and and treat them as Potential sequencing errors and a lot of the algorithms including cluster algorithms were In mother it's called pre-clustering. We'll also assume that if singleton is highly similar to an OTU it would collapse that singleton into an OTU and very Singletons are very different from anything else likely to be experimental artifacts and I think that the marker gene based analysis is very good for doing a Get the community profile of abandoned organisms in the community But it doesn't really do very well when if you're interested in is in the really rare organisms in the community Right, so but so but if you're if you're interested in rare organisms in a community Then a targeted PCR direct target approach is more suited then Then microbiome approach Yeah, yeah, so mother and China both support bio Okay So the few words about a bio format. It's more efficient in heavenly large amount of data because instead of Storing things in these matrix and you'll notice that there's some OTU that are not present in in a in a sample at least not I should say not detected in a given sample And it's kind of wasteful to store this Information right because you can infer that in the biome file if if a If sample one has no OTU three entry you can infer that is that it's not detected rather than store explicitly store a zero in your in your file So using that approach it did basically reduce the amount of data that need to be stored it it's also Encapsulate there's in it's able to keep the metadata together with the observational data So here the metadata can only be encoded in the headers, but in the biome file you can actually explicitly attach metadata as to your samples in in the same file so much You don't have to worry about missing meta meta day losing the metadata doing the file doing file transfer and so on so forth It's well supported by multiple softwares and I list the Most the software that the common softwares that support the biome format you can see chime MGRAS that we covered PyCross which Morgan will talk about mother so on so forth all support the biome format and there's tools that specifically deal with the conversion of biome format to other common format Such as tap delimited file and so on so forth for you. So It's a community standard now, okay, so the next step not the next step, but the the other component of the Marker gene analysis a sequence alignment so sequences must be aligned in order to generate phylogenetic trees and The phylogenetic trees can be used for diversity Analysis when you generate OT use there's no implicit distance between OT use OT u1 That the difference between OT u1 and OT u2 and between OT u1 and OT u3 It's not explicitly Sorry explicitly implicit implicitly stated essentially each OT u is treated as a As a single organism and all being equal The Tree-based approach on the other hand gives you Distance information and There are traditional approaches that are not very scalable to the large amount of data that we have so New tools such as Pinast Relatively new I guess that's based on the template That's templates available from the green gene with the The the silver databases are Available to to fast to generate Alignments faster and in fairly accurately Okay, so Quickly that once you have the alignment you can generate phylogenetic trees Because the large amount of data one you some of the older methods are not a scalable again So fast trees seem to be the common tool that people are using to generate maximum likelihood trees that are that consist of thousands of Notes And it's the one used by China by default Okay, so I mentioned this already as we multiply samples that's the sequence depth can vary from sample to sample and This in turn affects the richness and diversity measures of your downstream analysis But so you can imagine you see sample or seek us if you sample a Particularly environment more deeply you're going to see more organisms and more rare organisms in that sample then if you Sequenced it to a shallower depth. So the common practice is that you verify your samples To the same level of sampling depths in other words the same number of reads in each of the samples and There are some alternative approaches I've been proposed instead of in sort of truncating your Samples to a fixed number of reads you might be able to control the Control that the the diversity by doing a variant transformation. So the basic principle is that You would try to reduce the As your sampling size gets bigger the variants also gets bigger so you will try to control The variations in each of the samples to bring them to the same level of variation So even though you have different number of reads in each of the samples they roughly Scale back to the same Variants but that approach doesn't seem to be favored anymore given that now a day You sequencing is cheaper and you potentially just generate more reads For for for under sampled or under sequenced sample Yeah, you should oh you should verify your data set before downstream analysis Yeah, but you still won't be able to get consistent Exactly exactly the same number of reads. So you still have to it just so you don't have to down sample as much Okay, so quick words on the diversity measures so alpha diversity It's that it's measuring the diversity of organisms in one single sample Richness simply means the number of species with taxa or number of otus observed Will estimate in a given sample evenness on the other hand It's a relative abundance of each taxon in this in the same in one sample And the diversity measure There are different ones, but all of them effectively try to take both evenness In other words the relative abundance of the different tax on taxa in your single sample and also the richness the Number of taxa found in your sample into account and The most common ones that This to here there's Shannon entropy the phylogenetic distances and we'll use some of these in the in the tutorial but we also If you're interested we can show you the exactly how they're calculated in the formula and so on okay, so So this is just to show that looking at alpha diver you looking at the Alpha diversity a quad of the same data set of the same sample I should say using different OTU picking method can actually give you slightly different results and not surprisingly the open OTU and The de novo approach which do not discard the novel sequences Giving you higher number of otus than the close reference clustering approach Okay They are slightly different there's not a lot of otus in this particular example, but It's about 20 versus 30 to 40 Yeah Yeah, so yes, or I should say that the y-axis this one ends at 30 this these two are 40 Not the best pictures I have to admit The beta diversity on the other hand measures a difference of diversity across samples or across environments And again a list of common beta diversity measures that you can use There's unifrack Which takes a phylogenetic tree and then calculate the percentage of unique branch unique branch lengths in the sample versus another sample So it's a it's a one that considers the phylogenetic distance breaker this on the other hand takes the OTU table and simply look at the OTU abundance so relative abundance of otus in your samples the jacar measure on the other hand just looking Again taking the OTU table, but only look at the presence and absence of OTU Okay, so there's other OTU. Sorry. There's other beta diversity measures Based on both OTU and phylogenetic trees available in chime and mother Well, so so here's a quick sort of summary of how beta diversity analysis is done first You have a table showing the relative This sorry not the pairwise distance in your samples then For example, all the samples of course are Highly correlated with each other and worse you can see here that sample 1 and 3 are more Similar to each other than in sample 2 and 7 sorry simple 1 and 2 or 1 and 3 and Once you have that information You can transform the the information into lower dimensionality display for visualization purpose so instead of Summarizing summarizing the results in a pairwise fashion you you can use several approaches to Show you which samples are more similar similar to each other than other samples in two-dimensional or three-dimensional spaces Okay, so the Unifrag as I mentioned it's all the OTU some map to a single phylogenetic tree and the unique branches in the tree that Correspond to a single sample. It's used as a measurement of how different the samples are from each other So sure notes are either ignored in the unweighted version or they're discounted in the weighted version So what's been shown is that the way the unifrax is sensitive to Sure to certain experimental biases, but the It's more robust when you have a lot of rare organisms in your In your sample, whereas the unweighted unifrax would effectively treat the rare organisms the same as as The abundant organisms so it would effectively be If you have to let's say you have two samples that share the most abundant species But their rare species are completely different in the unweighted unifrag calculation the because the it's unweighted the abundant species and the rare species are treat Have they have equal weight samples are going to show up was very different because the The rare species dominate the the calculation Whereas the weighted unifrax would sort of discount the rare species. So your Samples are going to show up to be more. That's sure the common Abundant species is going to show up more Those are different samples so So you can So sorry so each node is attacked each node is a taxon an OTU and It's saying that this particular taxon is found in green sample That tax is found in green sample and that's found in green sample and these three are found in the red samples same thing here and if you calculate the number of unique branches Then you can see that this one where the This the taxa from different samples are More distantly related to each other have will have a longer branch lengths than the ones that that are more that have Organ tax are more similar to each other does that make sense? Then the branch lengths sort of give you an approximation of how similar the two The two communities are from each other. So these two don't share any Taxa Maximally different community they don't overlap at all whereas these They're more they're interlaced so they're more similar Okay, so Instead of going to the detail of principle coordinate analysis Just want to give you an analogy what what it's trying to do So it's trying to take a higher higher dimensional project and project it in the lower dimension So from in this true example it's going from a three dimensional project into a two dimension And then you can think So so you can think about this problem a bit Ah Sorry keep flipping You can think of this problem as how Wait after the projection how much information is retained? So you can see on the right hand side here The shadow even though it's in two dimension can still tell you that it's it's a chair it Whereas in this case here this shadow here has very little information available It it could be anything could be a chair could be a stool it so this is a poor projection And this is a much better projection so the the aim of the principle coordinate analysis is to Give you give you the maximum informative projection at a lower dimension to help you separate out the Samples based on the most salient features or most salient So in other in in the jargon, it's trying to account for the most variations when it's reducing the dimensionality So this is a plot generated using chime showing you and I think many of you probably have seen some Eat similar plots, but essentially the color shows the different Sampling sites for example and each dot is a sample and you can see that there's three distinctly distinct clusters that Separate out by the sampling sites And these thoughts of course are colored based on the metadata that you provide Question question anyone has an answer Notice It's not yeah, it's not a it mathematically they're different, but it's hard to say math given that they're different Mathematically how does that translate into biological significance? So Yeah So so basically pick the ones that fit them out fit your data the best And Right because both of them are trying to those numbers will give you an indication of how good your representation is The higher dimensional space right so certainly choosing the Within an idea appears to have a bunch of predictions. You want to choose the one with the less stress Yeah, so I haven't seen a good analysis of why to use one of the other I guess you can do both Well when they don't agree you probably want to make sure that That what is in your data is actually being represented in these plots right it might be an indication that There's some variability in your data that isn't being represented and that maybe neither Okay Yeah, we can talk about that offline Yeah Yeah, and there are different assumptions again the mathematical if you read about the method They will tell you what the mathematical assumptions are but to translate that into biological assumptions It's just not not always straightforward and a lot of these methods were developed using my community will use Simulated data sets so they're also not always No, some of them start with was a correlation Correlation coefficient rather than so rather than distance. Yeah, so PC a started with the distance. Sorry PC o a starts with a correlation rather than with a distance Okay, so the next way to represent the results is just show it in a in a biology and actually Another phylogeny tree a sample tree using hierarchical clustering. So each of the leaf is is a Single sample so it's not not text out anymore, but each is a sample and of course the ones are closer to each other out Interpreters samples are have more similar microbiome profile. So The downside of this approach or this type of visualization is that it forces your Samples into bifurcating trees. So then What what do you call a cluster? Do you cut it off? I'd say this branch length? So this is a cluster, but then this two are not a cluster anymore. Well, where do you cut off and call a cluster? So this type of approach may be meaningful if you have only a few Sample types and what you're hoping to see is that you see a two two two two giant clusters one represent Each of your sample types then but in the case where you have a lot of different sample types then these type of approach of visualization can actually Confuses the the results and in that case something like principal coordinate analysis would be more reasonable Okay Actually, so yeah, so This is just talking about marker genes versus meta genomics So I think most of these are self pretty self-evident by now. So marker genes are less expensive Computationally, it's also easier to deal with Mainly provide tax on my classification rather or taxonomy profiling rather than functional for 16 s We have a pretty good handle at the phylum level of the of the microbial diversity So at least you can fit your data into some sort of class taxonomic classification scheme And You also because you amplify specific target region It's relatively free of host DNA contamination the shotgun meta genomics approach is still much more expensive It usually requires a lot more computational resources and but it does provide both Directly both taxonomic and functional profiling. However, because our database for sequences not nearly as Not Complete not that comprehensive There will be many more unassigned gene fragments that simply have no Representation in an annotated database. So you yeah, you might get a fragment that you just don't know what it is and you have to Think you can really analyze that fragment properly. So there's a lot more wasted data and it's also Depending on the sampling approach more prone to host and environment contaminations It's also been shown early quite early on in meta genomics and microbiome comparison that The bacterial phylum could vary greatly across samples, but the the corresponding functional Classification is much more stable across samples. So some So interpret that as comparing at the functional level may be more meaningful than comparing at the bacterial taxon level in other words some Different taxon in different samples might be fulfilling the same function in in that ecosystem I'll skip pie cross because you're going to get a whole lecture on that