 All right. So I worked at the International Microbiome Center at the University of Calgary. Some of you I have worked with. So I'm a lead data scientist or lead bioinformatician. We help people who cannot analyze their data, analyze their data and then, you know, with and that can be microbiome data because we work at the microbiome center. But we also do RNA-seq analyses, sometimes rarely single cell metabolomics, all sorts of data, sometimes just, you know, visualization for people also. So my talk might not take a whole hour, but stop me if you have questions. There's a lot that I'm trying to explain, but I'm trying my best to keep it, make it easier to understand. But if you don't understand something, please just raise your hand and I'll try my best to answer. Disclaimer, I'm not a statistician. So don't ask me too many stats questions, but I've been using stats for a while and I'm a bioinformatician, so I know when to use the test and then not to use it at least. So what I'm trying to kind of make you aware of is some kind of statistical understanding of 16S data. Like what, how does it look? You know, how would you calculate the different diversities, alpha and beta? What kind of statistical tests you can do if you want to compare two groups or more? What is differential abundance analysis? And a question that has caused a lot of debate and controversy is to ratify or not to ratify. Most of the stuff that I'm talking about is how I feel about the field. It could be totally different from what, you know, other bioinformaticians or, you know, people you're going to work with field. So just take it as, you know, my opinion and it's just from the literature, right? What I've learned and what I've understood. So there are, I just wanted to start with some definitions. So for me, a person who has not really entered the wet lab biosample can be anything. It could be soil sample, animal sample, water samples, anything that has a community in it. Morgan and Robin already talked about ASVs and OTUs. So ASV stands for Amplicon Sequence Variant and then OTU is an operational taxonomical unit. But if you just want to think of it as a, in very simple terms, it's just what you get out of doing the whole lab that you did today, right? You ended up with a table with abundances and a sequence. So that sequence can be, is region-specific. So either it's V4, V3, V6, whatever region you do, it's a sequence of a certain length that you got. And you're going to use this sequence now to assign taxonomy, right? So whatever level you end up, sometimes you'll get species level. Sometimes you'll end up with genus or even family. When we talk about diversity, that just refers to kind of like the variety and the abundance of species that you have in an environment or an ecosystem. Distance and dissimilarity matrix. We're going to talk about them in detail. But these are just interchangeable terms that we use when we talk about beta diversity. The reason why we keep on calling something dissimilarity matrix or matrices or distance is how, what kind of calculation they do. So when it's a distance, it's kind of like, it's non-negative. And there are certain other conditions that it follows. Whereas if it's a dissimilarity, it doesn't follow those kind of norms. When we talk about sample depth or read depth or a number of redesign, it's all the same thing. And then sub-sampling is when you have anything, if you have a bag of, let's say, red and blue balls and you pick 10 out of them from 100 or something. So you're just sub-sampling a data set and you're picking things out of it that it's a subset. And yes, it stands for next generation sequencing. So for me, when we talk about microbiome data, what I think is, okay, what kind of data is it? So is it count data? Because we're counting species, right? So that's what we're doing when you have this ASV table. You have abundance. Abundance means how many reads. Reads means, basically, it's a count. How many reads were assigned to an ASV, right? So when you think of microbiome data or sub-sampling data, you can call it that, okay, I think this is count data. I'm going to detail in the next slide. Or you can say, yeah, I think it's compositional because what I'm looking at is a number of species inside a sample and everything is kind of proportional and relative. So you can go with either one. Or the field is some people are calling this as count data. And then they go with that in mind and develop methods. Others are saying, no, this is compositional. Okay. So what you guys have done today is you have ended up with a table like this or a matrix like this, right? So you have samples and you have ASV or TACSAR, whatever you want to call it here, right? And the nature of this data that we have is it has a lot of zeros in it, right? So it's not necessary that an ASV is present in all of your samples. Even if you collect a sample, you know, like, even if they're all fecal samples from people living in the same house, you might not end up, they might have different ASVs in them and you might end up with zeros. So this kind of matrix or a table is what we call like a sparse matrix or sparse data. So you'll hear these terms again and again when you look at different methods, when you get deep into analysis, you'll hear the term sparse, this data is sparse. So it just means that it has, you know, basically a lot of zeros in it. If you take this table and you don't kind of plot this first sample or TACSAR and you just try to look at the distribution, so this is just when I tried to make a histogram, right? So you can see that the most the peak or the highest frequency is at zero, right? When this happens and people start saying, okay, microbiome data is, well, first they'll say scound data and then they'll say, hey, this looks like a zero inflated distribution, okay? And then they'll go with this and then say, okay, because it's scound data, a distribution, this looks like it has a lot of zeros. So this might be zero inflated or there are a few other names. They'll say, I think a negative bianomial distribution and blah, blah, blah, and then they'll go and develop methods based on this. So the question that I just want to ask and stop here is, so if we have a bio sample and I chose a cat in protest of everybody using mice in the past, so I chose a cat. But so when you're amplifying the 16th region and then you're sequencing it and then we're ending up this matrix, so the question is, or the stable, is it capturing everything? Is it capturing kind of like whatever I'm seeing here, is it exactly what it is here in terms of the count, in terms of the species that I see here. And the answer to that is no. And that's because when you take a sample and you put it on a sequencer, sequencer becomes kind of like a limiting factor, because if you're running a my seek or even a high seek, there's a limitation on the number of total reads that you can assign to, assign to samples in a run, right? So if it's, let's say, if it's a few million, right, so let's say one million. So now you're limiting yourself that the total number of reads that you have is one million, and then those reads are then divided up into your samples, right? So this kind of is what we call like a fixed size, I think, subsampling. So it's just, it's fixing the size that you can give a sample, all right? So this is just an example to just take this a little bit further. If I was looking at a biosample, and then I had certain species in there, so this is the count of the species. If everything was perfect, and this is where I'm sequencing it, so let's say I had n equals to 20 species here. The sequence is telling me I can only assign 10 reads, right? So I subsample, this is just an average, so there's a, you know, standard deviation here, but just imagine, because they're kind of these red guys, I end up being having five of these, three of these and two of these. But this is not exactly what happens in when we do sequencing, and that's because we have certain biases. So these biases or these kind of things can come, not things, biases come from, you know, when we amplify, for example, in 16S, we amplify our reads. There are some random variations. There are some preferences sometimes, and you might not end up with this, but you end up with something like this. So there is, there are some factors that we have to take into account. And I'll put this slide in the wrong order, but if you have different samples on Iran, you would end up with sometimes very different, like two times, three times, and I have even seen 10 times differences in sample, the reads that are assigned to samples. And I think in this example, all of these are fecal samples. So it's not like I was putting, comparing saliva samples compared to fecal, right? So that, so if I was doing that, then yeah, it makes sense that yeah, some samples, because fecal samples are more rich, they will, they would have more redesign and others, the saliva samples might not. But in this case, we had fecal samples. We had tried to put the same amount of DNA, but that's another factor, and the VETLA people know more than me, but that's another factor where you try to put the same amount, but it doesn't end up being exactly the same when you end up with this kind of situation. So now I'm coming back to the compositional data and why people think it's not count data and compositional data is because if we had like these three samples and we sequence them, we know there is some bias for many of the reasons we don't end up exactly, you know, with these counts, the counts that are present in the community or a sample. But what we end up with is kind of proportions, right? So we know that we don't know that if something was, you know, had a count of 100 here, might have a count of 50 here. So the counts are not matching, but proportions are matching. And this is a good example where the counts are different in these two samples. But if I look at proportions, they're exactly the same for sample two and sample three. So that's why a lot of people think and the feel is moving towards that, that let's think of all of the sequencing data, not just microbiome, but RNAseq and all of it, let's think of it as a compositional data where we have information which is kind of relative or proportional, but not really absolute counts. Okay. And then how does it affect our choices that you are going to make when you're doing your analysis? There are certain normalizations or transformations that people use. So you have to understand that if you're using method X, what kind of normalization or transformation you have to do to your data before you can use it. Okay. So for example, there are some standard approaches. Most of these are based on the assumption that, you know, we have count data, there are compositional approaches and a lot of them are being kind of used, becoming more popular now. It's not that these are new. It's just that people are realizing, okay, let's think about it as compositional data and use this. So you'll see papers now you're talking about, you know, CLR, which is central log ratio or HSN distances when we talk about beta diversity. Some of the things remain same no matter which one you choose, but it's very important to understand what are the method you're using, what are the assumptions it's making, if it's related to standard approach or count data approach or is it a compositional approach? Okay. Any questions? Okay. So when we talk about diversity, it's all about figuring out, you know, either who is in your sample and if you're comparing two groups or treatments, then how are these two treatments kind of similar or usually, you know, we all want to find differences. So when I see how are the two groups different? I did a treatment, is it different or not, right? And most people are not happy, they don't see difference because they have spent a lot of time in the lab trying to figure out to do these experiments. So while there are more than two kind of these ways of looking at ecological data or microbiome data, the most popular ones are alpha diversity and beta diversity. So if you look at any microbiome paper, they'll be hidden somewhere. Okay. If not in the main paper, they'll be in the supplement, but people will talk about alpha diversity and beta diversity. So alpha diversity is when you're just focusing on a sample and we're saying, okay, what is going on in this sample? How many species I have? And then I want to look at, okay, what are the abundances of these species? Let me talk about beta diversity. We're saying, okay, I know how many species are in one thing, but now I want to look at how is this one sample different or similar when I'm looking at this second sample? So what is the relationship? Okay. How are they related to each other? So for alpha diversity, there are two words that you will hear again, richness, evenness. Okay. So when we talk about richness, it's just the number of species or ASVs that you have found in a sample. So when you did the lab today, you probably saw that different samples had different, I guess, feature counts is what we're calling it, right? So you had different number of ASVs that were found in there. And when we talk about evenness, then we're talking about the abundances. So again, in that OTU table or the ASV table that you were looking at, you can see the abundances of different ASVs that were found were different. Some were very abundant, probably in all the samples, the others were not. Or if there were two groups, like in that data, if you looked at the metadata file, right? So there are wild blueberries and I guess mannish and forest ones. So you could, so it's possible that some ASVs are, you know, have higher abundant than one and the other group. So they're different alpha diversity. They're called indices. And they kind of try to combine this richness and evenness together. So depending on what you're doing, you can also look at just the observed species, like what is the total number of species in my sample? And then you can, the most popular ones that you will see in most papers are Shannon and Simpson indices. I can't pronounce this word, so I'm not going to try to do that. But that's kind of like more of an evenness measure. So depending on what is your question, what are you trying to find out? You can pick these, you can try multiple things. So if I'm looking at a data set, I'm going to try different ones to see, you know, how many species were found. Is there a big difference in just the number of species? Is how does the Shannon index, which takes both into account, how does that look right? How does the Simpson index look? So I look at different things. So it all depends on what are you doing? What are the questions that you're asking? And, you know, what are you comparing in the end? This is just an example of where, you know, I have two groups and I'm just trying to see how do these different measures or indices look. And these points are each one sample, right? And they're just from two different groups. So like in a group, like this sample has less than 120 species. It's the same group, but another sample has 60 species, right? So there is a variation that's understandable. And, you know, and looking at the different index, you'll find that, you know, some might be significant, other might not be, or all of them are not significant or whatever. So just look at different ones and try to understand what is going on in your data set based on kind of the formulas or the definitions of each one of these. And you guys are using Chime. So Chime has a nice forum that you can go to where people explain all the possible, I think 12, 15 or plus alpha diversity measures that you can calculate and look at. So when we start comparing things, right? So you can not just look at them and say, hey, they look different. Now you need statistics. So for alpha diversity, it depends on how, what kind of diversity measure you're using. You're looking at Shannon or Simpson or species. And then based on that, you have to check if, you know, if, if you can use a NOAA or not. So there are certain assumptions in the NOAA test makes, like your assumption of normality, the data is normally distributed, or that the variances of the two groups are similar. So in that case, you can use in a NOAA. Otherwise, you can go to a first call Wallace and use that because it doesn't make any of those assumptions. It's just, it's more, if I can say the words trick, then, then in the NOAA. And then if you are comparing more than two groups and you see that, you know, and all our first call Wallace are significant, then you can go and do a pair by comparison between each group and try to figure out if all of them are significant or only one compare, if all the group combinations are or just a few of them are significant. So this is test that you can use. Okay, so now about beta diversity. So beta diversity. Now we're talking about comparing samples, right? So alpha was all about each sample. And now we're trying to see how to, two samples, either in the same condition or different conditions, how are they kind of, you know, related. There are, so this is where the dissimilarity or distance metrics they come in. So the way you, I'm sure most of you have heard of Euclidean distance, having, yeah, okay. So these that are, all of these are kind of like that. They're just calculating how, I'll show that in the next slide, they're just calculating how two samples or however many samples you have, how are they dissimilar or what's the distance between them. They're certain, like breakers, which you might have heard of the most of these, is the one where tries to look at kind of absence and presence of the species and then abundance. Okay, so it looks at the abundance of species, and then tries to figure out, you know, what is the distance between two samples, sorry, the similarity. You have other things like the jacquard, which looks at just the absence of presence. So it doesn't care what the abundance is, like, I don't care about that. All I want to see is how many species were present in one and the other. We have some, these are, they don't take any kind of, you know, phylogenetic trees or that into account. But then there are two other, I'll call them distance or dissimilarity, but you have two of these, one that takes into account the abundance and the other unifract that does not. So how do you make a choice, right? Again, it depends on what you're trying to do. So is abundance of a species important to you, or is just presence or absence important to you? So for example, when we were doing, we were doing one study where, and this was again for Kathy McCoy's lab for the germ free facility. So we were just trying to figure out, we didn't care about the abundances. So we just wanted to see the number of species, the species that are found as it's similar in one group or the other, and not the abundances. So we chose to go with a distance that just cared about presence or absence. But most of the times in most cases, you want to make sure that you use a matrix that takes both of these things into account. People like weighted unifract because it also tries to take kind of like a phylogenetic tree also and takes those distances into this. Okay. So I usually don't like putting formulas in talks that scares people. That scares me. But I thought this one was important to do because I'm talking about dissimilarity and this and that and you have to kind of see what it is. And then we'll do a simple example to go through it so that you understand what is this value that you're getting in the end. So this is the example of breakertis. So what breakertis is trying to do here is it's saying, okay, if I'm comparing sample one and sample two, what I'm going to do is I'm going to first on top here. I'm just going to see for each of the taxa or species here. First, I'm going to find out all the values that are smaller off the two. Then I'm going to sum it up and that's the numerator here. And then it just sums up the total number of reads that are assigned to assign in each of the samples and then divide it by that. And then it's one minus this. Okay. So if you have a value that is closer to one, that means that these things are further apart and the smaller the value, you'll see the closer the sample will show up when we make a better kind of visualization for this. And this is how you calculate this value for kind of like all of these samples. So I have 11 samples here and here. So basically, you know, the diagonal is zero because it's the same sample compared to the same sample. But you have this kind of a distance matrix or dissimilarity matrix. So the values are calculated like this break vertices. When you talk about weighted unifrack, it's taking the abundances and the phylogenetic tree into account and doing similar calculations between two samples and then basically coming up with something like that. Now the question is how do we, if I'm looking at this, you know, even with these 11 samples, it's going to be really hard. Yes. Not yet. We haven't gotten there. We'll get there in two slides. Okay. Yeah. No. Yeah. Yeah. So these two, yeah, so it goes, so right, so it's subtracting it by one here. So these two samples are not that similar, right? So the value is a closer to one, right? And zero is basically, if you get closer to zero, that, what am I saying? Yeah. So the smaller the value, the distances less between two samples, so the more similar they are. So now if you think about taking this and trying to figure out, you know, how things are related. Yes. Sorry. Sorry. If we're kind of on the way to talk about, why didn't we introduce abundances or the diversity analysis? Good question. So exactly. So a lot of these diversity measures that we are using use an account data and they are, they're not from the era where we're talking about compositional data. So there are ways that what people do is you can do transformations, you can do data transformations before doing, before doing data diversity. So you can use a CLR, so the Center Log Ratio Transformation, or there are others that you can do ALR and others before doing this also. Okay. So this is just to make it easy that I'm using abundances and numbers. So exactly. So when I am making, I'm looking at beta diversity, I would use a few different transformations. It also depends on the data that you have, right? So it's not one fit all kind of thing. So it's very data specific. So like, while I might not like a certain transformation to use all the time, I might choose to use it for a specific problem, right? So there are no favorites. There are just kind of, you know, all methods have their weak strengths and weaknesses. So you pick them based on that. So yeah, that's, this is just for making it easy for, but really good question. So when I think about, you know, figuring out how these things are related, sometimes, you know, people would make a heat map, you could make a heat map, right? And, but it doesn't really tell me much other than maybe these samples here, you know, because it's a light color. So maybe these guys are closely related to each other compared to these samples here, right? So if I was looking at S5, for example, I know that the distance to S7 is smaller compared to S5 with S11 or something like this, but it doesn't give me like a visual about my samples where I can say, okay, are they separate or not or what? You can also, I've seen this in some papers, like this plot is from HMP2 paper, where they have used, they've used the Bray Curtis distance on different conditions. So this is group one and group two, from the same data that I was showing you all for diversity. And then, you know, you can make these kind of density plots and look at Bray Curtis. And in that paper, they were trying to define which samples are dysbiotic, right? So they chose a, they picked a cutoff and they said, okay, anybody who has a distance from the, I guess, from the control samples more than, more than a certain value, these are dysbiotic samples and then they picked and they went ahead and did some analysis on that. Most of the times we don't, we don't do this. We like to see something like this, right? This is, and so think of this as a ordination, you probably have seen PCA plots, right? So principal component analysis plots. And what we like to see is something like this, where there's a nice separation and, you know, group A is here and group B is here and you're done and voila, right? But there's a lot of math behind it. Do not ask me the details. But there are different methods that you can use. So principal component analysis, principal coordinate analysis, NMDS and correspondence analysis, all these things can be used. Again, remember there are different transformations that you have to do before you can use the PCA on this data, for example. Or, you know, like read about it, what does it need, what are the assumptions and then use any method. I'm showing you an example of a PCOA here or some people got a PCOA. So basically what you do is you take that, that this, this matrix that we had and you'll use that as an input. And then this makes into components or coordinates or axes or whatever. It'll give you, if you look at that, it gives you, you know, a bunch of axes. So the idea behind this whole thing in very simple words for me, how I understand it is there's a variation in your data and you are using these ordination methods to find out kind of like different components and what we call component one or axis one or whatever. It's the one that kind of describes the most, tries to capture the most of the variation in your data. So if I take the axis one values, so what every sample gets a value here, and then if I try to plot it on just one axis, then for this data it looks something like this. So this is a really nice data set which separates already, if I just use one axis. Usually we use either axis one and two or we use sometimes two and three also if we don't find things here. This is just the percent of variation. In this case each axis is explaining and there were, you know, I think 16 or so, but I'm just showing you a few to try to explain this. If there is a separation in your data, then usually it's visible in the first three axes, you don't have to usually go to fifth and sixth because most of the variation is explained by that. Okay, so one point I just quickly want to make here is sometimes you might see this separation. If you have not used the right transformations also and your sample size is very different or something else is going on, so the job of this visualization is just to separate your data based on like the biggest difference that it can see. So most, any kind of like anything that explains the biggest difference. So just make sure your transformations are right. And the transformation makes sense based on what sample sizes you have. Like if you have, you know, a lot of samples that are have one, okay you won't have one million, but that's it 300,000 reads and other have 3,000, then sometimes VCOA might pick those kind of things. So make sure you go look back and usually I would label these based on, you know, different metadata variables. Here it's just, I'm just giving you an example, so it's just group one and group two, but there are different things you can color them by, you can label them and kind of show the, like highlight the differences that are in your data. You have a question? Okay, all right. So again, even with beta diversity, yes, it's beautiful in this case, in this rare case, where you're seeing this nice separation, but a lot of the times that's not what you're going to see. You're going to see groups that are a little bit on top of each other. And then again, if you want to publish this, you have to show some kind of statistics, right? So the one that I'm just, there are a few that you can choose from, but the one you'll hear about you'll see in every paper is a permanola, okay? So it's a method that uses permutation for multivariate analysis of variation, okay? So this is like in, I mean, it's not, it's kind of like an enola if you think about it, but it does a lot of permutations in it. So what it does is it calculates the same F statistics, which is basically a ratio of what is your within group variation divided by within group and with groups, between groups, okay? So that's what it's calculating, but it's doing, what it's doing is it does permutations on your data. So it kind of tries to see that if I randomly switch the labels, do I still get, are these groups still different or not? It keeps on doing this and usually by default, I think it's about 999 iterations that it's going to do, but you can change that to more, usually not less. So people just want to add more to just be sure, you know, if it's correct or not. So some of the concerns about permanolias, if you were just comparing one term like treatment A and B, right? So it's just one variable that you're comparing, that's fine. But the moment you have more, like you have treatment plus time plus something else, so permanolias sensitive to the order in which the terms are put into it. So if you switch the order, your, our beloved p values might change. So that's one thing that happened with, with permanola. Another thing for which I didn't want to go into too much detail, but I put a note here. If you have, if you test, if you do a permanola test and you get significant values that things are different, then it's recommended to do a second test, which checks the dispersion of your groups. So it, it's not, it's not that sensitive if your dispersion or the variations, right? So variance in your two groups, if they're not similar, permanola is, permanola can still be okay, but it's recommended to check this dispersion of your groups. And then see if they are similar, then you can believe this p value. If they're not, then you have to look at, again, the PCOA plots and see, okay, I mean, you know, how is this dispersion really different? Are the groups separating really well anyways? So it's kind of like a decision again that you have to make, but it's always good to, to do this beta dispersion test. So I wanted to talk about next differential abundance analysis. Yeah. Yeah. So when it shows, so sometimes permanola will show you like really significant things like 0.001, then usually it means the, what we call dispersion or the variance of these two groups is very different. So that's what it's picking up. So it's always then good to check, to do a, I think it's called a beta dispersion test. So that's what you do. And then you can find out if that's significant or not. There's a really nice set of videos by Pat Schloss. He's the guy behind mother. And he has really nice videos on YouTube where he discusses that even if you do a permanola and then you do this dispersion test, in that case, what to do, right? So looking at the PCOA plot, you can decide then if you want to kind of reject this or still accept this based on how well the groups are separating. So I'll put that in the slag. He has a really nice set of videos. The only problem, not a problem, but he uses R for all his analysis, but he brings out some really good points and discusses things. So so far we've been talking about kind of looking at these groups or samples based on species and abundances, but we're not talking about, yes. Yeah, so this is where you would use, I mean, when do you use them? So we, so that's a good question because we use the axon sometimes, but we always do a permanola. Maybe Morgan or Robin have a better answer for you, but does it answer? We're all not statisticians. Well, I think tomorrow or on Friday there will be a statistician here. So you can ask them. So yes, I just wanted to talk a little bit about differential abundance analysis, but before that, I just wanted to show you like the taxa bar clause because I haven't shown any yet. And I've been talking about the table and the table and the table. But again, in every microbiome paper or presentation, you'll see a plot like this. This is relative abundance. So we're showing proportions here, right? I think John talked in the morning about proportions and absolute values and, you know, how to deal with that or what is preferred these days. But this is usually how we still show our taxa. Usually, it's shown at file filing level just because there are less file life. You can't make these plots a genius level too, but then you have a lot of genera to show here. So with 16s, I mean, this is usually for gut samples. You do get some species level like 10 to 15 maximum, I think 20%. You'll have more information at genius level. And then as you go higher, you get more information because you would just collapse, right? So you will collapse different species into genus and then you'll start getting more information there. So yeah, these are the plots we show. But now if you wanted to, for example, and this is at file level, if you wanted to see differences in the groups in different either species or, you know, genera, then that's where you would use the differential abundance analysis. So it's not talking about the overall thing, it's now becoming more specific and looking at species or things. So yeah, this is funny. So for differential abundance analysis, again, the question becomes of your assumption or what assumption you want to go with, is it compositional data? Is it count data? If you are, and based on that, you can find different methods, right? So some methods that are mentioned here like edge R, the seed, these are, they all, they come from actually RNA seek field. And they're kind of count based, right? And again, in here, they use different transformation. Some of these down here, these are based on the assumption that data is compositional. And that's LXX2 also. So whatever you use, you'll end up with something like this. And Morgan was talking about these long, kind of, he was talking about the MD5, but you're also long names that we end up with. The numbers are for ASV. So because we have so many NAs or no information about at the species level that we usually put an ASV number or any number kind of here. So that we know, we know that kind of like if you have two genera here and NAs here, then this is 47 is different from 192, for example. But what this shows is basically, depending on the method you use, because this is just a figure from results from DC2 that shows local change, but it's some kind of a ratio or a change between the two groups. And it tells you, okay, some of my ASVs are more abundant in this group and some are more abundant in that group. So that's what you're showing. So with this, you can then get more specific, right? So if you're talking about a specific disease, IBD or, I don't know, something else, you can then go and, you know, in your paper or if you're talking about it, you can say, okay, we saw these species to be different or you can, if you had done this at genus level, which a lot of people do, then this genera was significantly higher in this compared to that group. So just wanted to show you this is from Morgan's lab. This is the paper where they compared almost all of the, most of the methods that are available for differential abundance analyses, and they looked at different data sets, okay? So each row is a data set, and then each column is a method. And I think the take home message for this is, you know, different methods would find different number of significant ASVs. So a lot of them, again, have different transformations or normalizations that are used in the background. And there will be some methods are prone to giving you more false positives. The others are very, very stringent. They might not, you know, give you anything, right? So the author's recommendation and for this is that if you are comparing like two different data sets, then use the same method, right? So you can't for data set A, you know, so let's say you're doing some study on IBD, and somebody in your lab five years ago did an IBD study, it was DEC, and now you come and you want to use AncomBC. And then you're like, hey, the results are not matching or whatever. So the first step is to use the same method for a comparison or methods, right? So if you are comparing different studies, then use the same tool. If you have one study, you can use different tools, and then make sure kind of look at a consensus of those to try to see that which ASVs popping up, right? So look at a consensus or even present the consensus kind of thing that, okay, with this method, we found this and with this, we found that. And there is no, I mean, except for, I think EDR here and left C, these would give you a lot more false positives. So avoid that if you can. And let's do in my opinion, it's very, very stringent. So a lot of the times you don't see anything. I don't know if that's a good thing or a bad thing. You want to have false positives, right? So again, use a few methods and then try to see what works. Okay, so we talked briefly about this variation in read depth. This again goes back to this discussion that we started with about count data or like looking at absolute values or looking at relative values and that comes down to in the end. Again, the question of verifying things are not verifying. So I'm going to try to explain that as best as I can. When we sequence things, right? So different samples will end up with kind of different number of reads, right? And there are a number of AFVs, you get out of them is also it varies. But usually, if you give a sample more reads to a certain point, it will keep on getting more species, right? So you will kind of like your depth will increase and you'll start getting the even the rare species, you'll start picking them up, right? So for example here, in this figure, for sample C, you know, you, sorry, yeah. So sample C has a lot more species, although it has, you know, still has a little bit less reads than D, but all of these samples kind of as you increase the library size is just the number of reads assigned to a sample. So as you give them more reads, it kind of, you find a lot of species, but at a certain point you have kind of exhausted it. So the gain that you're getting, it's let go. So the gain that you're getting from increasing the number of species is not increasing as the library sizes is feeding. So you're kind of captured whatever was in the sample. But this plot are these rare faction plots, the way that they work is you take a sample, you know how much the maximum reads are and then you sub sample these things. You sub sample and then you try to see, okay, if I give it, for example, this is 5,000. So about a thousand reads, then how many species are found. If I give it 1,000 reads, then how many species are found. So this is how these rare faction curves are made. And then based on these, because if you're thinking of this as count data, now what they try to do is they try to find a cutoff for the whole data set. And then say, okay, I am going to verify or sub sample everything at 11 or 12,000. That way, I will keep almost all of my samples. And they have all kind of let go. So the number of species that I have, I have, I will capture most of them. And this is done again, because of the assumptions, this issue with count data, because some of the methods, like one of you pointed out about breakers, why are we using all these kind of dissimilarity matrices, you're sensitive to if the number of reads are different, because they're doing this calculation based on kind of the raw values that you're getting them. So they would, they're sensitive to this. And that's why we, some people recommend to sub sample things, and then go with that. So if you assume that your data is compositional, okay, and you use those methods and you use those transformations, then you don't have to sub sample thing, in my opinion, unless you have really big differences in the sample sizes. So if one sample is, you know, 1000, but the other one is, I don't know, or 500,000, even in that case, you should be cautious and check it. So I would, what I do is, even if I know there are big differences, I would make, I would do my transformations, and then I would make beta diversity plots with verifying and without verifying. And then I would look at them if there is a difference or not, because verifying things can introduce its own artifacts also. When definitely not to verify is, if you are looking at rare text, you might miss them if you sub sample. This doesn't usually happen unless something is really rare and you're really sub sampling them to a really low value. Sometimes, and it's shown in one of the papers that you also decrease your statistical powers if you start sub sampling the data. So I think in my opinions, I rarely verify data. If I am in doubt, I make two or three different beta diversity plots with different dissimilarities, verified, unverified, and then I compare them, and then I try to see if it's making sense or not. The only case I would even suggest you to check it is if there are really big differences in your sampling that. Okay, so all of this that I have shown you today can be done either in time that you guys are familiar with. I learned it last week for the first time, so it's actually pretty user friendly in the sense that you can do a lot by running commands, right? You can quickly do your analyses, you can get some plots out, you can get enough information to understand your data. I am an R user myself, but that means that you need to make an effort to learn some programming. The advantage of R is when you have to make plots, when you have to make something for a publication or a presentation, then you have more choices. You can use UG Plot, you can do many different things, and you can kind of make it your own with your own color palette, for example, and other things. So these are the options. Another option, if you just don't like programming and you also don't like terminal, then if you have an ASV table and you have your taxonomy, you could try to use microbiome analysts. It can do some, you know, it can make you alpha diversity plots, if you beta diversity ball, it's all web based. This is just for people who don't have any time and don't want to invest any time. You could try this out also.