 Well, thanks, everyone. It's my pleasure to introduce Samuel Gamboa, who is, in fact, to welcome you to the Metagenomics short talk series, Metagenomics and Cancer, short talk series of the Myoconductor 2022 conference. Samuel Gamboa is working with Levi Waldron's lab at Cooney, and he'll be speaking about a resource of microbiome benchmark data sets with biological ground truth. Thank you, Samuel. Hi, thank you, Vincent. Right. Well, I would like to start by speaking a little bit about differential abundance analysis, which is the major focus of this work, what's kind of a statistical approach aimed mainly to identify microbes that are associated with a given condition or given body site, sometimes for the purpose of identifying biomarkers or signatures. So this method has some challenges, and these challenges usually come from the type of data itself, such as the high compositionality of sparsity, these type of data composition. And so a array of methods having used to tackle this and having implemented to identify differential abundance features of TACSA, which go from classical statistical tests like coaxon, T-test, some of them are compositional, which are based in a central low-grade transformation. Other methods implement their own normalization transformation methods, just as metagenomic, and there are some other methods borrowed from the field of anesthetic and secret anesthetic. Well, bottom line here is that there is no consensus method or a pro ratch, so pretty much this has to be some kind of benchmarking before using any of these methods. So another limitation that exists right now for the development of these differential abundance methods is that the limited options for benchmark data sets. So commonly, these data sets that are used come from simulation data sets, which often do not reflect the complexity of biological data sets. Some biological data sets are used, but some of them lack a biological ground truth, so the researchers cannot really know if their results are correct or not, because it just comes from statistical analysis. And on the other hand, some other data sets are not real, we are viable or reusable. So to tackle this, we are creating a collection of data sets, which have been described in literature, and they are viable somehow. So we are standardizing these with all of these in a single resource, and well, there are actually three data sets, and one of them, the first that you see there is presented in several versions. So as you can see, they come from different boy parts, the Gigi ball, stool samples and vagina, and they have different contrasts. This is important because it's assumed to be some differential evidence in these data sets, and they all have a biological ground truth that has been very well characterized in the literature, not only through sequencing methods, but all their approaches. For example, in the Gigi ball aerobic dioxide, it's well known that's enriched in the super Gigi ball plague, as opposed to the aerobic, as opposed to the sub-Gigi ball plague, which is enriched in anaerobic taxes. So it's two samples, and we have this data set that they're really, the real ground truth here is that has some spiking bacteria, these are exogenous, and they have been added in a fixed amount, which is known. So this can be used to recalibrate the whole data set. So we have accurate counts, and the last one comes from bacterial vaginosis, which is well known that they've probably decreased in lactobacillus during bacterial vaginosis, and an increase of older taxa, which has gotten realized. So, well, this collection of data sets is viable in Saint-Noel right now, at least the first three data sets that you see there. We're also developing a package that's to be submitted to Biocontactor 3.16, in which the data sets are delivered as trisumorized experiments, while all the components of the data sets are viable, use just the sample metadata, the ambulance matrix, taxonomy annotation, and the phlegic tree for some cases. And these increase the fare, the final accessible, interpretable, and reusable approach for the data sets. So I would like to speak a couple of cases of how we are using these data sets, just an example, and the first that we have to mention is the bacterial vaginosis data set. Well, the data set has some experimental evidence, which has based on an annotation of taxa, which have been isolated from bacterial vaginosis samples. So we know that there should be differential amulet. We have a sample metadata that is also backed up by an experimental approach, in this case, based on the engine scores, which divide as in bacterial vaginosis and healthy tissues. And all the taxa are annotated, so we have a reference. So we're using these for creating an arrangement analysis. So in this plot, we are comparing several differential amulet methods and normalization methods here. We're comparing, we classify them into classical composition, now the anomie is RNA-seq and single-cell RNA-seq, which is what you're presenting in the first slide. And well, on the left side, you can see that we have the expected biological ground truth, right? So we would expect that the methods would detect bacteria as a site with bacterial vaginosis up and with healthy vagina down. So we can see in the plot that just by looking at the number of features that most of them did find the taxa that we expected. But as you can see, some of the compositional methods also detected the bacteria that should be up in the other direction. So this is preliminary data, so I still have to do some digging there. But this is an example of how we think that this dataset could be useful to compare the outputs of the different differential amulet methods. And the other dataset that I would like to talk about a little bit is about the standard 2016. This is called, as I mentioned, for nestle samples before and after allogeneic cell transplantation. There are three bacteria that were added exogenously, saline bacteria, AST fluids, and resorbonate bacteria, which are totally exogenous to the stool samples. So we can pretty much show that they're not there. And will the authors or in the count data, well, we know how much of those bacteria are. So it can be used to be recalibrated. So we use this to compare the coefficient of variation of compositional and non-compositional data transformation with space in the low-radio transformation. And as you can see in the plot in the first panel, all right, the recalibrated counts are, in this case, with saline bacteria, the variation is zero, which is expected because it's the same amount in all samples. So this would be the biological ground truth. And we compare the performance of different normalization methods, and we see that, for example, even a method that is in the low-radio, which is considered appropriate for compositional data, didn't perform much better than the simpler approach, which is relative abundance. So we can see there's still a lot of variation here. And we see this also in the abundance of the older taxa. So this is another example of how to use one of these data sets for benchmarking and compare the results. So while summary and perspectives, well, we are expecting that this resource and this package will provide these data sets, which will be suitable for benchmarking and differential evidence methods to evaluate the correctness of discoveries because we already know what to expect from them. And we expect to add more of these data sets from different organisms and habitats so we can have an array of methods there. And we'll increase the finability with these arenas in all already, and what they will be submitted to by a conductor. Well, thank you all for listening. Thanks to my co-authors and all the Waddon lab and the research foundation at CUNY for funding this. Thank you. Great. Thanks, Samuel. We have a minute for one or two questions, if there are any. Yes. I think you would go to the microphone, and then others will be able to hear you, and you can come to take the response. Thank you for your talk. I just had a question. Looks like your samples were just 16S subunit, using metagenomic shotgun sequencing as well. Yes. Well, right now, thank you for asking. Yeah, right now, all there are 16, except for one. The first is Gingival. We have some metagenomic data there. Yes, we do look for expanding this. So we're open to proposals of data that's to include. Yes, and definitely including also metagenomic data is in sight, and we would like to add those. Yeah. And just like a follow-up question. I'm doing some similar type of there's room for lack of function because of the way that bacteria share genes like that, looking at disease to try to make a whole disease target. Sorry. Yeah, sorry. Repeat the question. Is this OK? You never hear me find this close? My question was, are you interested in taking this further into some type of functional gene annotation, not just like a taxonomical basis for interrelatedness between species? Well, yes. Thank you. Yes, we do. I intend to. Well, the laboratory is working with a lot of annotations. I think the next talk is going to be related to that. We also have other packages. And we look into ontologies as well. Yes, that would be nice to annotate the taxa based on functional features, not just the taxonomy. Yeah, that's definitely something that we would like to include just in future datasets. And we look forward for that. Yeah. Awesome. Thank you very much. No, thank you. All right. Thanks again, Samuel.