 New York Assistant Professor and Principal Investigator in the Department of Biostatistics at the University of Washington. She develops tools for microbiome and biodiversity analysis. She is interested in reproducible research, meaningful data analysis and collaborates with scientists who share those values. She has been a recipient of NIH Outstanding Investigator Awards and University of Washington Outstanding Mentor Awards. Please check her Twitter and her websites. And on this, Dr. Willis, please take it away. Excellent. Well, thank you so much for the very kind introduction. I was just saying to folks who were here a bit earlier that, as was just mentioned, my home is Seattle. I love living in Seattle. I'm very sorry I can't be there in person today. I had a prior commitment on the East Coast for traveling. So feel free to connect with me and any of the ways here through Twitter or email. And I look forward to... It's been terrific to see the programming that's been happening at Bioconductor 2022. So this is actually my first time presenting at a Bioconductor conference. I'm a passionate misguided for the open source community. But I thought that since this is the first time I'm presenting here, I might start off just by introducing myself in some of the work that I do. So I'm a statistician. I have a favorite of training in statistics at this point. And I really believe that statistics exists to serve the sciences. And so to that end, collaborated with folks on a number of different projects. I'm really passionate about biodiversity. So I've got some projects here looking at analytics and looking at an Arctic boolean mammoth reconstruction of its lifetime movement from isotope data as well as looking at very hard to detect members of the oral microbiome. To support some of this work in the analysis of biodiversity data, I developed methods for its analysis. And so you can take a look at some of these methods here. These are mostly focused on developing new methods for the analysis of emerging structures in the analysis of microbiome data. So I'll talk more about that in a second. I think that one of the most important things statisticians and data analysts can do is to support modern sciences to bridge disciplinary divides. And so I take this outreach and interdisciplinary communication work really seriously. And I think that some of the tools that have allowed me to better collaborate and better communicate with folks in biological fields include having some sort of perspective pieces trying to frame some of the questions that statisticians think about for microbial ecologists. And I've got a couple of open source software projects but I'm really excited about listed on the bottom as well. So what I'm going to talk about today is more recent methodological work that I've done while this is accompanied by open source software course. And I'm particularly going to focus on a preprint that we put up a couple of months ago. It's called model and complex method and error in microbiome experiments. So feel free to check out that paper on archive if you would like more details. And I'm happy to answer questions. Of course, sounds like at the end is how we're doing them. So the story that I'm going to tell today, I'm going to start from background for folks who have relatively little familiarity with microbiome data specifically. But this story really begins with the advent of high throughput sequencing. And as we mentioned many times yesterday, high throughput sequencing is this incredible technology that allows us to analyze genomic data sets at this unprecedented scale and also at an unprecedented low cost. So the advent of high throughput sequencing has really revolutionized the way that we study biology as well as the way that we study medicine. And so one thing that distinguishes high throughput sequencing from sort of classical approaches to studying ecological communities is that it is an indirect way of obtaining information about the genomic source or the community that's under study. So if I think back to how ecological surveys were done in the 50s, 60s and 70s, folks with a lot of sunscreen and bug nets were sitting there counting their frogs on their clipboards and counting different fishes in the net or different spiders that were in these traps. High throughput sequencing is totally unlike that. There are all of these different steps in the laboratory preparation procedure which I'll talk more about in a second. Before I do that, I'll give a little introduction to microbiomes and communities of microscopic organisms which are gonna be the focus of this talk today and indeed the rest of my current work. So you may or may not know that you're not wholly comprised of your eukaryotic self. You actually carry approximately 10 times more bacterial cells on your body than human cells. If we think about what that bacterial complement is able to help you with, it actually adds an enormous amount of function to what you can do. And so for example, we've got 22, about 22,000 genes in the human genome but the combined complement of genes from the microbes that you walk around with is maybe on the order of eight to 10 million. So what do these bacteria do for you? They help fight off viruses. They help you break down your food and metabolize and absorb nutrients. And I guess maybe one of my favorite factors that eukaryotic cells are descended from bacterial and archaeal cells and so I like to think of them on my good days as my distant cousins as well. So these are, we haven't think about pathogens but our commensal organisms help us a great deal with living and existing. Okay. And so a common data structure that we see in, that we often see in the analysis of microbiome data is count data, it could also be coverage data. This is not such an important distinction but we're often thinking about matrices of the form of samples by taxa or taxonomic groups species or strains. And so I'm gonna throughout this talk think about WIJ as the number of times a given strain, J is observed in sample I. So what we see in this example data set from Brooks and co-authors is that in sample one we observed lactobacillus chisputis 19 times and lactobacillus inners about four times. In contrast, lactobacillus chisputis was observed zero times in this second sample. Okay. And so a assumption that underpins a lot of the work in statistics on the modeling of microbiome data is concerned with true relative abundances of strains, species or taxa in sample. So I'm gonna let PIJ reflect the true relative abundance of taxon J in sample I. And a common assumption that underpins a lot of methods including concave and de-seq and many more that we can chat about in detail in question time is that the expected number of counts we observe in the sample for the Jth strain is proportional to this true relative abundance with some sample specific scaling factor that I'm gonna call CI here. And so if this is a core assumption of many methods for estimating abundance in microbial communities and looking at differential abundance across different sample types, I think it's really important to investigate this assumption and see do we have good evidence for this working assumption being the case? So how can we do that? So one really interesting data structure that exists in microbiome sequencing and I think it's kind of unique in microbiome sequencing as opposed to other high throughput sequencing technologies such as RNA-seq for example is we have this validation data that we can use called mock communities. So mock communities are artificially constructed communities of known composition and they've been really well used in the bioinformatics and computational biology literature but not so much in the statistical literature. So this is sort of where we've been today. And so mock communities have true known composition, right? So what I'm showing here is that I'm gonna show four samples and seven species and we see that that first sample is comprised 100% of streptococcus agalactiae and none of the other six strains are present in that sample. In contrast, the second sample here is comprised in equal or one-to-one parts by Apotropia vaginium prevotella bivia and then this third and fourth sample have one-to-one-to-one ratios of three taxa each. Okay, so this is a great tool that we can use to investigate these are some of the assumptions that underpin a lot of the methods that are used in microbiome analysis. And so it may or may not surprise you that that data table that I showed earlier actually originated from this community. So I'll let that, I'll let you all look at that for a second and then I'll pull out a couple of really interesting pieces. So one thing that I'll notice here is that we're observing strains in samples in which they should be absent. So as I said earlier, we observed lactobacillus chryspatus 19 times in this first sample in which it should not be present or we know it not to be present. Another really interesting feature that we see looking here is that while we have known equal ratios, let's say in this third sample of lactobacillus chryspatus and lactobacillus inners as well as prevotella bivia, we observe many more reads from lactobacillus inners compared to that of lactobacillus chryspatus. We observe about two and a half times more, maybe around 11,000 lactobacillus inners reads in this third sample compared to about 4,800 lactobacillus chryspatus in samples in which they should be present in the same abundance. So I'll just make those two observations again because they're gonna be pretty important for the model that I go on to talk about today is firstly that we notice that despite equal mixing fractions, some taxa are observed many more times than others. And then also despite supposedly being absent, we observe taxa in which we shouldn't. Okay. And so what I'm gonna focus on for the rest of this talk is using this control data to propose and justify and validate a model that maybe better reflects underlying biology of these communities than the assumptions underpinned by this working assumption that I introduced earlier where we had that counts were proportional to true relative abundances. Okay. So this is work with David Clawson who's a fifth year PhD student in the Department of Biostatistics at the University of Washington and has just been an absolute joy to work with. And so David and I worked on the following model. So we're gonna decompose the expected number of counts in a sample attributable to a taxon as comprising of two pieces. So that first piece is contributions from what should be present in this environment. And then the second piece is contributions from what we can broadly think of as contamination, things that shouldn't actually be picked up in our sample because we know that they shouldn't be there. So I'll dive in more detail into this contribution of sample piece first. So what I'm showing here is more mock community data. And this is also mock community data from the Brooks and Co-authors paper that I mentioned earlier. And so what you'll notice is in common about these five communities. So in this top panel, I've got true relative abundances. We see that the first and second samples and third samples are comprised of one to one to one mixtures of three different taxa denoted by different colors. All of the five samples that I've pulled up here both have this green taxon and this blue taxon present in these communities. And what we observe is that we're consistently observing a lot more of these observations due to this green taxon compared to due to this blue taxon. So let's think about the ratio of number of observations attributable to each of these taxa. And I would claim that we're consistently observing about 18 times more observations due to lactobacillus inners, the green taxon, compared to streptococcus agilactiae. And so I think for folks who've looked at high throughput sequencing data, you might think or someone who's looked at a lot of high throughput sequencing data, I would say this is a pretty consistent single signal to noise ratio compared to a lot of what we see in this field. And so is this some fluke am I cherry picking the two taxa that I'm taking the ratios of? Well, let's take a look. So here's a pot where we've got pairs of taxa on the x-axis and I've showed my, we observe approximately 18 more copies of lactobacillus inners to each copy of streptococcus agilactiae in communities in which they're present in equal abundances. And if we look at all of the other pairs of taxa, I would say we also have pretty consistent over detection and in particular over detection of one taxon relative to another in a pretty constant multiplicative proportion. Okay, so observing in this first, the most left hand side of this plot, approximately 30 times more lactobacillus inners compared to gardenarella vaginalis. And this is true regardless of the number of species present in each of these mixtures. So this is, I think, a pretty strong signal and a pretty good indication of what sort of model we should be considering in this scenario. And so I'm gonna claim that this data, everything I've shown up until now, provides pretty compelling evidence against this model where we have expected counts as proportional to the true relative abundances and much stronger support for a model where we have our expected counts are proportional to the product of our true relative abundances as well as strain-specific detection effects or what I'm gonna call or what's commonly known in the microbiome field as detection efficiencies. So I'm gonna let that EJ term denote efficiencies here. For estimation purposes later, it'll be convenient to work with these on the log scale. So we've got an expedited form of this model as well. Of course we're gonna hang on to our sample-specific sampling intensities, CI, in addition. Okay, so I think it's one thing to sort of see this model and say like, yes, that all seems fine and good, but what are the real implications of failing to consider taxon-specific efficiencies when we go to do modeling? So the model that I'm claiming is that we're looking at, if we go from looking at counts to looking at relative proportions, we can consider observed relative proportions as proportional to true proportions of taxa in environments multiplied by taxon-specific efficiencies. So what's the impact of failing to account for these taxon-specific efficiencies? So let's consider just a toy example, noiseless setting where we're looking at maybe a treatment sample and a control sample. And this could be population level data. It's not so important here. So it would be very nice to look at this data that appears on our desk and conclude that the relative abundance of this orange taxon, which is maybe a pathogen, let's say, decreased in the treatment sample compared to in the control sample, because we observe a decrease from 42% to 24% in terms of these relative proportions. So, perniciously, what could have generated this exact observed data is the opposite where in fact, we had an increase in the relative abundance of this orange pathogen from 20% relative abundance to 33% relative abundance going from control to treatment. So a change in the opposite direction to what we're observing in our data. So what could have driven this effect? The following taxon-specific detection efficiencies could have given rise to this. So we had one taxon that was extremely easy to detect, this green taxon observed 18 times more easily than this blue taxon, and then, and also observed about three times or observed three times more easily than this orange taxon, so six to 18, one to three. And so what's happening here is that because this green taxon is also changing in relative abundance, going from 5% relative abundance to 33% relative abundance, and also because it's very easy to detect, it's sort of eating up a lot of this relative abundance pie that's available and edging out sort of this orange taxon in showing us which direction it's moving. So really critically here, what we're finding, what we see and what this example shows is that it's not the relative abundance of one taxon that's important, but the relative abundance of all taxa that are important as well as their detection efficiencies. So this is essentially why this detection efficiency problem poses such a big challenge to analyzing this data and making scientific conclusions, drawing scientific conclusions from it. Okay, so I'll wrap up this sort of subsection on talking about contributions from the samples. So what we know to be present in these communities are saying that this piece in expectation is the product of true relative abundances, sample specific intensities, reflecting that some of these samples have many more observations attributed to them than others. For example, this first row has about 51,000 observations observed in total and this second row has about 21,000 observations observed in total. And then in addition, we need this term that accounts for different degrees to which taxa are detected. So the fact that as we noted earlier, lactobacillus inners is observed about two and a half times more easily than lactobacillus chryspatus. So that's our contribution from sample piece. Let's move on to chatting about the reason why we have non-zero numbers of observations from taxa that shouldn't be present, that should be observed zero times. So the model that I'm gonna consider here is pretty similar to the one that I introduced earlier. We're gonna think of contamination profiles. So we're gonna think of potentially capital K tilde sources of contamination that could be impacting our reads. Each of these contaminant relative on its profiles have an intensity that they contribute. And then we also have that samples sequence more deeply in total also sort of blow up the number of the amount of contamination in absolute terms that we observe in them. So this is just a nice piece that gives us some explanation for why we're observing non-zero counts of taxa that should be absent. Okay, so let's put those pieces together. We've got a bunch of notation here that I've thrown at you relatively quickly. Essentially, we've got this piece due to what should be there and this piece due to what shouldn't be there. And I'm a statistician, right? So I like to think very generally in terms of experimental design and possible ways that folks could be running these experiments. And so I'm gonna introduce just so you know that this is a fairly general model, a couple of different design matrices. So one that links samples to specimens. For example, if folks have replicates, technical replicates that they wanna sequence. Another that potentially links these different detection effects to different batches or different sample preparation methods. That's this X matrix. And then we've also got, you know, different ways that samples can share contamination profiles. And that's coming in through this Z tilde matrix. So that's a bunch of notation. Feel free to take a look at the paper if you wanna dive into more detail here. So how are we gonna estimate parameters in this model? Well, we're gonna propose using likelihood-based tools to do so. And so we're gonna model our counts conditional on our parameters as Poisson distributed with means given as stated previously. And critically, I'll talk about the Poisson assumption in a second and whether or not that's something we're really gonna lean on. Spoiler alert, we're not. But critically here, in order to fit this model we need at least some parameters to be known in this model. So for example, we're imagining this being most useful in a setting where folks are investigating samples of known composition as control data alongside samples of unknown composition. So maybe that's a setup or maybe we know that we're gonna set one, we're gonna sequence the same samples looking multiple different ways. And we're gonna set one of those as the baseline will be estimating different detection effects, one protocol relative to another. So one of the really nice features of using this estimation approach is that it results in estimators for our proportions as well as for our detection effects that are consistent under very general conditions. And in particular, we don't need to assume that these counts are actually Poisson distributed as long as we've correctly specified our mean model. So this is consistency in a very, very general sense that's not tethered to distributional assumptions about WIJ. So really we're thinking about this likelihood-based approach as giving us estimating equations. One thing that's worth noting is that the Poisson distribution makes strong assumptions about the relationship between the mean and variance of observations. And so a tool that we have to improve efficiency is to perform maximum weighted likelihood where we're looking at estimating some variance function as a result of looking at squared residuals, for example. And then we can make structural constraints like fitting that model with a isotomic regression as a reweighting approach. So some nice details in the paper if you'd like to look more closely at that. So estimation in this setting is really interesting because some of the parameters that we're trying to estimate do not fall in the interior of the parameter space in which they can live. And so we've got these relative abundances that are not only in the simplex. So given some of proportions has to add up to one, but we also wanna allow maximum likelihood estimates to potentially fall on the boundary. So specifically, we want to permit true relative abundances or true estimated relative abundances to be exactly zero. We don't wanna assume that every strain, every species occurs in every sample. That's not biologically plausible. So how are we gonna do that? So we consider breaking up our algorithm for estimation into two pieces. We've got this first set of steps, which is a barrier method, and then another then followed by a profiling step. So the idea here is that we've got this really unstable estimation problem. And so we're first going to consider this setting where we're gonna enforce that our estimated relative abundances fall in the interior of the simplex, interior of the space of observations between zero and one that add up to one. And then we're gonna at the final step allow those estimates of relative abundance to move on to the boundary if it increases the likelihood and that's done via a profiling step. And so essentially the way that this barrier method step works, so we wanna really carefully control the way that these, the way that we're increasing our likelihood and the way that our estimates are changing along the steps of this algorithm as we go forward is returning this quite unstable, constrained optimization problem into a sequence of unconstrained optimization problems. So we're maximizing not our likelihood, but our likelihood plus this other term controlled by a hyperparameter T. So we originally start with a small T and we say solve this problem and then initializing that the next step of our algorithm we're gonna increase T by for example, 10 fold. And so we're looking at maximizing a function that's more like our likelihood. And then we repeat that again until T is really large. At any individual step of this optimization we're gonna use regularized Fisher scoring. So second degree, second order approximation of our likelihood. So the interesting thing about this is that we're at each step of this algorithm only going to enforce that our relative abundances are greater than zero or between zero and one. And we can do that and enforce this, the simplex constraint that they add up to one, but while being strictly greater than zero by reparameterizing our relative abundances using long ratios. So this is just a nice reparameterization trick to allow this interior of the simplex constraint. So a profiling step once we've got, we've got estimates that approximate something that's really similar to our likelihood, but not quite. We wanna allow our estimated relative abundances to potentially fall on the boundary of the simplex. And we're gonna do that with a constraint Newton's method within this augmented Lagrangian term. So you'll notice that this is not your typical Lagrangian. We've got two pieces that enforce this constraint. And we're gonna solve this problem using non-negatively squares. Essentially we no longer have this ratio parameterization that's one piece that allows us to potentially have estimates on the boundary of the simplex. And of course we're only gonna take this update step if it increases our likelihood. So we have this additional step here to make sure that we're increasing our likelihood at this step. Happy to take more questions on this and yeah, very excited about this. So the final piece that I'll introduce before talking about a couple of examples is some really interesting considerations regarding statistical inference or hypothesis testing in this setting. And so, as I've talked about and one of the challenges is for estimation is that our parameters are not guaranteed to lie in the interior of the space in which they can fall. And so a lot of our conditions for... Our regularity conditions for doing asymptotic analysis break down here because looking at, you may remember from your first interest sets class, but that we're assuming that parameters fall in the interior of the space in which they can fall. So coming up with an asymptotic distribution or describing the asymptotic distribution of things like test statistics or estimators is not trivial here. And so a lot of digging into what possibilities we have here landed us on this really amazing bootstrap alternative. So we're not doing just your regular multinomial bootstrap. We're gonna consider a modification which is still giving us correct asymptotic distributions for estimators when we have parameters that potentially fall on the boundary of their space. And so what we've got here is a result showing that if we're looking at a subsampled bootstrap and so we're looking at P, N, Z distribution here. So this is a weighted empirical distribution where the weights of each of the observations that we draw are coming from a Dirichlet distribution with an M over N parameters and we need to have M and N becoming both becoming large and in particular, M becoming small relative to N. So there's some really interesting considerations here. If you're curious about these questions and I think this is just a really great example where theory and practice merge together in interesting ways. So feel free to check out those details if you're interested and I'm happy to take questions during question time. Okay, so I'll show you how this tool and this method can be used in a couple of different ways for the analysis of microbiome data. And so a common situation that folks are interested in is choosing between different possible experiments that they could run. And so this is a setting, this is data from Costea and co-authors published in 2017 where the authors were setting out to investigate which of three different ways that they could run their sequencing was the best one. And so what they did is they took 10 samples, I think believe they're human stool and into each of these samples, they mixed a synthetic community of 10 different bacteria and then they split their samples up, sequenced all samples according to each of these three protocols and then a sort of a validation approach looked at some flow cytometry data. And so one of the really interesting things about the analysis that these authors did is that they could look at their flow cytometry or relative abundance profiles and look at their sequencing profiles and say they don't quite match which looks to be the best. So that's sort of the old framing in terms of counts proportional to true relative abundances, but it was really hard to do anything with this. So let me show you how we can do a little more nuanced analysis with this new tool. So what I'm gonna show here are according to each of these three different protocols, relative abundances measured by this validation technology, flow cytometry against the estimated relative abundances for every one of our in colors, each of these 10 specimens and across the X axis, we essentially have different strains. So what we're observing is that some taxa are consistently observed way less than they should be. And very few, these lines are not on the X equals Y line. And so we're really struggling to see which if any taxa are estimated correctly with respect to their relative abundances. So if we fit a model that fails to account for detection effects across taxa, which explains why we could consistently be undetecting taxa in all of our samples, we end up with a model like this. So this is essentially correcting for contamination in samples. And we still see we've got some deviation from the X equals Y line. If we fit this full model, the one that I've introduced already, we I think can see something quite stark, which is that protocol W, while still biased in some sense towards detecting some taxa compared to others, is much lower variance than protocols H and Q. So I think this is a nice way of reframing, not how can we improve the, how can we choose a protocol that's accurate on average? But here we can start to say, we're never gonna get accuracy. We can correct for this bias in relative abundance estimates. And the challenge that we wanna do is minimize variance, okay? So we're optimizing precision now, we're prioritizing precision now over accuracy. And I think we're prioritizing both precision and accuracy, but we can choose a protocol that's an experimental protocol that's low variance because we now have a tool to improve accuracy. Another really important way in which this method can be used is to remove contamination in samples. And so the idea is that a lot of the tools that currently exist in the microbiome literature for removing contamination label taxa as contaminant or non-contaminant. And don't have sort of a, well maybe it's a contaminant in this sample from this other sample framing. And so what we're gonna advocate for here is a tool, an experimental tool called dilution series, where instead of just processing, where instead of biologists processing their sample, they process their sample as well as that same sample diluted down with say water in a one to three ratio, and then maybe again one to nine ratio and so on. Because what we can see sort of from this graphic here is that the proportion of contaminants, if contamination is constant across all samples, we see that the proportion of contaminants is greater in these more diluted samples. Okay, so let's take a look at what this looks like. This is data from castans and co-authors. I made this claim that we have contamination in our samples. And my evidence or one piece of evidence for this is that in a community in which we know there to be only eight strains present, we observed 248 strains in total. So 240 contaminants in this data set across I think nine samples or so. So what castans and co-authors we're looking at with eight bacterial strains mixed in, equal ratios, 12.5% each, and they have dilution series of each of these samples. And so what this model that we've introduced now can account for is not only these detection effects that we're not expecting to see 12% observed reads, but also contamination being greater in these diluted samples. So let's just take a look if we take a look on the left here, we've got observed relative abundances of each of these eight taxa along with other being contaminants. And then on the right-hand side, we've got what's known to be present in these communities. If we dilute this sample down three to one and sequence it again, you can see we observed slightly different proportions and maybe a slightly different proportion of these contaminants, slightly increased relative abundance of other over here. And as we continue to dilute again, one to three, so one to nine ratio of original, we're seeing more and more contamination. So we jump up to a five-fold and five rounds of dilution in one to three, we're now observing about 25% of our reads coming from contaminants. And if we're looking at an eight-fold, three to one dilution, we're getting almost 80% of our sequences coming from contaminants. So this is what that full dataset looks like. So what I'm gonna do, we've only got nine samples here, so we're gonna do what we can. We're gonna do three-fold cross-validation. And so if we fit this model that I've described earlier, we see a much improved estimates of relative composition. We're getting much closer to this 12.5% for each tax online. Again, we're learning on six samples and then training on three. And these folds are done in a way such that we're not communicating information across different folds. And then similarly for the second fold and so on for this third fold. So I would say that we're learning pretty as much as we can from a relatively small dataset and that this dilution series is a pretty low difficulty, low expense technique that folks can use in the wet lab that they can then combine with these analytical tools in order to reduce the impact of contaminants on their analysis. And in particular, one thing I was really excited about for this example was that if we look at 95% confidence intervals for our true relative abundances for 238 of the 240 contaminants, our true confidence intervals for their presence in the community were covered zero. So I think our in-practice empirical coverage is pretty great. Pretty good, I'm pretty pleased with that. So there's a lot of ongoing work that I'm really excited to do in this area. A big piece of this is experimental design. How do we use these techniques and these insights in order to help biologists better spend their money and better allocate their energy across, for example, choosing between replicates and choosing between dilution series. How do we best encourage folks to collect the right type of control data? There's also all of these interesting questions sort of at the interface of biology and engineering about well, how conserved is the detection of a specific strain across phylogeny? So if we're looking at comparing two different species in the same genus, can we learn things about the detection, the ability to detect one species from another? You know, I alluded earlier to this, some of the mathematical challenges regarding estimation and I said sort of vaguely we need some things to be known whether we need our P's or our betas to be known and this is essentially getting towards identifiability. So some more general understanding of identifiability on the algebra side, I think is an important consideration. And then the holy grail of this research area is enabling biologists to do meta-analysis. So to combine data sets that have been constructed maybe in different continents, definitely in different labs, probably in different batches and combining them in a way to essentially improve the estimation of whether it's treatment effects or disease effects, as well as power to look for a differently abundant taxa across different sample types. So for folks who are not in the microbiome literature, I thought I'd, or not working in this area, I thought I'd give just a couple of sort of takeaways from the highest level of what I think we can learn from this quite specific example from microbial ecology. And the main one is about model validation. So statisticians and computational biologists and biologists have for many years used these models that weren't validated and were built on this assumption that the expected counts attributable to a species were proportional to their true relative abundance. And along through that example that I showed earlier where we had a decrease in one taxon, a decrease in a taxon that manifested as an increase, we saw that this is a model that can really mislead us, can really take us in the opposite direction from what biology is going on under the hood. And so, if you're in a field, I'd love you to think about what creative ways and what tools are out there are, what tools are available to you to do model validation in your setting. And so in our setting, we were looking at this control data that through mock communities, this had previously been largely gone unused by the statistical community, but we found to be really useful for evaluating model specification. And so alongside all of this work, as I've said before, they're really the only way to get useful tools, make useful tools is to distribute them as open source software in my opinion. So all of this methodology is implemented as an R package known as TinyVamp. Something I'm really excited to chat about with folks is how do you balance, how do you find the right balance of tutorials and vignettes that you put together when you're focused on developing a method that's very flexible? And so that's something that we're really sort of grappling with at the moment, how much support, all of the different, given that there are so many ways to use this method, what could we do to better support our users? So this research thread has come together over many years. This paper that I mostly talked about today is work with David Clausen, who's just someone, a student that I've been so lucky to work with. But some of the earlier ideas for this work were done in collaboration with Mike McLaren and Ben Callahan at NC State for Ben and now MIT for Mike, or actually, yeah, MIT for Mike, as well as with Jim Hughes and Brian Williamson at the University of Washington. And I'm really grateful to the National Institute for General Medical Science for funding this work. So that was all I had prepared for you today. I would really be excited to take questions. Thank you so much for being here. All right, thank you. This was great. Let's see if we have some questions from LifeOgins. Can you hear me okay? I can. I'd just like to say thank you for your talk. I am currently working at OHSU and I'm working on a metagenomic shock and sequencing microbial analysis of esophageal cancer progression. Lisa Karsten, who you referenced is, she's on our project right now, but this is kind of small world moment, but I've been personally working on the viral genome detection of the double-stranded DNA, non-RNA stage viruses that integrate into the human genome, and it's been difficult to normalize the counts based on all of the things that you talked about, about the high complexity regions and detection and things like that. So I'll be sending you probably a little bit of a long email after this, but I just wanted to- That's wonderful. Yeah, bring that up. Yeah, it's a lot of interplay what drives diseases in terms of the human microbiome, but this has been just a really wonderful talk, so thank you very much. Thank you so much. I really appreciate that, and I look forward to your email. I actually saw some data from some of my colleagues at Seattle Children's quite recently who were also looking at viral mock communities and this question of single-stranded, double-stranded DNA, RNA viruses and how does this impact their detection efficiencies is something that I at this point know very little about, and I'm hoping that this is an emerging research area and I really look forward to seeing what you find. Thanks so much for introducing yourself. Yes, thank you very much. Hello. So I have two questions for you. I really enjoyed your talk and my first question is the model that you proposed and fit to data sets on the order of a tennis species if I understood correctly. One of the things I wondered about it is it seems to have a lot of parameters and microbiome data oftentimes is very sparse. And so I'm wondering whether it's even possible to fit such a model to true microbiome data when you don't have the same organisms across many samples that you've sequenced. And so you're trying to fit a model if I understood correctly with many parameters to sparse data. Yeah, fantastic question. Thanks so much for asking it. So if you look at the dimension of our data and if you look at the dimension of our parameter space, I would say that in general, we're not gonna be able to estimate individual relative abundances for every taxon in every sample. And this sort of gets at this point of identifiability. So what sort of replicate data is needed if we see even the same sample multiple times that's giving us more information and how can we better leverage what we have? So that's one question or that's one piece of my response to your excellent question. The other thing that I think is important is that for the purposes of maybe comparing across samples or across sample types, this challenge of the WIJ is such that WIJ equals zero, this zero's problem is something that is really important to consider. I would say that given that we're trying to estimate relative abundances, that sounds like really good information to me that our true underlying relative abundances are also zero if we're failing to observe them in many samples. And so this doesn't worry me for the purposes of this calibration model as much as it does for differential abundance comparison, for example. So to speak to your question about dimension of our parameter space, you said that there are 10 species in this castans data. Actually, there were 248 species in this dataset and they could be fit relatively quickly. I just showed and then collapsed a bunch into other for the purposes of visualization. So I've been encouraged to realize how much we can learn from these datasets, even though of course we do have a parameter space dimension plus data dimension considerations, absolutely. But I think it mostly impacts asymptotics is my long answer to your points that you bring up. But happy to chat more about this. This is happy to go into more detail. Okay, thank you for that. So my second question is one source of contamination of microbiome samples in my experience is barcode crosstalk. So on Lumina when you multiplex, you get this problem where some samples bleed into other samples because of the demultiplexing error. So it seemed to me like if you were to just dilute the sample, the barcode crosstalk would remain at a constant abundance. It would not appear to increase in abundance. And so I was wondering whether your model at all fits barcode crosstalk as one possible source of contamination, how that would be dealt with. Yeah, fantastic. Thanks so much for bringing that up. So this question of barcodes that don't fully anneal and therefore hop from one sample to another. So index hopping is one term that this phenomenon has been given. Is something I've considered in a couple of settings and I've looked at it from a modeling perspective where we explicitly try to model hops from one read to another. That becomes computationally intractable very quickly. And so one of the nice things about this model or a place in which I'm optimistic is that if you know the true composition of some samples and you also have barcode hopping, that piece is going to be estimated as part of your background contamination profile. And so essentially you're gonna have small quantities of every one of your taxa in your contaminant profile even if those contaminants are coming in the experimental sample preparation process. And so this 238 out of 240, taxa being estimated or having confidence intervals for their true composition that covers zero gives me a lot of hope actually that we can still be estimating true zero abundances even in the presence of crosstalk or this index hopping phenomenon that you talk about. So I think that's a bit more detailed than I went into today but I'm happy to chat more about this offline. I would say that this method does, but it sort of gets at it indirectly that maybe better than nothing is where I would end that. Thanks for that great question. All right, we have few questions from virtual audience. So the first one is from my club. And his question is, will meta-analysis require inevitably original count data from all studies or is there a summary statistics solution that will be good enough? Great question, Mike. That is really interesting. I think depending in particular on the scale, I guess maybe there's some different people. So one is, what are we estimating? Are we estimating maybe like average treatment effect across different cohorts? Maybe in that case, doing meta-analysis through these tools could be enabled by the use of summary statistics if count data's been maybe not available, for example. I would have to think about that more though, that's my sort of off the cuff response to that question. I guess on the other side, if not everyone's publishing their count data and one of the nice things about this Poisson approach to giving us estimating equations is that we don't actually have to have count data. We could be looking at coverage data. I guess we could be looking at proportion data, though in that case, we're losing information about differential precision across samples and maybe some ability to estimate the mean variance relationship across the breadth of our counts. But I'll have to think more about this question of meta-analysis and summary statistics because I admit that's not something I've thought about in too much detail, but thanks so much for that very interesting thought. All right, another question from Alex Eamons. And this was fantastic. It was indeed. How did this modeling compare across sample types? Does it work equally well for less diverse sample types like human gods versus more diverse sample types like soils? Yeah, fantastic question. So thanks so much, Alex. I appreciate the kind words as well. So I guess there's a few different things that leave into my mind in response to this question. So one is that there are, of course, computational challenges with looking at more diverse environments. So the bigger the increases the dimension of your data, it increases the dimension of your parameter space. And so computation will inevitably take longer if you're looking at calibration with respect to, for example, high diversity soil environments compared to low diversity environments such as the human vaginal microbiome. The other thing that you bring up is this comparison across sample types. And so I think that an interesting question and something that we're looking at that is not well documented at this point is what is the impact of the extracellular matrix on these detection efficiencies? And so our efficiencies essentially the same if we're looking at like saline environment, saline environments for marine samples compared to freshwater environments. Or is the amount of like carbohydrates some in stool impacting detection effects for the same taxa. So I think that my current on this are that through comparisons within relatively simple sample types, efficiencies are pretty consistent. And this is data from that paper with McLaren and co-authors. I think the question is a little more or I think we have less information about comparing across very different samples. And so these sort of gross, or like large scale extrapolations and large scale, comparing very heterogeneous data types is sort of a challenge on the biological and model building side more so than on the statistical side. But I'm looking to see as more data emerges what we can say about that. So I hope that address your question in a couple of different ways. All right. And then another question from Mike. Are the models and inference methods relevant for questions about single cell competition? For example, single types may have efficiency differences that may lead to over detection. I didn't fully catch that question. But if I understand it sounds like Mike was curious about generalizability of this model to single cell data, for example. And other put genomic data types. And I would say that I guess having spent many years working with microbial communities, I can attest to I think the strength of this model in that scenario. And I'm curious to hear from actually, from you and other folks who have expertise in both single cell and bulk RNA sequencing if a model such as this has been validated in that setting. My impression is that because of the way that like what we're looking at the scale of individual transcripts it seems much harder to develop the same kind of control data as what we're leveraging in this scenario. But again, my expertise in these other genomics areas is pretty limited. So I would defer to other folks but I would, I have been asking folks for years like has this been like carefully validated? Have folks really been, have folks thought about this in other settings? And I'm not yet convinced I have complete information on that. So happy to continue that conversation with you as well. All right, it looks like we have all questions answered. I'll pose one from the floor. Oh, okay. Peace. Thank you, wonderful presentation. Could you bring up that slide of the dilution data? I think it's like the third to last perhaps. Yeah, let me see what I can do. Is this the slide you're after? Yeah, that's right. I thought there was incredible fidelity in the right hand slide. So I'm just trying to get a handle on what's going on on the left hand side. These are other analyses being used. Let me see if I understand what you're asking about here. My understanding is that on the right hand side this is your model. That's right. And you're estimating the proportions extremely accurately for all of these. And on the left hand side, something else has happened. What is going on over there? Yeah, thanks so much for asking me to clarify. So what we've shown on the left hand side here is essentially the raw data that have been transformed into proportion scale. So if you look at just a single one of these dilution sets, this is the sum of all across all of these, eight taxa plus an unclassified and plus some other. We've got a total proportion of one. As we observed from our WIJs divided by the sum over all IWIJ. So this is essentially what we are estimating under this model where expected counts are proportional to true relative abundances. So plug in relative abundances from our data shown on the left hand side. And then I think of it as corrected relative abundances for both contamination and detection effects on the right hand side. Does that answer your question? I think so. What were the confidence interval widths like on the right hand side? Great question, yeah. And so I think as I mentioned, I believe we're learning from, I can't remember, I think we must be learning from three samples and trying to predict the composition of the sample here. So I would say that I'm trying to apply, pretend that n is large when n is three sounds unreliable to me and so I would, we didn't compute confidence intervals for this, because I'm not even sure if confidence interval construction throughout amount of n bootstrap approach is possible with three samples. So this is not a scenario where I'd encourage inference or hypothesis testing. This is more of a model fit illustration. But yeah, thanks for your question. If we were looking at more than nine samples undiluted and then up to eight prefold dilutions, we could definitely compute those interval estimates. But in this case, that's not possible, unfortunately. All right, any other questions? If not, let me say for our virtual audience, there is a reaction button. So please use those reactions. And for our live audience, let's thank our speaker. Thank you. Thank you so much for coming. Have a good, enjoy the rest of your conference.