 Hi, I'm Anita Kazurski. I'm a professor in pediatrics at the University of Alberta. I'm a co-lead of the human cohort platform of impact. And I am a microbiome epidemiologist. So I was an epidemiologist first, and then 10 years ago or 12 years ago or so. I obtained funding from CIHR to embark on my microbiome research career. So I am telling you this because what I will be talking to you about in terms of designing a human microbiome study has a epidemiologic perspective. But I will hone down the lingo when I explain things like study bias, blah, blah. Okay. So, in, by the end of the hour, hopefully you'll understand something about human microbiome study design and how to diminish biased results to be able to make some sort of informed decision about timing of microbiota by decisions. Dr. Sekuro gave you a really state of the art up to date presentation on how to handle and process a sample. I will focus on the timing and relationship to biology. And at the end, everyone's favorite, how to perform a sample size calculation for a microbiome study, which increasingly will be a requirement for publication. All right. So, one slide, important decisions in designing a study, any study. Okay, so I'm just going to go through the steps and then I'm just going to then go directly to deal with microbiome issues. So, what is your research question? Hypothesis usually has two parts. It has an X and a Y, how they're related. What study design are you going to select to answer or test your question? What is your primary outcome? What kind of variable is it? Is it a yes, no variable? Is it a continuous variable with some sort of units of measurement? If you want to know about microbiome influences on this variable, what is a meaningful difference in this primary outcome, which usually is a health outcome? What's your primary exposure? Is it just the microbiome composition? Or is it diversity? Or is it some sort of cluster? We'll talk a little bit about that in the sample size calculation section. Of course, you want to know how many subjects do you need to recruit on the basis of your sample size? Realistically, can you recruit? You've got groups. Do you want an equal number of people in your groups? And if you're taking more than one sample, well, what is the collection interval? Okay, so these are just some minor, actually major important decisions to make when you're designing a study. All along, you're thinking about, because this is going to be a human microbiome study, it's going to be an observational study. It's not going to be usually a randomized control design. So which means that with an observational study, there are sources of bias. And bias, what I mean by bias is that your result, your finding differs from the true finding in one direction. It's either higher, the association is higher, the association is lower, or there's no association. And this is not because of your research question. It's because of some other factor bias in your study. In a nutshell, there are three main types of bias in an epidemiologic study, selection bias, measurement bias and confounding bias. And the goal is to minimize these three types of bias to improve the internal validity of your study. So let's just talk about what are the, and I'm going to ignore measurement bias because some of the issues around measurement bias Dr. Securo has covered very nicely, because any changes in how your sample is processed sequence, etc, etc. So measurement bias, what I wanted to focus on is confounding and selection bias. So what they both are is they, they occur they result when there is an uneven distribution of an environmental influence or some kind of factor out there, which is related to your interest and your disease outcome that causes a false association between a false association between microbiota and disease phenotype so a false association for the question you are trying to answer. But, and I'm not going to be spending too much time talking about these terms but one important difference is the confounding factor or bias. There is a number of it in the analysis stage there is several options you can pick restrictions stratification, or multiple variable regression to take care of that nasty confounding bias. But once you have selection bias, and it cannot be adjusted for. And that's because for one of your comparison groups that environmental factor which is causing this bias that characteristic just happens not to be present. And so there is no way you can adjust for it in the analysis stage. And so, these are words, what I wanted to do next is show you what I mean practically and I'm using a very old example from a paper that was published testing antibiotic use and infants antibiotic use influence on got my microbiota and this study was still using culture methods but and I modified this diagram to illustrate my point. Okay, so you have group two infants that received 12345 different kinds of antibiotics doesn't matter which kind. And in group one that did not receive antibiotics, and in group one, just for your information. These were infants that were delivered vaginally and by cesarean section, and your microbiota measure is your, your count of Bifidobacteria. So just looking at the figure you notice that the counts are lower for the infants that were received antibiotics and they are on average, higher in those that that didn't. So, your conclusion could be antibiotics reduce accounts of Bifidobacteria. Yes, that's a reasonable conclusion even biologically. But the question is, you notice that C section. That's the CS also produces lower counts of Bifidobacteria. So, in your group to is the lower count of Bifidobacteria due to the antibiotics, or is it due to the fact that many of these infants were delivered by cesarean section delivery. So, cesarean section delivery is a confounding factor in this particular instant. So, you say in your study, you then determine what proportion of infants in group one group two were delivered by cesarean section, and I'm going to give you a few scenarios so this is the first scenario. The first scenario you learn that it's equal distribution 50%, 50% cesarean births in group one and 50% cesarean births in group two. Okay, so this is even distribution of the confounding factor. It should not produce bias because the issue is the distribution is uneven. So then we go to another scenario. Okay, a likely scenario is that we see in group one that only 20% of the infants were delivered by cesarean and maybe that's why overall their counts of Bifidobacteria are higher. But in group two, the group of your, you know, of interest in terms of antibiotic treatment and assessment 80% were delivered by cesarean. Okay, a big imbalance. So what you can do is in your analysis stage, you can do some modeling restriction blah blah to account for this confounding bias. Right, you have a solution to your problem. But what happens when we have scenario three. So scenario three is that in group one, all the infants were delivered vaginally and in group two to they're all delivered by cesarean. So you have no infants that were delivered vaginally in group two, and you have no infants were delivered by cesarean group one. The question by is, you cannot do anything with it. You can never know whether your results in terms of the antibiotic treatment and reduce counts of Bifidobacteria are due to the antibiotic or are due to the cesarean section delivery. Avoid this scenario when you're planning your sample recruitment, your subject recruitment and your sample collection. Okay, so hopefully that was a good illustration of the difference between confounding and selection bias. Okay. Now, bottom line know what is the main influencer. And I'm giving the example of infant gut microbiota we know it's birth mode breastfeeding status and antibiotic exposure and plan your recruitment accordingly to make sure that you have even representation on the basis of these confounding factors. Okay, so let's just talk about some just a main three main types of study design with that and I'm going to again I'm using the example of an infant microbiota study design we have the cohort. We have the case control and we have the nested case control design. I'm going to go through each with a nice visual representation to help you understand. In the cohort study, you're recruiting your infants, you're collecting fecal samples from them, and then you're monitoring them forward in time for the onset of disease. Okay, you start off. So you've got your infants, and they are exposed or not exposed to some sort of factor that you're interested in, you collect the fecal samples, and then you monitor them going forward in time to see who does and does not develop the disease. So that's a cohort study. The microbiota assessment occurs before the disease onset. It's the ideal study design in terms of exposure outcome association. In this particular instance, you have the luxury of collecting all your samples and doing all of your analysis at once. In a case control design, the opposite happens. Here you start off with cases and controls. Infants are children because now they're older that do and do not have the disease. And then you go back in time maybe these were obtained from a previous study, and you evaluate what their gut microbiota composition was like. The big problem with a case control design is usually the source of information, the study subjects infants, they may be from different population, different studies, and this really increases the likelihood of confounding factors. Huge, huge risk for confounding factors and basically if you can avoid a case control study, definitely avoid it. But what we have is we have a hybrid. And this hybrid is quite too useful. So this hybrid is called a nested case control study, and you start up with a cohort study. And because, and so you're, you're starting off with your group of infants, some of them have some sort of exposure not you're collecting few fecal samples and as you're waiting for the disease to develop or for you to assess and assess them a certain point, you're impatient, or you have limited resources money you want to, you know, sequence a part of them. Okay. So it's like, starting off with a cohort design and then then interrupting it and implementing a case control design. So you're not analyzing it and the full capacity as you with with the cohort study, you are, you're kind of truncating it, and then you're ending up with cases and controls but it's from your original study. What is the issue with the cohort study versus the nested case control design. Well, both of them start off or both of them meet the temporality criterion that you start off with your infants you collect their fecal samples and profile there got my car body and then you assess monitor them for the disease. So that's the outcome so X occurs before why. In your case control. Nested case control you still have that temporality in place. But now you've kind of truncated the natural development of the disease. And then you're selecting yourself up to some biases with the case control design. So, just one thing to pay attention to is, if you proceed, if you pursue this nested case control design. And then when you're selecting your cases and controls do pay attention to this distribution of confounding factors. So, do check that you have sufficient sample size for infants that are delivered vaginally in both the cases and controls for example, or do check to see that you have a good distribution of infants that are breastfed versus not in the cases and controls. If you pay attention to this detail, then you can proceed to analyze your nested case control design without problem. If you ignore this aspect and you end up with the extreme example I gave you a selection bias then your nested case control design results will also be biased. So, selection bias occurs at other steps in a cohort study or an epidemiologic study a microbiota a study not just at the at the selection stage or not just at the monitoring stage when you lose your children, they're lost to follow up. So, you lose and in your, you know, one group or the other, you can also meet up with selection bias due to some sampling issues and this is a this is actually a common problem. And that's because I think Dr. Sikur alluded to this, but it's not that easy to get a fecal sample from a young breastfed infant. It's, it's a little liquidy. So this is your sample, maybe too small or you may not be able to obtain it etc etc. So as a result, you have a study where you actually are interested in, in knowing about the microbiota of infants that are breastfed, but you may end up with dominance of infants that are formula fed. So this is something to pay attention to, because you may have designed it correctly but when it comes to the analysis stage you kind of look in your freezer and say, Oh, I have like I don't have enough sample, or, you know, it's been allocated and so I'm kind of in trouble. Yeah, this is a common problem, and some cohorts have looked into, like maybe, you know, if you're going to collect that sample from a breastfed infant, you may need to collect that piece of diaper. Well, maybe this should that diaper should be standardized or something like that so when you're processing it, it's, you know, this interference is the same for all samples so anyway so this is a bias related to sample processing that you can anticipate in advance. And think about how you're going to deal with it. Okay, and so this is the last slide with respect to the study design section. And that's to do with temporality so a cohort study does have temporality, you know you start off with your fecal sample collection, it occurs before you measure your outcome. And that doesn't always guarantee temporality so I've just given you an example from our group of how we, we approach this issue so in this is data from the child cohort study. We had three month fecal samples, they were collected before the skin prick testing was done at one year, great. The criteria and temporality is met so and the skin prick testing was done to determine food sensitization status. Okay, so that's good. You have some sense of temporality, but some infants, by the time you had skin prick tested them may have already developed food intolerance, which caused a change in feeding method that altered the three months got microbiota. Okay, so this is in some infants not many reverse causation has already happened. So, what do we do in this case well, you can, like we did conduct a sensitivity analysis and just remove those infants from your analysis. So you do your analysis with all the infants, and then you do a sensitivity analysis and you remove those infants, and you see if you get the same result. So just a little helpful solution to this problem. Okay, so then our next goal for this afternoon is to do with the decision making regarding a timing of the year microbiota by a specimen and again I'm going to focus on got microbiota in the infant and the mother. And before I do I did want to mention this paper that is is impressed and very useful. Very useful results on the various influencers of got microbiota composition in infant. And in particular, this, this group finished group, tested extraction, DNA extraction batch effect, and found that if all the percent variation of all the risk factors that batch effect explained about 6% of the variation. Okay, so that's it's really it's good, it's good to know that we talk about this in more like a hypothetical way, it's good to know now that groups are starting to document this. The second most important influencer was birth mode. Okay, so here you have a compounding factor coming up as your another factor that's causing bias so I really recommend you take a look at this paper that was just published by jacquela. Okay, so now I go on to talk about timing a little bit and I'm talking about this in terms of biology. So say you're conducting a a you're interested in the maternal got microbiota and you're interested in this during the pre and post natal time period. So a couple of things to be aware. You, if this is your question of interest you're going to be reading the literature by this is just like FYI for some things that we know already. Well, during pregnancy there is a change that got microbiota from the first to the third trimester. And this is really a change that in terms of some taxa, you know, preparation for birth so you have to think about that when you're sampling do you want to sample when that changes occurred. For example, too early you're going to miss the change. If you sample in the first trimester, you're going to get a got microbiota is more similar to preconception do you want that. If you want an assessment of the change microbiota so it depends on your research question, and then post no postnatally it takes some time for the, the, this change got microbiota, the changes that occur in the third trimester to revert back to preconception, and that it takes one to two months so again, you know, is this important if you're sampling postnatally do you want to, you know, get a sample at the time where it's more like the third trimester, or do you want to sample afterwards. In other studies, it's challenging to get, or they, they were not able to get a sample in the third trimester so they took a postnatal sample, and you know within that time window so as a result they could say that this was represented that third trimester because it was in that time window so some useful things to know. We're starting to see these types of publications. We know a lot more about the infant got microbiota in terms of changes. We know that post birth that got microbiota undergo undergoes a change towards the end of infancy to become more adult like so there is some some tax that are more common initially over time and others increase. So then you have to be mindful of this and when these changes occur and decide well do you want an eight specific cross sectional sample, or do you want to say the change over time. Also, there are different times during infancy where the influence of the environmental factor is more strong like for example in early infancy, a Caesarean section effects are stronger than later infancy. So you want to capture that, and to take an early infancy sample but if you're kind of assessing the, or if you're your question is not related to the section and it could be a confounding factor. Then again you have to pay attention to that balance, and in terms of you have a balance of infants that are delivered by caesarean vaginal vaginally. So the question about microbiota differs whether your breastfed or formula fed so again this issue, depending on is it your primary research question. If it's not you want to adjust for it so you seek a representation in terms of sample size. Also, if you want to get that peak sample from peak breastfeeding need to know what are the, you know, a breastfeeding patterns in your population, you know, this is a study United States where they have shorter maternity leave and then. So breastfeeding is normally shorter than in Canada versus like a Scandinavian country. So, again, just have to be mindful of the practices around childbirth and postnatally that would influence your sample. Well, the composition of your sample. Finally, meconium. Just be mindful that you may obtain some samples of meconium that do not have any detectable microbes in them. Meconium takes up to one day or so to be expelled so during that time period maybe there's some environmental uptake of microbes and then so you might actually find something in your sample if you're interested in that question. But if you obtain if it's expelled early may not get anything in your sample. And this varies from population to population. Okay, I'm going to just pause here before I finish with sample size calculation section and just ask if you have any questions. Who here is involved in a infant got microbiota study. Okay, a few of you are yeah anyone our pregnancy microbiota. So now we're now utter third objective how to perform a sample size calculation. And I do want to clarify something for you before I start. Well, in terms of sample size calculation mentioned this is going to be a requirement down the road in terms of publication. So it'd be good to become familiar with doing one or at least have a general sense I guess someone else to do it. So when you're thinking about sample size, there are two x y questions in terms of the microbiome, either your microbiome is the exposure and you're predicting a health outcome or a biologic outcome. So the microbiome is the x, the health outcome is the y, or the microbiome is the y and some sort of environmental factor intervention risk factor is your x. So in the first scenario, then any sample size, the standard sample size calculations apply. I'm just focusing on the scenario where microbiome is the outcome, it's the y. The same sample size calculations are used but you just need to think about how you're what to you have to think about your outcome measure your microbiota measure. As you know there are many microbiota measures and for each microbiota measure, there is a different way of doing the sample size calculation with some tweaks and assumptions. So I'm talking I'll be discussing the second scenario. I'm recognizing that there is a gap in the literature on sample size calculation so the human cohort platform teamed up with the analytical platform with the impact network to write a paper on doing sample size calculations for microbiome studies so this just came out last year. And I will be just highlighting what's in that paper we tried to make it very accessible and easy to use and even started with a decision making tree. And we also published a second paper specifically for a sample size calculation when beta diversity is your open measure. Okay, so this is our decision tree that hopefully it will be helpful. Let's get off with a very broad question. Is there a difference in your microbial community, yes or no. It's measured by beta diversity or you can do collective tax on abundant measure, drill down further and then you have your total micro microbial of your alpha diversity, you have your specific taxa in terms of the relative abundance, you can then do something creative and form or identify clusters based on their taxa abundance and then you can just ask the question, is this taxa present yes or no like a C difficile in the sample yes or no. So those are the various scenarios and each scenario, a scenario I'll talk about the sample size calculation. So let's just start off with beta diversity. All of you have seen these really pretty pictures, these are PCO plots, right so the cluster plots to visually show that there are differences and this is the three group differences according to breastfeeding if they're the clusters are quite separated say oh well there's no overlap they're you know quite different when there's significant overlay then you have the opportunity to test for statistical significant differences in terms of permanova test, but how does one do a sample size calculation to arrive at a beta diversity test in this PCO plot, which is meaningful. So that is the question and well. If the beta diversity measure like in this case Unifrack was actually published in a paper like in the supplementary info, even or in the main part of the paper, then you would actually have some data from the paper on which to base your sample size calculation so this in particular instance yes and this madam paper in 2016. They actually did report in a figure. These mean Unifrack distances for these different feeding modes and with this information, you have a mean and you have a standard deviation. So here as I've reported here on the on the table. So we've got the three groups and say you're interested in comparing to them, like exclusively breastfed versus formula fed you have the mean you have the standard deviation, and you can use a very regular traditional for mean differences to calculate the sample size I've got include a whole bunch of math there. This is your basic standard equation to test for differences in mean. And so, when we did this as you see I've circled where the, the data comes from. So you have two means, you have standard deviation you could pick one you can pick the most conservative one, and then you calculate your delta, which is your is your the difference between your means you're interested in testing. And then you plop it into your equation. It's there in the denominator and you throw in some other things. So we have a standard equation we have found the source of information, and then you can use this for your sample size calculation. Well, that's great. However, in most instances, these beta diversity measures like Unifrack are not actually positive. Well, that's great. However, in most instances, these beta diversity measures like Unifrack are not actually published in a paper. So you have a nice little diagram but we actually do not have the actual data for behind that diagram on which to base your sample size. So we identified that this was a problem and so our group as I mentioned, did a simulation study based on American gut data to provide a range of beta diversity measures that people can use for their sample size calculations of beta diversity. And you have those slides so you can look at that later. Something else you can do with existing literature. If you want to determine sample size for beta diversity. And this is based on the fact that am I going to go into the math or the algebra. But one can use the R squared and so if you know anything about regression modeling R squared is the percent variation explained. And, and R square is reported in tabular results of a perma nova, which is very useful. So you can take that R squared and then you can take the square root of it and throw it into this equation. So this will give you an estimate of sample size for beta diversity that if you like for example Unifrack R squared results that you can have them for the crude model or for an adjusted model so you have. You may have some options at your disposal so another little useful. Way to estimate some information that's not easily found. Okay, so next we go on to the alpha alpha diversity as the outcome and this is actually the most straightforward all the scenarios because, and this example is from the castle Pascal paper 2020. And as many times it is normally distributed there's a mean and standard deviation. In this particular instant. This was a study on a current disease. And their face PD which is alpha diversity measure mean was 13.5 and then there's a standard deviation, you can plug it into the same equation, shown on the screen. On the basis that you want to determine a meaningful and minimally clinically meaningful difference of 1.5 units in terms of the the sphenotype and then you can calculate the end per group or and for two groups. So that's pretty straightforward. The next scenario is the relative abundance and if you've analyzed relative abundance data, you will appreciate is quite skewed it's not normally distributed. And usually one has to use things, non parametric statistics such as the median and the inter quartile range. So, in this particular instant in the example I show here is also in the paper. You can take this median and inter quartile range information and you can convert it into a mean and standard deviation. And that I've shown you on the screen on how to do that. And so if you have a large and you can say by my median is close to the mean you can make that assumption. You can calculate the standard deviation as I've shown here. And then, then you have a mean and standard deviation for eight for each group. And then once you have that, you use the same equation as I've been showing you, because then you can calculate the triangle or the delta which is your, your difference that you're interested in on the basis of the, the mean assumptions and the standard you calculate. And again, plugging in the same, the same Z values 1.96 and 0.84 are plugged into the equation. And then you end up needing about 20 in each group. So that's some some way if your sample size is fairly decent to take median information and convert it into a mean and standard deviation. Another way to perform a sample size calculation for relative abundance, if you treat relative abundance. If you convert it into a categorical variable, you can, you can have abundance higher or lower than the median, or you can have abundance in the upper centile range and in that case, then you have two groups. Then you can determine the likelihood odds of this outcome. And then there's a sample size equation there looks a bit terrifying to use if you want to calculate it on the base of an odds ratio if you're using the relative abundance that has been converted into a categorical or two category variable. So coming to the end and there are two remaining scenarios, one is cluster membership. And this is your, you have this cluster or you do not. So it's to category and multiple category. And once you have a category yes or no. Then you can use a specific equation, which is similar to the one I've been showing you only this is a test for difference in proportions because once you have a yes no, you can know what proportion of infants have this cluster versus another cluster. You just plug in these percentages into this some equation. Again, and then you can determine your requires sample size, pretty straightforward and the same Z values 1.96 and 0.84 also used in this equation. And then finally the colonization status is similar. You know it's a two category yes no, a seed of a seal present or absent and so you have percentages proportions. And again, you can use the same equation I just showed you to determine the end for one of the groups. And so and that brings me oh no it doesn't bring me to the end. Right, I wanted to say a few words about mediation analysis so I've shown you not to how to do a mediation analysis, because that's another lecture on its own, but just to alert you to the fact that this is now becoming quite quite common in the microbiome literature to show that the microbiota profile characteristic was in the pathway between an exposure and an outcome. And so this is one of our papers on maternal overweight. Pre pregnancy and relationship to child overweight at one and three years of age, and a microbiota sample that was determined at three months. And then you can determine through this mediation analysis, whether this particular lack of superior say it was in the pathway, the amount of it was in the pathway between, or was potentially causal. It was in the pathway between a mom being overweight and her child offspring being overweight. So this is a sequential mediation we can have just one variable just the microbiota on its own. Here we coupled it with birth mode. And there's another example of a simple mediation between the questions related to senior section birth and child topic sensitization, and a bunch of microbiota measures as mediators. So if you take it to the next step, you can report the proportion of the pathway that is due to a particular group or pattern of microbiota or of the metabolite they produce so you can then also report that with your this is like highly advanced mediation. I'm just letting you know it's possible. And if we want to understand the intricacies of microbiota and how they function together with her hose to produce produce some of these phenotypes, we need to be using these kinds of approaches. The reason I'm bringing this up is, of course, sample size calculation for mediation analysis which actually is pretty straightforward. Okay. So you've got your X is your exposure your why is your outcome your M is your mediator the triangle and the sample size calculation is based on the, the alpha and the beta which is just the correlation between XM and the correlation between M, Y in order to proceed with the mediation analysis, those conditions need to be met. So this is after you conduct your study, but of course, you want to be able to calculate the sample size required to conduct a mediation test. So you can, based on the literature, you can say the XM association in the literature, and this table three of the fritz and mechanic paper, it tells you, you know, whether it's low, medium, high. And then the NY association whether it's low, medium, high on the basis that shows you nice little table, and then you just can look up the sample size that is required for your mediation. So, there you go. The examples I gave you are in the furtis a publication from our impact network we're very proud of that publication it took quite a bit of time to put it together we we recognize that it's identifying a gap in the literature to help microbiome researchers with this particular aspect of research, and even statisticians, because, you know, of course statisticians can easily do a sample size calculation, but you're the expert in terms of what microbiome measure you want to use and what form is going to take. And so that determines that will help the statistician pick the right sample size calculation, or you can show them this paper, say, I want that one.