 So good afternoon. So this is the third day and I suspect many of you feel that your brains are rapidly filling up with thoughts and ideas and you're desperately requiring a decompression phase and the last thing you want to see on the last afternoon is detailed compositional statistics but here we are. So before we get started with that though I want to ask if there's any sort of concepts that over the last couple of days have come up that you really really really want to ask a question about that you haven't had the chance to do yet. It could be anything. I can't promise I can answer it but if you have questions about OTUs or things that aren't OTUs or stuff that John said this morning or whatever. Yeah, Fiona. Just as a last sort of stalling tactic, how many of you are coming out more excited about meta genomics from this before my lecture? Excellent. Okay good good because you know we try to strike the right balance of enthusiasm and skepticism and at one point I had a very smart student in my class say is barn chromatics good for anything? So I'm glad that we're sort of we're not in that space and hopefully you've been able to become as excited about this stuff as we are. So let's get moving. So we're gonna talk about fancy statistical stuff. So what I'm hoping we can get through in the next two hours I suppose between the lecture and the workshop that's what it's called. I'm gonna blame the goat curry on any pause or misstatements. Learning objectives. So Will sort of did a brief introduction of some statistical tools actually not to shortchange John. I believe he talked about some this morning as well. Briefly. Briefly. Okay. You will be able to or else. What's that? Oh you've set me up. Excellent. So formulate important questions about meta genomic data. Well hopefully you're sort of on that path. There were lots of very interesting topics that came up as we went around the room on the first morning. Understand why compositionality matters. Choose an appropriate test given the types of questions asked and I want to emphasize here that I can sort of give you the 20,000 foot view of this but there are there's an obscene number of different statistical tests. There's more than you can shake a stick at. And anybody who's taken like a biostatistics or multivariate statistics course or even just tried to look at different distance measures for you know microbial community data will appreciate that we can't run through everything in realistic amount of time. But just some of the principles right. It's really the principles that matter and then in two years once everything that's been used now has been replaced by other stuff then you'll be able to say okay well I understand what people are trying to do at least. Even if you're not like reading all the formulas in the supplementary material and interpret the results. Okay so we're going to take a little bit of a look in the tutorial at interpreting some results. And then in very general terms how can machine learning methods be applied to metagenomic data. How many of you use machine learning methods before? A few of you. What's your favorite machine learning method? What have you used? Lasso regression. Nice that's a good one. That's like the I don't want to go full on random forest or SVM. I'm gonna do a little bit of machine learning. Yeah yeah cool that's a good one. Okay so again not a super detailed introduction to machine learning methods just enough to say here's what they are let's sort of naively use one now I've made you dangerous go do some analysis. And there's this I have 65 slides and I go on tangents so you know I'm happy to answer any questions about stuff we get through do not get through in this lecture and hopefully at the very least if we skip over some of the more tedious stuff or less tedious stuff then you'll least have the reference to look at say okay well this is what he was gonna talk about if he had better time management so one of the things and this is sort of interesting historically right from the point of view of so there's a bit of a story the human microbiome project right started in 2007 2008 something like that and there's this great story of a conference call which is how all great scientific stories start and it had about 200 people on it and you know all of the principles of the HMP you know Curtis Hutton-Harriter gave us and so on and Rob Knight was on the call of course Rob Knight architect of time and many other things and so I'm told I wasn't on this call at some point Rob said so what's the hypothesis and there was apparently a minute of silence none of the other 199 people had an answer to that however the key here is that yes there is some exploratory analysis to be done in metagenomes right part of this we're still at a phase where it's like what the heck is there right we needed a reference for a bunch of healthy humans from st. Louis and Boulder Colorado in order to compare all of our other subjects against right without that without some sort of reference that's just collected as Ernst Rutherford would say stamp collecting right then we wouldn't really have anything to go on so it's you know if you're collecting data without necessarily a hypothesis in mind that's okay that's what exploratory data analysis is for but if you have a hypothesis or hypothesis emerges from your analysis then you know you can say well what exactly are the predictions of a hypothesis now that I've seen the results and how do we properly test that right now I'm not going to get into the details of you know Carl Popper and statistics and things like that but you know there's basically hypothesis testing and exploratory right and so microbial communities are awesome because they break statistics it's kind of like the story of a hundred years ago when a lot of statistical methods were developed because the ecologists were like look it's messy so now we're looking at the same thing for microbial communities and things like you know the distribution of abundances across all of your samples it's typically pretty wacky right and so you know it could be like this sort of gamma distributed negative binomial whatever and there's a ton of zeros right it's like you look at OTU number 657 it's like 0 0 0 0 0 0 2 0 0 0 0 right and that makes statistics harder so that's something to worry about this is this is the biggie and we're going to spend a bit of time on this and so John apparently talked about this a little bit this morning the whole point that we're looking at proportions rather than counts which causes pain and the big problem is that sometimes people don't realize that it's causing pain and they just proceed without feeling the pain and that's not necessarily a good thing my favorite part of this whole story my favorite part of medicineomics microbial community analysis is the hierarchies right which might be a weird thing to say but the you know the key to it and you'll see this in the machine learning example that I'm going to show you later on is that we can think about things in a hierarchical context right it's like what's the species right it's E. coli right well within E. coli we have a number of strains and isolates and subdivisions and blah blah blah and so if you're trying to say well what really matters here what makes the difference between you know a healthy tuna and a sick tuna well maybe it's not the name species E. coli maybe you have to dig further down looking at 16 s or metagenomic functions or you know even just some sort of phylogenetic representation in order to get the right view of this and this is exciting stuff and it's true phylogenetic functional taxonomic this is where some of the really really really interesting questions lie okay which is not to say that you need to explicitly address them I mean there's only certain amount of things you can do but it's certainly fun to think about even if you're not going to spend a huge amount of time on it and so just to give a quick overview we want about you know statistical methods parametric methods sort of using inference of distributions assumptions of distributions to look for you know significant differences and we can think also in terms of how we're trying to infer relationships between things and sort of a classic distinction in statistical learning machine learning whatever you want to call it is between unsupervised and supervised methods right and so supervised methods are really cool because you're basically just saying data you know what looks like what essentially you don't have any preconceived notion in fact it's rather a lie well it's not a lie but you know here's here's a bunch of dots right and the dots are red and green and that was a terrible choice of colors for which I apologize the dots are red and green but the key here is that the inference of whatever structure that is was done without reference to the redness and the greenness it's just like dear method here's a bunch of points make them in the clusters or higher order manifolds whatever and once you've got that then you can overlay your points and say well what came out of that right and principle coordinate analysis is a great example of that you don't go to PCOA well actually normally you don't go to this is going on the web isn't it and okay well whatever I'll say it anyway so normally with PCOA or PCA or whatever non metric multi-dimensional scaling you're not saying fit me a beautiful ordination plot that maximizes the separation between my groups that's the opposite of what you're supposed to do except if you're publishing a very prominent paper and use certain methods that are very similar to PCOA but do really try very hard to separate the groups from one another and that's all I'm gonna say about that especially in sort of a recording context you can ask me about it later so the other the other component to it is the supervised learning right that's where you have some notion of you know this is this and that is that right and you're asking the question whether it's a complex classifier a simple one or whatever what are the rules that I can find in order to differentiate the tacos from the moons right what are the properties that are really important there except that instead of an obvious problem like you know tacos and moons you're really got you've got like really subtle patterns in hundreds of ASVs or thousands of otus right and so the classifier has to kind of go let's try this nah let's try this so that's a bit better right until it finds a satisfying result or you just give up right so that's the classes two or more discrete classes contrast that with regression like lawsuit regression for example or not binomial whatever it's called regression sorry logistic regression thank you yes that's it and so basically you have a model and you're trying to come up with quality quantitative predictions right so that's the contrast classification regression there's also semi supervised methods which are both involving inference of distributions inference of relationships in addition to certain classification based approaches so those exist I don't know if they really come into play in medicine no mix or not and no one else is volunteering so I don't know anyway it's sort of you can see that it's sort of a trade-off and so I just wanted to do a brief introduction to compositionality right what is it and what sort of problems does it lead to and so I thought we'd step away from bacteria for a moment and really talk about cattle okay and so so here's here's you know traditional ecology of multicellular overly complex organisms where you've got two different samples right two different populations in fact and the population of this is ten cows three of which are red seven of which are white and over here we have 30 cows three of which are red and 27 of which are white right and so obviously obviously the number of red cows is the same in the two samples right but the proportion is quite different right three tenths versus one tenth so you know this then then you say yourself okay well what really matters here right is it the fact that I have at least three red cows in these samples maybe there's a threshold number of red cows and there's like an ecosystem shift it doesn't matter how many white cows or whatever maybe it's the proportion right a greater abundance of red cows leads to some desirable or undesirable ecological outcome right and so same count different proportion the obvious contrast of that is we have three red cows on the left nine red cows on the right and so now we have the same proportion but different counts okay so obviously looking at a different question but what happens in let's say you can only count to ten right or in the case of DNA sequencing you can only count to 50 million or whatever device you're using right then you are a sub sampling right and instead of getting an enumeration of the population you are getting a sample and once you've done that you have no freaking clue unless you've done some independent method like self sorting or microscopy or something like that that in fact there are three times as many cows in the population on the right as there are on the left and guess what the number of cows can actually matter a great deal right you don't know if it's actually the same number of cows or just the same proportion of cows right and if it's different proportions of cows you don't know if it's the same number of cows these are not knowable things and this really really really messes up with your statistics and the additional thing is that different sub samples will have different associated variances so the impact of sub sampling can be important and one of the keys to a lot of methods is actually trying to model the variance of your sub samples you just have some sort of simulation approach like or sampling approach like Aldex is it Aldex no I'm thinking of yeah Aldex okay Aldex too or doing something else to try and capture this element of variation that can be quite important right it's like you're doing in a nova or a t-test right the difference between means is six is it statistically significant well if the variance of each sample is two then yes it is if the variance of each sample is a hundred and sixty five thousand then no it isn't right these are things you need to consider here as well so sequencing right unless there are fewer nucleotides in our sample than we can assess in a sequencing run so there's like three E. coli in our sample except that we do amplification anyway so we're always going to be limited by our sequencing capacity so that's an issue and so what does this mean well this is a this is a figure from a really phenomenal paper by Greg glue or Western universities developed or contributed to a lot of these methods to try and deal with compositionality and the table is basically split in two well there's labels there split in three but two important parts and essentially you go through this list of things that may in some cases be very familiar right rarefaction which was quite widely used for a while is bad for many reasons and so DC addresses some of those and then your distance measures the same ones that pretty much everybody still uses right you know Bray Curtis Unifrack Jensen Shannon jacquard distance phylo sore you know what else someone else have a favorite distance method that's not listed here what's Euclidean distance okay that's another one yet you can use Euclidean they all suffer from certain limitations right ordination peek away obviously you know the differences the important differences in a sample may not depend on the relative abundance of things that are different right so it's like your 40% Euclide who cares what really matters is the 2% of acrimansia here and the 3% over there right peek away is gonna be like I don't care about that right or it's gonna bury it down in like your fifth or sixth component with an eigenvalue of who cares and so that's really important multivariate comparisons so that's kind of your standard at Nova's or analysis of similarity or perma nova correlations so we all know and love Pearson and Spearman right do I want to do parametric on an on parametric whatever and then these differential abundance methods that people want to use to say you know do I have different amounts of this over here and that over there right and so on the other hand this is the slide between the one I just had in the other hand one thing I want to emphasize before we start talking about compositionality and it's an interesting one it applies to pretty much everything in science right and in life in general is here is the best way to do it right here is what everybody does and here somewhere between is what you're capable of if you know what you're doing right and so I want to emphasize that using methods you'll see the impact of using methods that do not account for compositionality right in some cases it matters very little in some cases it matters a lot what I want to say about this is that do not dismiss a result simply because they used a relatively naive approach and spoiler alert the first part of the tutorial is going to be using an a nova that does not account for compositionality I'm trying to communicate the intuition to you and then you can go out and use other methods that are based on log ratio transforms for example so again you know the proof of the pudding is in the eating and you'll see some examples where in some cases it makes very little difference in other case it makes all the difference in the world and depending on which of those elements in the table we're talking about this awareness has arisen or has been translated to different degrees and so if we want to talk about for example rarefaction well rarefaction is you know it's the longest four letter word in metagenomics don't do it distance is used I mean there are couple of compositionally aware distance methods that have just come out in the last year or two but everybody in their freakin dog still uses break hurt us and weighted unifrack and unweighted unifrack and blah blah blah so and that's you know it's still okay right still okay and then correlation analysis in part because of historical reasons earlier publication methods that take compositionality to account those are generally more sophisticated and kind of more with it in terms of the acknowledgement and the application statisticians have known about compositionality for decades right some of the methods that have been adopted into this are from like 1980 right 1981 so it's nothing new but as with any new feel it's like there's sort of the first methods that come out and it's like that's wrong and so people come up with other methods and it's like well that's wrong and so you know we're sort of in that phase of I would say rapid progress John would you say rapid progress no okay all right we're all learning here change change rapid change there we go and you know there have been some papers to say standard approaches versus compositionally aware approaches you know the impact really depends on the structure of the data right there are results that are like happen obvious right this is 80% acrimansia and this is 2% acrimansia right it's too sad two sets of samples is that significant yes did you use a composition of it you do a standard method no does it matter not so much your effect size might change a bit your p-value might change a bit but from a biological context you know you're good so I'm not trying to say you know just use the standard approaches I'm not giving you an excuse or no I'm just saying that there is still some validity to think about when you use them and the best practices are evolving and so what this comes down to is what your reviewers say right you're like you submit it and they're like why did you use a nova reject all right so that's kind of you know depending on who you get they will apply different levels of stringency to the analysis okay so we have on the left all of these standard approaches and on the right we have some compositional compositionally aware equivalents and will pointed this out in this in a slide and I think this is coming up in a couple of slides you talk about compositional approaches it's a bit ambiguous right because the compositional approach is either something that's aware or not aware of composition in this case the compositional approach means hey look at composition let's do something about it right so just wanted to make that unambiguous in case I accidentally you know straight off course of it and so I'm not going to talk about all of these things but I wanted to talk about and I assume that this is just a typo in the paper that perma nova is the same as perma nova because they're both I mean they're permutation based I think they both kind of address this issue actually they're not both the same thing correlation methods we got Pearson Spearman based and we've got spark speak easy and various other Greek letters there's actually really good paper which I cite in a few slides that compares like a zillion of these things and really illustrates the ability to detect different types of ecological associations unreal and simulated data right and then just a little bit on laser pointer what's the point the laser pointer point right differential abundance so some of you probably use left see which came out of Kurtz at Curtis Hutton Harris Lab and that's just a Norwegian pastry which I didn't know and on the other side Aldeck's too right which is a more compositionally aware method for inferring differential abundance okay any questions so we pause to take a sort of mental break any any questions so far I assume I don't actually know what the row is referring to there right I'm like okay spark and speak either we're good by you know so and again I said like it's like I said it's a subset of methods that they tested out in this 2016 who was the first author it was like a zillion authors and was all the creators of the different methods so that's where I would look for clarity on that spark is nice and straightforward in some ways and I'm gonna kind of touch on it hopefully at an appropriate level of resolution simple things first okay questions you can ask of your data easy questions to ask your data are there two categories or greater than two categories okay and depending on that you choose either the left column or the right column and then you ask yourself the question well okay what is the key to a parametric method I'm asking you what's that parameters distributional parameters yes yeah exactly and so a parametric method will look at your data try to infer a particular probability distribution with certain parameters to it and then use that sort of mapping to a probability distribution to actually test whether the differences between whatever are significant right a non-parametric cast or permutation based test doesn't do that they just kind of work with the data themselves and muck around with the data to try and see if we can get some result that's impressive relative to no signal right so it's kind of like here's my data and here's the difference between my data right if I shake up the data such that it's completely randomized do I still see a difference that's like as good as that or better that's the notion that's the idea behind the p-value another way of thinking about it is simply our differences between groups you know healthy pigs sick pigs significantly greater than differences within groups right so there's a whole range of methods there and there's permanent over and so a nova to start with something very basic very simple it's parametric which is both a good thing and a bad thing why is it good thing I fly out tonight so I have a bunch of diet coke left and this is the cherry flavor which is quite good I know I know some people are really put off by aspartame but if you answer this question I'll give you a cat diet so why are parametric methods good nobody like aspartame somebody anybody fine fine by fitting your data to a statistical distribution you potentially gain more power to infer statistically significant effects because you're not so much limited by the number of samples you have you've mapped them into some sort of space where you effectively have an infinite number of samples why is it a bad thing okay why are parametric methods potentially bad oh dear god absolutely that's absolutely correct so I mean feel free to re-gift this you are a hundred percent right and this is why when you do a regression when you do it no matter exactly yeah so depending on the method you use there are assumptions because you're fitting to a distribution the data you look at the data it's like does it make sense to fit to that distribution right and so that's where you get things like the assumption of normality in a t-test right you're fitting a normal distribution are the data do the data seem to be normally distributed often what happens in in sort of bioinformatics is like I have 16 million observations and the t-test is robust with respect to violations of assumptions if I have a lot of data so forget I don't need to do it so that's kind of okay but the you know the the testing of these assumptions is often neglected right and the key to a nova is simply is the sum of squared differences between groups significantly larger than the sum of squared differences between groups and the key to a nova I mean a t-test is two samples right like is there a significant difference or not if there is it's pretty obvious which two groups are significantly different from each other the ones you just tested a nova though can tell you if a difference exists but not where a nova just looks at all the groups and says there's something here but the whole point the reason you can't do too many like just complete t-tests is this the non-independence of those tests you do a nova you say is there a difference or not if there's no difference you're like I guess we're done here if there is a difference you have to actually go on and do post hoc tests such as 2 keys test in order to say which groups are in fact different from one another right so that's a very important point and this is something you will see when we use stem and then there's other there's tons of variants it's ridiculous I saw one this morning but I'd never heard of before but I can't remember the name of it my nova is is used for multivariate responses when you have multiple output variables that may in fact have some correlation structure crestle wallace is the non-parametric answer to a nova how am I doing for time what time is it what time is it okay 20 okay crestle wallace you don't use the actual data points you rank them right and so you get away from this whole just excuse me distributional thing and then it's like are the ranks you know as the centroid of ranks over here different from the centroid of ranks over there that that part is similar to a nova but again if you get a significant result from crestle wallace you again need to run post hoc tests to say well these ones are actually different from each other hermanova is really cool it's sort of the complete abandonment of any notion of fitting things to anything at all permutation tests many of your certainly are probably familiar with them is as I just said before take your data which is like category a category b category c and test for an effect size right it's like oh the difference between let's keep it simple two samples the difference between means between sample a and sample b is 12 12 great is that significant is it important you know how do we assess that we can use other methods but the other thing you can do is take your data points randomly shuffle them between categories when you do that you should have no differences at all between the two categories because you just mucked up whatever pattern you had so you do this once and you calculate the same difference and you get six well six is less than 12 that's great it's promising but obviously you need to do it more than once and so you do it a hundred times or a thousand times and it's like six five two three four four five you know and so 12 is really outside of that distribution so then 12 is impressive right but if you do these reshufflings a bunch of times and you get like 10 and 15 and six and 53 and of 12 is not looking so impressive and so that's where you get the p value from a permutation test one of the really cool things about perma nova is that you are not constrained to look at the difference between means or medians right you can have any sort of measure you want that you can calculate from your data and you just permute permute permute calculate the same measure and you get your distribution get your p value okay so that's a really nice thing about it and as a consequence it can accommodate both you know fun statistics more interesting than the mean and also fancy experimental designs okay so that's useful all right so differential of us right key question here is what features and features can be amplicon sequence variants and otus and species whatever that means and pathways and functions and metabolic networks and what else what else can we do phylogenetic clades things are different between two or more samples right so this is where we really want to get into the gory details of what if anything really differentiates these samples right and this is where it comes back to you may have a hypothesis in which case you say I'm going to test these things specifically carbohydrate active enzymes the abundance of acrimansia whatever or it's a fishing expedition you're like well there's 360 asv's and let's look at all of them right which is fine it's no problem with that it's useful for identifying good guys or bad guys key functional genes or biomarkers now good guys and bad guys is a value judgment right also their gender neutral but the point is that even seeing something that's over represented in one set relative to another right if you've got your healthy individuals and your individuals afflicted with malaise boredom whatever it doesn't mean that those are causal right you you make your inferences but I mean bioinformatics is all about generating hypotheses right you say oh I see this is a difference and maybe I know something functionally about these things therefore my hypothesis is that this is actually important to test that hypothesis rigorously you need to do experiments right which we don't do but at least we can sort of sort through the you know the the actual data and say well here's some possibilities wave our hands a bit about this pathway seems to matter whatever yeah and if you're not familiar with sometimes they're called pathobions and sometimes they're called synergens you should really read about them because they're neat these are things that kind of hang around normally they're part of the normal micro flora you know normal whatever that means but in the presence of bad influences like let's say certain strains of pseudomonas they jump on the you know the sort of the the wagon of being pathogens right and so this whole thing about good guys or bad guys can also be conditional on the context right so this is a neat method that came out a few years ago from the lab of Curtis Huttenhauer how many of you have used Lafcée before yeah so it's slightly complicated but it generates really interesting visualizations unfortunately it does fall into that standard hierarchy but nonetheless it does other things really well so it's kind of a you know what what should I be doing here right at least it's worth considering and so Lafcée does something important it wants to find groupings and we'll see more about groupings in a moment that distinguish two categories of samples based on effect size because it's not all about the p-values right you can get a p-value of 10 to the minus 9 for a bacterium that's present in point one percent of this sample that this type of sample and point one zero zero one percent of the other type of sample right yay that's really impressive thanks so Lafcée really gives you know really places an emphasis there's statistical testing along the way but it really looks at the effect size and presents the effect size rather than really obsessing over the p-value the thing that's really good about Lafcée is that it explicitly considers hierarchical relationships in the data specifically phylogenetic groupings so that is Lafcée work well it's kind of complicated but let's see if we can work our way through this so I've broken it down into three steps what we have here are a number of different samples so the columns are samples okay poo samples or lung samples or soil samples these are grouped into classes right so class one is samples of a particular type and then you can have subclasses as well and then you have features right so these could be different types of functions and so the first step is to actually do this non-parametric rascal Wallace test for each of these different features okay well here's where we look right at the top again I don't know why people default to red and green it's really it's so weird but in any case the point of the top one is that you know there's no at least visible tendencies that are strong right so there will be probably a difference between means between the two but it's not going to be particularly impressive because we have red on both sides we've got green on both sides okay and so what you see in this column here are the different conceptual p values coming out of each of these features right so top ones point one three and second ones point oh one and intuitively you know you see reds on the left and greens on the right okay I get it and so on and so forth the fourth one down 0.00 because we got bright green on the left and bright red on the right yes and so that's okay and so your first filter are these p values from the cross go Wallace okay the second step is to do a Wilcoxon rank sum test looking at differences between different subsets and features and now we could be thinking about different types of things like violating that it claims and so we run the Wilcoxon test in these subclasses for the ones that were significant in the first step and now we get a second set of results and some of them are interesting some of them are not okay this then gets translated into an effect size using something called linear discriminant analysis the point of LDA is to say we have a distribution of frequencies or abundances or whatever can we actually differentiate them using a simple function right and the extent to which you can do this is mapped into this range essentially looking at your two original classes right so if you can differentiate them really well well maybe it's higher in class one which case it's colored green and maybe it's a lot higher in class two in which case it's colored red and so we're looking at significant differences and remapping them to the subclasses so that we can consider things like nested phylogenetic clades which gives you something like this and so we have a phylogenetic tree here and it's sort of it's collapsed and I think it's collapsed maybe for the sake of the sake of tractability so that you're testing like you know if there's 200 leaves you don't necessarily want to test all 198 internal nodes on a rooted phylogenetic tree so you can collapse things where you're maybe warranted to do so and then you have these tests applied to each of these different clades and so now here we are you're not just saying otus right it's not otus you're not looking at asv's you're not using species just good you're actually looking at many possible nested features and trying to say which ones are most impressive and so looking at this you start to get a sense of you know where these differences exist uh and looking at the different taxonomic groups so bifidobacterium is much more highly abundant in one type of mouse and it was an interesting test it's basically looking at the mutants that are have different degrees of susceptibility to colon cancer the point is that there are two groups and in one group there's a much higher abundance of certain types of firmacutes and you can see with color intensity that certain specific clades are really really differentially abundant which leads to the question why but that's a whole other you know different analysis maybe you have pi crust you run pi crust you're like all these functions are different right but at least you get this nice visualization and a sense of which clades are different and looking at named species you can get that as well but left c doesn't handle composition of data appropriately it's based on proportions and so the key to a lot of this and again john i don't know if you talked about this at all log ratio transforms okay what's that well you can have it if you want i can i can take a break and and kind of digest the lamb curry a bit more goat curry sorry and so you know there's a couple things to think about proportions as i said don't really tell you much about quantities right you don't have that information right so that's you know proportional differences across samples may not actually be that informative but the the primary way to deal with compositionality is actually not to look at absolute proportions but to look at ratios between your different taxa so you're not initially looking at you know bifidobacterium is 40 percent 60 percent 30 percent 20 percent 0 percent the first thing you're doing is internal to each sample you're calibrating bifidobacterium and everything else to the abundance of the other things right is that fair john is that a good yeah okay cool he knows this stuff pretty well so um so really you do that calibration first within each sample the idea being that if you can have some sort of standard that you calibrate against then what sorry if you have a standard that you calibrate against that can give you a more robust approach that is resilient to the effects of compositionality and so let's do something about this and the thing that people generally do is what's called the log ratio transformation right and this is interesting and has a couple of nice effects to it and so the key to log ratio transformations well it's a ratio therefore you're dividing something by something else what is the numerator well it's the abundances of each of the things in your samples what's the denominator what are you dividing by any guesses it's in the slides right um there's a couple things you can do in fact there's multiple things you can do from really simple to really really complicated but one thing you can do is magic invariant feature okay so i know that there are always this many E. coli in my sample right well that's nice then you can calibrate by dividing by that particular species right it's otu function whatever you've got that calibration point i don't even need to ask what the problem is right how do you ever know that and it's never true so that's not this this sort of what's called the additive log ratio is not really an effective approach okay another one that people use and this is actually the most widely used one is dividing by the geometric mean of the taxa to give you the centered log ratio okay so this is not placing the emphasis on one magic thing in each of your samples it's based on the overall distribution of things in each of your samples right so you're going to take bar chart or bar from sample number one and divide the abundance of everything in that bar chart by an aggregate measure of the abundances of everything in that bar chart and it just so happens that using this geometric mean gives you some really nice properties so that's the ratio and then you take the log which gives you a different distribution of your abundances right and so the actually you know things with really high ratios get scaled down relative to the things with lower ratios okay so alt x2 and this is like the lightning round because i'm not spending a huge amount of time on this but do i even want to let's see uh sure why not so start with your counts right your your counts of otus or asps or species or whatever and the first thing you want to do you know that you know in advance that you're going to take a logarithm and you know that a lot of things are going to be zero we already addressed that so you don't want to take the log of zero because that crashes your software and so add point five right and this is this is greg gluer's method right like i think he actually came up with this yeah yeah so this is greg gluer's method so add a tiny number tiny amount point five just to fudge the fact that the log of zero is awkward and then and this is really important you get your your counts you actually sample from them so you're not just taking your actual counts you're sampling from them using this this Bayesian approach because by sampling you get an estimate of the variance of your sample and a sense of how unstable your estimates are which is really important what do you do after that then you do the magic centered log ratio transform which gives you these nice properties that i talked about before once you've got that then you can go to town with your statistical tests right particular alt x2 does a couple right welch's t in the wilcoxon rank a couple different types of standard statistical tests and then very important and hopefully most or all of you know this if you're doing 16 billion different tests of significance one of them is going to give you a really impressive p value for no particular reason so multiple test corrections right how many of you use the bonfroni okay how many of you use something better than the bonfroni he's he raised his hand sheepishly what do you use not correcting bonfroni is great if you want to be really really really conservative right it's like i divide by this very large number to get very um yeah well sorry yeah yeah so like a story fdr for example or done c doc or something like that yeah so there's a bunch of different options and actually you'll see in stamp the software from donovan parks that there's a range of options for multiple test correction including bonfroni which is like a warm hug for many people but also including other methods that are a bit more you know well suited to the multiple test problem fiona benjamin hockford yeah so it's in stamp it's another fdr right i think it's a what's that yeah yeah that's right i think it was the first one i learned in biostatistics i forget how it works what's the formula that issue i think i'm so certainly right but yes yeah yeah i mean oh my god i forget the benjamin formula uh this i think it's the story fdr that was published in pnas maybe 10 or 15 years ago now and what's really cool about that false discovery rate is that it kind of looks at you know under the null hypothesis you expect an even distribution of p values from zero to one right the story fdr basically says what's the expected distribution it tries to estimate that from all of the different bars in your graph and then it says well at this end right the significant end do we see some sort of spike right and corrects for that so it's it's a really nice method and i think is it more widely used than benjamin i think it is but i is is it is q an exponent or is it i over m times q q's an exponent because i think it's it's so there's no exponentiation oh what am i thinking of the i it's ranking of the you know it is amount of the a means to do okay sorry about that so anyway the point of this is that there are a range of different approaches you can do the bond foroni is overly conservative i've never actually really done a route to be honest never really done a rigorous comparison of other false discovery rates and how it impacts on the data sets that we've generated so that actually that's you know stamp implements a lot of these if you wanted to muck around with this in the tutorial on our bumblebee data set not our bumblebee data set the bumblebee data set then you could kind of test this for yourself and say well it doesn't really make a difference in the number of significant results i get for relations any questions up till now yeah so and you choose two or five or one adding some things so that it's not zero because it i guess have a big impact um why did they choose point five john why did they choose point five i mean it's i mean it's smaller than any actual count you would get that's a good question so if you went from point five to five for example then how does it impact your sampling right okay so splitting the difference between zero one basically well exactly yeah yeah so yeah i mean you want it to be small relative to the observations but we're adding this count data right so it's it's going to be a really tiny amount relative to that but if it changed by an order of magnitude that's a really good question i don't know the answer to that that's a very good question actually it's really good isn't it yeah yeah and you know it's um i don't know how long it's gonna last especially the um oh my god so there's four flavors right and there's the there's the cherry one which is the best there's the orange one which is quite good as well and then there's the and i apologize if this has any induces any sort of physiological reaction people mango i learned that there was mango and it actually made me kind of physically ill for a while because i just tried to imagine those things together and like it wasn't working out well and then i was in sobies i'm like hey they have the mango stuff let's try it and actually it was not as bad as it seemed um i never bought it again but it was not as bad as it seemed and then there's the really weird fourth flavor which is what i can't remember no no it's like it's not mint but it's like some green herb thing plus what's that ginger lime that's exactly what it is yeah someone got fired uh so this is going up on the web maybe i'm getting fired um so okay what about core ecosystem you've got a bunch of abundances of things again asv's otus don't use otus species functions whatever um and so what you want to think about now is what is the correlation structure among this right maybe you want to consider all possible pairwise correlations so you can look at stuff that goes up together and stuff that goes down together right because maybe just maybe there's some biological relevance to that again causality requires experimentation but you know the fact that two things go up together they co-vary when one is up the others up and others down and others down at least raises some interesting questions right so for example maybe they are mutualists maybe they get along really well and one produces something that the other really likes to eat and so that's why they go up together maybe the tiny little lag but that lag might be less than the resolution of your sampling so maybe it takes an hour your sampling every day so you're not going to see that um or another possibility is that you know um what's a good example you take a person and they have a certain micro flora and then you feed them I don't know what uh crave cereal um something and the conditions induced by whatever treatment favor both of those organisms independently so it's not about them getting along together it's just about the same ecological response right um either of these is interesting right you can't really tease apart these things but if you know something about it you can propose hypothesis and so on for later testing and so you know we're thinking about ecological interactions and there are terms and this is oh by the way why sit all is this comprehensive test of many many different correlation network inference methods it's really good you should read it uh and so that you know they looked at the various different types of models right so this is the example I gave you before plus and plus is mutualism uh plus and nothing is commensalism or one is apathetic towards the presence of the other plus and minus is a negative interaction right if it's an interaction at all right maybe it's not an interaction but if it is it is a negative one right home negative correlations between bacteriophages and the things that they love to kill right which actually has some really you know important interesting effects on dynamics in a lot of systems and sometimes overlooked and so you've got all of these and you'd like to be able to infer these by looking at correlation networks and so what's everyone's favorite on the next slide method well we'll get to that uh but the idea is here compare compute correlations between all pairs of entities and then threshold by some statistic or p value to build network right so you don't want to have a network that is completely connected because you can't do anything with that you can't really interpret it visually so you want to have some sort of threshold right maybe the Pearson correlations point three maybe the p values point one or less that's that's the approach that used to thresholding all right so the first correlation method that pretty much anyone learns is the Pearson approach and essentially it's looking at the covariance divided by the standard deviation and there are some assumptions based on it okay and so it's it's quite intuitive it's a parametric method that looks for things going up at the same time or things going down at the same time you've also got the spearman which is applied to rank so it's less sensitive to outliers and things like that okay but again correlation i i think it's fair to say that the sensitivity to of correlation measures to compositionality problems is awful which may have driven this to be one of the first areas where compositional where methods started to make inroads because if you do not address for compositionality then you will have all sorts of spurious correlations all over the place okay and so various methods were developed to try and deal with this in different ways and one of the first ones if not the first one published in 2012 is called spark that was pronounced spark or spark or sparse but the idea here is that and one of the key assumptions of spark is that you've got lots of pairwise comparisons but very few interesting correlations okay that assumption is used to pave over some of the sort of parameter related issues that i'm not going to get into and the key here is something called H. Hinson's test which it's it's it's correlation based the idea not the implementation the idea is somewhat similar to a nova it looks at the correlations among ratios of pairs of things and it can say by aggregating these it can say there's something here there's enough of a deviation from some sort of null hypothesis that this is interesting it will not tell you what is different from what else and it will not tell you the magnitude of the effect size but this is your starting point and this was published this was published in 1981 the first the first version of H. Hinson's test and so how do you infer statistical significance you're like okay i've done n squared comparisons right or n choose two comparisons and H. Hinson's test says there's something right okay great and so assess the statistical significance based on simulation of many variables with no correlation right so this is like we're going to have a whole bunch of abundances that are not meaningful but obviously there's going to be some differences from zero right but what you hope to see is that correlation between any pair of taxa is way bigger than most of your simulated null distributions right that is potentially indicative of something real does that make sense question okay but feel free to interrupt because as i say you know if you have a question you're guaranteed to not be the only person in the room with that question so it also helps to wake people up once okay um so spark and and here's here's kind of where the rubber hits the road in this chart right this is a paper this is a figure from the spark paper and what they did they just they had they had real data right and this is very pixelated but five different body sites mid-vegina retro retro auricular crease bucle mucosa got super gingival plan and so each of the dots around the perimeter is a taxonomic group i think it's otus i believe it's otus and green edges connect dots that have a statistically significant um piercing correlation a lot of edges um and there's some red ones which which are like negative significant negative correlations right okay um so then this is the real distribution and you can simulate uh data right with similar structure but no correlation in it and this is what you get so this is kind of nice in most cases right it's like so let's look at super gingival plaque on the bottom okay we have this many connections right and our fake data has very few connections which means that we expect relatively few statistically significant correlations by chance so that's okay but you know look at the top there right look at the top and even under the fake data condition you still have a lot of connections right and i believe that has a lot to do with the relatively low diversity of the vaginal micro flora right it's dominated by various types of lactobacillus very different diversity profile and so you have different null distributions here and then they ran spark and you can see quite clearly that you know the impact of spark relative to Pearson is really dependent on the type of data you're looking at and so this was validated on simulated data uh and tested on real data obviously and once again this is something i said earlier on the impact of your choice of method really depends on the underlying structure in the data so one thing that we like to do is you know this sort of 19th century principle of concordance of evidence right try it a few different ways and see whether the same type of result pops out which is great but make sure that those methods are not just doing the same thing in different ways right so you know great examples from phylogenetics right it's like i built a tree and it's got you know i've built a tree using maximum likelihood with a whale and engoldment substitution model right and i got a tree and i got statistical support i built a tree using maximum likelihood with a jones taylor thorton model and got the same tree well guess what those models are almost the same the fact that you've got two trees that are very similar means nothing so when you're testing this make sure that if you're using different methods make sure that the foundations of those methods are somewhat different right and that's the nice thing about that weiss 2016 paper is that they really do give information about the contrast of the method that you can say well i don't want to try all nine but i'm going to try two and the two i try i know can give me different results if they can and they don't on a particular data set well that's interesting any sense and so just to make it a bit it's going to say fun but whatever this is actually some of our data this is data that we generated this is from a study of 47 individuals in an assisted care facility and what we had was 47 individuals roughly four to five weekly time points for each individual and so that was about 210 60 nest samples so mike and my group who also designed this this b the bumblebee tutorial applied certain statistical methods to generate these networks and so what you're looking at here it's kind of interesting this is a correlation network of positively correlated otus generated using spark the colors are phyla right for whatever phyla are worth and the size of the node it's in the network is proportional to the overall abundance in the population and so you know it's otus so that's a fundamental limitation of this but it's still interesting to look at it and see things like connections between varucomicrobia has some positive connections with itself but also with firm accused and varucomicrobia are pretty much our favorite phylum because if you see varucomicrobia that usually means acrimansia which if you've heard of that is a very very interesting bacterium that people are studying for probiotic potential right so we really really need organism mike has demonstrated that there's actually different types of it that all show different temporal properties co-variation that's another story for a different day anyway i thought it was pretty so i included it what time is it now i'd like to go through this it means that it means that like you'll still have time to do some of the tutorial you might encroach on the coffee break but machine learning is fun and interesting so bear with me so it was a lie i said we wouldn't get through the slide because we didn't really get through the slide so we've talked about statistics now let's think about machine learning so it's an interesting sort of philosophical discussion is there a difference between statistics and machine learning terminology is a big part of these right and so there are methods that are quite similar between the two right um new version of java is available and i just ignore that i can't remind me later okay do statistics have a monopoly on probability density functions right and no very normality whatever no lots of machine learning methods also exploit the fact that you have variables probability distributions okay so that's not really a differentiator is iterative training exclusive to machine learning the idea here is that machine learning methods are big and fancy and complicated and therefore have no closed form solutions what does that mean it means that you cannot take most machine learning methods say here's the data and the machine learning method immediately goes okay data come in model go out right what typically happens with machine learning methods is uh and i'm just going to talk about talk about supervised methods here data go into machine learning method support vector machine random forest artificial neural network blah blah blah and the machine learning method has an model inside it and says using this model i predict this and then you look at it you look at the accuracy of the model and you say bad classifier and the classifier goes back and says okay i'll try again some parameters get updated and it gives a new accuracy score the idea being that if the training method is good over many iterations the machine learning classifier will converge on the optimal solution okay right but that's that's the idea is iterative training exclusive to machine learning no an ennova is not iterative there's a closed form solution linear regression is not iterative there's a closed form solution but as you get to fancier and fancier statistical methods then there are iterative procedures as well is machine learning alone concerned with predictive accuracy no again machine learning methods are usually evaluated in terms of the accuracy of the predictions they make but that's true of some statistical methods as well so i tend to be a bit nihilistic about this not worry too much about the distinctions but it's certainly the case that when people think of machine learning they typically think of far more complex models that are being fitted with many many parameters which creates great opportunities on complex data sets but it also creates some pretty significant risks as well so why use machine learning well free parameters right so you can have these very very very non-parametric models that can fit different aspects of your data right an ennova has a parameter right that's all it can work with whereas a machine learning method can have all sorts of different parameter values that allow it to fit the training data beautifully right and so it has the capacity right if it's like these 15 things interact to you know determine the outcome of this pretending causality again then the machine learning method potentially has the ability to discover those interactions and give you a model that does really well on the data overall of course one problem of that is that the more parameters you have the more opportunity you have to learn the data set exactly give you an example of that in a moment and the other thing to watch out for is that many machine learning methods are really complicated and at some point with many methods it becomes magical right support vector machine data go in something's happening inside and then predictions come out right your accuracy is really high but actually determining what the support vector machine is doing it's not impossible but it's really difficult and this completely non-intuitive some methods are more interpretable and some are hopeless um different methods perform well on different types of data right Panstafl stands for damn thank you Larry Niven there ain't no such thing as a free lunch oh last one good i can actually get on the plane now um so yeah there's no free lunch that's actually the name of the theorem basically says that there is no single machine learning method that dominates across all classification problems okay so where does that leave us well things to keep in mind there are a lot of different classifiers to consider right you can't use them all um but there are certain criteria you can use to think about these things one is what's called the bias variance trade off and it's got a fancy name but it's very simple a classifier with high bias has very few parameters that it can fit therefore it's not good at learning complicated things that could actually be a good thing in many cases if the problem is not that complicated do not give an overly complicated classifier right so that's bias variance is at the other end where you know some gigantic artificial neural network has a million free parameters and they can therefore fit the hell out of your training data which can lead to overfitting do you care about interpretability right if you do use something like a decision treat if you don't there's lots of other methods to try do you want training to finish this decade right and so there are examples of classifiers that scale horrifically on the data right a support vector machine on 10,000 or 100,000 cases well unless there are variants that i'm not aware of that do well on that i've never had any success at that scale however my student donovan developed a method based on naive base classification which is very simple and fast that he was able to train and test i told him he was crazy and it would work but it did train and test on all camers all words of length 15 i was like that's not even going to fit on the hard disk that's like you know a ridiculously large data set but it worked it actually worked very quickly so naive base because of its simplifying assumptions actually works really well in large data sense does anything about the problem suggest a particular choice of classifier so that's that's big and complicated but already you know we're sort of getting into this with these other bullet points there so generalization you've trained a classifier on a set of data and as i said you can overfit the data such that the classifier knows that data really well those those data really well but when you show it something it hasn't seen before it completely plots us because it hasn't learned the general rules that define the data right and so this is one of my favorite quotes that most people don't really like but i use it anyway a machine with too much capacity read a classifier with a high degree of variance lots of free parameters is like a botanist with a photographic memory who when presented with a new tree concludes that it is not a tree because it has a different number of leaves from anything she has seen before right that's overfitting a machine with too little capacity read bias is like the botanist's lazy brother who declares that if it's green it's a tree neither can generalize well so this is where you get into stuff and this is where i might start jumping ahead a little bit long story short you have data don't use all the data train the classifier use some of it to train test with other data sets that it hasn't seen before and if it does well on the stuff that it hasn't seen before then it's probably doing pretty well blah blah blah skipping ahead cross-validations a fancier way of doing that uh support vector machines there's some really cool theory behind it here's here's drawing a line that separates the blue category from the red category there's different ways to do it support vector machines try to find the optimum way to do it uh it's really cool sorry it's just gotta stop somewhere um sbms i'm not actually i'm not even going to talk about linear separability if you're interested in it we can talk about it during the tutorial or afterwards training is iterative and interpretation is not a thing okay last thing so this is a paper we published about three years ago as my phd student jesse and i and we had a bunch of hmp data sets this is a principal component plot nine different uh oral microbiome sites and different oral cavity sites right apparently there are nine full cabinet sets and so this is a ordination plot of various samples from each of these including 300 from super gingival plot which is like plaque above the gauntlet and then 300 from some digital plaque which is below the gauntlet and so the ordination plot tells you something that you could probably have guessed anyway these things are really close to each other they're both on the tooth they're pretty hard to differentiate right you can sort of see some tendencies like there's more green up here and more blue down there and they're actually quite nicely separated from the rest of the oral sites so we're like here's an interesting example here's an interesting case so we decided to tackle this two-class problem data encoding is really really really important for classification right there are many many different ways to represent your data set and so we're like ah i like biology what can we come up with so the first one was otus right because that's like you got to try otus uh this is really before asv's kind of took off so we focused on otus we also built phylogenetic trees of all of our sequences and this is where i was talking before about it's kind of like lefsy right the classifier can actually choose to have a really tight association of bacteria like a really small clade in the tree as one of its differentiating features or it can take a huge set of bacteria like all of the proteobacteria for example to try and differentiate super from subgingeral plaque so the classifier has a lot of latitude to pick what it wants to use in the model and long story short the groups that the classifier really liked were this really swath really wide swath of phylobacteria edes uh smaller subclades within the fermicutes and some stuff within the acyno bacteria and no proteobacteria at all and so even just with this sort of screening this initial screening of the data we start to get a sense of which things might be interesting and useful differentiating these all right i just emphasize we didn't really care about what body sites we were looking at we just wanted a hard problem to try this out on the other thing we tried actually was pie crust right take our 16s sequences and predict functional distribution all right we tried that and so long story short the performance of clades was a little bit better than otus the thing about clades is that we had a lot of features we generated and so without feature selection we had a somewhat lower result pie crust functions really kind of did not work very well all right and so 80 percent is like the bane of bioinformatics classification it's like we used a classifier and a data set we got 80 percent accuracy unless of course you're using p-sort b in which case you get much higher accuracy than 80 percent um so but there's a question here and this is i think what i'm going to finish with so we have two groups of samples we've tried to classify and we tried a whole bunch of other stuff that i haven't shown you we've tried to classify and our ceiling actually with clades was about 82 percent depending on how we how we ran it but how good can you do is there some magic data encoding and method of there that can give you 100 accuracy probably not right intuitively it's probably not the case think about somebody sampling the subjingible plaque right screw a so sorry breaking things now um and so here's what we did this comes back to something i said before we said all right well svms are a thing random forests are also a thing and source tracker which i don't have to explain time to explain is another thing again and they are very different in the approaches they take to the data so we had done the svm and we had an accuracy profile across these and now we tried random forest and we tried source tracker long story short out of the 100 percent of our samples 80 percent were relatively easy and most classifiers got up to 80 percent close to it 10 percent of the samples were never classified accurately accurately by anything ever so i would argue that we should just not even worry about them i can say these are not okay um i i'm not gonna um and so these are the 10 percent of samples that are hopeless um you're going to tell me that you have a classifier you applied it same data set got a hundred percent accuracy aren't you um so 10 percent of samples so maybe our ceiling of accuracy is 90 percent now we didn't get anything close to 90 percent well we combined the classifiers together and said well maybe each classifier can give us you know if we combine the predictions maybe we can get 90 percent so we tried it we got 80 percent so that was pretty much the end of that okay long story short microbial data are hard and all methods have limitations the end