 I'm really pleased to be here and it's great to see that you guys are all here because presumably you care about microbiome research and really, you know, oh, first, I just want to mention these slides are, version of these is available through bioinformatics.ca where we present a version of this. This has changed a little bit but they're still under the same Creative Commons license. But today what we're going to basically do is talk about what biomarkers are and their utility. I know the basics of identifying them. Be aware of some examples. I'll just go through in particular one case study of examples of biomarkers identified for microbiome data. And really, I can't emphasize this enough. It's appreciating the importance of careful conservative analysis. You know, we'll be bringing this up a lot and I think you've already heard a bit about that. And for example, with the issue of OTUs, I just want to add a comment that, you know, there'll be times where OTUs are mentioned here but you could easily substitute that with any kind of measure or grouping of sequences. And so just keep in mind that a key thing though is just to avoid that kind of push button bioinformatics and that's really why you're here to learn how to do things and consider all of the issues and pitfalls to watch out for. Anyways, before I start though, I would just love to get to know the audience a little bit and I heard that you guys haven't actually been asked some basic demographics. So I'm just wondering how many of you would consider yourself like a primarily computer scientist, primarily sort of microbiologist, molecular biologist or primarily both? So like who would say they're like computer science a bit more? So there are a few. We weren't sure, I was just chatting with other faculty and who would say that they're more of a microbiologist, okay? And then who would say that they're really a mix of both? Great, yay, I'm glad to see that group is increasing. So that's awesome. So, but that said, everybody comes at this from these different perspectives and it can be hugely valuable. And I will try to bring up a few points that are relevant to the different groups because you really are in different points in a spectrum. Okay, but anyways, first I just want to step back for a second and just say, you know, really microbial research is exploding. I mean, I could show some graph about how much microbial research has increased, but you know, really it's a notable developments have occurred, you know, one of my favorites is this idea of humans can now be identified by their own microbial cloud, you know, this kind of concept that, you know, like it's a, I remember there's a relative mind saying it's like the aura, you know? We got this microbial cloud around us. And, but these have huge implications in terms of this concept that there are potentially markers, you know, of people, of things, of diseases, of other states. And so, you know, this has really exploded and what's happened is a lot of researchers outside of traditional microbiology and cancer research and heart disease research and depression research are starting to look at the microbiome and look at the role of the microbiome and in particular interested in markers of disease and disease prognosis predicting in particular how sick somebody's gonna get. As a sign of how much that's arrived, Merriam-Webster dictionary, in case you don't know, is now defined microbial and they've come up with a, so that's a sign that it sort of exists. So there's now a definition by Merriam-Webster and I think the really great sign that things of, you know, microbiome research has really arrived is you can now, there's this own microbiome game and then you have your own game, right? For something, that's a sign that something has really arrived. So I encourage you to check out the Gut Check Microbiome Game, which actually can purchase through a laboratory supplies company. So some labs have been purchasing it and it looks like supplies that they claim on there. But I haven't done that. But basically, you know, this research has really taken off and there's a lot of interest in biomarkers for a few reasons I'll mention. But first let's just step back, you know, what are biomarkers? Very generally, measurable biological property that can be indicative of some sort of phenomena, such as infection, disease, environmental disturbance. And a key thing to appreciate is primarily we're interested in either functional biomarkers, like looking at biological functions, genes, proteins, patablets that are specific to either one organism or a bunch of organisms. Or we're interested in taxonomic biomarkers that can be specific for a specific species or category of organisms, including the Otu's that, you know, there are some concerns about. But why we want to identify them? Really, one of the primary reasons these days is to detect and diagnose phenotypes more quickly. You know, there really is a benefit to doing this very quickly, cheaply, more accurately versus say, metagenomic sequencing. The idea of predicting prognosis of disease, being able to treat certain people differently based on their prognosis. There's also the interest in bugs as drugs and using them as therapeutics, which has interesting issues because bacteria are sort of considered in this sort of gray zone of are they actually, if you get a bunch of bacteria put into like a fecal transplant, is that considered more like a tissue transplant of cells? Or is that considered a drug, a therapeutic? And the rules around patenting and protections around drugs versus tissue transplantation are very different. So there's a lot of interesting issues around that. But there's really, you know, growing success stories. It's still early days, but you know, one I wanted to highlight that I thought was neat was just this concept of looking at a bunch of children, infants in the first hundred days of life, looking at children that have developed asthma or not, and looking at what bacterial changes in their microbiome, their gut microbiome were occurring. And they found that there was these four species that start with FLBR. And so they called it the flavor bacteria. And these flavor bacteria are decreased in early life in the children that had asthma versus not asthma. But notably, so this was like a biomarker, you know, literally of asthma, but notably when they added these flavor bacteria to germ-free mice, they decreased airway inflammation, implying that they really could be, you could even use a therapeutic. And so there's, again, that bugs as drugs concept of finding these biomarkers of things that have changed, and then maybe those microbes can be re-added if there's a dyspiosis or change in the microbiome that basically could be corrected. But there's lots and lots of applications, and I'm not gonna go through this much just to say that there's differentiating inflammatory bowel disease from related disease, detecting colorectal cancer, looking at COPD in the lung, looking at human milk microbiome, protecting against mastitis, and then a whole suite of environmental biomarkers and markers of pollution, ecosystem health, et cetera. So there's really a lot of different applications, but what I'm gonna focus more on is sort of biomarker selection. I wanna emphasize this is gonna be a bit of a high level overview thinking, the goal is to get you thinking about issues, pitfalls, things to watch out for, not going into a lot of detail. But basically biomarker selection is a process of removing these non-informative or redundant sequences, identifying the ones that are differential between a couple of conditions, right? And so to find those, we've got the sort of combination of bioinformatics, taking sequence, looking at quality control, quantification of different sequences, and basically identifying these sequences, and then applying statistical methods to those sequences to help find the useful biomarkers. And then of course the key part of this finding biomarkers is validation. So you can find something, but key is to validate, and just for the computer scientists, more computer scientists oriented people just wanna talk about primers, then pick out your sequence of interest from a sample, and often we're interested in QPCR for validating of looking how many times those primers sort of snag our biomarker of interest. But in terms of how we find them, first things first, you really have to have a plan of what kind of biological data are you looking at, and what kind of marker are you looking at, like are you looking at viral data, are you looking at, you know, 16S data versus metagenomics data, et cetera. And then of course there's the issue of obtaining biological samples, getting the DNA, and then you get into the more biomarker part, which is identifying the potential biomarkers, things that are more or less abundant, right, then validating them. Usually you do it in silicone, and then in vitro as I'll go over. And then you have further optimize usually by checking for other closely related sequences as well. But in terms of options for biomarker ID, one thing I do want to emphasize is you sort of can look at bacteria, viruses, or eukaryotes, but I really encourage thinking about combinations, as I'll talk about more later when I show some actual data on a time course. But really bacteria, you know, are sort of the widely most studied shotgun or 16S amplicon analysis by shotgun, I mean sort of metagenomics. But widely the best studied, most methods developed at sort of the easy route. There is viral data either shot down or there are amplicons, RDRP and G23 that are used for viruses. And that's certainly one of the big issues there is it can be challenging to get enough DNA. But as I'll show you in an example data set, this idea of viruses potentially having some more specificity holds promise. So sometimes you can see things that are different in the viral analysis for spectral analysis that might be useful in terms of biomarker identification. For eukaryotes, again, it hasn't been as much done because the large genomes really make shotgun difficult and there are concerns about these sort of marker-based analyses like 18S or ITS. But they certainly, there have been a lot of methods developed for eukaryotes. But again, you've got that sort of biocide but you've got the marker side. You know what kind of marker do you want? So there's really considerations. Again, combinations are recommended. You've sort of got taxonomic where you can use your amplicon or shotgun data and identify taxa. But the problem is you can have that kind of strain level diversity. You can get false positives, false negatives. If you really don't know what your taxa is like and really as I think has been alluded to, what is a taxonomic group? It's a very difficult and sort of somewhat arbitrary concept in some cases and then most notably in some cases they can be more variable across some environments that can be useful in some cases but in many cases that level of variability can be really problematic. If you're just wanting to just generally identify something consistently. Gene-based are becoming of increasing interest. You know, really you need shotgun data for that is the big problem. DNA or RNA, if you get transcriptomics. But so it becomes more expensive, right? Because the amplicon data because you really do need that metagenomics. But as I'll allude to later, I really encourage doing metagenomics, at least some metagenomic sampling of your experiment because I always find people end up finding it the most useful. You can use 16S or something to get a broad feel for your data, get an idea of variability, get a sense of how generally the taxa are behaving, what taxa are there and also as we'll be talking about later Morgan who you've seen lots and you're gonna see more of. Morgan Langell is gonna talk about PyCrust, a tool for looking at 16S data, taking it a level further of predicting function from it. But metagenomics data really does give you a lot of things that you can't get from 16S. So it's always good to at least maybe do a survey with 16S but then delve into taking a few metagenomic samples. The one sort of issue is you can have some complications with the domain-based gene architecture but generally sometimes you'll find there'll be some functions or gene functions or markers you can identify that are more consistent across different samples than taxa necessarily are. And this is where also PyCrust comes into play because again you're getting down, you're moving away from taxes specifically to functional groupings. Other types are diversity metrics, don't forget about looking at alpha and beta diversity. Have you guys actually, I can't remember, did they cover alpha and beta diversity at all? Yeah, okay. And then using microbial analysis to suggest other metabolic markers, that's one way we went with one project. But let's just talk first about just getting a marker, some general, very basic statistics for many of you will have taken a statistics course, this is obvious but I just want to remind you about some basics here that a basic concept of a good biomarker or you're looking at signal to noise and you're looking at basically your class means have to be far apart, okay? So for example, this is just using talking about O2Us but again this could be any kind of grouping. So you might have sort of your sample frequency and your abundance and if you plot that, a good biomarker of say your condition that is, actually it doesn't look, it doesn't show up very well. So you have maybe your condition you're interested in that's the blue condition, it doesn't really matter but this OTU is of interest because it's very different, the mean is very different from the OTU one and there's also tight variants, there's a little overlap. In this case where you have, you know I guess I should use a red example, sorry. And but in this case, this is an example of not ideal where you have lots of overlap, the means are not far apart and you've got an issue with maybe not as tight of variants as well. And basically abundance in general, usually you're following a normal distribution so it pays to look at that. And why am I not able to, there we go. Okay, so you know, here's just an example of looking at a normal class and a blue class, okay. And so we have maybe samples one to five and we've measured OTUs one to three again. These could be any kind of classification, they could be gene sequences one to three or what are you know, camers one to three. But essentially what you're doing here is looking at, oh, sorry, you're looking at what makes a good biomarker. So if you're interested in differentiating this normal and blue class, obviously this OTU one is looking really great. You can see that you've got something where you see clear difference. If you look at these things and then put them down here you'll see that OTU one is clearly increased and very consistently in samples one, two and four versus samples three and five. OTU two is more inconsistent where you get sometimes higher and lower. And then you have OTU three which is no difference at all. So you're basically again, you're trying to look for those ones where you're seeing that clear difference and a consistent measure. So they basically do Cisco techniques. You can range from very simple, like a simple t-test to compare them. You can write your own statistical analysis or you can use some really nice, more complex methods. Lefsee is a very popular one implemented in a Galaxy workflow. Have you guys heard, have they heard about Galaxy yet? Galaxy is something that allows somebody who's not a computer scientist who wants to basically make workflow of doing a series of analyses, in a fairly user-friendly way through a web-based interface. And certainly I encourage you if you're not, if you're a biologist wanting to do more metagenomics analyses, getting familiar with Galaxy or genomics analyses, period, getting familiar with Galaxy could be a very useful tool to use. But Lefsee is as one example, metagenome seek is another example implemented using R. And again, I'm not going to go into a lot of detail here, but really the idea is it's identifying these features, characterizing the difference between two or more conditions. First it IDs, statistically different features among classes using this non-parametric factorial KWSAM, some rank test. But then key is it performs a pair-wise test using these subclasses, using an unpaired Wilkinson rank sum test to just basically check if these differences are consistent with respect to the expected biological behavior. So you've got sort of this identifying these different features among these classes and then performing pair-wise tests among these subclasses to see whether these differences are consistent with respect to what you're expecting. And then you use LDA or linear discriminant analysis to estimate the effect size of each of these differently abundant features. And you could also do dimensionality reduction if desired as a little aside as we move forward in microbial analysis in general. And I would say genomic analysis in general. We really have to watch out for this, that we've got this very biased sampling of our microbial world, the very biased sequencing of our microbial world. And so for example, we have tons and tons of E. coli genomes and not so many of certain environmental genomes. And so often I find it's really helpful to do some dimensionality reduction of just the number of sequences I have. So I don't have like tons and tons of E. coli. I take subsamples, some of those guys, and just take subsamples. So I've got a more even spread of the organisms in whatever database say I'm comparing the sequences to. I can talk about that more later. But there's some really nice description on Hutton Hauer's site of Lefsey and I won't go through this a lot, but just to say that they've got a really nice set of features, a lot of nice visuals, basically being able to look at, you know, plots of features with significantly different conditions and looking at representation of these features on taxonomic or phylogenic trees, and so looking at things by effect size. So in short, allows you to, and this sort of walks you through that concept of how you're trying to sort of first have your biological hypothesis of your conditions and you've got your estimates and you're using Lefsey and then getting these different kinds of results. Okay, but anyways, again, I want to sort of get back to sort of big picture. Yeah, sorry. It appears that Lefsey is just a non-parametric base. So can we add, for example, covariance? Ah, that's a good question. Do you know of my chance? I don't know. No. I don't think so. Yeah, you don't think so? Okay, okay, yeah, no, great question. But anyways, so what I want to do though is step back a little bit about some things around assumptions of data and understanding these methods. You have to understand your assumptions about your data, statistical methods assumptions about your data, your limitations and how to interpret your results. And so in short, you must choose but choose wisely and really it depends on what you're doing and I will sort of go through a few of these basics that you need to consider. One of the biggest issues is, are you looking at discreet or categorical results or continuous variables by categorical? You mean like, are we trying to predict classes that are like disease, not disease, or are you trying to, or are you trying to do continuous variables like what level of, I don't know, what do you call that stuff? LDL or HDL you have in your blood or what level of some sort of continuous measured variable you have and you want to sort of work out correlations between levels or, yeah, sorry. So I just need to step back a second and I'd like to know what you're talking about. So briefly, what would you describe LDL? Maybe I'll come, if that's okay, I'll come back to that at the end. But I just want to go through some basics first. But basically, do the concept of discrete or continuous variables and also do your samples involve known classes or do you not know how many classes there are and you have to do what they call unsupervised learning. So generally, statistical techniques either try to predict labels. For example, like classification of samples as diseased or not diseased. Basically, you know your classes, it's pretty simple, you're just trying to predict those classes. And one of the issues I have to say with this kind of class prediction is you really have to often step back before you're going to do your biomarker analysis to say, are those really my classes? Or is that really, are those really the patients I want to be classifying as diseased versus not diseased? What about those people that are in the between part? How am I gonna handle those? So, and, because you know, your biomarkers are only gonna be as good as your actual data that you're feeding in. Regression analysis is what you use whenever you're trying to look at some variable. So when you have a continuous variable, you know a classic example is predicting tomorrow's stock prices, you're trying to predict a level. And very often there's an interest in predicting more continuous variables and doing analyses where you want to get the level and you don't want to just say this is class A or class B. The other thing you have to think about is whether you're doing supervised or unsupervised learning is basically supervised learning is where you basically have the samples come from these known classes. So you know, for example, that samples one, two and four are not diseased or samples three and five are diseased. Or there's unsupervised where you don't know what the classes are and you want to have the data tell you is sort of letting the data drive and it's saying, okay, I don't know, these samples are all X and I just want to know what the groupings are is, you know, you could get the data to tell you that sample three and sample five are different from samples one, two and three. And really identifying, usually by clustering. But of course, they're really not two categories. There's also semi-supervised methods where you give them, give the data a bit of a sense of what your classes are. Yeah. I have a question on unsupervised clustering because it gives a large number of classes and now it's up to the researcher to decide which number they want to pick. So do you have any suggestions for that? Yeah, we were just talking about that actually coming in. Again, I'll talk more at the end but it's short. It is a big challenge and it's always bugged me these methods that say you have to specify how many clusters you have because you don't know how many clusters you have, right? And so that's why this sort of semi-supervised is very popular or attractive, being able to get a bit of a driving sense of what your analysis is. But I will come back to that at the end but I do want to say that you really do have to watch that issue of how you cluster and all you have to do is take a bunch of vegetables and put them out on a table. One of my favorites is, and my son was referring to this the other day, by the way, this concept of computers and machine learning is getting better and better but they still have trouble with certain things like classifying. Have you guys ever seen classifying labradoodles from fried chicken? Have you guys seen that? It's hilarious. There's pictures of labradoodles that are the color of fried chicken and they've got the same kind of poofiness and then the fried chicken pictures and machine learning methods can have trouble with this. And so the other day, just yesterday by son was, who's 14, I was bringing up this classification issue but they can't even differentiate dogs for chicken, fried chicken, so how can they do this kind of issue? So it really is a big challenge but let me come back to that and digressing. So for supervised methods, I really want to emphasize, I mean this is the sort of easy situation, simpler study design, so the biomarkers could be more robust. So what it means is it's nice if you have that metadata, right? And so I'd really like to emphasize the importance of getting that metadata and grabbing that metadata. And in your cases, if you're ever collecting some data and you don't really need maybe certain type of metadata, try to add it to your sequence data anyways when you submit it to a repository because you don't know when somebody else is gonna find that's really useful and having that metadata or I'm not a big fan of metadata by the way, because it implies somehow that the sequence data is the most important and metadata is just everything else. It's sort of like the other data but you want your sequence data with any other information associated with those sequences tagged on top. But the disadvantage though as I alluded to is you really might not have well-defined classes so it's really difficult to find biomarkers particularly when your classes aren't really clear and maybe sometimes in some cases are flat wrong. So the unsupervised can be really nice because it doesn't assume anything and so you can really sort of see what those clusters are and but really how do you know that they're real? I really struggled with this issue as was alluded to how do you know that you're really getting real classes because you can arbitrarily classify anything into any two classes but really are they significant and this is where a further statistical test to show that those classes are different enough is really valuable and where validation becomes really important. You may require you testing very large sample sizes to get this properly and does tend to be a little more computationally intensive but that's not, I don't find that's too bad these days. We've got now, we've got two new really big super computer clusters in Canada, one at SFU and where's the other one? I think it's Montreal. What's that? Yeah, that's right. Yeah, so at Waterloo and SFU we've got two really big computer clusters and as an aside oh, I wanted to see how many people are not from Canada? There's not that many, right? There are a few of you, okay, welcome. Yeah, welcome and but for those of you in Canada or anyone collaborating with anybody in Canada, I mean, just a reminder, you can get a free Compute Canada account and lots and lots of storage and CPU access just base allocation that's actually pretty decent and everybody who's got any kind of academic appointment, student appointment or is collaborating with an academic can get one of these accounts and it can be very useful for doing certain analyses that are computationally intensive and so in one concern they have in Compute Canada is not enough biologists are using it so they are really keen to sort of try to encourage that. Okay, so anyways, so say UID and I'm going through this as you can see sort of very generally because I do want to get to the point of just trying to see what time is like. I do want to get to the point of going through an example as a sort of better way to sort of show this. So once you ID your groups and you want to choose a biomarker you need to test it so PCR is obviously a really good example so for example you could identify something using something like Metaflan, a marker based tool, cluster the reads, find conserved sequences, verify that these sequences are selective and again I'll talk about that more later. Design some primers like using primer prospector from a sequence alignment or primer blast which designs primer specific to a clade and basically from there you're sort of validating what you're getting but I would like to go through a case study just to give you a sense of an example of doing some biomarkers and some of the challenges that exist though again I'm not super great detail but I just want to acknowledge this great group of researchers we're collaborating with for a genome Canada study that basically was looking at using metagenomics to identify markers of water quality and there's some more information on watersheddiscovery.ca but hasn't been updated in a while but I'd like to particularly acknowledge Will Sows, the Beast Center for Disease Control and Natalie should be acknowledged here too and Patrick Tang who used to be at the Beast Center for Disease Control and Matthew Croxson that basically played a key role in getting some samples and Thea Van Rossum like Pia Boddy or graduate students in my group who were really leading the bioinformatics analysis for this project and so why do we care about this? We really wanted to move towards an ecosystem approach to water quality monitoring so people tend to feel pretty relaxed about their water they feel confident many people feel confident about water quality but water quality is a big issue, water is becoming a big issue it's a valuable resource the climate models are not good for how much California is going to be needing water and they're looking towards Canada as a source and we do have some, we do have lots but we don't have an unlimited supply and so there's a lot of interest in improving water quality monitoring, identifying issues more at the source rather than just testing we do even the Beast Center for Disease Control which is required for all of it which is the center for doing water quality testing in BC which is a bit unusual it's for environmental testing it's actually at a disease center but basically a lot of testing is done at the tap and what we want to do is look more at the source and nip these sources of contamination so we can avoid these regions of Canada where they have 365 days of the year for all water advisories so the big concern is fecal coliforms is one that is really, I did very inaccurately, coliforms or the coliform test is woefully inadequate because of course there's a lot of problems with false negatives, not all pathogens are coliforms and also false positives because not all coliforms are pathogens just because you get a high coliform count doesn't actually mean that water is bad so a lot of beaches get closed unnecessarily or you end up with something showing up like a protozoan pathogen or something that isn't detected and you end up with people getting sick when they're perfectly low coliform count so really there's an interest in developing a panel of QPCR-based assays based on metagenomics surveys to basically identify these pathogens and measures of water quality more accurately using a greater number of sort of markers so what we wanted to do was we looked at a control watershed where it's actually completely protected in BC we have really fantastic water well because the water just basically is distilled from the ocean, comes over, falls as rain and then is in these protected watersheds where there's not even boating aloud or anything but basically you have this control watershed human impacted watershed where there's concern about septic tank leakage and agricultural impacted watershed where there's feces leaking occurring from agricultural use and these different land uses we wanted to look at these samples we looked at nice examples collected over one year plus additional one year or hourly time courses and in short we just showed that in hourly water was not changing that much but so these sort of monthly samples turned out to be actually very useful we filtered the water actually at different levels looking at viral particles, bacteria and protus or either microeukaryotes or what we call microeukaryotes and then we did alumina sequencing of the DNA and viral RNA looking at 16S, 18S for eukaryotes, CPN60, shotgun, metagenomic sequence and basically did this bioinformatics analysis of this data, it was actually just the first survey of water in rivers as a time course at all that is at this level and then of course we're interested in biomarker identification so we had, I'll focus on one site that was particularly useful because we had sampling site upstream of contamination and then a site sort of at the site of contamination in a bit downstream and again we had monthly water samples and again I really want to encourage positive and negative controls just remember that too many microbiome analyses early on were done without positive and negative controls you should always have some sort of bacteria spiked in water, you can buy pre-prepared positive controls and you should run that with every sequencing run and do negative controls where you just put the, you know, sterile distilled water in your, and run it as a sample these are really, we found those very useful we were able to detect some contamination at one point and also were able to use those positive controls to assess our methods better and make sure everything was running okay but basically we looked at this microbiome survey looking at both taxa and G profiles looked at differential features and then developing QPCR tests but again the first thing you always want to do before you start is really evaluate your methods so we actually developed some positive controls that were water-like, you know, reflecting types of conditions or types of taxa that are commonly found in water and we evaluated the methods we decided to publish a paper on the evaluation because there had been such a lack of evaluations that were independent at the time most evaluations of accuracy of methods were, you know there's all these methods out here and hey we've developed our method and hey our method is best, you know because of course the way they evaluate it right is optimized to theirs so we wanted to do something more independent and so we evaluated the methods and in short I would say I often get asked there's no easy answer but the methods are all over the map in terms of their precision their specificity versus sensitivity how much they predict at sort of high taxonomic levels and versus sort of more at the species level and in short there's no one method that's great versus others but I would say if you're ever starting out with metagenomics analysis certainly Metaflan as you've learned about is sort of a good fast marker based method just to get a feel for what your data is it has very high precision but low sensitivity and when I say metaflan I mean metaflan two or whatever like for example Megan is another method that's good but you know when I say Megan I mean I can mean Megan four, Megan five is better or Megan six but essentially you're basically this concept of developing biomarkers what we did is we did sort of fast track kind of approaches and then more in depth approaches and the fast track approach allowed us to sort of very quickly in the project get some markers that we could start testing just to sort of see you know in other watersheds for example are we getting the same thing or not and so here's an example just a really fast one where we took the bacterial shotgun data using the alumina high-seat data using metaflan we basically and which had sort of 3,000 reference genomes and can run you know 3 million reads in 10 minutes so it's super fast basically what we did is first we processed and validated the data doing the sort of positive control validation confirming that DNA free water spiked with DNA for multiple taxa of lab culture bacteria we did in fact get those bacteria back and only 7% of reads were assigned by metaflan to a species I mean this is very expected for this method this is not a this is a method that does well if you have a lot of sort of known species in your data set like for example if you're doing a gut microbiome where there's more of the species they got a lot more of those species are in these reference databases and the and so this was not surprising of those 84% were correctly assigned so it's not predicting a lot but it is when it does make predictions it's being pretty accurate and but then you want to sort of different identify the taxa now it's really frustrating for me that for this particular marker they weren't keen on us revealing the data the universities frustrated that so we call them tax on one and tax on two but the idea is we just prioritized high abundance taxa with the idea that if you're gonna do a sort of dipsticks test in water you'd rather look at the more high abundance ones and so use a white nonparametric t-test with false discovery rate very important multiple test correction defined differentially abundant taxa and basically we were able to find some obvious taxa that were different for upstream of your sample or versus at the site and downstream where these are sort of the generally the sort of impacted water versus upstream is a clean water as a little aside I'll just mention I don't really have time to talk about it but I've really become more interested in using random forests for some analysis and encourage that as another sort of machine learning kind of method for identifying differential taxa excuse me but anyways so we identified 57,000 reads were assigned to this tax on one 2,000 reads assigned to tax on two and we prioritize this one just because we thought more high abundance taxa would be better and there was evidence that they do tend to get more accurately predicted and we extracted these sequences from the Metaflan database literally took the 607 taxa sequences that are associated with tax on one from Metaflan align these reads the sequence reads against these sequences and then chose regions of Metaflan sequences with the most hits so for example you might have your Metaflan marker sequence that's in their database you identify reads that are characteristic of that and maybe you find one area where you have a sort of candidate marker sequence because you've got a lot of reads in that one area and then we just use primer three for primer and probe design looking basically at first in silico taking those sequences and just saying in a computer whether you get what rates you get of these candidate sort of primer, amplicons and probes resulting probes and basically we, you can see sort of here if you look at these this is high seek reads that or these sort of right one or the sort of right one primer reads that contain the sequence you can sort of see that upstream you get not much and at the site or downstream you've got the sort of forward primer or the probe or the reverse primer you're all getting some sort of reads that contain this forward primer sequence reverse primer or probe just and as a little aside just note that we did consider matches that are exact or have one to two mismatches to reflect the fact that with PCR you can't have allowance of matches through those of you familiar with PCR you'll appreciate that it depends where you're looking at three primer five prime end of the primer but thankfully there are some nice methods that can do that quite easily but basically you're choosing sequences that minimize these non specific matches and what's nice is we were able to confirm we could amplify this product we actually didn't take this too far because we got interested in some other markers instead but basically we could amplify the product of the right size through this but this is identifying a marker based on differential abundance of a bacterial species and time was used sort of as a pilot which was great because it was fast sequence data to primers in a couple of days allowed us to very quickly because inevitably with any kind of microbial research it takes a while to get the samples done and sequenced and you start to do your QC and you finally get your data that's good and meanwhile the people who are in the lab ready to validate are like, come on, give me something so it does allow you to get to that quickly but I really want to emphasize the limitations here it really does depend on the differential abundance of these known bacteria the bacteria are highly similar to those in the Metaflan database then this approach won't work so you might have some really interesting markers but they're just not in Metaflan particularly if you're dealing with samples that are out of the sort of gut microbial realm if you're looking at different conditions or different sites where there's a lot more tax out there that we have not identified or have not characterized yet and then it is based on taxa which have been shown to be more variable across environments and gene functions so this is a low hanging fruit kind of approach to sort of find something the alternative approach is sort of more complete analysis of sort of shotgun data can either be bacterial or viral or whatever but the idea is to use something like Kraken or discriminate is a method I just wanted to highlight again there's that review of accuracy methods which that review, we had so many comments of people just wanting more information because we'd sort of laid it out because you have to consider this, consider this but a lot of people were looking for just tell me what methods I should be using so we actually did do a little frequently asked questions response in the paper and then the journal decided to change the display of all papers in, is it BMC? I can't remember where this was published but I think it was BMC and all those comments for every single paper got lost like they're just gone so just so you know there's all this commentary so so you know with papers and it was just gone and they were working last I heard to try to get those back but we ended up adding our comments to the bottom of the PubMed citation so if you go to the abstract of this paper at the bottom there's our attempt to add some frequently asked questions and respond to some comments made there anyway so we did gene function analysis at the time it was sort of Megan four, Megan five you know things have changed in terms of new methods coming out and methods as being described in this workshop or summer school I guess they should call it using seed and keg databases basically we were getting these predicted proteins clustering find differential features then designing PCR and so again looking at tax or functions we identified informative regions for primer design using a CD hit to cluster reads by identity and we designed primers using primer blast or the IDT real-time PCR tool and validated those primers using primer prospector or primer blast and then validated using just in vitro QPCR and to get to a nutshell because I'm realizing I'm running out of time is the QPCR worked quite nicely with certain markers we were able to find if this is just looking at Comona Day CA I was in trouble saying it and this is showing your percentage of sequences over samples for the upstream clean water that the source of pollution and downstream and is an example of a particular tax at least where very consistently we were able to find those sequences more predominant in the polluted and then this just shows you the number of reads in your cluster and number of applicants generated in silico but most notably when we did the QPCR for this marker gene aspartate carbonyltransferase from limno habitants we found out we got some good data and also we've seen this in other samples as well that we were able to get no detectable results of for QPCR you sort of have a CT value and we were above over that line of the undetectable signal basically is sort of undetectable levels and we were able to get that for the upstream clean water the dirty water we did have some technical errors where we got no signal for the positive control too so again an example of use of positive control is being useful but I think in a nutshell the point is that we were able to get from just through that very simple approach get to an actual QPCR validation of some primers that are now being investigated for improving testing water quality and another comment I wanna make because I'm gonna talk about some general considerations here now at the end is it's really helpful whenever you go to do biomarker analysis that you also include some sort of ethical, legal, social issues in genome, in genome candida we have this gel thing I'm a member of the board for genome candida so I'm sort of familiar with this and we found it really useful that when in parallel while we were doing this microbiome analysis and the biomarker analysis we were also asking the end users the people doing water quality testing what kind of things do they wanna see in their biomarkers what kind of things do they care about what issues do they have and sort of lots of great points came up like they didn't wanna have to love 50 liters of water from the remote site so they really wanted to test that would work with small volumes obviously in some hospital environments that issue of transporting samples isn't as big a deal or you don't have to necessarily deal with large sample volumes they also it was interesting how much there was an interest in getting biomarkers that had some known association with disease so they didn't wanna just that it's this is spartate carbonyl transprays they weren't so interested in that because this was an example of something where there was no association with disease with this marker it was definitely associated with the dirty water versus clean water but there was no sort of known biological information that implied that this one reason why that would be and so they really did prefer to have some markers at least some markers in a panel of markers that were towards certain pathogens or towards certain kind of disease causing genes virulence factors and so it's something I just really wanna encourage you guys if you're ever developing any kind of biomarkers ask those end users what kinds of things they're looking for because sometimes that can really help in the end with getting use of the biomarkers but also getting approval in our case we have to get sort of EPA approval and I say US Environmental Protection Agency approval because the can is funny that if to get things to be approved environmentally in Canada you have to get it approved in the US versus in Canada looks towards the US for approval so I hate to say it but the issue is happening south of the border impact us indirectly to that way but with cuts to environmental protection but in short, so we may have to step up our game in Canada but it's interesting how much we do that and being aware of that process it actually guided us back at the very beginning of thinking about what markers we wanted to prioritize so do look into that okay, anyways, but just to remember other markers community diversity I'm not a big fan of community diversity as an indicator as a marker but I think it's certainly I'm a big believer not looking at limiting taxa to just bacteria and looking at other things like metabolites and gene based analyses and so I'm just gonna show you some examples of data just to help illustrate that need to look at other taxes so here's that same watershed study looking at bacteria, 16S data and bacteriophage data and these are the sites and this is like how many kilometers apart they are and it was really striking how much the bacteria were not sort of location specific and this is well known it's well known that there is a sort of water and far parts of Canada for example will have the same taxa in these different locations but yet the bacteriophage are really quite distinct you really see the geographical differences between the virus and it's not just the bacteriophage of other viruses show real spatial patterns in these different watersheds we were looking at and one of the reasons and I don't have time to get into it is that we think the 16S metagenomics data do not differentiate active versus dormant cells so I wanna remind you when you're doing metagenomics you're doing a mixture of both the live stuff the dead stuff, the dormant stuff whereas the viruses do tend to reflect more sort of activity they will tend to bloom in response to an active bacterial population dividing and so I guess one of the theories is this is a bit of a reflection of the sort of that this is showing sort of in some cases what's active and we're investigating this further and I wanna know that we see this trend if we look at lower taxonomic resolutions if we look at genes if we look at overall metagenome content look at subset of phage there really is a difference between how viruses are more distinct in different sites versus bacteria this is just another bit of data just to show that there's other differences I don't again don't have time to go through this but here's like a Mantel our statistic of looking at the bacteria versus DNA viruses bacteria taxa versus bacteria metagenomics and bacteria taxa versus RNA, viral, et cetera and basically the how much their synchrony is showing here by this value and then there's sort of like the darn, I really there's the so you can see that just overall synchrony here and I don't remember what the green dots are so I can get back to you about that but basically the NDMS plot here of just for example some DNA viruses and RNA viruses revealed really surprising synchrony over time and we basically have a sort of in Vancouver area where the samples were taken we basically have a dry season and a wet season and you can really see that here this is sort of the dry season then the things shift over to the wet season and this is sort of looking over time and in short we would not have seen all of this if we'd just been looking at bacterial data and seeing this we were very surprised at the degree of synchrony of the DNA viruses and RNA viruses and we're not extra sure what's going on there and then I'll just mention that we also have differences last thing is in the bacteria metagenome versus the DNA viral in terms of response to certain environmental sort of other data so for example whether it's sort of the rainy season versus dry season or shifts from rainy season for dry season you have big shifts in the bacterial data but not in the viral data the viral data is not changing in response to sudden rainfall whereas the bacterial data is presumably it's sort of all this material coming in being washed into the river so one thing I want to appreciate is the idea of looking at different kinds of taxa another thing I just had to put in a little pug for something led by actually Rob Bico just this concept of diversity and that we should be really supporting diversity there's a lot of conservation biology and theories developed for how to support diversity on this planet but it's all macroscopic organisms back to that little cartoon I had at the beginning and we really need to draw upon conservation biology and think about microbiomes and how we want to protect microbiomes and look at the diversity of these microbiomes and have some sort of stewardship of these microbiome diversities not just for human but for other microbiomes on earth as well and again remember other markers and the markers are only as good as the data they're based on so you really have to design these experiments carefully including positive and negative controls and I don't have time to get into it I keep saying that but it's true but there's some low abundance microbes are suspect and I really want to encourage you to sort of get rid of those sort of really low abundance predictions sort of things that are like less than 0.1% of the data a lot of those you should probably throw it because some of them are just random sequence errors that can just result in things being falsely predicted as one thing versus another so just there's some information in that paper about this another paper I'll point you to is just talks about how microbiome average genome size can impact results so if you have an average genome size in one condition versus another condition that can impact your normalization and results you get for predicting taxa and because you're literally you've got so much DNA but you've got a larger genome with the number of copies of that one gene versus all the DNA is different in one condition with a large average genome size versus small average genome size so appreciate these biases and limitations in what's in sequence databases we really are touching on the surface over here and consider microbial level this idea about dormant remember there's live dead stuff that you're dealing with in any given situation and consider looking at other bacteria and here's that just another an article that just sort of emphasizes this fact of having controls okay I just want to also emphasize it biomarker discovery is really just the start that you know validation is key and I want to encourage you guys to avoid the over something the microbial more if you've heard of that from Jonathan Eisen there's a lot out there right now there's a lot of what do you call it sneak oil is what they use in English so you know we really need to watch out for that and make sure that what the predictions we're making are robust as possible however there is a lot of promise in the future and I think we're gonna see this idea of the microbial and appreciating the important role of it is literally like in at least in our bodies you know our gut microbial miss like another organ you know it's a tissue that is functioning and producing things like are these metabolites that are used for circadian rhythms et cetera so of course this work was done with a large team of people and I just really want to acknowledge researchers from the B.C. San Frasese Control and SFU in particular my students thea and Mike for playing a great role in moving this work forward and we're just submitting a paper and now that's sort of a culmination of a lot of these efforts shortly and if anybody's interested in this stuff we can certainly disseminate.