 It's really a pleasure to be here and get to talk to you a bit about basically biomarker discovery. Bring it all home after doing all this microbial analysis. And I want to do a special thank you to Anna, Maria, Kristin, Mike, and Thea of course here and John Parkinson for making our version of these, many of these slides. And today what we're going to do basically is go and appreciate, you know, what are biomarkers and their utility, know some biomarkers basics regarding how you identify biomarkers, be aware of some examples, and I'll highlight the case study of biomarkers that we identified for some microbial data with a couple of different approaches and then appreciate, and I can't emphasize this enough, the importance of careful conservative analysis, so I'll be really emphasizing that in particular at the end. Okay, so what are biomarkers? So a common definition is they are a measure, measurable biological property that can be indicative of some phenomena such as an infection, disease, environmental disturbance, essentially they're a marker of some sort of biology. There is an appreciation in microbial analysis that there really are two main types, there's sort of functional biomarkers where you're looking at biological functions, be genes or proteins or even metabolites, they're specific to an organism or even a group of organism, community organisms that share a certain functional biomarker, or you can have taxonomic biomarkers are very popular, particularly when you have 16S or sort of more amplicon based analyses, you can have things that are specific to a certain species or a taxa, an OTU or other kind of grouping of organisms. Now, why we identify biomarkers is actually for a few different reasons, it can be for detecting in the environment, diagnosing clinically, phenotype more quickly, doing it more cheaply and more accurately versus metatomic sequencing. So when you do whole microbial sequencing, you sort of have this noise of all these other organisms that are in there that might be varying more sporadically and really what you often want to do is find those key ones that are changing in response to a certain perturbation and being able to narrow in on those and have something like a simple PCR based test is really valuable and centers like the BC Center for Disease Control that Will works at, they moved a lot of microbial tests from the traditional culturing based tests to PCR based tests because you can do it very rapidly, high throughput, cheaply and certainly there's a lot of interest in moving metatomics analyses towards biomarker discovery and biomarkers that you can actually use as a clinical test. There's also a growing interest in what we call bugs as drugs, you know, the idea of finding cocktails of microbes that might be used actually in therapeutics. There's a lot of interest in this because traditionally people find drugs and then they have to patent those. But there's a lot of interest in weather and a lot of debate about what happens with microbes. Are they actually a drug or are they actually a tissue and you're sort of doing a tissue transplantation? If you do a treatment of the tissue transplantation of giving somebody some gut microbes and the idea is your gut microbiome is sort of acting like an organ that, you know, maybe putting those microbes in is actually sort of like a transplant and therefore it's not subject to the same rules as drugs are. So there is sort of a lot of interest in this idea of finding these cocktails of microbes that may or may not be subject to the rules of drugs. And there's a growing success story that I wanted to highlight here from UBC is an analysis led by Mary Claire Areada and involving Stuart Turvey and Brett Finley's group here in UBC. And essentially what they did is instead of finding some bacteria that are higher in some group, they were actually finding things that were missing. So they looked at asthmatic children versus non-asmatic children and found in these asthmatic children there were certain bacterial species that were lower or missing in those asthmatic children in these first part of their children. Basically there was this what they called gut microbiome dysbiosis that was occurring in these infants and then what they found is that they took the gut microbiome from these asthmatic children which are considered like unhealthy microbiome leading to the development of asthma and they put that in germ-free mice and then they put another set of germ-free mice with that same gut microbiome supplemented with these four missing bacteria, bacterial taxa, they actually had reduced inflammation and measures of asthma in these mice. So they were able to basically find possible bacteria that could act as some sort of therapeutic or maybe even just these could be act as diagnostics for indicating asthma and possibly prevention of asthma. I mean these are very early days with microbiome analysis and as you probably know Jonathan Eisen have you talked about the overselling the Microbiome Award that he has. So there's a lot of issues with trying to move too quickly to biomarkers but there has been a lot of interest in this you know in the bowel there's interest in I'm involved in a project well sorry I'm on the scientific advisory board of a project which is differentiating inflammatory bowel disease from related diseases and detecting degree of severity of inflammatory disease using microbiome analysis basically excuse me and developing finding PCR based tests that basically can help them in the clinic detect if this patient needs more intensive therapy or monitoring versus another patient. Detecting colorectal cancer is another example and in lungs there's progression of COPD that where there's been biomarkers that have been identified and I apologize I meant to put references here for these this work but I forgot so if you if there's anything you're interested in just send me a message and I can you know follow up with any references but but you know for example you know a great stat from an allergen conference that happened here a couple of weeks ago that one third of adults with asthma are basically don't have asthma they're being improperly treated there's a lot of misdiagnosis where people think somebody has breathing issues and then they diagnose asthma give them a ventilator and they're walking around with this ventilator paying all this money for something that actually they but they actually have something more serious like COPD or other diseases so there's a lot of interest in finding these markers making sure we have more appropriate treatment even things like in the breast there's a lot of microbial analysis identifying markers protecting against mastitis and at the end and I'll talk about this a bit more isn't in this case study there's also environmental markers looking at say markers of pollution or conversely ecosystem health so you know what's wrong with that line there so what is biomarker selection so basically it's a process of removing the non-informative sequences again there's this noise of all this information and absolutely doing metagenomics analysis directly to investigate something is still very valuable but but being able to remove those non-informative or redundant ones and identifying those things that are truly differential between two conditions is of high interest so how do we find biomarkers so basically there's software that as you've already learned about some of it you know take the raw digitized information performing QC and then you can do quantity quantification of different sequences you know based on taxa or genes and then you're basically using math there's statistical methods to basically find these markers and we'll go over that a bit and then there's another key component of course is validation so you basically usually I've got this sort of this is more for a non-biological audience most of you guys will know what primers are but basically you design primers or these sort of biological hooks that pick up the marker sequences and then you can use QPCR or quantitative PCR basically to measure how many times these primers manage to snag your biomarker of interest and and really it's all about you know using that sort of bioinformatics analysis that the math statistics and then doing that validation is really the whole biomarker process again you know there's this concept of biomarkers involved the sort of bio and the mark and you really one of the key components of finding biomarkers is that initial analysis plan that initial design of your experiment is critical you know which what biology are you looking at what kind of measures are you making absolutely critical to make sure those measures are appropriate you know is it really looking at these patients and these patients are they really going to adequately differentiate do you possibly have it's these patients and these patients to be stratified into multiple groups or not for example and then the other thing is the marker you know and I'll talk more about that in a second you know what do you want what do you want to look at taxa or genes or you know viruses or bacteria but then of course you know with that good design then you obtain your biological samples extract and sequence your DNA and identify your markers and I'm going to talk about this basically finding those taxa ot user functional genes more or less abundant in your test versus control and then you're validating these markers both in silico usually at first and then in vitro or in the lab and then iteratively you're further optimizing this biomarker so these are the bits that I'm going to sort of focus on here but I will mention a bit about the you know the the marker kind of concept in the bio concept I mean there's often this issue of you know what do you want to look at do you want to look at bacteria viruses eukaryotes and so just a big picture of comments just to remind you is that you know for bacteria you know you can do shotgun or 16 s amplicon there's certainly the best study the best richest databases of information most methods developed so there's a lot of pluses for that viruses though where you use shotgun or amplicon there's this rdrp or g23 markers that you can use for doing a virus analysis I mean it can be really challenging to get enough dna and get quality dna and it can be messy to put it mildly as the has learned doing some viral now but I have to say we've got interesting data showing the value of looking at bacteria and viruses which can include phase and this concept of population bursts or changes that might occur that might not be there as well measured at the bacterial level so you know I really want to emphasize this concept of that in the future combinations might become more and more valuable and and looking at these in combination for eukaryotes it's still early days there's sort of amplicon analysis either by looking at the 18s are ribosomal RNA of course which is different from the 16s in eukaryotes or the ITS sequences there's been some nice development of that but we don't really do shotgun because genomes are so large it's really cost prohibitive and there is it is well studied you know and many methods develop but not as much as bacteria so it's still the majority of analyses are still focusing on bacteria but I can't emphasize enough that when you're doing analysis you know do consider looking at bacteria viruses and eukaryotes because you can probably get some interesting results that might be more robust in those with by using those combinations or at least at minimum it would be good to look at it to see if you could find something that is more valuable so the issue of course of marker you know you've got that bio but the marker that you want you sort of look as you know you can sort of look at taxonomic analysis and just sort of I'm assuming this is a bit of a review that you know sort of there's this problem with them taxonomic level analysis that that sort of strain level diversity can cause some issues it can be more variable across environments I'm going to talk about some environmental analysis where you may be looking at one body of water and you might get a marker of a certain tax in that body of water but another body marker has body of water in a different marker so you know this kind of concept of variability in taxa can be problematic with gene based analyses they're they're also really desired but and that's where you get the sure shotgun metagenomics data but you do need good sequencing depth to to reach some genes and I would say the biggest issue is we don't have enough knowledge of some genes so there's sort of this issue of what knowledge there is and and in and of course the cost of doing metagenomics versus say 60 nests amplicon analysis we've we've had a lot of success with doing both and then one thing I have to say in the future I'll be more and more interested in is using something like a amplicon marker analysis as initial screening for giving you a sense of how the data is varying and what's going on and then targeting certain samples for your more in-depth metagenomics analysis to get at that community and the gene variability that's occurring that you miss in say 60 nests don't forget about other markers they're sort of looking just simply at microbial diversity not a big fan of it but it has been surprisingly useful again to get a feel for how your community is changing over time and then also using microbial analysis that to suggest other metabolic markers there's a lot of interest in using metabolomics and using this to derive some future metabolomic studies as well but again you know combinations can be really valuable okay what makes a good biomarker so basically you're really wanting things that are going to be differential okay so you know some basic statistics that you know you might have your sort of abundance and your sort of frequency in your sample of say OTU-1 you really want these means to be far apart you don't want them to be overlapping say for this OTU-1, OTU-2 you basically want to have very different abundance the mean should be far apart and also the variance should be low I mean you really want it to be that in a given sample you're going to have that big abundance it's not going to be that some samples are going to be it's more abundant than others okay this might be a better way to show it so here's just a sort of schematic example pseudo example of say we want to find biomarkers that separate the red and the blue so we'll say the blue or the healthy people and the red or the unhealthy people or something and so you know you can basically look at these OTU-1, OTU-2, OTU-3 and obviously OTU-1 looks really great because look it's present in 1, 2 and 4 and not present in 3 and 5 where we're depending on how you're looking at it and then but you can see how OTU-2 is certainly more inconsistent and really is not a good marker and then OTU-3 of course is not useful at all because it's present in all of them so again really what you want is those things that are going to be differentiating with a clear difference and a consistent abundance so in terms of the math of once you get those abundances you can basically range from the very simple like a simple t-test to more complex you can write your own analyses using our equivalent or you can use methods developed by others and I think I pronounce it lefsey but I don't know what people call it but you call it lefsey okay yeah okay and you know it's been implemented basically as a nice convenient galaxy workflow so you guys talked about galaxy much I don't think so but galaxy is it just nice workflow if you're not familiar I'm not going to get into it right now but if you're not familiar with galaxy I recommend you get familiar with it because it's a really nice way to sort of link a bunch of analyses together and do a workflow for doing a bunch of analyses and particularly if you don't have strong computational skills but but basically it's implemented as convenience galaxy workflow there's also metagenome seek that's been implemented using our just to focus in on lefsey just a bit I mean I don't really have time to go into a lot of detail here but just say it's it's really for sort of very high-dimensional biomarker discovery and explanation it basically IDs this these features genes or pathways or taxa that differentiate classes between two or more biological conditions or classes and basically first statistically looks at different features among these biological classes using this nonparametric factor okay W some rank test don't worry about that I'm not going to get into statistics but I will say that what's nice is then it performs pairwise tests among these subclasses using this unpaired Wilkins Wilcoxon test basically to assess if there's differences consistent with this expected biological behavior so it's finding these differences in the classes and then it performs these tests among some class subclasses to see if you still get that differences are consistent with respect to that expected behavior and then it uses this LDA or linear discriminant analysis to estimate the effect size of each of these differential abundant features using a dimensional reduction if needed and this is a nice image that basically this concept that you can have your data where you already have your analysis partly complete and you can go right to the left seat or you might sort of have your high throughput experiments and do you want to do your tax on abundances your functional abundances your gene expression analysis but basically you're feeding that into left seat and then you've also got maybe some prior knowledge that you know based on things like keg and go and seed these you know keg functional pathway or basically pathway database etc that feeds in and essentially does analysis that I described before where you end up with this a couple of different analyses or visualizations you can get visualization of different features ranked by that effect size as determined or you can get representations of features on taxonomic or phylogenic trees or plots statistically different conditions and essentially it tries to be like a nice little all in one package for doing these analyses but I'm not going to go into in detail and I've got a web link here that you can get a lot more information and encourage you if you're interested in this method to go into it in more detail and read about it make sure you understand it and make sure it's appropriate for yours because and I can't emphasize this enough that biomarker selection relies on statistical techniques it's really important to you and understand your methods that you're using particularly make sure you understand your assumptions of the data statistical methods assumptions of the data is assuming a normal distribution or not so you shouldn't always your data normally distributed or not statistical methods limitations and how to interpret your result from the output so you know basically you must choose but choose wisely you know but it really depends on your research so I'm going to go through a couple of things and then get to this case study to help illustrate that so considerations are do you have discrete or categorical data or do you have continuous variables or do you do your samples involve no one classes or do you not know how many classes there are so generally you try to predict labels or you know you have classification as the sort of class A class B or you might have continuous variables you know sadly I was sort of smiling at this this morning grimacing because you know obviously when you're doing continuous variables you would have linear regression and which attempts to say to predict some future value for some variables so a common example would be attempting to predict tomorrow stock prices which has become really important actually today and light up what's happened over the last 24 hours but but basically the concept is that if you have categorical data you would be doing something like logistical regression and if you have more continuous data you'd be doing linear regression and I shouldn't oops I went forward here I guess I'll go into that so you also have another issue of whether you want to do a supervised or unsupervised approach that is do your samples have no one classes? Do you already know that you've got these are your healthy patients and these are your these patients and you absolutely know what those are or do you actually have a bunch of data where you'd like to learn you know how many different groupings are there you know what have you know what is happening in my community and what how many different classes are there is there actually a group of patients that are unaffected and then there's patients that are affected but have this kind of profile and then other patients that are affected that have this other kind of profile so so this concept of doing supervised versus unsupervised analysis it comes into play so again supervised analysis where you have samples come from no one classes and you basically like in this example here you would know that say sample three and sample five are from one class and one two and four from a different class whereas unsupervised and then you evaluate it using this test set and then unsupervised are basically when you don't know the classes and you want to discover so basically you have all these samples and you discover say that sample three and sample five or in one class and one two and three or the other class and so you're really letting the data drive the discovery and I should say that there's generally these two categories we can also have semi-supervised methods as well but just to mention the advantages of supervised obviously really easy to do it's simpler because and also you can find biomarkers that may be more robust and relevant because you clearly said this is one class this is the other class give me the biomarkers that are different between those and it's really easy to validate because you sort of know what your classes are but the disadvantages you do have to remember those classes may not be well defined and this comes back to that you know design of your experiment I mean if you're looking at sort of some positives and negatives what if your negatives have a lot of issues or you know if there's actually a subclass there or your positives might have different kinds of degrees of severity and those different degrees of severity may have different biological factors driving those those different degrees of severity so it just you know you have to be very careful you also might have sort of a continuous kind of situation so where you know what you define as sort of affected versus unaffected is a bit there's a gray area so it's really can be defined can be a problematic sometimes being supervised but generally if you can do it it's always good to do but an advantage of unsupervised it doesn't assume anything right so you can really find novel things about your data and you can basically well it can be just difficult to evaluate and can require a lot of you need sort of more samples to get robust analysis to find these different classes and so that can make it more computation intensive I mean there's huge value and be able to look at your data and seeing what the data says regarding classes because you might find as some people do that your people who are affected you know like we used to talk about cancers and then we learned that there's actually many cancers that have the sort of same phenotype but some different genotypes for example so how do we validate these biomarkers so I mean I guess I'll just just to close with that I'll come back to what we did but just keep in mind that you're always sort of trying to think of that study design when you do biomarkers think of you know am I looking at sort of discreet or continuous variables am I looking at wanting to do sort of supervised or semi-fruits supervised analyses these are the kinds of considerations you want to look at whether you have investigating whether you have a normal distribution in your data or not for example anyways so how do we validate our biomarkers so once you ID a group and you want to use it as a biomarker you obviously need to get it to make a test so PCR or QPCR is a good option and one approach that I'll show you an example of is using sort of a marker based tool like MetaFlan2 that you've learned about or you can that basically you already have the marker there or you cluster reads and align them to find conserved sequences and verify that that representative sort of consensus sequence or conserved sequences I should say representative sequence rather than consensus but is selective and then you can also then of course once you find those actual sequences that are associated as a good biomarker then you want to design primers basically using primer prospector primer blast or two popular methods where you can design primers from a sequence alignment or design primer specific to a a clade that allow you to basically find those primers to do that PCR but let's get into a case study to sort of illustrate how this works so this analysis is there's watersheddiscovery.ca for more information it's a genome Canada funded watershed metatomics project and really the idea was oh I should just mention I'll mention more about the idea in a second but basically we're doing improve pollution and pathogen detection source tracking using metatomics to identify markers of water quality and I need my water so I just want to acknowledge in particular Thea and Mike here who were involved in this study and I did this strategically so if you have any really picky details I'm going to point them to these guys because they're the ones who actually did the work so and will I should mention but you know um I didn't do the work he'll be the driver of the intellectual comment commentary not that you guys wouldn't sorry oh boy insert mouth and insert mouth and foot that's not a good thing okay I have not had much sleep over the last 48 hours okay so basically I would just really like to acknowledge also Patrick Tang who was sort of the PI who led this project and Matthew Croxton and others Miguel and Natalie here they're basically also see with the BCCDC UBC Genome Science Center and SFU so the idea though is why do we care about watershed metagomics so basically wanted to use a more ecosystem approach to water quality monitoring so the current emphasis of water quality testing is really at the tap and we wanted to look more at the source so when we have problems with water quality it makes much more sense to find the source of that problem and stop that problem rather than just putting dumping a bunch of chemicals and trying to clean our water we have big problems in BC for example with 20 you know 24-7 boil water advisories in some communities because they haven't identified this or or haven't don't have access to clean water another big problem is coliform tests you've heard about beaches getting clothes for high coliform counts so well coliform test is really inaccurate it can identify the source excuse me and it's important to realize that for coliforms that coliforms are a type of bacteria of which not all coliforms are pathogens so you could close a beach because of a high coliform count and it just might not be actually going to make you sick the other thing is you can also not all pathogens are coliforms there's protists and other organisms that can be pathogens so you could actually have a low coliform count but actually have water that can make you sick so really the concept was to do an analysis of clean and dirty water compare them find better markers based on metatomic surveys where we could do a PCR-based test of a panel of microbes to basically identify water quality more robustly so to do this the case study design was we basically had a control watershed of some clean water in victoria so anybody associated with victoria your water is actually really nice and clean then we also had a human impacted watershed looking at water where there was impact from leaking septic tanks into the water and then agriculturally impacted watershed collecting samples over the course of a year plus additional like monthly samples plus additional hourly time courses filtering the microbes doing sequencing of microbial DNA and viral RNA doing 60 nests 18 s for eukaryotes cpn 60 and shotgun sequence and then basically doing bioinformatics analysis and biomarker identification most notably so this I'm going to focus in on the agricultural watershed where we had one sampling site upstream of the contamination and two downstream and one sort of at the site and one sort of more downstream and basically we had these samples where we just did a microbial survey both taxon and gene profiles of the differential differential features and the ideas to develop a qpcr test so I'm going to bring up two approaches used one was a sort of fast track approach to sort of marker ID that actually was done by thea and so basically using metaflan and the idea is metaflan you know has really high precision you know when you find something it's probably right but it has low sensitivity so it's it's missing a lot you know it'll find some things so it's good for just doing a quick pass fast track it'll if it finds stuff that's wonderful because you've got some stuff that's probably good but it's going to miss lots of things right so if you don't find anything that doesn't mean you don't have good markers there but certainly we have found metaflan really useful to get a quick pass look at what your data is like and essentially based on select clade specific gene sequences and note there's about 3,000 refresh genome so that must be higher now though isn't it like yeah we should get that updated and and basically it's fast though you can do analysis really fast so essentially step one is you sort of process and validate the data quality trim normalize your data cross samples and I'll emphasize this a couple of times this concept that we've had it really useful having this sort of mock community or positive control validation so basically take a bunch of DNA from bunch of different microbes put them in some sterile distilled the ionized water or do not even do the ionized water and you basically sequence that and that acts as a really useful bioinformatics and lab based control sequencing based control to check that your sequence quality is okay so you should be able to like if you put in poor bacteria you should get those four bacteria come out and it'll really help with evaluating your data and doing analyses so this this mock community I can't emphasize enough that this is not done enough in microbial analysis I really encourage any of you doing microbial analyses please use a positive control it's going to become more and more required I'm being really distracted by these cute little kids walking by but my kids are getting older I'm like oh they're so cute I've forgotten how much work they are but anyways so note a couple of things here that when Thea did the analysis only 7% of the reads right were assigned to metaflan assigned by metaflan to a species so this was expected I mean is low sensitivity is a fast precise approach of those 84% were correctly assigned which is pretty good but keep in mind this is water microbiome which is we don't know as much about this so certainly for a gut microbiome where it's better generally are the human gut microbiome bacteria are more characterized and even more in databases and so you can get a bit of better accuracy for say versus say something like looking at some water where it's a a lot of species that maybe haven't been looked at before but in short I just want to make sure you realize that so we're analyzing these markers but we're only doing it on 7% of the data right so you're confined something but you are finding it from a subset of the data but but she was able to identify something and so like this for purposes of simplicity and also because we're not apparently because of possible IP we can't mention what the tax are but but basically it's you can see there's this tax on one that's differential between this upstream sequences and at the site or downstream of sequences and also a tax on two here and I should note that we prioritized highly abundant taxa with the idea that you don't want to be finding markers that are something that's really really rare because that's going to be hard to find a test you know you you don't want to have to collect 50 liters of water to make a test to write you want to collect that little teeny tiny bit of water to do PCR based test used White's non-parametric t-test with false discovery rate multiple test correction to find these differential abundant taxa because she did look at the data and saw that it was not normal distribution so that's when you want to use a non-parametric test and I'll just note here that random forest is something that she's also done in general is also a really nice test to consider it has some built-in validation that's really nice so just another method or approach to consider but basically so you get that so of that data just appreciate that 57,000 reads were assigned a tax on one and about 2000 signed tax on two so prioritize tax on one due to this interest in having more abundant taxa and then and this is the key part here so you basically have those reads that you basically have and then you extract tax on specific sequences from Metaflan database so in this case we've got 607 sequences per tax on one and then what you want to do is assign those reads so you basically against those sequences so you've got these tax on specific sequences assign those reads against those sequences and then choose those regions of Metaflan sequences that have the most hits so here's an attempt to show it sort of as a fake sort of schematic diagram so maybe you have a Metaflan c marker sequence and then you have maybe a consensus sequence of all these reads aligning and but then you have this highest coverage area right and so that would be your candidate marker sequence that you'd be interested in so then what she does collected those sequences these candidate marker sequences and then use primer three for primer and probe design for basically to PCR up those sequences and then the first thing of course you want to do absolutely critical is just check in silico if you can just do that against your sequences can you actually PCR up to quote in silico these this expected marker sequence and note in this particular analysis considered matches that are exact or have one to two mismatches to handle a little bit of variability there is a bit of an issue that ideally you want to be able to have mismatches be positioned you know because the way PCR works there's a difference between how the three prime end of the primers versus the five prime end of the primer happy talk about that more if needed but unfortunately there's a bit of an issue with we don't have any perfect primer making method right that does this really well and does everything you need really well there's a need for software actually in this area but the idea is then you choose sequences that minimize non-specific matches and certainly you know here's just this is just showing in silico but you can see sort of at the site and downstream you're getting all these amplicons with the the sort of forward primer and then this probe and then there's the reverse prime sorry and the reverse primer and this is upstream and you can see you're not getting your sequences so you're getting your sequences showing up as you would expect at the site downstream and most importantly we have confirmed in a lab we can amplify a product of the right size so that's great but I have to say just a few comments you know this is being used to sort of pilot an iterative validation process where you'd look at a bunch of markers you'd go back and look at a whole bunch of possible and sequences or taxa look at a bunch of identify a bunch of possible markers identify a bunch of possible primers because usually the way things work is that you know if you ever have any experience developing primers against something you're always going to have just a certain level of failure rate of computationally the primers seem to work but when you actually do them in the lab they don't right so you've got to basically have that nice cohort of a bunch of possible cases and then you're going to whittle that down into your final sort of hopefully your PCR based test that's actually going to work benefits of this approach really are it's fast and the sequence data to PCR primers just takes a couple of days so you can get something quick and inevitably with most microbial analyses where they're interested in biomarkers the whole lab analysis and sequencing stuff always takes too long so by the time you get to the bioinformatics analysis the people who are going to do the downstream validation are desperate to get something to start validating as soon as possible so this is a way to get something fast to people but it and it doesn't require a large amounts of processing power but the it really does depend on differential abundance of no one bacteria like it has to be something that's in the Metaflan database so there's also a member Emily analyzing seven percent of that data so there's a whole bunch of other markers that you're missing but and it also is based on taxa which have been shown to be more variable across environments and gene functions so you know it's really as a low hanging fruit good first step the second one I'm going to mention is the case study that basically approach that actually Mike Peabody did where he looked at a sort of more complete analysis so this is like doing metagenomics analysis doing actual looking at all of the data now well okay I correct myself looking at more of the data might be the more accurate phrase and so the idea is with cracking or discriminate you can do analysis on the actual metagenomic sequence data and I'll mention this Peabody paper by him about talking about different methods and how different methods can perform and your different conditions and you might want to look at that to decide what method would be best for you but then what you can do is you can do gene function analysis like for example with megan four which uses seed or keg databases basically to do more functional classification able to define gene based analyses that are looking at certain kinds of functional classes and and basically you could do cluster based analysis so an example is you basically get predicted proteins you cluster them find differential clusters and design PCR and I'm just going to go through just in sort of one slide just over well just overview of how you would do this so basically you have your discriminative tax or functions that you've already done and then what you can do is identify an informative region for primer design use CD hit to sort of cluster reads by identity so basically you have all your reads and you cluster them and I'm using CD hit and then what you can do is defy design primers like using primer blast or this IDT real time PCR tool basically against those clusters and then validate these primers in silico like we talked about before using primer prospector or primer blast and sorry keep hitting that and then basically you validate your primers in vitro right afterwards now that kind of approach again I want to just emphasize that that kind of approach allows you to find other markers so they looked at the same data and you know Thea found a marker that also might found but he found a whole bunch of others so he was able to find other ones that are also being put through the pipeline now for QPCR or PCR validation or at least I think they are they're being validated right now aren't they or I don't know what the status is on that yeah but we'll have to check into that one anyways so I can't emphasize enough though just to remember other kinds of markers community diversity I used to be sort of like about it you know what's the point but it really is an interesting indicator that can also complement what you're doing for your other analyses so it's a good idea to look at diversity and then it might suggest other types of screening tests and metabolites etc and I can't emphasize enough that markers are only good as the data they're based on so you have to design experiments carefully including positive and negative controls and I'm just going to add a couple final comments and reference a couple of papers here that this concept of using controls and experimental design is just basic science but it's so important for microbial analyses there's such a problem right now with people just going in and sequencing stuff and then you know without that proper design you've got to think of what question you're really trying to ask and make sure that your design is appropriate for that question and also and you know it's I think I understand you know apologies I wasn't able to be here for the first couple of days because I was away out of town but the I understand the issue of negative controls has been brought up and you know I'd like to emphasize that idea of you know contamination can really affect microbial analyses and there was a really nice paper that came out about just this week or was tweeted about by Jonathan Eisen I can send it but where they were looking at placental samples because there's been a lot of interest in whether there's sort of bacteria they're crossing into the placenta and actually impacting the immune system development of children infants when they're or what do you call them fetuses when they're in the body and basically showed that the because they had the controls that the placental microbial they were finding was actually no different from the negative control microbial so basically you know without those controls you could have easily said oh look at all those interesting tax that we're finding in these placenta but by having those controls they were able to identify this problem of contamination and I can't emphasize how important that is the other issue is noting this variation in accuracy of different metagomics analysis methods so this is that paper I mentioned by Mike Peabody that I encourage you to look at just for the purposes of just appreciating because a lot of people really we got a lot of good feedback about how much people said you know they really appreciated the effort that went in to evaluate you know it's a lot of work to evaluate all this and it's not perfect it's not the only analysis it's actually in we had an interesting challenge by the way with one to add comments to the paper but then they the journal basically BNC Biopromatics we just stuck it in I decided to change their journal format and all the comments got lost I guess I can't remember but did they they don't allow any new ones they don't allow any new ones but I think the did the old ones actually weren't showing up anymore they put them back yeah because there's a bit of a weird piece but so we ended up adding a comment because we wanted to add a comment and because we got a lot of feedback afterwards and we sort of did a sort of frequently asked questions you know and so there's a in the PubMed version of this paper check out there's a PubMed sort of frequently asked questions that we responded to that people have appreciated that basically just deal with some of the issues but what I want you to appreciate is that different methods have different levels like you know Metaflan you know and it really is going to look at a subset of your data discriminate is a is a really good method but it does sort of a little higher level taxonomic allows you know it doesn't go down to the genus and species and if you want to look at the genus and species level or you know even further you know that's not the method of choice so basically you know appreciating these differences in methods and what the method would be suitable for your experimental design so again this variation accuracy can be significant and then appreciate the biases and limitations of what's in the sequence databases I mean if you don't have that's why we found doing the Amplicon analysis and doing the Metagomics analysis really useful because they sort of complement each other the Amplicon analysis will find some things that the Metagomics won't and the Metagomics will find things that the Amplicon analysis won't and a lot of it has to do with the databases you know what's in those databases you're looking at and those sequences and and then lastly is this issue of carefully examining the data the methods and the biases when you're comparing across different data sets so there's a lot of interest in trying to integrate different data sets together but you've got to really watch that for example the Amplicon analysis did they use the V3 region or the V4 region and how did they process their initial samples and watch out for biases that can occur again positive controls are so useful for comparing across different analyses so having your positive controls in there will also help people in the future be able to use your data further and subsequent analysis so just again careful consider analysis can really pay off and and I guess I will leave it at that and just see if you guys have any questions but I would also just like to mention happy to take questions but we're also what we thought we'd do how are we for time oh great we wanted to leave lots of time to go over you know some big picture questions and thoughts you might have so we're going to sort of start you know you can ask me any questions but we're also going to sort of open it up to the floor of people that have sort of a panel discussion of sorts to sort of go over some questions and then I have some ideas of things I can use to prompt you okay but with that thank you so much and I hope you've really enjoyed this workshop okay thank you that's it okay so yeah yeah you know what I'll let Thea answer that it's not one of your notes today to the best oh yeah for sure I think it was so though the nice thing about using that that plan approach is that you right away you get a a marker that might be differential between two groups of samples and for that it's it's a great it's a great great tool so when it can be not as useful as if you if a lot of the bacteria in the sample are not going to have a market gene in that plan then the output from that plan in terms of a giving you a taxonomic profile of your sample that might be quite off and I think that's what we saw when we looked at the metaphor and results for the whole like a taxonomic profile for a sample versus a sort of similarity search-based approach then we saw a point of difference right between we're looking at the whole sample and a larger taxonomic profile but if you're just looking at you know is this group of bacteria differentially abundant between group samples A, B, C and the other samples the map is fine actually that that looked like it or carried out or at least we saw the same thing in part of the analysis as well now I would say in general my impression is but you know we haven't done like many many analyses and you guys might want to comment further but you know you can get a sense of things will be different or the same you know like you can get a bit of a overview of your data from that analysis and very quickly right you know the just a very short analysis not hours so I don't know Mike if you have any thoughts but I think yeah yeah yeah so basically just as you mentioned like if your community you're studying is the global records you know yeah that's a great point yeah it's the more close to you know so it's you know it can give you most of it no thanks for bringing it out it's a really important point that there's a sort of a difference to factor in whether you're looking at some sort of human gut microbiome analysis that is something that people have looked out a lot or if you're looking at some sort of strange you know arctic or Antarctic you know environment right it's going to be a difference in how well those kind of marker-based approaches are going to work yeah see yourself as you go so I was looking at what the blind kids say all you should get a house on control at the end is if that that addresses all issues I think it doesn't but it's really hard to you're just starting to start with it and then you get criticized and it's going to have some control from the concept of the question people think it's totally useless to do but just I don't think that the journey value actually will just start from on how to set the parameters of control then you can study to get some assessment you don't want to provide I mean for anyone who doesn't know it's you agree up a great point so there's two things there they're sort of like your standardized control can be useful you know because then you sort of people are using it in general and I could see that point of doing that but I I have to say I really do believe strongly in the idea of using some sort of customized control can also be really nice you know if we're doing water microbial and we're sort of expecting certain kinds of bacteria to be in there versus say a gut microbial where you're expecting you know certain bacteria to be in there I mean it would make sense for your controls to include some of those bacteria that you're expecting I mean why not so but the challenge of microbial analysis you never really have that true control you don't have that microbial but you absolutely know what's in there you're always sort of spiking bacteria you know knowing bacteria into it but but I think there is absolutely it makes sense to use bacteria that you're expecting to see or or at least reflect a bit the level of diversity you're going to see you know that kind of thing but I have to say my bigger concern has just been a lack of I mean people to start using cause of controls period you know I mean there's been a lot of work published without positive controls so I would like to just at least see something being used and I'm not too worried about I'm surprised you say that have you been criticized then for which controls you used or but where what is the criticism about that that you're getting see I would argue I guess that you know ideally you want to do both right you want to address those issues until we have more data I mean the fundamental problem right now we have right is not enough data what is it with all these kids going by I just have to say I was like very distracting anyways they're getting younger and younger on campus so but basically seriously you know what is there a downside to you know you know looking at the household and looking at that I mean I agree though absolutely with what you're saying but you know can you look at it in those multiple ways you know as a way to and and is it not not really a sign that we need more data just more fundamental data on how a household is varying and how people are varying and and this is what I hope is we will get better data so that we can literally have better controls as a result right yeah yeah yeah yeah no but it's a big issue well we had we had batch processing of our samples where we would have a sequencing run and we would have a positive and negative control for each sequencing run and that was really useful because we had one case where we were doing this you know these samples over a year and right before Christmas we had some samples that when we like we didn't know this until later when we analyzed samples but these samples that were sequenced and and done right before Christmas had all this contamination showing up but then it disappeared after Christmas I guess because they made new library agents I guess I don't know but but basically it was very valuable to sort of see that temporal you know change but you know you you you don't want to just have one positive or negative control just like you would any experiment right you know you do have to control for batch differences and and et cetera and and I mean you know they take up space and you know resources and stuff but they really do pay off I found anyways yeah we did yeah yeah so now I was thinking about probably you know to you because I have a representative sequence right we look at a sequence to design the parameters on it you might if you would love it to you yeah actually there is yeah you might also want to do an alignment of the reads that are assigned to you just to see if you have any to positions in the sequence where you have a variety of variability within your needs just and then you can avoid sort of that section for you to you one thing that though you'll need to oh sorry so to use yeah and the other thing is that there are probably areas of the 16s sequence the varied region that you're amplifying that are going to be more or less varied so you're going to want to once you design the primers for that subsection check it so we're going first to make sure that that bit of the regular sequence that you choose is going to be just as discriminative as the longer sequence so with the constant sequence like decreasing I was wondering if you know we have to worry about finding biomikers and developing these set primers or if we should just do the 16s analysis or even the metagenomics or the cost is going to be less than 100 dollars for metagenomics and the 16s is already almost the same crazy sequence yeah yeah I think um it really comes down to depends what your use is right I mean if you're doing sort of research based analyses I absolutely think you know it's going to make a lot of sense just to do you know 16s analysis because you get more data right but you know one of the worries I always have you know with places like the bccc and national microbiology labs and everything I appreciate the the benefit of moving from culture based to pcr based diagnostics but it always scares me a little bit because you know as somebody used to work in the national lab for what was called STDs at the time and you know that you just sort of these these bacteria are evolving and they can sort of there can be impressively strong selection to not be detected because if you're not detected as a sexually transmitted disease and you can spread more because people don't tend to and also as a little aside you can actually be if you're sub like not that bothersome a disease that people won't go to get treatment so there's a there's some interesting selections that occur there but so there's a bit of a worry that you know with pcr you're sort of focusing on a certain region and I just want to emphasize first that cautionary note is to really validate well you know you could find something and it looks all exciting but maybe have a good pile in your back pocket of other things you're going to validate to because some of them will not work out in the end but but I have to say for the sort of clinical diagnostics I mean the pcr based approach is certainly preferred from I don't know if you would have a comment but just from a legal approval policy framework they just they really prefer you know a lot of organizations including environmental protection agency and stuff coming up with tests they prefer seek things where it's very definitive you know what that test is and so they have trouble with this sort of concept of metagomics I think that might change over time but there's also a noise level you know in metagomics right yeah but I think you know not other than that would give you a to think of yes or no but cut off yeah but that's just yes but you know I think it comes down to in the end cost you know like PCR is pretty darn cheap you know I mean it's like less than a buck isn't it now I don't know I can't remember what this you know a sample and that's really powerful if 16S Amplicon analysis can maintain its competitors that's fine metagomics I think that'll always for now at least that will remain something that will not be part of a diagnostic just because it's just the cost is unless you can really show that there's huge value in it but I do appreciate that for example at the metagomic sample I mean you can basically have that panel that in silico right you can find those markers right but you could argue if you're going to find those markers and you can get primers and you can just put it into existing pipelines for you know these big banks of PCR machines you know it's sort of easy to implement in these large scale environments you know I mean getting back to what they call now sexually transmitted infections or STIs and that's I think the number one test that's done at the DC Center for Disease Control if I remember correctly you know it's a really popular you know like many many samples being processed all the time so having that real cost effectiveness and having that infrastructure there is really key and but I have to say another area I think is really interesting is in environment um it's not always about you know you know sometimes there is this issue you know with biomarkers of uh you know really wanting to track how things are changing over time like for example an environmental assessment situation where the advantage of metagenomics is you can detect how things are changing and new things might be coming up that you would miss in a PCR based test but I think in short you know things are going to be moving and it's going to be important to always keep on top of how things are changing and adapting you know I think we have you haven't talked really about long read sequencing and how that could transform a lot of metagenomics in the future with basically getting better quality analyses of your metagenome community through long reads in the you know that I would say is more like you know five years away to really good quality data but is something to absolutely keep your eye on what's happening in that area yeah and I just wanted to bring up a couple things also now we have you know okay first there's there's two main things I wanted to see if you guys wanted to discuss it all was one if you had particular research questions you know one thing is don't hesitate to bring up some challenges or things that you're interested in analyzing because we can sort of use it as a bit of case study and give you feedback on it and use this as an opportunity to sort of bring some things together so if you have something you want to bring up you know now's a good time the other thing is just to get feedback on the course we're going to be doing a survey after this but it's sometimes nice just to get sort of general comments that people have about the course things they wish recovered more things that they wish recovered less things that they liked about it things they found confusing so if you have any comments also about the course too it's really appreciated at this stage so does anybody have any thoughts that just come to their mind right away they're like oh you know I'd be great if this was covered yet yeah yeah no thanks for that and it's a big struggle with with workshops of this length of how much material I mean we could barge you with insane amounts of material but we could also go too light so it's really good to get a feel for if we're getting a good balance and it's definitely not unusual to feel like you know like that's a bunch of stuff but that's the point is you do have access material afterwards to digest it and I also think that's a good thing to do to digest it and I also encourage you you know other candy buy informatics workshops they also have a lot of material posted so there's some other analysis you're interested in like doing statistics with our I think there is one there's the genics there's two our ones and of course she is the one to ask that if I don't know all the workshops going on so you know if you're interested in gene expression and stuff like that so just and there's a lot of material that's posted there you know to check out and you know I think that's a really good resource too to appreciate yeah okay any other thoughts on in particular is there anything that you would have liked to have had covered more is there anything that was sort of missing a little bit or bits where you would have liked more depth you know would you have liked to know and more about a particular method or anything like that okay that's good okay make sure you put that in the survey and then what was the other thing I was going to ask about just also the timing can I do a little bit of a little like pull like it show hands did you find it was like sort of a little too fast going through things at all at times or was yeah sometimes I see a few nods okay a few hands and or did you find that sometimes it was like a maybe a bit too slow was it like a little bit too no okay that's notable yeah yeah yeah yeah I mean you're hitting on a classic issue of in bioinformatics I mean it's bio and informatics and you've got these people with different computational backgrounds and biology backgrounds and that's always a challenge uh you know even teaching it in undergrad class you know you'll get these students who have a lot of experience I usually have labs where I have the you know the initial sort of UNIX training lab for my class in the molecular biology department where I'm like okay here's the whole long version but if you already know UNIX here's the really short version if you can do this and this and this you're done out of here you know so it's uh it but it is a big challenge that and but I I want to emphasize a couple things is you know as you move forward in research in any kind of environment the team-based approach can be so valuable I mean this kind of stuff it it really pays to have somebody with some more you know in-depth understanding of you know microbiology and microbiomes and somebody who's got more in-depth understanding of bioinformatics and or statistical methods and just a if you can you know draw upon these different people and get that those expertise is involved in your analysis it can you know huge payoff and get those ideas from other fields I mean we've had a lot of success in just getting people who are doing machine learning to apply some methods to our problem our biological problem it develops some really interesting new insights by using their machine learning methods there's a lot of computer scientists out there who got have these cool ideas of great new computational approaches and they don't have data to to apply them to and they're usually really happy to work with you and try to work on some problems sometimes it doesn't quite work out sometimes the analysis is so simple that it's not so interesting but sometimes it can be really a really big payoff so and getting statistical experience like make sure your stats are good I I regret it not taking more statistics when I was in school that's a great idea I'm actually going to ask do a little maybe we could go through and do a few quick thoughts of everybody has you know on you know like for me I have to say an issue is deciding where you want to be on that spectrum of computational and bioinformatics you know how much do you want to be developing software or developing algorithms and how much do you want to be doing analyses but so you want to sort of think about what your interests are but but certainly if it's I don't know what your background is so it's hard to say like if you don't have a CS degree I assuming so you have like just so biological sciences degree or something like that yeah so I mean I'm going to mention something but you guys might have other ideas is you know it depends on if you really want to get into it I would say you can go for it and take a few courses but then you should take a few courses and get some decent CS training if you really want to take it seriously you know doing a sort of a minor you know or something like that can be useful but but if you're really wanting to get into it more in a sort of casual way you know there's certain texts and just diving in and doing some stuff learning so programming yourself you know it also can be very successful but I actually be very curious to see what you guys say but but I I will add to this that I think also a I don't know what the demographics is that can I just get a actually show of hands how many people have like CS degrees does anybody have a CS degree you do okay yeah okay and so so how many people have biological sciences degrees okay so does anybody have another degree yeah what do you have okay so this is actually almost exactly like a bioinformatics class I teach you you know it's usually you get somebody with the physics is really common and so you guys are really at that interface then of probably it's more an issue of computing skills but I just want to encourage you to not forget about statistics and not forget about making sure you know the biology of these organisms and appreciate some of the biology if you're from a biological sciences degree that doesn't have a lot of microbiology in it so it's just another comment but I can add some more later but do you want do you have any comments about what you think or anything to add I would just say that it does take a while to get used to command line approaching and executing your tool through command line because you know most of us grow up with using windows which is a graphic interface and then now but that's not very scalable for the type of analysis we want to do so eventually we have to revert back to command line based approach that people in the 80s 70s quite familiar with and have no problem using um and I do find that everyone eventually can pick up command line and can eventually be proficient with it average is probably at least two three months of full time or about six months of sort fairly intensive use so three days is definitely not enough and don't give up just yet but if six months of intensive use you still hate it command line then you really have to think about where you position yourself and that's the next one you don't want to torture yourself for the rest of your life to feel my programming at all but the trick there is you have to be able to collaborate with someone who will have money to pay or someone to do that so what do you say I'm going to wait till the end of that for Michael Bion for Michael Bion for Michael Bion so I don't see Bion so I don't go to either of those too often nowadays but uh but I mean the big boundary conference would be the uh the ISNB conference that's probably the largest boundary matters conference and there will be analysis of four sorts so there's two three thousands of people there and a typical boundary conference the smallest bill that would be the hundredth zone so that's the largest boundary conference but it's usually in combination with with other large conference so you get the conferences joining the course it's an international one which usually tend to be a fairly good of features the morning to your favorite is Lake Arrowhead yes that's what I that was that was top on my list the Lake Arrowhead microbial genomics meeting it's just awesome I just can't rave about that enough it's uh it's really cute though how it's run you know Jeff you just sort of send him your check email him I'd like to go you know you just email him a check or send him a check and uh but uh but it's just it's it's increasingly become microbial focused and it's a really like the meeting that a lot of people who were some leaders in the area will take the time to go to and travel and they will come from overseas to go to where there's a lot of people who won't come from the UK over to here very painful as I talked about people from UK traveling right now but but I have to say the this concept of just that people value that meeting so much microbial genomics and we could send like a and link to it it's only under it's two years on the character though yeah the other year it's not it's not it's not it's not it's not it's just only it yeah like Arrowhead every other year for conferences IHMC the international in microbial consortium is just this year so so many people out here they're like a biome and then Disney is a Montreal and August this year so it's microbial quality it's increasingly covered in the in the in the world and besides that there's something there's something cool to sell in the border conference do you have any thoughts what would you say to somebody who's more I mean it does seem like there's I'm going to comment on the computer scientists too but computer science yeah so I think you know there's the parts we've covered here so hopefully you have the the basics to get through to you know do what I consider the in the middle or actually the second one on upwards but I think it's pretty rare it doesn't need to talk to each other so you have sort of the feeling of being there's what you wanted to last right and so that requires probably sometimes just hanging away or trying to piece together the other two of our journey to be like let's bring out this profile and and I can talk about all the things I've been interested in stage or I'm really interested in looking at the two of these functions and maybe doing five of the next trees or really interested in how the things interact or something so some of those have tools and some of them just don't and sort of have to figure out how to figure out yourself but the basics of getting the you know the first pass in your data so to add to that a little bit is if if you see something stand right there then it's great and then sometimes nothing stands out you know what you mean so I think the one thing it's not covered much it's probably machine learning which I would just mention like you know a little bit it's something that we're recently using in our lab quite a bit to try to tease out you know whether you can really really classify samples policies it's interesting although it's classifying samples which is really useful but also trying to find what means are most important that requires some more learning but it's something that we're really interested in sort of diving into data and more actually if you will be more interested factors that are making it easier to decide to learn and um do you guys want to add some comments? we only have about like five more minutes so yeah thank you so much well I think we have five more minutes till the survey thank you where are you yeah okay I just want to encourage people to still do the survey because that's you yeah okay oh great one so if you're interested in learning some more computational skills these are usually two workshops that you would focus on are a pipeline and they include some command line skills and how these get and um it's very hands-on you sort of go around along which is after two days and we're typically either free or fairly low cost so I use a little copper carpentry those are the ones that do more of the R and Python and data carpentry is more focused on data analysis but and probably if like they'll know what Morgan was talking about if you want to do sort of like learn something with Python or if you have some requirement to really fill with your data uh beyond what we've been able to be able to do in the last couple of days um I think taking one of those copper carpentry workshops we've been able to do so we're going to be discussed thank you so much for measuring that because that's probably you're like one of the yeah top recommendations right off yeah um Mike Mike do you have anything to add or I mean I mean you'll get it depends what you want to do and it's you just have to practice doing stuff there's so much information on the internet and it just googling exactly and all that um or if you want to take them or the software carpentry is great and also that it's just like online where scissor is gone people make websites with you know going through learning pretty much anything you want to do um except for sort of you know you know stuff is probably a lot fewer than that this is so specialized but anything like machine learning or any programming like literally you can plan like it's just so it's of course Sarah of course I'm like like yeah yeah yeah there's also it's also it's also like like yeah and like starting in sometimes in the future so I would draw a lot so wow that's great that's awesome okay um do you have any thoughts you want to add I mean I've mentioned a lot what are you at what are you at learning and it's uh you have a problem something to see and you just get down and do it so so you know you have a need for something and that we would try to do this to learn and they could do something so that of course it's to do to learn yeah like that you know you know if there is a solution out there what are you all doing I don't hear very great mistakes and just very yeah and in terms of conferences like that first lady in competition I'll keep on for it just a few weeks ago and so we'll be hosting this again I think it's going to be Vancouver to see in and to start building capacity and bioinformatics and how yeah and literally if you're interested in bioinformatics and interested in volunteering to help with that and you're in B.C. you know that's something to consider um yeah yeah I agree I think in general is challenging in the research area I mean it's just it's a huge challenge and we you have to remember that you're always like when you classify into actual species or genera or genes there's this whole pile of organisms and genes you're missing because they aren't no one or aren't studied yet or you know there's all these hypothetical genes right and so just you always to remember you're dealing with this very subset of knowledge that we have of the true diversity and but you know there's huge opportunities here though as we move forward I think to you know get organized and get better better data but yeah you do have to be really careful and I'm going to add one other point that I wanted to mention just about the importance of just being careful with your analyses you know that garbage in garbage so it's just easily easy to come up with these are these taxa that are there and you don't really have any sort of without the controls you know in particular but you know even with the controls sort of really hard to say those tax are you know like just it's just you can easily get some garbage come out and so I wanted to bring I'm trying to remember so we're without bringing up very much we're dealing with an issue right now of somebody who did analysis where they made some big errors and it resulted in some wrong tax of being produced you know but I'm trying to remember do you remember way back when in the watershed project we had a problem also and I can't remember what it was what was that one because I just wondered if it would be nice to share these kinds of things to think about to watch out for because I would hate to see some people other people make these mistakes and what was that one about the primers or I can't so yeah so the first time we did some high six sequencing our libraries were too short I guess we have the last 30 base pairs being all A's oh strange and we're always to realize that right if you really had just your triple quality and then tossing it to Megan and never wanted our data we would you know we'd get issues we would have gotten you know tax on the table and the key function table and we would never have been noisier if we didn't look at the data or and look at those fast PC plots and notice that oh wonder frequency of A is really high at the end of the fact so it's really really great always you know just open your files take a quick look at them make sure they don't look crazy and really something I've been noticing is paying attention to the size of your needs versus the size of your libraries whether you go ancon analysis or shock analysis seems to be a common problem in sequencing that I've that I've seen so that's something that to make sure you're on top of it I guess and look at your data I'll work for you yeah yeah and even doing things like a you know what was that thing that you know something as simple as seeing if the you know sequence composition on the sort of merge reads you're analyzing is similar to the sequence composition of your entire data set and stuff like that you know like just do these checks and you know just you know genomics and particularly I think microbial analysis can be quite dangerous in that fact that you can sort of easily go down this road of not getting stuff that so you know see if your data makes sense too see are you getting sort of some of the big tax that you expect for you know a given type of analysis you know and if you have outliers you know what's going on with those outliers so it's all common sense but still it's important to appreciate but but I have to say this issue of the microbiology knowledge I mean I come back to you and say you know like you want to get talking to somebody who has more microbiology knowledge and look at your data you know we we come up with data and then you know if you can get people to just I did all this stuff way back when okay see wait again we have to finish all this analysis way back when where I was dealing with taking and doing I did a bioinformatics undergrad where a biochemistry undergrad where I had to take all these metabolism courses and they were so horribly dry and boring and memorizing all these metabolic pathways and everything and I was like horrible but it's been so valuable because I can look at data and I know those pathways and I can sort of see trends you know and so just but getting somebody look at your data sometimes they can see some things that you know you can't and appreciating these biases you know this these biases like you know keg or or whatever I mean they've classified things into pathways but those pathways you can have sub pathways you can have mega pathways and you know it'll only find the stuff based on how it's classified and it depends how you classify what you're going to find so just got to keep that in mind anyways I think we have to stop there hopefully those comments have been useful but as you mentioned we can do more comments afterwards