 OK, so from the introduction, it sounds like everyone has some experience with microbiome management analysis. So this lecture might sound a bit like preaching to the choir. So I thought what I'll do is I'll go over it fairly quickly, but we'll use this opportunity sort of as a discussion. So if there are any points that you want us to go more in depth or that you want to bring up your perspective, please feel free to speak up. And also, I wanted to say these lecture slides were mostly adopted from Rob Beko's previous lecture. I think one year he couldn't make it, so I took over the slides and I got stuck with with them. But it bakes the question, why isn't Rob giving this lecture today? So I'll do my best. OK, so as you know, you're here for three-day intensive workshop in microbiome analysis. So we broke the workshop into eight different modules, and I'll briefly go over them. So the first module is this one right here. So we'll introduce the basic concept, the definitions, the general approach, and some of the resources available. In module two, we'll go into marker gene analysis, mainly based on 16S analysis, and use that to demonstrate how you can measure community diversity or sample diversity, in other words, sample diversity, alpha diversity, and beta diversity for the different samples and communities. For module three, we'll go into PyCross, which is a tool that Morgan developed to link between marker-based analysis, in other words, taxonomic markers to functional genes and infer functions from marker genes. In module four, then we'll go into shotgun metagenomic analysis, talking both about the taxonomic classification and functional classification of the samples that you get from metagenomic shotgun sequencing. This used to be two separate modules, but we condensed it into one module to make room for additional topics, but also as the tools have improved over the last few years, it also streamlined the process quite a bit. Module five is new, and Laura will be talking about how you can take metagenomic samples and assemble them. Sometimes you have to pre-bind the metagenomic reads before you can assemble it, and how do you extract genomic sequences, sometimes for genomes from metagenomic data. Module six will be on metagenescriptomics, and John will be covering how you can do RNA-seq analysis. Module seven is also new, and Rob will be covering module seven, giving you some more advanced statistical analysis that you can apply to, I think, mainly on marker gene data. Are you going to cover any shotgun data as well? Yeah, I mean a lot of that is kind of important. Yeah, so you'll be an extension on module two and module four to give you some more background information on statistical analysis of the data sets that we'll see in this workshop. Module eight will be a lecture delivered by Fiona once you carry out the sequence analysis, once you've done your statistical analysis, doing the abundance analysis, differentiating the different microbiome samples, how can you use the results to select for biomarkers that can be associated with different conditions, such as diseases or different environmental conditions? Any questions so far? Any thing that we missed or that you think we should have covered? Okay, so the general learning objective for this entire workshop is to be able to define the different types of metagenomic projects and process the data, so there will be a lot of opportunities for hands-on usage of the different tools, and we'll show you how you can run some standard pipelines for marker genes, for metagenomics, and metatranscriptomic data sets, and we'll also be making these tools available to you so you can replicate the analysis when you go home with your own data set, and if you want to you can also have opportunity to try out your own data set doing this workshop if you're more advanced. And also important throughout the workshop we will bring up some technical and sometimes philosophical limitations of the metagenomic studies, so you're aware on some of the important limitations and not make predictions or make estimations over-esimate of the power of metagenomic studies. So for this specific module you will apply key terms in metagenomics, for example you will understand what microbial communities. So how many people have been exposed to the OTU versus ASV kind of debate? How many have you heard of OTU? Almost everyone. How many have heard ASV or ESV or Amplicon sequence variant? Much fewer people. So I guess this is an opportunity to bring up a particular discussion in the next session. So we'll also show you a few types of main objective for why we carry out metagenomic studies and more in module to interpret the content of sequence data, and lastly I'll cover some of the common resources for reference databases and so on. So the term microbiome has been attributed to Joshua Lilleber and by Laura Hooper and Jeffrey Gordon. So he defined the microbiome as collective genome of our indigenous microbes, which used to be called microflora, but as you should all know that bacteria and archaeas are not plants, so the term microflora has been sort of, sort of fell out of favor, so it's not really been commonly used except in the, not in the microbiome community anyway. So the idea being that the comprehensive, it takes a comprehensive genetic view of the human as a life form and its microbiome, so sort of take a holistic view of human as an ecosystem. So the term microbiota is the actual set of microorganisms that I found in the particular setting. So there's a bit of a historical confusion about microbiome and microbiota, because some people assume microbiome men, they interpret as the microbial biome, so they use it to refer to the organisms, but in our case we sort of make differentiation, but ultimately sometimes the term are interchangeable. So genomics on the other hand is quite different from the term microbiome or microbiota. And Joe Handelsman in 98 actually used it to describe functional aspect of the microbiome, so it's, and she was more referring to meta as in beyond, so they find it as the advance of molecular biology and eukaryotic genomics, which have laid the groundwork for cloning and functional analysis of collected genomes, again so a community-based approach of soil microflora, and she turned out the metagenome of the soil. So we sort of make the distinction that metagenomics refer more to the functional aspect and take a shotgun approach to identify the functional genes, rather than the marker gene-based approach, which typically don't give you a functional aspect of the community, but gives you a taxonomic aspect of the community. So the goal of the microbiome study is to explore the relationships of the microbes and their habitat, including human and its effect on our health. So to accomplish this, we use different molecular biology techniques and computational techniques to make inference about the community. So in this workshop, we'll be talking about how you use marker genes to characterize a community. How do you then take the, use PyCross to go from marker genes to function, but then we'll also show you how you can do metagenomics analysis using shotgun metagenomics data. And then we'll also talk about RNA-seq data sets for metagenomics. We will not be covering metaproteomics and metabolic type of, so not proteomics or metabolic type of studies in this workshop. But the point here is that there are many terms now ending with omics or ohms to refer to sort of a community or a system-based approach to understand community holistically. So there's also culture, which talks about how you will culture different organisms and so on. So why do we take a metagenomics approach? So as you know, most organisms don't live in isolation, they live in a community. So a traditional cultural-based approach where you try to isolate a single organism and which is still the dominant practice in diagnostic labs and many other medical microbiology labs is still very much a colonial view or a pure culture view of two pathogenesis to diseases. But it often missed the intricacy, the interactions between different organisms living in the same community. In a while back, there's a paper published about a great plate count of normally sort of comparing the number of organisms that can be successfully cultivated on the plate versus what's observed under a microscope or under or observed molecularly. So I estimate that less than 1% of the organisms across habitat can be cultivated. And this, of course, now it's a bit controversial because people have been systematically trying to cultivate organisms, especially ones found in environments that we care about, such as the human gut or other parts of human body, and the percentage of organisms that can be culture, it cannot be higher, and more importantly, the ones that if you can culture them in a bioreactor or some way of allowing them to interact with each other, it greatly increased the culture of organisms that you can grow in the lab. So, but in any event, it's not possible to culture all organisms, at least at the present time. So to take an alternative approach where you can interrogate the community without culturing them is why microbiome and metagenomics analysis became so popular. And as of last year, that last check, there's about 25,000 papers, and I'm sure some of you have contributed to that count of a number of microbiome papers published in the last 10 years. So when I was preparing the lecture for some other course, it was doing Thanksgiving, so I was looking around for examples, and I found this one, which is still appropriate, I think, given this context, but anyway, there's microbiomes of many different things, and I think throughout this workshop, you'll hear a lot more about other studies, but I want to highlight this one, because it's a good example where a microbiome of a food product or organism, I guess a food product or a bird can actually be intricately linked to human health. So we care about the turkey microbiome, because we know that microbiome can drive metabolism and prime host immune system, and the studies have shown that while turkey and domestic turkeys actually have very different microbiome due to the interference in the agricultural process, and more of many of the organisms are unknown and unculturable, yet we apply low dosage of antibiotics as a gross promoter, and they're still in practice in certain settings, even though it's not allowed in the EU, and I'm pretty sure it's not allowed in Canada, but people from CFIA can probably let me know if that's the case or not, but antibiotics, as we'll see in module two, in the lab session, can actually affect the diversity of the native microbiome and give pathogens an opportunity to thrive in an environment that they otherwise would be out-compete. So the opportunities pathogens due to the intervention of antibiotics as growth promoters can actually affect our food safety, and the study of the gut microbiome may actually, of turkey, may actually lead to better ways of enhancing the growth without the use of antibiotics. Just by searching for turkey, I also came across a study that looks at different Turkish fermented drinks, so it's not surprising a lot of lactic acid bacteria that are found in these fermented drinks, and I was in the Oxford Nanopore, how many people have heard of Minayang or Oxford Nanopore? Okay, so it's a small device for sequencing, so I was at sort of a one-day workshop, and the instructors went to a store and bought some kefir and extracted DNA from kefir and sequenced it in the workshop. So this type of study that I think was published in 2013 took maybe not years month to prepare can now be done using the current sequencing technology in a single day as a demonstration in the workshop to show you what kind of organism, what kind of microbes can be found in your kefir or in your kombucha, that was the other type of drinks that was used to extract DNA for sequencing in that workshop. And so it really brings home the idea that we now have the tools and the both sequencing tools and as we learn in this workshop, the Bionfamac tools to really study what type of organisms in our surrounding, in our food, in our, in and on our body. And most of you probably have heard that most of this phrase, most of you is not you, stemmed from the observation that most of the cells found on your body actually are non-human cells and the ratio is approximately two to once leaving the sort of latest, latest estimation. And the microbes in and on your body encode 500 times more genes than the human genes. And it weighs about two kilo of your body weight. So how many human genes do we, do we have personally? 20,000? Yeah, yes. Yeah, that's about right. So imagine you have about 20,000 or 25,000 genes. The number of microbial genes is sort of, collectively, is about two to three million different types of genes in and on you. Okay, so I was referring to the sequencing technology available that had vastly speed up our ability to generate sequence data and therefore use sequencing based approach or molecular approach to interrogate the microbiome. So most of the, so Roche 454 is actually, it rarely in use these days is sort of a discontinued product that has been support, with minimum support. And Sanger, of course, is the traditional sequencing platform. The most sort of dominant, short resquencing platform as may know is the, the Lumina series of sequencers ranging from desktop my seeks all the way to large scale sequencers such as different versions of the high seek. And more recently, the so-called third generation or single molecular sequencing have been sort of, sorry, have been made available publicly and the two dominant ones that the Pacific bioscience will pack bio platforms, which occupy entire room and requires reinforced concrete floor is this thing, I think, weighs about a ton or so, compared to the Oxford Nano port device, which is, as you can see the scale here, sort of a thumb drive size device, and you just plug into your USB port on your laptop to, to, to generate sequence data. So the different devices has drastically increased the power sequencing capacity in the last few years. Okay, any questions so far or any observation so far? So how many people are using Lumina sequencers for their, to generate their data? Almost everyone? Okay, anyone using Oxford Nano port Minion? Okay, a few, okay. So yeah, if you might want to, if you're interested, you might want to talk to each other to, to share your experience in these platforms, especially the new, the newer Minion platform, how it, how it goes when it, when you generate your, when you try to run a metagenomic samples or even a marker gene sample on these, MPCON samples on these, on these devices. Okay, so, so what can we answer with microbiome studies? Roughly speaking, there's four, four different general questions. First is just who's there and what's in the microbiome, and this can typically be achieved using a marker gene-based study. And, but of course you can also do metagenomic shotgun sequencing and, and infer taxonomic information from this shotgun data, and we'll talk a little bit about that in module four. Okay, so the other general questions, what are the functions that are present in these microbiomes? And the, the study here is drawn from an ongoing study in Rob's group looking at the, the different antimicrobial resistant genes in an elderly population. And so I'll let you add any comments, but he was pointing out essentially that the different, so along the x-axis are the different classes of antibiotic resistant genes and, and the, their, their proportions in, in the, in the samples. And so the different types of, of resistant genes are present at a different level, but also as you can see in the sort of the, the height of the, the uncertainty or the, the error bars, it shows the, the very, the variations across the, across the subjects. So some genes are present in low abundance and highly variable across different subjects. So the, their micro, I mean, of course, their microbiome encode these genes. And some of the other genes that seem to be found in all subjects in high abundance, such as beta-lactamase resistant genes. And some are sort of in between that everyone has this particular gene, but they seem to be, be, everyone consistently have this gene and there's low variations across different subjects. Anything you want to add? Just one quick thing. If you've never heard of ultramycin, it's no surprise. And this is a nice illustration. It's a very, very similar, but on slides. All right. So the next question is asking what do the functions or the taxonomic profile of the microbiome correlate with. And this is looking at the different characteristics of your samples of the, of the conditions that you want to study and correlating your microbiome with those indicators. And this is the, the topic of our Cisco analysis module. And, and also it will be brought up in some of the other modules as well. So in this particular study, for example, this, this, it looked at the, the correlation between the soil microbiome in terms of its diversity versus the, the pH level in the soil. And you can see there's a nonlinear relationship between the, the two indicators, between the two variables, I mean. And another study in looking at the frequency of saliva, that sort of correlating the saliva microbiome similarity to the, the kissing frequency of, of presumably couples. Who knows? Okay. So more, so not just finding the, the relationship between microbes and its, and its environment. It's also possible to, to use time series to look at the, how microbiomes will respond over time to, to different treatments. So this is a study looking at, essentially in a mouse model looking at a C. diff infection. So mice that are treated with, sorry mice that are, that have, that are healthy and, and not being infected by C. diff in, in this quadrant here. And we'll talk about the, this type of display, which is called a principal component analysis, essentially projecting high dimensional data in, in a two dimensional structure. So you're looking at the maximum separation between different groups of organisms. So in this corner here, these are the healthy individuals. And what, and these group here are the organisms that have been treated with, with antibiotics. And what's interesting is that this is the, the group that are persistently shedding C. diff and have a clinical science of infection, this, this sort of sick group. And the researchers then introduced the micro, the fecal samples from the healthy mice into these C. diff infected mice. And over time, as you can see the, the small number here indicated the number of the, the, the time, time points in the study. You can see that over time the, the number increased the C. diff infected organism. The infected mice gradually might become more and more similar to the, to the healthy ones. So by day 14, it has the similar microbound profiles to healthy individuals, showing that the fecal transplant was able to improve the, the health and they're no longer look like the persistent shatters. Okay. So as I mentioned in the introduction, sort of want to give a bit of historic perspective of how metagenomic studies came about. And it really started with the different, the development of different sequencing technologies and allowing us to look at DNA or the genetic material as a proxy to phenotypic studies of these organisms. So in the seventies, the Sanger sequencing technology was developed along with some other alternative sequencing technologies. And, and shortly after that, it was applied to, to, to different communities to identify and use it as a marker gene to identify the different organisms in, in the community. And, and towards the end of 1970s, one of the first, so at that time it's not called bioinformatics, but one of the sequence analysis tools, cost tool kit was, package was developed, cost data and, and people are quite optimistic with the, the technology development and statin would develop the, the package had this observation that DNA sequencing is now a fast procedure and the availability of computers gives the possibility of more efficient overall strategy for, for sequence determination. And of course, compared to what we can do now a day, this is considered a low throughput technology yet. I'm just encouraging you to think, you know, five, 10 years or maybe 20 years from now that it was technology improvements. We will be looking back at our current technical challenges and think, you know, we have achieved a lot, but, but the field, the technology is moving faster. So problems that you might not be able to solve today may have a better solution tomorrow. So don't, don't get discouraged and, and focus on what you can solve and what you can actually interpret with the, with the current technology limitation, namely the short reads, inaccuracy of reads and so on. Okay. So, so, and, and by 80s, as I mentioned, that they've been looking at the different communities and, and and finding marker genes as a way to characterize those community. And this is primarily an effort out of Norman Pace's group, where he, his group, I look at the different low complex communities and was able to extract enough DNAs, clone them and sequence them. And you see sequences like this and comparing a known sequence that's in the database with a given name to an unknown sequence, a query sequence that's in your sample. And through these type of similarity search comparison, you can then infer what's found in the community of interest. By the 90s, Sanger sequencing has been improved. So now you have capillary sequences and you're able to do 96 or 384 sequences in a single run. So the development led to, for example, the difference, different studies to look at using 16s as a marker to look at different communities. This is also the air of whole genome sequencing. You have enough throughput now instead of sequencing markers, single market genes can take a shotgun sequence approach and, and assemble the entire genome. So in 1995, it's when the first bacterial genome was a sequence assembled and, and published in metagenomics as a term that was defined around at the end of the 90s. And this is also where your Lumina was founded. So in 2000s, this is arguably sort of the, the early ages of, of microbiome studies. People are applying early mixed-gen sequencing and also Sanger sequencings to different, to different communities. So some of the sort of very well-known ones is their sub-gasso sea expedition led by Craig Venture essentially go around the I believe the Gulf Coast at that time and, and extract DNAs from sea waters and, and sequence to identify what kind of bacteria and archaea found in, in the, in the, in the sea water. And the, the acid, the acid-mide drainage is another interesting study that look at a low complexity in community. And what, and this is one of the, I think this is the, one of the first paper, if not the first paper that's shown you can actually assemble a complete genome from, from metagenomics data if the, the community complexity is, is low enough. And, and with, with sequencing become increasingly cheap, increasingly less expensive. There are sort of commercial offerings and even citizen scientists sort of non-profit offerings to let you sequence, you know, your own gut or to sequence your cat or to sequence your dogs. And, or even there's something called a second genome project that look at your microbial community. And for low fee, you can actually pay these companies to sequence your own microbiome. Okay, so for the last bit, I'll move into some of the major concerns with metagenomic analysis. So of the two of the top of the list is, is data quality issues are alluded to. So sequencing is, is not error-free and depending on the sequencing platform, you can have very accurate sequences such as the one generated on the Lumina with less than 0.1% error rate due to substitution. As a side note here, in Lumina reads the quality drops as the read gets longer. So towards the end of your read, the substitution, the error rate is, is, is significantly over 1.1%. So this is sort of the average of average error rate cost sequences. Then the single molecular sequencing platforms such as PacBio and Minnions, however, has much significantly higher error rate ranging from 10% to 15%. So imagine one out of 10 base in your sequence is incorrect. And in those cases, you need to be able to, you need to learn how to interpret those, those results and how to correct for the errors. So for Amplicon studies, chimeras can be an issue. There's about 1% chance of getting a chimeric reads. And this is when doing PCR reaction, two or more templates were, or combined artificially into a single Amplicon. And there are tools that will help you detect chimeric reads. The other data quality is not issue. It's associated with the metadata or contextual information about the sequence data that you're generating. And how many of you have gone into a public database trying to find a similar study to yours or the ones that you're interested in or read a paper and say, okay, this looks like interesting data. Say I want to download it and try it. And when you go to the, say NCBI, you realize that the metadata found in the paper and the metadata found in the public archive are essentially not matching. So you either have to contact the authors or just gave up on that data set. So how many of you have tried that and and failed? Okay, so yeah, yeah, right. Yeah. So, so just want to highlight the importance of metadata. They're like, don't fear that people might scoop your work or scoop your data. I think there's enough, you know, samples out there for everyone to sample. It's important to be able to to reuse the data that that are being generated to help in your own study or your own interpretation. So think that way. And when you deposit your own data into the public repositories, make sure it's easy to, for other people to reuse. Okay, and there are community standards that can help you make the metadata more consistent. I won't go into the details here, but roughly speaking, it consists of a minimum checklist asking you to specify some key information about your study. But in addition to that, depending on the environment that you're starting to study or the sample types, there are also specific environmental packages with additional data fields that will be good to specify so other people can reuse your data without having to recompile the metadata themselves. So if you go to this website, it will give you a excel spreadsheet of all the data fields. What these metadata standards don't really enforce is the values you put into the, into the field. So some fields are easier to enforce, such as day day formats or specific measurements of specific units and so on. But there are still a lot of free text. So actually some of the work that I've done in my group is trying to improve the terminologies used in these in these metadata standards to ensure that you describe the same, describe things consistently through the use of control vocabularies and what's called ontologies. And so recently there's a paper published by Mark Watkinson at all called the Fair Principle and it's actually getting a lot of, it's getting a lot of notice defining how a data set should be stored and curated to ensure that it's findable, accessible, interoperable, and reusable by others. And more and more funding agencies are actually looking at this Fair Principle as an indicator of how they should, the data generator should behave or should try to achieve with their data sets. Okay. Okay. So the other, another major concern of metagenomic analysis is the comparability or reproducibility of the data. And as I mentioned already, often the public data sets that you want to use for your own comparison, essentially are not usable. And in many cases, even if you want to reproduce the experiment using similar sample types and using similar process or similar SOPs, it is still difficult to reproduce an experiment. And some of the the factors affecting this is the use of different marker genes with different marker regions. And this can affect the result of your microbiome study. And the different sequencing platforms and sampling conditions can also give different results. And we'll talk a little bit about that later. And lastly, the workflows are often ad hoc. So a lot of time the details of the workflows such as the parameters used and so on are not kept. So in this workshop, we'll actually show you some of the, for example, how trying to help to publish the workflows as well as the results. So you can keep track of the analysis you did and the data sets you use and so on. Okay. So another concern is regarding the linkage and resolution issue. So the current technology, especially marker gene-based analysis, only look at a small region of a 16S gene. And even if you look at the entire 16S gene, it still doesn't have the resolution often to differentiate different strains within the species. So the strain level diversity in male genomes will often be missed due to the difficulty in either interpreting 6S genes or when it comes to shotgun sequencing inability to assemble, reconstitute your genomes at the strain level. So I think Laura would touch on this a bit more in her lecture. So and we will talk about how whether the pros and cons of assemble your metagenomics reads and how to interpret the quality of your assembly. Okay. So another issue is concerning taxonomy and OTUs. So taxonomy is the names you give to an organism or group of organisms. And as mentioned already, a lot of the organisms in your samples are unknown. So in other words, it doesn't have a name. So the approach that's taken to deal with that is essentially to give them an OTU or operational taxonomic unit as a placeholder for a proper name. The issue with OTUs is it's not correlated to the function or to the phenotypes of the organism. And it's often an arbitrary threshold and often said a 97 percent sequence similarity as the threshold for grouping organisms. And we'll talk in module two why that's an issue. Okay. So last concern is with the functional annotation. Again, there's many genes of unknown functions or hypothetical genes in your data set. And some of the studies shown here that on average, even with some detailed annotation in terms of the molecular functions, there's still a large number of large proportions of genes that have unknown functions. So in this case, roughly 60 percent of genes have an annotation, but the rest are. And when it comes to biological processes, the proportion is even lower. So when you do a metagenomic study, more often than not, especially in the environmental samples, more often than not, you're doing with genes that simply don't have equivalent in the database. And therefore, you will not be able to use similarity search to identify the function of such gene. And you might need to look at correlation of that gene to your sample context in order to try to understand what are the possible functions. Well, you might need to do some pathway studies to try to infer the functions of genes of unknown function. Okay. So given the time, I'll go through the resources very quickly. This is really just to highlight some of the common databases that you can use to reference databases as you can use for your analysis. So for 16S, sort of the most common ones are probably right now, silver and green gene data sets. These are curated 16S sequences and other marker gene sequences that you can compare your samples to. Often, you might also be interested in whole genomes as reference genomes for your metagenomics data set. And again, so NCBI, GenBank has a list of curated genomes. And over the last few years, the microbiome communities or the research community have systematically trying to sequence genomes from common metagenomic samples in order to improve the reference data set available in these genomic databases. And Patrick, in addition to being a repository, also provides some tools that allow you to study genomes. Now, metagenomics, again, there are several repositories. So for the human microbiome project, the data archived in the what's called HMP DAC, data coordination centers for the human microbiome project. So that's a wealth of information ranging from the SOPs to the different data sets available. And you can also request access to metadata through this portal. EBI has its own metagenomic and NCBI, too, metagenomic rearchives that you can access. And MGRAS is another resource, but one cautionary note is that MGRAS, the tools provided often overestimate the over predict. So be careful when you use it. Tools provided within. Some functional studies, I'm just highlighting a few such as metabolic pathways. You can use cake to annotate your your own genes for some protein families analysis. You can go to UniPro for reference protein, so for protein family references. Card is an antimicrobial resistant database that Justin here actually has was helping to have helped to build. And it allow you to to curate different antibiotic resistant genes that are found in your samples. And as Rob pointed out, you know, it could also have some annotation issues that might give you the wrong prediction. But overall, it's a high quality, manually curated data set for antibody and a database for antimicrobial resistant genes. And gene anthology provides a consistent naming schemes for the different functional genes, functional proteins. Okay, any questions? If now we can have coffee or