 So, are we ready to start? Like coaching under six soccer, are you ready? Okay, so everything here is screen comic licensed. I think Michelle probably went through that. It's totally open. We still need to reuse that fusion. This is, you know, we try to publish open access and so on. All right, okay. Module one, introduction. So this is going to be less of a technical sort of introduction to meta genomics. As much as sort of a conceptual overview, we're going to talk about some detailed things and certain types of experiments. But my objective is to really get to thinking about meta genomics and then sort of starting off interacting with some of the resources. My tutorial, by the way, is on the wiki now. And as you'll see, it's really about trying to interact with the data, finding different ways to joint the data down from the online resources so that you can get started with other people's data should you have the courage to do so. I want to acknowledge a couple of really key contributors. So the first of course is the Canadian bioinformatics workshop as represented by Michelle, but also has sort of run and tremendously supported by Francis Lillet. If you haven't met him, you totally should. He's a really, really fantastic guy. And Michelle and Francis have been extraordinarily busy over the last two months because this is the 10th this year, Michelle. 11, okay. And so now we have the Center for Comparative Genomics and Evolutionary Bioinformatics. This is a range of researchers who work on sort of theoretical aspects, mathematical phylogenetics, applied metagenomics, a lot of protostology, really exciting creatures and listening to the taxonomy for 15 years, I still haven't learned it. Dino Atlantic is our local Dino Center that tries to facilitate connections and try to bring basically more research funding to researchers in the region. And oh, there, that's me. So these are my contact details, our VCO at LCA and we're on Twitter. All right, please, please make the most of your visit. And I'm, like I said, we're happy that they help facilitate anything that you want to know about the region, anywhere you want to go. I know some of you are taking the opportunity to camping or something I wanted to mention about Halifax before I finally shut up about Halifax is that there's a very important person who was born in Halifax, not as Oswald's baby. Right, and so some of you may be familiar with this experiment. It was not that, you know, everybody's like, oh, Watson and Craig, discovery of double stranded DNA. I mean, that's fine, right? Okay, that's great. But the, so some of you have probably learned this in like first or second year university and I don't want to dwell too much on it, but it's not. The smooth strain, which is non-burialant cannot make my sick or dead. You've got the smooth strain which is burial. What happens if you kill the smooth strain? So you heat them up to some sort of temperature where they can no longer survive. They go, hey, and then you would reduce them into the months. Do you want to change the color scheme? I have no idea. I've not seen this window ever before. Yeah, let's try to do that. Right on, okay. So is that 10 minutes and we have our first room? All right, so. Bye-bye-bye, nucleic acid. So rough strain plus DNA from the smooth strain equals bad rough strain, equals dead months. So also changing the world, right? So we have six modules. The first one is delivered by me, Juan, and basically talking about what we're talking about and different approaches sort of at a high level. And then we're going to investigate things in more depth. So Will Sows at the back. I hope you got introduced to Will. You'll get interested. Okay, great. I'm just going to be talking about certain protein-based approaches in particular, the 16S ribosomal RNA gene, right? And then tomorrow is Mediginoma Fest. So Morgan Langell is going to start in the morning with taxonomy and the afternoon with functions. I feel like we're going to end with that in a minute. Module five and six, John Parkinson from the U of T, who's done a lot of methods to develop more of the metronomex. He's going to be talking about the field and some of the crucial tools, and then a bit of a work through on some of that analysis. We're going to finish off the icing on the cake. It's going to be a good afternoon. This is a lecture that is going to be delivered by John Parkinson with a tremendous amount of input from Keona Brinkman from Simon Fraser University, who I'm sure some of you know, who contributed a great deal to the development of this workshop, but could not make it due to other travel commitments. So we're very disappointed that she can't make it, but we'll give you the spirit and the PowerPoint slides. Before we want to introduce you. I want to probably speed things a little bit. So the basic idea is, what is the objective of your process, right? How do you deal with your own data files? What are the standard pipelines and how do you run them? What do you do for the analysis? And then of course, the crucial one, also known as the crucial or critical one, is to recognize the technical limitations of medicinal studies, right? Because if you spend $60,000 on Pac-Mio sequencing in the U of T metagenome, you're like, wow, we've discovered the key to... I was hoping for something more irrelevant. We discovered the cure for malay, right? And then you submit it for review and they're like, this is the stupidest thing ever. You should use aluminum instead of Pac-Mio. Ah, okay, and then you get a scoop. So these are very important considerations, not all at the detailed analysis command from GUI level, right? So I sort of want to think about this at a few different levels. Sure, for my maximum. We're gonna start by talking about some terminology, semantics, just because this is a discussion that comes up every time you talk about metagenomics. So I just want to kind of make sure we're all on the same page as to what it is we're talking about. As a metagenomic experiment, what's the appropriate choice of technology? Interpreting the contents of sequence files. So this is mainly, I mean, fast A is like the simplest format ever. We're gonna take a little bit of a look at the fast view format, but I'm sure many of you are familiar with any way during the tutorial. And then Will is going to, I think, hitch a full on with the files and the processing and the Python and the whatever else he's got at the door. And then the one that we're gonna focus on is what are the online resources and what are the different modes of access you can get in order to bring these data closer to us. And if you read that really quickly, don't put your hand up. Who wants to hazard a guess as to what it's metagenomics? Come on, you're not all jet lags, okay? Right, Christine, coffee. That's great. And I think that is the closest to the original Lederberg definition from your announcement, right? We're gonna actually reset that. Personal communication. So the collective genome of our indigenous mother, the set, and that you think that comprehensive genetic you are almost thinking that in life form should include the genes in our microbiome, right? Another way of looking at it is that it's somewhat equivalent to the organisms found in a particular setting, right? So hopefully you see the difference. In one case, it's all about the genes. In the other case, it's all about the things that contain the genes, right? And so either of these definitions is okay. I'll tell you right now that I tend to use this one about the microbes and encompasses all of their genetic stuff and metabolomic things that are going on in the meta genome. So that's the microbiome. The meta genome is a term that's tossed around all the time and the press and the adipose shock. This is Joe Nicholson, 1998, right? Advances in molecular biology and new periodic genomics which have laid the groundwork for cloning and functional analysis of the collective genome of soil microflora which we turn the meta genome of the soil. Now there's one really important thing about this and actually Pat Schloss tweeted this a couple of weeks ago. Where does it say sequencing it? Does it, right? This is about cloning and functional analysis, right? So it's using the genetic material but it's not necessarily sequencing it. And if you look at some of the first papers such as Joe Handelsman looking at AMR, anti-microbial resistance genes and say Alaska, it's not about sequencing. It's about let's blow up the environment of microbes that is and clone random bits of DNA into like plasmids and then express and see what we can find. No DNA sequencing required at all. So the purpose of these sort of tedious introductions is just to try and make sure we know what we're talking about and make sure that we have sort of the right broad perspective on these things. Because you know, you say these terms that, oh, by the way, it's not encompassed marker gene survey that was published a few years ago in the new science of meta genomics. Revealing the secrets of our microbial planet and a National Research Council of the time. Yeah, Claire Fraser, Jeff Gordon, Ford Doolittle and a number of others as well said that it did. So marker genes strictly speaking according to the definition not really because if you do cloning of 16S and then do expression tests, that's kind of boring. So it's ambiguous, right? But let's just combine that with the point, right? Well, I tried to come up with the simplest definition of why we do this stuff. Let's explore the relationship between microbes and their habitat. That's a fair definition. And then how do we accomplish this? Well, this is where the rubber hits the road in terms of, you know, experimental design, bioinformatics, statistics, hand waving and all that. And there are many, many, many, many different ways to profile the community, right? Here are six. And actually, here's something that we can do. So there are six. And then maybe we should have a look at what you're doing about to indicate whether or not you have done this type of analysis. I think this would be a very interesting survey. So marker genes, okay? Over half the room. Metagenomes, right? So this is the environmental shotgun type stuff. About a third of you. Metatranscriptomes, now we can count, right? One, two, three, four, okay? Metaproteomes. And, okay, zero. Metametabilomes. We're gonna be doing some of this in a couple of months. Excellent. Contraromes. What is a cultureome? Yeah, so cultureomics is a term specifically designed to make Jonathan eyes very angry with you. But the objective is to try and actually grow as many things as you can from a given sample. Why would you wanna do that? Because, hey, it's a mess in there. To the extent that we can tease certain organisms out, we might be better able to characterize their biochemistry. Three, grow them in more tight combinations with other organisms. And I have a beautiful example of that later on. It's not specifically cultureomic, but it kind of takes you to part of the reason for cultureomics. So, most of the great things that have to nominate should be capitalized. And the claim that has been made for decades, including sort of in this seminal paper about the gene analysis of Monaco 1995, is that fewer than 1% less than 1% across many habitats are cultural, right? And so you take your dirt and you throw your dirt while you do your various dirt extraction things. And the things you get to grow are born, right? It's like, oh, these are the things that'll grow anywhere. Whereas, you know, Mexicaucasanthus, candidata, syringes, cellulose, whatever, those are like the exciting things and they won't grow because they have very specific requirements that you do not know about, right? And so if that's the case, then if you just do culture, you get a very positive. And even if 50 or 70 or 80% was cultural, I mean, there's still two problems, right? One is that you still have 20% that you can't get at through these traditional methods. The other is that if we're talking about the human microbiome, whatever that means, then you're culturing 500 things, right? That's a lot of very hard-working graduate students, right? So it's, you know, now we have met it in a way, sort of this sort of environmental shotgun approach, or indeed, if we use a broader definition of the morphogen approach, then we get this broader picture of, and to use the sort of cliche, who is there and what are they doing? The definition of the microbiome was actually quite specific to humans, but it's generally accepted, at least among people that I talk to, that this is a term that is relevant to any sort of habitat, any sort of biome on earth, right? So when I say microbiome, I'm not just talking about humans, I'm talking about, you know, soil, ocean, rivers, talking about hydrothermal vents, talking about wastewater ponds where graduate students get to carry shotguns to scare away the cold, the polar bears, all of these things, right? And so the scale of the human microbiome on a civil host, if you were to see it right, how many genes do humans have? Well, somewhere between 20,000 and 25,000, this is about 25,000 plus or minus 50,000. Now, your typical human gut microbiome sample will have on the order of greater than 160 species. It can be several hundred species in scared boats because we lack a good definition of species. And, you know, over a million different genes. So in terms of functional diversity, of course humans have all these confusing things like introns and shuffling and routines and whatever that makes me glad I don't work with them. So there is more complexity than that number 25,000 would imply, but I mean, this is significant, this is substantial, right? It's true of any biome on earth. I want to kind of get at the roots of metagenomics, just because I think it's important to kind of take a look back and see where the field has come from and some of the techniques that have emerged over time. So I won't do that yet, we'll get to that. Okay, let's start with the 1970s, which I would actually be talking about because one Nobel Prize is not enough. He got his first Nobel Prize for sequencing the protein sequence of insulin, right? 1960 was actually the first to be in a sequencing method. It was not very useful. I mean, it was amazing at the time, right? But it doesn't give you everything. It's what it was. So isn't it interesting? Because those of you who know about, you know, very simple, homo-period or homo-magnetic sequence, they were actually the only things that people could see, the sequence 45 years ago, 55 years ago. And of course, if you don't know whether Margaret Bacoff should, she was a pioneer in many ways. She's been referred to as the mother and father of bioinformatics. And one of the key advances was the atlas of protein sequences structure, which was the first collection of all proteins that had been sequenced to date. I forget the exact number, but I think it was somewhere between 50 and 100. The 1970s, we start to see additional algorithms for alignments and the first efforts at sequencing, right? So Sanger plus minus, Maxim Gilbert, which I understand is a horrific way of sequencing. It's a chemical that's cream and toxic. And then the revolution in 1977 was the Sanger-dioxicane termination. He was translated over 64,000 times. This is what got him his second Nobel Prize in 1980. So the turnaround timer was like, okay, so this is significant. Also in 1977, the first major discovery of the Archaea, right? This sort of Karl Woglitz-George Fox tree of microbial life. Also in 1977, where is it? Oh, I don't have the other on the next slide, so we'll see that in a second. 1977, the Staten-Bioinformatics software. And the Staten package in 1979 added the first method for automated sequences, right? So 1979, the continuing rapid follow of the cost of computer components is making it possible for most DNA sequencing laboratories to have their own small computers. So the fact that DNA sequencing is now a fast procedure and the availability of computers gets the possibility of more efficient overall strategies in sequence determination. Holy crap, look at all the data, right? Every paper in bioinformatics now starts with holy crap with the data. So here is the chances, this is amazing. So 1970s, this is what I was gonna say. Genome sequence, 1977 by X, right? Now it's the genome that's more sequenced than any other, because you dump it into your my sequence, as well as some of these clustering issues. And so here in all this glory, make sure that we're going to look at 1975 and the whole genome. 48 years ago. This one's pretty neat too. So it was determined in 1974, 1975, 76, but it wasn't through sequencing. It was through really information analysis, right? Mutations, other really, you know, difficult and time consuming techniques. So this map is not based on DNA sequences, based on, you know, at least 20 years of experimental biology. These are the first. Okay, 1980s, 1980s. Dr. Deva, it's done with an online computer database and a sophisticated retrieval system accessible by phone to outside users in September 1980. So imagine, okay, and I don't know exactly how it works. That was three when it came out, but you call in, right? And you're like, press one for insulin. M, A, R, right? So yeah, I laughed, but it's remarkable. Okay, so we're getting to what I call the dawn of metagenomics, okay? Because in 1984, 1985, more recently, developed the first methods for sequencing genes from microbial communities without culturing, okay? This is where it all started. They did not sequence the seed in the gene. They sequenced the ribosomal RNA interaction, okay? But this is where it all began, okay? 1985, 30 years ago. To hydrothermal tests and then to talk to the spring with the industry, you know? mid-1980s, and this paved the way for everything that came after. The first data sharing agreement between NIH and US, EMBL and Europe, and JNIT and Japan, right? Let's mirror things, let's share the data, let's make it open. Thank you very much, right? By the geography, 1987, this is certainly playing an important role in various environmental metagenomics. 1989, ribosomal database project, right? So, they come along. On metagenomics, right? We have a couple of reference sequences from solapolivus, the student calderia, and the comodicus for the environmental sequences that were characterized on that. Very simple communities, but it was 1985. Holy moly. Right? 1990. CNES study. You have an on-ancient, cellular genome, 1995. Go ahead up to 1999. Who knows what a RISA is? Okay, what's a RISA? Ribosomal, yep. So, the basic idea is that it's really expensive to sequence, in fact. So, let's use restriction digestion and some sort of gel-based approach or some other-sized, typing-based approach such that, you know, your heterogenic transfer of sequence between 16S and 243, yeah. And mine are slightly different, but hopefully they have sequence variations that lead to differences in restriction sites. So, even very recently, you see papers published that are based on a RISA. Doesn't give you as much information as others do, and it can be difficult to do automatic attribution, but it's still what it is called. Okay, 2000 metapronium is called the Kepronium. People knew about it, and they got a little discovery of life-harvesting method of- Definitely, the microbiome. 2004, Prismus double, Acid-migrated metapronium, Gil-Batheal, Acid-migrated metatranscriptomics, Gil-Batheal. 2005, Acid-migrated metapronium, Gil-Batheal, right? So, she led the way in a lot of the stuff in the mid-2000s. So, it's a very, very deep sequence, starting to get into the human microbiome studies. And then the two major human microbiome, HFT, and then the European Medi-Hist Project. Okay, we're reaching the end for it. And so, more and more sequence variations on this policy, and that's a degree, many of you know of, has been probably the leader in the advancement of microbial ecology into the sequencing medicinal mix era. So, a lot of these sort of beta diversity methods, different statistics, different environmental correlates. In 2013, quickly biomephobiasylab, sequence, people samples from 21 mice. So, we do computational stuff, and we're like, hey, we have $6,000, what can we do with it, boom, right? So, this is just an illustration of how it's estimated to be coming. 2015, we're out of sources, and now we're working on it. And I, unreleased, we're getting ours in a few weeks, we're very excited. And the other thing about this decade is that with the greater accessibility of sequencing for very low cost, you start to see interesting things like the microbiome of roller girders, right? And that's actually, that's a good read. This thing, I'm not allowed to mention last year, this microbiome, mobile phones, American cool ship ale, Iron Rugby players, Mike just pointed me out to the sports surfaces microbiome, which was like workout tables and stuff like that. I can't do that. And retesional cheats, which actually I would love to have been involved in that. Okay, so that's history. History is history. Any questions so far? Apart from what are you gonna get to the point of? Any questions? Questions? Okay. Yeah, question. This is gonna answer. Who wasn't? Who was she? Anybody know? I forget. Sanger. So the early metagenomes, 10 years ago, the early metagenomes that you used to, these days, and I'm gonna show you an example later, roughly, is for doing them. So it's not restricted to Illumina or Pac-Bio or 454 or Nanocore or anything specific like that. So this is the very, very, very, very high level view of how these things are done, right? The big picture, the big picture, you have a sample of data. Well, this is the metagenomes, metatranscriptomes, metaproteomes, microgene, whatever, right? You've seen very important, I mean, this is one of the critical things of the early 2010s with the development. One of the papers that I included in your recommended package, which is your subword, Bioinformatics for the Human Microbiome Project, right? So this doesn't go into the details, but it cites the details. This is basically saying we had more data than anyone had ever dealt with before. We needed to do the analysis quickly and well, right? So they talk about some of the 16S analysis challenges they face and they did things like building mock communities. So basically, what's a mock community? It's like, okay, I have, in you go, in you go, in you go, what's on the show? Kidding me. Okay, see, I warned you, didn't I? Yeah, exactly. So, and then the downstream analysis. So, by the way, just very quickly, my plan was to go as far as I could before the coffee break or before I see that it gets to be very important product. And then after coffee break, so after you've been as coffee break, after coffee break, I figured we dive into the tutorial, maybe for an hour and then save the last half hour to finish up with this. I think that's a reasonable balance of things. Market teams extract DNA, amplify with targeted primers, right? You can look at B1, you can look at B2, B4, B3, B5, B69, filter errors, filter clusters. You don't know what I just said, you'll learn about it this afternoon. And then various types of diversity analysis. And if you don't see my B extract DNA, quality control, and then your various diversity function analysis, metabolic pathway reconstruction, what have you. Metatranscript filmics, which I'm very excited for Friday, because I haven't really done this myself. I've been really looking forward to John's session. Extract RNA, the ribosomal RNA, because, well, it's actually quite informative and Jessica Green had a great paper on how ribosomal RNA levels can actually cause many things to it, but it turns into function control. And then you can look at function gene expression, taxonomy, usual size. One of the papers that I'm gonna give a very brief overview of the metatranscript film can be orders of magnitude more informative than the metagenome. It's a detail, right? It's there pretty well. We're making samples and we're gonna have some metadata, which is like these samples in this ball. Let me be specific. These samples came from older mice, these from middle-aged mice, and these from younger mice. And so we have some sort of taxonomic representation. In this case, and again, you'll be learning about this this afternoon, so don't worry about it too much. It's a principle components plot. I think it's a principle coordinate plot. It basically tries to take all the diversity, smoosh it down into a couple of manageable dimensions, and then show you the similarity. Well, that's another matter entirely. Okay. Well, I have some papers to talk about. Michelle, how much time do I have? So just to kind of give some motivating examples of some really cool work that's been done in the last few years, I wanted to walk through a few different examples of metagenomics, metahomics. And I haven't put these papers up, but they are cited in the talk. And if I remember, which means if you remind me, I can also throw them up on the wiki later on. These are some of my favorite papers. Well, some of these are my favorites, and then a couple of them. So, it's always good when you're explaining metagnomics to your mom, dad, and neighbor to start a cluster that you didn't know. Because hey, that's one of the most prominent examples of the discovery, the use, the application, the success of essentially a metagenomic approach. The microbiome was from 2012, a pharmacogenic plebrololid at the Sanger Center and a cast of thousands, or at least a dozen or so, looked at, we had these different mice and we've taken samples from them. When you represented this way, there's sort of some captain obvious stuff going as a sign of the problem. But this is just the community profile. Who knows what Shannon diversity is? So Shannon diversity is, okay, so let me back the talk up for a second further and tries to use this information-based approaches to try and accommodate both the richness and the diverse, what's the diverse? It's the relative abundance, right? Because a richness of 10, where the distribution is 10%, 10%, 10%, 10%, 10%, 10%, 10%, 10%, 99%, 0.1, 0.1, 0.1, 0.1 and so on, right? And so Shannon diversity attempts to capture this. They're both informative measures, but they serve somewhat different purposes. And in this case, you can switch including the 90 mice and the covered mice have a much higher Shannon diversity than those of you who've looked at any papers like this, dysbiosis, things ain't going so bad in the gut, are usually succeeding in the gut. Now, this is like my favorite figure from microbiomics ever seen. And of course, it's a pretty cool component to why because you can't have a microbiome in the paper today. Go to the diversity information before. And again, Will's gonna talk more about this. What I've showed you before is the alpha diversity. What's the alpha diversity? It's simply the diversity within a second, right? And so you have an alpha diversity of your gut and you have an alpha diversity of the alpha. Beta diversity, which I'm sure some of you are familiar with others, maybe not, beta diversity assimilates similarly to the dissimilarity. And so if my gut microbiome and Yanan's gut microbiome are very similar, then our beta diversity is low. And so what they did here was they calculated the beta diversity between all pairs of samples and then they ran this through the principle. This is a different, I think Will is gonna start with the difference between principle points, principle coordinates this afternoon, he is now. Okay, so anyway, but the answer is, if you get this lower dimensional one, the trig cluster there, over here is just about right. We have very, or we can give them an antibiotic, say, clindomite, right? So let me explain what are the four different colors. And the other are a lot of the silks. So this is your probiotic yogurt, the yogurt that you eat in the morning with you. What is the classic one? What is mix B? Well, it's a subset, it's a chosen subset from the healthy microbiota. Because what's in the poop of a mouse? I don't know, you know, right? If you're a mouse, do you wanna take a donation from someone? Maybe not, mix B, they took six things. They put them together and they said, how well does this work? So these are classic, right? So this helps you mix B with the most successful. You've got sickness, you've got health, you've got different treatments. And actually the geography of this weird abstract plot thing tells you a huge amount of information, okay? So I think I'm at time for this morning. Okay, we'll do, okay? Hopefully people's brains are not full yet. Any questions about this? Okay, good, it's all clear. It all makes sense. All right, these are some of my favorite papers. And this one is, in my meta-genomics, the environments in ecology and the results are really nice. Yes, of course, thank you for the little selected and favorite circumstances. I'm not gonna call it a competition, but I'll give you a lot of the insights. And see the answers, I don't know. Because yeah, I mean, if passage worked well, then it makes sense to sub-sample from that one to try and get a simpler community. Absolutely. Any other questions about feeding various things to the rest of the questions? All right, I've got a field research point. So we like this so much, we actually implemented it in some of our software. Let me start by asking, is anyone familiar with canonical correlation analysis? Okay, how many of you are familiar with principle components beyond the thing I just showed you? So PCA is a mapping of these complicated patterns of diversity or function into a smaller set of dimensions that hopefully capture most of the information that you had before. But you can visualize because two dimensions is cool, right? 37,000 databases. And over here, this is membrane proteins, this is transporters. They focused on those specific changes in environmental conditions. That is why they focused on those. And so again, don't always think of all the dataset, focus on some of the greatest importance. So what are the two dimensions? You actually want to make sure that the variables digitalized together. You can see strong correlations between environment and the family and also technology for decades, okay? It's cool that they applied it, but they forgot to step further. Because no relation, maybe they're anti-covalent, you don't know. If they took these ordinance sections between those not in the software, what if the relationship between shipping and this phosphate transporter, right? I'll finish this one in there. So let's look at those numbers, those valuations. Let us draw an edge between them and let the green edges represent strong positive association. What do you end up with? You end up with these really nice little things. So this affects, there's a relationship. We shouldn't infer too much causality. There's a relationship between these things. And the phenotype environment network makes it very clear. Any questions about that? I have three more examples, but I think it's time for me to hand over to Geven. Questions? Thank you. Thanks, welcome back. So I just wanted to follow up with, I think Daniel had this question beforehand. This is from the mouse paper, right? And so Debra's collection of 18 bacterial species from the passive one fecal derivative. They had 18 different species, so-called. They split them into the three mixes. Mix A was six of them, mix B was six of them, and mix C was six of them. Dr. Seussian. And this is basically showing that mix A and mix C did not reduce the colony-forming units of C-diff. Mix B did, so that was the successful one. And then they characterized the six species in mix B. Staff worn awry into a caucus highway, lack of a cellist's droid awry, and then three novel species, species known, species known, species known. So that's the story of mix B. Okay, so I'm going to go through the remaining three examples a bit more quickly, hopefully. This is a very recently published, beautiful example of what you can do with metagenomes plus metatranscriptomics, both in terms of working up the data and interpreting the data. And I'm not going to do all of this. Like I said, I'll put the reference up if you can remind me later. The basic idea is that you have raw metagenomic and raw C-d, metatranscriptomic data, quality control. And what they did was interesting. So they split their metagenomes, because metagenomes are very complicated, right? Now this was an acid-mined drainage community, so it's a lot less complicated, but it's still complicated. There's still a few different things in there. They actually split their metagenome based on camber abundance. What's a camber? Who knows what a camer is? Camer, okay. So, somebody give me 10 letters of DNA. Come on. C, G. A, T. A, T. Stretch of A's. What's that? Stretch of A's. Ha, ha, ha. A, A, A. Seven, okay. T, I'm going to throw in a G, and? C. C, perfect. Okay, so this is a sequence of DNA. One way to figure out, one way to represent it is by doing a blast search, comparing it against database, and then trying to figure out what it matches to. Another way to deal with this is to decompose it into camers. So, camers are words of length K. So, if we want to express this in terms of camers of length two, camers of length two are dimers, right? And so, we can say, how many CGs do we have? Well, we have one. How many GAs we have one. How many ATs, one, two, right? So, you see how we're counting words of it given length. So, and we have two AAs, for example. So, that's the camer decomposition. You're just counting words of a certain length. Words of different lengths are useful for different purposes. Here they used words of length, oh, I think it was 12, but I forgot. And they basically separated them into high abundance words, which might come from high abundance species and low abundance words. So, they're taking their metagenome, and they're actually fractionating it out so that the resulting things are simpler. And then there's some garbage, left of, not garbage, but there's some leftovers that they scoop up and put back together. So, they get context that are theoretically better than you would just get from naive metagenome, right? They assemble their RNAs, they do clustering, and they build these context. They claimed 11 draft genomes from this AMD sample, which is pretty cool. So, what's really neat about this? In addition to that, is that, so here are 11 different things, the 11 species or types or whatever you wanna call them identified. These are the abundances, the relative abundances of their DNA, i.e. metagenome, versus cDNA, i.e. RNA, i.e. metatranscript, right? And so, you can see that FKV7, which is ferrovoom, ferrovoom, was over 98% of the DNA. And in transcriptional terms, it's about 92%. So, you get a relationship, you can see that sort of a naive assumption that DNA and RNA level should be the same. It's actually kind of under-producing, right? And then look at FKV1, it's way less than 1%, but its transcriptional contribution is over 1%. It's punching above its weight in functional terms. Makes sense? And so, there's the really, really common thing, and then there are the rare things. And here's the crucial outcome of the paper. There are critical functions, such as nitrogen fixation that are going on in the community. Who's doing them? The rare things, okay? FKV1, I don't know what's the example. Yeah, FKV1, right? Very rare, far less than 1% of the community, doing most of the nitrogen fixation, almost all of the nitrogen fixation coming from this one little dude, right? What is it? Acidithiobacillus ferrooxidant, which is interesting for a whole range of reasons, some of my favorite organisms. And then you can see sulfate reduction being carried out by FKV7, so that's pretty common. But then other functions, you can see that other rare individuals are carrying out certain other crucial functions, right? Metagenomics wouldn't really tell you this, necessarily. Metatranscriptomics does, yeah? Plus you get 11 draft genomes, so that's really cool. Okay, this is not as recent, but it's just a very quick example of metabolomics, or metataphylomics, in bacterial vaginosis. And I don't wanna dwell on this too much. It's basically showing you that there's more than one approach, right? And so what they had here, they had, I forget the exact number, but each of these was an individual, either negative for bacterial vaginosis or positive. And so this is a heat map, and we've all seen heat maps, right? Hotter colors represent higher amounts of somethings, right? And so this, there's really super red box here, it means that's zero for five. Individual 50, individual 50 has a lot of that metabolite. Mednailla, no, I don't know what that is. You get the idea, right? And so these are all metabolites that were profiled using NMR, right? Not genes, not transcripts, not proteins, metabolites, which is really cool, because you get some functional information at the kind of business end of the managing of the microbiome. So they have two main clusters, each of which comprises people with and without BD. And you can see immediately that some metabolites are critically important in distinguishing both this cluster and that cluster, and then BD positive, right? So here's an example of something that distinguishes. So on balance, they concluded that catabolic pathways for things like amino acids and peptides were overrepresented because of the products they found. So, nice example. And then they did taxonomy, because hey, you got to do taxonomy. These are the kind of good guys over here, the lactobacillus, lactobacillus, lactobacillus. And these are sort of the doctorate breakfast of other organisms that are not as nice. So just, again, metabolites by organism. All right, last one. So this is a great paper. And I think this is the preprint. I think the paper's actually 2015. So this was a paper by a large number of authors. Senior author was Marty Blazer, who was one of the best known microbiome types. And the purpose of this study was to investigate the effects of low dose antibiotics, specifically penicillin, on physiological and metabolic development of mice from birth, right? So lots of people are put on prophylactic antibiotics and things like that for various reasons, early in life, later in life, and so on. And the question is, what are the consequences of this? We know that you're beating down certain parts of the microbiome when you do this. What are the effects, right? So they went a lot further, they went a lot further than just characterizing the microbiome. You need to look at this paper. It's got five figures and a total of 92 panels. I was like, what crucial panel can I pull out? So I chose the best 10. But here's the experimental design, basically. Control, right? No antibiotics. This is gestation, nursing, and then normal rat chow or mouse chow. Control, no antibiotics. LDP and weaning, blue, right? So the antibiotics started here at four weeks. And the B was birth, basically a little bit before birth. So during nursing and through mouse chow, they got the Lotus antibiotics. And then they characterized things in a billion ways, right? And just let me draw your attention to a couple of things. Growth rates were different. Where's the critical one at? Fat mass, right? So male and female mice, they found that the fat mass was significantly higher for the LDP group from birth. And they found different diet responses and so on and so forth. And so this is really interesting because they're looking at causal mechanisms. They're looking at these relationships, not just who's there, but what are the effects? So that's kind of interesting. And they let's get some taxonomy going, right? And so they had, these are the low-dose penicillin in red. This is the control group. And this is simply a phylogenetic tree with all the dudes they found, bacteria that is. And red is overrepresented in LDP and green is overrepresented in control. And so you can see, for example, that FS247G, I love the LDP taxonomy, is overrepresented in the low-dose penicillin, whereas other things over here, for example, are overrepresented in controls. So you get taxonomic views, but they have histology and they have physiological assays and all sorts of things. So it's in cell, it's quite a remarkable paper. I strongly encourage you to look at it. Okay, so let's do another quick survey. So you're doing data analysis, that's cool. How many people have used Sanger data? Nice, most of you, that's great. Ion torrent, okay, a couple of you, very nice. Roche, 454, yeah, yeah, a classic. Illumina, something seek, high seek, my seek. Nice, okay, so there's your market share right there. The PacBio multiple freezers stacked on top of each other. Yeah, nice. And then the Oxford Nanopore. Ask me in a couple of months, I'm very excited. Michelle? Cool. Okay, so obviously in choosing a sequencing technology there are many, many, many, many trade-offs such as sequence read length, right? So who's the winner? Depends on what you're doing with it. So arguably, so I think in terms of claim read length, one, two, three, four, five, six. So does anybody disagree with that? So Nanopore longest PacBio next, then Sanger, then 454, then Illumina, then Ion torrent. Illumina, Ion torrent, I kind of tied, I think. My seek may exceed the Ion torrent. Accuracy, we care about accuracy, right? So what's the most accurate? The most accurate. Illumina, Sanger, I think it's Sanger, right? So Sanger's pretty good. And then I think Illumina 454, pretty simple, I think. Depends what you want in accuracy. Of course, yeah. I mean, it's sort of, you know, you've got the generally reported error parameters and then there's different ways to run these things, different modes, different levels of accuracy. So I think in general, these three, 454 Ion torrent, Illumina, you can kind of run to roughly the same error level. And then we have the last two, right? Packed Bio, which gives you super long reads, and Nanophore, which gives you ultra long reads, but the error rates in both cases are still hovering around 10%, depending on which papers you read and who you listen to. And so, depending on your purpose, different examples of these will be most suitable. So what would you use to sequence a genome, for example? Because we love genomes. It depends on your genome. Okay, pick a genome. So we have saying in 454, Illumina and Pac-Bio, for two and a half times the size of the human, and how to put it. Which genome is that? Calculative proof and 3% of this microblast, it's not finished, that's the problem. Okay, so you're finishing it with a range of technologies, yeah. So all paths, the assembly algorithm combines, the latest version I know of, combines Pac-Bio with paired end Illumina with made paired Illumina, if I recall correctly. So you've got paired end reads that are close to each other, paired end reads that are long away from each other, and then the Pac-Bio long reads, which are much more error-prone. So in general, if you want really good quality something, you typically don't have one that gives you everything you want, right? And so a lot of people choose Illumina, hey, it's cheap, it gives a lot of data, it's reasonably accurate, and especially with a MySeq, you can get some level of assembly, right? But these are all very important considerations. Pac-Bio seems to have settled in a certain error profile. Nanopore is getting better all the time, and people are developing new algorithms to better interpret the signals as they come off the machine. So what's this picture gonna look like in five years? I have no idea. Will we still be talking about metagenomics in five years? I have no idea. So the thing we're gonna almost finish with, and then we're gonna explore in the tutorial a little bit are different resources. I'm just gonna run through these very quickly. So 16S, right? I mean, 16S has been the de rigueur gene of choice ever since Carl Woz and George Fox and so on said, let's sequence something, right? That should be like, you know, Nanopore, let's sequence something, trademark. And so these repositories, as I showed you, have been in place in a formal way since the late 1980s. RDP and silver and green genes all have, I think over a million sequences in them. Green genes is interesting because of its automated approach to taxonomy, right? And so as you'll see in the tutorial, if you take the same data set and you feed it through green genes, silver, or RDP2, you will get someone different conclusions. In the example I've given you, you will get very different conclusions as to whether a sample contains eukaryotes or not. Hint, no, I'm not gonna give you a hint, you'll see. This is the RNA, a Rebson RNA copy number database. What is the largest copy number of 16S genes known in any organism to date? What's that? 16. I know a 15, what has 16? I don't know, I just read it. Okay, 15 I think is the record and of course I'm blanking on which organism it is, but many E. coli's typically have I think six or seven Rebsomal operons, right? And so you're gonna learn about PyCrust, which is predicting functions from taxonomy. And in general, if you have a taxonomic profile, an organism that has seven Rebsomal RNA genes is gonna be overrepresented because it has seven, as opposed to the ones that have a single copy. Most things have single copies. Most things that have more than one copy, if you build a tree with a bunch of things, their copies will cluster together in the tree and do a claim, so it's like, okay, these are all representative of Candidata Surangium celluloseumsoci and they're all kind of here, right? Halo Archaea can have 16S that diverge by greater than 10%. Where do they go in the tree? Don't ask me. Okay, genomes, right? Because genomes are the things from which meta-genomes are constructed. GenBank is a good resource, of course. Lots of modes of access and sort of the default place to put genomes, although they're not all there now. Gold is really nice because it has information. It's designed to adhere to some of these minimal standards for a sequence specification, right? We haven't talked about metadata quality, but you should try to have some, right? And so gold has a wide range of fields to say, where was it extracted from? What was the method used to sequence it and so on? So that's very good. Gold has been around forever. Patrick is the pathogen resource, so they have a nice set of features as well, including trees of different organisms. And then ensemble has both eukaryotic and prokaryotic genomes as well. And there's no shortage of resources. There are others as well. We're not doing a comprehensive survey. I'm just kind of introducing you to a couple. Okay, metagenomes. So this is one I have less experience with, but Mike has been using a lot of it. So if you have questions, I encourage you, like he's been using it at sort of the basic level to query things, APIs and things like that. EBI is pretty good for that. MG Rast also has a really nice API and we will be looking at that one in the tutorial, including what an API is. And then one that we've used to fair bid is the Human Microbiome Project Data Analysis and Coordination Center, the HMP DAC, which has all of the HMP data, right? Metadata, 16S, Metagenome Reads, Metagenome Assemblies, you name it, right? It's all there. And so there are big questions about the size of the database, the reliability of the database, the completeness of the database, how do you access the database, right? So we're gonna look at a few of those things, okay? Function, so function is a great term because nobody knows what function means. Nobody knows how to define it. Nobody knows how to predict it because even the best characterized genomes typically have about 30% hypothetical proteins, right, the predicted hypothetical proteins. And if you're familiar with the snowys at all paper 2009, even the ones that have functional annotations are sometimes wrong. And if for some functional categories, they're always wrong, right? And so the key trade-off in looking at a functional database is coverage versus accuracy, okay? And so keg is an example of high coverage, but not the best accuracy. Actually, I talked about the snowys at all paper in a moment, but then you have Uniprot KB is neat because there's the SwissProt side and there's the tremble side and the SwissProt, as far as I know, is the most highly curated, the most accurate, most reliable resource for protein function. Therefore, its coverage is the smallest, right? Stands to reason. And then there's Uniprot KB tremble, which is more sort of predicted functions which gives you greater coverage at the cost of slightly lower accuracy. Genontology has decent coverage. One of the really nice things about Go is that you get evidence codes. So it tells you the function and tells you where that function came from, right? So there's inferred from direct assay or something, right? Experimental validation, and there's different classes of that. All the way down to person came back to the lab after a pub crawl, ran blast, hopefully with the right settings and assigned a function that came from E. coli via salmonella, via citrobacter, via staphylococcus to halo-pointer atom wall GI, right? So there's sort of that range of evidence codes. You can say, for this analysis, I'm gonna focus on the things that had experimental evidence, right? Card is new and very cool. This is the antibiotic resistance gene database. This is being developed by Andrew, it'll come to me, Andrew at McMaster University, MacArthur, that's it, Andrew MacArthur. And so it's meant to be a highly curated, highly accurate resource for antibiotic resistance genes. Obviously something that's getting a lot of attention these days. Okay, so, questions? I have one last section before we get into the tutorial, but I'm probably going pretty fast, so feel free to stop me and ask questions. Michelle, yes, five minutes. Okay, major concerns in metagenomic analysis, right? Here's one, right? Data quality, so we talked a bit about errors, right? So there's error rates, which are often quoted, and then there's error types, right? So some sequencing platforms get gummed up on, you know, tracks of a certain type of nucleotide. Others have attended to induce more insertions and deletions, some have biases in favor of certain types of call. So this is something that can really influence your results as well. And then there are the chimeras, right? And so those of you who have worked with 16S data are hopefully familiar, hopefully familiar with things like chimeras layer, for example, or bolerophonol, I think that one's a bit older, because if you're PCRing, if you were to find that, a bunch of 16S genes, often during the process, you'll get recombination, right? You'll get these chimeras, these hybrids, yeah. I was just wondering, when you were working with them, are you, is there a trend towards more labs and do they do a reducing strategy to work with them? Which I wish I was doing with my work. Not to my knowledge, I'm not sure, actually. This is a really good question. So is anyone using the deletion indexing approach? I know Pat's published on it. Pat's lost? Yeah. Okay. I think most people are still using the trend defaults, I think. But yeah, that's a good question. All right. Comparability, reproducibility. There was a paper, and I can't remember, I think it was from the Rob Nature, I think there have been a few papers, but the one I remember was that one, where a bunch of metagenome sequence from a bunch of different samples, and you run your piece, peek away, and you get to show me several of these plots already, and I can say what cluster is right. And it was one of those cases where they had 454 and Illumina data. And the primary criterion for clustering is whether the sample had been sequenced using 454 or Illumina, right? And so one thing that I really like to do is meta-analysis, combining many data sets. And with sequence genomes, especially finished genomes, this is kind of okay, although there are still assumptions. But if you're trying to grab a whole bunch of 16S, or particularly metagenomic data, or metatranscriptomes, if you can find them, the sequencing platform can have a huge impact on the results you get, right? This is true with 16S with different variable regions, right? People have done studies where they're like, you know, your diversity profile is like this, if you do V13, and like that, if you do V35, right? So that's a problem. And I meant to include a citation to this. I'll add that to the website. But basically, building some of these mock communities and then trying different extraction strategies, DNA extraction strategies, and different sequencing strategies, you know, whether it's Amplicon-based or Shotgun-based, and getting very different characteristic results in the outcome, whether they were able to recapture the community or not. And then, you know, read the methods section. We use Chine version 2.3.6 point upside on happyface.apps, Ampersand, which uses UCLUS version this, which uses, you know, RDP version 15, release 15, whatever. Reproducibility is a massive, massive thing, right? There are solutions like Galaxy for workflows, which really, you know, do encourage reproducibility. Several people have written about reproducibility of bioinformatics analyses in general. I strongly encourage you to look those things up and try to adhere to them as much as you can, as well as making your data public, but you already knew that. This is my favorite example. OK, I'm almost done, almost done. And so this is, so Morgan's presenting the metagenomic stuff more. This is our mouse booth study. I showed you a very similar plot earlier. I apologize for the loss being so small. But we had 21 samples, OK, from 21 mites, the old Middle Age and young. So a very, very small study, right, a pilot. And we said, all right, we can do these comparisons. But wouldn't it be nicer if we had a larger reference data set? OK, we'll compare them against a larger set to see how they stack up. We're like, well, people have sequenced mouse poop all over the place, right? It's like it's a giant cottage industry. There was even a paper where they sequenced the fecal microbiome of five or six different healthy mouse strains compared to, of course, their mouse strain effects. So that's great. So I asked Morgan and my other postdoc at the time, Conner, to grab these other data sets, put them with our 21 samples, and then do like a pico-a analysis and see whether the old or the young or the Middle Age tended to cluster more closely with the mice that people would see, but most pooped the people in sequenced earlier. And guess what? The reference, although the reference sequence is often occupied as a little part of the plot, and ours are like, boom, all over the place, right? Comparability? No. Wouldn't it be nice? But it's a bit of a fantasy. I mean, we could have tried harder and gone back to the raw sequences and tried to sort it out, but even then, we'd probably just be asking for trouble. Okay, linkage and resolution, strain-level diversity is often missed by Amplicon shotgun approaches, and in many cases, strain-level diversity is crucial. Often whether your wastewater treatment plant works or not depends on which strain of Candidatus, Acumulobacter, Phosphatus, your house. Does 16S resolve this? No. Does polyphosphate kinase, an alternative marker gene, resolve this? Yes. 16S gives you a certain level of resolution. The intergenic spacer between 16S and 23S gives you more resolution. Long story short, paper from 2009, and they found that if they clustered these ITS, these more variable sequences, right there, if they differ from each other more, these are all from Prochlorococcus meridus, the fourth organism in the ocean. If you cluster at about 97 to 99%, you find strong associations between taxonomic diversity and nitrate concentrations in the ocean. If you cluster at 90% identity, you get associations of light, 95% temperature, and so depending on your taxonomic resolution, you can actually find associations with different environmental variables. RDP taxonomy is hideous, but we live with it anyway. Well, you'll be hearing enough about operation taxonomy that's there, there are uses, but again, we use them. And because I need to finish up, functional predictions are a pain. We can talk about this later. This is the critical assessment of functional annotation. Functional prediction methods for proteins, not doing so well, getting better, but there are some limitations there. So all these hypotheticals, still very difficult to predict. Okay, so in conclusion, we're on a coffee break in networking session, no we're not. I'm gonna give you a minute. If you have questions, please feel free to ask them. We're gonna take a deep breath and then we're gonna move into the tutorial.