 Welcome back everyone, if you're watching this on Twitch, thank you for being here and still being here after the hour long discussion of the assignments. And today we are going to talk a little bit about meta barcoding. So this is going to be a very quick lecture again, just like the one that we had last time for the R package creation. So DNA meta barcoding. DNA meta barcoding is a new kind of novel technique to identify different species. So the idea is that you sequence a short fragment of a gene and then compare this short fragment to a database to identify which species were in your sample. So of course, it doesn't deal with a single species. The idea is that you take, for example, a cup of water from somewhere and then you just sequence all the DNA which is in there. And then, hey, in the end, you get all kinds of different sequences from different species. And based on the amount of reads that you get from the different species, you can figure out which animals were in your sample. So it's a way to monitor which type of animals are living in a certain area of water, but also in soil. So if you're doing plant science, it allows you to take a little bit of earth or dirt near the soil. And then, hey, you look into that and you kind of see which little creatures are living in there. The same thing for water. So it allows you to very easily and cheaply monitor thousands of species or determine thousands of species of bacteria in water or bacteria in soil. So why would you want to do DNA barcoding? So I actually looked this up on Wikipedia because I'm not an expert on DNA barcoding. I know how to analyze next generation sequencing data. But there are a lot of advantages to it. So one of the things is that it increases your taxonomic resolution. You can imagine that if I would take a water sample and then would put a drop of water under a microscope and then start looking, then, of course, I can only look into a couple of drops of water. And every time that I look, I have to classify all these animals that I see on a microscope. So it's a lot of work. And by using DNA barcoding, you can just take your sample, extract DNA, sequence the DNA, and then compare back to a database. And it will directly tell you, if you have 1,000 different bacteria in your species, it will show you that there are 1,000 different in there. So you don't have to observe them by eye, which makes it a good method to look at what's in your sample. And one of the nice things is that you can actually relate this to different environmental factors. So if I take, for example, water from a lake, and then I take water from another lake where the environment is slightly different, I can see if there are differences in the animals that are in my sample. So this also means that it increases the comparability among regions, because you're not having to look through a microscope, counting, and saying, well, for this sample, I determined 300 bacteria for another sample. I only did 150 because I didn't have much time. The other advantage of DNA barcoding is that you get DNA. So in DNA, you can get from early life stages as well. So even though your critter might be very, very small, almost invisible in a microscope, because you're looking at DNA, you can still identify them. And the same thing holds for fragmented specimens. And so if you have a specimen which animals are already dead because it was in the freezer for like five years, then you can still do DNA barcoding. You can determine cryptic and rare species, because if I would look through a microscope and I would count or I would classify like 1,000 different bacteria, then, of course, I could miss this one bacteria, which is very uncommon. If 1 in 10,000 in my sample is a certain type, a cryptic or rare species, then, of course, the chances of me spotting it through the microscope are relatively small. But by using DNA barcoding, I get some reads from their DNA. It also increases very dramatically the number of samples which can be processed. You don't have to look through a microscope. It's just taking a scoop of water, extracting DNA, sending it in for sequencing. And it's non-invasive. So you don't have to catch fish and then release fish again. And the nice thing about water, especially if you do water researches, is that you don't have to have the fish in your sample, because animals will kind of leave their DNA everywhere, especially in water. So it's relatively non-invasive. Of course, not for the bacteria that are in your sample because they will die in the process. But all the other animals living in the lake are relatively unaffected by you taking a little scoop of water out. So how does this work? Well, of course, we have the standard DNA extraction, which, of course, is not really my specialty because I'm not a lab person. I'm a bioinformatician, so I sit behind a computer. But I've been told that DNA extraction is really, really easy to do. Hey, you just take your sample. You use salt and isopropyl alcohol. And then you kind of extract the DNA from your sample. The next step is PCR amplification. Then you do DNA sequencing. And then you do the data analysis. So the trick here is the PCR amplification. So what we generally do is we take a. So here we see, for example, 16S RNA. So 16S RNA is part of the ribosomal RNA of bacteria. So this is relatively well-conserved in some regions. In other regions, if you would look across like hundreds of bacterial species, you would see that there's a lot of variability. So here, for example, we see the variability in the DNA. So for example, at V1, around 15% of species have a different sequence. Here, like 6% or 4% to 5% have a different variability. So we just look at the amount of nucleotides. But what we're looking for when we do DNA barcoding is these regions where every species is more or less the same. These are these highly conserved regions which need to be there so that the ribosome can work. So it might be a region where messenger RNA is bound. Because that always has to be the same. Because if that changes, then you can't bind messenger RNA anymore. And that's one of the functions of the ribosomal RNA. So what we do is we look through the database. We check hundreds and hundreds, perhaps even thousands of different bacterial species. We align them together, and then we look for regions where there is almost no variation across hundreds and hundreds of species. And then we design our primers to target these regions. So in this case, these primers are more or less omnieuniversal primers. And when we had the primer design lecture, we talked about universal primers. But here, we're designing primers which not only target one species, we're targeting literally thousands of species. And we want our primers to work for all of these species. So that means that the variability in these regions needs to be very low. And then what we do is we PCR out this piece. Of course, we want to have a variable region inside of this, inside or between the primers. And now when we sequence this piece, then based on the base pairs that we get in this region, we can assign the read to a certain bacterial species. And of course, if the base pairs would be slightly different, we would assign it to a different one. So that's kind of the basis for DNA barcoding. So which genes do we generally look at when we do DNA barcoding? Well, if we look at animals, we usually look at like 12S or 16S RNA. So that's the ribosomal RNA. Or we look at things like cytochrome B, which is a very common protein which is shared across more or less the whole animal kingdom. If we look at plants, then we take different genes. So we take things like mutk or RBCL. And these are genes which, again, are very conserved across the whole plant domain. So every plant has a mutk gene because it is fundamental to how plants work. For bacteria, we look generally at 16S RNA as well, just as in animals. But there's other genes like COI and CPN60 that you can look at as well. Fungi, again, just a list of genes. Generally, we look at the 18S RNA, ribosomal RNA. Ribosomes are more or less shared across the whole animal kingdom. And the same thing is for protists. So we can see that if we would target the 16S RNA, we would be able to determine, like animals, we would be able to determine bacteria. And we would probably be able to look at, if we take 18S, we can spot fungi and protists. So there's different markers that we can use to target different areas of the genome. And of course, that depends on what we want to look at. So after, of course, we have done this and we have sequenced it, we need to find out our barcode. So this barcode, it means that we have our sample and we see in this sample that certain, so we get a read, we assign the read to a certain animal. But for that, we need the database, right? Because we need to know where the read came from. So there are three main databases which are used in DNA meta-barcoding. And that is the barcode of live data system called bold, which is a database which mainly contains records for animals based on the COIG genetic marker. We have Unite, which is the reference database for molecular identification of fungi. And these uses ITS genetic markers. And then we have the diet.barcode. And this database contains two different genetic markers. It contains the 18S ribosomal RNA marker, but also the RBCL marker. So these are just databases where people look through a microscope, did the sequencing of purified samples and then put the sequences in so that when we have our sample done with 18S RNA, we can just ask the database, where does this read come from? And then the database will tell us, oh, that's a bacteria X and this read comes from bacteria Y and this reads come from bacteria Z, right? So, and then we can do the identification of what was in the sample. So DNA meta-barcoding of course is slightly different from the standard DNA-barcoding in the fact that we do many different taxa, right? So instead of looking at a single animal like is normal in DNA-barcoding, we actually look at many different species, right? So standard DNA-barcoding is more or less asking the question, in my sample, does species X occur? Well, DNA meta-barcoding asks the question, which species occur in my sample? It's almost often from environmental DNA, so we take a soil sample, a water sample, or even an air sample can be used, and then hey, you would just do the barcoding, right? So you take your sample, you extract your DNA, and then normally you would do PCR amplification with primers specific for your mushroom species that you wanna find. But in this case, we are not using primers, which are unique to this mushroom that we want to identify. We use these universal primers which target many different species of mushrooms and other things, and then for each of these reeds, for each of the DNA reeds that we get, we want to assign this reed to one of the species, right? And then the number of reeds that we get is more or less a measurement of how abundant a certain species was, right? So in this case, all of the colors are two, but hey, you can imagine that if you find a lot of yellow reeds, then you would say, no, the yellow mushroom is more commonly found in my sample than the one that is blue. So why do we do metabarcoding? Well, we do it for, for example, biodiversity monitoring, but also in paleontology. So if you are interested in things like ancient ecosystems, right, and hey, you have, for example, one of these fluffy elephants that used to live before, what are they called? Fluffy elephants, they are, come on, people. It was very early when I woke up, fluffy elephant, mammoth, that's it, 10 points to you, Misha. For example, if you have a mammoth, right, which was frozen in ice, then generally you want to know what did mammoth eat and what was the environment like, right? So you would take this frozen mammoth, you would take a sample from the mammoth, from, for example, the stomach, and then you would do DNA metabarcoding to see which things are in the stomach, right? You also do it with ancient poop from animals, and also in paleontology, also this is used a lot to kind of get an idea of the environment that the animals lived in. We also use DNA metabarcoding when we look at plant and pollinator interactions, right? If we would take a flower, right, and then we would take out the flower, then of course most of the DNA would be from the flower, but of course when a B comes there, it also leaves a little bit of DNA behind on the flower. So, and the same thing holds for wasps and flies, and so if we look at a certain flower, we sequenced using DNA metabarcoding, we could figure out which pollinator is visiting this flower. And if we would do that for all kinds of different flowers, we could in an overview and we could say, like, no, bees generally tend to pollinate tulips, while, for example, flies generally tend to pollinate other things. We can also use diet analysis and also in food safety is used a lot. So, if you ever want to work in the future in monitoring food safety, there, of course, hey, you take a sample from the conveyor belt that is making the food, and then of course you want to know which bacterial species are in there and are they harmful or are they harmless. So, and also you want to know if there's not too many bacteria in there. So, also in food safety it's used a lot. And diet analysis is the same thing, right? If we have humans coming into the hospital, we can just take a little bit of poop and then look into the poop to see what more or less the bacterial composition is for human one and for human two and for human three. And then we can do inference on that. So, if all of the humans that come in that are sick with a certain disease, have a certain bacteria in their excrement, had them what we would say, okay, there might be a relationship between it. So, these are the five fields where DNA metabar coding is used a lot. But especially for biodiversity monitoring, it's really, really useful. So, there are some shortcomings, right? So, one of the things that DNA metabar coding doesn't really work very well in is that there's a whole bunch of physical parameters, right? If we take a scoop of water out of a pond, then of course DNA in aquatic systems, it moves around, right? So, if you have flowing water in a brook, or in a river, then if a fish just swims upstream of where you take your sample, then you would detect the DNA for that fish. But if the fish would live slightly downstream of where you're taking your sample, then you would not be detected, right? So, it's very difficult to take a very uniform sample, right? Especially in things like the ocean or in rivers. In lakes, it's not that big of an issue, because if you have a lake, the water doesn't move that much. But there's a lot of physical things that influence the DNA concentration. One of the other physical parameters is, for example, the amount of environmental radiation, right? If you have your sample and you take your sample from water, which is relatively shallow, and there's a bunch of sunshine, then of course the radiation from the sun will break down the DNA in the sample relatively quickly. So, you can't really compare this to somewhere where you took your sample when it was dark and there's not a lot of radiation around. There is some technological bias in there as well. So, this depends on the primers that you use. So, we will be talking about that more. But also, if we do a PCR, then of course if primers don't exactly fit the sequence, right? If this idea that we had in the beginning that we design our primers on regions which are always the same, right? If for some species this region is not the same, then all of a sudden we can't detect this species anymore because the primers won't bind there, right? So, we're saying that, oh, this animal is never found in any of our samples, but this is not due to the fact that the animal is not there. It's just due to the fact that this animal might not have the exact same sequence as that we expect it to be, right? So, that's one of these shortcomings as well. One of the main things is currently the lack of standardization. DNA metabarcoding is a relatively novel technique. So, there are not a lot of standards yet. Like there's no standard in how much sample should you take, how much read should you generate when doing next-generation sequencing. Which databases should you use, right? And of course there's also the conventional versus the barcode-based identification. Is that in a barcode-based identification, we're just looking at sequence and only a very small sequence of the whole animal, right? So, you might have two toxa, two bacteria, which have the exact same sequence on the region that you're looking for, but like another gene, they might be completely different. They might also look completely different, but you're not going to see that using DNA metabarcoding. And one of the other short problems is that it's really hard to estimate things like richness and diversity. Because you get an idea of how many different bacteria, for example, are in your sample, but because of these technological biases and the physical parameters, there will be a lot of variation in there, right? So, you can't directly compare sample one to sample two, unless both samples were taking at the exact same place at like slightly different times. But then again, if you take samples in the winter, then these samples in the winter will be different than the samples that you take in the summer. And that has nothing to do with the diversity. That just has to do with the fact that DNA might degrade quicker in summer. So, it is a very good method to get an initial overview of how diverse is a certain lake or how diverse is a certain poop that you find, or how diverse is some soil that you take. But it is really hard to compare these with each other. So, of course, there are some other issues as well. One of the things is that species can and will exchange DNA, especially when we're talking about bacteria, bacterial conjugation, and bacteria A giving part of its genome to another bacteria that happens quite frequently. So, when your target gene is a gene which is frequently exchanged, different tuxa will seem very close together while they are actually not. If you have an E. coli bacteria which actually exchanged its 16S ribosomal RNA with bacillus subtulus, then now you would misclassify this E. coli. You would say, oh, it's a bacillus subtulus because it uses this sequence. The method is very susceptible to the gene that you choose. So, of course, we choose generally ribosomal RNA genes because they are very common and they occur throughout the whole kingdom and they don't change that much. But it is still, if you choose another gene, your results might be different, right? And that's one of these issues is because we have very different genes that we look at. Taking the same sample, looking at gene A might give you a slightly different result when you take another gene to look at, right? So, in the end, you have to do a couple of these genes to get an idea of what is the truth because every gene will give you a slightly different answer. One of the things that I don't like about the method or not so much like about the method because I think the method is really good for what it's aimed at is the fact that DNA sequences are generally not that conserved, right? The thing that is conserved through evolution is the protein sequence, right? Because the protein does something. So, the protein is under selection. But the DNA is not that much under selection or has less selective pressure because there's the wobble base, so each code on the third letter is more or less free to choose. And so, there are problems there with the fact that DNA sequence can change much quicker and much more than protein sequences can. And, of course, primers can and will fail since we need a target... a variable region to find differences. And we're trying our best to target our primers to regions which are conserved across thousands of species. But, of course, what we don't know we can't account for. So, it could be that we can never detect certain species altogether just because they are slightly different in these regions which we call more or less constant, right? Primers themselves are also biased. So, if I have primers, right, and I have this region which is relatively well conserved, but some animals have a single snip, then this single snip will make the primers less effective. So, in the end, when we look at a quantification, right, and we want to say there's 100 bacteria A and then there's 25 bacteria B and there's 10,000 bacteria C, then because of the fact that one of these animals might have a snip, the primer efficiency is much lower, right? Because if you have a primer, the primer, when it matches perfectly, DNA is amplified with 2 to the power of X, right? So, every cycle you double the DNA. But if there's one or two snips in the animal and your primer doesn't bind perfectly, then the efficiency is not 2 to the power of X. It can be like 1.7 to the power of X or 1.3 to the power of X. And so then you think that there are not that many of bacteria A, but this is not due to the fact that there are less in your sample to begin with. It just means that the primer doesn't bind too well. And in the end, when I looked into the method and I read the method last year because one of the students asked about it, the method is not that novel. When you look at Google search, then Google search results for DNA metabar coding start at around like 2016, 2017. But it's actually just a buzzword or kind of a label on a 30-year-old established technique because we have already been doing 16S RNA sequencing to identify bacteria since the 1990s. So the idea is not novel. The thing that makes it novel is that instead of doing it for a single bacteria species to do the identification, we now just do a shotgun approach and say, no, we're just going to sequence everything. So instead of doing a bacteria culture, making the bacteria pure, and then sequencing the 16S RNA to identify exactly which bacteria it is, we're just kind of using a shotgun approach. Shooting at all of the things that are in our sample, sequencing everything and then trying to figure out later and kind of reconstructing it. But in a way, it's just a new label on a 30-year-old technique which is, of course, like it's perfectly fine to do, but it seems a little bit buzzword in a way. Good. So that was actually what I wanted to tell you guys about DNA metabar coding. So up next is literature management. So I'm just going to stop the recording now for the people watching on YouTube. So if you're watching this on YouTube, then the next part of the lecture will be literature management. So see you in probably a day or something. It takes a little bit of time to cut the videos and upload them. So I will see you soon.