 So once we know a sequence, in principle we know everything, but as you now know, getting from the DNA to the RNA to the protein is not entirely trivial. If we focus on bacteria first, it's a reasonable tractable problem. So in bacteria, I have a so-called start codon, which is usually ATG or AUG if you use RNA. And then you have lots of triplets, right? I'm not going to draw more. And then at the end you have, for instance, TAA or UAA. There can be a couple of end codons. That is where we're going to start expressing a protein and then where I'm going to stop expressing a protein. In humans, it's much more complicated because eukaryotes in general have this, first they have, we have this challenge with introns. And that means that if there's an ATG in the middle of an intron, that's not really a gene. And they're going to need to find ways to splice things together. Again, that is far beyond this class, but you should know that basically what we have to do is design computer programs to predict the locations of genes. Out of all the DNA in our body, again roughly 1% is actual expressed DNA. But whether it's a bacterium or a human, assuming that I can get this sequence, there are two things I want to ask. First, how does evolution happen here? How do mutations happen? And second, should I work with this? Or should I work with the protein? And eventually, how does this get to the protein? Well, there are a couple of different types of mutations. If I have a triplet here in the middle, let's say that that's a CT. Obviously, you know the genetic code by heart. Just kidding, I don't. That is arginine. That's what I actually do happen today. Arginine is special because either I can change the third there, I can change that to ACC or ACA or ACG. All of them will still code for arginine. This is a called silent mutation, not very much fun. It's a mutation, but if I want to look at mutation rates, how quickly genes evolve, this is very useful because it is valuable information. And in that case, I definitely want to look at the DNA sequence. On the other hand, if I'm interested in my proteins, it's completely pointless because if I'm interested in protein structure, it's much better to say, look, this is just arginine. Forget about all those nasty bases. I should focus on the amino acid letters. We have collected an insane number of sequences like this. The place in the world where everybody in science deposits these is called GenBank. GenBank is actually a collaboration. There are similar efforts in Europe and Asia. GenBank today, these numbers are very quickly outdated, but it's worth mentioning them anyway. It contains roughly 6.3 trillion base pairs in 1.7 billion sequences. And it's roughly half a million different species. That's an insane amount of information. I don't think that you understand how much this is until you really start to think about it. Compared to the human genome, 3 billion base pairs in total.