 So the point of this class is not to teach a number of bioinformatics algorithms by heart. There are much better bioinformatics classes for that I would encourage you to take. It's a fascinating area. But I want to drive home the message how bioinformatics is evolving and what type of information we're getting and are able to use. So based on what we just did using the simple sequence alignments, if we would like to improve even further, what could we do? Well, what we did with those substitution matrices is that rather than merely guessing by physics, we let ourselves be guided by nature. And we let nature tell us how common is it that we replace, say, a tyrosine for a tryptophan. But there is one stupid limitation with that. In that substitution matrix, I take all these positions and mix them together. That was reasonable in the early days of bioinformatics when we didn't have that many sequences. We have 207 million now, trust me. We have enough. For myoglobin, we probably have 5 or 10,000, if not more. And in that way, it's really stupid. Why on earth do I use this column when I calculate the probability of having lysine? That is not necessarily the same probabilities as I have in this column for mutating lysine, right? So why on earth are we mixing them together? I have enough information in each column. Don't mix. So if I just use the simple information to line up these proteins that I can do with the substitution matrix, then I can create a position-specific matrix. So these positions now in the so-called multiple-sequence alignment that I did in the last slide. That's an important concept. I don't just align not just two sequences, but multiple-sequence. From a multiple-sequence alignment, I have a position, 1, 2, 3, 4, 5, 6, etc. And in each position, I can now calculate how likely is it to have, say, alanine, cysteine, all the way down to tryptophanomaline. What this now enables me to say in a particular position here, how likely is it for alanine to occur there? And if I only ever see lysine in a position, then I would probably want to say that, look, with 99% probability it should be lysine here. Anything else should have a really bad score. But I probably don't want to say that it's minus infinity, because it should be possible to put something else there. It should just be very bad. And instead of scoring things against a normal substitution matrix, I can now take a new sequence and score it against this position-specific substitution matrix or scoring matrix. So this is, we might call this a matrix, but you can think of this as a profile. So if you're the police and I have murdered somebody, then you might think that they're looking for a middle-aged man, exceptionally fit and good-looking, of course, that has glasses and fair skin and it was wearing a sweater. Not particularly muscular, sadly. So the point is that this is kind of an average description of something, right? And then if you now get a new sequence, you can say, well, suspect one fits that profile overall. Suspect two doesn't fit it, rather than just comparing somebody by body weight, which would be the plain substitution matrix. The idea here is that I get something that is not just a score based on the individual amino acids, but the entire multiple sequence alignment now for, say, myoglobin. So if I get the new sequence and if I want to check, could this be a myoglobin? I don't just want to compare this against one myoglobin. I want to compare this against the entire pattern of the suspect profile, all other myoglobins. And that's exactly what I do with a position-specific scoring matrix. Do you see what I did again? Apart from killing somebody, I introduced more biological information because I now also care about the position in the sequence in the actual protein. This happened in the early 1990s or so, and at that time, all prediction methods suddenly took a 10% jump in accuracy. The take-home message here is that it's very rare that we make huge advances based on a new algorithm. It's typically that we implement an algorithm that allows us to use more biological information. The more biological information we use, the more physics we implicitly use, because again, the stability in these different positions comes from the physics you've learned and the better predictions we will make. So while the initial improvement was in the ballpark of 10%, it's far better when people had optimized these methods. This is a small bacterium that we worked on when I was a poster at Stanford many years ago and just to give you a gut feeling of the improvements. The first blue point, sorry, this is a small bacterium that has in the ballpark of 5,000 genes. If we're only trying to detect genes based on sequence alone, we could assign a little bit over 1,000. Not bad, but again, it's 20%. Once you start using the profiles I just showed you, this doubled, so roughly 2,000, almost half the proteins could be detected. And then we're using even better methods based on so-called hidden Markov models that I won't have time to explain to you, but it's kind of profiles on steroids that we jump to say a bit over 3,000. This is 15-year-old data. I bet if we really did this today, we would probably be at 4,900. Bioinformatics have gone so exceptionally good that it's hard to improve on some of these basic algorithms.