 Can we start with something simple before attacking the entire problem? What about those simple patterns in proteins? You know what some of these patterns are in the secondary structures in particular. This helix. There is a unit here, right? The unit here is one turn, 3.6 amino acids. Well, you might say that it's I2I plus 3 for hydrogen bonds in a 310 helix or I2I plus 4 for the hydrogen bond in an alpha helix. We're not arguing about units here. Focus on orders of magnitude. Let's say 3.5 with 3.6. Similarly in a beta sheet, you have some sort of I2I plus 2 unit. Because here as I chain is pointing up, then it's pointing down. That's 2 amino acids. And now I'm pointing up again. So after 2 amino acids, I'm back to roughly the same pace. So maybe 3.5 units of a repeat here and 2 units of the repeat in the beta sheet. But what is the reason then why we don't have exceptionally small helices or exceptionally large helices? Well, for exceptionally small helices, we've already been through that. Remember when I spoke to you that there is an initiation cost of at least forming the first turn. Before that has happened, you won't even start to form it. That's equally true for the beta strand. But even though when I've formed one turn, it still isn't an advantage. It's just when I've formed one turn, I will start to form the others too. But it's probably not until I have four turns or so that the net delta G starts to be negative. So if the helix doesn't reach at least an intermediate side, there isn't going to be enough stabilization free energy in it. But armed with that, you would say it would be better if this was infinitely large. 5,000 residues, right? Here, mathematical statistics center. So let's look at the distribution of such units and ask ourselves, how likely is it to have, say, four helical segments after each other, or even more general, what is the average length of helical sequence going to be? To do that, I'm going to need to introduce some sort of probability. And I'm going to make this easy. Let's say that there's a probability of being a helix, but this really doesn't have anything to do with helices. So if I'm just picking our repeating segments here, let's say that it's three to start with. That means that I should pick whatever this is, say helix, helix, helix. And then I need to pick something that is not helix before and after it. In that particular case, it's three. But I'm instead going to say that we want to, say, ask the question that what is the length R here and how likely is to have that. Second, I'm not really going to care about the case where it's just one. It has to be at least two, otherwise it's not repeating. We know that the probability of having that's three equal to R, or three. Well, that should first be one minus P. That's the open one before. And then I should pick P values R times. And then I need to pick the minus P value after it. We don't know what P is yet. It's just that it's a value between greater than zero and smaller than one. I would like to know what R in general is. And that is surprisingly easy to calculate, at least the first step. You know this. I should sum over all those weights and weigh them by what value R is. So R multiplied by the weight of that particular length. And then this should start by two, otherwise I'm not repeating and it should go to infinity. And it's R I'm summing over there. And then to make sure that those weights sum up to unity, I should again sum from two to infinity, but just over the weight of R. I'm not sure about you, but when I was in upper secondary school I truly hated sums because they're much harder to work with than integrals. And now I love sums for the same reason. Sums are super cool because they're hard for computers. This is not trivial to solve immediately, but there are a couple of tricks. What you would normally do, you would look up in a book what sums of the sum R raised to the power of N and summing N from one to infinity. That will only work if R is smaller than one, of course. And I know what that sum is, so I'm going to write that down here for you. So sum R equals one to infinity. That was the only sum I could find of P raised to the power of R. That is P multiplied by one minus P. No, sorry, not infinity. That should be N here. Divided by one minus P. Not entirely obvious, trust me. Let's cut it from a book. Well, we've done that. We can start to enter a few things here. Let's write out what those sums really were here and see if that makes it clear. In the nominator, I had R multiplied by that expression, right? That is one minus P, and I'll square that and put those two in front. And then I have R multiplied by P raised to the power of R from two to infinity. And then the denominator, I had exactly the same expression, but no extra R. So that's one minus P squared multiplied by P raised to the power of R. You can see I can strike out that term one minus P. It's not part of the sum. And that means that this value really is the sum from two to infinity of R raised to P R divided by sum two to infinity of P raised to the power of R. I'm going to need to solve both of those, but I will use a small trick here. Do you see that the nominator here is almost the derivative here? If I take the derivative of P raised to the power of R, there's going to fall down an R in front of it. I'm losing a P, but so you know what? I'll save that for later and start by focusing on the denominator here and see if we can fix that. That's almost the formula I had up there, right? So how can we solve that? I'll avoid using the equal sign here. If I start this one from zero, what has happened then? Sorry, from one. Well, at one I'm adding the value P, right? So that would be P raised to one. That's going to be an extra addition or something that gets complicated. But maybe we should write it so we have a P in front. This way that I say, so here the first term is P. Maybe I write it as P raised to R minus one and then I'm adding an extra P there. Do you see what's happening? The first P now is P raised to the power of R. I'm well aware that this is still a two, but that P does not depend on R. So that P I'm actually allowed to put in front. That doesn't seem to simplify things, right? But now the exponent here, when R equals two, I'm actually starting with the exponent one. So now I can change this a bit. So I'll keep the denominator, some from two to infinity, R multiplied by P raised to the power of R. And then I'm dividing by P and now I can start the sum at one. P raised to R. Need trick, right? This only works for your sum into infinity and that's why infinite sums are so cool. That sum now corresponds to the expression here if I put N equals infinity. And that means that that becomes P multiplied by, well, one minus P. If P is smaller than one and I raise it to infinity, it's going to disappear. So I just get P here divided by one minus P. And then I add an extra P there. So I still have that sum from two to infinity of R multiplied by P raised to the power of R divided by P two divided by one minus P. Neat. I still now need to attack the nominator here. So how I'm going to attack the nominator? Well, this is a bit longer expression. I'm not going to do it interactively for you here, but I'm going to do the trick there too. So this derivative, if I take, start by studying this expression P raised to the power of R. If I take the derivative of that with respect to R, it's now going to be R multiplied by P raised to R minus one, right? But if I started with that expression, then I get exactly the same property here. It starts at R minus one, and then I can do the sum from one instead of from two. And then I have that solved too. That's going to be a slightly longer expression. So it's not being a few more terms here, but this is just simple bookkeeping. It's not very hard. The only trick is to realize that I can write the nominator roughly, as I think it's going to be one over P multiplied by the derivative of the denominator. If I do that and then solve for it, there are a few extra steps here that I won't show in detail, then I'm going to end up with the fact that the average length of R is two plus P divided by one minus P. This is strictly true if I'm just grouping things, but there are a bunch of approximations here. First, the fact that I can look at these at turns or repeats or so. What is P? Well, we don't know exactly what P is. It's going to depend a little bit. But if you look at genomes or so, roughly half the segments are alpha helical. And again, we're talking about order of magnitude estimates here. And if P is roughly 0.5, that's 0.5 divided by one minus one minus 0.5, which is one unity. So if P is roughly 0.5, R is going to equal roughly three. What does that mean? Well, remember that I said that this was not amino acids, but some sort of repeating segments. So that would mean that for an alpha helix, I would get roughly three multiplied by 3.6, right? Three point six residues per turn. And for a beta sheet, I might get, say, that factor three multiplied by two. So this might be roughly 11, and this might be roughly six. What this says is that, again, alpha helix would be in the ballpark a little bit over 10 amino acids on average, and beta sheets slightly shorter, maybe six. I think the book claims that this is right. My gut feeling says that it's a bit of an underestimate. But again, we've just picked toy numbers here for P. If P, you can check yourself what happens if P has lower or higher values. But the point is that these toy values gives us something that's probably right clearly within a factor of two, at least. So it turns out that the reason why you don't see longer secondary structure elements than we do is simply probability.