 In this lesson, we'll be learning about another way to determine the length of a keyword that was used to encrypt up cipher text with a visionair cipher. This method was developed by Elizabeth and William Friedman, who worked together at Riverbank's labs, which is actually where they met, in Geneva, Illinois, until they left to go work for the War Department in 1921. Now, William Friedman gets a lot of the credit for the work that's published under the Friedman name, but there's actually a really interesting documentary called The Codebreakers, which you can link to here from the slides that talks all about the work that Elizabeth did. She helped co-develop the statistical method that we'll be looking at today, the Index of Coincidence, but also went on to do a lot of unique and independent work with another group in the Codebreaking Department at the War Department, and it's really cool, so I would definitely encourage you to go and read more or learn more by watching the documentary linked here. Now, the method that they developed, the Index of Coincidence, was instrumental to determining whether or not a particular cipher text message was encrypted using a mono-alphabetic cipher, like the Caesar or the Affine, or a poly-alphabetic cipher, like the visionair or the autokey. So let's dig in and see how it works. We know that we can identify pretty easily the difference between a mono or a poly-alphabetic cipher by just looking at their letter distributions. We saw this as we were creating them. Caesar ciphers or other mono-alphabetic ciphers follow this kind of frequency distribution we see on the left. Some letters appear quite a bit, like the letters E or A, and we can see those spikes, and other letters appear hardly at all, like maybe X, Y or Z, and we can see that in this cipher text, those have been rearranged to other letters, maybe like an L or a Q or an R, but it really is kind of feast our famine for a mono-alphabetic frequency distribution. You either have characters with a high distribution or a very low and then a few in between, so it's very volatile. But we saw that when we use a poly-alphabetic cipher, the goal of that is to smooth it out, and in fact our goal would be to make it as smooth as possible, every letter 3.8, and it turns out that when you change the distribution, we actually change a lot of the probability characteristics of those distributions. Now that might not seem like it's that important of a detail, but we're going to see it's very helpful for us to figure out with a simple calculation, a single number, a statistic, to determine without even looking at a bar chart whether a particular cipher text came from a mono-alphabetic or a poly-alphabetic cipher. And here's that statistic that we're going to compute. It's called the index of coincidence. Now the Freedman's often used the Greek letter kappa when computing this value, so it's often referred to in literature as the kappa test, but the definition of it is that we're going to calculate a number which represents the probability of two of the same letters coming from the same cipher text. So the probability of choosing two letters at random and those are 2a's or 2b's or 2c's and so on all the way down to or 2z's. So we could kind of look at this equation here, capital I for the index of coincidence. n sub a over n means the number of the letter a's in a message divided by n, the length of the message. And when you pick the second letter, there's one less letter to choose from and the one less a is going to make you pick an a on the first one so that's why it's n sub a minus 1 over n minus 1. So that is the probability of just choosing two a's. But you could also choose two b's, so similar calculation, count the number of b's divided by the length of the message, multiply that by the number of b's minus 1 divided by the length of the message minus 1 to get the probability of choosing two b's in a row. And we could continue doing this down the line probability of two c's all the way down the probability of two z's. And then we add them all up, that's the way that probability works. If any one of those 26 probabilities counts towards our definition here of choosing two of the same letters, then the probability of any one of those things happening is the sum of all of the individual probabilities. So we add them up. And now for the index of coincidence for say the English language or really any mono-alphabetic cipher text would be the frequency of each of those things. Now in a very long message, we could assume that n sub a over n and n sub a minus 1 divided by n sub a minus 1 are relatively the same size. That one off difference won't change too much about that fraction. So to simplify the calculation, we'll often just do the proportion of each letter times itself or squaring it. So it's kind of fudging the calculation a little bit here, but we'll see it makes it a heck of a lot easier to do. So the probability of choosing two a's for any mono-alphabetic distribution would be 0.082 squared. Probability of two b's is 0.015 squared and so on all the way down the probability of two z's. And if we add those up for either English or a mono-alphabetic cipher text, you get 0.0656. Now that might range a little bit depending on where you got these probabilities or proportions from. As we've discussed before, there is no one solid gold standard for what percentage or proportion of a message is comprised of a's or e's or whatever in the English language. They tend to fluctuate. But we'll see that this value of coincidence here is about 0.0656. And again, it represents the probability that if you were to go into a mono-alphabetic cipher text and pick two letters at random, that those letters are the same. It's about a 6.5% chance. But if you did this for a polyalphabetic cipher, now a perfect polyalphabetic cipher, where every character was represented 1.26 in the cipher text, you could see that when we do the same calculation, the probability of choosing two a's is 1 over 26 squared. The probability of choosing two b's is 1 over 26 squared. And so on down the line all the way to the probability of choosing two z's being 1 over 26 squared. If you do that calculation, you get an index of coincidence of 0.038 for a perfectly uniform polyalphabetic cipher text. Meaning it's about a 3.8% chance that if you pick two letters at random from a polyalphabetic cipher text, that they're going to be the same letter. So again, 0.065 or 0.038 depending on whether it's a mono-alphabetic or a polyalphabetic cipher text message. Now let's see how that's helpful for us. Let's say we have a cipher text message, and we don't know if it came from a mono-alphabetic cipher like Cesar or Affine, or a polyalphabetic cipher like Visionaire or Autokey. That would be a really good, important first step for us to undergo before we start to try and analyze this message any further. We might use a different technique depending on the nature of our cipher text. So let's go ahead and compute the frequencies of each letter. So we could do this using the same way that we computed the frequencies like we're going to make a bar chart. But now we can just compute the index of coincidence by squaring the frequencies of As, adding to that the frequency of B squared, adding to that the frequency of C squared, all the way down the line to the frequency of Z squared. And when we sum up those individual frequency squares, we get 0.04171. So not 0.065 exactly, and not 0.038 exactly, but that calculation is certainly much closer to 0.038. So it's most likely that this cipher text came from a polyalphabetic cipher, since the probability of picking two of the same letters at random from this cipher text seems to be much closer to the probability of picking two of the same letters at random from that perfectly uniform. So we kind of go with the closer one there. Let's look at another cipher text that we've generated here. And we'll do the same thing. Compute the frequency of each letter in the message. Those are the proportions of As, B, C, and so on. And we'll do the same calculation to get the index of coincidence. So the probability or the proportion of A squared plus the probability of two B squared and so on. And we see we get 0.064, which is much closer to that 0.0656 that we saw earlier for an English message or any monoalphabetic cipher. So that would be the most likely candidate here. Knowing that, maybe we try and use a brute force message to try all 312 keys for the affine cipher. Maybe we try and do a known plaintext attack to set up some congruency equations, but now we know a much better approach that we can inform our next steps to crack this message. Now, we can do one more thing here with this with the power of our computer. Remember, the Freedmen's, they didn't really have modern-day computing. This was a very manual process, so they might only do that when they needed to. With the real power of computers, though, we can try a lot of different things here to help us deduce the length of our keyword. They were just trying to figure out whether it was monoalphabetic or polyalphabetic. They weren't even starting to think about what the length of the keyword is. But let's see how this one simple number can also get us a little bit more information with the power of modern computing. We've got the ciphertext here, whose index of coincidence is 0.04344. So we might definitely consider this to be a polyalphabetic cipher, maybe even visionary. Let's try breaking that into groups and looking at the index of coincidence of the individual groups of characters. So we're going to start with two groups. So group one will be starting with the character z and then alternating, so z, n, g, m, t, and so on. And group two is kind of the filling in the blanks. So starting at the character v and then going every other. Those individual groups have an index of coincidence that also would imply that they are polyalphabetic. Now, remember the whole goal, if this were visionare, and we split it up into the correct group size, which corresponds to the keyword size, those individual groups should look like a Caesar distribution. And a Caesar distribution would have an index of coincidence that's close to 0.065. Well, clearly these index of coincidences for the group are not 0.065 or even close, which means that they're likely not Caesar distributions, which means that the keyword is probably not the length of two. So maybe we try this again with three groups. Maybe we think the keyword of this visionare cipher is three letters long. So group one, group two, and group three, calculate their index of coincidences, and we can see they all look pretty polyalphabetic too. Their average is about 0.04383. So I don't think this is a three letter keyword for visionare. If it were, each of these groups would have that monoalphabetic index of coincidence size. So we could keep going with this, calculate the IC for each group, calculate the average of it. We did that for already keywords of length two and three. We could do it for a keyword of length four, but each of those groups would have an average IC of 0.044, not getting up to that 0.065 yet. Until we get to length five. If we were to assume this had a keyword of length five, split it up into those five groups, we could see the average jumps up to 0.06574, and it goes back down when we increase the keyword to length six and seven and so on. So it looks like we struck on something here if we assumed a keyword of length five. We found that the average for each group is what we'd expect for a monoalphabetic ciphertext, which implies that this is probably the likely length of the key, since it created subgroups of characters that seem to follow a Caesar distribution, or at least a monoalphabetic distribution. So by having the modern technology of computing, we could just try a bunch of different subgroupings of our ciphertext message, compute their individual indices of coincidence, and then average them out until we found one that seemed to imply that we got that keyword correctly. I could only imagine what the Freedmen's would have done if they had this power at their disposal back in the early 1900s, but it certainly would have advanced their work far beyond where they're able to get back then. So we're standing on the shoulders of giants here as we are using our computers to help advance the work of the index of coincidence.