 In this lesson we're going to be learning about the Kosiski test, which is one method that will learn how to use to break the vision air cipher. Now this test is named after Friedrich Wilhelm Kosiski. You can see he was born here and lived mostly during the 1800s. And this work was first published in a 95 page book on cryptography that he wrote here in the original German, but translated into English, meaning the secret writing and the art of deciphering. So we've mentioned before that Charles Babbage was actually the first person really to break the vision air cipher, but it was Kosiski was the first person to have a general attack that would work for really any messages that we were to intercept. And this method relied on the analysis of gaps between repeated fragments in the cipher text, meaning the distance between repetitions in your cipher text. So let's let's look at a sample cipher text here. And if you were to pause your screen and look at this, you might see some repetitions in the cipher text. Actually, why don't you try and do that by hand real quick? Hit that pause button. And then we'll show you some of the repetitions in three, two, one. Here's some right there. So this fragment Z V R A O, we see shows up not once, not twice, not three times, but actually five times in this short paragraph. And in fact, it's not the only fragment that repeats. Here's another set of repeated cipher text and B D B M V S shows up also five times in the cipher text. So we'll have to figure out first, why are these repetitions even occurring? Is it by chance? Or is there something else at play? So let's look at a smaller one here. Let's let's start with some plain text. Let's start with a small message. And we just a piece of a sentence here on a plane, the plane is due. And notice the word plane shows up twice in this message. We're going to pick a random small keyword here is going to use the word milk. So this would work for any four letter keyword. And if we were to write out our plain text and create our key stream on top and encrypt the message, you'd see we do get some repetitions in the cipher text due to a couple of things. First, there's a repetition in the plain text. We saw that that word plane showed up twice. But also we can see that above that repeated plain text word, our key stream happened to line up the exact same way K M I L K both times we started with the word plane. Let's think about why that might be the case. Between the repetitions in the plain text, there are eight characters. So if you started the first P and the first word plane and count over this eight more characters until you get to the P and the second plane. And since our keyword has a length of four, it just happens to line up really nicely. So we have K M I L K M I L, and then K again, starting the sequence again. Well, that is that actually doesn't have to happen with the four letter keyword. If we use the different word this time hospital line up our plain text with the key stream encrypt. And again, under the word plane, we get the exact same cipher text ETT NP. And again, it's because our key stream lined up in the exact same way over the word plane. This time it was P I T I L, there's eight letters in the word hospital. And because again, there's eight letters from the first P and the word plane to the P and the second word plane, that eight letters has happened to repeat perfectly right on top. Now this won't always happen. Here's a five letter keyword water. And when we set up our plain text, key stream and cipher text, we don't get a repetition. And again, because the length of the word water is five characters in it. But we started with an E over the P in the first word plane, we had an A over the P in the second, it didn't line up correctly. There's eight characters between the repeated letters in the plain text, but five characters in the keyword. It didn't really stay in sync, we got off sync by the time we got to letter P. So we didn't get a repeat in the cipher text. So you might have already drawn this conclusion, but it appears that if the length of the keyword, which we can call n, can evenly divide the distance between the repeated plain text, which we can call D, the cipher text will also repeat and in fact, must also repeat. So if the distance between the repeated plain text and cipher text is large, however, there could be a lot of different keys that could divide length D a lot of different lengths, which would still leave a lot of options for the true key size. So it'll be helpful for us if we could have multiple repeats, like we saw in the example, in the plain text and cipher text, which will help us narrow down the choices for n. So if one gap had a length of 10 between it and the next gap had a length of seven between it, maybe that'll help us figure out something about the key length, more than if we just had one gap to identify. So let's look back at our example, cipher text here, and we'll just highlighted all of those repetitions. So if we look at the ZVRAO repeats, the first one of those is at index 102, so we kind of counted those there for you. The next one would appear at index 134. So the distance between those character Zs would be the difference between the indexes. So 32, they're 32 characters apart. The next occurrence ZVRAO is at index 390. So we could do 390 minus 134 to get another difference between the repeated fragments in the cipher text. And we could do it again at 402. And again at 426. So these five kind of occurrences of the same fragment, we can get four values for that D, the distance between the repeated cipher text. And now we know that the key, the length of the key must divide all four of those distances. We could even do better. We could go to the green words, the MBDB, MVS. And we can see that the first occurrence is at 143, the next at 155, the next at 307, and then 411, 439. We can get four more distances. And since the key is the same for this entire message, we know that the length of that keyword must divide all four of those as well. So we've got eight differences between repeated fragments in the cipher text. And we know that the length of the key must divide all eight of those numbers. Let's see if that's helpful for us to figure out what the true value for the key length is. Now, before we do this, there's a big assumption here. The assumption is that these repeated fragments of cipher text is only caused because of the repeated values in the plain text and our keywords getting kind of back in sync due to their length. In theory, it could be just by chance alone that we're getting these repetitions, maybe we have some keyword and it lines up just right to create a repetition otherwise wouldn't have been there. But if we have a long repeat makes six or seven characters like we've seen, and you get multiple repeats of the same fragment, the chance of this repetition in your cipher text just being by just pure luck is very, very small. So we're going to make this assumption, but just realize it could be wrong in very specific situations. Alright, so let's take a look. We've organized these indices, so there's our 10 indices, five for the first fragment, five for the second. And then there's the corresponding values of their differences, which we called the values for D. So those differences are 32, 256, 12, 24, and then 12, 152, 104, and 28. Now remember the value n, the length of the keyword, must divide all of those. And we have a nice handy algorithm that helps us find divisors and in fact a very specific divisor, the greatest common divisor, which is actually really handy. You probably, if you were to find multiple divisors for all of these values here, you could probably assume that your encryptor was using the largest value of those because that would reduce the most secure message as we've learned about in other lessons. So let's find the greatest common divisor of all of those numbers. And it happens to be four. It's actually not a very large divisor at all. So that is going to be our most likely length of the keyword, since it is the largest value that divides all eight of those numbers. But it could be other ones. There are other values that would divide those eight numbers. They're the numbers that also divide four. So one or two. Now, we know a key value of one or a key value with a length of one in vision air is really just a Caesar cipher. So that's probably not what was used here. And if a key value of two, a length of two, not very secure. So again, it's a pretty another good assumption, especially when the value is this small, that the greatest common divisor of all the values of D is the most likely candidate for the length of your keyword. Now, now that we know how long the keyword is, how does that even help us? Let's take a look at our cipher text again here. I've just shown the first line of it. I'm going to color code each kind of grouping of every fourth character. So color code them here, the green ones were all then in ciphered using the same letter from the keyword, assuming we guessed the length of the keyword correct, that j q c j q l and so on. We're all in ciphered using the same letter from the keyword. The pink letters a s w b and so on are all in ciphered using the same letter of the keyword, the orange ones, the gray ones, likewise, that we don't know those letters yet, but let's group them together. So all of the green letters there are all of the letters from the cipher text that were in ciphered using the same letter of the keyword, the pink ones, likewise, and so on down the line. Now, we've seen in other lessons that if all of those letters were in ciphered using the same letter of the keyword, that means they were all added to their numerical representations were all added to with the same value, which means that they should follow the same distribution of the Caesar cipher, meaning any other sort of mono alphabetic cipher. And in our case just been shifted left or right from the English standard distribution. So let's actually take a look at the bar charts for each of these four groups. Here they are color coded. And we can see starting from left to right, that we have some telltale spikes that we saw earlier when we were doing the crypt analysis of Caesar. I can see that this J is the most likely keyword for the green group because I see the five letters down the way at the end, we have a larger spike, so that's probably the a e spike right there, meaning that if a got shifted to J, that means that the first letter in the keyword is most likely J. The second and the pink here, we can see kind of the O and the S are the most likely a and the E from the English that have been shifted. I don't see any other kind of spacings like that with the double spikes. Likewise, with the orange group K seems to be where a got shifted to so K is probably the most likely letter in the keyword in the third position. And see if you can guess on the last one. E. And it actually works out pretty nicely here that these four letters form an in English word joke. And now if we think that's the most likely four letters of that make up our keyword, let's actually go ahead and decipher the ciphertext of the joke as the keyword. And if we guessed it right, are able to deduce that correctly, we should see some English pop out. And if we got a bunch of gibberish back to the drawing board. Now in this case, it does happen to work out pretty nicely that we get a joke out of this about a mathematician, a physicist and an engineer that you can read. But we can also see that there's a lot of repeated plain text in this message, all of these numbers like three, five, seven, nine, the word math shows up, engineer shows up, there's a lot of repetition in this plain text. So there's probably no surprise that we are four letter keyword happened to line up over those repeated words in the same position several times. But it turns out that English is also very repetitive in general, that it didn't have to be just this joke with a lot of repeated words, any length message of sufficient size, we'll have some repetitions in it, maybe it's the word the or four, or two, these common words that we use quite a bit. And regardless of the length of your keyword, they're going to show up fairly frequently as repetitions in the ciphertext as well. So that's it for the Kosiski test, we can see it's it's somewhat manual, but we could automate this fairly well with Python, we just need to find a way to find repeated fragments in our ciphertext, calculate the difference between those repetitions, and then compute the greatest common divisor of those different delta values. Once we've got that, we can then use our bar charts to make pretty good guesses at what those letters were in our keyword.