 In the previous lesson, we saw how we could use a known plaintext attack in order to reverse engineer the keys for a particular cipher. In those examples, we assumed that maybe we knew a letter or maybe even a word from the plaintext that matched up with the ciphertext that we had. And now, how does that happen? Maybe we can identify some patterns in the way that messages are structured. Maybe we've got some good spies that can give us some information, but a lot of times we're going to be left to our own devices on having to make good guess. And one of the tools that we can use in order to start making good guesses about which letters in the ciphertext map back to which letters in the plaintext is a mathematical and statistical technique called frequency analysis. Let's take a look at the selection of texts from the book Pride and Prejudice. We might be able to look at this and pick up on some key patterns, maybe some letters show up more than other letters. And we can actually use this information about the way that the text shows, and we're going to do this for the entire book, but not just this one passage, and start constructing what's called a frequency bar chart, which will summarize the proportion of the entire text comprised by each of the characters. So we can start looking at, like, how many a's are there in the whole book and how many b's and c's and d's and so on. This would be a pretty tedious task to do by hand, but Python allows us to do this very easily and very quickly, and that's what we're going to learn how to do in the next few lessons. If we were to do this for the book Pride and Prejudice, we'd get the following bar chart. Now, what is a bar chart? We'll see in just a moment, but we're going to have a vertical bar, one for each of the characters in the English alphabet, so just a through z, and the height of that bar will represent the proportion of the text in the book that that character represents. So when we make the bars, we can see here. So we can see that about just under 8% of the book is the letter a, maybe about 2% of the book is the character b, e jumps up, maybe about 13% or so, and then letters like q, j, z, and x, fairly low proportion of the overall text. And you might think, well, this is the bar chart for this book. How does that compare to other books? So let's take a look. Here we've got a bar chart for Alice in Wonderland. Here's one for Frankenstein. And here's one for little women. And you'll see these bar charts don't really change that much from book to book, which may be that surprising because they're about different characters. They're in different places. You would imagine a lot of the words being used would be very different in these books. However, this underlying pattern remains. And that's actually the case for all of the English language. If you were to take the entire corpus or body of works are written in English and run a similar analysis, we get a histogram or a bar chart, rather, that looks just like this. Now, this is from Wikipedia. And depending on which books you use and the size of your analysis, these frequencies or percentages are going to change a little bit from analysis to analysis. But they're all approximately going to be about the same. And we can see that even for a relatively small passage of text, just a paragraph or two, these patterns will emerge. Now, if you had just one or two words, of course, we're not going to see the exact same distribution. But that law of large numbers will really start to weigh in once we get past just a few sentences of text. Now, some things to recognize in this histogram is that E is by far the most frequently used letter in the English language, weighing in at around 12.7% of all characters written in the English use that character E. So this could be a telltale sign if we were to look at an analysis of an encrypted message where the distribution might look a little bit different, that the most frequent letter of the ciphertext might go back to this letter E in the plain text. This kind of pattern here with A and E with the double spike spread out five across is a pretty common one to recognize, followed by the H I kind of together a higher frequency pair, and O is a higher frequency pair, and the RST is a high frequency triple. So if we're to start shifting around this alphabet, just like we do with Caesar, we would expect to recognize these kind of signatures or fingerprint of the distribution, maybe just in a different location on the encrypted message. Let's see how that might take a look. So here on the left is that same distribution of our English language that we would expect to see based off of our previous analysis. And on the right is a distribution of a Caesar text, a Caesar ciphertext that was created using a value of K or the key of three. And we can see that the letter E is most likely shifted three spots to the right in our ciphertext because the character H in the ciphertext is the most frequent letter. So there's our A, E, five apart, a double spike. There's our H I double spike now shifted over to KL. There's our N O double spike now shifted over to QR. And then there's the RST shifted over to UVW. Now notice the heights aren't exactly the same because our ciphertext uses different words and different letters to make up the message. However, they're roughly the same proportion. That's the law of large numbers weighing in again that we're going to get roughly the same shape even if it's not the exact same values. So we can definitely have a lot of different reasons to believe that K is K equals three is in fact the key for this message. We do the same thing for K equals seven. We can see L is the most frequent letter highlighted in red, which makes sense. That's seven shifted over from E. And then you can look for your AE double spikes, H, I, N, O, and RST spikes all shifted as well. Here's one for K equals 22. Probably easiest to recognize that E got shifted four spots to the left, which is the same thing as 22 to the right. Always trying to remember to put our key into a positive value. You try. Here's a distribution of a cipher text that was created using the Caesar cipher. Just pause the video and see if you can figure out what the key that was used. And we'll show you the answer in three, two and one. Here we can see that the most frequent letter in the cipher text was O, which is 10 spots to the right from the character E. So you might guess that the cipher text was created using a Caesar cipher with key of 10 and it was. So I think we've got a good handle on how the frequencies of this bar chart can help us make some educated guesses about the key that was used when creating cipher text using the Caesar cipher. However, we've seen a couple other ciphers, so let's see if these patterns remain and can be extended to other different encryption algorithms. Let's take a look at this cipher text that was created using a multiplicative cipher with key of three. Notice we can still tell that the letter M in the cipher text was the most frequently used character. So it's a pretty good guess that that came from the plain text character of E. However, you'll notice that the shape of the remainder of the distribution doesn't really match up with the English language distribution just shifted like we saw with Caesar. Consecutive plain text letters no longer get mapped to consecutive cipher text letters with a multiplicative base cipher. So we can't rely on those kind of AE double spikes or HI, NO, RST spikes to kind of correspond or corroborate the guess that we're making about the fact that E went to the letter M. We're just going to have to take a risk with our guests and hope that it plays out all right. All right, you take a try on this one. Here's a different cipher text also created using the multiplicative cipher with an unknown key. Pause the video and see if you can guess what the key might have been here. We're going to go over the answer and I'll find it in just a moment. We'll show you the answer in three, two, one. So again, it'd be a good guess here that the plain text letter E got mapped to the cipher text letter I. We can't use any of the other patterns to try and back up that guess. So let's just take that guess and run with it. So let's remember back to our previous lessons about how we could use this mapping to help us solve for the unknown key value. If our guess were correct and the multiplicative cipher then E, which remember has a numerical value of 4, was mapped to the cipher text value of I, which is 8. So we know that 4 times the k value modded by 26 should be congruent to the number 8. And a little bit of guess and check here. There's not a real easy way to solve this congruency. But a little bit of guess and check would determine that the k value could have been 2. 4 times 2 is in fact 8. And 4 times 15 would be 60. And 60 mod 26 is also 8. So these are the only two k values between 0 and 25 that would make this congruency true. Remember, we are working in a multiplicative cipher. So k equals 2 is not a valid key. It does not have a multiplicative inverse. So we can eliminate that as one of our two possibilities, meaning that k equals 15 is in fact the correct key that was used to generate the cipher text. Let's take this again to a new cipher, this time the affine cipher. So recall the affine cipher has two keys that get used in order to generate the cipher text. We take our English language or our plain text letter, convert that to a number. We're going to multiply it by the multiplicative key, k sub n, and then add on our additive key, k sub a. The distribution to the right was created using the affine cipher with the multiplicative key of 7 and an additive key of 2. Just like before, just like before, we can see that those patterns that remain in the English language in terms of the consecutive letters don't show up in the cipher text with the affine cipher. That's from the multiplicative component. But we can still see and make some good guesses about which plain text letter e went to the, which cipher text letter. In this case, they happen to be the same. It looks like the most frequent letter in the cipher text was e itself. So somehow, whatever combination of k sub a and k sub n that was used here, mapped e back onto itself. And that's okay. A lot of times plain text letters do get mapped onto themselves. And as long as they all don't do that, we're going to be okay with our encrypted message. All right, one more time for you to practice. Let's see if you could make a guess here about which values for k sub a and k sub m were used to generate this cipher text based on the distribution. Now remember, we need two keys here. So this is going to be a little bit trickier than the multiplicative one that we just did. Refer back to our previous lessons about how if we did know two correct mappings, how we could solve that system of equations to determine the key values. And I already gave you a little bit of a hint here. You're going to have to get two letters correct in order to make this possible. But you pause the video, take a look at these distributions and see what you can work out. We'll go over the answer in three, two, one. So looking at this, you probably guessed that letter m, the most frequent letter in the cipher text, came from e in the plain text. Here it got a little bit trickier though. Maybe the cipher text letter p was the letter t, our most kind of second most frequent letter in the English language. But due to the slight nuance in the actual text that we created versus this English language standard, maybe letter g was also a t. And we just had a weird selection of texts in this one where they flip flopped. These frequencies are all kind of close. And in fact, letter O, S, and t in the cipher text are all possible candidates for this letter t. It's going to get a little bit tricky. But let's go with the most likely guesses that we can tell from this bar chart. And that's that the plain text letter e was mapped to the cipher text letter m and at the plain text letter t, the second most frequent letter, was mapped from the plain text t to p. And assuming that those things are right, we can set up the following system of equations to solve for the two keys, which we'll call a and b here just to simplify the work. So if our guesses are correct, we can set up these two equations. If the plain text letter e, which has a numerical value of four, gets mapped to the cipher text letter m, which has a numerical value of 12, we can set up that second equation in our system, that four times the numerical key a, plus the additive key b, is congruent to 12, mod 26. And if our second guess is correct, that t, which has a numerical value of 19, is mapped to the cipher text letter p, which has a numerical value of 15, we can set up that first equation, that 19a plus b is congruent to 15 and mod 26. And as we've learned in our previous lesson, we can solve this system by subtracting the second equation from the first to eliminate the variable b and the new congruency that's set up, that three is congruent to 15a, mod 26. And we can solve this by multiplying both sides by the multiplicative inverse of 15, which if you remember is seven. So we can determine that a, the multiplicative key must be equal to 27. And now we can substitute that in to one of the two equations. Here I've chosen the second one, slightly smaller numbers to work with. To get the congruency that 12 is congruent to four times 21 plus b or 12 is congruent to 84 plus b or negative 72 is congruent to b. And then we mod that by 26 to find that b is equal to six. And now we've got both the multiplicative and additive key figured out for our cipher text. We should be able to use those to recover the entire plain text, not just the two letters that we've guessed. So there's our setup for frequency analysis. It helps us make some better guesses about some of the mappings between plain text, the cipher text. And while that might always get it right, it is gonna be a useful tool to have up our sleeve. In the upcoming lessons, we're gonna look at other ways that we can actually create these bar charts to kick off this analysis. And then also find another way to automate this process so we don't need to keep guessing these individual letters.