 Folks, good noon. Good noon. All right, before we get into crypto, I wanted to make some comments. I was going to send out an email about this, but I'll tell you all here now, since you're here and probably not some of the panelists. I read through some of the submissions for part two and three for the homework. Some of them are very good. Some of them are not so good. The reason why the ones that are not so good are not so good is because they are neglecting the English part of the writing assignment. Remember we said the goal is to actually write something that is reasonable and easy to understand as if you are writing a report or a letter to your boss or something. Some of the things I've seen, like a series of just one sentence bullet points are not appropriate for this type of communication. The idea is to get you thinking in terms of writing paragraphs that have sentences that logically follow each other. I'll make this more clear in the mailing list, but part of the assessment of those parts is not just the technical content of the answer, but also your spelling mistakes, grammar mistakes, are your thoughts clear and logical? Are your sentences following each other? Are your paragraphs properly structured into something that makes sense? Can you submit it already? You have time to change that, so that's good. Any questions on that? Are the user names unique? Let's say yes, they are unique. Since we're not doing any authentication in this, you'd see that the name is the identity. They would not be two users, they would be one user. And can the keys be used by multiple people? What does the policy say? Only users would not be authorized to use. So how would you take that to me? What case is? Yes, the test cases exactly follow the policy description. So it says that it will be in this form, that it will be in that form. If it says that your output needs to be in this form, then your output will be in that form. For part two or three, is there an upper limit of how much you want people to write? I don't know if you want to lead an essay. I mean, how long are we talking? If you write something really long, I will read it, but it better be very good by the time I get to that. Like I said, it's more about comprehensive, so if you feel like you need to take time to comprehensively answer the question, I think that's absolutely a good thing. Yeah, so no hard upper limits. I don't want to have to set a file size limit on past e-testing uploaded. Cool. Alright, let's get into writing crypto, or crypto. We're talking about crypto systems, and we model just as a kind of a refresher since we already covered this stuff on Thursday. A crypto system we talked about basically has, we can think of as having five elements. What are the elements of our crypto system? Yeah, in the back. An encryption system, so what does it do? What is the encryption system? What does it take in? It takes in a plain text, and what else? Key. We need a key. We need some kind of key, right? So we can think of it as an encryption system. So e is basically our functions that take in a plain text, a message, and a key and output ciphertext. And so already basically implicit in there, kind of in this formalism. m is a set of plain text, k is a set of all the possible keys, c is the set of all ciphertext, and an encryption function maps a message and a key into some ciphertext. And so then if the encryption function does what? What's that? So what does it do here? Ciphertext and a key? Output is plain text. Output is plain text, perfect. And so for the security, we'll talk about that in a second. So we talked about a Caesar cipher, so that a Caesar cipher actually was used by Julius Caesar, which I think is kind of cool that we still talk about it, and it's still relevant. And so we talked about there, a Caesar cipher is the messages as a sequence of letters. The key is some integer from 0 to 25 that tells us how much to shift all the letters to the alphabet. And so the encryption is pretty simple. The encryption is take the letter, for each letter of the message, for every letter, shift that letter, k number of characters forward, and the mod 26 is so that it will wrap around. And so for shifting by 1, z will move to a. For shifting by 2, z will move to b. Anybody have questions on this? Seems pretty straightforward. Can you talk about the enigma machine? The German enigma machine. That Turing helped break. To be honest, I don't know. I'd have to look into it. It probably is not this simple, because that would be very easy to break. But as we'll see, all crypto systems have some kind of combination of these types of things, of substitution and rearrangement. So it probably is something like that. You're already going to do that. You will be breaking some crypto, but it would be cool if it was like the enigma machine or something. Well, maybe we can watch the movie in class. And so decryption is very simple. So if you know the T, you can decrypt the message by taking every letter in the message, subtracting from the key, that letter, and that will reveal all the plain text. So the important things here is that psychotechs and message, they're both the same language. I mean, they're just letters or not. And this is, I would say, true for most crypto systems. You don't want to assume you have some crazy language that's not exactly the same. So even when we get to more real crypto systems, they operate on bits, right? They're the same person once, so. Any questions here? So how do we break a crypto? So when we think of crypto, we want to think in terms of adversaries. So the adversary is the person who wants to break the crypto system. So why might they want to do that? What's the motivation behind our adversaries here? What's that? Read a message. Read a message? Why would they want to do that? To win a war? So I mean, a large part of, if you know where people are or where they're going to be, you can plan accordingly. Yeah, that's definitely. What else? Leak information? What kind of information? Personal information. So maybe they want to read. I don't know if that's a true statement. I'm pretty sure. I think all my tax returns are on this computer in my Dropbox, but they're all encrypted. So if somebody broke into my Dropbox, they should not be able to read those files of my tax returns because they're encrypted. So if they can, that would definitely be a problem. I should verify that that's the case after this box. I just reminded you that. What else? No other motivations to break this? For the thrill? Yeah, that's true. Yeah, in the back. So if we break it, maybe we can inject our own fake messages into there and cause chaos and confusion. That's a great one. I don't know how about that. What else? Do you have your hand up somewhere over here? To expose. To expose security blocks? Yeah, so maybe we don't actually want, we're not an adversary in the sense that we will cause harm by breaking this, but maybe we are cryptographers who want to demonstrate that a known crypto system has flaws, right? That can be taken advantage of. And so, what should we assume about our adversary? So when we're designing a crypto system or we're discussing a crypto system, we can think through, well not only what's the motivation of the adversary, but I think there's a lot of reasons why they would want to break the system. But what do we assume about them? That they will use any means necessary. They'll use any means necessary. Why is that safe to assume? And what do you mean by safe? Well, if they're trying to break your cryptography, they're obviously doing something you don't want them to do. So your feelings are not important to them. And I mean, probably the law is not important to them either. So whatever they're going to do, they're going to do it. Okay. Safe. What do you mean by safe? Or does somebody else want to define safe? You can do it. It's okay. There's no possible way for them to get in. You want to close every way of attacking it and breaching your system. You don't want there to be some kind of problem. Right. So I guess the question is, do we assume that this person is a, I don't know, a trying to think of a good person who's not skilled that's also not like, I don't know, like a 10 year old who doesn't have a lot of math background. Right. Could become an advanced cryptographer, but it's not quite yet. You assume that that's who your adversary is, then why not just use a Caesar cipher? Like I don't think a 10 year old could be able to break that or it may take them a long time. I don't know. Right. So by safe we're meaning we're kind of assuming a very powerful and strong adversary in that sense. Right. So going back to this, what should we assume that the attacker knows in general, but not a specific system? So when we're saying, when we're talking about the Caesar cipher and we're thinking about, okay, somebody wants to break this, what do they know? They know the alphabet, so they probably know the language, the set of plain text and the set of cipher text. Right. They will probably know that they are looking at something encrypted. Yeah. We'll get into stenography and other types of stuff like message hiding things later. But that's definitely I think a safe assumption. Yeah. Just drawing on from some of the other classes you've been doing. I think it might be safe to say that they might, that they know our encryption function. If we're using a good one, the best ones out there are, I mean, the function for it's known, right? Yeah. So I don't know. What do you, what does everyone think? Should we assume that the attacker knows at least E, the function that we're using? Is that too strongly of an assumption? I think that if I was encryption, I would assume that whoever I'm dealing with knows at least everything that I know about encryption. You know, and so the only information that I can keep hidden from them is literally keys that are, you know, that are protected by, you know, by design. So going back to the safe assumption, right, we want to assume a strong adversary. We should assume, I can't remember who said the encryption algorithm, but we should assume that they know E, the encryption algorithm, and D, right? I mean, we'd assume if you know the encryption algorithm, you would know how to decrypt it. But the important piece of information that we, so I guess the other question is, should we assume that the adversary knows the key? Probably not because then what are we hiding from them, right? If they have the key, they have the encryption system, they have the Cypher messages, they could not only manufacture new messages, they could easily decrypt everything, right? So the key, literally, is the key to all of this, right? So, yeah. There's a lot of encryption methods. So just because they know all the possibilities of the names and exactes can be used. True. Although the flip side, then, is whoever you're communicating the message to has to somehow know what algorithm you're using, right? So a lot of the modern crypto will actually say in the metadata the message exactly what algorithm is being used. So yeah, it's an interesting thing because you think, well, if I have to keep the keys safe, then I should just keep the algorithm safe as well. But so there is a little bit of difficulty there, but you have to keep something secure, and in this case, the keys. And so the idea is, and you have a lot of benefits by the algorithms being known, and that other people can attempt to find problems in these algorithms, right? People who are a lot smarter than you will be able to break these crypto systems. So this is a good discussion because it does inform how we think about crypto systems. So we assume pretty much always that the adversary knows the algorithm but not the key. So, but is the, let's say, I guess another way to maybe think about this is if I create my own crypto system, and I, let's say, just use it to write messages to myself or to store my own data. Is that a secure crypto system? The good system? I'm saying not only do I have the key, but I also have the algorithm. You're shaking your head. Just to see, sort of, you're storing paragraph, you know, text information of your, then you can look at statistical, I mean, statistically analyze what the output is and potentially reverse it with the algorithm because you're not doing something that up the skates. Right. Yeah, so I think the key, well, kind of the key idea here is the opposite of this necessarily isn't true, that hiding the algorithm means that the system is necessarily secure, right? Because, and this actually comes up a lot when you talk about real world cryptography and companies who actually use crypto as well say something like, hey, here's a thousand dollars to anyone who could break this crypto system and they have this contest for, like, months and after six months they go, nobody has broken our crypto system. That means we're definitely secure, right? That's like, well, not really because $8,000 is not that much of a cryptographer's time. I mean, of what a cryptographer would be doing to analyze this and just because broken it publicly doesn't mean that it's not broken and so this is kind of an important thing to keep in mind is that keeping the encryption algorithm secret or claiming that nobody's broken it is not necessarily a proof of security. So, what should we, so we kind of assumed, so we assumed that the adversary knows the algorithm used, doesn't know the keys, what other possible capabilities could an adversary have on our system? What can they do? I mean, think about different scenarios, like we just talked about, right, in a war environment, in a file on my computer. Yes, but then they're not actually breaking the crypto, right, if they're denying it's access to the system because presumably I can stand up a new system assuming I can safely transfer a key from one place to the other. So, one capability we can maybe say is they, and let's restrict it even more and say they only have access to the ciphertext. Right, so they only have, like let's say, thinking about maybe a war scenario, you've intercepted some message or some communication, right? You know the algorithm being used, but you definitely don't know the key and maybe this is the only message you have. Maybe you have multiple messages, but you may not know if they were encrypted with the same key or not. That may be something that is difficult. So, is this it? Any other capabilities that an attacker could have? I would say this is kind of, when you think about the adversary's capabilities, this would be on the bottom, so this is kind of the, at the very least you want some ciphertext to work with. But, yeah. Do they only have comparable systems to what we have? Yes, you would assume they not only have comparable systems, but, and this kind of comes in assuming a strong adversary, right? So you assume, it does depend on your threat modeling, right? You may not assume for a company that's a nation state may necessarily want to try to break your encryption, where if you are a nation state, you would assume that other people have your capabilities or maybe even better so that you would try to design that in. You could have the plaintext and the ciphertext associated with it. Yeah, so I may, through other means, maybe it's an incredibly simple message, maybe somehow I'm trying, I'm able to figure out the plaintext. So I have one instance of a message and a ciphertext, right? And the question is, can I derive the key from that in case that key was used in other messages or like we said, to encrypt our own message? So this would be intercepting maybe, or, I'm going to go too far down that route, but like, interrogation or use other means to like get a copy of the message and the ciphertext. Do you control that, that message in this case? Because it's... At your adversary created that message and you have this associated ciphertext with that message. So then, what if that's not the case? What kind of the next level up would be if we, the adversary, can choose the plaintext that they want encrypted by this system. So here not only do you have a message and a ciphertext, but usually you can make frequent queries to the system to get additional messages encrypted of your own choosing. And so this is kind of, this is basically a kind of reverse in increasing sophistication and capabilities of the attacker where, as we'll see, yeah. Possibly get whatever keys, like for instance like, say they're making an account and they get a key for themselves from that they possibly say, well we have this key Usually you would say that if they break your key it's game over. So if they get access to your key it's essentially done and your system is broken. It may not have necessarily a cryptographic weakness. It may not be something to do with the algorithm itself. It's the fact that you lost the key. So you can have and actually I think most crypto systems are, like modern ones are resistant to even chosen plaintext attacks. So even if you encrypt whatever you want that doesn't mean that you can derive the key from just the chosen message and the ciphertext. But if you leak the key then it's done everything's over. So you can now do anything you want. But you haven't broken that specific algorithm or that crypto system. You've broken that implementation with that key. So you can do this or something that would be good. Cool. So most attacks on crypto systems come down to a couple broad categories. So one is really like a mathematical attack where there's some mathematical flaw in the algorithm itself and somebody finds that and is able to derive some kind of attack. What are other types of flaws? Couldn't you have a flaw in how it's coded if somebody has like a bug in it they can exploit it? Yep. So it could be the algorithm itself maybe secure but the implementation has flaws in it. Maybe they're pulling, maybe if they need high quality random numbers but they're pulling from not high quality random numbers that could cause an attack. And that's not necessarily the mathematics, the mathematics of the crypto system is fine but the way that random numbers are generated are not. I believe I should probably verify this but it's a good story is that there used to be I think online poker games that would, there was an online poker system that would randomize the deck based on a random number that turns out was generated based on time, like just enough like the UNIX time stamp. And so people figured this out and so they realized they could very quickly guess you could guess a range of a minute or something of what, because you don't know exactly the time on the server but you can guess. And if all it is is second base distinction you could then figure out, play out those games and then figure out where the cards are going and what card you have that would window down to exactly what all the cards everybody has at that point you basically drove in the game. So yeah, but that's not necessarily randomly shuffling a deck, it's not something that's inherently insecure but how you choose that random number is cool, what else? Key exchange, the how the keys depending on the system how the keys are transferred from party A to party B I'd say that's outside the scope but it is something definitely, I mean we will talk about that especially when we get to asymmetric cryptography we'll talk about why that's better and it comes down to this exact issue because there is this problem with if I want to talk to any of you we have to find some secure way to exchange keys so we can talk to each other but how do we actually do that if the whole reason why we're using encryption to communicate is because we believe our communications are being monitored, so how can we then safely exchange keys it's a difficult problem but we'll ignore that for now so what else? In what sense? Even if the algorithm that you implemented they're going to be flogged on the hardware that it's implemented on yeah let's file that under implementation problems because that's another it's actually a big used to be a big problem with crypto systems is you would based on the timing so if you can choose your plain text based on the time it took to encrypt that it would leak information about the keys and so they've had to develop entirely new ways of programming with like constant time programming so no matter what input you give it no matter what the key is the input it always takes a fixed amount of time you had an idea earlier when we talked about looking at my file on my crypto system how would you break that? so what does that mean? just throw statistics and deep learning at it based on the probability or based on how often a certain letter is written so we know English text English is not random some characters are used more frequently than other characters maybe that was the most frequently used character E so there's a whole thing you can analyze text to figure this out it is not a random distribution so you could analyze just the ciphertext itself to see if these same properties maybe hold or maybe there's some information in the ciphertext that is left after the encryption process so there's basically a whole category of attacks on this trying to statistically analyze the attacks if you think it's just kind of a simple cipher we'll talk about because we can do them by hand it's actually not I can't remember was it DES? one of the first major big encryption algorithms standardized they found out had some kind of statistical attack that it was still leaking information based on the encryption here so this happens in real systems it's not just these more simplified ciphers and so we talked about implementation flaws the important difference between these is mathematical attacks is basically the core mathematics is wrong statistical attacks means math is fine but the information that is left from the message or the key in the ciphertext is still there and so the encryption algorithm is not properly hiding that and so therefore we can try to make some assumptions and break it that way but there's a problem in the way that it's actually implemented so we're first going to look at classical cryptography as it's called or symmetric cryptography the idea is the sender and receiver of this message both share a common key like we said if it's just me that's pretty easy I have a password that could be the key it's in my brain nobody else knows it because when now when we have to talk we have to somehow exchange this key in an out-of-band manner where we both actually trust an important kind of distinction here is the keys don't have to be exactly the same they could be slightly different but if we say I don't know if it's any way trivial to derive one key from the other it basically falls under classical cryptography so symmetric so where does symmetric come from here exactly the keys we think we both have the same key sender and receiver and there's two basics types that we're going to talk about and this is actually true for simple ciphers and also true for incredibly complicated modern class cryptosystems substitution ciphers which the seizure cipher is the first example that we'll look at so based on the key you substitute one character of the message for some other character and the other way is move characters around so you could make an encryption algorithm that says for every two characters swap them so that would be a transposition cipher and modern cryptosystems modern classical symmetric cryptography is a combination of these things so you do a series of substitutions and a series of transpositions and you do it kind of over and over so we already saw this with the seizure cipher so with this substitution cipher so the idea is if we have a seizure cipher with the message hello world and we want to use let's say a key of three what would the cipher text that be every letter shifted three to the right so x would go to a, y would go to b z would go to c so on and so forth so the key would be written you either write the key as three or the letter d so this would be kind of how it's often written in these types of cryptosystems so the cipher text here would be cool sort of so how do we break this so is there any problem with the map necessarily because you only have 26 possibilities that you drifted on true but I wouldn't say this yeah that's tricky yeah I guess it does depend on how you define the cryptosystem and how you define keys and everything so one flaw would be that we can only send messages with no punctuation only letters so maybe we want uppercase and lowercase letters maybe we want spaces maybe we explicitly don't so that could be a problem so how do we want to tag it so will says we should just brute force it so if you're I don't think I mentioned this here but when you are interviewing for a programming job this should be your first idea for everything right it's like how do we do this thing I don't know just brute force it try all possible combinations right it's only to buy yourself more time to think about a good answer but it also shows that sometimes brute forcing it is correct I mean is the right thing to do if you know you only have 10 elements then do whatever kind of sword you want do a bubble sword it really doesn't matter at that point as long as you know that it's correct and your brute force is actually correct so this last weekend we were playing a CTF and there was a challenge that was looking for it took a string and called javas hash code function on it and this string that it gave you had special characters and like a dollar sign but the actual but the input to the function could only be alphanumeric so you couldn't actually use this dollar sign and it would check that the hash code of the string you input was the same as the hash code of the other string right and so I looked into the math you can look at exactly how Java calculates hash code it's this formula and it was kind of late and I'm I'm not really mathematically oriented so I was like I can't figure out how to make the math work right so I just said screw it I'll just write a brute force and so I wrote a brute force and let it run overnight and when I woke up it like found the answer and part of that is because it's actually super quick for Java to do these hash code calculations so you just do this in a loop try all possible values and at some point you'll eventually hit it so that was fun yeah so you can completely often times especially it's hardware and this is just on my laptop using like a 20 core machine or anything like that but yeah sometimes it's just worth it to do a back to the envelope calculation like could I actually brute force that in this case we only have choices right so you just do that and look at all of them and say what actually makes sense to me that should definitely be kind of your first idea and your first approach to a lot of these things we'll see that quickly falls away the more combinations that there are so then what should we try so let's say it is maybe not feasible but for patterns what kind of patterns do you want to look for double out so let's go back to our psychotext here right so so the idea is here we could maybe do an analysis and figure out how many let's say English two letter words are the same like letters right after each other that appear what are the most common and so you can say okay if I have a zero if I have an O O that probably could be either an L L or actually that would come in two letter words in English but you could try those and derive a key from that and then try these values to prioritize what of the 26 you're going to actually try cool what else do you want to go to the same time what's the rainbow table basically you have a lot of common information and their psychotext and you can see you can run them all through to see if anything like the psychotext that you're looking at matches anything in your table okay so we haven't talked about rainbow tables yet but yeah that's a good point so you can basically pre-compute so the idea is basically pre-compute so the idea would be take the top 10,000 English words and encrypt all of them with all the possible keys and then see which of those appear in the psychotext more often so you probably be able to find it that way that's a good one as the number of keys increases though that becomes less and less feasible because you can't as the key size gets larger your table then it has to be much much larger with the hand in the back so we can also just do this up so how does so we can do some kind of statistical analysis so we just talked about that English has a different distribution of letters so the letter E is a lot more frequent than the letter Z the fact that we have a Z in here pretty much tells us fairly well that we are not in the correct language here but so how does the so if we peek ahead for a second and we take this nice graph from Wikipedia and this is the graph of English character frequencies how does this graph change by the Caesar cipher so I think we can all agree our input language, our message will follow this graph and the more characters we have the more true that will be so what does the Caesar cipher do this it should just shift everything however whatever your key value is the important thing to remember is our message is not going to be exactly this our message for whatever reason let's say we are using code words that start with Z a lot so there happens to be more Cs in the message but the important thing is whatever the distribution is of the original message the cipher text is only going to be that shifted so that leads us to basically so this is, so we talked about already we skipped ahead to a 2 gram model of English so this is basically 1 gram so there is looking at letters so not a letter and what follows it we are just looking at a single letter and so if we have some cipher text that we will talk about so I guess let's talk about this so how would we let's say you are just giving this cipher text like we say we assume the adversary knows that we are getting encrypted but let's say we don't what would we maybe check or what would we any obviously if you can break it using a cipher then yes it is encrypted with a cipher but without necessarily breaking it well you could start just by seeing what letter is used the most assume that that is going to be the most common letter in the alphabet and then just kind of work your way down and you can see if exactly sorry basically take let's say this graph there is not a graph that I don't have but it is on the web if you just Wikipedia letter frequencies that is where this graph is you can sort this by most to least frequent character so you are going to have a distribution like this you can do the same thing to the cipher text and see that yes it roughly follows that same distribution if it is flat all the way across what does that probably mean not yet but it is probably not a Caesar cipher because that would mean that every character is used equally which would probably indicate if every character is used 1.26 of the time that would probably indicate that it is random right that is just that the cipher text itself is at least random but I remember that only is looking at one grain frequency so there may be other signals we can use cool so so what we can do with the frequency of each letter in the cipher text you can do this by hand I highly recommend not doing it by hand it is a lot better if you write and this is why things like Python is great for this actually I use iPython when I am doing stuff like this so if you don't play if you know Python but you haven't played around with iPython it is like a better Python when you just type in Python you can write code so iPython has history and it is really awesome so I highly recommend that so we can take this we now have the frequency of each letter in our cipher text and so what is the most frequent letter here was it V yeah V so it is probably not really the English model so that is probably not where we want to be and so we can actually perform an analysis here and the idea is calculate the correlation of the frequency of the letter in the cipher text versus the frequency of a letter in English so you can compute and it is really simple so if we say that Px is the frequency of some character x in English and f of c is the frequency of the character in the cipher text so these are very easy to compute you just literally count the occurrences of a character divided by the strength of the frequency of the character where would you get Px a table from where yeah anywhere there is crypto book there is probably one in our book there is tons of these how would you actually derive that table if you needed to derive it a few what if I take documents that are specific to let's say computer science it may not have the same frequency not large numbers large enough numbers right dictionary so you could use a dictionary right what would be the problem with using a dictionary though exactly so a dictionary would be great because it would give you the frequency distribution of the letters in all of the words in English but it doesn't take into account the fact that the will be used a lot in English letters so th and e should be used more frequently right so where can we get that from what would it be if you were going to compile on your own would it be useful to actually take a subset of documents relating to the topic that it was working with so like you know if it was war like you would just want to grab a whole bunch of war documents and that would give you a way better estimate than taking like newspapers and people magazine and a whole bunch of essay school children messages than that could actually feedback if you know the messages even if you don't actually know the cyber text if you know what some of the messages are being sent that could definitely inform that let's say you want how are these tables generated I mean literally how would you do this you live in an age where all of the stuff is available and you could go calculate these tables nobody needs to pre calculate these tables for you so how would you do it web scraping web scraping so you can call the web even I don't remember the name of the project there are some projects that we'll call already have called the web and they have terabytes of data on all of the things there that's a little tricky because you need to scrape out all of the HTML so you need to make sure you're only extracting the text but you could use something like beautiful soup or something to do that what else, how else would you do it public books, what are those books that are no longer copyrighted and available I think that's project Gutenberg I would say has yeah so there's publicly available text of English books and that will be biased a bit because it will include a lot of old stuff but you could actually download probably tens of thousands of books and analyze those you could download Wikipedia itself and use the English Wikipedia as input to the system and do this right so yeah like we said of course if you were then targeting a specific system or a specific group you would definitely want to include those things more often so yeah that would be cool but yeah the point is that like you can actually you can create all of this it's not just you don't have to have it be given to you and so the correlation is incredibly simple so the idea is you're trying to calculate four given i so this would be the key so for each key, for each key go through every character in the alphabet calculate the frequency that that character occurs in the ciphertext multiply that with the frequency of what that shifted character would appear in English and add that all up and so the idea is this will give you a pretty good measure for for each of the keys so you can actually calculate this for our previous ciphertext which was here for this ciphertext so we can see and this I've sorted by the correlation so we can see that 23, 13 and 7 both have a fairly high correlation just guarantee that the one on top is definitely the one not necessarily our ciphertext is pretty small in this case right our ciphertext is I don't know 20 letters or something like that I don't know exactly who it is but I don't want to count each letter our ciphertext is not incredibly large so we'd expect maybe some characters are kind of offset and then we just try each of these kind of in order and see if they make sense right so we could say we try 23 does this look like the ciphertext difficult to me when they don't have spaces in the words because I keep trying to make words that don't exist so we try 13 does this look like something yes you should never build your own crypto it's true and we can try 7 just to see that it kind of but we can see there's actually a lot of ease there's e, e, e, e so maybe according to the distribution this is much you know this is more likely but just because of the words you know there's only one two three four five six seven seven words right but still that's what's kind of crazy is that even though seven words still follow the basic distribution so using this you can figure that out questions on how to approach this so what are the problems with the cipher these are for us only 26 characters what else key spaces small is that it so another way it doesn't matter we as we just how we broke this right is we looked at the statistical analysis of the frequencies right you could see that it's just shifting the frequencies one way or the other it's not necessarily completely hiding the distribution of characters weakness compared to some more modern algorithms that if you re-cyper your plain text like a hundred times it doesn't do anything so like if you shift all the characters left by 23 and then you shift them right by 12 and then left by 5 and then right by 27 and then left by 100 if you did that a hundred thousand times on that string do the same analysis you just did and you know pick our top four choices and again that's mostly because the key space is small but also it just doesn't it doesn't change anything when you re-cyper your cycle takes right and so that's a good point so it doesn't like let's say compose it doesn't necessarily give you anything although we'll probably touch on a bit even modern crypto systems people are wary about feeding it back into itself because mainly because it's not really analyzed to do that and so people don't really know what to make of that right but we brought up something else that now I've got crypto systems multiple encryption systems they'll come back to me so definitely the key is too short we're going to search statistical frequencies are not concealed very well oh the repeating part so let's say you know we have the word hello or the repeated multiple times throughout the cipher text that will still appear those three letters will appear sorry if that appears in the message those three letters will appear in the cipher text over and over again and you know as an adversary so we broke this with a chosen with a just cipher text right we didn't mean we actually know the content of the message but if we knew the content of the message and have the cipher text would we be able to get the key like a plain text attack we would be able to get it incredibly easy and what about if we could choose the plain text how would you do it what would you choose to encrypt what character I would do A A would be the easiest I would tell you exactly what the key is you literally just have a message of A and it will decrypt encrypt it to whatever 13 is in this case then that's your key from there on out so it's incredibly it can be mildly very mildly resistant to a cipher text only attack if somebody does not know how to do this but it's all of these problems so but it's pretty good I mean you know for like roman times this is pretty pretty sophisticated I was actually reading something that was like actually it was really good because a lot of people couldn't read there were a lot of languages anyways and so if you saw it you would just assume it's a different language and you wouldn't assume that it was something in cipher or encrypted you would just assume it was some other language that you've never heard of and just ignore it so yeah it's too bad the masses aren't totally dignified these days it's not so they're good or bad things so you're how would you improve upon this algorithm this like crypto system while still keeping kind of the same substitution style yes something I did in a previous class the key to encrypt the first n letters of the message and started using the message to like kind of shift over the resulting yes we'll get to that in self you could kind of basically feed the key to encrypt the message and then use that part of the message essentially as kind of the key to the next encrypt the next part of the message and you kind of feed the encrypted message back in on itself but we'll get to that so using the same kind of methodology you can put a message over a message so being like you use like a short phrase like all I got on the key here God save the key and stuff like that and then put that over with key and then you shift the letters by that phrase every single time yeah so the key idea there is just make the key longer right part of the problem here is the key was one letter which was 26 characters which meant that there are only 26 possibilities you had to search for so an idea here is to use let's use multiple letters in a key so let's get a key but let's have multiple letters we'll actually talk about what happens in the key as long as the message so that's something else but the idea is to really kind of smooth out the statistical frequencies of the letters in order to make this harder so anybody speak French and pronounce this name did you know I will never do that did you know I have to practice this at home so this person came up with this cipher the idea was actually what we just talked about so a similar idea but use a phrase so that if the message was the boy has the ball which I don't know why that would be your message but it's a really good example will you choose in this case actually let's touch on this in a second so if my key was 1, 2, 3, 12 characters so my key was 12 characters long and was let's say pretty randomly created as a key that would basically hide completely the statistics of the underlying language right everybody agree with that so if every single letter is encrypted randomly each letter is going to come out as something random that's the problem there seems like if you knew somebody was using this cipher and it leveled out you could get the letters in the cipher by the way they level up the frequency but let's say that I so in general let's say I have a message of size n I just create a key of size n and choose the key randomly and then encrypt it so that way each letter is essentially randomly changed to something else and I have to give that key to you somehow securely so if I'm giving you a key of the length of the message why wouldn't I just tell you the message if I could do that securely and so yeah we'll touch on this later but I wanted to just mention that a little bit because so yeah exactly at that point the keys are just as long and then you have a problem of reusing the key if you reused the key you could leak the information so let's use the key vig so this would be the key so how many possible combinations are there that you would have to search for if you knew that it was let's say a three letter key so how do you derive that 26 to the 30 so each key could be 26 so 26 times 26 times 26 there's 26 possibilities for the first one 26 possibilities for the second letter and 26 possibilities for the third which is about 17,000 that's still not impossible I mean you could presumably if I was paying you a lot of money or you're in the military and you had to you could go through you could write a computer program to print out all possible 17,000 and see which one looks like English you could write a program like we talked about to maybe do statistical frequency to figure out which one of those was actually correct so you could definitely do this but the idea of how this works is you take the key you repeat the key so the idea is the first the first character of the plain text will be encrypted with v the fourth character will also be encrypted with v and the second character will be encrypted with i I say encrypted so with i and g would be the third character so you get something like opk, ww whatever whatever and importantly compared to our previous example was this ww doesn't necessarily mean that there are two letters next to each other in the message that are the same letter so we wouldn't be able to use the trick of thinking they're either 0 o o or l l or one of the other two letter words so how do we break this yeah but how do we know where the v is well okay actually hold that thought yeah so one thing we can do so we can actually look at this and run through the same steps that we just did for the Caesar cipher to try to look at the frequency analysis so we can do calculate this, calculate what we did previously to try to see which ones were most correlated with the keys and so we do things like 0.55 this was that correlation value of 22 would get us something like l k l g s s a y actually it has to say in it right which maybe gets us something like this, 4 gets us something like this 2 gets us something like this so these kind of correlation values even though they're very high almost as high as the one that actually encrypted the last message it's not actually decrypting the message right yeah yes so just like so one of the core problems here when we're trying to analyze a cipher like this is that we don't necessarily know the length of the key maybe we can do this analysis and try to determine that yes it looks like this type of cipher but let's say we know it's this type of cipher we don't necessarily know the size of the key so this is some terms that help here so the period is the length of the key so you think about like physics like the period of a pendulum because the key is repeating over and over I don't know where that comes from that but it helps me and the other thing that we just touched on is that the key is in each nth based on the period of the cipher text was encrypted with the same key so there are multiple alphabets if we think about whereas a Caesar cipher is mono alphabetic right it only has one alphabet this cipher has multiple letters so the general approach is exactly what was just said so we want to first try to figure out the period of the key then we can try to figure out and establish the period of the key then we break the message up into n different parts so if we take every nth letter and say these are all encrypted with the same Caesar cipher and we can either and we can actually just use techniques that we just did in the Caesar cipher to try to decrypt this but what's the difficulty here if we just let's say do this kind of analysis we did here of looking at this correlation and then try to break it like this let me phrase it actually I won't rephrase it so we take every nth character let's say we've somehow determined that the cipher text is length 3 we take every 3rd character and say that's all with the same alphabet and we know that it's all the same alphabet so it should have the same statistical distribution as English so we can shift it over so what's the problem there what's the difficulty with that approach so one problem is going to be a sample size problem with a very small message the amount of letters that we have is only going to be a third so the correlation may not be as high what else use it in parallel then you're going to have several combinations to go through and you have to iterate a lot more to get something that might make sense so part of the problem here is that the second one that we tried in this previous example was the actual plain text and we knew it because we knew that it was English only sought but if you take the third letter of a message that's not necessarily going to be the case so the interesting thing here is that these things are interlinked these different alphabets but it's not necessarily the exact same way as the Caesar cipher so we may have to use a weighted approach and try different combinations based on how likely we think they are what else can we use we're going at the same time and then when you get the final message compare your statistics so we could try to solve each problem independently come up with pops well I think even if no because they would have each individual alphabet let's say has the right distribution if you bring them together and aggregate them they should also have the correct distribution you need some way to know that you were correct or not yeah so what we're doing if you know the size of the alphabet it's n it's going to be a smaller sample size so you just find the same letters in the same position say that again same letters in the same positions that's like you say at e in the second position you find out how many of the distribution of the e in the second position and from there you start trying to break it down that way that's like three different Caesar ciphers at the same time the difference is when we broke the Caesar cipher we knew because the message appeared taking the third letter of the message we won't necessarily know when we've actually broken that specific alphabet well we're going to try to figure out what the message is how can you start figuring out the other part of the key let's think about it let's skip over that so let's whoop, we're going to use this ciphertext so if this is going to be a ciphertext it's a little bit longer we've probably tried to run the same analysis the spaces are just for our visual reference they don't need anything here so how can we find the period of this message yes? we could try brute forcing it but then we have to have a good way to like we said verify when their messages are correct do that with the dictionary if you were going to automate it if you were going to run all the combinations or all the periods and then do all the analysis and automate all of that every time that you came up with a potential or a candidate a decrypted plaintext cipher you just run it against the dictionary to see if any of the groups of letters that are in there appear in an English word dictionary we're going to expect that the majority of the plaintext is going to be all English so your candidates would have to have at least three English words to probably get a bunch of like random dots or a eventually if we find it that definitely would be an approach we could take so if we just wanted to brute force it but let's go back to so we just look at this ciphertext so we know well we know that the length of this key is three but let's say we just have this ciphertext how could we try to determine that do you have an idea the Caesar oh I see this one it's probably longer so why do you think that happened what is this E sorry 0 O, E, Q, O, O, G yeah so let's look at this one so we know it's three let's use the same idea but any repeats here OPK OPKW but W actually repeats as well so this is four letters of the ciphertext that repeat but we know that the key is only three why did this repeat incidental so you picked a bad message you should have added an extra inconsequential word into your plaintext to separate those two instances of the word the so that it both get translated to OPK and the same thing even if he's right you could have written a better message that expressed the same idea and encrypted it the same way you could not have that inherent weakness of those four consecutive characters getting repeated after a period of time so why are they repeated they coincidentally the same plaintext was getting encrypted with the same corresponding key values at that exact same location so going back looking at the plaintext we have T-A-G-B and T-A-G-B are all encrypted by V-I-G-V right so it just happens that we have characters in our plaintext that are repeated which are likely to happen it was something like the or in or a this is definitely likely to happen in text and it happens that our key overlaps not necessarily the first character of our key that doesn't matter but the fact is the key repeats across here T-A-G-B and T-A-G-B so then what does this tell us then about the period this means that the period should be that the key length should be 4 because there's 4 characters that happen to be encrypted the same repeated the sequence that opened K-W was repeated just like one letter that was encrypted by one part like they know it's the same and then we can see how many letters are between the two loads yeah so I would think that the OPKW and OPKW were likely encrypted by the same key letters and I would say that what's the difference here 1, 2, 3, 4, 5, 6, 7, 8, 9 so the difference between those 9 something what does that tell me about my key some factor of 9 it should be some factor of 9 which makes sense here because it's 3 so then this was actually a and I believe for a while that the you know today's cipher was considered actually a pretty secure cipher Kasky actually made this realization that repetitions of the ciphertext occur because the key is passing over the exact same characters in the message right like we said if they were offset at all then the would not encrypt the same thing but it's it can be the case that it happens and so the difference here is 9 and so we would think that the key is some factor of 9 either 1, 3 or 9 we know it's not 1 because we can do a distribution analysis and realize we did that and realize it's not really any of those we could try 9 but that would be kind of a long key and so what we want to do so we actually that was really good noticing that I don't know that I would notice that's something like this but you would then see that so what you want to do is try to find all repetitions in this text and so we can do that and we can find all possible repetitions am I O O E Q O O G calculate the start and end characters get the distance and get the factors among them so how does this help you do you try all of these possible combinations start with the most common two is very common start with the most common so what's likely to happen based on probability that two characters next to each other just randomly get to the same I was just going to say I would start with the longest one because two is going to be a factor of every even number so I'm looking at the factors on the side that I would use to pick the length of my period and I would start with the longest one because any of those spots are going to occur way more frequently just because there are factors in a lot of numbers right so A first thing we think about is the length of this repetition you want to start with the longest one because that's least likely to occur just due to random chance right because even if you're 100% randomly encrypting this message you're likely to get some repeats just because of randomness so you want to start there and you'd say that well okay let's get alright we'll stop here