 We've seen that hash functions can be used for authentication. We've seen several examples where we take some message and calculate the hash of the message. And the properties that we relied on for those authentication examples to work were this one-way property. That is, we can calculate the hash of the message easily. But given the hash value, it's hard to go back and get the original message. And the other property is no collisions. A collision is when you take a hash of two messages, two different messages, and you get two hash values. You get the same hash value. Two messages map to the same hash value. There's a collision. So we'd like a hash function where it's difficult to find collisions. And we saw some examples in the authentication that if we could find collisions, we would defeat or an attacker could defeat the authentication scheme. It wouldn't be effective authentication. We'll see the same applies for digital signatures. What is a digital signature? Well, we said yesterday what a signature is. Signature is to prove to anyone that a message originated or is somehow approved by a particular user. I sign a document, and I make that document available to you to look at and pass around. My signature on that document is an indicator that this document is approved by me. It came from me. So my signature means it's from me. So we want a digital version of a handwritten signature, something that we can have a file, some message in a file, and then the file across a network. And the person that has that file can confirm that this file, this message, came from one particular user. It's signed by someone and one person only. And that's what we got to yesterday, that if we tried to use symmetric key cryptography for a signature, it wouldn't work because with symmetric key cryptography, two people have the key. What we could arrive at was if this message is signed or is encrypted by one user, then there's no way for some third party, some other user, to confirm that this message came from A because it may have also come from B because both have the key. There's no one-to-one mapping from user to key. There's two people that have the same key. So symmetric key cryptography cannot be used for a digital signature. We cannot prove who it originally came from. And so we use public key cryptography, where remember we have two keys now, a private and a public key. They come together in a pair, they're related, and the nature of the public key algorithms is that if we encrypt with one of the keys in the pair, we can decrypt with the other. If I encrypt with a public key, I can decrypt with the corresponding private key. And the other way around, if I encrypt with a private key, I can decrypt with a corresponding public key. But if I use the wrong key, it won't work, or we'll recognize. So we can use this signature because we have such a key, the private key of a user, where there's only one person in the world who has that value. The assumption, therefore, to sign something, that user will use their private key. And then to verify, for someone else to verify that it was signed by that user, they use the signer's public key. Everyone has their public-private key pair. You have your own public-private key pair, okay? And you want to send him a message, and you want to sign the message that you send, so everyone is sure that it's from you, okay? You have your public key, you have your private key, you know everyone else's public key, you want to sign a message, what key are you going to use? You're going to use your private key, okay? Your private key. To sign a message, use your own private key, okay? Before, when we wanted to provide confidentiality to encrypt, we used the destination's public key. So if I wanted to send a secret message to you, I would take your public key to encrypt such that only you could decrypt because only you have your private key. But that was different than what we want to do now. What we want to do now is to make sure it's signed. It came from a particular person. To sign a message, use your private key. As a result, anyone can prove that it came from you because anyone can obtain your public key. So the concept is shown on this slide. A user signs a message by encrypting with their own private key. So in this case, user A has their pair of keys, P-R-A and P-U-B, the private key and public key of, sorry. A user A has their pair of keys, P-P-U-A, P-R-A, yep. We're not sending a private key. We're encrypting using a private key, okay? Look, this operation, I've got my message. I want to sign it. I encrypt my message using a public key cipher like RSA and using my private key as the key for that encryption. So we're not sending the key. We're encrypting using the private key, same as in symmetric key cryptography. We don't send the key, we just encrypt using it. It's an input parameter. So we can say that some message, if we encrypt it using the private key of user A, we can say that that produces a signature. Okay, so the thing that we attach to the message is the signature, denoted as S here. The user sends the signature and the message to the destination, not just S, but S and M. When someone receives this message, let's draw it. I think we have, we can draw the process so it's clear. They're sending something to user P and message. User A has the message. They calculate the signature by encrypting using their own private key. And they send the signature and the message. So S in this case is not a secret value, but it's a signature. And B receives, B needs to verify the signature and the verification steps as on the slide, we decrypt the signature. If it was encrypted with the private key of A, we decrypt with the public key of A. And let's say we get some value as an output, X, and now we compare. If the value we get as an output equals the message, then everything's okay. If not, don't trust it. Signature is the encrypted message, encrypted using the private key of the source. Both the message and the signature are sent. B verifies by decrypting the signature using the public key of the source and comparing to the message. Because if we're decrypting S with PUA, since S was created by encrypting M with PRA, X should be the same as M. We should get the original plain text out as a result. So they should be the same unless something was modified along the way or we're using the long key. There's no confidentiality here. Anyone can see the message. Anyone can verify. The idea with a signature, when I sign a document, it means anyone in the world who sees that document, if it's not confidential, can verify. But it doesn't give them any advantage, just verifying that it's from A, the intended purpose of this. So everyone has the public key of A, but there's only one person in the world that has the private key of A. Therefore, if it successfully decrypts with the public key of A, then it must have been encrypted with the private key of A. Therefore, it must have come from user A. So it's signed. What's missing? Well, this topic, what's this topic about? Hash functions. There's no hash functions in here. So we can, this concept of a signature, in fact doesn't, in theory, doesn't rely on the hash function. It's about using the private key, using public key cryptography. But in practice, we use the hash function, we'll see why. So the verification receiver decrypts using the public key of the sender or the signer, and compares the obtained value with the received value. If they match, trust the signature. That's the concept. In practice, it's slightly different. We take a hash of the message instead of encrypting the entire message. The hash of the message ensures that there's some structure that we can recognize when we decrypt, and it makes the signature much smaller. If my message, again, is a five gigabyte file, using this approach, I need to encrypt that five gigabyte file using a private key. Encryption is slow using public key cryptography. But if I first take a hash of the message, remember, a hash of a large message produces a short, fixed-size value. So we have a small value, we encrypt the small value. So it's much more efficient than using the hash of the message as opposed to the real message. We just get the same operation because of our properties of the hash function. If the messages were different, the signature would not verify. Similar, when we take the hash of the message, if the messages are different or something's changed, then again, the signature would not verify. So the hash of the message is used here to get it effectively a smaller, but effectively unique value that represents the message. So this is what's used in practice. And what we will usually refer to as a signature is encrypt the hash of the message with the private key of the sender or the signer, send that and decrypt the signature and compare the hash values. So we don't normally do it the way shown here, we normally use the hash value. He uses, so he calculates the hash of the message and then the signature is created by encrypting that hash value using A's private key. I'm gonna slide it instead of using our case H, it's just H of M. And send again the signature message and to verify being decrypts the signature, again, public key of A. And if nothing's been changed, all right, we'll see, but we decrypt the signature with the public key of A. We get some value, let's say X. We receive the message, so we calculate the hash of the message. Let's say you get some value Y, these are the steps. And then we compare. If the values are the same, everything's okay. Decrypt the signature using a public key of the signer. We should get the original hash value. Take the hash of the receive message and we shall get the original hash value again. So they should be the same. They won't be the same if something's changed. If someone modified the message, X will be the hash of the original message, Y will be the hash of the new message. And the hash of two different messages should be different. That's again our property or our desired property of the hash function. If someone tries to modify the signature, that is they create a new message and try and create a new signature, how do they? They don't have the private key of A. So you cannot forge the signature unless you have the private key of A. And under our assumption, you can't find the private key of A. So modifying the message will be detected by the recipient. So that's how we provide verification. So long as our hash function has our desired properties. Any questions about signatures? That's quite easy, but very important because they're used everywhere or not everywhere. They're used in a lot of applications in the internet and in computers. Digital signatures and on the slides, I said there's the concept and the practice. The concept is encrypt with the private key of the user. Practice is we encrypt using the private key of the user, the hash of the message, not the full message. It's much more efficient. So when I ask you in the quiz or in the exam, write the equation for assigning a message, focus on the practical one, the one that's used most. The hash of the message, not the full message. Any questions on how to sign something? Some people have attempted the quiz. You'll see, if you haven't done it, you'll see shortly when you do it that there is one or two questions about signatures in there. So if you've already attempted it, maybe go back and visit that question at least and see, make sure you understand. How do you create a signature? What's the notation? What are the steps for signing and verification? Give me an example algorithm I can use for H. Anyone? Give me a name, a real algorithm for H. I say it's a hash function. What's the name of a hash function? Something char, char by itself, just char, S-H-A. The secure hash algorithm is the name, but char is one algorithm, it has different variants. And char is one, what's another one? MD5 is another one that you'll see. So there are two very common hash functions, MD5 and char. We saw on Tuesday MD5 is no longer considered cryptographically secure because it's easy to find, well, not easy, but it's possible to find collisions. Okay, if the hash function is MD5 or char, give me an example of E or D, what algorithm? What's the name of something you'd use for E or D? E or D, what's an algorithm? Don't be afraid to yell, anyone wanna guess? RSA, okay, that's the only one we've really studied. It's a public key algorithm. So here we're using public key cypher, so E is encrypt, D is decrypt, but we're using a public key cryptographic algorithm. So the only one we've studied is RSA, but there are others. There's one, I think, a few listed later, digital signature standard and El Gamal and others. So RSA, again, you're experts on RSA, you could use it here as the encryption algorithm. Ah, here are the algorithms. RSA is one, but there are others. There's one called the digital signature algorithm, which is part of the digital signature standard. It uses different techniques in RSA. There's a variant of that called, it uses elliptic curve cryptography, which we haven't mentioned, but it's important that we don't get time to cover El Gamal and there is a few others. So there are different signature algorithms that can be used. They have different characteristics in terms of performance, they're used for other things, other than just signatures. Similar, there are different hash algorithms that can be used. SHA, MD5 are two that we've mentioned. MD5 is no longer considered secure because of the ability to find collisions. And the properties that we need for these hash algorithms, we've said the one way property and no collisions. But now we're gonna look into those two properties and in fact, we'll list three, which will bring one of them up into two parts. And we'll call them pre-image resistant, second pre-image resistant, and collision resistant. So we'll look now back at those properties of what is required of our hash functions so that digital signature, actually, while before we go into that, and I will give a demo later. Terminology, so we now look at the properties in a bit more depth. The terminology we'll talk about with a hash function is we have some hash value h, lowercase h, it's a bit confusing, but the lowercase h is the hash value. The uppercase h is the hash function. And if we take, apply the hash function on some value x, the input, and get some hash value h as output, the terminology is that we say that x is the pre-image of h. We hash x and get h. We can say x is the pre-image of h, of the hash value. x is like our message, just some notation. We know that the hash function is a many-to-one mapping. That is, there are many possible input messages that will all map to the same hash value, to the one hash value. That is, a hash value has multiple pre-images. We've mentioned this before. If we have a fixed size hash value, but an arbitrary size message, then we can have, we'll have more messages than hash values. Therefore, since we map messages to hash values, some of the messages must map to the same hash value. So it's a many-to-one mapping. A collision occurs if the two inputs, x and y, are different and the hash values are the same. That's the concept we know already. A collision is when we hash two different messages and get the same hash value as output. We don't want collisions. And we've seen examples why, when we looked at the authentication of using hash functions, similar with digital signatures. If we can find collisions, the attacker can defeat the authentication. So we don't want collisions, but we know that possible in theory. How many pre-images for a given hash value? Well, it depends upon how many hash values and how many input messages. So if we take a hash function that takes a b-bit input, the message is b bits in length, meaning there are two to the power of b possible messages. And if this hash function produces an n-bit output, so the hash value is n bits in length, where b is larger than n, then it means, on average, the number of pre-images that map to the same hash value is two to the power of b minus n. Let's give an example of that, and hopefully it will be clear as to the common. Let's say we have, keep it simple, we have a hash function that takes some message and produces a hash value. M, we're calling now the pre-image of H. The length, for a simple example, let's say the hash value is four bits, and the message is a fixed length. Normally it can be any size, but let's say it's a fixed length of 10 bits. So we've got some hash function that takes a 10-bit input message and produces a four-bit hash value. And the hash values or hash functions should produce random values. That is, you hash two messages which may be the same, the output hash value effectively should be a random number or a pseudo-random value. That's the property of our hash functions. So this hash function maps messages to hash values. How many possible messages? We need some, a bit more participation and I think some of you cannot hear me at the back, so some of you are gonna move forward, okay? Let's go to the front, and let's go to the front. And another one, come on, move forward. Okay, it'd be much easier to hear me if you're in the front two or three rows. Don't worry, other people will join you quite quickly. It's not so bad, yeah? You said you wanna join him, okay? Up you go. Then you can answer my questions. Just to find a seed in the front two rows and then you can help us with answering some questions. Don't worry, I just use a random number generator to pick you. Yeah, both of you come, good. I won't wake him up, don't worry. Okay, how many messages do we have? Have a hash function, keep coming. Front two rows, don't annoy these two, they're studying hard, yeah, okay? The front two rows, two, one, two, good. Just so we can have people answer some questions. I've got a hash function. We've got a hash function that takes messages as input, produces hash values as output. The length of the messages, they're all 10 bits long. Keep it simple, they're all 10 bits. The hash values are all four bits long. How many possible messages are there? Every message is 10 bits in length. How many do you think there are? Ask your partner, that's fine. Two to the power of 10, binary values, okay? We've got 10 bits, so maybe I should ask you to write them all down, okay? 10 zeros, nine zeros and a one. And then just keep writing them all down and you'll find that there's two to the power of 10 possible messages of input, okay? About how many is that? About 1,000, 1,024. Two to the power of 10 input messages. Let's list them, well, not quite, but okay, let's say there was M1, M2, M3, 563. The comic won't tell you the answer, don't worry. M1,024, okay? So you could write them down if I was mean. I would ask you, okay, all possible 10 bit values. So our hash function takes the input and produces a random hash value as output. How many possible hash values are there? How many possible hash values? Two to the power of four, okay? Our hash value is four bits in length. And that's the normal structure of hash functions that we take any, usually we allow any size input language, any size input message. In this case, I've kept it simple to say that input message must be 10 bits. But in general, it may be 10, 11, 1,001 million bits. So the message is almost always larger than the hash value. So there are many possible messages. The hash value needs to be short and it's a fixed length, in our case, four bits. So there are 16 possible hash values as output. 16, in this case. Hash function takes a message as input, produces a hash value as output. It's a mapping, that's what a function is. And it's a many to one mapping because if we need to map all of these messages into these smaller set of hash values, some of these must map to the same hash value. There's no way to avoid it because if we've only got 16 possible outputs but 1,024 possible inputs, some of them are going to come to the same value. Now, what is the mapping? Well, the actual hash algorithm determines how does this map to one of these. Let's just say it's random, okay? Let's forget about the details of our algorithm. But so randomly, these messages map to some hash values. So let's say we had some values, M1 maybe became H4. So that would be the hash function. If we hash M1, we get H4 as output. And M2 maybe from H8. And M3, H2 and so on. How many messages map to H4? How many messages map to H4 on average? How would you calculate that? 64, Y64, 1,024 divided by 16, okay? We've got 1,024 here, 16 possible outputs. On average, 64 M's must map to the same hash value. So there'll be 64 coming here, 64 to H2, 64 to H3 such that all 1,024 messages are covered. And of course, what that means is to map to the same hash value. We'll not draw 64, in this case with 10 bits input, four bits out. We have 2 to the power of 10 in. And divided by the number out 2 to the power of 4, which is 2 to the power of 6, is the average number of hash values that average number of messages that map to one hash value, our 64. On average, there are 64 collisions to a particular hash value. So we always have collisions. We can't avoid them. But we say our properties of our authentication rely on being hard to find the collisions. That's the important part. So back to our slides. That's where this 2 to the power of b minus n comes from. If we have b bits in, n bits out, n bit hash value, then the number of, then one hash value has 2 to the power of b minus n messages that map to it. And we call those messages pre-images, due to the power of b minus n pre-images. What we need is a hash function such that it's hard for someone to find those pre-images. Collisions are guaranteed that it should be hard to find messages that produce collisions. Hard in terms of takes too much effort for the attacker. So now let's list more precisely the requirements of our cryptographic hash function. Some are easy. Some will go through in more detail. All right, easy one, variable size input. Our example, we had 10 bits input fixed. But in general, if I want to hash a word document, which is one megabyte, apply a hash function, and then I want to sign another document which is 500 kilobytes, and then I want to hash a five gigabyte DVD, variable size inputs. We need to be able to accept any size input. Fixed size output. The hash function produces a fixed and small hash value. It should be easy to compute. That is, the hash of some message shouldn't take 10 years to calculate. Should be fast. The practical requirements. The next three are the security requirements that we'll go through in detail. The last one, pseudo-randomness, okay. The output of the hash function should be a pseudo-random number. For example, I'm telling back to ours, we don't draw it, but if, let's say we had another case where, say we had it instead of the red one, but M2 mapped to H1, M3 to H1, M4 to H1, M5 to H1. If messages which are similar, okay, the next M2, M3, and M4 will similarly differ by just a few bits. If they were similar messages and all mapped to the same hash value, then maybe that's not pseudo-random. So that's normally not the case. They should map to a random set of hash values. So that's the randomness requirement. Otherwise, if we had this, it would be easier to work backwards. It would be easier to find collisions as well. So now we need to go through these three properties. Pre-image-resistant, second pre-image-resistant, and collision-resistant. They sound hard, but you know the basics already because they have other names. Sometimes pre-image-resistant is called the one-way property. We've mentioned this before. Given a message as input, it should be easy to get the hash value, but given the hash value, it should be hard to get the original message. It's a one-way function. Going one way is easy. Going backwards is impossible. That's all this one means. Pre-image-resistant means for a given hash value, it's too hard to find a message such that the hash of that message gives that hash value. So going backwards is not possible. We will not go through that one in much more detail. The others, the next two are related to collisions. But it turns out that this one, the middle one, is also related to the one-way property, or the difficulty is as well. Second pre-image-resistant, given some message x, it should be computationally infeasible or practically impossible to find some other message y, which is not x, so it's different from x, where the hash of y equals the hash of x. So this is our collision property. I give you a message x. You shouldn't be able to go and find another message y that produces the same hash value. You shouldn't be able to find collisions. We just said before that there are collisions, but when we say computationally infeasible, it means that it takes too long to find those collisions. So it's also called weak collision-resistant. So resistant against collisions is property. But second pre-image-resistant is the name we use here. Now the next one is similar, and this is where most people get lost. If they're not lost already, anyone lost already? One-way property is one of the easiest ones. You can't go backwards. Collision-resistant, we said we can't find collisions, but now we're breaking that into two parts. The first one says, given an value x, you can't find another message which produces the same hash value. Collision-resistant or strong collision-resistant is you can't find any pair of messages where those pair of messages produce the same hash value. What's the difference between these two? That's the hard thing for people to understand. The first one, there's a message given. So you're an attacker, you know x. Your challenge is to find another message that produces the same hash value. If you can do that, then you break this property. No longer exists, or it doesn't exist for this hash function. Whereas the second one is you're an attacker. You get to choose any two messages you like, x and y, whatever you like. And you get to choose them, and your aim is to choose them such that their hash values are the same. You're an attacker, which one are you going to try? Which one's easier? Collision-resistant, why? Think about which of these, so focusing on these two, which one from an attacker's perspective is easier? That is more likely to find the collision. Both cases trying to find a collision. In the first case, you're given some message, and you need to find another message that produces a collision. The second case, you can choose any pair of messages. Which one is easier for you, the attacker? Number one or two, louder. One or two, hands up higher for two, and again, right, be more precise. Second pre-image-resistant or collision-resistant. Second pre-image-resistant, hands up. Okay. Collision-resistant, hands up. Higher. It's this one, collision-resistant. It's easier for the attacker. No, in the second one, your aim is to find a pair, okay? In the first one, I give you a message, all right? I'm the normal user, you're the attacker. You have this message X. Your challenge, go and find another message that produces the same hash value as X. Whereas in the collision-resistant property, it says that, okay, you're an attacker. Your challenge is to go and find two messages, any two messages, X and Y, such that they produce the same hash value. There's more freedom for you in this case. And therefore, it's easier. That's a hard thing to get your head around sometimes. So let's consider another way to view it. There's a lecture, where there's an IT class down the corridor. There's about 40 people in there. Okay, let's assume you don't know them, right? Our IT colleagues. I've given you, there's two challenges you can choose. You can go down there and I say, either, find someone that has the same birthday as you. Now, it's birthday I mean the same day of the year, not the year, okay, all right, maybe. So if you're born on the 1st of January in 1980, then think of the birthday as the 1st of January. Don't worry about the year. So the first challenge is you go down to the lecture room and of those 40 people, find someone who has the same birthday as you do. Or go down to the lecture room, find two people in that lecture room that have the same birthday. Which one's easier for you? A or B? B. It's the same as our collision resistant. In this case, we're given, your challenge is, here's a birthday, here's one of the 365 days in the year. Go and find someone who has the same birthday. Assuming birthdays are random, yeah? Which is easier for you to go and find someone, yes. All right, so what is easier then? More chance of being successful. Okay, good. So I should be more precise here. Easy, I mean, we'll see, we'll see. So more, the way to think of it is what's your chance of being successful? There's no guarantee that you'll find someone with the same birthday as you, okay? I think if there's 40 students in that lecture room and you're born on the 1st of January, well, there's no guarantee that someone else is also born on the 1st of January. So there's no guarantee, but we can think about the probability that there is someone who's born on the 1st of January. The second one, let's see why, or let's at least go through some of the numbers. We will not go into the depth, but, and then we'll come back and explain how that relates to our collisions. Because it's the same, same problem. We have, how many birthdays are there? How many birthdays are there? 365, let's say. No leap years, okay? So forget about years, just think there are 365 days in the year. So the possible set of birthdays is 365. The first one is, let's say you're born on the 1st of January, okay? So the challenge is, find someone in that room that also has a birthday on the 1st of January. Let's look at the first one. The same birthday as you, for example, a group of n equals 40 students. Possible birthdays, m equals 365. Don't have to write this down, I'll put this on the website. It's just the concept that I want to make clear about eventually the collisions and which ones are easier and harder. You don't need to calculate any of this. So which one is easier? Well, let's frame that as, what is the probability that someone has the same birthday as you, okay? Because I'd like to choose the task which has more chance of being successful, the highest probability. And the way to calculate that is, you think, okay, the probability that someone has the same birthday as you is one minus the probability no one has the same birthday as you, okay? No, the opposite. So in probabilities, one subtract that one, okay? So if we look at the probability that no one has the same birthday as you, you can go through the steps. For example, if you were born on the 1st of January, the chance that one other person not having the same birthday as you is 364 out of 365. That is, you're the first of January, then there's another person. The chance that their birthday is different, they're 365 days to choose from. It's random, your birthday. And so 364 would not produce the same as mine, okay? So the chance that another person has a different one than me is 364 out of 365. And following from that, the chance that two people have a different one and three people and so on is this time 364 by 365, depending upon the number of people, n people. So of the 40 people, the probability no one has the same birthday as you is 364 out of 365 to the power of n or to the power of 40 in our example. Therefore, the probability that someone has the same birthday as this. You don't have to calculate this, you can check, just accept that one. Plug in n, you get 10%. That is, if you went down to that lecture room where there was 40 people, there's a 10% chance that you'll find someone that has a same birthday as you. Assuming you don't know them and assuming the birthdays are random, fair assumption. 10% chance of being successful. So what if you took the other task, task B? It's a bit more complex, the analysis. Well, you see the answer down the bottom, it's about 90% chance. Much, much more chance of finding someone, finding two people with the same birthday. Okay, you can have a look through, but you end up with an equation that the probability of any two people having the same birthday is one minus the probability that no two people have the same birthday. And you can look at it, say, with three people and then extend it to four or five and it becomes this little bit more complex equation. One minus 365 factorial divided by 365 to the power of N, this hasn't come out very well, times 365 minus N factorial. Plug in N is 40, 89%. That is, the chance of finding two people in that room that have two birthdays on the same date, about 90% chance. That's why task B would be easier. You've got more chance of finding someone that meets the condition. Why do we do that? Well, we'll see. Here's those values, but for different group sizes. So N, if there were 40 people, what if there were less or more? So the blue one is the probability of finding that person, oh, the red one first, is the probability of finding that person that has the same birthday as you. So if there were 40 people in the group to search through, chance is about 0.1 or 10%. If there were 80 people in the group, it's approaching 20%. The blue one is what's the chance of finding two people that have the same birthday? If there are 40 people in the group, you look and you get up to about 90% chance. That's what we just calculated. And it quickly approaches almost 100% chance. So we see it's much easier to find two people the same birthday as opposed to find one person with the same birthday as you. Now, what if we, so this one is easier, higher probability of meeting that condition, 50% probability takes about 20 people roughly here. That is, if you have to have a bet, gamble on the fact that, okay, if you go into that room, I'll give you a million baht if you can find two people that have the same birthday. And if you can't, you give me a million baht. What do you wanna know? Well, you want your probability to be above 50%, okay? If your probability is above 50%, that is, if the number of people is 20 or more, then you've got a chance of winning in that case. If it's below 50%, not good. So the 50% probability is around 20 in this case. So if there's 20 people, there's a good chance that you'll find it. So our answer, which is easy, be, there's a 89% chance in the same conditions A, about 10%. Why do we care? It's the same as our collisions on hash messages, on hashes. So we said, go into the next lecture about 40 people and find, well, think of this. If I give you a set of M messages, set of messages, and some hash function, find someone that has the same birth as you, well, from the hash functions, find a message that has the same hash as some given value. So the first task, find someone who has the same birthday as a given value, that is your birthday. Same concept as find a message that has the same hash as a given value. Whereas the other one is, you get to choose any two people. It doesn't matter what their birthdays are as long as they have the same, a collision of the birthdays. Or, you get to choose any two messages. It doesn't matter what they are as long as they have the same hash value. So it's the same concept. So which is easier? Finding two messages that have the same hash value. It's good down the front, isn't it? Yes, okay, someone will join you then. Join them down the front. Maybe we should do this every lecture. Randomly choose a few people to sit down the front so I have a bit of company at the front. Yeah, you can go if you like, but you sit on that side. Okay, all right. So it's the same concept. Collisions of birthdays, collisions of hash values. But it's fine to be with the same birthday, there's a collision on those birthdays. So from an attacker's perspective, of these two, which one's easier? Which one's the more chance of being successful? Well, the second one, finding two messages, any two messages that have the same hash value. It's easier for the attacker, more likely to be successful. And the way that we often consider it is, okay, what's the chance, how many messages are needed to give a 50% probability of finding the same hash? So now the attacker, they need to look through, they find any pair of messages and they want to find a pair that produces a hash, which is the same value. Well, using the equation that we developed, well we didn't develop, but is here, this was for days in the year, 365 days, and is the number of people in the group. And it tells us the probability of finding a collision. We think about it from the same perspective, but if we want the probability to be 50%, we know, in the other case, we knew 365, in this case, we know, well we have the number of messages M, we know, sorry, the number of hash values M, we want to find how many messages, until we get a collision or a high chance of a collision. So how many messages do we need to search through until we can find a collision? And it turns out, this is the equation from before, where we have N messages, like we had a group of 40 people, we have N messages to go through. In the previous example, we had 40 people to go through to find the match. The total number of birthdays was 365. In this case, the total number of hash values, we denote as M. If our hash length is B bits, then there are two to the power of B hash values. Let's say two to the power of B is M. So the same equation gives us the probability with N messages equals this. And approximately, if the probability is 50%, then N is about the square root of M. That is, the number of messages you need to search through to give 50% chance of getting a collision is the square root of M, where M is the number of hash values. The square root of two to the power of B, the square root of two to the power of B is two to the power of B, all to the power of a half, which is two to the power of B on two. Sorry if this doesn't come out very well, but that's, the final one is two to the power of B on two. What that says is if you have a B bit hash value to give a 50% chance of getting a collision, you need to search through N messages. So you need to try N different messages until you get a 50% chance of collision, where N is two to the power of B on two. For example, if we had a 100 bit hash value, if B was 100, we'd need to search through two to the power of 50 possible values. And this is an indicator of how hard it is to break this property. I'll put those slides on the website again. The idea of which one's harder is the key thing that you need to come away from. Which one's easier? The easier one from the attacker's perspective is finding any pair of messages. Much higher chance of doing that than finding some message or a message that matches the hash value of some message Y. And that two to the power of B over two, I think, is given here. The amount of effort to defeat the property, that is, you're an attacker. Your challenge is to find a collision. If you find a collision, you can defeat the authentication of the digital signatures of the different schemes that use hash functions. So your challenge is to find a collision. So then the question is, how much effort does it take to find a collision? It depends upon the length of the hash value. And the pre-image and second pre-image properties and attack upon those properties that is trying to defeat the one-way property or the property that is, it's hard to find some message that produces the same hash as another given message. The amount of effort to defeat those properties is proportional to two to the power of M where M is the length of the hash. A hundred-bit hash takes you two to the power of 100 operations to defeat the property. Whereas that third property, find any two messages, the amount of effort for the attacker to defeat that is two to the power of M on two. You've got the two to the power of B on two. So if M is 100 bits, the amount of effort the attacker takes to defeat this property is two to the power of 50 operations. Is it possible? Two to the power of 50 is possible, depending upon the hardware you have. Going back to brute force attacks, depending on how long each operation takes, but two to the power of 100 or two to the power of 128, we saw took centuries to do all the calculations, even with ultra-fast supercomputers. But two to the power of 50 would take much less time. So for an attacker, attacking the property of collision resistance is easier than attacking the other properties. Due to this, that's called the birthday problem or birthday paradox, this concept of finding two people with the same birthday. So MD five, for example, there's a hash function that uses a 128 bit hash value. Ignore this one at the moment. To defeat it from the perspective of a pre-image or second pre-image attack, you need two to the 128 possible operations, a brute force attack. Whereas to defeat the collision resistance property, you need two to the power of 64 operations. Much, much easier. It turns out there are other known attacks against MD five, which brings it down to 16 times less to the power of 60 operations. Shah will see uses longer hash values to finish off. Not all hash functions have all of those properties. So we'll list those three properties. Pre-image resistant, the one way property. Second pre-image resistant, given a message, find another message with the same hash value. And collision resistant, find any two messages with the same hash value. So there are the three properties. Not all functions need those three properties. Depends on where you use hash functions. If we use hash functions for digital signatures, we generally need all three properties. There are some cases where we don't need the third one, but generally we do. But if we use hash functions for some other purpose, sometimes we don't need to meet all properties. We will not cover intrusion detection, but we will cover passwords a little bit and see hash functions used there. Mac functions are same requirements as the hash function in digital signature. So some functions have the properties, some don't, depending on where they're used as to whether they're strong. Okay, returning to these three properties. We have two hash functions. If one function has the first two properties, but not the third. So hash function A is pre-image resistant. It is second pre-image resistant. Hash function B has all three properties. Pre-image resistant, second pre-image resistant, collision resistant. Which hash function is better? A or B? Again, hash function A has the properties pre-image resistant and second pre-image resistant, but not collision resistant. It doesn't have this property. Hash function B has all three properties. A and B, which one's better? Which one's stronger? B, B. Having this property means an attacker cannot find a pair. So the presence of these properties in hash functions, if you have them all, the hash function is often called a strong hash function. If you only have these two, it's usually called a weak hash function. It gets confusing because we've got weak and strong collision resistance. And again, you're an attacker. Three properties. Hash function B has all three properties. You're an attacker. Which one do you try first? Which one do you try to defeat? Put on your malicious hats. You try and defeat the easiest one and the easiest one to defeat from the attacker's perspective is collision resistant property. The strong collision resistant property. Easier for the attacker to defeat because it's easier to find a pair that produce the same hash value than it is to find some other message that produces the same hash value as X. That's confusing for many people to get your head around especially these two and compare them. So, think about why one's easier than the other. Try to remember the properties, the difference between them. And if I ask in the exam define this property, second pre-image resistant or collision resistant or given this property, what's a possible attack? And you should be able to give some discussion. To finish, let's just show one example with open SSL of signing messages and see how long it takes. So, we'll summarize this topic and just see what we've missed next week. I think we got to the end, very close. MD5 is one hash function, no longer recommended, but it's still widely used. SHA is another one. It's got different variants. SHA1, SHA224, 256, depending upon the hash length, SHA512 produces a 512-bit hash value output. So, different variants. And 256 and beyond are considered secure. And there are other hash functions. Let's run open SSL just to see the duration of signatures versus verification. So, RSA mainly is used for signing, not used for encrypting data. So, when we do a speed test with open SSL on using the RSA algorithm, what it does is goes away and does, in this case, it uses different length keys. RSA, 512 bits, 1024 bits, maybe 2048. And it, private keys, private RSAs and public RSAs. So, encrypting with the private key is signing. To encrypt with the private keys, the process it takes for signing. Using the public key is the process for verifying. You sign a message, send it to someone, they verify by decrypting with a public key. So, this is, may take a bit too long, but does each for 10 seconds and sees how many messages it can sign. Then how many messages can it verify to give the typical speed of the algorithm on this computer? All right, it's going to do a 4,096 bit key. So, longer key length takes longer time. You'll see the summary results at the end that will hopefully show us. Okay, these last four lines. Using RSA with four different key lengths, these are the lengths of N in RSA. The average time to sign a message and the average time to verify a message. Signing uses the private key, verifying uses the public key. So, with a typically recommended 2048 bit key length, to sign one message takes 1.6 milliseconds. So, you can do about 600 messages per second. To verify a message takes about 50 microseconds. So, you can do about 20,000 verifications per second. If you move up to 4,096 bits, stronger in terms of security, longer key, but the performance, we see much different. Here it takes 11 milliseconds to sign and 185 microseconds to verify. If you compare that to DES, AES and others, RSA is much, much slower than symmetric encryption. And therefore only used for signing and verifying on short hashed values, not on full messages. We'll stop there for today. Tuesday, you'll have done the quiz, the online quiz. It has a few questions about digital signatures, hash functions, I think even properties of hash functions and MAC functions. So, do the online quiz before the lecture on Tuesday and we'll summarize on this topic and we'll move on to our next topic.