 So cryptographic hash functions, first, what do we mean by hash function, and then we'll see how they can be used for cryptography. We'll look at general properties of hash functions, we'll see how hash functions can be used for authentication as an alternative to max, okay, we could use a max or a hash function. Then we'll see a special case, which is very important case, digital signatures, which will combine hash functions with public key cryptography. And then at the end we'll look at the specific requirements and properties of hash functions. Similar to max functions, the previous topic, we will not look at the algorithms in detail. We'll mention the names of some. Anyone know the name of a hash function, char, md5. Then you may have seen md5, for example, come up in maybe you download a file, some application, and the website lists here's the md5 hash of that file. And it's used for some concept of the receiver can check whether anything's been modified. That is, the file that you download and save on your computer, you can calculate the hash of that and compare it to the published hash value to confirm that the file you have is correct. So we'll see how that works. But we will not see how md5 and char work in detail. We'll look at general properties of hash functions. So a hash function, what do we expect? The notation we'll use, and again it can be a bit confusing, uppercase h to indicate the function, the hash function. The hash function takes a variable length input m, some block of data, some message as input, and it produced a fixed size output called the hash value, and that hash value will unfortunately often write as lowercase h. So uppercase h is the function, lowercase h is the output value. We apply the hash function to produce a hash value as output. The input can be any length, the output is a defined length, usually small. And the property of that function is such that if we apply it to many different inputs, we should produce evenly distributed and random looking outputs. If I take a file, the contents of a file as input, then the hash value that comes out should look like a random number. I take a very similar file as input, maybe it's different by just one byte. The hash value that comes out should be another random number with no connection to the previous one. So two similar inputs should produce two random outputs, and that would produce evenly distributed because they'll be when they're random. But it is a function, so it's not producing a random value, it's applying some algorithm to produce that hash value. When we use some hash functions for cryptographic purposes, we require them to have some properties. And there's some similarities to what we talked about for max. We want a hash function to be such that it's very hard, we say computationally infeasible, takes too much time to find this for an attacker, very hard to find a message M if you only know the hash value H. That is, I take a message, I calculate the hash of the message, I get some value as output. Now I give you that value, the hash value. It should be hard for you to find the original message. And this is known as the one-way property. The hash function, it should be easy to calculate the hash of a message, but given the hash value, it should be hard to calculate the original message. The function goes in one way easily, but at the reverse or the inverse of that function is hard to calculate. So we will require that property under another number of cases when we use hash functions in cryptography. And another property, it should be hard for someone to find two different messages, M1 and M2, that produce the same hash value. Hash of M1 gives some hash value X, hash of different M2 produces the same hash value X. It should be hard for someone to find two messages that produce the same hash value. That is, we don't want collisions, a collision when we have two different messages mapped to the same value. So we'd like the collision-free property in a hash function. And some later slides will return to these two properties, in fact we'll split them into three properties, so we'll see there's slight variations. But generally one-way and no collisions. And that's similar to a MAC function. Remember with a MAC function, we assume that we take two different messages, we'll get two different tags as output. So we don't get collisions. And in the same way as a MAC function, we'll therefore use the hash function to determine whether something's been modified or not. And the following, well, the next few slides give us some examples, but where are hash functions used in general, in cryptography, to authenticate messages? I send a message to someone, the receiver needs to check if it's been modified or not. Digital signatures, I want to sign a document such that someone later can prove that I signed it, someone else didn't create it. Storage of passwords. When we store passwords on our computer system, we don't store the passwords in plain text. For example, the ICT server or the IT server, where you all have accounts. You all have passwords. The server doesn't store your passwords in a plain text file. It doesn't, in fact, encrypt the passwords. It calculates a hash of the passwords and stores the hash values. And in other applications of security, virus detection, you use hash values to check if a file matches a particular pattern. And random number generators can use hash values as input, hash functions as input. This picture just shows that we take a long input or a large message, we apply a function and we get a fixed length, usually small hash value as output. The message is normally larger than the hash value, it can be much larger. Maybe the message is a one gigabyte file. The hash value is a 128-bit value. The input is much larger than the output. What does that mean? Input much larger than the output. What does that mean with respect to our properties? If the input of the function can be much larger than the output of the function, what does that mean with respect to these two properties? Repetition or what do we call it? Collisions are possible. If I have a function that takes a large input and produces a small output, then it must mean that two of the inputs will map to the same output. Collision must mean in theory that there can be collisions. We can have two different messages of input that will map to the same output value. But we want as a property that it is hard for someone to find those messages. The collision-free property says that we accept that there may be collisions, but it should be practically impossible for an attacker to find two different messages that do produce a collision. If they can, then they can defeat the role of hash functions in some security applications. I think we've drawn it before, but this idea of collisions, we drew it for hash functions. The input, let's say the input is not variable, but it's fixed as m bits. For an example, let's say our message is a thousand bits. Then we take those 1,000 bits and we apply a hash function, a hash value equals the hash of that message. Let's say this hash function produces a fixed-length hash of h bits. The length of h is, let's say, 20, just for an example. Here we have a function that takes a 1,000-bit input and produces a 20-bit output. We can think of all the possible inputs. We can say there's m1, that's one possible input. If we fix it to 1,000 bits, there's another message we can input, m2, m3. How many possible messages are there? If they're limited to 1,000 bits, m2 to the power of 1,000. That is, if we have a 1,000-bit message, there are two to the power of 1,000 possible combinations. That are the possible inputs to our hash function. How many possible outputs? Possible values, two to the power of 1,000 inputs, two to the power of 20 outputs, because we only have a 20-bit output. We can think of the possible outputs, h1, h2, hash value 3, up to hash 2 to the power of 20. Our function maps the input values to output values, and here when we have more input values than output values, some of those inputs must map to the same output value. The hash of different messages will produce the same output. This is a collision, so in theory collisions are possible. The only way to avoid them would to be make the hash value the same length as the message or larger. But we want a short hash value because we're going to use it to attach to the message. How many collisions on average? On average, we have two to the power of 1,000 inputs divided by two to the power of 20, which is like you said, two to the power of 980. That is, in this simple example, on average, two to the power of 980 different messages would map to the same hash value, not three messages, but many messages. There are many messages that map to that same hash value. So we have many collisions here. But our property we require, it's hard for an attacker to find collisions, find messages. So why? Why do you think it's going to be hard for an attacker to find messages that map to the same hash value? First think the messages of these two to the power of 1,000 possible messages, think of, say, English messages, not many of them make sense. Let's say I have a message, and it's an English message of 1,000 bits long, then one challenge for the attacker is to find a different message that also makes sense that maps to the same hash value. So even though there are many messages that may map to that same hash value, from the attacker's perspective, most times they'll need to have to find a message that also makes sense so that they can fool the receiver into believing that the message hasn't been modified. So that's one advantage is that even though we have two to the power of 90 messages that map to the same hash value, not many of them would make sense. But still it can be a lot that map to that same hash value. So we'd like to have the challenge for the attacker is finding those messages, because if the hash function randomly distributes the values, they need to check, does this message map to the hash value h2? If not, try another one. If not, try another one. So the general approach for finding messages that map to the same hash value is to try them. And how many does the attacker need to try until they map to the same hash value? It's in the order of the number of hash values that we have. So for an attacker to find a message that maps to the same hash value, normally they need to try as many values as there are hash values. If I try two to the power of 20 messages, and I unluckily get two to the power of 20 different hash values. The next message I try will produce one of the existing hash values, so we'll get a collision. So for the attacker to find messages that produce the hash value, they normally need to try all the possible hash values. In other words, to stop the attacker to do that, make the hash value long. Then they need to try many messages. So the length of the hash value is one measure of security of the hash functions. We've maybe gone on beyond what I wanted to do there. We'll see that coming to effect after we see how hash functions are used. Remember there are always collisions, but it's, if the hash function is long enough, it's hard for the attacker to find collisions. Let's see how hash functions can be used for authentication. And I think we have three or four different schemes that we'll go through. Again, message authentication. I send a message from A to B. B wants to be sure that the data received hasn't been modified. It's exactly as it was sent. And they want to be sure that they came from the right sender. It's not someone pretending to be someone else. So that's the goal here. And the same with a Mac. When we receive a message, we're going to compare and see if some values match. If they do, we'll believe everything's okay. If they don't, we'll assume something's gone wrong. When we use a hash function to provide message authentication, we sometimes don't call the output a hash value. We call it a message digest. So it has a different name here. So message digest or hash values will mean the same thing in the following pictures. Here's one scheme for message authentication using hash functions. What we do is we have a message we want to send from A to B. We calculate the hash of that message. The hash function just takes a message's input. There's no key like in a Mac. We concatenate the message with the hash value and encrypt all of that, send the ciphertext, the receiver decrypts. This is using shared secret key encryption. So A and B both know the key K, B can decrypt. When they get the plain text, they want to know, is this plain text the correct plain text or not? How do we know? You decrypt something, you get a plain text message. How do you know if it's the correct one? Well, we can't always rely on recognizing this is correct. So what we do is we use our hash function to verify. We get the decrypted ciphertext, so we get the plain text message. Included was a hash value. So we essentially have the received message, the received hash value. We calculate the hash of the received message. If that matches the received hash value, we believe everything's okay. So this is a way for the receiver to check that the plain text they got after decryption is indeed correct. We don't have to rely on checking if it's the right structure of a JPEG or if it's English text. We use the hash function to do that. Why would they be the same? Well, if the message has not been modified, nothing's been modified, then when we decrypt the ciphertext, we should get the original plain text message back. When we hash the received message and compare it to the received hash value, if nothing's been modified, we have essentially hashed the same inputs and hashing the same input should produce the same output. So they should match. What if something was modified? Let's try and look at what happens if the attacker modifies something along the way. We have user A sending to user B and the key that they know, let's denote it slightly different, KAB. That's the encrypt decrypt key. They know the hash function, so we assume both sides know what function to use. What does A do? They calculate the hash of the message and then encrypt it all. So we'll write down the steps. They calculate the hash. Let's denote at H1 as the hash of M1 and then they attach M1, concatenate that with the calculated hash value H1. Just join them together. Since we know the hash function, each hash function has a defined length of hash value. So the receiver is going to know how long the hash value is. Therefore they know where the split is between the message and the hash value. And we encrypt that using key KAB. We're sending to B. We encrypt all of that with, say, AES or some other symmetric key cipher. And we get a ciphertext, we'll denote as C1. Then we send that. C1 is sent from A to B. This is just the steps that user A does in the previous diagram. I've used slightly different notation, M1, H1, key KAB. Let's say our malicious user in the middle intercepts. They get the message and they modify something. What can you do? You're now the malicious user. What can you do? Try something. What are you going to do as a malicious user? Or maybe think what you can't do. Can you decrypt? No, we don't have the key KAB. We can't decrypt. So we can't do anything from the internals, because we can't see what's inside. So what could we do as an attacker? If we can't decrypt, we don't know what's inside. Can we modify anything? Well, we could modify the ciphertext. We don't know what impact that's going to have, but we can change it. There's nothing to stop us from changing the ciphertext. Let's try that and send on a modified ciphertext, C2. We'll denote C2 where they are not the same. C1 is not equal to C2. They're different. What happens at B? B, the steps are to decrypt using KAB and then check the hash value. So we decrypt. We decrypt using what key? KAB. What are we decrypting? The received ciphertext, C2. What do we get? When we decrypt C2? Or maybe easier. What do we not get? We don't get the original plaintext. We have decrypted using key KAB, some ciphertext C2. C2 is different from C1. We know from our properties of encryption if here we've decrypted a ciphertext using the wrong key. Or actually we've decrypted a modified ciphertext. So we're not going to get the original plaintext because decrypting C1 with KAB should give us M1 concatenated with H1. Therefore, decrypting a different ciphertext with the same key is not going to give us M1 concatenated with H1. It's going to give us some other values, some other random values. But it will not be M1 concatenated with H1. Let's say it is... Let's just denote it as plaintext P2. We decrypt. Does B know that P2 is wrong? Does B know at this stage that P2 is wrong? That is not equal to M1 concatenated with H1. Hands up for yes, they know it's wrong. Hands up for no, they don't know it's wrong. In this case we're assuming no, they don't know it's wrong. We're assuming when we decrypt something we get some plaintext, some bits. I don't know what I'm expecting to get. I don't know if it's going to be English or a JPEG or it's going to be random bits for a key. So we can't know for sure when we decrypt something that what we got was wrong. And that's our problem here. That's why we also use the hash. So we get P2. Is it good or bad? We don't know. So what we do is we know that P2 is some message concatenated with some hash value. If we know the hash function that A used, and I should have wrote that up here, the hash function is known by both sides. They agree upon the hash function. And of course the encrypt function and so on. So they know the length of the hash. So they think P2 is some message. I'll denote it as M2 concatenated with some hash value, H2. We know that the plaintext is some message and some fixed length hash value at the end. So we can think from B's perspective, this is the received message and the received hash value. These values are the received values. Now it wants to check. What does it do? To check we calculate the hash of the received message. Maybe I should have used a different subscript here. That we'll see. And compare the hash of the received message. Here's my comparison operator. Does it equal H2? That's the question. Does the hash of the received message equal the hash value received? No, why not? Why not? Because our plaintext essentially is random. We decrypted some ciphertext with some keys, so we got some random plaintext. So there's no relationship between H2 and M2. They just think of them a random sequence of bits. So the chance that the hash of this random sequence of bits produce this other random sequence of bits is very, very low. The chance that this message mapped to this hash value is very, very low, essentially zero. So the chance that the hash of the received message maps that is equal to the hash value received is no chance. Because the only way that they'll be equal is if we actually calculated H2 by calculating the hash of M2. But that wasn't the case here. H2 wasn't obtained by calculating the hash of M2. H2 is just the result of the decryption. And when we decrypt something, some ciphertext, using this key KAB, we get some random plaintext as output. So there's no connection between H2 and M2, despite the subscript being the same. They don't match and there's some error at B. That is, we don't trust this value. They would only be equal, we're assuming, will only be equal is if H2 was obtained by calculating the hash of M2. Which wasn't the case here. H2 is just a random set of bits at the end of M2. M2 may not even make sense. M2 may be a random set of bits. The attacker didn't choose M2 or H2. All the attacker did was modify the ciphertext. They have no influence over the values of M2 and H2. So in this case, what can the attacker do? They can't decrypt. They can't, if they modify the ciphertext, when B checks, they'll find the received hash doesn't match the hash of the received message. Don't trust the message. Any questions on this example of hash functions or message authentication? If the attacker changes the ciphertext, so maybe the question is, why did the attacker change the ciphertext? They were hoping to fool B into thinking that this message was the right one from A. There doesn't seem to be much purpose here because they don't know what the original message is, but they can still try and modify the ciphertext. Maybe they're just being annoying, okay? They want B to think that the message received is okay. If we didn't use a hash, we didn't do these last steps, then we have a problem that B receives ciphertext C2, the modified value. They decrypt, they get P2, and then they use P2. So they may use it for whatever application that they have, but P2 is wrong here. So the role of the hash is to check P2. Maybe it was a key. Maybe the message being sent was a random key. So when we decrypt C2, we get some random value P2, which is different from A, so they will not be able to communicate securely. So that causes some inconvenience at least. Is there anything else you can do as a malicious user? Can't decrypt. If we modify the ciphertext, we will be detected. I don't think there's much else we can do as the attacker. We would need to modify the ciphertext such that when B decrypts, they get P2 such that this random hash value is equal to the hash of this random message. The chance of doing that is essentially zero, because they don't know what will be obtained when B decrypts because the malicious user doesn't know KAB. So there's one example. We check that what's decrypted is correct. Let's try a different case. This is similar for the checking purposes, but we don't encrypt the message. We only encrypt the hash value. So in the previous case, we have confidentiality. The attacker cannot see the message. Sometimes we don't need confidentiality. Again, I send a message to B. I don't care if someone reads my message, but I care that when B gets the message, they know it's the original one. It hasn't been changed. So that's a service we sometimes need. So here's a way to do it without encrypting the message. Remembering that encrypting a message, especially when it's large, takes time. So if we don't need confidentiality, then we shouldn't waste our time to encrypt it. So this approach involves taking the message, hash of the message. We encrypt the hash value. So we take the encrypted hash value and attach that to the message. And we send the message unencrypted, but the hash value encrypted, and the receiver decrypts the hash value received, or this ciphertext at the end received. They'll get as an output some hash value, compare it to the hash of the received message. If they match, okay, if they don't, we have a problem. Let's see what happens when an attacker tries to defeat this scheme. User A, user B. User A calculates the hash of the message, M1, and let's call it H1. And then they encrypt that hash value using a shared secret key, using symmetric key encryption. And let's say they get ciphertext C1. Note that we're not encrypting the entire message. If our message was one gigabyte in length, the hash value is small, let's say 128 bits. Encrypting a very small message is much faster than encrypting a very large message. So with respect to performance, encrypting a hash value doesn't take much time. Encrypting a large message does, so there's no need to encrypt the message here. And we send the message combined with C1 to B. Message M1 can coordinate it with C1, and our malicious user is going to intercept. What can they do? Let's try and change M1. See if we can modify the message here. In the previous case, we couldn't even see the message because it was encrypted, but here the message is not encrypted. So let's see if we can change M1 and send a modified message, let's say M2, where M1 and M2 are different. And we know that B is going to check when they receive, so we need to attach something at the end of M2, the modified message. What can we attach? So we've got two approaches here, and we've seen it with a max as well. We could not change C1. So just modify the message. Note that the attacker knows the length of each, so we assume that the hash functions are known by AMB and by the attacker. Let's say the hash function produces a 128-bit hash value. H1 is 128 bits long. How long is C1? It's the same. When we encrypt 128 bits, we get a 128-bit value at the end. So the malicious user knows that the first N bits of the message and the last 128 bits are the C value, so they know where the split is. So one approach, not modify C1. What happens? Let's go through in detail so we can see the effectiveness of the hash function. Then we send a modified message, but the same C1, B receives. So B receives the modified message and the C1 here. They decrypt that C1 using the shared secret key and compare with a hash of the received message. So let's go through those steps. They decrypt using KAB C1. What's the answer? B decrypts C1 using key KAB. What do they get as output? What do we get when we decrypt C1? We get H1 because C1 was obtained by encrypting H1 with KAB. Therefore, if we decrypt C1 with KAB, we'll get the original input H1. Now we compare. The hash of the received message, M2, with we compare with the H1. Do they match? Does H1 equals the hash of M2? No, they don't match. So we assume there's an error. Why do they not match? Because H1 is the hash of M1. H1 really is the hash of M1. And the hash of M2, the hash of two messages, which are different, should also be different. That's the reason they don't match. If we have two messages, M1 and M2, which are different and they are different, M2 was modified. If we have such different messages, then our property of our cryptographic hash function was that there should be no collisions. Two different messages should not produce the same hash value. And if that property holds, there is no collision, then the hash of M1 is different from the hash of M2 or the hash of M2 is different from H1. H1 is just the hash of M1. So they are different, therefore, B knows something's gone wrong. What do you need to do as the malicious user to defeat this security mechanism? Under what conditions would this security mechanism fail? What do you need to do as a malicious user to make it fail? Think about the properties. What are you going to try? What would you like to be able to do? I would like to be able to... Correct. I would like to be able to choose a message M2 that does produce the same hash of M as M1. If the malicious user could choose M2, which is different from M1, but the hash of M2 does equal the hash of M1, if they could do that, this system would fail. If the hash values of the two messages were the same, B receives, they get H1, which is the hash of M1. They can bear the hash of M1 with the hash of M2. They are the same. They are the same. We trust this message. So the security of this scheme depends upon it being practically impossible for an attacker to find another message M2, which has the same hash value as the original message M1. We know collisions are possible, but it should be practically hard, computationally infeasible to find such a message. And generally, it depends upon the length of the hash value as to how long it takes to do that. Assuming the hash algorithm is secure, then the only approach for the attacker is to try many different values. So we make the hash value large enough, and they won't be able to find M2 with the same hash value as M1. Questions on this authentication scheme? Just before the break, what if we changed C1? That is, we modified M2. I will not draw it. And instead of sending M2 concatenated with C1, we sent a modified value here, C2. What's our problem for the attacker? That is, I modify the message to M2. I calculate the hash of M2 and get H2. And then I encrypt H2 with which key? I don't have KAB, so I cannot encrypt it with KAB. I encrypt it with the malicious user's key. And the problem with that approach is that when B decrypts that with KAB, they'll get a different value. So if we tried to modify C1 as well, it won't work because we don't know KAB. Remember, C1 is obtained by encrypting the hash value of the message. The malicious user can find the hash value of M2, but they can't encrypt it with KAB. So trying that won't work in the same way as this one didn't work. We'll only be successful if we can find a second message with the same hash value as the first. So we need to make sure that the hash function makes it impossible to do so. So that was an example of using hash functions to allow the receiver to verify if the message is being modified, but it didn't encrypt the entire message. The next one, slightly different. Again, we have no encryption of the message, no confidentiality. But we're using a shared secret S, a value that A knows and a value that B knows, a shared between A and B, and used really as authentication to verify that the message came from the right person. So here, we don't encrypt anything. So there's no encryption. And one benefit of no encryption performance, encrypting things takes time. Maybe we don't have the hardware or the software to encrypt in some, at least in older cases, when there weren't so many algorithms to choose from, some algorithms had a patent. You needed a license to the algorithm. So there were benefits of avoiding encryption. And this is a way to authenticate by avoiding encryption. We take a message, we join it with our secret value S, concatenate. We hash the message concatenated with S, get a hash value, and then concatenate that hash value with a message, send them, and we verify at the receiver. The verification of the receiver, received hash, sorry, the received message M, we join with a secret value S, hash, and compare with the received hash value. So if everything's okay, then we essentially compare the hash of M concatenated with S with the hash of the received M concatenated with S, where S is known only by A and B. It's like a key, but it's not used for encryption. It's just used to authenticate that this message came from A, because the only person who could have sent this message with the same S that I use here is user A. So we'll use it a way to confirm this message came from user A, not someone pretending to be A. So let's look at a couple of attacks on this approach. First, we'll try a masquerade attack. So we have user A, user B. They share a secret value. I'll write it as SAB. They know that. No one else knows SAB, the secret. Not write it as a key, because we're not using it for encryption. It's just some secret value. And first, let's try a masquerade attack where our malicious user sends a message to B pretending to be A. So they're going to send a message using the same scheme. They get to choose the message they're pretending to be A. So we'll send, this is what the attacker, the malicious user will do. Send the message concatenated with the hash of the message combined with the secret. So they send their message, let's denote it as M1, concatenated with the hash. Now this is a masquerade attack. The malicious user chose M1. A hasn't sent anything. It's just someone pretending to be A. And they take the hash of M1 combined with what? Secret, again, is not the secret of A and B. They don't know that. It's secret only between A and B. So they use some value. Let's say the secret known by the malicious user or maybe the secret shared between the malicious user and B. Sometimes I write it the secret of the malicious user or between M and B. But it's not S, A or B. That's the important point. B receives this thinking it's from A. The from or the source address identifies user A. So it must have come from A, B thinks, but let's check. And how do they check? They combine the received message with the secret they shared with A, calculate the hash and compare with the received hash. So they check by calculating the hash of the received message M1 with the secret that they have shared with A, S, A, B, they get a value and they compare it with the received hash value. So I'll write it in short. We compare it with this portion of the message. That's a question mark. Are they equal? That is the hash of the received message combined with A and B shared secret compared with the hash value received, this component. Are they equal? No, because S, M, B is not the same as S, A, B. So no, they're not equal error. Don't trust the message. Even though M1 is the same in the hash input, the last part is different, meaning the two inputs are different, meaning the two hash values are different. So this is an example of how it's used for authentication because the malicious user doesn't know S, A, B. They cannot create this fake message and send to user B. What if A sends a message to B? Let's consider a different case. No longer masquerade, but A sends to B. So they send using the same approach. They do generate a message here. Let's again call it M1, concatenated with the hash of M1 combined with S, A, B, in this case, the real S, A, B. They send it to B, but the malicious user intercepts. So this is a different case. The malicious user intercepts. What can the malicious user learn or what can they do in this case? Do they know the message? Do they, does a malicious user know M1? It's not encrypted. There's no encryption here. We're not trying to keep the message secret. Of course they know the message M1. What if they try and modify the message? Let's try. We change M1 and I think we'll quickly see our hash function has the intended effect, but we'll try modify the message and send the modified message onto B. So change M1 to M2, concatenate with the hash of what? We've got two options and we saw them in the others that both of them don't work. So we modify the message M1 to M2 as the malicious user. And now if we use the original hash value, the hash of M1 and S, A, B, we know I think that when B tries to verify they'll detect something's wrong. Because if we don't change the hash value, but change the message, what B will do is they'll calculate the hash of M2 combined with S, A, B, compare it with the hash of M1 combined with S, A, B. If the two messages are different, then it'll be detected. So if we don't modify this hash value, we know it won't work. All right, let's write it down to be clear, but it's a repeat of what we've seen. If we use the same hash value, note that we don't know S, A, B here. When I write H of M1 combined with S, A, B, I mean the malicious user knows it's 128 bits. If our hash function produces 128-bit output, so they just copy those 128 bits. If they don't calculate them, it's just a copy. So let's say they take the message, change it, and then copy the last sequence of bits forwarder onto B. B does its check. B takes the receive value. This is the receive value and compares it to the hash of the received message. Do they match? No, the inputs to our hash functions are different. In the previous case, the inputs to the hash function, the secret part was different. When we did the masquerade attack, here the message is different. So therefore the hash values are different, because we assume that our hash function, we cannot produce collisions. If the attacker could choose M2, such that the hash of M2 combined with S, A, B is the same as the hash of M1 combined with S, A, B, then this scheme is insecure. But assuming they can't find a collision, and even here they can't check because they don't have S, A, B, then they can't defeat this scheme. This was the case when the malicious user simply copied the hash value. The other case is when they change it. A sends a message to B. The malicious user modifies the message and modifies the hash value. Same message sent. The message and the hash of the message combined with the secret. The malicious user intercepts and modifies, sends on to B. What do they modify? M1 becomes M2, and this time they try to recalculate the hash value. So they know the hash function. Good. They know M2. Good. And they concatenate M2 with what? Not S, A, B, because they don't have S, A, B. Remember, this is what the malicious user does. They don't know S, A, B, so they use some other value. S, M, B, meaning the secret shared between malicious user and B. But it's not S, A, B. And it turns out, you'll get the same result as the masquerade attack. What does B do? They take the receive value. That's what's received as the hash value. And they compare with the hash of the received message, hash of M2 combined with, because they think it was from A, S, A, B. Again, the two inputs are different. As M, B is not the same as S, A, B, therefore the two outputs are different. And we get an error. They don't match. So in fact, similar to the masquerade. Because the malicious user doesn't know S, A, B, they cannot calculate the correct hash of the modified message. If they don't change the hash, then again, they won't match. So the attack is detected. You're malicious. Anything else you can try? We didn't know S, A, B. It didn't work because that was shared only between A and B. Can you find S, A, B? Brute force doesn't work. We know brute force will take too long. I just choose S, A, B to be 256-bit random value. You can't brute force it. Is there some other way you could try to find S, A, B? What do you need to do to find S, A, B? Because if we did find S, A, B, we could defeat the scheme. Find the secret and we can pretend to be A. We can intercept and modify. How can we find the secret? Let's consider this case. Look what the malicious user receives. They know M1. They know hash of M1 combined with the secret. So what do they know? Some message, concatenated with some hash value. Let's write the hash value as H1. They intercept this where they know that H1 was calculated from the hash of the message and a secret. They want to find the secret. What would the malicious user try to do to find the secret? They intercept M1 and H1. H1 is the hash value that was calculated. How do you find the secret? You have H1. How can it help? What is H1? It's the hash. We don't know, or the malicious user doesn't know, but we know. User A calculated H1 as the hash of M1 concatenated with S, A, B. We know the hash value. What if we could find the input? If we could do the inverse. Think of the inverse hash function. The opposite, which takes the hash value and returns the original input. If the attacker could take the hash value, apply the inverse hash function, then what they would get is the original input M1 concatenated with S, A, B. We know M1 is the first N bits. We have M1 concatenated with S, A, B. We know the last X bits are S, A, B, and we learn S, A, B. The malicious user learns or confined S, A, B if they can calculate the inverse hash function. That is, given the hash value, go back and find the original input. Can they do that? What property tells you that they can't do that? The one-way property. Coming back to our very first two properties. We've seen examples of the collision-free property. The attacker mustn't be able to find another message with the same hash value. Here we see an example why we need the one-way property. If the attacker knows the hash value, it must be computationally infeasible to go back and get the original message. And that was a case of why. Because if they could take the hash value and get the original message, then that scheme that we just went through wouldn't work. So sometimes we require the one-way property as well. One-way property means it's easy to calculate the hash of M to find H, but it's practically impossible to go in the reverse direction and, given H, find M. So if they can calculate that, well, we say no, they can't do that because of the one-way property. That's where the property is necessary for our hash function. So when hash functions are defined or designed, those two properties become important. You can't go backwards and you can't find collisions. And we've seen some cases as to why those properties are necessary. Questions on hash for authentication. That was scheme C, one we won't go through a modification, encrypt as well. This is scheme C combined with confidentiality. So there's different ways where hash functions can be used to provide authentication. And there are advantages and disadvantages of each. So some may also provide confidentiality, but they rely on encryption. So we need to make a trade-off on security requirements and performance. So authentication and encryption, it depends on what we want in terms of security. Do we want confidentiality? Then we will need to encrypt the message. But if we don't need confidentiality, we just want authentication, then we don't necessarily have to encrypt the message. And that can be better for performance perspective. Encryption can be slow. Or if you want fast encryption, in some cases you can use dedicated hardware, but that costs money. And having that hardware in your device may be inefficient if you don't need to use it much. And some algorithms may have licensing costs. So again, a cost of doing encryption. So some reasons for not doing encryption. They were examples of hash functions for authentication. In the previous topic, we saw examples of MAC functions for authentication. So we can choose from them. As a MAC is put in the next topic, well, no, it was covered in the previous topic. Similarities between MAC and hash functions. What's the difference? The key. Hash function just takes a message. A MAC function takes a message in a secret key. And if you remember back to HMAC, HMAC turns a hash function that just takes a message into a MAC function that also takes a key. That's the idea of HMAC. Let's look at a couple of examples of hash functions on files and then we'll go through signatures and maybe the security requirements. So there are different hash functions. We're not going to go through the design of any. Two that you'll come across if we jump back to the last few slides. MD5 and SHA. And SHA has different variants. The secure hash function. MD5 was quite popular. Developed by Ron Rivest. What did he do? What did he also do? The R in RSA. Same guy. RC4. RC4 was a stream cipher. The same guy developed that. So this was developed in 1990s. It generated 128-bit hash value. It was commonly used by applications in the storage of passwords. File integrity. I'll show you an example in a moment. But you go to a website, you download a file. The website also shows you the hash value of the file so that when you check on your computer, you calculate the hash of the received message, the received file compared to the published hash value. But it has some weaknesses. One of them being the 128-bit hash is too short and some weaknesses in the algorithm such that it's no longer recommended because people can find collisions. You can find two messages that produce the same hash value. It's not so hard. So MD5, you'll still see it around a lot, but you should not use it for secure applications today. Another one, the secure hash algorithm developed over a number of years and improved over time. So the first one was simply SHA. It was actually SHA0, SHA1, SHA2, and SHA3. SHA, the original one, SHA0 and SHA1, the secure hash algorithm one, considered secure as well and not recommended. SHA2 is quite common and SHA3 has been in development and was maybe standardized one or two years ago but not so common in use. So they follow different techniques, but one thing that's different is the hash length that they produce. In this table, the message digest size, the output. SHA1 was 160 bits, MD5, 128 bits, and SHA2 had different variations. You could choose the length and common ones are SHA256 and even SHA512. So they allowed for different lengths. The longer the hash value, the harder it is for the attacker to find collisions and to defeat the one-way property. They depend upon the hash length, assuming the algorithm is well designed. So that's all we'll say about them. I'll just show some examples now about applying them on files. Let me find some files to apply. Let's try some simple messages first. Here's message1. It wraps around and message2. So we have two messages, two files. Message1.txt, message2.txt, and that's the contents of the files. So what we would like to do is calculate the hash of these files. Not the file names, but of the contents of the file. And there are different tools to calculate the hash. OpenSSL can do it. And if I remember the syntax, I've done it before. OpenSSL has a digest option. Digests is the output when we use the hash value for authentication. It's called a message digest. Choose the algorithm, md5, the simple one here, and take the message as input. And there's the hash value of message1. It's in hex. If you convert it to binary, there are 32 hex digits, so 128 bits. So this is the random hash value produced as output. So OpenSSL can calculate that. Normally, your computer will have other software to calculate. So I have something called md5sum. And it produces the same value. So it's just a different implementation of the md5 algorithm. md5sum is a tool that calculates the hash of the contents of the file. Not the file name, but the contents only. It produces this. What if I calculate the md5sum? It's called sometimes a checksum. The md5 hash value of message2. What am I going to get? If I calculate md5 of message2, we'll get the same or different? Be careful. Same, different. Anyone else want to have a guess? Are they different? Yes, they are different. There's a dot missing here. Message1 has a full stop. Message2 does not. They are two different contents. Only a little bit different. Only a bit or so different in the files. And when I calculate the md5 of message2, you expect to get a different hash value. How much different? Well, the hash value produced when I calculate, do you think will be similar or the similarities to this one? It should be random. So there should be, on average, half of the bits different. So there should be no similarities. So it effectively produces a random hash value. We also have char256 on the same messages. I think it's longer. It produces a 256-bit hash value. Different algorithm and a game, essentially producing random hash values. The hash of two different messages produces two different random hash values or digests as output. So that's just an example of that. Let's try another example, just to be sure. They were text messages, English messages. These are just binary messages. So two binary files that is 128 bytes long. Maybe hard to see. I'm going to calculate the md5 hash of file one and then of file two. What's going to happen? I've shown the hexadecimal contents of those files because they're binary. I can't show them. They're not English messages. This is the contents of file one and down the bottom, the contents of file two. Are they the same? They are slightly different. It's hard to see all of the differences, I think. Here's 2571 in hex. And here it's 25f1. So there's one hex digit different there. And I think there are a couple of other differences. So there are some differences. They're not exactly the same. So what do we expect? Two hash values, which are completely different. Two different random values. So we calculate the md5 sum of file one and file two. What do we get? They are the same. You said they'll be different. Note that the hash values of those two different files, we recognize they were different contents. With md5, we get the same hash value. Here is a case of a collision. Those two files were... Someone else found the two files such that they were different and only a small amount of difference. And they both mapped to the same hash value. So we get a collision here. So this is bad from a security perspective. Because if one of the files, file one was sent by the user A and we're using the hash for authentication, user B... Sorry, the malicious user could intercept, modify or file one to file two, and it will produce the same hash value when B tries to validate. So here's an example of a collision. And that's due to the weakness of md5. With md5 today, people can find collisions. Now, this is not a very useful collision because both files have no meaning. That is, what you want as an attacker is one file, let's say, decrease Steve's salary by 10,000 baht. And then the second file can be changed to increase Steve's salary by 10,000 baht. And if they both map to the same hash value, then the attacker can take advantage of that. But it is possible to find collisions. So md5 is no longer recommended. Last example, and you see this sometimes when you download files. Here is an Ubuntu download page where you can download the ISO image, the CD image for different versions of Ubuntu. And I've actually downloaded one, I think, just to remind myself, I downloaded this file. So 576 megabytes is just an image, a CD image. So I downloaded it, and that's the file. And now I calculate, let's say, the SHA-256 sum of that file. Here we have a large input, and it produces a 256-bit output. So here's the hash of the file. It didn't take long, a couple of seconds to calculate. Hash functions can be quite fast compared to encryption. We took a couple of seconds to calculate the hash of that 600-megabyte file. And if we look on the website, so I've downloaded the image, I can calculate the hash of the received message, the received image, but the website also publishes the SHA-256 sums. So in a file here, it's just a text file, I open it up, and at least the SHA sums that the website has for those files, and my file was this one. And if we find out, do they match? I hope so. If you check the hash value published by the website, matches the hash value I calculated after I downloaded the file. So that's a way for me to check A, that the downloader file matches the one published, and that maybe I haven't lost any bits through the download. And importantly, it hasn't been modified through the download. So you'll see that often, a roll of hash functions. What's the limitation of this approach? Can an attacker have an impact here? The website had the file and the hash value published on the website. I downloaded the file, calculated the hash of the downloader file, and got the one on the green screen, and then I download the hash value and compare them, and I see they're the same, good. What could the attacker do here? Maybe they, assuming the attacker can't modify the web server, the web server is secure, that is, they don't have access to the web server. Note that I downloaded the file, and then I downloaded the hash value. Now, if I downloaded the file, but a malicious user intercepted and modified it, before it got to me, I got a modified file, but then I downloaded the hash value, and if the malicious user also modified the hash value, then when I check, the hash of my modified file could equal the hash value that I downloaded because that can also be modified. So this system is not secure because a malicious user can intercept and modify both the downloader file and the hash value. We need to somehow encrypt the hash value, like we saw in one of our schemes. If we encrypted, actually, this one, if the web server encrypted the hash value, when I download it, then I decrypt, and the malicious user could not have modified it. What's the problem here? With this scheme, the malicious user we saw before cannot modify anything. If we don't encrypt, the malicious user could modify both the message and the hash value as it sent. What's the problem with this scheme, especially with regarding the website? I download the file, M, I download the encrypted hash value, and then I decrypt the hash value using what key? If I use the same scheme for the website, I would need to have a secret key shared with the web server. So we have a key distribution problem here. I have to first download the key from the web server, but that's a problem because how are we sure that the key is secure? So how can we overcome that? How do we distribute a key? How can I be sure that the key that I get is the same one that the server used? Diffie Helman, remember Diffie Helman? That's one way. Or we could use public key cryptography. Instead of encrypting with symmetric key cryptography, encrypt using public key cryptography. Then I can get the... So the server could encrypt with my public key. I could decrypt with my private key. That's not so convenient. It requires the server to encrypt with my public key. They don't know me. The other way is that the server can encrypt using their private key, and I can decrypt and verify it's correct using their public key. And that's what the next part is. We can provide some form of a signature. So let's see on signatures and see how we use hash functions combined with public key cryptography. And just coming back to that website, where to go. There is a SHA256 sums file. There's another one which is encrypted. And that actually uses the signature approach. Let's see how that works. Signature or...