 Let's look at what we mean by hash function and then state some requirements on a cryptographic hash function. I would usually not say cryptographic hash function. I'll just say hash function but we because we're talking about security implies a cryptographic hash function. Hash functions have many different uses. One is for security and also for other purposes. So the ones for security have certain requirements. What is a hash function? Well, we can start by saying that a hash function is some function that normally takes a variable size input. The input of that function could be of any length and in this case denoted as m and produces a fixed size output, the hash value we call it. Although it can get confusing, the hash function is usually written as uppercase h. The hash of some message produces a hash value or simply just a hash which is denoted as lowercase h. And that function h, uppercase h, when we apply it to many different inputs, it should produce outputs which are evenly distributed and random looking. It should produce random outputs. The hash values should appear like random numbers or random strings. Evenly distributed means that if we have say a thousand inputs, a thousand possible inputs and twenty possible outputs, evenly distributed means that on average of those a thousand inputs, what is it, fifty of them should map to one of the possible outputs, another fifty to one of the other outputs and so on such that of those inputs an equal number map to each possible output. They don't all go to one value, for example. And random looking in that, that is, the output hash value we can think is a random number. With cryptographic hash functions, they have those features but we have additional requirements. And in simple terms, it should be impossible for someone to find a message m, the input, if they know the hash value. So a hash function takes a message as input and produces a hash value as output. That should be easy to calculate but going the other way should be hard. The inverse is, given the output h, find the original message m. That should be practically impossible to do. Computationally infeasible meaning it would take too long to do that. And it's called or in simple terms, we'll see some other definitions, it's called the one-way property. The hash function should be easy to calculate in one direction but hard to calculate in the other direction. Another property is that the hash of two different messages m1 and m2 should produce different hash values. Or the other way, it's practically impossible for someone to find two different messages that produce the same hash value. So from a security perspective, it should be hard to find the original message given only the hash. And it should be hard to find two different messages that produce the same hash. And that will be important when we use hash functions for authentication. We use hash functions to determine whether something's changed or not, whether our message has been modified or not. And in the same way as MAC functions, we can provide authentication and data integrity. But there are other purposes for hash functions, so in security and even in other systems. So we'll use it for message authentication. I receive a message. I want to be sure that the message has not been modified and that it comes from the person who they claim to be. That's the same as we saw for MAC functions. And a specific instance of that is what we call digital signatures. And that has the purpose of... I want to be able to sign a message such that when someone receives that message that they can prove that it came from me, not anyone else. So we'll look at digital signatures. We will not look at passwords, but another aspect of computer security is how to store passwords. And hash functions are commonly used in the storage of passwords. Instead of storing the actual password of a user on a computer system, we store the hash of the password. They're used in virus detection, so to create signatures of malicious software. And for anti-virus software to detect that virus, hash functions are commonly used. Hash functions are used in pseudo-random number generators. Because hash functions produce a random output, or random looking output, they can be used to produce random numbers. And a number of other applications where hash functions are used in security and other aspects of computing. We can think of the hash function as providing some mapping. That mapping takes an input of some length L bits. The message, for example. So we have some message of some variable length, so it can be different lengths, different number of bits. Often we'll attach the actual length to that message. That's just a practical feature. So we take some message of variable length as input to the hash function and it produces a fixed length output. And the output is usually small relative to the possible inputs. Given that the input can be variable length and the output fixed length, property two, is it possible to find collisions? Why is it possible to find collisions? Because the length of the hash value is smaller than the possible input. So there are more potential inputs than there are potential outputs. Same with a map. So in theory, yes, there will be collisions. That is, two different inputs will map to the same output. But the strength of hash functions for cryptographic purposes relies on that it should be hard to find those inputs that produce the same output. It's not that they don't exist, it should be hard to find them. What's an example of a hash function? The name of a hash function. MD5 is one that you may have seen and come across. Where have you seen MD5 in use? Checking file integrity. When you download a file and maybe post it on the website where you download from, there's also a hash value, an MD5 hash value, with the idea that when you download the file, you calculate the hash of that file and compare it to the one on the website. Now, there are some security flaws of that, but in some instances that allows you to confirm that the file that you downloaded is the correct file. So MD5 is one. Any others? SHA, S-H-A, the secure hash algorithm. So MD5 and SHA are two widely used hash algorithms. We will not go through how they work. There's a few slides at the end that talk about their characteristics, but there are two that we'll use in examples and you'll see a lot. Let's give an example. Here's our first message. So we'll have a message and many systems will have a program to calculate the MD5 hash or the SHA hash of a file. So I've got something called MD5 sum. MD5 sum takes as an input usually a file and what MD5 sum does is takes the contents of that file and calculates the hash value. What do you think the hash value is going to be? Can you predict it? It should be random, so we cannot predict it. If we'd done it before, we'd know. How long do you think it will be? The same as a message? No, not necessarily. In fact, in this case I have quite a small message. We'll try a longer message. The message length can be any size. We'll try it from different size soon. The hash value is usually fixed length and it depends upon the algorithm as to the length. MD5 producers will see. Let's calculate it. Okay, it's done. So we calculated the hash of the contents of that file and we got this value as output. It's represented in hex. How many hex characters are there? 32 hex characters, times by 4, 128 bits. MD5 produces a 128-bit hash value. So it's actually 128 bits, but to show it on the computer it converts to hex and prints the hex characters. One hex character is 4 bits. Okay, let's do the MD5 sum of two different or two other files. Message 2 and Message 3. What can you tell me about Message 2? It's different from Message 1 because the hash values are different. The hash of Message 1 and the hash of Message 2 are different which implies that with two different inputs we get two different outputs. That was our requirement for our hash function. So this suggests to us that Message 2, not the file name, but the contents of the file are different than the contents of the file Message 1. And Message 3, the hash value is the same as for Message 1 which implies that the two files are the same. That's one way we use the hash function. Let's check. Message 1, Message 2 and Message 3. Message 1 and 3 are the same. Message 2 is missing a full stop. So that's what we expected from the hash values. One and two are different. One and three are the same. Different just by one character. Okay, and when we said that the hash function should produce random outputs and evenly distributed it means that even if the inputs are very similar, so almost the same except for one character, just a few bits different, then the hash values should be completely different if they're random. And that's what we see here, that the two hash values are not similar. Just because the inputs are similar, it doesn't mean the hash values will be similar. Let's try it on a bigger file. One of our virtual nodes, our base, just a large file, 800 megabytes. Let's calculate the hash on that. So a larger file takes some time but it gets there. Okay, so the hash on that, just to show it works. We'll try char in a moment, yes. Let's modify the file. Let's look. This is a compressed file. So think of it as a binary file. So I'll just look at the first maybe 32 bytes of that file. That's the first 32 bytes of the file, so of course it's much larger than that. Let's just change one bit in the file. So to change it, I'm going to run a command that would change, let's say 96, this is in hex, to 97. So we'll change 96 to 97. Everything else should be the same in the file and then we'll do the hash. Of those 800 megabytes, we want to see what's the impact of changing just one bit. It doesn't matter the command we use to change. It just changes to 96, a search and replace, into say hexadecimal 97 on that file. It takes some time and let's call it base 2. Hopefully this works. It takes some time, it's just changing these to read through the whole file and we see we have two files the same size and now calculate the MD5 sum on the second one. We'll do it on the first so it's easy to compare. Again on the first file and on the second file. The file name does not matter. What MD5 sum does is calculates the MD5 hash on the contents of the file. That's just what this program does. The contents of the file does not contain the file name. It's of no consequence. If I rename the file, we'll do that, different name, we get the same hash value. It's working on the contents of the file and again this is showing that the hash of two different files differing by just a minimum one bit, just a few bits in worst case of 800 megabytes, with two completely different hash values. That's what we'd like. We'll get to the char sum as well in a moment. One more example. Make it clear, MD5, I have a binary file instead of a text file. Again, so this is just on a, not a text file but a binary file, just actually random bits or a sequence of bits. Just to be clear, both 128 bytes long, what does it say about the two files? Hash values are the same means the contents of the files are the same. There's a program called CMP, it compares two files, the binary contents. This program says the files differ. The hash values are the same. That implies the files are the same. Let's look closer. This may be hard to see but we'll try. Zoom out a bit. There's the first one and there's the second one. Are they the same? Can anyone spot any differences? Last row, here's one difference. 2B, AB. I think there are a few others in there. The files are different. The contents of these two files are different but the hashes of the files are the same. What does that tell us? MD5 considers the entire contents. It tells us that we've found a collision. A collision is when we take two different messages as input and produce the same hash value. That's what we've done here. Two different messages are different by a few bits. When we calculate the MD5 hash of each, they get the same hash value. That's the problem. We said that it should be computationally infeasible for someone to find two different messages that produce the same hash value. With MD5, the algorithm is considered insecure for cryptographic purposes because people have found ways to find the hash of two different messages using MD5 that produce the same hash value. That was one example. Those bits were created to find two different messages that produce the same hash value. We want it to be difficult to do that. We know in theory it's possible but it should be hard for someone to find messages. If they can find them easily, we say that hash algorithm is insecure and MD5 today is considered insecure. We'll do that again. Two different messages on input produce the same hash value but we said the other hash function is called char. There's variations of char. There's char version 1 and then there's char version 2 that takes different lengths or produce different lengths of outputs. One produces 160 bits as output and then one produces 256 bits of output. We'll calculate that one. char producing a 256-bit hash value of those two different files produces different hash values. That's good. The point is that char in the specific instances of char that is there are some old ones which are not secure but char 256 is considered secure in that it's practically impossible for someone to find two different messages that produce the same hash value. MD5 is considered insecure because it is possible to find collisions. Although it's still used in some cases because it's so widely implemented and it's been used for a long time. char as we'll see has different variants. Version 1, version 2, version 3 version 1 is considered insecure version 2 with particular length hash values is considered secure for most purposes. Version 3 is secure. MD5 is considered insecure. Although there are some purposes where we can use MD5 where we don't care about collisions so it's still used in practice in some cases. There's char 512 which just produces the same algorithm but produces a longer hash value as output. 512 things. Let's look at an example of how we can use hash functions for authentication. Message authentication we want to check the integrity of the message that we receive. I want to make sure that the data that I received is the same as what was sent and that the person who sent that data is who they say they are. The same with max. And we use the hash function to provide message authentication. The output will call either the hash value or sometimes it's referred to as the message digest. What's the name of a hash function? The insecure one? MD5. MD, message digest. The message digest algorithm 5. So sometimes we call the hash value a message digest. The digest. So let's see how we can use a hash function. And there are a few examples here. Maybe with the time remaining we'll choose we'll stick with this one. This one's easy. In this case we use it in a similar way to a MAC function but the difference between a hash and a MAC function a hash doesn't take a key as input. So in this case we have a message. We want to send to B and let's say the message is a secret key a random set of bits. We want to encrypt it so no one can see it but so that B can check that they've got the right output when they decrypt we'll also use the hash function. So what we do in this case is we take the message calculate the hash of the message we get a short hash value as output and then we combine that with a message and encrypt them all with a symmetric key encryption algorithm send that to B B decrypts and then checks and the checking is similar to a MAC function take the received message M translate the hash of the received message and compare the calculated value with the received hash value and the assumption is if they match everything's okay if they don't match something's gone wrong because if they don't match it implies that the two messages that were used as input were different and they should be the same we may see some attacks on this or similar ones but let's look at one other example which is a different one here's one here's a way of using a hash function to again prove that you're the sender without encrypting this approach uses no encryption what it relies on is that both A and B have a shared secret key S in this case what the sender A does is takes their shared secret S combines it with the message so this is concatenation and then calculates a hash and combines the hash value with the message and sends the message on to B there is no encryption here no confidentiality what B does is uses their shared secret to confirm that this message came from A ten minutes remaining try and perform an attack on this scheme so the scheme is no encryption is used just a hash function there's no confidentiality we want B to be able to confirm that this message came from A and S is a shared secret they both know it no one else knows S so see what an attacker needs to do to defeat this scheme for example modify the message and have it go unnoticed at B or send a fake message to B pretending to be A try and I'll start drawing it when you say is it possible what we want to find is what are the requirements to make it impossible it's possible under certain conditions and that will define what we require of our hash functions A sends the message concatenated with a hash of the message concatenated with S that's what's sent by A in this scheme let's say the malicious user intercepts and they're going to modify something and send it on to B with the intent of tricking B into thinking it came from A and it hasn't been modified so what can the malicious user do first can the malicious user see the message can they see the contents of the message of course the message is not encrypted so there's no confidentiality in this case we're not trying to achieve that the malicious user sees the message what if they modify the message let's try so if they send a modified message let's denote it as M prime and concatenate with the hash let's say they don't modify the hash value it's the same as before they just modify the message what does B do B takes the received message M prime concatenates with their secret calculates the hash and let's say they get calculated the calculated value and the hash value they received HR is this part does the hash value calculated equal the hash value received why not so what B did was we receive a message we know the secret S so we join the message and the secret S and we calculate the hash using our hash function we get HC the calculated hash value but we also received a hash value this component which I denote as HR and because the inputs of the hash functions are different then the hash values should be different so B will detect this by realising HC does not equal HR therefore something's gone wrong B detects something's gone wrong in this case what else could the malicious user do what if when the malicious user changed the message to defeat this check what they need to do is find a message M prime such that when they combine it with S it will produce the hash value the same as HR but since they don't know S they don't know what to combine it with so they cannot just try and find the same hash value HC what else can malicious user try to do recalculate the hash okay so I think people will see that it won't work but let's try it recalculate that hash value we send M prime blue M prime concatenated with a hash of M prime concatenated with what we'd like to include S here and then it would work but S is secret so the malicious user shouldn't know S so some other value S prime you want B checks calculate the hash of the received message M prime with the secret shared with A and the received hash value is just this component S doesn't equal S prime therefore HC doesn't equal HR again B detects the change and this relies on the characteristic of our hash function the hash of two different inputs produces two different outputs if the attacker could find an input that produces the same output that would be successful but without knowing S they have a challenge there what about from a different perspective can the attacker find S how would the attacker find S it's not encrypted they know the hash of the message of S there's no private key how could they find S or what approach would they take to find S what does the attacker know if they can find S they'll defeat the scheme they know M they know the hash of M concatenated with S if they so they know the hash value if they can given the hash value find the input so they know some hash value H if they can do the inverse that is the inverse of the hash function on the output should return M concatenated with S so given the hash value try and calculate what the original input was if they can do that they get M concatenated with S since they know M S is easy to find we said the one way property of hash functions means that it should be hard it should be impossible given the hash value go back and find the original input M concatenated with S so as long as we have that one way property the attacker will not be able to find M concatenated with S and cannot find S so we see in this scheme it's security depends upon those properties one way property it's hard to go back given the hash value to find the input and the collision free property it's hard for someone to find two messages that produce the same hash value and if those properties hold this scheme is useful for authentication what we'll do next lecture is look at some of those other schemes but you should try also with these schemes what I commonly ask in exams is here's a scheme show me why the attacker cannot defeat it and that requires you to ask yourself what if the attacker did this how would B detect that what if the attacker did something else how would B detect it and it leads to the required properties of our hash functions