 Today we're going to finish on password entropy and spend our time on storing passwords. Password entropy is a formal way to look at the strength of a password or a password selection scheme. So we use entropy. Entropy is a general term to measure the content of information. How much information is in a particular message. We can use it as a measure of how strong a password or a password selection scheme is. We finished last week with looking at, okay, if we have a selection of 10 digits, 0 through to 9, how many bits do we need to represent one of those 10 possible values? Well, 3.32 bits and that is the log base 2 of 10 is 3.32. If we have the lowercase English letters, there are 26 possible values. So if you think in terms of how many bits are needed to represent 26 possible values, 4.7 bits. So this is our simple interpretation of entropy of passwords. The number of bits needed to represent a string. In this case, if it's a single letter, 4.7 bits. Most keyboards support about 94 printable characters. So ASCII characters, just in English that is. Uppercase, lowercase, digits and some punctuation characters. So if we have one character, there are 94 possible values. We need 6.55 bits to represent that in binary. Then we can use that to work out a measure of a particular password. For example, if you choose a password randomly, a randomly chosen password just of numbers, of digits, then each digit could be stored in 3.32 bits. So if your password length was 20 digits, 20 numbers, 20 times 3.32 is about 64 bits. So this is just one example saying, if you choose a 64 bit random value, it's about the same as 20 random digits, or 14 English lowercase characters, or 10 printable characters, 10 of those 94 characters. Because they all have an entropy of 64 or close to 64 if you do the calculations. So we can say a password of 20 digits is about the same strength as a password of 14 English lowercase characters. So we can use entropy to measure and compare. Of course, assuming they're randomly chosen. So not everyone chooses random passwords, but if they were randomly chosen, they're the same strength. And the strength is in terms of how much effort would it take for a password to be forced to try them all in order to guess it. If we have a 64 bit value, it takes 2 to the power of 64 operations in the worst case. If we have a 20 digit password, it takes about 2 to the power of 64 operations. If we have a 14 character English password, it takes about 2 to the power of 64 operations to get that password, a brute force attack. If we have 10 printable characters, about 2 to the power of 64 attempts. Because there are 2 to the power of 64 possible passwords if we have 10 printable characters, approximately. It's actually 65, I think. So we finished on that last week. Anyone have questions how to calculate entropy of a particular password scheme? If you want to remember a formula, look at, well, how many possible passwords there are, the entropy is log base 2 of that number. If there are 10,000 possible passwords, the entropy of that password or password selection scheme is log base 2 of 10,000. That's all. Unfortunately, we don't choose random passwords. Humans choose some structured passwords. It would be nice to be able to compare the strength of passwords chosen by people to say, okay, this type of password is stronger or better than this. It's hard to estimate or calculate the entropy in cases where you don't choose random passwords. But different people have done studies. And one of them, it's quite old, but it's a nice example. Some standards institute called NIST come up with the following data. Let's see how we interpret it. So there are some studies. If you want to see the details of the studies that's available online, it's getting a bit old now. But the way to interpret this is that in the right two columns are the randomly chosen passwords. So take an example. If the user can choose of 94 possible characters, uppercase, lowercase English, digits, punctuation characters, the way to read this table is if the user chooses a password of length 10 characters, then the entropy of that password is about 65.9. So 94 possible characters, if you choose 10 characters long, then 65.9 is the entropy of that password. That's for a randomly chosen password. Let's just confirm that. If we have 94, so the number of passwords possible, if we choose randomly 10 characters, 10 letters, from a set of 94 each time, then they're 10 to the power of 94 possible passwords. If we chose eight characters, what have I done wrong? Upside down, 94 to the power of 10. Sorry. If you choose one letter from 94 possible values, there are 94 possible passwords. If we choose 10 letters, there's 94 to the power of 10 possible passwords. The entropy of such a password selection is log base 2 of 94 to the power of 10, which is, I don't have my calculator, but I've calculated before, it was about 2 to the power of 65.9. Sorry. I better slow down. It's 65.9. You can check with your calculator. 94 to the power of 10 possible passwords, therefore we say the entropy is log base 2 of that, which is 65.9. If you have 11 character long password, then it's 94 to the power of 11, and the entropy is log base 2 of 94 to the power of 11. Where does 65.9... Where did I pull that from? Here. If we have 12 characters long password, entropy is 79. So that's just how to read this table. Going backwards from the right column. If you chose a random password using just a set of 10 characters instead of 94, for example, you chose numbers, maybe for your ATM, for to withdraw money from your bank, you must choose numbers. If you choose a pin that is four digits long, then if it's chosen randomly, the entropy is 13.3. So this column here is for 10 character alphabet randomly chosen. 13.3. 10 to the power of 4, log base 2 of 10 to the power of 4 is 13.3. So that's random chose passwords. The study done by NIST looked at what if people don't choose random passwords, but choose passwords based on information they know based on some structure. For example, if you choose the first letter in your password to be the letter T, then it's very unlikely that the next letter is going to be Q, because there are very few words in the English language that are T followed by Q. So they did studies, looked at, okay, given one letter, what's the probability of the next letter being a particular value? And they come up with these results and the columns, these four columns. A user chose a password based on some studies. If they chose from the 94 character alphabet, the printable characters on your keyboard, and if the user got to choose a password with no restrictions, then they come up with these approximate values for the entropy. If the user chose a 10 character long password, the entropy they calculated is about 21. Note, if the user chose a 10 character long password randomly, the entropy is about 65.9. There's a big difference in the entropy. That is, there's a big difference in the strength in that the higher the entropy, the stronger the password. If the user chooses passwords, it's much more likely that they'll choose passwords which follow some structure. They don't choose random passwords, and therefore there's much fewer passwords than an attacker would need to guess to find that password. So much weaker in terms of strength. So the lower entropy is a weaker password. Skipping the next two columns, if the user chose just numbers. So instead of getting a random pin for your ATM card, you choose a value. Then their study showed if you chose a four-digit value, the entropy is about 9. If you chose a 10-digit value, the entropy is about 15. Compared to randomly chosen or randomly generated, a 10-digit value, 10 numbers, random is about 33. Users choose much weaker passwords than random passwords. And I think we know that from... you think of your password compared to a random value, yours would have some structure. The last two columns, or the two columns we skipped over, are if the user was forced to choose a password following some rules. And I'd never remember the exact definition, but this dictionary rule was that they can choose a password, but it can't be a word in a dictionary. If they did, they have to try again. That makes it stronger, because they'd have to choose some different combination of letters. So a 10-character long password, the user gets to choose, but they cannot choose a word in a dictionary, gives an entropy of about 26. Better than if they get to choose any value, which gives an entropy of about 21. Because this approach requires them not to choose dictionary words. The dictionary and comp rule, what was that? I think they did some extra comparisons to some known structured words, and restricted even further. I cannot remember the detailed restrictions, but they say these things... these types of passwords are not allowed. Some rules to say they're not allowed. And that raises the entropy even further. This is all choosing a password 10 characters long from a 94-character alphabet, ranges from an entropy of 21 up to 32. If you chose randomly, the entropy is about 66. Much, much, much better. First it shows that user-chosen passwords are not strong compared to random-chosen passwords, and it gives a measure of that. And again, it's based upon some studies of how people choose passwords and how words are structured. Another way you can interpret this is if you want a password, which is equivalent to a random value, a random binary value of, say, 64 bits, then choose one that produces an entropy of 64. And you see a 40-character password, in this case, gives us an entropy of 56. So we need to be larger than 40 to get the entropy of 64. So if you got to choose your own password, to choose a password which is equivalent in strength to 64 random bits from this study, you need more than 40 characters in your password. So showing that to get strong passwords, that users choose, they need to be very long. Too long to be convenient. No one has 40-character passwords. If you allow random passwords or force random passwords, you can get strong passwords, but again inconvenient, because it's hard to remember and type in random passwords. Maybe the main conclusion here, if you look at the typical length of passwords that people choose in the order of 6 to 10, maybe 12 characters, the entropy, when people choose them, reaches about 30, 34. An entropy of 34 means there's about 16 billion possible passwords. And a computer to try 16 billion possible values doesn't take long. That is, to stop an attacker trying to guess a password if it can use on a computer at any speed, then typical passwords chosen by users are not strong enough. Computers can guess them. Any questions on entropy to finish this topic, this part? How do you calculate entropy? Easy, log-based, two-of-what? The number of possible passwords. So if I say in the exam, here's a password scheme, a simple one. You must choose a password, a password is generated, a random password is generated, which is five characters chosen from uppercase or lowercase English. Five characters long, always. Uppercase, lowercase, you think, okay, five characters, uppercase, lowercase, gives us 52 possible characters. So we have 52 to the power of five possible passwords with 52 possible characters. Five characters in length, 52 to the power of five possible passwords. Therefore, the entropy of such a scheme is log-based two of 52 to the power of five. And you need your calculator for that one. So understand where entropy is used, that you can use it to compare password schemes, and understand how this table is interpreted. It gives some experimental data based on how people choose passwords about their entropy. One conclusion that we often can arrive at from this is that in most cases, people don't choose passwords which are strong enough to prevent a brute-force attack. If you choose a 10-character password with no checks, entropy of 21, meaning there are about two to the power of 21 possible values, which is two billion possible values. If a computer gets to check at one million per second, it takes 2,000 seconds to find your password, which isn't long for most cases. When we finish this topic and this course, we'll return to a few slides we missed at the start and then towards the end. The last main thing to cover is how do you store passwords? Why do we store passwords? Well, most systems need to store the password because what happens is that think of a website. You create a website and you allow users to log into your website. So what does a user first do? Before they can log in, they must do what? Sorry? Yeah, but which username and password do they enter? They must register first. So let's call this step. Before you can log in, you must register a username and password on that website. You choose a new account on Google. You must register first. Maybe select your name, your username, a password. So this registration procedure creates your username and password and then the system stores that, let's say in a database, whether it's a file or an actual SQL database, it doesn't matter, but in some data store. Then when later you want to first log in, you submit your username and password and the system compares your submitted value with the stored value. If they match, you're logged in. If they don't match, you have to try again and it takes some action to block you. So we must store the password in the system. So the users register and that information about the ID, your username, ID on the slides, and the password, or we'll see something related to the password, and maybe some other information like your first name, full name and so on. But the core information is the ID or username and password that must be stored in a file or database. So that later when you log in, you want to access the system, you submit your ID and password and the system compares your submitted values against the stored values. If they match, you're okay, you're logged in. Any questions about that? Maybe some of you have implemented on websites already. Well, you will in the future. Not just websites. When you log in to Moodle, when you log in to ICT server, when you log in to the SIT internet access system, same approach. You register a username and password, then you log in and the system compares the registered versus the submitted values. So what we want to look at is when the system stores your registered values, in particular when it stores your password, how should the system store it? That's what we're focusing on. We'll look at several options and arrive at the final solution after looking at the drawbacks of some of the earlier options. The first obvious option, how do you store the password? Store it in the clear. Store the username or ID and the password of the user in some database. And for each user that's registered, just increase add another row to your database. Let's see if I have an example. So first option. And who has the handouts with them? Let me just check the handouts and see what I provided you with. Towards the end of your printed handouts, at the end of the password slides, there's a printout of a document on passwords, hashes and rainbow tables. And some of the examples that I'll show are from that. So you have it in your printed handouts. Let me bring it up. Bring it up so you can see. That's one of them that you have in the printed handout. Here's an example. So you've created a website, for example, and users register. And as part of that website, you have a database that stores registered usernames and passwords. So the database we can think of is some table where we have two columns, one column containing the ID or username, and the other column containing the password that the user chose, the registered password. So there's the first approach. What's wrong with it? If you store this in your database on your website, what can go wrong? So you've created your new website. Users, many of your friends and many people are registering. Thousands of people are registering. They're adding rows. Your website stores a database, adds rows to this table. What's wrong with this approach? What can go wrong? Okay, first thing, whoever has access to this database, maybe you and a few of your friends are creating the website. Whoever has access to that database can go in and see all the passwords of all the users. Well, is that a big problem? It's hard to stop that anyway. Yes, that's a problem to some extent in that, okay, let's say the three of you have created your own username and password, your developers, you can see the other person's password. Okay, so if you store the passwords in the clear, it's very easy for someone who has access to the database to go and see other people's passwords. But it's in fact hard to stop the developers from accessing the passwords. So if you create the website, it's very hard to stop you, the creator of the website, from being able to see the users, those that register a password. Okay, so that's something that's very hard to avoid. Another problem with this approach is that let's say you create your database, you have your website, it becomes popular, and then there's someone out on the internet that tries to break into your website through other means, some form of security attack on your website, and it turns out that you got a low grade in CSS322, and you didn't develop your website very well, and it turns out that some attacker can access your database. And if an attacker could access your database, so someone from outside could access it through some other security attack, then that attacker has now learnt the password of every user. That's a problem, and that's a real problem. So as the developer of the website, you don't want to have to assume that no one can get access to this database. Preferably no one else can access this database, but what if there's some other flaw in your website such that someone could get access to the database? If so, let's say your database was for Facebook. You have the tens of millions of users and their passwords stored in some database. If an attacker could get access to that database, they've now learnt the password of tens of millions of people, and that's a big security problem. So that's the main problem of storing the password in the clear. You can't store it in a form that people can easily access it. So from now on, we'll assume that there's some malicious user on the internet who can read your database. Even if you didn't intend them to, that somehow they've got other means to read the database. So if a malicious user can read the database, how are you going to store the password? Encrypt is the first obvious solution. You know if you want to have some data, and if someone can access that data, the way to protect that data so no one can read the contents is to encrypt that data. So let's go back to our slides. Make sure I may have a... Let's continue with this example to see what happens if we encrypted the password. So instead of storing the password, we stored some encrypted value, and I'll just write some random characters here. Some unstructured values. So this would be encrypt the password. And the second password would be some other random ciphertext, whatever the values are after we encrypt, and so on. So instead of storing the password, store the encrypted form of the password with the idea that if someone can now access this database using some other attack, they can see the usernames. They cannot directly see the password because it's not stored. They can see the ciphertext. So then we need to try and stop the attacker from being able to get from the ciphertext back to the original password. Now before we see what the attacker can do, when I encrypted the password, what did I use? What key? What key should we encrypt the passwords with? We must use some key. Anyone want to... different options? How should you encrypt the password? Whose private key? My private key. But everyone has my public key so everyone can decrypt, including the attacker. Don't encrypt with my private key. That doesn't provide confidentiality. What else could you try? Sorry? Encrypt with... What do you mean by an ID? An username. Let's say we encrypt and use the pass... The key to encrypt this string use the username. Okay. So the key is the username. But now if the attacker obtains this database, they can easily decrypt because the attacker has the ciphertext. They know that you used the username so they can quickly decrypt and they get the original password back. So the idea is that we want to make it hard for the attacker, if they get access to this database, to find the passwords of users. So no, don't encrypt with the username. What can we encrypt using? What key? What type of key? Sorry? Encrypt with the password. Okay. So I take the string MySecret as the plaintext and as a key I use the same string, MySecret. Okay. Now, fine. When... When you want to check when someone logs in. Okay. Now, John logs in. He submits his username and password. The system must check the submitted values against the stored values. How do we decrypt this ciphertext to get the stored value? Use the password that he entered. What if it was wrong? Okay. So if we encrypt the password and use that same password for the key, we get ciphertext and we can decrypt using the submitted password. Any problems? Who cannot retrieve? The entity that stored the password. If we store the ciphertext, how do we get the original plaintext back? Yeah. The idea is that if we... Let's write it down. We encrypt where the key is the password and the plaintext is the same value. We get some ciphertext and we store that ciphertext. So the value of C is stored in the database. And then someone logs in. John tries to log in and they submit their username and their password. So the system decrypts and what do they get as the output? The original plaintext. So there's the idea. Encrypt the password using the key which is the same. When the user submits the username and password to log in, the system needs to check. And the check is take the submitted value and use that to decrypt the ciphertext. It was encrypted with key, mic, secret, a shared secret key. Decrypt using the same value and you should get the original plaintext as output. Does it work? Does it decrypt successfully? I think it would decrypt successfully there. What if the user submitted the wrong password? They typed in the wrong value here. They typed in ABC. Then we decrypt the ciphertext using key ABC. What do we get as output? Some random text. So we can recognize that that's wrong in that case. Any problems? Yep. You're too advanced asking about MD5. I haven't got there yet. We'll cover MD5 in a moment. You're correct. Any problems? No. Assuming the algorithm is strong, then if you use the two different keys, you should get two different plaintexts. So assuming that's the case, that's okay. We'd have a problem if people used the same password. That would be a problem. It's likely that multiple users used the same password. That is, John and Daniel both used the same password. They don't know it, but they do. What would the problem of that be? It would be that the same value would be stored in the database. An attacker may be able to take advantage of that. The fact that two values are the same. Then if John saw the database and saw Daniel had the same ciphertext as he did, then he knows that Daniel has the same key and the same password as he does. So there's a small problem. What's a general problem with encryption and decryption? One is that in terms of performance, it's slow. The other problem, the key here. So in practice, we shouldn't use a short key to encrypt because if you're using, say, AES, your key needs to be 128 bits. So here the key is only the password length. So in terms of brute force attack, a brute force attack from an attacker, if they have this database, what do they do? Try all the keys. That is, try all the possible passwords. And generally that can be done quite quickly because the number of passwords is quite small in this case. If you have a password of, in this case, eight characters, we'll see the numbers shortly. You may consider what else may go wrong with this scheme. There are some performance issues. There's these issues if you have the same password. There's the issue of a brute force attack. It is possible if an attacker has this ciphertext finding the password is not too hard because the password is the key in this case. But someone asked, well, what about using MD5? What about using a hash function? So let's look at an alternative approach. Another way to do encryption, and I will not try and draw it, is to encrypt all of the passwords using some secret key that the server has. So instead of encrypting using the password, and it's an approach that works in some cases, encrypt the password using some long secret key. Again, that works until we need to store that secret key somewhere. And if we store that secret key, say, on our server and the attacker can access the database and the secret key, then they can decrypt and get all the passwords. So encrypting, if we use a long key, it needs to be stored somewhere and the attacker may be able to discover that value and therefore decrypt them all. If we encrypt with the password itself, then a brute force attack is very easy in that case because the passwords are quite short. Let's try another approach. If we encrypt the passwords with some secret key k, then we must keep that secret key secret, which is hard when we want to store it on the server to be able to decrypt all the time. Someone suggested using hash functions and that's generally the way to go and we'll see how they can be used. That's the page we want. Here's an extension of our previous storage. Instead of encrypting or storing in plain text, take a hash of the password. Store the hash value. Hash functions. What are two known hash functions? You said one before? MD5 and SHA. There are others. They're not the only ones, but they're the ones you may have heard of. So when the user registers their password, the system stores the username and takes a hash of the password and stores the hash value. Let's see what happens when a user tries to log in. A user logs in. So this is the database on the server. So let's say the client and there's a server. The login involves the user, the client submitting their username and password, sending it to the server. So John tries to log in. He submits his username and password to the server. This should be encrypted using encrypted communications. They submit it to the server. So this is the, let's call it, the submitted values of the username and password. The server looks up in the database. Now it takes the submitted value, John finds the row in the database where the username is John, takes the submitted password, MySecret, takes the hash of that submitted value and compares the hash of the submitted value against the hash value stored in the database. And you know based on our properties of hash functions that they should be the same. If we take the hash value of the two same values as input, we'll get the same hash value as output. So the hash of MySecret matches this 0, 6, C, 2, so on. Where this value was obtained originally when the user chose their password of MySecret. So compare the hash of the submitted value with the stored hash value. If they match and they will in this example, so the hash of MySecret equals that long value, then success, log in. If they don't match in another case, maybe that's an E, doesn't matter what it is. Maybe they make a typo when they type it in. They don't enter the right password. The server takes the hash of the submitted value and compares against the stored value and we'll see that they don't match in this case. And therefore fail the login. Why don't they match? Again coming back to our properties of hash functions. The hash of two different inputs should produce two different outputs. That was one of our assumptions. When we hash the original MySecret, we get one value. When we hash something different, even if it's different by a little bit, we'll get a different hash value. So when we compare the hash of this submitted password versus the stored hash value, they will not be the same and the login fails. And that works. If our hash functions are appropriate, we have this property, then it works. Any questions on how it works at the moment? What if the hash is not good enough? Good question. What if the hash function is such that the hash of one value produces the same hash value as some other value? That is that we have these collisions. Then yes, we can have problems, but still it's very hard for the attacker to find out which password to try to produce the correct hash value. They still need to make many attempts to do that. So yes, we should have a good hash function, but if we return to those properties, really, what do we have? One-way property, weak collision resistance and strong collision resistance is the one-way property and weak collision resistance that are needed in this hash function. Strong collision resistance is not usually an issue. So yes, we need a good hash function. What else can go wrong? What does an attacker do if they access the database? They know the hash function, okay? It's public. Let's say we know as the attacker the hash function used. What does the attacker do? Sorry? Yep. Okay, let's go back. One thing that hasn't helped here, remember John and Daniel had the same password originally. They didn't know that, but they just chose the same password. Still, the hash value stored in the password database is the same. What's the problem with that? It means if John can read this database, then he's immediately learnt Daniel's password. Okay, because John knows his password is my secret. He realises Daniel's got the same hash value as me. Therefore, Daniel's password must be my secret as well. So that's the first problem, but it's only a minor problem in most cases, okay? Although it may be possible for users to access this database and multiple users to get the same password, it's unlikely both happen. It's unlikely that John is the malicious user that gets access to this database. So it's a minor problem. Let's consider an example and add some numbers to see what an attacker would need to do. Let's say the hash function is MD5, which produces 128-bit hash value. And it is in this example that it's 128 bits in hexadecimal, isn't it? Remember the one-way property, which says it should be easy to calculate the hash of the password, but given the hash value, it should be hard to go back to the original input. That was our desired property of the hash function. And generally, to defeat the one-way property, the number of attempts needed is 2 to the power of the number of bits in the hash value. That is, if an attacker has this database of hash values, to take one hash value and go backwards and find the original password in a brute force attack on the one-way property takes about 2 to the power of 128 attempts. And it quite simply involves trying random inputs and eventually you get a collision, the correct value. Let's put some numbers to how long it takes. And I did some looking up on the websites about how fast we can calculate these hash values. Let's say we can do it at a speed of 10 to the power of 10 attempts, or generally, hashes, per second. That is, an attacker knows the hash values, they've got a computer, they download this database, they have a computer that can make 10 to the power of 10 attempts per second. Or calculate 10 to the power of 10 hashes per second. And that's about typical speeds of some advanced hardware that is available today, not very expensive. So in the one-way property, they may need to make 2 to the power of 128 attempts. How many seconds? Well, 2 to the power of 128 divided by 10 to the power of 10. I'll say the time to defeat the one-way property is 2 to the 128. That's how many attempts, divided by 10 to the power of 10. Anyone have a calculator? 2 to the power of 128 divided by 10 to the power of 10 seconds. And we'll convert that into minutes, hours, days, years. So the answer here will be years. 10 to the power of 21 years. It's not possible. So when we have 2 to the power of 128 attempts, even at this speed, if we tripled the speed, if we increased by a factor of 1,000, still it's going to be many universe lifetimes. So not possible. That's a raw brute force attack defeating the one-way property. But there's an easier attack. Suggestions? You have downloaded this database. You have the hash values. If you want to find the corresponding passwords for those hash values, what do you do? Don't try and break the hash function. We just tried to break it with the one-way property. It won't work. It turns out strong collision resistance doesn't help in this case. What can I give you? Let's make it easy. The passwords that everyone chose are all 8 characters long. Generate the hashes for lots of passwords, for common passwords to start with. Let's say we have a dictionary of words, 8 characters long, or combinations of common words. Take the hash of them and compare those hash values against the stored values. So take the hash of known words and then compare those calculated hashes against these if any of them match, we've found the password. In fact, we can extend that. If the character set is the printable characters, reasonable to assume, 94 different characters, 8 characters long, a brute force attack here would try all possible passwords. Let's see how long that would take. So a different attack. Let's draw it again. A different attack. This is a brute force on the passwords. But we need to make some assumptions. Let's assume that a password is 8 characters. When the user chose a password, they were restricted to having exactly 8 characters. And the character set has 94 characters. That is the printable characters on the keyboard. How many possible passwords? How many passwords do we have? That's correct. So the attacker knows that there are only 94 to the power of 8 possible passwords. All passwords are no longer than 8 characters and they are from the character set of uppercase, lowercase, English, digits and the 32 punctuation characters. So what the attacker does is just takes all possible passwords and calculates the hash of those passwords and then compares them against the known hash values. The time to do this for all passwords, so the worst case, the number of possible passwords to try, divided by our rate or speed, same as before, is approximately... Any guesses? Anyone calculate the answer? I think I've done it before. Let's do it on our calculator. That's the number of seconds, minutes, hours, days. Okay, so about 7 days it takes. Is that okay from the attacker's perspective? Well, yeah, in some cases that's fine. I can wait a week if I can get passwords for important people. They're going to return something to me from an attacker's perspective. I would like it to be faster, but 7 days is practical in some cases. That is, have a computer that takes this database and just leave that computer processing through, calculating the hashes of all 94 to the power of 8 possible passwords and you generate all the hashes and then just do a comparison where comparisons don't take much time. Let's compare the passwords, the calculated hash values against the known hash values and you'll find the password for all of those users. How can you make it faster? Any questions so far on how to get that attack? 7 days is simply the number of passwords divided by the speed at which I try to break them. How are you going to make it faster? I'm impatient. What if I had a slower computer? I couldn't afford a fast computer. Fast computer, this one cost me, I don't know, $60,000. I don't want to spend that. Any ways to make it faster? Okay, it's most likely that users, if they got to choose passwords, had common words. So it makes sense to try them first. But this system generated random passwords for users. Alright, didn't, but let's say it did. Then any other way to make it faster, apart from changing the computer speed, well, maybe get someone else to do it for you. Or, more precisely, someone does this once and then they save the results in a large database and then the next person who wants to break a password which is 8 characters long just reuses the database. They don't have to recalculate all the hashes. That is, the person, the attacker who just calculated across seven days, they create their own database containing password and hash value. And in there, they store for every password, I'll say P1, the calculated hash value, H1, the second password, the corresponding hash value. That is, the first time we do this, we take seven days and we generate for all possible passwords all possible hash values. And while generating them, store them in a database. Let's see how that helps. And we keep storing them. How many passwords goes up to P to the 94 to the power of 8? That's an 8. Draw it again. And all the hash values. So this is what the attacker does. They generate this database. Now, someone else later wants to do the same attack. They have their own database that they've got from some web server of hash values. They don't need to go through seven days to find all these values. They just reuse this table. And given their hash values, they just look up in this column, find the match, and they've found the password. This lookup procedure will be much faster than calculating hashes. Searching for a value in a database is much, much faster than calculating the hash of a password. Hash functions are generally slow compared to lookups and so on. So the idea is that because hash functions are slow, calculate the hash just once as the attacker, and then reuse this database. Because this database stores the hash values for every 8-character password. So any other website that needs a hash value for an 8-character password, we just look up in this database. How do I get this database? I'm the attacker. I've spent seven days generating this database. Maybe there's more than 8 characters. Maybe it takes longer. I generate the database. Then you're another attacker, some other malicious person. You have some hash values. You want to find the password. What are you going to do? Ask me for my database. Of course you. I'll give you my database for a cost. And that's what happens in practice. Someone generates these hash values, stores them in a large database, and then sells them to other people who wants to use them. Because once you have this database, to look up a hash value is very, very fast. To calculate it may take seven days, but the lookup may take a matter of minutes, maybe hours worst case. You can find websites that will sell you such databases. Just the hash values with different hash algorithms of many different passwords. And that makes the finding of the password much, much easier for an attacker. Understand the concept so far? Questions? Okay. Ready for an exam question about passwords? What, next week? Yes. Take the same hash function in our example md5. So I took the first eight character password. You know. A, A, A, A. Eight A's in a row. Calculate the hash with md5. I store the hash value here, whatever it is. Take A, A, A, A, B. Take the md5 hash. Store it here. Just do it for all possible passwords. Store all hash values. Now, it comes to me and they say I have a hash value such as 5 fc2 bb and so on. What's the password? Then all we need to do is just search through this column looking for the hash value. Once we've found it, we've found the password. Which is very fast compared to calculating hash values. How big is the database? Let's say no compression. Just the raw data. How big is it? Well, look. There's a table. How many rows are there? The number of rows. The number of possible passwords. In this case, 94 to the power of 8. Two columns. How big is the first column? We said it was an 8 character password so let's say it's 8 bytes. To store it, we need 8 bytes. How big is the hash? This md5 produces 128 bit hash value which is 16 bytes. So in fact we have 24 bytes in every row. And how many rows? 94 to the power of 8. How many bytes is that? I've calculated before about what? We have a problem. For the attacker to do this in this very naive approach to store all of these values they need a set of disks to store 146,000 terabytes. Who has that? Well maybe some organizations do but my option, go and buy 146 or 40,000 disks and store it in 7 days. We haven't gained much here. The storage space here is too large to do this. But this storage of the actual values without any compression is very naive. You can actually compress some data. And it turns out there are algorithms or data structures for storing this information in a very, very compressed form. That is all of this information and data structure that condenses it to be much, much smaller. And I did some lookups on you can go to websites and download or buy these databases and this one's about the data structure is called a rainbow table. And we don't need to know how they work but you may have heard of them. And this amount of data stored in this compressed form this rainbow table data structure takes about 576 gigabytes much more manageable. One disk can store all of this. So it's just a specialized data structure for storing passwords and hash values in a very compressed form. So really designed for storing this information. So the first attacker they generate this database this rainbow table, takes them seven days. They store it on one disk one hard drive. Someone else comes along and they pay them 20,000 baht to get this hard disk from them. And now the next attacker can reuse this rainbow table for any hash values they need to look up and find the password quickly. So really this is a trade-off between storage and processing time. The first person who generates this takes about seven days but once it's generated we can store the information on a disk for example and then once we have it stored and have it accessible the lookup time is almost insignificant minutes, hours in worst case. So now as an attacker given a hash value takes me tens of minutes to find the password. Rainbow tables you can download them or you pay some money you can download them or people will send you a disk with their values on it. Again tens of thousands of baht to buy one of these. With different hash algorithms with different lengths of passwords but of course if we add one more character to the password to make it nine instead of eight we multiply the number of rows by 94 so effectively the size increases by a factor of about 100. So instead of half a terabyte we're up to about 50 terabytes and that becomes again more expensive. 50 terabytes on disks is not something that is going to cost a few thousand dollars. And these are being used in the past. So we're going to continue this lecture for another hour and a half. Anyone have a lecture this afternoon? No we will not continue but we'll spend five minutes just finishing this topic, this part. Don't worry only joking. I'm as tired as you are, don't worry. But let's just lead to the point what's the next step? So now just to summarize if we store the hash of the password depending upon the speed of our computer the length of the password still an attacker can sometimes find the correct password. So in our example about seven days but to be even faster from the attacker's perspective once someone has generated this data they store it and then just reuse that because they don't have to generate it again and looking up the values is much much faster than calculating hash values. So the next approach in storing passwords and we're not complete it we'll just introduce it is to add some salt and the examples from our handouts is here. So we don't store the hash of the password we introduce a new piece of data. When a user registers we store their username the system generates a random value just a random number we call it a salt we add a little bit extra so this random value is just this column the salt column is just a random number I've just written it as printable characters and instead of storing the hash of the password we combine the password in this random salt and hash them together and we store this hash value now the when the user submits their username and password they submit their username John their password let's say MySecret the system combines the submitted password with the stored salt calculates the hash and compares to the stored hash value if they're the same everything's okay what advantage does a salt give us? there's one small one but one major one maybe you'll see the small one up there remember John and Daniel yeah now John and Daniel before they had the same password and therefore the same hash value now using this random salt they'll get a different salt the system chooses that for the user therefore they'll get a different pass a different hash value so there's a minor benefit but not the major one what's the major one you're an attacker you have this list of hash values you have the list of salts you've obtained this database how do you use the rainbow table I've generated a rainbow table okay and since you're nice I'll give it to you for free I spent seven days creating it it's that 576 gigabytes you're another person you come to me and say can I have your rainbow table yeah okay here what's the problem when I generated the rainbow table I did it for one salt value in fact originally I didn't use a salt but in the first case I would use one salt value let's say I chose all zeroes as the random salt when you come to me asking for a rainbow table you must get the rainbow table for every salt value let's say A4 H star 1 which means I must have generated that if you're going to get it from me quickly you have to wait for me to generate it again there's no gain there of using the rainbow table so to use rainbow tables in this case it requires the person who generates them to generate one rainbow table for every salt value these salts I think I chose to be 32 bits almost done for today the salt length let's say is a 32 bit random value it doesn't look like 32 bits here it's just some printable characters but it's 32 bit binary value how many possible salts 2 to the power of 32 which is about 4 billion now what I would need to do as an attacker is to generate rainbow table 1 using salt value 1 32 zeros for example and it would be our 576 gigabytes and I'd need a second rainbow table with a different salt value another 576 gigabytes or about and I'd need our 32 different rainbow tables with each salt value covered if I did this if I've already created all these rainbow tables what you do is you come to me and you say I want rainbow table for salt A4 H star 1 you come to me I say I want this rainbow table and I go through my rainbow tables and I find it and I find the disc with that and I give you the disc and then you can find the password quickly that would work what's the problem so I'm one attacker I've generated all these rainbow tables then you come with the salt and the hash and you need to find the password and the problem with this is I have to store those rainbow tables and before I store them I must create them remember it took me about 7 days to create the first one we calculated one takes about 7 days to create so to create 4 billion takes 28 billion days so it takes us almost a billion years to create all of these and if I had a billion years to create them each one is half a terabyte so we need 2 billion terabytes to store them we cannot create them and even if we could we cannot store them so from the attacker's perspective there's no way to create these 4 billion rainbow tables so that you could come to me and ask me for one of them in theory I should create all of them because I don't know what the salt will be that you need to look up we cannot predict what it will be because it's random and that's the main benefit of a salt is that such that an attacker cannot use pre-stored or pre-calculated hash values and the way to store those hash values is the rainbow table so rainbow tables are ineffective if we use a salt if we don't use a salt to store the password someone can use a rainbow table to quickly find the password and this is the recommended way to store passwords and this is the way that when you go create a website you create an application that has some user login you will store passwords you always store a random salt long enough if 32 bits is long enough it can be longer, it can even be shorter in some cases and you don't store the password you store a hash of the password concatenated with the salt and when someone logs in they submit their username and password, the system combines the salt and the submitted password calculates the hash and compares to the stored hash value if they match, log in if not unsuccessful this is secure against rainbow table attacks of course there's still other problems with passwords but it's the recommended way to store them and I think that should do us for today but some questions to finish okay so you use many websites so the question is how do you know that the website is using this approach to store passwords you don't you could ask them how do you store your passwords maybe they have a security policy that they publish and say we've been audited and our auditors say we store the passwords correctly but in theory you don't know how they store them okay you must trust them so most attacks on passwords or many attacks on passwords have been on some attacker has got access to a website and obtained the database of passwords and unfortunately that database was not stored like this it was stored either the hash of the password and nowadays easily defeated or even worse just the password in the clear which is immediately defeated okay so many websites still do but you want to look and see what their policy is and maybe find out in any other details if you want to trust them let's stop there we've gone a bit longer today so that we don't have to do so much on Thursday we'll summarize then we'll talk about the exam on Thursday I don't think we'll go into any more detail about passwords just summarize on that topic and that will finish for the course