 I got a pretty important phone call recently. The president of the United States called me up. I was pretty surprised. He said, we've received news of a grave threat. Unless you re-implement Ruby's hash, alien terrorists will blow up SeaWorld. Blow up SeaWorld. I think you might have said some other stuff, but honestly, I was already typing. OK, so what is a hash? Everybody knows this, I think. But some language is called a map or a dictionary, but it's basically just a way to associate keys and values. I can say, hey, the key of appetizer has the value of fruit salad. I can ask for appetizer, and it'll give me fruit salad back. So this is our basic goal. We want to make our own class that behaves the same way. Easy peasy, right? We can do this. So here's what I'm thinking. We've got some tuples, little pairs of key and value. And whenever someone asks for a key, they say, I want cat. Then I'll just go rummaging through these tuples until I find the one I want. And I'll say, oh, this is cat. The value is goto. And if someone wants to write, then I'll go rummaging through them until I find the right one, and I'll update the value. If I don't have it, I'll insert a new tuple, and we'll be good. So I know that Ruby supports lovely syntactic sugar, like the bracket bracket and bracket bracket equals method, so that everything looks just like a hash. And I can implement it like this, which I have. And it works very nice. So I insert the key hokey with the value pokey, and I get it back out, and everything is good. So I attach that to an email, send it off to the president. I was feeling very fine, and everything was great. Thank you all for attending my talk. OK, OK, actually, I'm kidding. That wasn't it. So I got a call back from the president, and he said, what kind of hash is this? A hash is supposed to have O of 1 lookup time, and this thing has O of n. I said, I'm having some trouble hearing you, Mr. President. I think I have a bad. I'm like, what does that mean? Why does the president know more computer science than I do? So if these words are scary to you and you don't know what they mean, you're in good company. I was in the same boat, too. And I'm going to walk you through what big O notation is about and why it's important to the problem that we have here. So big O notation for beginners, it's basically just a way of talking about how an algorithm grows. It's talking about rates of growth. So as we throw more and more inputs at an algorithm, how does it scale up the amount of work that it has to do in order to solve that problem? And when you look at a graph of this, along the x-axis we have, how much inputs do we have to our algorithm? And on the y-axis we have, how much work does the algorithm have to do? And you can just ask one question if you want to know what big O category something is in. How is it shaped? We don't care about angle, and we don't care about location on the graph. We only care about shape. I'm going to give you some really simple examples to kind of help you understand what I'm talking about. So suppose that I'm running this conference, and I really want everyone to feel welcome at the conference. I could do this in several ways. So my first idea is what I'm going to do is I'm just going to walk around and shake everybody's hand. Hey, how are you doing? Welcome to the conference. Nice to have you here, right? If I do that, then that's going to scale linearly, meaning if there's 10 people, I've got 10 hands to shake. If there's 100 people, I've got 100 hands to shake. So the line is going to look something like this. Well, suppose that I decide, you know, that's really not good enough. I want to be more friendly than that. When I talk to somebody, I want to find out where they live. I want to know what their hobbies are. I'm going to spend a little more time with each person. And that's great. It's going to make people feel good, but it's going to take longer. So the slope is going to look like this, right? I've got more work to do per person. Okay, imagine another scenario. Suppose that I shake hands at the normal rate, but because I'm kind of nervous before I do that, I kind of have to get backstage and go, okay, I can shake hands. I can talk to people, it's going to be okay. All right, here we go. And I go out and start shaking hands. That graph would look something like this. So the slope is the same, but there's a bit of delay. For two people, I've already done one unit of work. Now these differences matter in absolute terms. They all affect how much time it's going to take me to complete shaking everyone's hand. But in big O terms, these are irrelevant. All of these things are O of N. They're linear. And the reason for that is if you think about it, no matter which one of those strategies I choose, one more person equals one more handshake. That's what causes the line to be straight and have that same slope. So if there's 10 people and I get one more person, I've got one additional handshake. If we have 100 people and I get one more person, one more handshake. It works the same way no matter how many people are already at the conference. Now in contrast, suppose that I say, I don't just want to greet people. I want everyone to know everybody else. So instead of shaking hands, I'm going to do introductions. I'm going to walk in around and I'm going to say, Carla, this is Robert. Robert, this is Carla. OK, Carla, this is Joe. Joe, this is Carla. I'm going to do this with everyone. Everyone's going to get to meet. Well, that's lovely, but that's O of N squared. Because if you think about it, if there's 10 people, I have somewhere in the neighborhood of 100 introductions to make. And if there's 100 people, I have somewhere in the neighborhood of 10,000 introductions to make. And this line doesn't go straight. It curves up because the further along we are, the harder this gets. So when we have 10 people and another person shows up, I've got 10 people to introduce them to. When we have 100 people and somebody shows up, I've got 100 people to introduce them to. So this is O of N squared. OK, one more scenario. Suppose that I say, a conference this size, I don't have time for this. What I'm going to do instead is I'm just going to go up on stage and say, hey, everybody, how are you doing? Welcome to the conference. This would be O of 1. This is constant time. It doesn't matter how many of you there are. And this is kind of nice. You notice N is the number of inputs we have, in this case, the number of people. In O of 1, N does not appear. Totally irrelevant. If there's 10 of you, hi, glad to have you here. I'm done. If there's a million of you, hi, might need a better microphone system, but I'm done. And also, if I take longer and give a longer introduction, that doesn't matter either. It's still constant time. This is the holy grail of scalability. If you can get an algorithm that performs like this, it's beautiful because you can throw as many inputs as you want at it, and you'll never get to a point where you are having trouble executing anymore. Now, at some level, orders of growth, scalability is about feasibility. Because if it takes a certain amount of work, it can get to a point where you can't actually execute and finish solving the problem anymore. So these are different orders of growth that we've talked about. There are many others. One of them, there's a famous problem in computer science that's called the traveling salesman problem. Traveling salesman problem is basically this. Suppose I want to start out at my home city, and I want to visit some number of other cities and visit each one once and come home. And I want to take the best path I possibly can. I don't want to go flying halfway across the country every time. I want to hop little paths from city to city. Well, how do we calculate the best path to do that? If we use the brute force method, what we would do is we would say, I'm going to calculate every possible path through all of these cities, and then I'm going to sort them, and then I'll take the best one. Well, doing it that way is O of n factorial, meaning if there's three cities, it's going to take me three times two times one steps. If there's four cities, four times three times two times one steps, et cetera. So that grows really fast, and it's kind of hard to wrap your mind around how fast that grows. I'm going to put somebody on the spot. So you there. If I'm able to do this for five cities in about a tenth of a second, how long do you think 22 cities would take? 22 cities. Five cities is a tenth of a second. 22 cities, how long do you think? Just whatever. An hour, okay. This is really, yeah, I would have guessed. Who knows what I would have guessed, right? This is hard to guess. It would take 35 billion years. It is crazy, okay? And this shows you that you need to know if you're in this category, right? Because you might work for a travel search site and somebody says, let's offer this trip planning feature. Yeah, that sounds great. We can offer that for four cities. We can offer that for five cities, but we can't offer it for 10 cities because they will have taken their trip and come home and will still be thinking about it. And if we try to offer it for 22 cities, our server farm will melt and civilization will collapse and the sun will burn out and we still won't know what the best route is. So this is not a solvable problem in practical terms. Yeah, sure, it can be done, but we don't have time. So it's important to know what category your algorithm is in and essentially, are you gonna bog down and is it gonna become hard to solve your problem? So what's our problem with this hash? I said, you asked me for a key, I'm gonna go rummaging through my tuples and find the one I want. Well, the problem is, the bigger that haystack gets, the longer it takes me to find the needle, right? If you've got a hash with a million items in it and you wanna look something up, it's gonna take me a while. Now, if I'm reading, on average, I'm gonna have to walk through about half of those before I find the one I want, but to make that distinction between I have to go halfway through or I have to go all the way through, kind of irrelevant because that's slow. That's, oh, it's this and not this. We don't care about that. That's not a big O concern. It's linear. The more there are, the longer it takes. So we don't want this. This isn't how hashes really work and this isn't how our hash should work. And by the way, I've built this class and I've done some measurement on it and our analysis of what's going on as we walk through these keys is borne out by this chart. So just to kinda help you understand, on the left side, I've got a hash with 150 keys in it. On the right side, I've got one with about 14 and a half thousand keys and all I'm doing is doing some writes, measuring how long that took, doing some reads, measuring how long that took, et cetera. You can see the reads take less time because on average we walk halfway through. That's the yellow line. Then on the right, I'm always inserting so I'm always going to the end. But basically, it grows linearly just as we suspected. So how can we do this in a different way? What can we use that's O of one? Well, something that we know that's O of one is a ray lookup by index. So if I'm gonna look up the 328th item in an array versus the fifth item, it takes the same amount of time to do that. Why does that work? Well, because of RAM. So RAM has this wonderful property that's unlike a spinning disk. If you're working with a spinning disk and you're trying to get some data, you're reading from this spot and you want data from a different spot, if it's close by, you can get to it quickly. But if it's on the other side of the platter, you have to spin it around and move the read head and it's gonna take a while. But RAM is random access memory. I can get any random value from anywhere in RAM just as fast as any other random value. And Ruby can figure out where it needs to look because if it knows that I'm looking for index five, well, it knows that the array starts here in RAM and it knows that every slot takes up the width of one pointer in memory and it knows how many slots down I'm going so it goes, you're going right here. This address in RAM is what you want. Ask for it from the computer and gets it back. So we need to make a new plan. We need to implement our hash as a real hash using a hash table, the data structure. And this is basically how it's gonna work. We're gonna have two parts. We're gonna have a digest function. Now these are sometimes called hash functions but I find that gets a little confusing. You got a hash function inside of a hash data structure making a hash table and it's kind of a little confusing. So Ruby actually does the nice thing of calling these digest to clarify just to kind of keep them separate. And a digest is a one-way function. You put something in, you can transform it to another value reliably. It always works the same way for the same input but you can't go backwards. You can't take the output and get back to the input. So it's a one-way function and that's a great name because just like digestion, digestion is one way. You can't take poop and turn it into pizza, right? You're done. At least not the same pizza. So we have a key of appetizer. We want a function that can give us a number because we need a number to be able to do a ray lookup by index and that's the other piece of our puzzle. We need a sparse array. So if I see that the key of appetizer needs to go in slot two, well obviously slot zero and one need to exist in order for me to be able to put something in slot two. So I'm gonna get the value of two for that key and then I'm gonna look into my sparse array directly and say this goes in slot two. So we're gonna implement it like this. We'll be able to instantly go to the place where we want, instantly get out the value, no rummaging around and it's gonna scale beautifully. Thank you very much for listening to my presentation. No, no, no, I'm sorry, I'm still messing with you. There are problems with this approach. There's a problem with collisions, first off. So we can't guarantee that when we hash a particular key that we're gonna come up with a unique value. In fact, we probably won't. We'll have some collisions. So we've stored a value of fruit salad for appetizer and along comes someone and tries to store a value for mammal. Oh no, that goes in slot two. What do we do? Well, the solution to this problem is to go back to our old solution of a tuple map. In that slot, when we get to slot two, we'll say, well, we got several things here, rummage around through them, find the one we want. Obviously, if there's too many things in this location, in this bucket, we're going to be right back where we started, but put that problem aside, we'll come back to it. The second problem that we have is waste. So I said, if we have a sparse array, we wanna put something in slot two. We need slot zero and one to exist, but we don't have anything to put there. Those are just nils. That's wasteful, right? We're just wasting memory with all these nils. What's the solution to that problem? Yet, over it. Basically, we can do some things about it, but the hashes are a trade-off. This is the key thing to see. What we're doing when we build a hash is we're trading memory for speed. We're saying, I'm willing to waste some of this memory and have all these blank spaces lying around just so that I have the guarantee that slot two is there for me and I can get there quickly. Trading memory for speed. So is there a grand solution? Is there some way that we can minimize these issues? Minimize collisions because the more collisions we have, the more things we have in a bucket, the slower we're gonna be, but we also wanna minimize the amount of memory that we waste because who likes wasting memory, right? Well, here's our solution. We're gonna grow as needed. We're gonna start out with a few buckets and we'll start out with a few blanks in memory and we'll grow as we need to. What does it mean to grow as needed? You could define this in a lot of different ways. I'm using a really simple definition. I'm just gonna say if there's 10 keys in a bucket, too many. We're gonna grow. What does it mean to grow? Well, the basic idea is that we wanna get more buckets and we wanna spread things out. It doesn't help to have a bunch more buckets if everything is still piled up the way it used to be. So we wanna have some way of spreading them out so that when we're done, most buckets don't have much in them. So one key thing to see is that every key is going to need to have, it's a new bucket calculated for it. We've gotta go back through all of our keys when we do this and figure out where they go now. So here's our basic strategy for our hash. We're gonna start out with x number of buckets, whatever. We're gonna compute a raw digest value for whatever key we've got. I'm gonna come back to how we do that and then we're going to say what bucket number does it go in? We're gonna go take that number, modulate the number of buckets and that's the bucket it goes in. I think everyone knows what modulo means but I just wanna make sure we're all on the same page. It's basically just divide and take the remainder. So any number modulo three is gonna come out with one of three values. So we can have a guarantee that we're always gonna know which bucket it goes in. Now, when we grow, we're gonna be growing from some number of buckets to some other number of buckets and like I said, the whole point is for us to be able to spread things out. Imagine if we grew from size three to size six. A lot of things that divide evenly by three also divide evenly by six. So we would end up with a lot of things in the same buckets. That wouldn't be so great but you can see if we grow from three to seven, most things mod three versus mod seven are different values. So that would be a good way to grow. Well, how can we generalize that? How can we know if I'm at a particular size, what size should I move to? Well, I'm not gonna go into all the gory details but a basic strategy that works well is double the number of buckets then go to the next prime number. Using prime numbers means that we rarely end up with things that divide the same way and we generally spread things out differently. So if you're at size 199, we'll go to size 401, from 401 we'll go to 809, et cetera. All right, I just glossed over this before. Where do we get a raw digest value? Somebody's giving us a key. We wanna come up with a number that we can then modulo and so forth. Well, this is kind of hard. I spent some time thinking about this and my first thought was, well, I can take a string key and I can turn that to Unicode code points and those are numbers and then, well, that's what I'll use. That'll be the number I'll do modulo, the number of buckets, et cetera. But we don't just have to support string keys and Ruby, anything can be a key. A string can be a key, an array can be a key, some hat object can be a key, so I can't just depend on it being a string. Well, what else do we have? Everything's an object in Ruby, right? And everything has an object ID and that is a number. Well, what if we use that? Not gonna work. Because every object that's different, like these two strings, A and A, those are different objects and they have different object IDs, but we don't wanna care about that with hash. We don't wanna have to hang on to the same string to be able to use it to get the value out that we set. We wanna be able to use a string that looks like the one we set it with and that should be good enough. We want same value keys, quote unquote, to be interchangeable. But that's a really fuzzy concept. I mean, suppose you use a hat object as a key. How do I know if this hat is equal to that hat? What do I use? Size, color? I don't know. How would I know if they're equal? It's a subjective sort of idea. Ruby's solution to this is to ask the object. The object is responsible for telling you what its raw digest value is and that's with the dot hash method. Every object in Ruby has this method and it's always going to give you a number and the contract that it has to uphold is that if two objects are dot EQL question mark, if they're equal in that sense, then they must provide the same hash value. If you wanna use custom objects like your own hat class as keys, that's fine. You can say hats are equal if they're the same size as long as two hats of the same size produce the same dot hash value. So to review, here's what we've got so far. We're going to, for any key someone gives us, we're going to turn that into a number using the dot hash method. Now we've got a number, we can do modulo the number of buckets we have. We know what bucket to put it in. As the buckets fill up, we're going to say it's time to grow. We'll grow in order to spread things out. Spreading things out helps us stay fast so we don't have to go rummaging through too many things in a given bucket. But it also means we're using more memory. So that's our trade off. Well, we've done all of this work in the hopes that we get O of one performance and for a hash that means, I don't care how much stuff is in that hash, I want to be able to look something up in the same amount of time as I could when there was only 10 things in it. So, if we've done this right, I should be able to show you a graph that shows you a flat line for reads and writes. As the hash gets larger and larger, stays flat. Did we do it? Are you guys ready to see the graph? Can you please give me a hand for the graph? Woo! Yeah! Whoa! Okay, okay, okay, we're okay, we're okay. So let's just take it one thing at a time. First thing I want you to notice, see those nice flat lines. So you can see way over on the left side, as I'm doing reads with my very small hash with only 100,000 items, it takes a certain amount of time and way over here when I've got 9.7 million items, it's about the same. So that's perfect, right? That's really nice. The yellow line looks beautiful. The green line is our right and pretty much looks nice, right? We've got way over on the left side, taking a certain amount of time, taking about the same amount of time way over here on the right side and we've got a lot of keys in it. So that's pretty good. We just have some kind of spiky issues there in the middle. So what's that about? Well, those spikes happen when we redistribute. When we go back through and figure out which buckets do our keys go in now, we've had to grow and we've had to go back through every key and figure out where it goes. Okay, well, how are we doing, though? Is this, how does it stack up to the native Ruby hash? Does the Ruby hash have this problem? Well, in fact, yes, it does. As the Ruby hash grows, it has these spikes. There are times when it has to redistribute its keys into its new buckets. Now you can see the Ruby hash is about two-tenths of a second up there at five-ish million keys, so it's a pretty fast thing and I'm wondering how do we stack up to that? How big are our spikes? Oh, yikes. Okay, so don't use my hash in production. This is pretty bad, but have we failed? Is this a sign of our hash being incorrect? Well, notice this, that both of these have linearly-increasing spikes. The slope on the Ruby hash is much lower, but they both are basically doing the same thing and we don't care about slope in big O terms, right? So they both have linearly-increasing spikes because whenever you have to reorganize and redistribute, you have to walk back through every single key. That's just the way it works. Okay, so did we get O of 1 performance or not? Well, let's think through this. When I read a key, I'm always gonna know right which bucket to go to and that bucket is never gonna have more than 10 items in it because I'm gonna resize if it does. So I know I won't have to do more than 10 steps at that point. So we can call that part the reads O of 1 and writes are the same way. As long as I don't happen to be the unlucky right that causes a redistribution, I'm gonna be able to go right to the place I want and update it and that'll be great. So what about those writes, those redistribution points? Well, the growth has to take in steps because we have to walk through every single key and figure out where it goes, but as that process gets slower, it also gets less frequent. Remember, we double the number of buckets and go to the next prime. So what's actually happening if you imagine that graph, the mountains get higher and higher and higher, but they also get further and further apart. So imagine you take each mountain and squish it down into its valley. If you do that, you find that you get a flat field. So another way of thinking of this is to say that every right incurs a certain amount of debt. When you do a right, you know sometime in the future you're gonna have to redistribute everything. Maybe not yet, but eventually. So every right incurs a certain amount of debt and some unlucky right pays them off. When we do this, we're doing what they call amortized analysis. We're saying spread out over the life of this thing, what happens? And each one has the same penalty. It's O of one, it stays the same across the board. So we succeeded. Woo, this calls for celebratory clip art. I was feeling pretty good about this. I attached it to an email. I sent it off to the president and I kicked back to relax and think over some other lovely ideas such as could we make this faster? Well, we could implement it in C, that's what Ruby does. We could maybe redigest concurrently using some kind of fancy algorithm or something. We could trade off more memory at a time, but why bother, right? The Ruby native hash is great. This is really a learning exercise. More interestingly, what can we learn from this process? Well, one thing we learn is that hashes are amazing. Hashes are an O of one multi-tool. You can use it in all kinds of situations. They just, you can keep using them and they're gonna scale beautifully. Another thing is that we have other things that work like hashes, distributed hash tables like the Reoc database and anything based on DynamoDB, Cassandra. Those have the same trade-offs. Essentially what they're doing is they're using a digest to figure out where to put your key in some structure of servers of the database. So think about what happens when you run out of space and you have to grow. Do you wanna grow by one server and spend all that penalty of rehashing every key or do you wanna double the number of servers? It makes more sense to double. This has a lot of implications for other things you might do. Well, I was mulling over all of these things and stroking my beard and smoking my bubble pipe. When I got another phone call, the president called me back. He said, dude, where's your hash? I said, what are you talking about? I sent you that email three hours ago. He said, I never got an email from you. I went and looked in my email and it was sitting in my outbox. I said, I'm sorry, Mr. President. I'll send it right away. And he said, it's too late. Alien terrorists blew up SeaWorld. Blue up SeaWorld. Blue up SeaWorld. It was all a dream. Thank you very much for listening to me talk.