 Tyler, I also work at Scrib. I'm going to be talking a little bit about alternative data structures in Ruby. Despite what Tim said, I don't actually know anything about NoSQL, so I'm not going to talk about that. Instead I'm going to talk about... Now, you might be asking yourself, you know, why should we talk about different data structures, like you have your arrays, you have your sets, you have your hashes. Great, that's enough for anybody, right? Well, sometimes that's enough, but sometimes you need to do something a little bit different. So again, at Scrib, we deal with a huge amount of data. And so, on lots of occurrences, I have found that like the normal data structures that you use on a daily basis just don't always quite work. They don't do exactly what I want all the time. So, you know, why would I use a different data structure? Well, there's basically three different reasons. There's speed, there's memory, and there's clarity. So, I'll get into that a little bit more. So, you know, you might... Excuse me. So, what's wrong with my favorite data structure? X, whatever X is? Well, probably nothing. The things I'm going to be talking about are just data structures that I've used in the past and, you know, also currently, and it's just data structures that I find interesting. The point of this talk isn't really to like say, you should use these data structures. It's more like you should use data structures in general and to get you more interested in them. So, all right, let's get right into it. Let's talk about bloom filters. So, the point of a bloom filter is to test for existence in a set. It's basically to say, have I seen this item before? It's a probabilistic data structure which means it can fail sometimes and, you know, how much it fails and how often it fails. That's a tunable thing and we'll talk about that a little bit. But really, the point of a bloom filter is its memory usage. It's pretty awesome. So, let's say that we have 100 million strings. Let's say they're about 100 characters long. If you were to put these in a traditional set, you'd be talking something like 10 gigabytes of memory. So, not really feasible, more memory than most computers have right now. In a bloom filter, if you thought that like a .0001% chance of failure is okay, you could do the same thing in about 280 max. If a higher failure chance is okay, you could do the same thing in 170 max. So, these are much more reasonable numbers. So, how does a bloom filter work? Well, a bloom filter is basically just a series of bits, a series of checkboxes which can be on or off, indexed of course. And so, if we want to say, here is my string, here is my object that I'm adding to my set, you know, maybe it's the string to be or not to be. First, we run that through a series of hash functions. Now, exactly how many hash functions you use is a function of how many things you're going to be adding in the size of the bloom filter. But for now, let's just say we're going to use two. And so, we run it through two hash functions and what pops out are two numbers, one and five, let's say. So, we set numbers one and five. Great. Now, we're going to add another one. We're going to add that. This is the question. And mind you, like an 8-bit bloom filter is not actually something you would want to use. So, there is that. But so, we're going to add that as well. Now, we're going to start a query, whether it is an obler. So, now we run that through the same two hash functions and out pops two numbers, two and five. And one of those is not set. So, we definitely have never seen this string before. Likewise, we go back to the two be or not to be, run it through two hash functions, one and five, and it's a match. Great. Well, we can also get one that will give a false match. Now, we didn't actually see this one, but the two things that it happened to get happened to be set. So, it's a false match. Like I said, this can be tuned just based on the size of the bloom filter. So, that is what makes it probabilistic. So, what is the point of a bloom filter then? Well, let's say we're running something like a file server. Some file server that lives remotely has lots of files on it. It's kind of expensive for us to query it. Let's say our architecture looks something like that. So, we have a request that comes in. It goes directly to the file server. If it exists, it sends 200 back. If it doesn't exist, it sends 404. Great. So, maybe we find that we're getting a lot of 404s. People are querying this like a lot. And, you know, our file server is becoming overloaded. Great. We need to find something to do instead of this. One thing we could do is add a bloom filter in between the request and the file server itself. And the point of this would be to just say what is not in the file server. We can't actually say what is in the file server. We can say if something definitely is not. And so, in this case, even if a bloom filter gives us a false match, great. So, we let that request through and the file server is just going to return a 404 anyway. And so, that's fine. But we still get rid of like 99% of the false request, the file server. Which is great. So, something about bloom filters. The point is testing for existence in a set. Really, the reason you would use this is for its memory footprint and, you know, it also has a great speed. So, that's bloom filters. So, let's move on to BK trees. All right. BK trees stands for Burkhard Keller trees, which is just the guys who invented it. What it actually does, though, is somewhat more interesting. It finds the best match even when an exact match does not exist in a set. And the point of this is to reduce search space. Traditionally, if you wanted to find the best match in a particular set of strings, let's say, you would have to scan through the entire length of strings, you know, maybe use a priority queue to like keep the ones that are closest to what you're looking for. But the point of a BK tree is to make it so that you don't have to actually scan through the entire list. So, really, the point is that it reduces the search space and it only works inside something called a metric space. So, what is a metric space? Well, the term metric space comes from, well, traditionally comes from like Euclidean distance, like actual distance between two points. But it turns out that Levenstein distance also counts as a metric space. And so, traditionally, BK trees are used for spelling correctors in order to find the best matches for a particular word in a large dictionary. So, what this works off of is something called the triangle inequality, which I added this slide to my talk and I'm like, wow, this is going to be good. Any talk of triangle inequality in it is going to be awesome. So, let's say we have these three nodes, X, Y, and Z. And we know the distance between two of the nodes, one and four. And technically, this is the reverse triangle inequality. But the point really is that using that, we can determine, we can determine a lower bound for the distance between Z and Y. So, we just plug in the numbers into our formula, four minus one. And so, we can say that the distance between Z and Y is greater than or equal to three. And so, if all we cared about was the distance, if all we cared about was if the distance between them was less than two, now we don't even need to run that distance function, now we can just skip it. So, let's take a look at an actual example of this. So, let's say this is our dictionary. We have six words there, taser, paste, shave, light, pastor, and pasta. And so, to start building our BK tree, we're going to pick one of the words on there, totally a random, completely a random, let's say it's paste. So, we're going to build a tree. And it just so happens, of course, that our tree works out perfectly and paste is the root and pasta is one distance, it's one edit distance away from paste, pastor is two edit distance, taser is three edit distance, etc., going down the line there. So, now we want to query it. We're going to say, you know, our user has typed in a particular word, let's say it's pasta, but they spelled it wrong, so it's pastu. So, now we want to find the words that are closest to this in our tree. So, we compare it to the root, we run our distance function, and great, it's one. But, you know, so what do we do to get the rest of them? Well, using that triangle inequality, we can determine that only two of the words on there, or only two of the branches off of the root, are actually even feasible. Given the triangle inequality, given that we know the distance between pastu and past and paste, and we know the distance between paste and pasta and paste and pastor, we can say that only those two are actually interesting. So, we can just get rid of all the other ones. So, now it turns out that pasta and paste are the only ones that actually match. They're the only ones that are interesting to us. But really, the point of this is that we got to not do our comparison across everything else. You know, taser, shave, and light did not even have to run this distance function. So, here, we got rid of 50% of the queries that we would have had to do previously. And so, extending the BK tree, you know, maybe we have lots of words here, and you can kind of see, you know, maybe we take that pastu, compare it to paste, and it turns out, you know, pasta and pastor, and each of those, like continuing down the line, we only end up having to query a very small percentage of the tree, which is fantastic. So, summing up BK trees, you know, most often these are used for spelling characters, but you could also use a BK tree for something like finding everything that's particularly close to something on a map, for instance. You could use it for that as well. You know, it works in any metric space, but it only works in metric spaces, and the point is to reduce the search space to reduce the number of functions that you have to run. BK trees. So, let's move on to another one. It's called a Splay Tree. So, before I get into Splay Trees, I'm going to go off on a little tangent here about access patterns. So, normally, when people think about data structures, they're like, well, you know, I'm going to query these different keys and, you know, they kind of assume that it's going to be an even distribution between the different keys that you're going to be querying. I do this myself a lot, but it turns out that, you know, that's actually very rarely the case. Normally, the actual, like, querying that you do against that data structure is going to look a lot more like a power law. For instance, like, a lot of the work I end up doing is in text analysis, and in text analysis, there's something known as Zip's law, which states that, you know, basically, in any human natural language, you're going to have a power law of works. So, that kind of applies especially to Splay Trees, because Splay Trees are such that, like, the more uneven the access pattern, the better. So, like, I'm sure you can find some, like, immediate uses for this. How about, like, web caches? Especially web caches for something like involving time data. So, for instance, you have a blog, a very popular blog, and let's say, you know, most of the traffic that you currently get are going to your latest blog post. So, a Splay Trees would be perfect, you know, if Memcache wasn't good enough or something like that. Anyway, so, Splay Trees is a self-balancing binary tree, and the point of it is that it brings the most accessed items closer to the root. So, you know, maybe you have your binary tree, look something like this, perfect little binary tree, and we're going to query for number nine, walk down our tree there to number nine, and then we do what's called a Splay Operation. We start doing tree rotations until nine gets back to the root. And you might be saying to yourself, well, we had this perfect binary tree before. Why would you want to do that? You know, why would I want this now, like, extremely unbalanced binary tree? Well, the point of this is to get the most accessed items toward the root of the tree. And so, you know, maybe, you know, so the next time that nine is queried, it's going to come up immediately. No delay at all. Great. Now, of course, you know, if five is queried, it's going to have to walk all the way down there and then, like, rotate it back up to the top. But the idea is that this is especially good for queries that have extremely uneven access patterns. So, this pattern is great for caches, garbage collectors. And, like, this isn't actually something that you're probably going to use on a daily basis, but it's kind of cool to know about these kinds of things and get you interested in, like, different data structures so you can find something that does work for your situation. Now, we're going to move on to tries. Like, in a traditional Splay Tree, you would do it on every access. Like I said, this is for, like, extremely uneven access patterns. And I have some benchmarks that actually work out to show that, you know, in very uneven access patterns. So, last data structure we're going to look at is a try. This is actually my favorite data structure. So, why is a try cool? Well, it turns out that it has order one lookup. It has order one add. It has order one removal. You can do order traversals. You can do prefix matching. Basically, it's like a hash table except better in every way. That's not actually true. But, you know, memory usage. So, that's also awesome. So, this is what an empty try looks like. Just have the root node. Got a string to it. We're going to add the string thin. And so, we just add four little nodes there, one following the other, then we're going to add another one. Try to see there, like, we're starting to share, like, the upper nodes. So, trapping thin, both share the T. And we add another one. We're going to add bar. The point is that it starts, like, as you build the tree, they start to share more and more of the individual tree. The individual nodes, rather. So, how's the query work? We walk down each letter at a time. P, R, look like this. You're going to start at the root node. B, U. So, it stops there. Basically, like, the P in Bufkus isn't there. So, that's as far as we go. And we know that it's not in the try as an autocomplete. Let's say you're going to make a rack based autocompleter. Well, it turns out that you can do that with most tries in just like that. So, you can see there we have our initialized method, which loads a whole bunch of words into the try. And then our call method to match the rack API. We do a rack request.new on the environment. We get our word. Forgot to put into a variable. That's cool. And then we return it. So, we're just going to return the list of children of a particular prefix to JSON. And that's really it. Like, tries are pretty cool for that. So, it looks like I'm ending incredibly early. Apparently, I've been talking really fast. Sorry about that. But really, my conclusion is pretty simple. Like, data structures are cool. You know, maybe you won't find any particular use in these data structures. But hopefully, this will kind of get you interested to look at other data structures and find something that is interesting and does work for you. So, hopefully, you guys have lots of questions. The idea with the bloom filter is to not rely on it being correct 100% of the time. Which is why I like the file server example there. Because even if it is wrong, it's just going to pass through and still get the correct answer. So, really, the point is to just use it as a filter. Like, expect false popper, the search engine that we use as grift. So, I'm using a MiK. I know several other people who work on some graphics related stuff as grift. So, I have a couple of implementations in C with ruby bindings on my github, github.com, slash Tyler. But then there's also a few other ones. If you just search, really, if you just search like github or google for like ruby splatree or ruby bk tree, you'll find quite a few good hits there.