 Go ahead and start. This is Alex from Indeed. So he's done some cool stuff in AI and natural processing back in the days of an engineer at Google. But today he's going to be talking about SQL, no SQL, but then a third option that is the middle perfect MPH. So he's going to tell you all about it because I know nothing. So I'll let him take it away. Go ahead. And there goes Paul and Andrew, but I'll just say it anyway. I'm Alex Jim from the software engineer at Indeed. And I'm talking about MPH, a key value store based on minimal perfect hashing. And I'll explain a little bit what that means later, but the important thing to remember is that it's a read-only key value store. So I'm going to go through and talk about the... Can everyone hear me? Sorry. It's soft. It's soft. Yeah. I'm going to talk about the use cases for this and then talk about the implementation and conclude with some benchmarks. So I help people get jobs. You didn't see the shirt. Indeed is the number one job site worldwide with 200 million unique ones through visitors in more than 60 countries in 29 languages. It's not the sales pitch. You can get that at our booth downstairs. This is just to emphasize that we're a global company. We want to serve our users worldwide very fast response times, which means web servers near them. And those web servers need to be talking to backend services in the same data center. And with that comes the problem of data distribution scale. We're pushing a lot of data back and forth between our data centers. We have a lot of different types of data we distribute. The search index itself, our job recommendations, ranking models, company reviews, all kinds of statistics, the deployables themselves and tons of logging. And all of these have different ways to be stored. The typical solution is going to be you do ask SQL or NoSQL. Both are fine with different pros and cons. SQL is a single point of failure. NoSQL can have some synchronization issues. We use both of these in fact, but I want to talk about an alternative today, which is to ship the data directly to consumers. Not a new idea, but often overlooked. So what we're doing is we're giving each client a local read-only copy of the data. This gives very low latency. You can end that or load it directly into memory. You're not accessing any external services. It's simple and robust. No rights, think functional programming. There's no moving parts, so nothing can break in production. And if you do need to update the data, you can just push a new copy, either ad hoc or with a regular schedule. So read-only key value store. The priorities for this is that the data must be small. I want to minimize network usage, disk usage and memory usage. You can't push around terabytes every minute. Read time must be fast. It doesn't help to save in the network if we're spending a lot of time in CPU. But the right time is less of a concern. It's good to write once, read many times scenario. So how do we build a key value store? Pretty much everything is going to boil down to some kind of tree or some kind of hash table. Trees tend to be more compact, but incur a log-in lookup. You might need to hit disk multiple times. Hash tables will take a little more space, but are faster. Your file system will need to be a tree. Something like gdbm is a hash table. SQL databases will use a mix of both. So our solution today is based on hash tables. So what is a hash table? It's a table based on the hash function, which will take the key and produce a deterministic, but ideally random value. Multiple keys might hash to the same value, so you need some kind of collision resolution. And you might have empty buckets. So a perfect hash table is a hash table where there are no collisions. Now, by definition, this can only be done when you know all the keys in advance. So this is only possible for a read-only table. Kind of easy way to get a perfect hash function would be to use a cryptographic hash. So you can make an md5 of some, but then you can't realistically build a table with an index for every md5 of some value. So ideally you want a perfect hash table that maps to a smaller range of numbers. And the holy grail is the minimal perfect hash table. Where the n keys are mapped exactly to the numbers zero to m minus one, no collisions. So perfect storage, you're guaranteed one single seek. Generating a table, this hash functions is kind of complicated, but there's been good research in this in recent years involving solving graph problems. And there are libraries to do this for you in linear time. Now that'll give you the hash function, but you want to store this on disk somehow. And so if the data itself, if all the key values are fixed with, you can just access them directly. You can use the hash function and scale that by the size of the key values and directly look up that offset. But more often they're going to be variable with. So you need a separate table of offsets to index into that. You'll notice here that the metadata, this is the hash function itself. The techniques we use to generate these perfect hash functions are not constant sized hash functions, but in fact scale with a number of keys, but only at about two bits per key. So compared to the rest of our data, this is pretty nice. One problem with this is you can't hear. Yes. You speak up a bit. I'm already here. You can't hear. Okay. Sorry. Now one problem with this is. I'm not supposed to move your mic right now. Hold it. Okay. Is this better? So with this, if your key values are very small, then the offset table here can be a significant size of the index. So we also try another approach here, which is instead of an offset table, we have a bit protector. What we're doing here is restoring a one in the start bit corresponding to each byte in the data file of where each entry starts. And then so if you wanna, if you get a hash.uk and you wanna find out where the kth item starts and you can find, you can look up where the kth bit is in this bit vector. And there exists data structures and algorithms to compute this in constant time. So what, and what we do is then we'll consider the size with an offset table on the side of this bit index and use the smaller one. This has an upper bound of one eighth the size of the data itself. Benefits of this approach, as mentioned, so there's no collision handling. So we get optimal storage to begin with and you're guaranteed a single seek. You compute the hash, look up the offset and read directly from that one location. You also don't need to verify the key, right? So we're talking about a fixed set of keys that we knew in advance. And if you're looking at the key that you know was in that set, you can just compute the hash, get the offset and read the value. So we might not even need to store the key and the data. We can also probabilistically, so if you're not storing the keys and you look up a key that wasn't in the original set, it's just gonna get a hash, it's gonna resolve to some value, it's a false positive. But you could use something like a bloom filter to filter in advance. And in fact, we can do better than a bloom filter. We can store a signature of the original key with the data. And so say we store a 10-bit signature, you then look that up, you compute the same signature for the key you're retrieving. And if they match, you assume it was the original key. That would be for 10 bits, a one in 1024 probability of false positive, which is a much better rate for the same number of bits you would get in a bloom filter. You can also instantly use the same technique to build a better bloom filter for any statics of keys. And the implicit compression, so we might have a very large key space, long strings, and these are now all compressed down to the numbers zero to n minus one. If we want to reference those keys from some other table or data structure, we can just store that hash down. And so the results, the results from the first table we switched to this in production. There's not a lot of solutions that are aiming at this exact immutable approach, but so this is kind of a grab bag, but we're looking at SQLite. And I did say earlier, we're looking at non-SQL solutions, but SQLite is a library that maintains a single file on disk and it's perfectly suitable for pushing to your consumers. We are looking at our in-house LSM trees, they're log structure and merge trees, and level DB, another open source LSM tree solution. The hash tables I described here, disco DB, which is another middle proof hash based solution that I found after implementing this. And tiny DB, which is a non-perfect hash table. And so the data, you can just consider these clusters, this is what we're working with. We've got hash values, eight byte hashes, mapped to a cluster of items. These are all 80 bit integers, they're actually stored as base 32 strings, 16 bytes each. The first table is the map from the hash to the cluster of items, and there's this auxiliary table, which is a map back from the items to the cluster. And so for the first table, the hash to items, we can see that the trees, as expected, are more compact, but the middle proof hash table is in line with the general storing this as text. An important aspect of both of the indeed solutions is we allow a plug-in, a pluggable serializers. So you don't have to store it as raw text. We can, for instance, store the eight byte hashes as eight byte binary longs. We do that automatically for SQLite as well. And for the items, instead of the base 32 bit integers, we can store this as 80 bit binary values. And if you can apply both of those, the NBH tables are the smallest, even smaller than our own trees, using the same serialization. For the second table, this is where it gets really dramatic because we're gonna, even without the additional optimizations, NBH is smallest. But if you can omit the keys, right? So in the second table, this is the reverse lookup from the items back to their hashes. If we, we don't have to store the items in this table. So we can just get that refeed the hash and then look back up in the previous table, whether that item was really one of those hash clusters. So we can omit the key and cut our size in half. And we can go further. So this is pointing to the, the hashes which had a minimal perfect hash function. So that compresses down to four bytes for the zero minus N1 storage. So this is suddenly one-tenth the size of anything else, really dramatic reduction. And then the next thing we're concerned with is the read time. So here, all the hash-based solutions are fastest. NBH was fastest of all. This is the times we're reading from the two different tables. For write performance, this is less of a concern for us. But again, both of the indeed solutions were pretty competitive. In summary, smaller and faster, if you can use an immutable solution to try it out. We've open sourced this. It's on GitHub. There's a link to a blog that has some more details about the implementation. The library is in Java, but also has a command line API. You can just say you can just try it out without that encoding. Any questions? Yes. So you mentioned CDV, and as far as I'm aware, CDV has implementation where you can't have a database as large as in four gigabytes. Yes, so. Because you could use a page if it's an implementation. Is there just another one? No, there's no limitation. I've tested this with more than four billion keys and up to around 10 gig files, you could probably do larger. If you want to do much larger, you could probably want to charge this solution. And actually, we had to leave off tiny CDV from the right benchmarks. It was writing really fast, but it was not generating the valid table. The limit is four gigs, but above three gigs, we're getting data loss. Yes. So how long does it take for you to generate a perfect hash table for like a billion keys? For a billion? It sounds like it would be four billion, I think four billion, like the intermax plus one was about an hour. Okay, then I'll choose that. The minimum hash function generation is very fast. All the bottleneck is in the serialization of disk. What's the feature of the data, like what kind of data? You said immutable data, what kind of email addresses? Yeah, this was actually user clusters in this case. We use it for job recommendations, we use it for anything. It's really good for machine learning type models. You can store the feature values without even storing the feature keys and just leave them out probabilistically. Is it possible to do better? Just wondering, because you mentioned that you're already using zeroes and ones that you're using. I'm assuming rank and select using that? Yeah, rank and select. You mentioned cost and time. So is it possible to do better by using techniques like that on the basis? Are there any, you know, I don't know any low hanging fruit to improve on this, but I'm sure you can find something. Yes. For the benchmark, how many records are you testing that with? This one, I think this was 50 million. So the size, this is megabytes, so these are three gig tables. You see pretty much the same thing. I tried one through 10 gigs, it's pretty consistent. Okay, thanks. Yeah, thank you Alex for one of my demos.