 Alright guys, welcome to, I think this is session number four in the databaseology lectures. We are super excited today to have the Facebook guys that have been working with RocksDB. You know this, RocksDB is a fork of LevelDB, a better version if you will. And Facebook has been doing a lot of work to optimize it and make it do things that LevelDB cannot do. And they're actually using it in production today for some of the things and they need to still talk about it. The great thing about Igor, Kanadi, and Mark Hallahan is that they come from the fabled University of Wisconsin, Madison School of Databases. It's that they both have master's degrees from Madison and where they work in databases and learn all of the skills that they need to work with RocksDB while they're right there. So eventually, it will not be, Madison will not be the champ in producing awesome database graduates, CMU really, but that's another story. I think that's all I got. And so I'm just to say also too that Mark is the chief database recruiter at Facebook and he also recruited Domon's, which is the top with the weighty and working on MySQL there among other things. Is that correct? Okay, awesome. Thank you so much guys. So I've been dealing with open source. Which mic am I speaking into? Okay. Okay, so I've been working on databases for about 20 years. The first half was at Informix and Oracle. So closed source databases, which meant I never got anywhere near a production deployment of a database, which wasn't the greatest way to learn about what is required. Second half of my career has been MySQL and now RocksDB. A lot of open source databases, definitely web scale. So Google and now Facebook, very large sharded deployments of databases. The purpose of the title is to show some of the motivation has changed over time with RocksDB. Initially, we were definitely chasing performance and now we are chasing performance with quality of service as we try to expand the use cases. Personally, I advise a couple of projects, RocksDB, Mongo plus RocksDB, MySQL plus RocksDB. And then Igor is one of the leads from the RocksDB effort. Definitely writing much more code than I write. So the overview, I will just, we have a short summary of how RocksDB came to be and where it is now. Igor will cover the architecture of RocksDB and then I will cover work in progress problems we're trying to solve today and then open problems that definitely I hope that the team solves, but I think there is some interesting research to be done in this area. For the story of RocksDB, at the time in 2011 or so, I was still focused full time dealing with NODB from MySQL. Someone else decided to try out LevelDB and got some interesting performance results where they were, the key result was how many page reads per second can we do from fast storage? We had a few servers in production with dual PCIe flash devices that could do between the two devices could do about 100,000 or 200,000 reads per second for let's say a 4K or an 8K read. And so that kind of being able to saturate that IO capacity was interesting. Now, what happened was I ran some tests on LevelDB in a similar setup and my results were lousy. And so I said, oh, LevelDB, it's not so good. There's too much mutex contention here. It's slow. And so the reason for talking about this is I had a performance result from LevelDB that was lousy. I made a conclusion based on that, but I didn't take the time to figure out why the performance was lousy. And it turned out in this case, I was using M-MAP. The database was larger than memory. We were running Linux 2638, which had some mutex contention issues on the virtual memory management. So if you're M-Mapping a database larger than RAM, you're constantly getting into the VM system to change page mappings. And the trivial fix was just to not use M-MAP, just to use P-Read. So it's just a common thing I see a lot when people are publishing performance results that if you are publishing results and you don't understand, you're likely to make a few mistakes. We had internally some uses of Tokyo Cabinet and generally those users were unhappy to discover that it wasn't crash safe. The lack of crash safety is documented, but it's kind of buried in the documentation. So we had embedded users of databases that just were not happy with the systems that they were using. Eventually, level DB, once we got past the M-MAP performance problem, found some workloads. The big problem with level DB is it's not meant for server workloads. By design, it's meant to be simple. And it's a compromise that they're making, which is a good choice for a good target workload, but compaction is single threaded. So if you need to sustain a lot of IO throughput for compaction, level DB will fall over. And it's funny that a couple of projects that have forked level DB have learned this the hard way. BitCask or Bashow tried a level DB fork and then Cornell has a level DB fork and they all seem to rediscover that, oh, single threaded compaction is not a good thing. We've done a lot of optimization to get better IO throughput, to get better monitoring, to get better control over the right amplification we see from compaction. And just a lot of features recently re-added transactions. So we have optimistic and pessimistic concurrency control. For pessimistic, we do repeatable read. It's similar to Postgres style semantics. We're using that for MySQL, but we also hope that some pure rocks DB clients get it. And just a lot of other features continue to get added to rocks DB. So now is Igor's time to speak. So I joined the team about 2014. And this is when the project was still called level DB, but we already like made many improvements like multi ferret compaction, we had backups, major operators, a lot of new features, optimizations for in-memory workload and also for flash-based workload. And we switched the name to rocks DB. And November, 2014, we decided to open source it. And by open sourcing it, we opened it up to external users. And by that point, we also had many users within Facebook. I think today we have like something about 10 billion QPS across all the different services. And for external users, very early on LinkedIn and Yahoo picked it up. LinkedIn is using rocks DB as a storage engine for their follow feed. And also for Apache SAMZA, which is LinkedIn project. Yahoo is using rocks DB as a storage engine for their Sherpa, which is their distributed key value store. And Co-Course DB is a very exciting new database coming out from some guys at New York, ex-Googlers, which we are very excited about. And then more recently, Microsoft added a port that ported rocks DB to Windows, Airbnb is doing something we don't know much about, but we see Airbnb engineers posting on our mailing list. And then Pinterest is just now building some system that uses rocks DB. So we saw a lot of explosion, both within Facebook and in external users. And of course, we are very happy about this. So now what's next, right? Like we see a lot of success. What are our new challenges? And what we're currently focusing on is bringing rocks DB to general purpose databases. So up to now we've been using it as an embedded storage for different applications, both at Facebook and externally. But we also want to make it such that you can use MySQL and MongoDB. Those are our first two database that we are targeting with storage engine replaced with rocks DB. And status of MongoRocks, this is how we call our plugin, a ROCKZ plugin for MongoDB. It's been running in production at Parse. Parse is part of Facebook that does like mobile backend as a service kind of thing. So you write your mobile application and it stores the data for you for some amount of money. Yeah, and we've been working on it for like last year, but we've been running a production for past six months. We saw huge storage savings compared to old MongoDB that was pre-VireTiger. We saw compression from five terabytes to 285 gigabytes. That was very surprising, but it turns out that Mongo does over provision some storage space and also they didn't used to have compression. So it's like big win. This is snappy, but even without compression it's I think, yeah, about 10X without compression. Power of two? Yeah, because Mongo has some padding. So, but yeah, it's snappy goes even further down. This is comparable with VireTiger, a bit less than VireTiger, but VireTiger also obviously has big savings versus non-compressed Mongo. And we also support document level locking versus a DB level locking or compress, or the collection level locking in 3.0, which means our P9 latency is go way down and there's way fewer problems on high concurrency workloads. So, Parsis has been happy with this. This is all open source, so we also have some external users, but not much. We've been also happy to learn that Precona is offering enterprise support for MongoRocks and for Rocks to be in general. So yeah, this is an ongoing project, but so far it's been pretty successful. And then we have another project that's trying to bring MySQL, RocksDB into MySQL that we call MyRocks. And as you might know, most of Facebook's data, like the prime user data store is MySQL. So that's where all of your likes and comments live. And we've been just trying to experiment with how would MySQL do if it switched from InnoDB to which is what we currently use as a storage engine to RocksDB. And two interesting findings we had was this was very surprising to me. Where when you port, we took one of our production instances and we moved it to like just replicated to MySQL on RocksDB. We found that the database size on disk that we see is 2x less. So let's say InnoDB was two terabytes, RocksDB took only one terabyte for the same data. And the same was true for bytes written. So let's say InnoDB was sending 100 megabytes per second to flash storage with RocksDB we are sending, let's say only 15 megabytes per second. So what this means is we can use less flash devices but with bytes written reduced, we can also make those flash devices last longer. And it also means we can add one more billion users with the same hardware we have today without having to buy new hardware, which is pretty cool. So this was like initial experimentation and once we saw those results, we decided to invest much more resources and effort into this project. Currently it's like very, very early alpha stage but we hope to have something very stable, let's say by the end of next year. So now that we told the story of RocksDB, how it came to be, let's dive a bit into the architecture part. RocksDB is based, same thing as LevelDB on lock structure merge trees or as we call it LSM trees. LSM trees, data is organized in many files and in level the LSM tree, files are organized by levels. So you have level zero, level one, level two and so on. There are two kinds of things to note about this structure. The bottom level is the oldest data and this is also the largest level. So in this case, I have this example configuration here, level four, we can set to be like let's say 500 gigabytes. Then your 500 gigabytes basically your database is in level four. And then basically upper levels, level three, two, one and zero, those are just deltas flowing in from newer updates flowing in down to level four. So data is flowing top down. On the top, top level you can see this is a memtable. Memtable is just in memory write buffer. So all the writes go into memtable and then they got flushed down to level zero and then eventually via a process called compaction, they end up in level four. Each level is configured usually to be 10 times bigger than the previous one. So in this case, if you set level four to be 500 gigabytes, level three would be 50 gigabytes, level two with five gigabytes and so on. So let's say, you know, let's kind of like trace down what happens when we have a key value that gets written into ROXDB. So you have a key and a value. First thing we do is we write it to write the head log. That way, even if your process crashes, we can still recover data and get to the same state. If your machine crashes, then you lose some data if you don't F-sync, but you can still recover most of it from write the head log. Then once the key value is written to write the head log, we apply it to memtable. Memtable is a skip list we use as default, although we have some other implementations, but in most cases it's skip list. So we add the key value to the memtable and then we're done. So our writes are pretty fast. You just append something to write the head log. If you don't F-sync, that's just like memory copy. And we also insert this key value to the memtable. And basically, we're done. So in the foreground, almost nothing happens. And then in the background, once the memtable gets full, in this case, let's say bigger than 64 megabytes, we flush it to level zero. And level zero, we organize files by time. So we have all this file and newest file that gets flushed from memtable. So flush is one process. Another process that pushes files down is called compaction. And what compaction does, it takes one file from, let's say, in this case, level two, and then takes all the files from level three that overlap with file from level two and just merges them together. It gets the, you know, does the merge sort and outputs new files. And those new files then go on to live in level three. So once the level is full, we do compaction and push the one file that we choose down to next level. And so at some point, all the new data ends up in level four. So to reiterate, in the foreground, so in the user thread, right just goes to write head log and then it goes to memtable. And then in the background, when the memtable is full, we flush it to level zero. When level is full, we do the compaction. So the level stays nice. So how does read work, right? Let's say I want to get the value of some key. If there has been a recent write, you will see there's a key in memtable. So if you find the key in a memtable, you know what's the latest value. You can just return it immediately. However, if you don't find the key in the memtable, then you have to go read the files. There's one thing I didn't mention before. Level zero is special. Level zero is special in a way that files in level zero can cover the whole key space. So each file here can contain keys from let's say A to Z, if you consider English alphabet. However, levels one, two, three and four and so on, they're partitioned. So each file here only contains subset of a key space. So in this case for level one, we could have one file from A to B, another file from C to D and so on. So basically when you read a key, you need to go through all of the level zero files, basically in time order. So you first go to the newest level zero file, then you check if the key is there. If it's not there, you go to the second level zero file and so on. But once you, let's say you don't find the key in any of level zero files, then you have to go to level one, but you only have to go to one file in level one. You only have to go to one file in level two, level three and level four and so on. So this looks like a lot of reads from this, but actually what's happening is we have bloom filters. And bloom filters is a probabilistic data structure that takes a set of let's say keys and it produces a data structure that's pretty low on memory. And it can tell you, no, this key is certainly not here in this set. Or yes, this key might be here. So we use bloom filters to say I need a key, let's say I want to value for key A. I go to bloom filter, say hey, is A in this file. And the bloom filter says no, no, it's not here. So in most cases, we just read one file. We read many bloom filters and we consult many bloom filters, but in the end, we usually do only one IO for point queries. At ROGZ, we also support range lookups or range queries. Range queries start at key K, let's say, and then go next, next, next, next in sorted order. So for range queries, bloom filters will not help us. So bloom filters will tell us, is this key here? But for range queries, we don't care if this key is here. We need the next key in sorted order. So we do need to read all of those files. And then what helps is memory. Let's say in this case, we have level four is 500 gigabytes, level three is 50 gigabytes. If you have 100 gigabyte memory, then everything above level four could be cached. If you have a bit less amount of memory, then maybe your level four and level three will not be cached and then we'll do two IO. But in most cases, if you have low amount of memory, you'll do two IO, level three and level four, because all others will be cached because they're pretty small. But if you have more memory, and usually we do, have quite a bit of memory in our server hardware, we only do one IO, even for range queries. We only do level four IO and everything else we can find resident memory. So to reiterate, when we do point query, let's say get me value for this key, bloom filters make us avoid this IO. In most cases, we do only one physical read for storage. Range scans, bloom filters don't help. And for short scans, we do one or two physical reads depends on amount of memory and amount of data that's cached. Cool. So now we go, went over this like LS entry, how it works, how writes and reads work. Let's talk about ROGSD files and what's our data format. So if you go to ROGSD, you direct or you do LS, you'll find something like this. You'll find the manifest something, something, a couple of log files, files with extension SST, log file with a big LOG and so on. So we'll go over all these files, what they mean and how we use them. So the most important file is manifest. Manifest file is file that contains our database metadata. It contains our LS entry, how it looks like. So that's obviously very important because if you lose that, you don't know what's variance, hard to, your database is basically corrupt. If you delete your manifest file, you can't open your database. And what manifest does, it enables us to have atomic updates to database metadata. So in this case, you can see the example of manifest file. We start with initial state, initial state that tells us, this is the state of LS entry. We have L0 files, these are the filenames of L0 files. We have files on level one, these are the filenames of level one and so on. And let's say a flush happens. What does flush do? What flush does is takes memtable and flushes it to file. And so now what we have to do is we have to add that file to our LS entry. We have to say this file is now part of LS entry. But there's one more thing we have to do. Our updates are not idempotent. So that means I can't apply same update twice and get the same result. So what we need to do when we do a flush, we also need to mark the right-hand log as persisted. So at the same time, we have two updates. One thing is saying, this log don't recover it and this file added to the LS entry. So if we die before the flush happens, before the flush commit happens, we will not, we will delete the file. We'll say, okay, we don't know this file. It's not part of LS entry. And we will recover the same data from the right-hand log. If we die after the flush commit, what we'll do is this file is now part of LS entry, file nine, and we don't replay that log. So that way we can make sure that after flush is done, the log updates will not get replayed during recovery. Compaction, what compaction does, it takes some files, merges them together and produces new files. So what compaction record in manifest looks like, remove some files from LS entry and add some files from LS entry. Of course, all of those updates also have to be atomic. You don't want to add a file and not delete old files. And then we have some other updates, like adding new column families and so on, that also we want to make in the manifest. The second file we have is the right-hand log. I already talked a bit about it. It's basically the same format and same data structure as our manifest. So it supports adding a record, append-only fashion, and it supports adding a couple of records atomically to the file. So in this case, let's say I write key A with value B, I will add this update, append this update to the right-hand log. In the second case, I have two writes, write key C with value D and write key E with value F. Those two updates will happen atomically. So either both of them will happen or none of them will happen. And this is pretty cool, because let's say you have a table in MySQL or in Mongo and you have an index. You want to make sure that when you insert something to the table, you also insert indexed record in the index. And we can make in ROXDB to make the two updates happen together. So that way you know that your index is always consistent with what's there in the table. You don't have to worry about scanning the table if you die. So that's pretty cool. And then of course we have deletes, we have some other, is there a question? Is this also? Correct, yeah. So the question is if this is the same format as levelDB, yes. We did some changes to the format and to the class, but most of it is still the same as levelDB, yes. And then of course we can mix and match writes and deletes some basic stuff. And then the meet is table files. And that's where we keep the data. Once you flush your mem table or when you do compaction we produce table files. Table files are divided into blocks and we have couple of types of blocks. The block that contains the actual data, the keys and values is called data block. And the way data block looks is just a list of keys and values. Something that's noteworthy here is that they are compressed. Each data block is individually compressed. And they're also prefix encoded. Prefix encoded, what that means, if your two keys share the same prefix, the second key will just say, hey, I share prefix of length eight with the previous key. It will not repeat the same prefix. So we first prefix encode them and then we compress them and we write them out in a pen only fashion. After we write, written out all the data blocks, we need to write a couple of meta blocks. And most important meta block is index block. And what index blocks contains is index to those data blocks. For each key, it has a pointer to data block. So when you read a file and you want key, let's say B, you first go to index block and find in which data block this B key is contained. And then you go read the data block. We also have filter block and filter block are just persisted bloom filters. So we don't have to recreate them every time you open database. We persist them. And then we have couple of blocks like statistics. And statistics contains, let's say, number of deletions in this file. And we use that to guide our compaction strategies. And then we have meta index blocks which points to index blocks and filter blocks and so on. So when you read the table, let's say you go to a table and say, I want key A from this table. First what happens, you first you go to filter block. Check if this key even exists in a table. If it does, then you go to index block and say, in which data block can I find this key? And then index block will tell you go to this data block and then you load the data block from file system and you find your key. In most cases, we keep both of our bloom filters and index blocks in memory. Because if you don't, then every time you go access a file, you have to load something from disk. So our bloom filters, index blocks, they can be quite big, but we try to always keep their memory. And then block files are just debugging output, this big LOG. We have all our tuning options, we have all the flushes and compaction that happen, record some metadata there so we can debug what's happening, our compactions too frequent. And we have some performance statistics, how fast did the compaction happen, what's our write throughput. It's pretty cool because if somebody calls you and tells you, hey, my rocks need to be slow, you can just tell them, give me your LOG file and you can read everything from there, everything that's happening in the database, it's in the LOG file. Yes, there's a question. Do you do any production also? Which one? Are you logging all the information you just spoke about in production clusters also? Yes, so the question is, are we logging everything in production? Yes. And how much performance does that log even have? Never really measured, but flushes usually happen very, like let's say every couple of seconds, maybe every 10 seconds, and compactions happen maybe every couple of minutes or even like 10s of minutes. So when it happens, we log something, but there's not much data and it happens very infrequently, it shouldn't be much or low or bad. Yeah, we never measure it. Just because it's not out of the many data, it's like in a couple of seconds. Right. Do you think it contributes significantly to the final application of the system? I don't know, definitely not. It does not contribute to that much. It's very few data. Much more data is coming to the machine than flushes. And it's very useful for us. If something happens, we just go log into the machine and see. Take a look at the LOG file. One cool thing about LSM trees is that table files are mutable. So you think about it, everything happens in a mint table. Everything that's changing is changing in a mint table. And then flushes generate the whole file. Compactions generate the whole file and lead the whole file. So when you have a file, it never changes. Once you write the file, it's there. So backups are very easy. It's easy to do fast and incremental backups. What you do is you just check, hey, what files I have, what files are there in remote storage and you just send the files that are new. And so we've been using this a lot in various different services and we even open-sourced tool called RocksDB strata, Rocks strata, which we use at parse to backup MongoDB clusters. And you can do also really cool stuff like you send the files to remote storage and you can do live queryable backups. So what it means, you can create Mongo shell and switch to the backup and just query backup without restoring full data from backup. So you can do queries on top of your backups. You can check, hey, what was the value of this document a month ago if you have a backup from month ago. So yeah, backups, the LSMR are easy and it's very cool. And we have built tooling around it. That's all open source. Cool, and with this I'll give Mark the mic to talk about what we are currently working on. You do a lot of benchmarking in the not really right amplification but just the size of the debug log. By default it's set to info informational level. It's too much for me. But I think for real workloads that it definitely isn't a problem. And something important for embedded applications, the users, the internal RocksDB users are not RocksDB experts and they're not even properly monitoring generally RocksDB. We, the provider are expected to do that but our application is not running and they're our control or the database. So we kind of generate a log file without their intervention and that just makes it easier to support production. We have the big work in progress for me is MySQL plus RocksDB. We have another workload trying to get RocksDB performant on pure disk. It's a log structure merge tree in the early 90s when the first paper came out. We were talking about LSM is there for pure disk solutions because you avoid the random IO on page right back. Fast forward 20 years or so and now at Facebook we've been mostly talking about RocksDB for pure SSD workloads. So now we have a big in-house user for disk. Tiered storage so some combination of emerging storage technologies. We have memory, eventually we'll have NVM, NAND flash and then disk array. Can we, efficiency is a big thing for us. We have a lot of data. Do we have to use SSD for everything or in the future do we have to use NVM? If you ask the internal user, yes they want a pony, they want NVM, they want SSD. They don't want disk. Operations doesn't want disk. Disk makes life hard. Mistakes are slower to recover from. So being effective with tiered storage though can be a big deal. And then just another thing, is the Micah or any of the people from Micah paper here? Okay, so if you look at level DB or even RocksDB, the NEM table is reads are concurrent with everything. They don't are lock free. If it writes there's a mutex. If you look at a lot of the academic work and even production quality code like Hecaton, everyone is doing lock free or highly concurrent in memory structures that can be updated. We don't have that yet. So, web scale it for MySQL at Facebook and this is a tier that I work on. Numbers are from earlier this year, about 175 million queries per second at peak. I won't say how many servers. 12 billion rows read per second. The interesting thing with 140 million disk reads per second and this is really PCIE NAND flash. We're not using disk on this tier anymore. We're doing at peak almost one read from storage per query. So even though we have a lot of data in memory and good caching, we have a very effective cache tier above the database. So we are seeing colder queries and we are definitely using a lot of IO capacity. The size, many tens of petabytes and this is after getting about 2x compression with NODB. So we've already shrank the tier in half with NODB. The reason for MyRocks is hopefully we can shrink the tier in half again. It's a really big deal when you're using that much NAND flash. Availability, many nines. Unfortunately, legal won't let me say how many. Although we did, I think increase it by an order of magnitude. We deployed a solution with automated failover. So within 30 seconds of a master dying, we have automation that there's no human intervention. Just the automation kicks in, does the promotion and we think it's lossless. It's not Paxos or Raft, but we're claiming it's close enough to lossless that it is called lossless. When you're dealing with production, even if you're dealing with synchronous replication, it's hard to say that anything is lossless. Just things go wrong. But the fact that it is not losing commits means we're willing to automate it and let it run fast. The primary workload is social graph transaction processing, likes, status updates, comments. We don't store video or pictures in the database. We do store metadata for video and pictures. So we actually, even though in public we've said we don't use MySQL for these solutions, those engineers are benefiting greatly from using MySQL for whoever does pictures at Facebook. Messaging was deployed last year. It was put on top of HBase to improve availability and response time, quality of service. We're using this with Presto to do parallel query for something that the data store gets trickle updates and then Presto does the parallel query on top of InnoDB. And then we have other users that just haven't been described in public. MySQL and RocksDB, again, the big win is 2x better compression compared to compressed InnoDB. But as we do a lot of performance work on it, there are some interesting problems that you don't hear described much in the LSM community. The first one is we're sharing an LSM tree across many indexes or many tables. Drop table, drop index has to be fast. I mean, it has to be instantaneous. Now, if we have multiple indexes in one LSM tree, we can't do a delete on all of the key value pairs for the index. So we have a way to reclaim the space in the background using a callback function run through the compaction filter. Every key value pair put into the LSM has an index ID as the leading part of the key so we know we can identify the keys to be dropped. Optimizer statistics per SST file, as compaction is running, it's dropping old files, creating new files. We are creating, we have a callback that runs during compaction when an SST is written. We use that to compute optimizer statistics that we need and then on startup we can collect this metadata to rebuild what the stats are for a given table. Tombstones, we've had a lot of problems with tombstones on performance. It turns out if you look at level DB, they're really efficient at dropping old versions of keys during compaction. The algorithm that they use for dropping tombstones generally means tombstones are not dropped until the tombstone reaches the base of the LSM. So we had queries that would encounter millions of tombstones and take seconds to finish, which was really lousy for performance. I think we've actually recently fixed this for Myrox. We are mapping different indexes to different column families for configuration, differences. Some column families will be tuned differently. Some are, some column families are tuned for range scan, others are tuned for point operations. Byte comparable keys, it's just, we wanna use Memcomp to do the key comparisons in the index. We have composite keys, we have character set issues, we have variable length keys. All of this must be combined into a single string that's byte comparable. And then transactions. We've added support for optimistic and pessimistic concurrency control for pure ROXDB applications, and then MySQL uses that. Another point is that there's many dimensions to better. A lot of the papers I read, the focus is on throughput. A lot of the focus for me has been on efficiency. So we don't need to get 10x lower response time because the response time we get today is already sufficient. If we can get 10x better efficiency and that would be hard to do, that would be amazing. So performance is one notion of better efficiency for a lot of us collecting user data that we have to, that we promise to store forever. Efficiency is a big win. And then manageability. It's really hard to hire operations people. And so if your project, if your database minimizes the size of the operations team, it will be more popular. MongoDB, a big reason it's popular, is it makes it easy for small startups to scale out a database. They're not winning the performance game versus MySQL. They're winning the manageability game. And then the last one is availability. If you're on call, do you have to wake up every time there's a server that died? It's a lousy on call experience, which means your on calls are going to quit. So it's kind of related to manageability. In addition, the other benefit is your users are happier when there's less downtime. Storage efficiency for me, we can talk about read, write, and space amplification. Read amplification, the simple definition is how many physical IOs do I do per query. I would really split this into read amplification for point and for range queries. So you're going to consider it differently. You can also talk about it for in-memory structures separately from persistent structures. Write amplification, how much am I writing per transaction or per query? Now, what we mean by a write here is going to differ based on the technology. If it's disk, I probably care more about seeks, random operations. If it's SSD, I care a lot more about bytes written because write endurance is an issue. So we have some wiggle room for what we're defining is the write amplification. Space amplification, just what's the size of my database versus the size of the data? My rocks, the big deal is space amplification is reduced in half compared to InnoDB. So we have different concerns depending on the workload. Tiered storage is just another way to pursue efficiency. So a few years back, Facebook did this thing called flash cache. It's a persistent write back cache that lets you map one storage device on top of another. It's transparent to the application. It's a Linux kernel module. We did this to put PCI NAND flash on top of a disk array. So a fraction of the database was cached in the flash. The entire database was on the disk array. The flash also absorbed a lot of writes and then gave us better elevator scheduling from flash to disk. We now have NVM, so we have a hierarchy forming between NVM, NAND flash, and disk. And eventually, you'd like to move data down the hierarchy. The question is how much awareness does the database algorithm require and how flexible is this? If you make the wrong decision when putting data on disk, you're going to have some lousy response time as the workload changes. The other thing is that the efficiency that you care about, the different devices, if you want to do sequential or large requests for reads and writes, disk is by far the cheapest. If you want to do a lot of write IOPS, then NVM is the best. If you want to do a lot of read IOPS, then NAND flash is the best. For me, I think an LSM makes it easier to navigate these choices than an update in place B tree. So one of the exciting things for me with RocksDB is that we have an easier way of pursuing these efficiency wins. Reinforcing this, I just used retail data, public specifications, list prices from Amazon for disk, TLC, and SSD. The interesting metric for write megabytes per second, that was the most interesting one. I don't use the peak write rate for an SSD. I use the sustain write rate. My workload is 24-7. There's no downtime. There's a daily curve, but the device is always in use. So with the SSD, you need to consider what average daily write rate can I sustain in write megabytes per second to this device? Well, for TLC, that does about 1,000 device writes. You get about one device write a day, and per terabyte, you get about 10 megabytes per second. For MLC, which is moving down to about 3,000 device writes, lifetime, over three years, you get about 30 megabytes per second, sustain writes. 30 megabytes per second is not per terabyte, is kind of low compared to what you get from a disk array. So the really big difference between the disk and the SSD, if you want to sustain writes over several years, the disk is just going to be much, much cheaper. If you want random reads, definitely we're looking at, or random operations, we're looking at SSD. But we really want to use the SSD for reads. If possible, we'd like to use the disk for writes. Multi-threaded memtable, it is a concurrent skip list with a mutex, meaning the writes serialize, the reads are lock-free. The writers serialize on the mutex. And the last point, we can fix this later. Judgment of what to focus on now and what to punt on later, I think we assumed we would have had a solution by now for the multi-threaded memtable. It's not there yet. We are working on it, and next year, it will be fixed next year. But some of these decisions are good, punting performance work to later. This one, I think we need to fix. How do you support reverse scan? You just have to point to the next interface. You didn't need to stir, OK? So open problems. The first big one is managing ingest. So ingest times write amplification should not exceed compaction throughput. If compaction can process 100 megabytes per second, and your ingest and your write amplification is 10, then your ingest over the long run cannot exceed 10 megabytes per second. If it does exceed that, then your LSM is going to get into bad shape. So the question is, we kind of can estimate or measure at runtime what the compaction throughput is. We can see how much can I read and write to the disk array or to the SSD. How fast can compaction process data? We can also estimate write amplification. So from that, we can figure out a target ingest we want to sustain. The challenge is, how do you shape the ingest? You need to slow down writes at certain points in time. Writes are cheap in ROXDB. It's just write-ahead log, memtable. So we're still working on better ways to shape ingest and predict ingest to make sure it doesn't exceed what compaction can sustain. Right, it's because if you have a very skewed workload that you may need less compaction or more those parallels that you need. Yes. So what do you target ingest? Repeat the question for the video. So the question is, if you have a skewed workload versus a synthetic uniform distribution of key updates, but I would say that will show up in the measurement. Because you can measure your compaction throughput. You can measure how fast you can move data through, and you can also measure your ingest. And if you're in a steady state, then from those you can compute the write amplification. With skew, compaction doesn't go as far down the tree. It can hopefully stop early. However, there's a bug in ROXDB where it wasn't doing that. So we have, in the benchmark client, we have an option to do a skewed update pattern. And I noticed that in one of the skewed update patterns, compaction was still going down to the base of the tree. When in theory it should stop at the first level that holds the working set. But all I would say is you can measure compaction throughput and ingest. And from that you can compute the write amplification, which would be smaller in this case. And so you can measure this dynamically. Another one, adaptive algorithm. So we have a lot of complexity in ROXDB. It's excellent if you want to do auto-tuning work. It's excellent if you want to sell professional services. It's hard to configure. Some of this is that we have a wide variety of workloads. But can we move the cleverness into the algorithm rather than the cleverness into the person writing the configuration file? It's also important because we have real workloads might have a mixture of behavior. With ROXDB we have different column families we can use. We can tune the column family separately. But it's still static. So can we move the cleverness into the algorithm to reduce the complexity of the configuration? And also depending on we might have different goals. We might want to reduce read or write or space amplification depending on the workload. Igor mentioned that we want the bloom filter and index metadata to be in cache. And that's easy to do if the ratio of database size to RAM size is not too big. But as you grow database size versus RAM size, some of this changes. You don't want to do a disk read to the bloom filter to figure out if you should avoid a disk read to the database file. You're still paying one disk read at that point. So how do you go about managing the case where the database, let's say it's 100 times RAM or 200 times RAM, some of your caching strategies have to change the size of the number of bits that you allocate per key to your bloom filter has to change. We want this to be dynamic with the goal of reducing disk reads for metadata. One interesting topic is clustered index. With a clustered index, you need in cache one key per database block. With a non-clustered index, you need in cache one key per row in your database. So a clustered index, the amount of data that you need in cache is much less than with a non-clustered index. So there's just some thinking. And since I work on MySQL and now RocksDB, I read a lot of people doing benchmarks against RocksDB and MySQL. And usually, the other system is claimed to be faster. Non-clustered index is something that has been used by some of the systems that are faster than RocksDB. And the authors haven't always acknowledged the impact of requiring the use of a non-clustered index, meaning your database to RAM ratio cannot grow as large as it could with RocksDB. X is faster than RocksDB. We love to read about these papers. They are interesting. And it means you're talking about RocksDB, which makes us happy. We also like academics. I only got part of the way through grad school. I do go to database conferences. So if you are working on this, please talk to us. We'll help. Even if helping means you get better results on your side, we'll help explain the results that you get from RocksDB. But just talk to us. And that's it. So thank you. I have a couple of questions. You started working on LevelDB. What's the first thing? When you started making RocksDB, what's the first thing you had to change to make it not be bad? Stop using NMAP. That was the first. That was a trivial one. The big thing was multi-threaded compaction. So initially, with LevelDB, it's one thread for Memtable Flush and one thread for doing all the rest of the compaction. And because of right amplification, you're amplifying. The ingest is coming in very quickly. And if you want to sustain, let's say, 50 megabytes per second ingest with an right amplification of 10, you need to be writing out 500 megabytes per second. And even worse, if you're doing Zlib compression, Zlib is hard to have a single thread doing I.O. and the Zlib compression at 500 megabytes per second. That's just not going to happen. So Snappy helps a lot, but multi-threading compaction was a really, really big deal to overcome the compaction stalls. And then we added early on throttling to try to keep the LSM in a good shape, even if compaction couldn't keep up. So we had a very primitive way of stalling rights so that compaction could keep up. So what's like a five-year thing? My rocks are sort of on your one-year horizon. What's like a five-year thing if you had a magic wand that could make happen that you wish it could happen? So I market. I tweet. I blog a lot. I market bugs for upstream vendors, usually MySQL. And then I also market projects to academics. So I just throw out ideas. And this one I'm pushing to WireTiger as well. WireTiger has an LSM and a B-tree. Dynamically, have an algorithm that figures out what part of my index should be using a B-tree, update in place, or even copy on right B-tree. And can another part of my index look more like an LSM, depending on the workload and the goals I set for read, write, and space amplification. So getting this dynamic algorithm that adjusts itself based on the costs you want to reduce and the workload you see, getting that from a storage efficiency perspective is a big deal. And I don't see anyone doing it yet. I think there's some interesting work there. All right, last chance. Yeah, you. So you said that the M-map was problem because of the kernel bug. So did you try it again once the bug was fixed? Repeat the question. So the question was, have I revisited the M-map issue when the kernel bug was fixed? Every time I do a lot of I-O performance testing, and I've always had P-read end up looking faster than M-map, maybe only 10% in terms of the I-O rate I can sustain. The big problem for me with M-map or a big problem is, does it know how much you want to read? So M-map will do a file system page read. If I need more than a file system page, is my database informing the file system that I actually want 12k of data or 8k of data? Does it meet some test data? Sure, but if I, and I haven't done the test, but if I have a pause between the two reads, because I wait for the first 4k to come back, and let's say there's no file system read ahead. So wait for the first 4k to come back. I decompress it, and then I ask for a little bit more. That's always going to be slower than just saying, give me the 8k or the 12k. And especially when you say fast, that's latency. But if I care about throughput, I don't want to waste random IOPS even on the storage device. If I have a lot of throughput, I don't want to have something due to IOPS that really could have been solved in one. So in two weeks, I think we're having Ivan Bowman from SAP to talk about SQL anywhere. So definitely check out for that. And then the video will be posted online later today. So let's thank Mark and Igor one more time.