 So our speaker today is Keith Bostic. When Keith sent me an email with his information, with his talk, I tried to convince him to modify his bio, because he had two paragraphs about his long history and things like that, and I just wanted him to replace all that and just put the word legendary. Right? He refused. And so the reason why Keith is legendary is because when he was at Berkeley on the Berkeley Software Systems Group, he helped write BSD, which is the sort of open source, the main open source operating system that's around at the time. It's on every single Mac. Although, yeah, it's on every single Mac. Yeah, yeah, that's what I'm saying. It's the user land of the BSD and the kernel of Moc. So Keith was the lead architect for BSD 4.4, and for those of you that don't read the history in your operating system textbook, that was the version of BSD that AT&T decided to sue the University of California Systems and stymied it for two years, which then left a vacant hole in open source operating systems that then allowed Linus to go off and make Linux. So part of the reason Linux is available today is because Keith decided to get the ruffle AT&T's feathers. So then after he was working on BSD, so not only is he an awesome kernel hacker, he goes and does a startup called Sleepycat that would commercialize the open source and metadata systems for a BDB. And so they were bought in 2006 by Oracle, and then Keith is right back at it again, and now he has a new startup that got bought by MongoDB last year called WireTiger, and that's what he's here to talk about with today. Right? So we're really happy to have Keith here, and we'll figure it out. Is that on? I feel just a little bit oversold here. Legendary. Yeah, I don't think I'd go that far. Accidentally in the wrong place at the wrong time, maybe. So I'm going to talk about WireTiger. This is a new set of slides. I didn't want to give you guys the MongoDB talk. I wanted to focus on the embedded parts of WireTiger. So I'm going to be going through the slides. I'm going to hand wave on some stuff because it's a big slide deck, and I want you guys to get to lunch today. Please feel free to interrupt at any time. I'm happy to answer questions. I'm happy to dive into something that interests you. Just yell it out. So my name is Keith Bostick. I am a co-architect of WireTiger. I am a senior staff engineer at MongoDB at the moment, and please feel free to send me email. You'll have a copy of these slides, and we are happy to talk about anything you want to talk about. If I don't know, I can probably find out the answer somewhere else. So from a top-level viewpoint, what's WireTiger? It's an embedded database engine. Our goal is to be a general-purpose toolkit. So the kind of two ways you see database engines going these days, there's, I want to solve a random general-purpose workload, the toolkit, or I'm solving a very specific problem, OLTP or whatever. We are a general-purpose toolkit. High-performing, we want to be scalable with low latency. We're a key-value store, no SQL, of course. We do have a schema layer unlike a lot of traditional embedded database engines. So we have data typing, we maintain secondary indexes for you. We are single-node. So if you think about it, you know, there's a general-purpose MongoDB database that is operating, it's got replica sets, it's got sharding, it's doing its durability through networking. We're single-node. We don't try to solve any of those problems. Standard object-oriented APIs currently we support Python, C, C++, and Java. We're open-source, of course. Okay. Right now we have a couple of fairly large customers. Amazon AWS, a big chunk of the cloud database support that Amazon AWS sells goes through a wired Tiger engine. We also sell to people to do financial trading solutions. So Amazon are the people that really, really don't want to lose their data, and Orc and Tbrick are the people that really, really want queries to happen very, very quickly. Okay. And then MongoDB, which is a general-purpose document store, or I should say a next-generation general-purpose document store. You may have seen this. MongoDB acquired wired Tiger about a year ago now. And this was the announcement. 7 to 10x better performance than the engine they had before and 80% less storage than the engine they had before. And that was all wired Tiger coming into their release. Or this. This is not something you get very often, so I'm pretty proud of it. Almost ran out of disk space on a replica set, swapping all members to the wired Tiger engine, saved the day, and it's kind of a nice drop. That's compression kicking in. Okay. I'm going to talk for a second about MongoDB, their storage engine, because it gives a good sample of how wired Tiger fits into a complex application. So they have a different, or they have a plug-in structure a lot, like my SQLs, right? It allows different storage engines to plug in. And so what's cool about that is, depending on your workload, you pick the storage engine you care about. They had an M-Map-based storage engine originally. And the problem with M-Map is that as soon as you have a bunch of writers, it usually turns into a giant lock problem, and so things slow down. So it's great for read-only, but it's not too good if you're writing. So one of the cool things that MongoDB wants to do and can do is that you can actually have members of a replica set using different engines on the same data. And so what you can do is route a query or a set of queries, or at a certain time of day, we do mostly read-only queries. So what we're going to do is spin up three or four more replicas with this data set, or this storage engine, and all of a sudden our workload goes a lot faster. Yes. Yes. Yes. I'll talk a little bit about that on the next slide, but yes. And obviously you have an opportunity to innovate further. You know, we'd like other storage engines in here to handle other workloads. I mentioned that we're a general-purpose toolkit, right? Which means that our responsibility is to handle every possible workload reasonably. All right? We'll get it done, but maybe not as fast as you like. If it's worth it to you, you write the general or the specific purpose engine, put it in, your queries run faster. We're the general-purpose workhorse of MongoDB. And here's the example. So the MongoDB query language handles all of the queries coming into the engine. It's got native drivers for pretty much every language you can imagine. Underneath it, there's a data model. And underneath that is a set of engines. And so MFV1 was the one they had until they got us, or they bought us and plugged us in. There's an in-memory experimental one that we're working on at the moment, not we're working on, but parts of MongoDB are working on. The idea is if I never have to evict anything from the cache, maybe things go faster. Other ones that are available, Tocotec, if you know the fractal engine technology that they had, they've been purchased by Percona, who markets and supports the Tocotec engine for MongoDB. They were here last year. Ah, cool. Facebook has something called RocksDB, which is derived long ago from Google's level DB work, which is an LSM-based engine. And it handles some very, very interesting data problems that Facebook has. For example, when they want 300,000 open tables at the same time in the same application. We don't handle that so well. I'll admit it. So that's the overall or the overview of WireTiger. What I want to do for the rest of the talk is pick some specific places in WireTiger that I think we've done something interesting or that might be interesting to you. If I have a goal in this talk, I would like to convince you guys to start working with WireTiger. All right? You've got research things you want to do. You want to try different kinds of block managers. You want to try different type of transactional semantics. Hey, we're open source. We want to work with you. It would be great. You find out where we're screwing up. That's even better. All right? So that's my goal here. So I want to talk about in-memory performance for a second. Why did we start doing this? Well, traditional engines struggle with modern hardware. We mentioned Berkeley DB. If you think about the machine that I had when we started working on Berkeley DB, we had two cores if we were lucky. I mean, I remember the day we got four cores. That was pretty cool. And we already had Berkeley DB running. Right now, we've got lots and lots of CPU cores. We've got lots and lots of RAM. Thread contention. When I started working on Berkeley DB, we didn't even have a thread model. There was no portable model for writing threaded applications. So now I want to avoid thread contention for resources. So I'm going to spend a lot of time on lock-free algorithms. For example, hazard pointers. Concurrency control without blocking. If you block, you're dead. It's over. So I've got to avoid blocking. How do cash more work per I.O.? So I want big blocks. If I have to go to disk, I really want to go to disk. If you look at traditional database engines, you're looking at a 4K, 8K, 16K block size. If I'm going to disk, I want 128 megabytes. Think about running a cursor through a big table for a join. I really want big I.O. I also want compact file formats because it turns out right now it is more expensive in a relationship to going to memory. It is more expensive to go to disk now than it has ever been before. So if I can get as much possible information from disk for that I.O., that's really a principal goal. So this is the big picture of the WiredTiger architecture. So I'll spend a couple of minutes on it here and then later we'll talk about different parts of it. We have the API, of course, and we dive into the engine. Everything in WiredTiger is a cursor pretty much, including hot backups, including statistics. Everything's a cursor. And wrapped around the cursors is a schema layer. So we do data typing, we do secondaries, all your indexes for free. On the side we have transactions. So it's a very traditional key value store approach. You have a cursor that is going to return data, it's going to read or write, and you can specify a transaction around that cursor or other cursors. So you say begin transaction, do a bunch of updates, do a bunch of reads, you commit, you roll back. Beneath those layers, we have what we call the access methods, and I'll talk a little bit more about those in a second. We offer row storage, columns storage, and LSM, and I'll get into those in a second. Beneath that we have a very traditional cache. So these are copies of file blocks that used to live in the operating system. They were read off disk, and now we have our own local copies. We do page reads and writes to the files, of course, and underneath that we have what we call the block manager, which is designed to handle traditional windows or POSIX file system. On the side, underneath transactions, we do logging. A very traditional right ahead log. I'll talk a little bit more about that at the end of the talk. And then on disk we have a bunch of database files, which are just key value pairs and log files. I mentioned we have multiple access methods, and this is something we have that pretty much nobody else has, and it's kind of cool. We have row store, and row store is very traditional B-tree. It's a byte string key, a byte string value, exactly the way you've seen it done for the last 30 years. We also offer column store. So integrated inside the same B-tree implementation, we do 64-bit record number keys. We don't actually store the keys. It's the values location in the B-tree that tells you what its key number or its record number must be. We support both variable length and fixed length. And the reason for fixed length, if you think about the whole point of a column store, it's that a lot of columns don't change, and they're very small and can be easily represented. And so it's fixed length so we can do things like sex. There really aren't that many choices. And so therefore we have fixed length values inside column store as well. LSM, so we wanted to do LSM, right? We wanted some way to do very fast inserts. And how do you do that? Well, we've already got this indexing structure. We've got a fast B-tree implementation. So the way we did LSM is as a forest of B-trees. So every LSM chunk is a separate B-tree. And we merge those B-trees and we compact those B-trees. In front of them, of course, we put bloom filters. And that is a fixed length column store. Yep. An SS table? An SS table. What do you mean by SS table? Let it be, for example, it implements LSM. Right. An SS table can format. Right. Which is the backward to B-tree. We're agreeing. Okay. I agree completely. We're saying that's different. You're saying that's what you mean. Right. The one thing we can do because of this is we allow you to mix and match, right? So it's reasonable for a sparse wide table. You would use a column store primary. And where your random inserts are happening in the indexes, that's when you use LSM. So this allows us to handle workloads that other engines cannot handle. So that's the overall architecture. Those are the structures we're using. Let's talk about in-memory performance. How do you run fast in memory? And we've all seen the traditional B-tree picture, right? So there's some number. There's a root. There's some number of internal levels. And then at the bottom, there's some leaf nodes. And those leaf nodes are linked for cursor traversal. And the reason they're linked for cursor traversal is because that allows me to evict the internal pages, right? Because I can still run a cursor through leaf pages without needing all the internal pages. Well, it also leads to deadlock. That's called the B-tree. B plus link tree, yeah. That also leads to deadlock. So what we did is we got rid of these and we put a pointer to the parent. So when a cursor is going through one of our trees, it's going up to the parent, finding the next location below it and then descending back down. It forces us to keep internal nodes in memory that we would not otherwise have to keep in memory. But on the other hand, we don't have nearly as many deadlocks. The second thing I want to say about our trees in cache is that we're not using file system offsets as pointers. So you'll hear hazard pointer, micrologging, pointer swizzling, all pretty much the same thing. What happens is that in memory, it's an ordinary pointer that a process can traverse by simply interacting in a traditional engine. Generally, this pointer between two levels of the tree is a reference to either something that's in memory or something that's on disk. And to figure out which, there's a path that goes through the cache that says, oh, it's in memory, here's the pointer, or oh, it's not in memory, I'll read it in. Now here's the pointer. And we don't do that. For us, everything in memory is, in fact, a standard pointer, and you interact through it, and this gives you a lot of performance. And so we do it using what we call hazard pointers. The trick is I want my application worker threads to never, ever, ever block, and I don't want to offload any work on them because there's where I care about the latency. So what happens for a reader or a writer thread coming into WiredTiger, it does a single memory flush, and it's essentially saying, I am going to go look at this pointer. And once it's done that memory flush, it looks at the pointer, and if the pointer is valid, it can use it. And that memory flush is not just shared memory. That is to very specific memory only used by this one thread. So we're not touching anybody's shared cache lines when we do this. The other side of this dance is an eviction server. So it's the process or the number of threads that are trying to get these pages out of memory because we filled up the cache. It's doing all the work. What it has to do is go look at that pointer and then check every thread of control that might be in the system at this time and make sure they're not currently using the page. So we've offloaded a bunch of work to a server thread, which is a good thing because I really don't care too much about server threads' latencies. And all a reader or writer has to do in WiredTiger is a single memory flush per level of the B-tree. So essentially every page it's going to look at, it has to do this process. But this is one of the ways that we get really good speed-ups inside WiredTiger. R.C.U.? I'm not familiar with that. Yeah, yeah. No, it's pushing out. So it's pushing out essentially a log record saying, I'm going to look at this pointer. There's a state associated with that pointer. And so what the reader or writer is doing is saying, well, the state's in memory and I'm going to push out something that says, I'm going to look at this pointer whereas what the eviction thread is doing is turning that state off so that any future thread coming in will notice that. And then it's going to survey all the threads to see if it raced with anybody. And then it backs off. So it doesn't actually move with itself? Exactly, exactly. I mean, the assumption you're making is that your LRU is working. You know, if your eviction thread picks a hot page, that would be very bad. But ideally, it wasn't, you know, you picked it for eviction because it wasn't hot. So there must be something built in that once you use a page, the pointer for a page, you have to drop it and go back and get it from there. You're basically going to log this operation and then you're going to discard that log record, which means the pointer is no longer in there. Okay, so the second thing is, I'm going to talk a little bit about, when we have a clean page in disk, remember I said we have really compact file formats and what's going on in there is that the only thing that we write to the disk is the actual key value pairs. Our overhead right now for a file is about a byte per key or value. We have, if you look at traditional database engines, you tend to put indexing information on the page itself and the reason is because when you read the page in, the indexing information is already there, it's great. Well, I don't want to do that. So what we do is we put the absolute minimum amount of information that we can on the disk page and we build an index for that page when it comes into memory. Once the page goes dirty, though, we need some other place to put updates. So as soon as you start changing values or adding key value pairs, we've got another place that we're going to put updates and those updates are skip lists. If you've run across this data structure, it's really interesting. If you haven't, it's an ordered linked list with what are forward skip pointers. So if you think of a big linked list, let's say A's are here, B's are here, C's are here, D's are here, it's ordered. If you walk that linearly, it's really expensive. If instead you have a set of pointers that, oh, the records starting with P are here, you can skip ahead, it's a lot faster. That's the guts of a skip list. William Pugh, 1989, I believe, is the paper. And his claim is simpler. It is. It's 150 lines of C to write a skip list in implementation. If you've ever written a B tree, it's a little bigger. It's as fast as binary search is the argument. And it's less space. Well, okay. It is a likely binary search performance. Binary search guarantees you a certain level of performance. Skip lists mostly do that. But if the skip list goes bad, you've got a linked list. The other thing about it is that binary search can be written such that cash pre-fetch when you get to the end of the binary search starts to work for you. And you can't do that with a skip list because it really is a linked list. So cash pre-fetch is buying you nothing at all. All right? More space, it's more space for an existing data set. And if you're called, I'm bringing something in off disk. So the indexing information that I bring, or that I create when I bring a page off disk, that's a binary search. It guarantees me a performance. It's, I'm sorry, it's less space, and it's easier to set up. So the indexing information of what I read off disk is a traditional binary search. But once I start doing updates, all of our updates are in skip lists. Now, why is this bad? Well, if you think of a cursor now walking through this chunk of information, it's got to do, well, let's see, where's my key value? Okay. Where are the updates for that key value? Do I take it from the original page? All that stuff. It's complex, but it's worth it. Now, why do we use skip lists? Because I can insert without locking. I can do forward and backward traversal without locking. The backward traversal is hard, but you can do it. The only thing I really have to lock for is removal. And if you think about a traditional page in memory for a database, removal is when I pick the page. Yes. A lot of free access. This is great fun, push your mind. It's a lot of fun for the programmer. And it's always got more bugs in it, for a lot of code and anything else. So, you know what they say, every locking algorithm is a PhD thesis. Yeah. So, really, it's kind of like, do you not use this unless you have performance at it in 15 months? Yes. Absolutely. So, where would you say that the most important place is on? Yes. In the state structure? The updates. Basically, if you look, there are certain workloads, for example, that what they're doing is frantically updating a single key value pair. It's a counter. And I absolutely have to be able to insert from multiple threads at the same time while the readers are in that chain. And that's where it really matters. We use it for other places, mostly because once you have one, it's pretty easy to go to a second one. But you said earlier, you have to have this. You can't constantly do it. But what you mean is there are certain places where you absolutely have to have it. You wouldn't use lock-free on every data structure in your code? Oh, no. I'm a big fan of locks. Right. Yeah. Basically, absolutely. As an engineer, the first thing I want to do is put a lock on something. And if I'm forced, then I'll pull it in into lock-free. Yeah. Sure. I mean, I'm going to stand up here and tell you, we don't do locking. And we don't. If it's a performance path, if it's a cursor in the tree, we're not going to lock. But when I'm doing lookups for the dictionary, I'm going to lock it. The skip list is just... No. There will be a number of skip lists per page. So if you think about, for example, it's not unreasonable for you to insert key-value pairs in between two existing key-value pairs. All right? So that one is going to simply be a skip list per, you know, key-value pair. Potentially, if you update it enough, right? Right. So, yeah, we have... If it's a key-value item that we're updating, it tends to be a skip list because there's somebody that's got a workload out there that just pounds that one particular point. So, yeah, a bunches of them per page. So, I guess I'm going to fuse it. Like, you talked about the v-trees. When are you using v-trees and what are they used for? The whole structure is a v-tree. Okay? Then there's a binary search in the page which is the original information that I brought in from disk. And then there are the updates to all that information and that will be in a skip list. To the page. To not necessarily the page to that particular issue on the page. Okay. So, what is semantics for inserts that need to be cursors? Do you insert... Does that invalidate the cursors that are doing a range of careers or something like that? No. And specifically, we don't handle phantoms at least at the moment. But, no, it doesn't invalidate the cursor. It's simply going to insert before or after the cursor's position. It's slurred over time in the version of data centers. Exactly. Is that isolation? So, if you want that snapshot, sure you can get it. All right. Summary. In-memory performance. It's really important to build in-memory trees, true in-memory trees. Follow pointers to traverse them. No locking to read or write the data. We keep updates separate from initial data. We use skip lists. The updates are atomic. We're updating. The skip list requires an actual lock. It turns out if you have enough threads pounding, doing nothing but appending to a skip list, we can't do that atomically yet. But we're working on that. And the other thing is all the structural changes, so evictions, splits, all that happens in background threads. We will not task an application worker thread to do any of this unless we've already got problems. In-memory, and you're looking to read more data into the cache, then we will in fact task you with doing some eviction as well. But other than that, a worker thread is not touched. Yeah. We do a lot of random workloads. And the reason is because we found that if we actually did target workloads where we're trying to test real workloads for real people, we could definitely be working quickly, but the education is presumably not found. And so we have a random task that just one of the dice is doing really, really stupid things. You know, it does things like start a cursor in a snapshot kind of action, wait five minutes, and then go see if the data looks correct, that kind of thing. You've never been tempted to use those little endpoints? Yeah. Amazon does a lot of work in that area. And I've talked to those guys a lot. You need a bigger engineering team. Yeah. Yeah, I think it's great stuff and it's really valuable. But you know, you got five guys doing nothing else but that. I don't know. Yeah. No, no, it's you pound the hell out of it and hope. I mean, we're kind of like every other application out there. Okay. So that's in-memory performance. That's what we do in memory to make things go fast. Let's talk briefly about concurrency. So we all know what concurrency control is. We have multiple CPU cores. We have multiple IO paths. We have all these operations happening in parallel. The job of concurrency control is to keep the data consistent. All right. You have to get to see the data that you want to see or should see. Two common approaches. The traditional one is locking. And so multi-version concurrency control, everything in WiredTiger is MVCC. Okay. Now I'm talking mostly about cache, the cache and a little bit about transactions and snapshots because the transaction and snapshot area is going to tell you what data you're supposed to see and the cache is going to maintain the multiple versions of the data. So multiple versions of records are maintained in the cache. By default, Reader C, the most recently updated version. So we do offer read and committed. We do offer snapshot isolation either on a per transaction or per handle basis. And because we're using skip lists, writers can concurrently create new versions while readers are in the key value pair. So you can have many readers inside a key value item at the same time you're updating it. Concurrent updates to a single record, we will fail. We can fail now. We can fail later. Failing now is easier. One of the updates wins and the other one has to retry or abort or whatever. No locking. No lock manager. One of the things we were really surprised about in BerkeleyDB was just how hard it was to write a fast lock manager. That had actually not occurred to us it was going to be a problem. But, yeah, right behind the log manager that did the right ahead logging records, the lock manager was our number one problem. And so naturally, second system syndrome, we're not going to do a lock manager this time no matter what it took. And no, it actually this... Okay, so it turns out the problem here is updating the transaction ID. If you think about all these threads, they've got to have a unique transaction ID as soon as they start a transaction and it is really, really hard to quickly allocate a unique transaction ID. Anyway. We have looked at jumping forward and that's kind of the next thing we're going to try. Right now it's linear. Right. Right. And how do you recover them after a crash? Right. So you've seen this. I don't think there's a lot of new information here. Obviously, we're doing a skip list up there. And this is MVCC in action. So we start out with an on-page disc image on-disk page image. Sorry, I'm starting to babble. We build an index for it. We decide we want to update it at that point. We're going to create a skip list. It's going to have a transaction and a value. Why doesn't this item have a transaction? Because we don't write a page until everything on it is globally visible. All right. So the update, we write it. Here's its transaction. Here's its state. It may or may not be stable yet. It may or may not be committed. We don't know. As more updates happen, that chain is going to extend. All right. It is going to be LIFO because by default, you want the last committed transactions. Transactions tend to be short-lived and therefore the first item on the skip list is generally what a reader wants to see. Okay. That's all I'm going to talk about about concurrency right now. I want to talk about compression and checksums a little bit because it's a big feature and that way I can talk about Iosum. Now I'm going to talk mostly about the page reading right in the cache and I'm going to talk about the block manager. So the block manager, cleverly enough, is responsible for block allocation. It owns the fragmentation policy, right, or it owns handling fragmentation. It also owns the allocation policy. If you think about a good database system, ideally you want to do best fit because that minimizes your fragmentation. But once you start doing file system or database compaction, table compaction, once you start doing check points, you actually want to shift to first fit because you want all the blocks that you're writing to be at the beginning of the file so you can truncate the end of the file when it's no longer in use. So that's the things the block manager does. It also owns checksums. We do compression at a higher level. We do compression in the cache but the block manager owns all the checksums. Unfortunately, the block manager has to be involved in check points. So what we do periodically is we're going to make sure that there's some snapshot of the data for the whole database that is durable. And we're going to do that in an operation we call a checkpoint. The block manager has to know about it because the block manager is the only piece of code that knows when something is actually on disk. If you think about the block manager right now, it's really three big chunky parts. There's, you know, finding a block to write which is really kind of messy. And then there's handling a checkpoint which is really kind of messy and then there's everything else. One of the things that we did to make block managers plugable is they just hand back an opaque address cookie. So what they hand back to the upper layers of the B-Tree or what are stored on internal pages is just a seven to 10 byte cookie that is if you hand me this, I will hand you back some data. And that allows us to have plugable block managers. So the read path and the write path are obviously pretty symmetrical, right? The things I do on reading I'm going to do in reverse on writing. So I'm briefly just going to wave hands around read required when pages are in memory and talk about the write path. So the write path, a bunch of stuff is happening. If you remember we've got this messy messy in memory data structure, right? So if you think about what has to happen either because we're evicting the page or because we're doing a checkpoint is we have to go through that page and figure out what we're going to write to disk. And that's a process we call reconciliation. It is probably the single trickiest piece of code we have as a large chunk of code. Because if you think about it, those updates half of them got aborted, three of them aren't committed, two aren't yet visible because there's an older reader in the system. So all that stuff has to kind of be banged against the wall. We've got to find a disk image that we can write. And that happens during reconciliation. Now that's also when we're going to do all our compression. So we kind of have two kinds of compression. There's the compression that I'm going to do because I don't want to write this amount of data. I already have it on the page so I write it again. And then there's the block compression. I have decided what I can write and now I'm going to use the block compressor to really use the LZW algorithm that's going to really squeeze it down to the minimum I have to write. After reconciliation there's going to be compression, the block compression. And then there's going to be a checksum and then it's going to go to disk. So let's talk a little bit about in-memory compression. What can you do in memory to minimize the amount of data you write? And I call it in-memory compression because this is dual benefit, right? Block compression in WiredTiger at least is either compressed on disk or uncompressed in memory. We don't uncompressed partially. It's a single chunk that is block compressed or decompressed. But in memory we do prefix compression. So if you think about keys, especially index keys, they often have a common prefix, all right? There are really two ways you can do this. Most people do it traditionally as they have a dictionary on the page. And so the nice thing about that is you can say, I want to look at key 37, uh-oh, look it up in the dictionary, its prefix is this, right? Pretty fast. But the problem is you haven't done as much compression as you can. How big is that dictionary, right? All right? So what we do is rolling per block prefix compression. And the nice thing about that is we get better compression. And again, if I'm talking, you know, 128 megabyte pages, all of a sudden really good compression matters. Um, so what we do is go a store with each subsequent key, the number of bytes it has in common with the key in front of it. Which means that we get the best possible compression we can. But in order to actually look at a key in memory, you have to go back and find the start of that prefix chain, right? I've got to find what we call a fully instantiated key so that I can then roll forward to figure out what this key really is. All right? Very slow, very painful. And if you think about a cursor going through that page, it's easy because the next key is going to have the same prefix as the previous key, which I just returned. Right? So that's great. But if I'm doing random lookups inside the page, it's incredibly expensive. Every position in the binary search I'm going to have to roll all the way back to an instantiated key and then roll all the way forward so that I can figure out what the current key's value is. And so what we do is we instantiate. As soon as we notice that you're doing random lookup on a page, we will instantiate a set of keys that you never have to move too far to find a key to roll forward from. How often is static encoding? If you really, really, really care about your cash footprint, you can do static encoding. It burns CPU. Absolutely, it burns CPU. But if you want to minimize your cash footprint, it pays off. Dictionary lookup, we do single value per page. So in that 128 megabyte page, you only write a value once. If we see the value again, we'll simply reference the previous one, which gives us really good compression for certain workloads. And finally, run length encoding. The whole point of column store is the ability to do run length encoding on columns that share the same value. That also gives us sparse trees if you think about it. Because if you have a column store where you put in key 37, and then you put in key 1 million, you really don't want us instantiating all the keys in the middle. And so what we'll do is put a single entry there that is run length encoding for empty keys. So once you've done all your in-memory compression, you're going to do on-disk or block compression. Currently, we offer snappy. LZ4. So Google Snappy is good compression, low overhead. It's kind of a workhorse out there. Everybody uses it. LZ4 is good compression, low overhead. And what's kind of cool about it is it gives you better page layout with really which which really pays off for SSDs. If you think about what you don't want to do with an SSD is write 9K bytes. You really want to write in fixed chunks. And if you look at all the compression engines with the single exception of LZ4, you give them a block, and they hand you back the compressed block. And you don't know in advance how big that compressed block is going to be. If it's 9,000 bytes, you're screwed. What LZ4 allows you to do is here's a bunch of data. Give me back an 8K chunk. Right? And that allows us to get really good page layout. And so LZ4, by switching to LZ4 with certain workloads, we get about 20% more file system compression. And it's not that we're compressing better, it's just that we're getting better page layout and we're avoiding all the right amplification. Does it come back and say like I can't do if you want a certain size because I can't do it. No, it will consume up to that size. So basically if you give it 5 bytes and say 8K, it says that worked pretty well. So what you do is you give it a megabyte and say 8K, and it comes back and said I took this much data. Obviously our compression is pluggable and it's optional because you might have a compressing file system underneath. So pretty much everything in WiredTiger is pluggable. You can even plug in your own data sources. If you implement our cursor model you can then plug in your own data source and we're perfectly happy to implement transactions on top of your data source. Check sums. Briefly, check sums are exactly what you'd expect. We store a check sum with every page validated during page read. It is not cryptographically secure. The assumption is that we want the speed and if you're doing something that really needs to be cryptographically secure, you've got an encryption engine going as well somewhere. One thing we do that's kind of cool and actually really pays off. Never or always store a copy of the check sum with that page address. And the cool thing about that is that I mentioned that opaque cookie that we hand back to the upper layers that is stored by the internal pages and that includes the check sum. And the reason is because not only do you know at the bottom that the page image that you're reading is correct. You know it's not stale. It's not only a valid page, but it's the page I wanted. Because the problem you see in database systems is not, especially in fixed size database systems, is not that the page on disk is corrupted. It's the wrong one. Right? We had a race and suddenly we lost a page or we freed a page we shouldn't have freed. That's the problem you see. By doing this, you catch that. Really nice technique. Obviously it's optional because hey, you might have encrypting or safe file systems. And this is compression and action. So MongoDB with their first storage engine, MFV1 it's about a 2.5 gigabit table of flights databases or of flights. I think it was US Airways, but I'm not sure anyway. You throw it in a wired tiger with no block compression and it's half the size. Simply because we're not writing indexing information to the disk, we're compressing values out, key prefixes, that kind of thing. Then you turn on block compression with snappy or Zlib and you can get it down to a couple hundred megabytes. So 90% compression. All right. That's everything I was going to say about file IO compression or compression check sums. And let me talk a little bit about durability in the journal. Yeah. Very good question. Because let's see. I do not. I would be happy to talk about it. Generally, the big ticket is the block decompression. Everything else is noise compared to block decompression. If you're talking snappy 30% Yes. Yes. Yes. Based on wired tiger. Oh, yeah. Yeah. I mean that 30% is single line. It doesn't really take into account the fact that while I was reading that big chunk of data, I wasn't writing and evicting stuff from the cache because my IO subsystem was busy. So it's really hard to tease apart what really is going on there. In general in most workloads you usually don't go wrong by writing less stuff. Yes. Yes. So the numbers that users are they bought their machines and they just want to watch out as much as they can and they're willing to buy more machines if they use data better for that extra transaction percentage. Oh, absolutely. Absolutely. Yeah. I mean the question is also how many point queries are you doing, right? If your database application is all about point queries, you care about this a lot more than if your database application is all about running curses through big tables. Because you're running IO on every one. That's going to hurt. But if I'm just running a cursor through a big table then, yeah, I'm accessing a few million key value pairs for a single decompression. Right. But amortized over how many key value retrievals. Yes. So durability in the journal. Let's talk a little bit about that. So if you recall one of the things I said earlier on is a wired tiger was a single node and we're trying to solve sharding. We're trying to run fast on a single node. Well, that means that we really have two kinds of customers. We have people writing applications for single node where they really just want blazing performance. They do care about durability. Well, okay, we do very standard write ahead logging. If you see to write ahead logging implementation, it's going to look very similar to what we've got. One thing, we only write redo records. We don't want to write undo records. And so the tradeoff is that in every transaction, all of the updates have to fit into memory at the same time. Okay? Most big memory machines can handle that, I would think. And it enables two things. Number one, we only write redo records. So we don't write as much to the log. But number two, we do much better compression when we actually compress those log records. We're not compressing individual log records. We're saying here all the updates for the transaction compress the whole thing with compression. We support group commit. Automatic log archival and recoup removal, all the stuff you'd normally expect. Hey, it's write ahead logging. And start up, we've, you know, look in the metadata, we find a checkpoint and we say, hey, great, and roll forward. But here's our other customer. Another customer is handling durability at a higher level, MongoDB. Right? What they're going to do after crash is they're going to bring up that single node and use another node to roll it forward. All right? And if you think about the fact that the logging subsystem is bar nothing else, the biggest performance problem out there, right? We want to get rid of that logging subsystem. So what WiredTiger offers is durability without journaling. If you think about a traditional database engine, the problem is after a crash you don't know where the corruption is. That's why you need the log to tell you where the corruption might be. All right? If it's an overwrite system where it's overwriting the blocks that it's changing and you've got torn writes, you may not even be able to fix it even if you have a log. All right? So WiredTiger is a no overwrite data store. All right? That means that when we write new stuff out to the file we never, ever, ever overwrite a page that is currently in use by any other checkpoint. Okay? With no journal, okay? That means that if you haven't done a checkpoint in a while, you can lose all the updates since the last checkpoint. All right? But the data will still be consistent. And in fact, this pays off on a single note as well, right? There's lots of things where I don't really care about the last 15 seconds of data, right? This table, I just don't care about 15 seconds worth of data. So what the heck? Why pay the performance cost of a log if it's durable back to a minute ago? Right? So we checkpoint every end seconds by default or the application can checkpoint. And this is when replication guarantees durability. So you get the performance without the log because we're durable even if we crash without a log. Writing a checkpoint is pretty much what you'd expect. We write the leaf nodes first because we can do that without actually having to lock anything down. I mentioned reconciliation. One of the key features and reconciliation is the process that figures out what to write on the disk. If you think about it, reconciliation does no locking at all. It is just another reader of the internal page, right? So it doesn't block concurrent readers. It doesn't block concurrent writers. But as we climb up the tree and we start having to write internal nodes, then we start having to lock things down. So we do that in a second pass. Then we flush the file to stable storage, write the new root address and we're done. And here's the picture, right? If we have the old root page on disk with old internal pages, there's a bunch of stuff we have to write. That's on the right. Once we write all that, we swap the address on disk or the on disk address of the root page and then we can reclaim all of the old checkpoints pages that are no longer in use. This means in worst case, if you dirty everything in the table, we're using 2x of the file space. We're an overweight system won't in fact use nearly as much space. We're going to use twice as much as they will. On the other hand, we can handle torn writes, which most systems can't. One thing that we do in here that's kind of interesting, we offer name checkpoints. So what you can do is any of these checkpoints that you're creating, you can name them. At which point they stay until you explicitly delete them. So you can have the midnight checkpoint or the once a quarter checkpoint or whatever. And the nice thing about it is you can read those, right? You can open the system and say I want to see the last stable checkpoint or I want to see the midnight checkpoint which shows you the data. You can do time travel in the database. Why would you do that? Because checkpoints are read only and they tend to be mapped into memory and so you minimize cache interference. One of the problems that I mentioned earlier on is what happens you have an old reader in the system and it can prevent eviction of a page, right? Because this page has only limited updates on it. But there's an old reader in the system because it's doing snapshot isolation it wants to see an earlier version of this committed change. And so therefore I can't evict this page because this reader's hanging around. Well if that reader were doing a checkpoint or reading from a checkpoint then it doesn't interfere. I can evict the page and the whole system runs more smoothly. Why do we make them read only? Because it's hard not to. If you allow branch from random checkpoints things get really really rough. There are some ZFS papers that talk about it. But it gets hard. I try to avoid hard problems just as a policy. Okay, so that gets me through durability through the journal. Future features, let me tell you a little bit about where we are and what we're doing. We are now the default storage engine in MongoDB. Up until now we've been kind of the, you know, well if you need the performance of your Tiger but now we're kind of the default. So I'm sure we'll see lots more bug reports. We have just spent being acquired by MongoDB. We spent our big year of tuning. Before MongoDB we were kind of working with very specific applications and those engineers knew what their applications did and we tuned for that. Wow, applications do really really really silly things. And if they pay enough money you got to make it run fast anyway. So, you know, yeah, checkpoints are a problem. They're really hard to get fast. MongoDB has something called cap collections which aren't allowed to grow beyond a certain space and that's been fairly tricky to work with. Encryption, they're kind of big enterprise customers that MongoDB wants to sell to. They want encryption and so we've been working on that. That's actually there. We will be rolling that out fairly quickly. And then more advanced transactional semantics. For example, I mentioned phantoms. We've been looking at ways to at least minimize the number of phantoms that you see. So more advanced, I won't say oracle style, but more advanced transactional semantics to support different workloads. How long are you still? It can get to seconds. It really can. Can it get to seconds now? No. In February, yeah. It's whack-a-mole, right? Oh, that workload does all this. But yeah, we had not really worked that much on checkpoints and so we spent a lot of time there. You know, I would be shocked if I saw one more than a second now about when we started, yeah, multiple seconds. Yeah. There's lots and lots of, in a very simple way, if you think about a checkpoint, you can write all the leaf pages into a complete second pass. But that second pass, you just say no more updates, right? That's the worst case. And that's your stall. So, you know, you would stall for the entire time that it would take to traverse this whole tree and if you think about it, well, I've got to have a database-wide checkpoint which means it's the amount of time it takes to write every dirty page in the entire cache, right? So then you say, well, what if I allow writing on trees that have already walked? Okay. So you're writing in pages where I've already gotten past that point and you just sort of keep tightening down what does checkpoint really, really have to know about that it will still allow us to write a stable point for the whole database. You know, once you get to 100 gigabyte cache, there's a lot of dirty data. You know, for example, one of the things we spent a lot of time doing there is ways to stop the whole cache from getting dirty. You know, you just basically say, we're 50% dirty. If somebody fires off a checkpoint, things are going to go to hell. So what we're going to do is start frantically trying to push out dirty pages because the checkpoint will just be too expensive. Okay, LSM. LSM is something we have not been working on lately because we've been mostly tuning the row store. But LSM is kind of cool because it gives you really fast random insert workloads. A dataset much larger than cache doesn't give you bad performance and you're saying query performance is less important. The bottom line is that for the reads that come along after the inserts, LSM will be slower. Background maintenance overhead is acceptable. All right, you've got your merging threads, your compaction threads, whatever. You know, oh, we've got to eliminate all the gravestones in the system, right? All that kind of stuff. You're updating all the bloom filters. You know, there are literally 20 threads of control behind that LSM engine frantically trying to get everything working together. And that's a performance problem. Bloom filters, of course, we include bloom filters. So here's an interesting graph. So the blue is a B tree. And we have this nice degradation as things fall out of cache, right? Very, very predictable. It's exactly what a B tree does when things fall out of cache. But if you look at the x, y that's indices and the primary, if you look at when we do pure LSM, well, we don't start out as good, right? But we don't degrade nearly as badly, right? So essentially, if you think about an LSM, it's just a single B tree in memory, right? And so it kind of gives you this nice steady degradation, but you never degrade too badly, right? If you mix and match, it's even better, right? If I can use a B tree for the places that a B tree makes sense and I can use LSM for a place that LSM makes sense, then I get the best of both worlds. Oh, I'm sorry. So the one other thing I want to say about this one is that's kind of the holy grail. That's what we want to do is we want it to be seamless where if you're doing lots of inserts, it's an LSM. If you're doing a read-only workload, it's a B tree. And there's a lot of work left to do that, but that would be just really, really cool. And at that point, we can handle workloads that just nobody else can handle. I never give out my own benchmarks, so this is Mark Callahan at Facebook. This... Is he? Cool. Yeah. He is so cool. Yeah, I want to be him when I grow up, I tell you. Anyway, so yeah, he's got really, really interesting benchmarks. I mean, they, at Facebook, they use MySQL a lot. They are probably, you know, outside of Oracle now, they do more work on MySQL than anyone else. And they also are big, big consumers of LSM in the form of RocksDB, right? So, you know, they funded a bunch of work on WireTiger. So, these are guys that are basically willing to use whatever tool they can find that works, and they really are thoughtful about the tools they use, and so he's got some really interesting benchmarks, especially if you want to think about LSM versus a B tree. And especially NODB, because they do a lot of work on NADB. So, yeah, there's really cool stuff there. October 22nd, sorry. It comes October 22nd. Yeah, catch that talk, seriously. Hi, May. No, I think the issues are kind of different. I mean, I think how WireTiger does LSM, which is a forest of B trees, brings into sharp relief the real point here, which is that if you're updating something small in memory, it goes fast. And everything else is kind of orthogonal to that. I mean, I don't believe that I don't believe that you can make LSM approach a B tree for read queries. I just don't think you can do it. And so I think that I think that this hybrid approach is really the right solution. But it's hard. You kind of have to have this moment where everything switches, and yeah, it's going to be hard to do. I'm not sure we can do it. You have one or the other. Right. And I think that's it. So, you got me a mail address. I am absolutely delighted to take questions offline. And seriously, as you start looking at this stuff, if there's a project you want to do that WireTiger would be ready for, we want to help you. We'll support you. We're interested in the stuff you're doing. So yeah, use this technology if you can. Two questions. Yes? Can you elaborate how Amazon's using AWS for database? No, I cannot. They would get my children. Oh, I'm sorry. How does Amazon AWS system use us? And yeah, you know, I cannot talk about it at all. You should see what we have to sign. In short, we are a principal component. We're going to be there for a long time. And they've gotten really, really good results. The number we are allowed to talk about is 50% reduction in costs for cloud providing or cloud infrastructure. Thank you. Nice talk. So I saw that you went for a copy-on-write design in Checkpoint. Did you consider that design at higher levels of the system? I'm sorry, I missed the lessons. Did you consider using this copy-on-write design at higher levels of the system? We're getting questions here. We used a copy-on-write style of Checkpoint. Did we think about pulling copy-on-write up to a higher level of the system? And no, because the place where it lives, which is essentially the cash slash Checkpoints, everything above that is a key value pair. There really is no aggregation above it. If I'm not answering, grab me on the side if I'm not answering that question. Anything else? Way back, you talked about if you have concurrent writers, someone is going to be wrote back. Have you observed the rate how often this happens in your user applications? Yes, good question. We can get it up to 50% by having too many threads for the cores in the system hammering on a single key value pair. So if you think of a counter where I've got 100 threads on 24 CPUs, I can get it up to about 50% rollback. That would, of course, be insane. But application writers are not known for their sanity. Okay, thanks for everybody coming. Next week, on Tuesday, we have Mike Strombricker come give a signature in this whole computer science. I think it's at 3 o'clock, so definitely wouldn't hit that up. On Thursday next week, we have Richard Kibb, the creator of Single Blight, which, again, is the most widely deployed database system in the world. You have a cell phone, it's wanted. Ah, you don't want to miss that. So let's thank you again for coming to that.