 Carnegie Mellon vaccination database talks are made possible by Autotune. Learn how to automatically optimize your MySeq Core and PostgreSQL configurations at autotune.com. And by the Steven Moy Foundation for Keeping It Real, find out how best to keep it real at stevenmoyfoundation.org. We're super excited today to have Victor Mies. Victor is a professor now at Erlang. It's a German school, but he's based in Erlang. He just moved there a few weeks ago. So Victor got his PhD from TU Munich under the fabled Thomas Neumann, the most German of all the Germans. And today he's here to talk about Leenstor, because he's been voting with his students. So as always, if you guys have any questions for Victor, as he's given the talk, please unmute yourself. Say who you are and where you're coming from and ask your question and feel free to interrupt him at any time. That way he's not talking by himself for an hour. Victor, it's 10.30 for you in Germany. So thank you for staying up with us. The floor is yours. Go for it. Thank you for having me. I'm very happy to be here. And actually, it's now four years ago, I also gave a talk at CMU on Leenstor. At that time, it was very early in the project. We just had a very early prototype. And it didn't even have that name. It didn't have any name at that time. And so I'm happy to be back basically and tell you kind of what has happened during that time. And also I'll tell a couple of things that are kind of between, because I mean, now have been a couple of papers on Leenstor, but I will tell you a little bit between the papers, some stuff that's not published and that helps you to give you basically a better understanding of what we're aiming at. And then also maybe I'll also implicitly tell you what's kind of ahead and where we are in the project. So Leenstor, I like to motivate it starting with showing this graph from a 2008 signal paper from Stormbreaker. And there they took a basically traditional storage engine, which used a kind of traditional architecture with Buffmanager, B3, with standard latching, locking, like 2PL and like area style logging, right? And then they took that code, right? And they ran TPCC on it. And at that time already, you had pretty large main memory capacities. So what you saw there is that you measure kind of this experiment with the working set in RAM, and then they kind of profiled where does time go or in this case, where do the GPU instructions go, right? And then they basically saw that if you execute this newer transaction, and you spend time kind of all over the place, right? And so how Stormbreaker in his great way kind of summarized this, there's no single high pole in the tank, which means and all of these components are very, very inefficient, right? Only like, depending on how you count 7% or even just 2% of the work is actually useful work. The rest is kind of just overhead of this legacy components, which have been designed for very different world when you were waiting for disk all the time, right? So the conclusion kind of was disk based systems are really hopeless, right? And that led to the emergence of in-memory database systems. And that's what I spent basically my scientific youth and Andy as well, right? I mean, I worked on Hyper and Andy worked on H-Store, right? And these systems have radically new architecture. They look very different from these traditional systems like the shore system that you see on the left. And just as a rough number, right? Here you see on the left, you require kind of 1.7 million instructions for one new order transaction. And with these in-memory systems, it would be less than 100,000, right? So it's more than one order of magnitude, just lower instruction count. And you also have better scalability and so on. So it's a big deal. They're much, much faster. And there's been a tremendous amount of research there. But there is one caveat, I would say. And that was really kind of the starting point for LeanStore is that all these in-memory systems, I would claim, they don't really have really good support for large data sets. So they basically assume that the data fits into RAM. And as long as it does, they work great. And when they don't, when they say it doesn't fit into RAM, then it's problematic, right? So there's some extensions, but it doesn't work that great, right? And so this was really the starting point for LeanStore, right? And in addition, what you also saw, and this has kind of continued, is that even after decades, for decades, you saw that DRAM prices have been basically decreasing very quickly, or capacities have been increasing, as you see in this plot. So there's a logarithmic scale here over the last 20 years, right? But what's really striking here is that around 2012, so a little bit, a couple of years after that paper that we just saw was published, the DRAM kind of stopped actually getting cheaper, right? Or larger, conversely, right? And I think that's really major, major change, right? Just look at the slope before 2012 and after 2012. It's very different, right? And so that means that, I mean, data sets, I just keep on growing, right? So it's not like data will always fit into DRAM, right? You want to, you want to basically process ever larger data sets, right? And so what got right is another trend, which I think is kind of an underappreciated trend, because it's kind of, I don't know why, but I'm personally very much in favor of, I'd like to advertise the importance of flash, right? And as you've seen this graph, right? Flash has become really, really, really cheap, right? So 10 years ago, there were very many papers that said, okay, you need flash and additional cash in front of disk, right? But now actually flash has become so cheap, you basically can't get rid of disk. And the nice thing is now flash is like about 20x times for the same dollar amount larger than DRAM, right? And I think that's basically the world we're in now. We have to use flash, right? It's not enough to just be about DRAM, right? And to make this more concrete, let's look at kind of the hardware that we're thinking of when we design and test LeanStore. And we have almost this one, and we're about to, we had one in the old place and new place, we're getting an upgrade basically. And so the spec that we'll be getting is what you see here, 64 core CPU, like 500 gigabytes of RAM. And that's the important thing, you know, you get the 10 of these new PCIe4 SSDs. Each of these SSD has four gigabyte capacity. So if you have 10, you have 40 terabyte, sorry, you have 40 terabyte capacity. And each of them has seven gigabytes read bandwidth. So in some you get 70 gigabytes per second, right? Which I think is absolutely amazing. It's actually getting close to main memory bandwidth, right? And even if you look at random IOs, you get 15 million random IOs per second with four kilobyte random reads, right? And interestingly, this number is kind of comparable with what AWS S3, since we talked about cloud earlier with Andy. And so the entire Amazon S3 service, recently they had a press release that at peak loads, they have tens of millions of requests per second, right? So this you can do almost in one server with these super fast SSDs, right? And they're not just faster, it also became really, really cheap. So it's just about $200 per charabyte and about 20x cheaper than DRAM, right? And so this is kind of the hardware that we want to build the storage engine for, right? So let's talk about LeanStore, right? LeanStore is designed for this hardware. And what is LeanStore? Well, it's a high performance storage engine, right? And right now we're focusing kind of OTP. I mean, in principle, I think many of the techniques and even the implementation, it's not OTP specific, it could be used for any general purpose system, but this is just what we start with, right? And we say that it's not a database system, it's a storage engine, because we don't at the moment have any SQL layer, we don't have query optimization, so on. We basically just have a C++ interface, it's like something like RocksDB, you can manage it, they also have a C++ interface, you can link it to your application, and then you can execute kind of QValue style operations, like GetKey, PutKey, and range scans, these kind of things, and hopefully soon transactions as well. And we try to be very scalable on multi-core CPUs, right? And we're optimized, obviously, this is why I talked so much about Flash for these very fast NVMe Flash arrays, right? We have, of course, index structure, and I'll talk about that, it's a P3, we have logging checkpoint recovery, I'll also talk about that. We don't have yet concurrency control, but we're working on it, right? But I think once we get these things together, it will be already useful, and if it's actually stable, then it will be a useful piece of technology, right? And so that's kind of the scope, right? So let's talk about really the components that make Lean Store, Lean Store, and that implement all these features, right? And so the first one, and this is really where the project started, right? This was published in 2018, and it's the buffer manager, right? Because as I mentioned, the initial motivation was really saying that it's not enough to keep stuff in RAM, you also need to, again, as old school systems, support stuff on Discord SSD. And so the first decision was made to say, okay, if you want to store stuff on Flash, you have to have page-based storage, right? Which means four kilobyte pages, right? Actually, the original paper was talking about 16 kilobyte pages, but it turns out four kilobyte with these newer SSDs is actually better in terms of read and write amplification, so we switched to four kilobyte pages. So in memory, you would want slightly larger, you get slightly better performance with 16 kilobyte pages, but because the out-of-memory performance is so much better with four kilobyte, so we kind of switched, or will be switching the default to four kilobyte pages, right? Smaller really doesn't work, it doesn't help you any more because the Flash SSDs really only help improve the read and write amplification up to four kilobyte. So that's the first design decision, everything is stored on these fixed-size pages, and the second thing is, and that's kind of the main trick here with the buffer manager is we use pointer swizzling, which means that if you're referencing a page, it can be, you have an eight-byte identifier, and that can be either a pointer, write or a page ID, and you use one bit of that reference to say, basically, if it's a pointer or if it's a page ID. In this example, you see kind of a small tree with a root reference, and in that root reference, you see it has three children, two of them are swizzled, and one of these P3s, like this is a page ID that this might still be on SSD. And the nice thing, of course, you now can directly follow these pointers if the page is in memory, and it's super fast, and it's much cheaper than a traditional buffer manager. So that's one part, the second part is the, at some point, of course, you're running out of memory, and you need to evict pages. And the point of LeanSource replacement algorithm, which is a bit unusual maybe, is that it's really, really optimized for these hot accesses. You want the hot access to do basically nothing, no overhead. And we achieve that actually through a combination of two replacement strategies. And I think one way to explain that is through this state diagram. So initially, all the pages are on SSD, that's the state at the top, all pages are cold, and then you start loading them, and then at the same time, when you load them first, and you swizzle them, which means now you can reach them directly to these pointers, and then they are hot. And of course, at some point, if you do that after a long time, at some point, maybe your buffer pool will be full, and then you have to start thinking about which pages you want to evict. And we do this through this artificial extra state, which you call cooling state. So what we do is we just randomly pick a page and say, okay, you are a candidate for eviction. And we also, when we do that, we unswizzle that page. And then it's in this cooling state. And then two things can happen. Either the page is actually a hot page, then it's a swizzle back in, and then it's back hot again. Or if it's actually a page that was a good candidate, it can be evicted, then it will be evicted eventually. And then so these states, what we do is we try to keep certain percentage of pages, let's say 10% in this cooling stage. And in this way, you can kind of distinguish hot pages from cool pages. And the key to understanding really this algorithm, and why it's the way it is, is really this first property here that there's no overhead for hot pages. Because if you access a hot page that is swizzled, you don't do anything. You don't even set any bit. And so that was the original idea. And so actually I've come to, basically, I suspect actually that we might be looking at the replacement algorithms again. So this was actually a very early idea in Leastore work, and it's worked pretty okay. But my guess is that in particular, if you look at, and this is what we're doing, more and more now looking at out-of-memory work builds, you might want to have an even more sophisticated algorithm strategy. Because this one is really optimized for hot accesses, but doesn't really optimize that much on trying to avoid IOS. So this might be something that we will be looking at in the future. But except for that, I think this design kind of has worked out so far. Okay, so that was kind of the original work, let's say, that started the project, right? But then there's, of course, other things that you need to do if you want to build a storage engine, not just a buffer manager, right? You of course, you need some kind of data structures where you store your stuff in, and particularly you need index structures, right? And so also from the beginning, I mean, it actually changed the implementation. So I'll talk about the current implementation. But we always have been using B-trees, right? And the B-trees that we're using at the moment is a, I call it an almost textbook B-plus tree, right? And one important thing here is, of course, it's not an LSM tree. So nowadays, a lot of modern storage engines are LSM trees. And so I personally don't believe that's a good idea. So LSM trees might have use cases, but I think it's just not the default best solution for most workloads. I mean, that's just my personal bias. And so that's why we have a B-plus tree. And so we support like variable length keys and values. And we have a couple of optimizations. I mean, all these optimizations, I actually can find them in the literature. So it's not really, at least those that are here on the slide, there's nothing fancy or nothing new here in a way. And so one thing that we do is we extract the common prefix from the page, right? So this is what you see in this example below, right? If you have a B-tree page here that's storing as keys URLs, right? And then it's not unlikely that a lot of them or all of them on that page, all of them actually start with this HTTPS prefix, right? So why do you store them repeatedly? You can extract it, right? So that's a pretty well non-optimization that saves your space and actually also speeds up the key comparisons. So that's pretty cool. And the second optimization that's also, you can also find it in the literature is that with these slots that you see in the beginning that are pointing to the heap-like tuple values at the bottom of the page, right? And there we actually extract the first four bytes into the slot, right? And that actually makes your binary search faster because before you can, you actually have to have the cache miss at the bottom of the page on the heap basically, you can first always can compare the first four bytes, right? And that speeds up the comparison, right? Oftentimes that's enough, right? And only if those are equal then you have to fetch the rest of the key. So, I mean, this is, I mean, I've worked on a couple of very fast, super optimized and very sophisticated in-memory data structures, right? And this is in a way, I mean, it's not totally trivial with this data structure, but it's not as fast as these, as the very fastest in-memory structures, right? But this is not really our goal, right? And so, here you still get very robust and decent performance, right? So, you certainly get something like at least one million operations per thread per second, right? And it's, as I said, it's very robust. It works for all kinds of data types and so on. And what's also really important, and this is why we need this page layout, right, is that you can just directly evict this page to disk, right? There's no pointers to somewhere, right? All the pointers are kind of internal, these are actually offsets and not really pointers on this thing, right? So, and every, you can fit it on fixed-size pages, right? That's all part of this optimization for flash and for fixed-size storage, right? Because this is what SSDs want. They want fixed-size pages, right? And as I said, I mean, it's not as fast as the very fast in-memory stuff, but it's still plenty fast, I would say, and probably fast enough, right? And so, that's one important building block, right, pre-standard. Now, let's go to another part. I mean, I mentioned that with LeanStore, we want to be scaled very well, multi-core CPUs, right? And so, and in particular, now you have to kind of think how you do this synchronization, right? And in traditional old-school disk-based systems, you just have ledges all over the place, right? And they are not just, and the problem is that they basically destroy your scalability, right? Before that, we saw these numbers with the instructions, but these systems are not just have a high instruction overhead, but they also don't scale, right? Because they have ledges everywhere, right? And so, how do we solve this in LeanStore? Because we still have pages, right? We still, and our ledging learning is also on this page, granularity, right? And so, the way we do this is actually, and I'm actually pretty really happy about this design now, because it's relatively simple, right? And it's also very robust, because what we do is because we have a hybrid scheme here. So, each page has two things. It has a version, right? And this is just an atomic 64 bit counter, right? And we'll see what we do with that. And we have also, each page also has a standard OS mutex, right? And now with these two things, I mean, they kind of interact in a particular way. And we have, we can implement three page access modes, right? So, we can have an optimistic page access mode. And this one only looks at the version. So, when it wants to read the page, you can just read the version, see if the page is not ledged. If it's not ledged, then you just read from the page optimistically. And then after you did the read, you check if the version didn't change, right? That's all you do. You never acquire any ledges, right? That's the optimistic mode. And then we have also the shared and exclusive modes. And those two work basically, they just acquire the read write lock in a standard way, right? So, you can either do this exclusively or in the shared mode. And so, you have this combination of two different approaches. So, you have the traditional read write lock, but you also have this optimistic one. And this optimistic one actually turns out to be really, really important. And we'll see an example of how it's used. And another thing that I want to mention, which has actually been one of the great happy moments of my life, if I may say so. So, in the original LeanStore paper from 2018, there's a section on memory reclamation, right? Because if you have these optimistic reads and we had them even in the original paper, the one problem that you have is you're never sure kind of when you can evict a page, right? Because you can always have a reader, right? When you know you have no readers anymore, you can never be sure about that, right? And so, we had this epoch-based reclamation. And that's usually the way to go about this, to solve this memory reclamation problem. So, basically, to find out when you can finally basically evict a page. And so, at some point, actually, this was, I don't know, one or two years later, Michael Haubenschild, who did a lot of the LeanStore stuff, he basically asked this question, do we actually need that, right? I mean, and we thought about that. And at some point, we actually realized that if you're careful in how you, if you do two things, if you first never give a memory back to the operating system, which we don't do anyway, we were a buffer manager. So, once we have the memory, keep it. So, that's the first thing. And if you make sure that these versions in a buffer frame, even if you put a different page into that frame, if you just always keep, this version always keep increasing, and then it just works. You don't need epoch-based memory reclamation. And you don't need any memory reclamation at all. So, that was a pretty cool thing, because it just simplifies the design, and it's also more robust. And so, that was a very happy moment, right? Was this from a new page here, or more like veteran? No, that was the original, Michael, who was also on the original ICD paper, who did the implementation. So, really just a realization that after some time, we moved on in some stuff, which we actually, if we did a very tiny change in your implementation, we just didn't need it at all. I like, that should be the entire PhD, but you don't need to do that. Yeah, the thing is, this is also something, there's bias in academia, right, for complicated solutions, right? And so, you can't write a paper on just saying all the stuff that you do is pointless, right? And a similar thing is actually, in my opinion, this optimistic lock coupling idea, which we'll be talking next, right? So, you have all these papers talking about very complicated synchronization protocols, right? But turns out, in very many cases, you don't need that at all, right? It's very sufficient, and this is why we use this optimistic lock coupling idea, which I'll explain in the next slide. It's also similar like that, right? It's so simple. Basically, you can't just write one paper about it, because it's just too simple. In my opinion, it works beautifully. So, if you build an actual system, I'm a very big proponent of that idea, right? And so, this is actually, we actually didn't invent it. So, this was actually another student, and at that time, we're looking at how we synchronize the adaptive rate x3. And so, he did a master's thesis with me, and he came up with all these complicated techniques, right? At the very end, we realized, hey, why don't we just use these versions and interleave them in a way that we detect conflicts? And then later, we actually, and that was also an amazing, happy moment of my life. Unfortunately, in that case, this technique actually has been published before, right? But even that is an interesting story, because these other papers that published this idea before, they used it as part of very complicated schemes also, right? And so, but nobody had kind of the could believe that this such a simple idea is enough basically for synchronization. So, that's again, I would say, again, this bias and academia of complicated solutions, right? So, that you've seen that anecdote, right? Okay, but anyway, let me explain what this idea that advertised so much actually does, right? So, at the left, you see here, actually, how normal lock coupling works. And this is the traditional way how you synchronize, let's say, a B-tree. You can use father data structures as well, but let's talk about B-tree. So, this shows a B-tree with four pages and four locks, right? And these brackets kind of show the ordering basically in the interleavings, how you latch this thing, right? Or lock it, right? So, that's normal lock coupling, right? The problem about this is really because normal lock is actually, you have to physically write to that cache line and then your scalability is totally destroyed, right? So, this is not a good idea, right? But the thing is we have these versions, remember, we have this optimistic locking mode, right? So, what we can do is we can just basically almost one-to-one translate the same synchronization idea into the optimistic lock coupling, which looks almost the same. So, every lock basically becomes a read version. So, check if it's not locked and read version. Then you do your load, your read optimistic in that page and then you validate. So, that's the optimistic read. And as you see, you still have these overlapping brackets and that's the coupling part. So, you can overlap basically these validations. And in that case, basically, you get very simple but also very effective synchronization approach that works for all kinds of data structures, right? So, I'm a very big fan of that. And the important thing about that, to realize, I mean, it looks almost the same in this slide. So, what's the difference? The difference is what I said before is that you don't write to shared cache lines, which means that you don't invalidate any other caches, right? For instance, at the root node, think about the root node, right? If you always latched the root.eg with just a very brief moment, you will invalidate all the other caches, right? That would otherwise have that root node in cache, right? So, that's really one of my favorite techniques. Unfortunately, I mean, I cannot claim to have invented it, but we can at least try to get people to appreciate it, right? Okay. So, we have this pretty standard B tree, right? We have optimistic clock coupling, which also looks kind of like almost like normal and synchronization of B trees. But there's also a couple of tricks that we can teach even a traditional B tree, right? And this we published in this year or presented this year at CIDR. And I think these are also two pretty cool ideas. So, one thing is that in a B tree, because we are not just in a B tree, but because we have page-wise storage, right? What can happen is that on a single page, right, we store multiple tuples and those might be hot, right? And they just end up to be on the same page even though they are unrelated tuples, right? And since we latch on a page granularity, right? And then you get kind of unnecessary contention, unnecessary right contention, right? And that's it. I distinguish that from read contention because read contention we don't have because we use optimistic clock coupling because reads don't have to latch at all physically. But with these writes, right, you get these unnecessary contention, right? And this technique that we presented here called contention split, it basically uses probabilistic per page contrast to whenever you want to latch a page and you don't get the latch because somebody else is holding the latch, right? That's a candidate for contention. Then you record kind of how often that happens in that page and also on which tuple was basically, which slot in that B tree did the contention happen, right? And so we have this meta information. It's not shown here in this slide, but once you have that, you can actually decide, since we're a B tree, you can decide to split the page even though there's no other reason to split the page, right? We just split it in order to get rid of the contention, right? But you can do that, right? And then as you see in this example, these two hot tuples then may end up at different pages, right? You can do it multiple times. So if you have, I don't know, one page with lots of extremely update heavy counters, right? Contention split will put each of these, eventually, each of these counters on a separate page, right? And then you have reduced the contention, right, as far as possible. Of course, if everybody goes to the same tuple, then there's nothing we can do at least with this approach. But at least you get rid of this right contention that happens from squeezing lots of tuples on the same page. So that's the first technique. But of course, now what can happen is that these pages become under full, right? Because we split them all the time. And that's in general kind of a problem with B trees, that they can have low space utilization, right? Because this is just how the algorithm works, right? And there's also some, often an advantage that is, like advertised by Alice entries, that you can have higher space utilization because they don't have these under full pages that are maybe only 60% full on average, right? But turns out you can also implement something, a trick here for committees, which are called X merge, right? And the way we integrate is as follows. And you can do it in other ways as well. But basically the idea is that whenever you want to evict a page, let's say you want to evict the blue page there, right? And on the left hand side, right? Instead of directly evicting it, right, you could also check if you, instead of evicting it, because you want to evict it to get an empty free space in the duffer pool, right? And instead of evicting it, you could also think about compacting kind of a range in the B tree, right? And that's exactly what we're doing. So the X merge would look at kind of a couple of neighboring nodes from the blue node from our candidate that we want to evict and see if we can merge them, right? And in this case, what you can see is that because we have space for three keys in this toy example, per node, we can't just normally, as normally in B tree, just merge two nodes into one because it just wouldn't work, right? It wouldn't help us anything. So which is why X merge takes X nodes, right? And merge them into X minus one nodes. In this case, right? We merge three nodes into two, right? And we saved one node, right? And then we don't have to evict anything, right? So we saved some eviction, basically, right? So it's a pretty simple idea, right? This could be different ways how you could precisely implement that. But I think it's also pretty interesting tweak, basically, twist to a pretty standard B tree that we have here. Okay. So the next thing that I want to talk about is a big topic, right? And I won't go into details, but I want to mention, basically, what we have here is logging checkpoints recovery, right? And the, I mean, at this point, right, we're kind of starting to look like, like a traditional system, right? We have fixed pages, we have B trees. And so how about areas, right? That's the standard way of doing these things. And so, and areas has many nice features, right? So you can have arbitrary light transactions, you have fuzzy checkpoints, so you don't have to like stop the world and do your checkpoints, right? And you have fast index recovery, right? And that's actually very important because we're in out of memory systems, right? So a lot of in-memory logging approaches, I mean, there have been many proposals for that, or I wouldn't say many a couple. And what many of them have in common, basically, they rebuild the indexes just when you after recovery, you just first recover your tuple, and then just rebuild the indexes from your tuples, right? And that's okay if you assume everything fits into RAM because then that's pretty fast. But in an out of memory system, your index might be larger than main memory, right? And if you on recovery, start building like a 10 terabyte index, right? That's not a good recovery time, right? So once we go into this direction of optimizing for out of memory, then we also need kind of much of many of the features that the areas actually offers, right? But the problem with areas is that it really doesn't scale in multi-cost CPUs, right? So because it has this single global log, single global write ahead log with a latch around it, right? That just won't scale at these transaction rates that we want to achieve, right? So what do we do, right? And this was the paper that was published last year. And this again looks kind of as a tweet version of a traditional technique, right? In this case of areas, right? So what the same as areas we're using physiological write ahead logging with read and undo logging physiological means we are logical within the page, but physical across pages, right? So a write ahead lock entry tells you page ID 57, right? And then tells you insert key five, right? And we have undo and read information in write ahead log, right? And then the second thing that's or one thing that is actually different from areas is that we have not just write one write ahead log, right? And not just one log partition as it's shown in this picture, but you actually have multiple, right? And so for instance, one per thread, one per worker thread. And then in the picture what you see here basically that's the design we kind of argued for in the paper that the best thing you want to have actually is that that each of these log partitions or at least the tail of it is actually on persistent memory. So on byte addressable persistent memory, because then you can actually implement very low latency commits, right? Because with persistent memory you can commit in in about a microsecond or something, right? However, there's one small thing because it's one trick that you need to do and that's also described in the paper is that it's because it's not enough to have just multiple write ahead logs because they can't commit independently because you still need kind of this illusion of having a single goal block, right? Because if you have seen something on a page and but the write ahead logging to entry is on a different in a different partition, then you might get unrecoverable schedules, right? And so this problem that it comes from is just having not a single write ahead log but multiple ones, but we show kind of an optimization, which you called and remote flush avoidance, but that can kind of basically fix this issue, right? That low overhead. And with that you can get actually very high scalability and very low commit rate commits with and unfortunately, and this only works if you have persistent memory, right? So if you have SSDs, then the latency of SSD is so high, then you're back at group commit, right? And so we now actually also have group commit implementation and because most systems don't have persistent memory, but conceptually, I think actually this is one of the places where persistent memory is kind of really great, right? For these low latency commits, right? And I think this would be the best design to have that. You don't even need a big amount of persistent memory, you just need it for the tail of the log and then you can, as you show, as it's shown in the picture, you can stage it to SSD then and so that all works, right? So that gives us scalability, right? And what the paper also is talking about is that we manage to bound the recovery time, right? Because well, my assumption is basically or my claim is that you don't really need extremely low recovery times. You don't need like recovering five seconds or something after a crash because if you put a new server or something that takes minutes anyway, right? So what you want is kind of bounded recovery time. So you want to know that if you recover, it will take, I don't know, whatever, 15 minutes or something, but not 15 hours, right? And what you also don't want, and that's also what a lot of legacy systems have, they have these extremely invasive checkpoints. So your system is very fast maybe and then at some point you have these spikes in latency or suddenly your system becomes very, very slow because the checkpoint is running, right? And this is also something that you can actually solve and we have a very simple approach there described in the paper, basically the idea is that you interleave the checkpoints with the right headlock volume. So let's say you first limit how much right headlock you write, let's say 20 gigabytes and you say, and whenever I crash, I want only to basically recover 20 gigabytes of right headlock, right? And then you couple basically this right headlock volume with the checkpoints and that gives you exactly the smooth nature. So you don't have any hiccups anymore and you bond the recovery time. So it's actually a pretty simple idea, I would say, but it's very effective and it gets rid of these crazy spikes that you have in a lot of systems, right? So we have... That technique is pretty common. I think Memsego does that. Yeah, so I totally believe that actually because it's so simple, makes so much sense, right? But it's true. I mean, but actually we haven't found it in any paper. So maybe it is somewhere and a lot of systems actually don't do that. I mean, and it's terrible. I don't understand why not everybody's doing that. Anyway, so there's still some low hanging fruit, I think. Sometimes people are just kind of too complicated things, it seems, right? Yeah, so and we have fuzzy checkpointing and the recovery can be implemented in a multithreaded way. So we have basically most of the feature set of areas, right? And all this stuff, I mean, it's not free, right? Implementing the stuff also just the instruction overhead. It's not for free and paper shows you kind of the breakdown. But I think basically it's not as bad as legacy systems because you can actually implement that into a more efficient way if you just do better in engineering, I guess. And you get all these nice features that you need in an out-of-memory system. So this is what we've done and I don't see many really realistic alternatives if you really optimize for the out-of-memory case. Okay, so these were actually kind of all the techniques that I wanted to talk about. So what's missing is some performance numbers, right? And so I have to add a couple of caveats. So we still don't have co-cancer control. So for a reason in this very first experiment, this is an in-memory experiment, right? And this is just for basically calibrating the in-memory performance in a way. So we see a lean store with Silo. And Silo is one of these in-memory systems. They're just optimized for in-memory performance. And as I said, it's not fair because Silo already has, they have co-cancer control. We don't. So it could be that if we implement co-cancer control, then we might be slower than Silo. So the point is not that we're faster than Silo. The point is that we're in same league, right? Even though we actually have all these out-of-memory features, right? And that's the very point of a lean store, right? It shows you that you don't have to be in memory to be fast in memory, right? So that's one thing here, right? And by the way, these transaction rates, of course, are totally insane, right? So if anybody knows what TPCC does, right, this is not, basically your DRAM capacity would be full after 30 minutes at most, right? Or something like that. So in a way, all these super optimized in-memory system, they're like formula one cars, right? They are not really something you can really use that much in production. They're really made to write sigmoid papers throughout because you can get, I don't know, three million transactions per second, right? I mean, you're generating over a gigabyte per second in the long, too. Exactly. Exactly, right? I mean, and also you don't, there's no network front end that can give you that many transactions into the system, right? Even if each one is like single short transaction. So it's, yeah, it's not really realistic, I would say, right? So, but it might be more realistic, right? At least if you can support out-of-memory workloads, right? So, and this is what we see here. So here, what we do is now we fix the buffer pool to 10 gigabytes, right? And then we dial up basically the data size by basically increasing the number of warehouses in DPCC, right? And this is now in a setup with seven PCIe3 SSDs. As I mentioned before, the setup that we had before would have about more than 2.5x the bandwidth that we have here, right? So this is already kind of not state-of-the-art anymore. But that's really what we had so far. And so what you see is that you quickly fall below this millionth transaction per second. But even in a setting where your data size is more than 30 times larger than your buffer pool, you still get more than 100,000 transactions per second, right? This is the green curve here. And you see on the right, in the right plot, what happens with the IO, you see that in total, you do more than 10 gigabytes per second of IO, right? And so this is also split here between reads and writes. And so you see that reads are kind of increasing all the time. So in DPCC, it's kind of interesting because as the ratio between the data size and the buffer pool becomes larger, then the ratio of reads to writes also increase, right? And the writes only, they're actually not really decreasing per transaction, but it just looks like it looks this way because the transaction rate is also decreasing here, right? And so the point is now with this experiment is actually you can manage extremely large OTP installations on flash, right? With these very big arrays of NVMe SSDs, right? You can have, I don't know, the 20 terabyte OTP workload there. And I think that's pretty cool, right? I mean, this is by the way still preliminary work. So we're like trying to get this even better, right? And one thing I should mention also that was also not in the original paper in 2018. It was mentioning that, oh, IOS are still relatively expensive, right? And at that time, we had only single SSD. They were pretty expensive at that time. And we were saying, okay, it's fine. Whenever you do an IO, you just briefly not why you do the IO, but briefly, you just acquire a big global lock, right? Just to manage the IOS. Turns out if you start adding multiple SSDs, that's of course will become a scalability bottleneck. So we had to get rid of that also, right? So and I was and that happened very quickly, right? Once we had more than one SSD, right? And so we have, you need to do a couple of things to achieve this, right? Because if you look at some of them, by the time, which is actually one of the fastest systems, right? And as far as I know, right? Faster than RocksDB, for example, on this workload. And it's, it's much, much, much slower, right? It's okay. This plot is hard to distinguish from zero. It's not zero. It's, it's more than 10x slower, even at the very big data sizes here. Okay. So that was the performance number. So let me just kind of conclude what we've seen here. And the, this is again the plot from the beginning, right? And if you kind of follow the talk, you will realize that a lot of these things that we saw in the plot, we actually covered today, right? So we don't need any hand-coded optimization because the lean store has been implemented from scratch. Everything is efficient anyway, right? So we talked about logging. We talked about ledging, right? We talked about the buffer manager, right? The only thing that we didn't talk about here is co-currents control and so locking. As I said, I mean, that's kind of work in progress, right? So, and so what have, what have we learned here, right? And what is this lean store thing? And I think it's kind of funny because it's in a way it's in, it looked very old school, I would say, right? I like this analogy to back to the future, right? It's both, both futuristic and old school at the same time, right? And so many of the techniques that we're using is conceptually there, really old school database stuff, right? But it's oftentimes with the twist optimized for modern hardware, right? And my personal goal is to make it basically as simple as possible, right? So whenever I kind of simplify, we manage to find an idea that to simplify something that makes me really, really happy, right? For instance, with these epoch stuff or this optimistic lock coupling that make those two things make me really, really happy. And I think we've made lots of progress, but there's still lots of stuff to do, right? And we'll still, we'll continue in working on it. And I mentioned throughout the talk a couple of things that we were working on. I mean, the control part, the, I mentioned the network front and we haven't really done anything that, but that's obviously something that needs some work and the cloud story and many other things. So stay tuned, right? And there's a link here where you can find all the papers here that if you're interested. And there's also, there's now an awesome the open source release. And I should mention it is not in the shape that you can actually use it for anything in production. So this is really still research prototype, but we're trying to make it into something useful at some point, right? Okay, that is, that's it. What I wanted to tell you. And I hope you have some more questions. Okay, awesome. So thank you, Victor. I will follow them half everyone else. So we'll open up to the audience. Anybody question for Victor? Please unmute yourself and fire away. Hey, Victor, this is Matt, one of Andy's PhD students. I noticed on one of your slides, you were you were benchmarking some of the Zen 2 stuff, the, the AMD ROMs. Have you encountered any like interesting artifacts or, or from a system design standpoint, you know, when you switch from from targeting Intel systems to these new AMD systems with their massive caches and different interconnects and stuff like that. So not really. So I'm the, let me think, I mean, one difference is maybe that's what you were referring to is that in the single, I mean, non, non-luma Intel boxes, they still have shared L3 cache and AMD does not, right? Which means the something like cache like ping pong is a little bit more expensive on AMD than Intel, right? So if you find a fighting on one global lock, right? But that's something we're trying to prevent anyway. So, and so the answer is no, that's been pretty, pretty painless, I would say. So there's not, not a big difference there. And I would even go one step further. I mean, we haven't done a huge amount of work there, but actually an admin who was actually implemented this, this version of lean store with the numbers that you're seeing here, one of my students, I actually, he was actually also benchmarking lean store in one of these fancy and arm boxes in EC2, this Graviton 2, I believe they're called. And interestingly there, I mean, there we didn't do anything, right? And we never ran arm before. And even that worked beautifully. You get scalability curves almost similar, like these ones, right? So it seems to me that, I don't know, that there's actually all these, these, it's pure architecture, are not that different in the end, even if you have different instruction sets, right? At least for the code that we used here. I was going to say, does that mean you're relying on fewer x86 intrinsics as you, as you build up the system and more just on C++, you know, STL doing compare and swaps and stuff like that? Yeah, exactly. I mean, library handle it. I mean, there's no intrinsics in here at all. I believe, let me think. Do we use any intrinsics at all? I don't think we use any Zimdi or anything, right? Because this is straight C++ code, basically, right? So I think there were a couple of compilation issues when Adnan ported it to ARM, but it was really, I don't know, five minutes or 10 minutes of work, probably. There was no fancy stuff. Thanks. Hi, Victor. This is Lin. Hi, Lin. Yeah, nice to see you again. I mean, unfortunately it's online, but yeah, nice to see you. So I have my question is that you mentioned you used this non-volatile memory in part of the, in one of the techniques, but just used a small part of the non-volatile memory, right, to shorten the latency a little bit. So Mori, what's your take on the role of non-volatile memory potentially in the install, right? Is it possible that at some point, non-volatile memory become, I don't know, part of the bigger role of the storage stack or maybe even replace SSD at some point, right? Your joy is this is non-volatile. Yeah, absolutely. So my take on that is that at the current price point, right, I see it very difficult for PMEM to take off, basically, right? Because it's basically as expensive as DRAM, right? And so you can always, for the same, or maybe it's a little bit cheaper, right? Maybe two X cheaper, but that's not enough, basically. So which means that for the same price, you can always also buy DRAM, right? And then you also get very, you get, but DRAM has better performance, lower latencies, right? And so, for instance, if I go back to this plot, right, if you were to plot PMEM here, it wouldn't, you wouldn't be able to distinguish it from DRAM. So for me, it's really about economics, right? Once it gets cheaper, then things are radically different. And then I think the Leinstein design might not make that much sense or might need to be significantly changed, right? So we had that Alex von Renner, he had that paper in 2018 that showed how you can add an extra layer, so between flash and DRAM, this PMEM layer, right? And I think that would be something that you'd need to do. But for that to make sense, the price needs to go down. Until that happened, there's actually, I don't see that many use cases. And for instance, in the cloud, there's still no PMEM, right? I actually don't see many software products at all that use PMEM. Yeah, make this. Thank you. Yeah. They made this guy's replacement, but yeah, nobody's shipping it as far as I know. SAP maybe, but because they need the very high end and just capacities, main memory capacities, as far as I know. There's other people I can't say here. Okay. Anybody else? Okay. I have a few questions. First question is sort of, what's your third thought about where you want to go to encourage people on this? So the end you can see is that's what everyone chooses in this world. And you sort of already have basic version tracking in your thoughts, but you don't actually maintain the versions. What are your thoughts? Do you think you're going to support multi-versioning? And how do you think you're going to do it? Yeah. So the answer is yes. And the challenges to me seem to be there. I mean, there's many things there, right? If you just say multi-version clearance control, right? And as you know, there's many design decisions. And this is basically exactly what we're doing, right? We're looking at these different options there, right? I mean, for instance, garbage collection is a big, big, big thing there, right? And but also the physical storage, what does it mean to have multiple versions of the tupper, right? I mean, a Keketon and Postgres, right? They copy the entire tupper, right? This is probably not what we want to use, right? So there you have the hypers idea, right, of these delta chains. And so, but all these things are kind of subtle and the implementation matters, right? And so it's interesting, right? It's also something, there's so many papers on MVCC, but if you actually want to build it, there's still many decisions that you have to make. And it's still not obvious how you want to build it. That's our state kind of at the moment. And we're trying to build something that actually works and that's robust, right? That's why we did that one paper where we looked at the four design decisions, because the index and stuff is the other part to that. Like, what do you actually store in the indexes, track different versions? That's the trick you want to do. Yeah, exactly. Do you store them in the indexes, right? That's what you could also store them like in some unknown place, right? So, exactly. I mean, those exactly, those, you could generate more dimensions, right? And all of them, you could make plausible arguments for a foreign against them, right? And, and for instance, also, and the lead store is as these conceptions is a steel system. So it means you can evict dirty pages to uncommitted pages to SSD, right? And you can have arbitrary large transactions, right? And then what does that have implications on the commit? And so all these questions are what we're thinking about. Okay, and then my last question is, and this is a phenomenal answer, okay? I mean, you were, let's, for better or worse, you were the tri-god, right? You were the rate extruder, right? You were the one who revived it from the ashy of computer science in some way, at least for databases. Why did you, why at least store all of them? Why didn't you put it off a try? That's also an interesting question. And so it turns out I've, because I'm the tri-guy, if you want to call me like that, I worked for a long time and I still have it on my, on my disk here, a data structure called called BART, right? So it's basically a hybrid between a B3 and an ARC, right? So the thing is, if you want to put it on, on a, if you wanted a buffer managed try, it needs to look a little bit like a B3, right? So you need fixed-size pages and you encode the try into that, right? And the reason why we're not using that in Lean Store is that it's pretty complicated, right? So in Lean Store is called Lean Store and not Comply Store, right? And so I basically come to the conclusion that it would be a little faster, maybe it would be even 2x faster, but it's just not worth it on the, on the, on the grand scale of things. So that, that's basically the, the answer I would say. So it's sometimes not always about absolute peak performance, I would say, because, and you can always spend that effort also in other optimizations, right? So you have to kind of be economical. Yes. I mean, it's a very, it's a very German way of thinking about letting the system. I like it, right? That's why we, I'm not German, this is why we end up with a VW treat. So with that picture, thank you so much for spending your time with us at night. We appreciate you for stopping at the range thing.