 Carnegie Mellon Vaccination Database Talks are made possible by Ototune. Learn how to automatically optimize your MySQL and PostgreSQL configurations at ototune.com. And by the Steven Moy Foundation for Keeping It Real, find out how best to keep it real at stevenmoyfoundation.org. All right, guys, let's get started to another vaccination database summer period. We're super excited today to have Gia Shee. She's a vice president on the activated team at Oracle. So Gia's background is, I mean, it's interesting, but it's only two things, right? She's been at Stanford and she's been at Oracle. So she has a bachelor's degree and a master's degree at from Stanford. And then she's been at Oracle for 18 years since 2002. So that's awesome. So as always, if you have any questions for Gia, as she gives the talk, please unmute yourself, say who you are, where you're coming from, and ask your question. And feel free to do this any time to interrupt. We want this to be interactive, that way Gia's not talking to the empty space that is Zoom for an hour. So with that, Gia, thank you so much for being here and the floor is yours. Thank you. Thank you so much, Andy. I'm super excited to be here today, you know, to be part of this database vaccination seminar. And what I bring to you is I hope, you know, a journey. Well, I take you on a ride. Let's go under the hood of an Exadata transaction. And the sort of the highlights of the day is how did we harness the power of persistent memory? With that said, let's go ahead and get started on the journey. All right. So some of you may have heard of Exadata. Some of you may not have heard of an Exadata. So I thought I would like kind of start start off by setting the stage and introducing to you what is an Exadata. So let's meet one. And Andy said that I've been working at Oracle for 18 years. So I actually have been working on Exadata for 10 years. So it's actually a product that has, you know, had, you know, overturned a decade old. But this is the latest and greatest product in that family. We call it Exatem. And as you can see on the screen, what's here is that we have a Exatem 2, which is the database. I have a database icon here. So this is a database server. And it runs the latest Intel Xeon processor and has, you know, two models. One is a two socket model. And the other is a socket, which is more like a small SMP box that's packed in there with a lot of sockets, a lot of cores. The two socket is the most popular one, because you can have a bunch of them and be able to, you know, create a cluster. And then on the bottom of the slide shows a storage server. So a storage server is actually where the data resides. And we have two types, two types of storage server. The first kind is we call it high capacity. And the name indicates it has a lot of storage. So we have 168 terabytes of hard disk in that storage server. And we couple that we pair it with 25 terabytes of flash, which we use to build a flash cache on top of that hard disk for, you know, IO acceleration. The other option is you can have extreme flash, which is, you know, all flash storage. And then it's a little bit shy on the capacity side because of the high cost of flash. So this is very straightforward, right? Standard two tier computing, you have your database, you have your storage. And how do I connect these two? So we have this 100 gigabit ethernet that's connecting the database and storage. And a quick word about the need for networking here is that our database kind of scales because we have a shared disk architecture for Oracle Rack cluster. So as you need it more, you know, database processing power, you just add more database servers to it. So that's pretty simple. On the storage side, it's the same scaling story, because we actually take all of our database data, and then we kind of chop them into, you know, smaller allocation units, like we call them extents. And then we place all the extents, we kind of call a stripe and mirror everywhere. So we put all the extents across all the disks in our system. And then we also create mirror redundancy for HA. And that allows us to basically linearly scale of our storage servers as the capacity grows. And on top of that, because we have a layer of storage side caching, so our flash cache also scales linearly both in capacity and also in performance. So having a very fast network that kind of connect the two tiers, extremely important. And we're running on 100 gigabits Ethernet. And there's a little catch to it called RDMA. I'm going to talk about it a little bit later, but this is kind of a quick intro of the concept. And then last but not least, I want to introduce this, you know, prime, I guess like the famous actor of this show, which is, you know, persistent memory, right? Like we talked about that. And I wanted to mention that we do have 1.5 terabytes of persistent memory in our storage server, each one of our storage server. And I'm going to kind of, you know, spend and take you on a journey just to see how we actually make use of it. All right, with that said, I just wanted to go ahead and be useful for the students to have a ballpark on what this costs. Okay, okay. All right. So I am actually not a product manager. So I have been told me I may not give the right price. So I've been told that list price for a full rack, which is what you see here on the slides, or you can see it, you know, on my background as well. That's a standard 42U server, like a rack. You can pack a database server, 14 storage server in there, and with the, you know, and two redundant network switches. So that costs you a million dollars. I think it's a nice round number. Ashish, is that right? Yeah. What you're talking about, JA is the on-premises version. And so it can scale from a really small unit to what JA was referring to. But now in the cloud version, you can consume it by the, literally by the minutes, by the hour. Right, right. It becomes a cost substantially in a consumption model. Right. All right. Thank you. Yeah. Thank you, Ashish. So and Andy and everyone, as you see, there's this sort of nice graphics that talks about like what Ashish said, on-prem, you can grow from eighth to quarter and elastically adding all the database server and storage servers as you see fit. And it actually grows all the way to multiple racks and similarly for the ASOcket. And in the cloud, like as Ashish talked about, we kind of just charge you by the OCPU and storage. So it's a lot more elastic that way. So you can start out really small. All right. So having seen what an exadata looks like, I just wanted to quickly talk about, you know, what is like the value sort of ads that we provide to customers. We really think that it is the best platform to run Oracle Database. So we have the best database hardware and the special sauce that we have built. I'm a software developer. So it's really like software is really the end year to my heart. We build a lot of really smart, interesting software to really harness the power of hardware. We want everything to be hardware bound. So I'm going to talk a lot about that for all the TP application today. And we also have a management component that's fully automated. And just to kind of give you folks a sort of a perspective, who are the customers who actually use exadata? We say 86% of Fortune Global 100 actually run exadata. So it's not like some toy platform or, you know, very bleeding edge kind of technology. It's actually very mature, very stable, very important for our customers. And you may ask, who are the 14% of the Fortune Global 100 that don't run the exadata? So most of them are our direct competitors like, you know, Microsoft, IBM, HP, and, you know, it's kind of very reasonable way where they can choose exadata. But and also another words about exadata is that it actually is a converged database. By that, what we mean is it runs both the data warehousing as well as OLTP. But for the majority of this presentation, I'm going to focus on OLTP. But I just want to make sure that you guys have this right notion that I can do both. All right. So now I'm really excited to kind of get really, really get started with the show, which is let's go under the hood of an OLTP transaction. As everybody in this seminar knows what in the OLTP, right, online transaction processing. So a very naive question, maybe a person who hasn't taken, you know, Andy's introduction to database internals class or she, you know, taking a database course may ask, what does an OLTP application do? Right. So I give a very simple example here. So for lack of a better word, I just, I just call it a super critical OLTP application. So there's application that sits there. And perhaps in this case, a classic text textbook example is a banking example. Right. So let's say Ben wants to deposit $1,000 to his bank account. Another user Alice wants to withdraw 500, Bob wants to transfer. So what ends up happening is that this user, let's say Ben wants to perform this transaction. And what really it translates into is really just go ahead and get my bank account. I'm going to deposit $1,000 in there. So what happens is that the OLTP application will go ahead and send an update SQL to the database, right? To say, look, database, please commit this transaction, process this transaction for me. And when the database gets a SQL statement, what the first step it does is it parses it. It tries to figure out how to even run the SQL. So there are usually two cases. One is that, look, somebody else has run similar SQL statement before. I've already built a parser, which has a full execution plan in there. So I don't need to, I don't need to repeat the work. And then my parser happened to be in my library cache. So I can go ahead and just use that parser and go ahead and execute the SQL. So again, when that happens, it happens as CPU main memory speed, right? Super fast. Or you may want to compile this statement and figure out what, you know, how to execute it. And that requires, you know, metadata on the schema tables and whatnot, which is usually very fast too because those are very frequently accessed blocks. And they will be readily available in the buffer cache of your database instance. So you compile a parser, you say, okay, off we go. So cursing the memory, let's move on to the next step. What happens? How do you deposit $1,000 to Ben's account? You typically maybe use Ben's user ID or maybe his account ID, whatever you may have, to traverse a B-tree, right? To find his record. Like where is his balance sitting? In which data block? And that B-tree index traversal is usually very fast because those are frequently accessed blocks. They're readily available in the main memory of the buffer cache of the database. So no big deal. I go ahead and, you know, walk down the index tree and I found that data. So now I finally got to my destination, right? I identify the role where Ben's account resides. So I need to go get that block and then change and update the balance. So where is that block? So now this is an interesting question. Because not everybody is accessing their data all the time, right? If somebody else who happened to have accessed that block just shortly before, you may be able to find it in the buffer cache. But oftentimes you may find that, look, I may have a miss in my buffer cache, which means that it's not in memory. So it's interesting when you come to a cluster database because you may have multiple database instances in the cluster. So it may not be in your local main memory, but it could be in somebody else's buffer cache. So we have this mechanism called a cache fusion. What it does is it goes and probes my neighbor's cache and says, do you have it? If you have it, I'll get it from you instead of having to go to storage because that's much faster. But in this case, let's say, you know, that block isn't even any of the neighbor's cache. So I have no option except for I have to go to the storage. So what I call this operation here is I really wanted to create this effect of, look, I have just fallen off an IO cliff. What I mean by that is for the previous three steps, as you can see, all the data I need to perform all those parsing and traversing a bit tree and finding the block, they're all in main memory, right? So I'm running at memory, perhaps CPU cache speed. So it's a super fast execution. But as soon as I have a miss, I got to go to storage. And as we can learn from the prior slide, my storage is sitting on the other end of the network right somewhere else over on the storage server. So at that point, I, my execution actually runs into a grinding halt because I have to wait for the IO to come back. And that's what we call the IO cliff. So a quick moment about like, I just wanted to kind of talk about what is the random data read IO cliff here is that as we talked about earlier, as soon as I identify which role has been the cow, I need to fetch that block for me to, you know, perform the update, right? So naturally we say, okay, you got to go to storage, let's go job. So let's go ahead and figure out how do I get that block from the storage server. So naturally I'll say, look, you know, storage server, this is a block I need, please go ahead and, you know, give it back to me. So you may remember that we have like sort of a tiered data storage on the, on the storage server side where, you know, we have terabyte, like 168 terabytes of hard disks. And then we have a flash cache in there. So let's say in this case, you know, I got really lucky Ben's data is actually in the flash cache, you know, not just on this because it's on distance, you know, the latency is really high. But let's say it was previously accessed, it's in my cell side flash cache. So okay, that sounds good. I go ahead and probe my flash cache, I found the data, I say, look, I got the AK, I send a block back to the user. Okay. And once the database server gets the block in its own buffer cache, you can continue to process the SQL, right? So we're back on the happy track again. So now I have a first question for the audience because Annie said this is extremely, you know, interactive sessions, I'm going to pose a question for you. How long do you think that the super critical or TP application had to wait for this AK block read? Any guesses? Please feel free to unmute and talk. It's getting the flash cache and it's going over RDMA. No RDMA here, right? Because this is like messaging model, right? Like database sends a message over to the server, it's inside the flash cache, I have to issue a local read from my flash device and then I send the block back. 200 milliseconds? 200 milliseconds? 200 milliseconds or whatever. Okay. All right. 200. And he was like, oh, that's too high, right? So I agree. And we can break it down. We can break it down. So as you know, there's so many different kind of storage out there, right? And then the mileage really varies. So I don't want to pick down a competitor. They may have 100 or 200 milliseconds. I'm just going to pick down an exadata that's before persistent memory. So let's say just I have an exadata, I have a hard disk, I have flash cache in my storage server, let's see how long it takes. So the message actually from the database to the storage, you can imagine that look, you know, I got a goal and then send a message. So there's a user to kernel, contact switch, the message leaves goes on the wire, arrives at the storage server, like a hardware completion comes in, interrupt wakes up, you know, a storage server thread, that thread goes and it says, okay, let me do the local read. So from a flash point of view, depending on what kind of flash protocol you're talking, everybody, now everybody uses MME flash. So we are pretty confident that AK read on a relatively newer flash generation will give you less 100 microsecond latency. So that's the local latency, you know, local to the server. After that, you have to go and send the message back to the database and then ship that block over, right? So to complete the entire cycle, I am going to review the answer. So it's about 200 microseconds end to end on an exadata. And mind you, this is already, you know, a thong times faster than the 100 milliseconds or 200 milliseconds number we heard, right? So this is already a super fast low latency, my all. And just by simple math, you can tell that the whole sort of overhead, shall we call that like from the user to kernel, the contact switches, all the interrupt processing is taking about 100 microseconds end to end. So the end to end latency we're seeing is about 200. And then it's kind of a half half 50 and 50 split between the raw flash latency and everything else that's involved to, you know, complete the network IO. So it's not too slow, right? But compare with remember like the CPU cache or main memory access, which is, you know, maybe 10s and 100s of nanoseconds. This is, you know, much, much order of magnitude longer. So this is comes out first challenge. How do we conquer this random read IO cliff? So here, another question. I put up this jigsaw puzzle. They're just two guys in there. Do you, I call those a dynamic deal? Anybody recognizes these icons? Like I kind of briefly showed them on the very first slide. When we meet an exadata, can you kind of screen to your eyes and yell out, what do you think is the guy on the left? What does that look like? Persistent memory on the left and on the right is your network card, probably. Wow. 100 points. Very good. So the left one is the persistent memory. And that looks like a dim, right? So that's how we pop into a dim, a dim slot on a CPU socket. So a quick word about persistent memory. Persistent memory is a brand new as it was released in 2019 by Intel. So it's a relatively new silicon technology. It has very distinct capacity, performance, and price behaviors when compared to DRAM and flash. So the persistent memory that we use is the Intel obtain data center persistent memory module. It's quite a mouthful, but that's what we use. And that actually sits on the CPU socket. So when you populate the persistent memory dims, you know, you take your CPU, you actually insert that persistent memory dim in the socket. So it looks like a DRAM dim, but it's at the dim form factor. So as far as reads are concerned, it's really fast. About three to, you know, maybe three to X lower than DRAM, but it's like a heck of a lot faster than flash, you know, much faster than 100 microsecond. No, we just looked at. And the interesting thing about the persistent memory is that the rights or the stores to those memories actually cancel by power failure, unlike DRAM. So that make it extremely attractive for a database application because having persistence and durability, you know, as we know, is critical in a lot of database operations. However, the problem with persistent memory is that you may think, okay, I'm just going to pop in this dim and all the stores to it is persistent. But actually underneath the cover, it requires a lot of sophisticated algorithm to make sure that the data on the PMM can persist across a power fail. And I'm going to, you know, touch on those exact points later on this presentation, just a quick sort of heads up is that the CPU cache on the server is actually not persistent. So if you think that you just did a store and then you did update and that's persistent, you're delusional because that update is still sitting in the CPU cache. So imagine you lose a power right then and there, your update is going to be gone. You're not going to see it when the server powers back up. So the trick here is that you got to be careful about, you know, flushing the data or bypassing the CPU cache when you want to make the data persistent. The second part is that we're so used to disk and flash, right? Like they have sector atomicity, 512 by 4k, whatnot. But persistent memory has a distinct characteristics in terms of right atomicity and it's actually have a profound impact on how we can use them for database applications. So I'll talk about that later, but these are the two things to keep in mind. All right. So I think the prior gentleman has, you know, already answered the question perfect. I'm not going to repose it. So the guy on the right is RDMA. So quick word about RDMA. RDMA stands for remote direct memory access. So this is actually not a new notion. It has been in the networking world for a while, but it started with Infiniband. So it was kind of a more of a niche technology, less or no. But nowadays I think you will hear about RDMA a lot more, you know, in the, it's going a lot more mainstream. So let me quickly kind of break down what an RDMA is. If you look at a server, let's say I'm just looking at the database server on the left. The memory region is basically a piece of memory that's sitting on your main memory. You know, it's like, you know, it's on SD or memory, whatever memory you may have. And from a server point of view, there are just really two ways of accessing that memory, right? One is a CPU core can just go do load on stores to that memory. The second way, which you don't kind of think about it, but it's actually there to enable your day to day networking is that a network Nick can access that memory through a PC and E bus. So what RDMA does is exploits the second use case is that you can actually authorize your RDMA enable Nick to say that, look, you have access, you can pin this piece of memory to your card. You can imagine that's kind of tether to your card. So what that ends up enabling you to do is that if you have a remote peer, who can actually connect to you, that remote peer can go ahead and perform a memory access, either be a read or write directly to a local memory without like bypassing all that software handling that usually happens during a messaging protocol. So this is what happens with RDMA. And then the in order to create a RDMA capability, the two Nick have to support RDMA to begin with. So they call RDMA enable Nick. And then RDMA actually requires a lossless L2 layer. So to do that on top of Ethernet is additional challenge. You need the, you also need a switch to be able to help you with Rocky as well. So I'll talk about that shortly. But just kind of quickly putting things together is that if you have two endpoints, like a, let's say a process running on the database server on the left and the process running on the storage server on the right, you form a connection between them that you call it like RDMA connection. And then you do the handshake. And then they can authorize each other to access their own memory region through their control plane, you know, messaging. So you can preauthorize access to the memory region. And later on, you can imagine that the server can go and update his own memory region. And then the database server, the client in this case can just go directly access that update without, you know, involving with any of the software stack running on the server. So that is a very, very crucial piece of information to keep in mind, because that is actually what enable us to really harness the power out of persistent memory. So with that said, let's move on to what's special about RDMA overconverge Ethernet. Like I said earlier, RDMA is not something new. It was invented, you know, decades ago. InfiniBand has always had that. And in fact, you know, when Exadata started out more than 10 years ago, we had been using RDMA, but on InfiniBand. And the reason for that, it gives you really low CPU latency, very high throughput, and very low network latency. So with all those three beautiful, you know, attributes of RDMA, we're a big fan. But back then, InfiniBand was the only sort of player in town because to get to the throughput that you want on your network, Ethernet is nowhere close. But over the last 10 years, the gap has basically gone completely closed down. So Ethernet has the same kind of bandwidth as already as InfiniBand as we look at the market today. So that's why we switched over to the latest and greatest 100 gigabits Ethernet as well, even for RDMA, because it really allows Exadata, you think about Exadata, you pull it into a data center, be it on-prem or be it on-cloud, what do you think what the other machines are running? What kind of networking are they running? Are they going to run InfiniBand? Most likely not, right? They're all going to be running Ethernet. So having RDMA over Ethernet is extremely attractive because it allows you to be able to kind of integrate your own RDMA network as part of the bigger data center network. And then that's why we kind of have kind of cut over to using Rocky, which is really the short for RDMA overconverged Ethernet. But you can't use a standard Ethernet switch for Rocky, right? Yeah, that's exactly, that's absolutely true. So what happens is that a standard Ethernet switch is sort of, I guess, lack the Rocky features, right? So if there's congestion happening in your network, the packets getting dropped and whatnot, you need a Rocky switch that does know as ECM the early congestion notifications and all that to avoid the packet drops to enable a successful RDMA stack. So you're absolutely right. Not only Nick has to be RDMA-enabled, the switch also has to be RDMA-enabled as well. All right. So meanwhile, back at the wrench, what happened to Ben's transaction? We had a nice segue off for the dynamic dual, but let me kind of combine those two threads together. So the theme of the story here is we really are looking to persist in memory and RDMA to help us conquer the IO cliff. So let's bring back this picture again. We've seen this picture before. It kind of talks about, you know, I have a flash cash on the storage server side, you know, I can do a local read. So one of the very natural sort of reaction to this is, okay, now, Ja, you tell me that, you know, PMEM is very fast. It's on the memory bus. You get memory latency. It's persistent. You don't lose any data on that across the power fail. So why don't I treat it like I treat, you know, I use a flash. I just build a PMEM cash on it, right, which is very, very reasonable. So let's give it a try. Let's see what happens then. So let's say I take all my flash cash code. It's all caching code anyways. I drop it onto PMEM. I bring, I build the tiered caching. So, you know, PMEM on top of my flash cash. Sure. I have a PMEM cash working. And then when it comes to latency, it's very close to memory. So if I read an AK from a PMEM, like if it's local on the same CPU core, it's about one microsecond. If you have to go to the second core, you know, remember, we have a two course, sorry, second socket, we have a two socket system. You have to go through a UPI leak. It adds a little bit of latency. So it's about, you know, two to three microseconds. So both are very reasonable, right? So you're thinking about, oh, I have like a low single digit microsecond latency. Now I have 100 microsecond added for all my networking and then, you know, sending the messages and copy the block back. So I'm going to ask you, what happens to the IO quiz? Well, you will think, okay, I was able to get a, get rid of that 100 microsecond latency of flash and replace it with this, you know, a couple microsecond latency from PMEM cache. So that's great. I cut my latency in half. So my, my IO clip used to be that high. Now it's half as high, right? It's like 50% reduction in latency. But somehow it just doesn't feel that gratifying, right? Because you'd be like, look, I actually get to read AK from persistent memory within like a couple of microseconds. But Ja, you're telling me from the database point of view, and to end, I am still seeing 100 microsecond latency. You know, it's just kind of, can we do better? And the answer is, yeah, we think we can do better. And especially with the help of RDMA, because this is a very radical approach in our opinion is that, look, when you go, traditional way of messaging is you really like wake up another guy on the other end, right? To say, look, I need this block, you know, where this block is, go fetch it and send it back to me. And that whole kind of talking, you know, shall we say, is actually expensive when it comes to IO latency. So the way that we want to address this is we want to get rid of all that chit-chat. Like, don't talk to the other guy. It's just too slow. Self-service is the best way to go. It gives you the latest lowest latency. It gives you the results with the lowest latency. So how do we self-service? How do we go and say, look, I want to AK block and get me over there? So we do RDMA over 100 gigabit rocky, as we talked about before. And before we're looking at the internals of that, I just wanted to share a quick result with you, is that we are able to not only go from 2x of the latency, which is, you know, the 100 microsecond we talked about, and we're actually able to get a 10x latency reduction when we do that. So the end to end latency from a database point of view is drastically reduced to sub 19 microseconds. And we put this up because that's what we're putting our documentation, our data sheet. And this latency number is actually achieved with multiple millions of IOPS happening in the system. So if you are just having a lower IOPS, you know, if you look at some other flash vendors, IOPS maybe, you know, 10K, 60K, 250K, whatever they may have. And those lower IOPS, I actually have seen much lower latency, like 13, 14, 15, 16. So it really is quite impressive. So let me kind of go down under the hood to see, hey, how do we get to this 19 microsecond latency? So remember, we had a miss in the buffer cache. So that's why we have to go to the storage server. So the first step that we do is we really want to be able to perform that read through an RDMA. And the biggest challenge with that is, okay, let's assume somehow you have some mechanism of populating, you know, your data into the PMEM cache. How do I get to it? I have no clue where on the PMEM my data resides, right, that AK4Band. I have no clue sitting on the database server side because I'm not managing that data myself at all. So what the trick here is that we actually break down the read into two steps. So the first step is you go and do our RDMA. Again, it's another RDMA self-service, no waking up the server process. You go straight to the server's memory and perform an RDMA read of an RDMAable hash table. And then that probe is going to tell you, are you going to have a miss or are you going to have a hit? So let me talk about the happy case first, the hit case first. So if you do have a hit needing, you know, that bands account balance, that record, that block actually is in the PMEM cache, then your probe will also tell you as part of the return results that you read, where exactly on the PMEM is my AK block. What is the virtual address for that? What is the memory key to access that so that I can perform RDMA? And then what you do is you go ahead and just go read that AK from this new location and you have your block in your buffer cache now. So this is extremely powerful. But you may also say, okay, that's a happy case, right? If I have a hit, I go perform two RDMAs and I get my AK block back. But you know, you say your PMEM cache is 1.5 terabytes, your flash is 25 terabytes, not everything fits in PMEM cache, right? And I totally agree with you. So the way that we do this is that we say, look, if you have a miss, no worries, it's okay. You can go back to the conventional way of sending a message to the storage server. The storage server is going to do the usual lookup and tracks down that block, perhaps it's in the flash cache, and that's going to shift that block back to you. So for that very first read, you may end up getting a longer latency like 200 microseconds, but it's okay. Because what happens next on the storage server is because of you had come to me through a messaging, I realized that you were interested in this block, but it's not in the persistent memory cache. So what do I do on the storage server side? Not only do I actually give you back the block, I also post-populate that into the persistent memory cache. So that's how we actually populate into the persistent memory cache. So that next time, you have the same need, let's say the buffer H out from your buffer cache, and you need to read it again, you can find it readily available in your persistent memory cache and do RDMA to fetch it the next time. So that's how we handle both misses and hits in that case. And the key thing here is that when you do have a hit, there's no software involved, you know, it's really a nick to nick communication and you get your AK block back. And that's how we actually significantly bring down the IO latency for AK random read. So now let's take a look back at this picture. We wanted to, our goal for OLTP is to say, look, we like the, you know, how it runs on the CPU on the memory, it falls off of IO cliff. And what I do here is I call it a tremper link, you know, jump, jump, jump on the tremper link, and then we spend like maybe sub 19 microsecond, we get our block back from the storage server. So a very natural question when I ask is look, you know, I launched the RDMA, okay, let's say I have a hit, I launched the RDMA. What do I do at that point? What does the database process do? Do I like get off the CPU and wait for interrupt to come in to wake me up and then, you know, to kind of read that block? Or do I kind of just busy spinning on the CPU because my RDMA takes, you know, so fast to complete? So I would say that the knowledge actually varies depending on the application. In the case of an Oracle database process, our process unfortunately has a pretty big CPU cache presence. So what that means is if you get off the CPU and gets rescheduled back, you have to incur a huge cause of contact switch. And what we have learned is that we actually ran the test and pleasantly we found that if you just busy spinning on the CPU, wait for that completion to come for your RDMA, it gives you a lower latency and surprisingly even less CPU. Because if you do the other way around of, you know, I yield and got waken up later through an interrupt processing, not only the latency is much longer, I end up actually spending more CPU doing that. So it's kind of counterintuitive, but it's kind of interesting find that we have. So what we do is we basically just spin on the CPU, get the block back into our buffer cache, and then we're able to update the balance for that and then, you know, complete that change. So we're very happy with that. And this is a slide that kind of summarizes what we just talked about this whole persistent memory RDMA technology. As you can see on the storage server side, like I have a pyramid here. So it really tries to illustrate that we have all the data in the code, which is, you know, the hard disk. And then the warm data, we put it in the flash cache. And then for the cream of the crop, you know, the hottest tip of that 25 terabytes is that one terabyte of hottest tip, we put them into persistent memory cache. So you can imagine that the most frequently accessed block from storage gets to be accessed the most, the fastest way with the lowest latency. And that is really sort of really nice for an OLTP application, because, you know, instead of just sitting there and waiting for a read to come back, you know, we kind of transform IO bond application into more like a cache, you know, like a cached application. So that's super nice. All right. So now we've kind of looked at this AK data read problem, right? We say, look, I trampoline over, I get RDMA from my persistent memory cache. What is the next step, right? So as a transaction, it has to commit, right? If the transaction doesn't commit, it's no longer a durable change. So in this case, very naturally, the all super critical OLTP application is going to say, look, I want to commit the transaction. How do you commit a transaction in a database? Well, it's very, very simple. You issue your log rates, right? Everybody who's taken the transaction course will know that, hey, you got to write your commit to a redo. And then that redo has to be persistent to persistent media. And in this case, you know, you may end up doing two separate log rights, one to update the balance change, the second one to commit, or maybe you can do some optimization, combine them into one shot, right? Like, you know, piggyback the commit along with the change itself. So you just issue one log, right? But even if you optimize that down to a single log, right? I would argue that you fall off the IO cliff again, because you have to send that right to the storage. And that is another pretty steep weight, a very long wait for you. So here comes our challenge number two. Can lightning strike the same place twice? So in this case, for bands, transaction, yes, the lightning strike again. And then for log rights, we fall into, we just fell into IO cliff as well. So what happens here is very similar to what we just saw before. The database server has to, you know, package out the redo rights and send it as a message over to the storage server. And on the storage server side, to accelerate log rights, we actually have a lot of interesting innovations in there, you know, even before persistent memory. So I'm just going to spend a minute talking about that. Is that when you look at log rights, you know, many people say, look, I've just put my log rights on flash that gives you the lowest latency, fastest commit, and that works well. But flash is not just a piece of hardware. You looked at a flash, you say, okay, I see a PCI card. But what's inside is actually, you know, as we say, I really think it's software runs everything. So what happens inside of flash is that flash is not like disk. You can do an in place update. What happens is every cell has to go through this program array cycle, right? You never write in place. So there's a whole bunch of software code that's running inside a flash card that constantly does the remapping in the background. And that allows you to have, you know, be able to have a low latency flash and high throughput IO bandwidth. But the problem with software code is that as we all know, we always have bugs, right? Bugs are inevitable, just like tax and death, I guess. But for us, it's bugs. And there are cases where your code may just run into some sort of glitchy state. What happens in those cases is that if you happen to land a redo log, right, into a flash, write that a moment, you can see outliers. Outliers by outliers, I mean, as normally you're right, finish writing a micro 100 microseconds, but occasionally it can take milliseconds to complete. And that has profound implications for OTP transaction. Because in the case of Oracle Database in particular, we have one log writer that aggregates a lot of commits for a lot of foregrounds, kind of batches them and writes them down to the storage. So imagine that single write gets held up. It's not just your own commit. It can get percolated and cascaded to many other transactions. So we say, look, you know, sending right to a single destination that might be prone to very rare outliers is not good. So what we do is look, we have another thing on our storage server, which is the disk controller that sits in front of all of our disks. But in the disk controller, what's in there is we actually have a persistent DRAM cache that accelerates writes. Again, it's a bunch of software code running inside that piece of hardware trying to manage the cache and coalesce all the smaller writes and dump them back to disk. So normally it's also very fast, but it can also run into its own bugs or glitch or stalls or whatnot. So we actually do a pretty clever trick here is we actually send a log writer to both destinations all at the same time. And we say that, look, you two are two separate pieces of hardware. You will have very independent failure characteristics, right? If one flash is running into some problem with this garbage collection, chances are it's not going to overlap with controller cache. It's its own management. Imagine two independent, low probably independent event of outlier occurring and we can overlay those two together. So what we end up having is eliminating those single outlier situation and would only have a slow log, right? If we have a double outlier, which is really practically negligible, because it's very, you know, two independent low probability events. Doesn't that really complicate your recovery processing? If you get unlucky and you get out of sync between the disk controller and the flash log? So that's why this is a very sophisticated feature we implement inside the storage server ensures that, you know, it's like runtime is all happy path. It's easy. It's a difficult part, is exactly what we pointed out as doing recovery. If your server were to crash and comes back, how do you reconcile the data, make sure you never lose any of the log writes? That's really where the challenge is. So it also sounds like there's custom or whole hardware or controller, the disk controller, all that's custom or whole stuff? No, no, there's actually standard components. So when we go exit data, we like we take all commodity hardware. So it's like we get flash from multiple vendors, controller from different vendors, and we put them together in the server. So they're not out of hardware, but we have learned this painfully in the beginning. We're all naive. Okay, flash is fast, you know, disk controller has a red bad cache. Well, what can you ask for? Right? And then you start seeing those logs right stalling and all that and be like customer complaining or be like, okay, we got to do something to eliminate those stalls. And we actually invented that feature. And okay, even with that feature in place, like we said, you have to go and issue that rate, you know, to both destinations, report whichever one comes back first. And then send that commit back to the OLTP app from the database site, right? Saying my log rights are persisted. Now your transaction is good. And I guess I'm having a sort of, you know, this is like a deja vu. Another question, how long did you wait for that log, right? And I'm going to just reveal the answer. It's the same 200 microseconds because the characteristics are very similar to a AQ random read. So whole, you know, 100 microsecond on the networking and the context, which isn't all that. And another 100 microsecond on actually persisting the right. So the question becomes, can we do better here? Like, you know, I don't like that 200 microseconds. Can I run it at much lower latency? So the answer is yes, the same dynamic dual. Now we just talked about earlier, came to the rescue again. So let's kind of break it down on how we can actually use RDMA and persistent memory to accelerate log rights. So in this case, it's very similar to the RDMA picture earlier. On the right hand side, you have the storage server. So there's a nick there. And then the database server sends an RDMA request to the storage server to say, look, I really want to write this piece of redo. So what happens on the storage server side is that we take some amount of memory. Actually, we take very little of it, less than 1% of our total persistent memory on a server. On a storage server is useful. This purpose is we take a very small amount of PMEM real estate, and then we carve them into separate receipt buffers. And then we create this thing called a shared receipt queue on the storage server side. What that does is it allows multiple RDMA connections, like from different databases, from different log writers, to all RDMA and send their rights to the storage server through the same shared receipt queue. And once the log write lands in the PMEM log buffer, the nick can simply just go ahead and send the act back to the database server. There's no software processing needed because as soon as you actually deposit your redo in the persistent memory, by definition, it's persistent. So you don't have to worry about the durability from that point on. The nick can then send an act back to the database server. And now way, the database server can actually just go ahead and said, okay, OLTP, super critical OLTP application, your commit is done, you're off to the races. And in the background, what we do is that when a piece of, when a send actually lands into a receipt buffer on that shared receipt queue, the hardware actually generates a notification to the software running on the storage server. It tells you, look, one of your receipt buffer has been consumed, you got to do something with it. So upon getting that notification, what our storage software does is it goes and takes that piece of redo and de-stage it back to the backing store. We talked about flash, we talked about the disk controller cache, we put them back. As soon as the redo is copied back to the backing store, that piece of persistent memory receipt buffer is completely freed up. I don't need it anymore because the persistence is already accomplished through my disk or flash on the other end. So what I do is I repost that buffer back to the same queue again, so that enables a very small set of receipt buffers to be reused repeatedly for very high throughput of log writes. And that, we call it, we can have a cake and eat it too because we want the P man, the persistent memory to majority of it, 99% of it, to be used for caching because that allows us to cache as big of a working set into it as possible, right? And we take a tiny bit less than 1% and we use that as a persistent memory log buffer in a circular fashion to facilitate and accelerate log writes. And if you ask me, what happens if you have a power outage, Ja? It's like I want my durability. Like, how is my redo safe if I have a power outage? And what happens is, brings us to the next slide. So previously, we have talked about, look, persistent memory is not a piece of cake. You don't just pop it into your dim slot and then expect persistence. The reason I say that is this one example here is a great illustration. So you can see this very simple diagram, like I have a server. I didn't draw a server, but imagine you have a server on the left-hand side. You have a bunch of CPUs and may have some shared L3 cache. And then there's a NIT card that receives the RDMA through the network. And then you have a PCIe bus that connects you either to the CPU and then the memory. So on the Intel platform that we run, we have this technology called Data Direct IO, DDIO for short. What that enables us to do is that when your network card receives some packets, information, whatever, it can actually directly perform a right allocator update into an L3 cache, like the last level cache. And that is generally a very favorable feature. Because imagine that when you have a receive on the network, you lend all that data in your L3 cache. And then you can just wake up that process. It was waiting for that. And that process is ready to run and finds all the data and needs as readily available in the L3 cache. So that's a greatly performance-intensing feature. But when it comes to persistent memory, that feature actually gets in the way. And the reason for that is that I draw a little dotted line across a persistent memory is that there is a thing called ADR safe domain. ADR stands for asynchronous DRAM refresh. So it's kind of a hardware term where essentially it means that your right has to land in this dotted box before that right can be made persistent across a power failure. So if you have DDIO turned on, all your log writes wind up in your last level cache. And that cache we have talked about earlier, if you just pull the power plug right then and there, it's going to get lost. So you'll be like, oh, I thought I committed that $100 deposit, $100 deposit, what happens? So that's not acceptable. So how do we ensure that this is actually persistent? So we have to make a very difficult choice, which is we actually have to turn off this thing called a DDIO so that all the NIC receives go direct to the memory. And this applies both for DRAM-based receives as well for PMEM-based receipts. Because unfortunately, the platform we have, it just have a single global knob that controls all receipts on a NIC card. And I know that we have NIC vendors and then CPU vendors who are working very close on that, trying to be able to do better in the future so that you can actually tag each network transaction received. And with different tabs, you can say, okay, this is going to the volatile memory. There's no need to worry about persistence. So I would just land in the CPU cache. And this is to the persistent memory. I'll bypass CPU cache and go straight to the ADR safe domain. And we do actually encounter some negative impact because of the turning off the data direct IO as a result. And we have to come up with other creative software engineering tricks to kind of mitigate that impact. So I just want to bring it up as an important point. And the second point I wanted to mention is that the network can also get congested, right? Because that log write is super low latency. I really want to get there fast. These are the lowest latency Ios I want. And then maybe there's a medium priority IO, for example, my OLTP data reads. And then there's kind of workloads like reporting, backup, batch, they're like higher priority, higher throughput, but definitely lowest priority. So I don't want those large throughput batch workloads to kind of get in the way of my super low latency Ios, right? So what we do is we actually carve out different traffic classes for all of our network packets. So from the database to the storage server and backwards, we make sure we pick different lanes for different traffic. So imagine you go on a highway, everybody may be congested in the regular lanes, but you get to go on the VIP lanes that's, you know, reserved for maybe emergency response vehicles and whatnot. So you get super low latency, even though your network is saturated. And this has to be done every step of the way, you know, on every switch and on every NIC card received. So that's another very important feature that we're putting to make sure that we have low latency Ios. So let's put all of this together, right? We have RDMA write, we trampoline that over again, we jump on the same trampoline, we're getting to do a super fast RDMA write to the persistent memory log, and we're able to get a much faster log writes. So what happened to Ben's transaction? Well, Ben is pretty happy about that, right? It's like, are we done? Okay, I deposited my money and my transactions committed. But the database, I wanted to say is still left with one more challenge, which is it has that AK block, right, which is dirty. And eventually maybe it gets cold, it has to be written back to the storage, or maybe I have a checkpoint, I have to meet, I have to write it out. So what do I do with that dirty block? I have to write it back to the storage. And here comes my last challenge for the presentation is, why is it so difficult about writing data back to a PMEM cache, you know, back to PMEM in the storage server? So one of the things I wanted to ask you, I don't know if everybody knows the words blinch. If, if you're, I know we're running short on time, so I'm just going to complete the story. So if you're like me, who loves Harry Potter and the wizarding world, you will know that splinch is a phenomenon that occurs if a witch or wizard has to magically disappear, meaning disappear and reappear, meaning appear to a new location. So it's a magical form of teleportation. A splinch happens is when you, when a wizard or witch is not very careful, they may end up moving to a new location, but leaving part of their maybe eyebrow or piece of clothing in the original location. So in the database parlance, we call it a torn block. So in database, you don't want your block ever to be fractured, right? Because if it's fractured, it's gone. What can you do with it? You have to do media recovery and do all sorts of other crazy stuff to recover that. So having sector atomicity is extremely important for the durability part of database transaction processing. So on persistent memory, it gives another challenge, is that the durability on PMEM is only eight byte atomic rate guarantee. It's not like the 512 bytes of 4K that you normally get from a disk or flash. And we certainly don't want our database race to be fractured, right? Because that would result in torn blocks and database losses. So the question here is, is RDMA to persistent memory a good choice here? I would say no, how not? Because you think about it, if you just write an AK block through RDMA, and you know where to go on PMEM, let's say you landed there, and then the server just dies right then and there, what happens? Your block gets fractured, right? And not to mention that we have network MTU underneath the cover and that you're seemingly one block to you as a logical rate can be chopped up into multiple fragments we send over the network. So even more chances for hitting a splinch. So how do I avoid a splinch? So good thing is, if you also go to Hogwarts, you take that operation class, the teacher is going to tell you there are three principles, the 3D principle you have to follow. It's called destination, determination, and deliberation. So we are very inspired by that, and we feel like those 3D principles can help us prevent the database block splinch as well. So first, destination. From a database point of view, I know where I need to send my PMEM rights. Sure, what I do is I send a regular message. I don't do the fancy RDMA trick we talked about before, I send through a regular message. And then when the storage cell software actually gets that message, it makes a determination to say, look, okay, this block is actually on my persistent memory. So I got to be really careful. I don't want to have a splinch. And the deliberation part is where the trick happens, is that we actually introduce a trick called a staging right. You can do it many other ways, but the way we happen to employ is we carve out a separate area we call staging buffers. And when we need to write to persistent memory, we actually perform two rights. The first right is to the staging area to say, look, I'm going to land that whatever, AK or 512 bytes or whatever block in the staging area and make sure I set up a barrier and make sure that my right is complete. And then I send the right to the actual location on my PMEM cache where that AK block resides. And you can ask me, okay, Ja, why do you bother to do the two rights? How does that help you prevent a splinch? So if you have a power outage, let's say, and it happens to die at the point where not all of your right, the second right to the actual location gets persisted, only half of it made it. What you can do in that case is you can perform, we call a staging buffer recovery. And what that does is that remember, I already write my entire update in the other location prior to I issue the second rate. So I just copy it over. I just rewrite the whole thing and make sure that it's complete. And that is how we guarantee that we'll never get a fractured block. And if the server dies during the very first right to the staging block area, that's even less work to do, right? Because nobody has touched my PMEM cache line on that original location and there's no recovery even needed there. That block is still consistent. So this is what we do. But there's still one more trick that we have to do. I would say that double, that's the double right back buffer in my SQL. The same trick. Okay. Okay. Yeah. Thank you. All right. So then let's take a closer look at how do you guarantee that you're right to the persistent memory is actually persisted across a power cycle. We learned how to do that for a PCIe initiated, right? Like a network RDMA. But there's a different trick you need to apply if you initiate the update or the store from a CPU core, which is the case that we're talking about here, is that you've got to be able to either flush all your rights through your CPU cache or you have to bypass your CPU cache. Either one of them has to be done. And in our case, we are not interested in inspecting that AK block, right? It's a database of writing it back. He may want to read it later. But as the storage server, I don't care about what's in that AK block. So instead of polluting my CPU cache and just write it through the cache and then flush it at the end, which is actually painfully slow, what we do is we use this non-temporal store instruction to bypass the CPU cache completely and then avoiding polluting the CPU cache. And this ensures that the store is persistent. And then we add an S fence as a memory barrier to make sure that that store is actually made it to the persistent memory. Okay. One last quick word about PMM writing is that we hit a very interesting performance problem earlier on in our journey is that normally when you have a disco flash, you say, look, how do I extract the best throughput from that device, right? The trick is to load it up, right? Queue it with as many iOS you can't afford to do. And naturally, the device gets saturated, you get the highest throughput. But in this case, when we did the same trick, we saw this precipitous fall, like you see the yellow line here of a 10x drop in the performance of writing to persistent memory. And we're like, we're really puzzled. What happened there? So our partners at Intel educated us on that there's this thing called IO directory cache. So this is a directory state that the CPU has to maintain to track all the writes to avoid a snoop. So what ends up happening is that even though PMM reads are super fast, like 2 to 3x faster, the writes is actually 7 to 10x slower than DRAM. So it's significantly slower. So what ends up happening is you have a lot of writes hitting the PMM DIM. IO directory cache updates is a separate update to the PMM because that cache state is co-located with the PMM data cache line as well together. So that additional update causes a degradation in the write performance. And in addition, as more threads will jam in, there's limited, like they call it like the write buffer inside the PMM DIM. So that you end up fragmenting that write buffer causing a thrashing and the whole performance kind of just dropped. And like kind of this is a really like a precipitous IO cliff, I would say, like it really falls off the cliff. So the trick there is to really enable the IO directory cache so that you can combine that directory update along with your actual data update. So you piggyback on that and reduce additional write. And that's why it gives you not a 2x improvement, but a significant improvement on the total write throughput, you know, to avoid that thrashing inside the PMM. So that was an important finding. And then, you know, it really helped without PMM write throughput as well. So key takeaway, I guess, you know, for this talk is really if you have a database on transaction processing, let's not fall off IO cliff. Let's do the RDMA read for your AK on random data read needs. And then let's do the RDMA write for your PMM log. And I often get a question saying, Ja, you know, you tell me this band's transaction so silly, right? Like if I am on my phone, like for me to enter, you know, that transaction, it probably take, I don't know, milliseconds or even 70 seconds for my phone to find the nearest cell tower to even send the bits over. Who cares about this 200 microseconds or 19 microsecond latency? So my answer to that is really this is just a presentation example to illustrate the point. But we really have a lot of really real super critical OTP applications that run on the data. Like for example, they do fraud tracking, or they do real time analytics, when you click on navigator screen, a lot of processing back end processing happens concurrently. So for that, this kind of RDMA to PMM low latency IOs are extremely crucial for those low latency high throughput OTP applications. So with that said, I guess this is really summarizes everything that we talked about in this presentation. Really, you know, we have a persistent memory tier in the storage. We pair it up with 100 gigabits rocky. And that's how we accomplish our, you know, very low latency IOs and very, very, very fast log writes. And then the good thing about this story is that it's actually scaling it's tiered and shared across databases. And as you know, if you remember the earlier slides, we can have multiple racks. So it's not really limited. Let's say by your PCIe extenders or others kind of hardware technology, which kind of limits how far you can extend this, we can hop across network switches, no problem. And that really allows this architecture to scale very nicely on Exadata. So that's all I have for OTP. And, you know, Andy, I know we talked about earlier about, you know, what does an Exadata do? Is it just OTP or do you guys do data warehousing? So before I end this presentation, I'll spend the next 30 seconds really fast through data warehousing because that deserve its own talks somewhere else, you know, like maybe I'll fly or maybe at a different location. But we also do amazing data warehousing analytics on Exadata. And this is from an Exadata user who, you know, wants to read 600 gigabytes of table rows and then had the storage index trim, you know, 500 gigabytes of it, and then scan through the flash cache and returns only 280, you know, through a smart scan. So that is a whole different story with Exadata analytics. I don't have time to go through that. But I'm just going to give you a quick teaser and sneak preview of, you know, what we have accomplished there. And just to kind of complete a story that we do both OTP and data warehousing really well on Exadata. All right, so that's all I have. Back to you, Andy. Okay, awesome. So I will applaud a lot of everyone else. So we're a little bit over time. So maybe I have time for one or two more questions from the audience. Hi, Jia. This is Lin. Thanks for the great talk. I have a question which is that you mentioned you're using roughly 1% of the persistent memory for RIDULO, which is really leveraging the persistent, persistency property of PM. I think it seems that remaining 99% you are not really leveraging the persistency property per se. So is it possible to just replace that 99% with DRAM? Or like, is there any reason that PM is still better in this case? Yeah, that's really, really a perfect, a great question. So the pre-PMM cache, as you pointed out, right, that's where we actually enabled RDMA. So at some level, you could argue it has nothing to do with persistent memory, right? You could do that 10 years ago. If you have memory and you have RDMA network, you could build that, right? Like a write through cache. So if I lose all my data across the server failure, I don't care. But what is really, actually really important for persistent memory is that one of the things I didn't spend much time on because we're limited by time is that PMEM also brought in this capacity differentiator when compared to DRAM. So when we deployed the system back in 2019, we're able to pack 1.5 terabytes of PMEM in a storage server. And if, you know, your friend was Intel and you have a big wallet, you can actually pack six terabytes of PMEM into a single server. So it brings you much higher memory density and allows you to build a real cache using that kind of budget. Because if you just simply look at DRAM, oh, I got 200 gig. Oh, I got, you know, 500 gig. How am I going to be able to build a cache? You know, as soon as you pop into the RU and before the next IO happens, it's out of your cache already, right? So for the fact about PMEM is it actually gives you the capacity advantage and still being able to sit on the memory bus and allow you to do the RDMA that gives us the benefit. So that's why even though we wanted to do that many years ago, we couldn't quite pull it up until PMEM comes along. Does that make sense? And to answer the persistence question, what I would say is that we can definitely put the PMEM cache in persistence mode. So you could just, you know, restart your servers to have everything in there warmed up, you know, ready to go. So that's still like nicer than a DRAM cache where you have to repopulate everything, you know, after a restart. Yeah, got it. Thanks for the answer. You're welcome. Hey G, this is Ishwar here. That was a great presentation. One question. All these parameters are configurable at the cluster level or it's all autonomous that you guys call it's handled by behind the scene. Okay. I guess by parameters to me, how much I use for PMEM cache, how much I use for PMEM log like I have listed here, right? So these all created automatically with the default. But you know, if you are, you want to tinker with that, you're more than welcome, we have a proper API that you can go on the storage server and configure that. Thank you. You're welcome. All right. We have a question for Eduardo, the unmute himself. I love reading from the chat. All right. So this question is, what is recommended with this technology, small radio logs and several switches or large radio logs and a few switches as been handled lately with other active data models? Several switches. Andy, do you have any guess what the switches means here? This is a redo log to archive log switches, log switching. That's what he's talking about. Oh, okay. Yeah. I was guessing because I wasn't sure it's a network switch if it's a log switch. So I wanted to kind of address that. What happens is that a log switch doesn't happen on Oracle until your redo log is filled up. So it really has nothing to do with how big your log rights are. Because you can imagine like typically maybe a log file is, I don't know, 50 gigs or a few, I don't know, 10 gigs, whatever size it is. Every right is much, much smaller, right? Because you've got to commit a transaction. Even in TPCC, I think the average log right size around like maybe 200K to 100K. So those are much smaller. Not to mention if you have a small transaction, you can even have a 4K right or even a 512 byte rate. So it really has nothing with log switches. It's just accelerating every single individual log rights using the PMM or the MA technology. And whenever that log fills up, sure, a log switch happens and then you end up using the next log. So it really has nothing to do with how frequently you switch the log. But I guess because our rights are so much faster, and you might end up switching your logs faster. So that could happen. But it's really orthogonal. Okay. So my last question would be, this talk was great because you showed basically how through like careful database software engineering, you're able to take advantage of all the new hardware that Intel and everyone else is throwing at you. So I guess the question is what's the next bottleneck? Like you're already here, you shave it off microsecond when you're IO. What would be the next target? Like if you had a magic wand, you fix one thing, like you have to redesign now your data structures up in like the database server is now you know that disk IO is, you know, minimized. Like what's the next sort of mountain climb? Yeah, yeah, it's a great question. And we ask that question all the time, you know, are we done? Do we declare victory? So what we really feel like this is really just tip of the iceberg of our way of using persistent memory for database, because when you think about it, like perfect example that we have is like, you know, I talked about we have B3 index. Have you ever wondered why does a database use a B3 index? You know, why do we just not store hash table, you know, directly on disk? Wouldn't that be nice? You read the hash table out, you've got your everything still having to walk down a tree. So it turns out that the B3 index is very nice because it's very disk block storage friendly, right? You can store it, you can, you can, you know, sort of read it back, you can populate, you can walk down the tree. It's great. But if you have a hash table, it becomes much more challenging how to store a hash table on disk, right? Like, so what I wanted to say is that this acceleration in the IO path is really very still very much along the block storage sort of mindset that we have about a database, right? How to accelerate your access to it. But there are so many other new possibilities you can do with persistent memory, right? Especially when it comes to database, because what happens if you use database persistent memory as your native storage, like forget about the block storage, right? Can you do something with that? So to me, I just feel like we're like, you know, the first step on this big epic journey of, you know, exploiting or harnessing the power of persistent memory, and they're bound to be many, many new innovations that's that can be done on the database space with this. This is my, my first page of students thesis was basically answered that as a question. Like, how do you, if you throw a, throw a deer and throw one flash, how do you build a new system? Okay, awesome. Josh, thank you so much for doing this. This is an excellent talk.