 Carnegie Mellon vaccination database talks are made possible by Autotune. Learn how to automatically optimize your MySQL and post-grace configurations at autotune.com. And by the Stephen Moy Foundation for Keeping It Real, find out how best to keep it real at stevenmoyfoundation.org. Thank you everyone for being here. We're excited today to have Joe Victor. He's a software engineer working at Singlestore. And so he's here to talk to us a little bit about the new version of Singlestore that deviates from the approach that it started with men and people. So Joe has a master's degree and an undergraduate degree, both from Stanford. So as always, if you have any questions for Joe, as he's given the talk, please unmute yourself, say who you are and where you're coming from, and ask questions at any time. We want to interrupt him so he's not to talk about him stuff. So Joe, thank you for being here. Go for it. Full is yours. Just start with the takeaway so that we can have the takeaway. One of the downsides of separation and storage compute, if you write one row or update one row in BigQuery and Snowflake, you are going to hit a blob cache, and that could have random latency and high latency. So I'm going to talk about a slightly different architecture that doesn't do that. So you can still do transactional workloads with the benefits of separation and storage compute. So let's do it. So I'm going to spend most of this talk actually talking about durability, an old school replication thing. And I'm going to talk about why this particular replication design makes it really easy to separate out storage compute. And then we're going to do some applications of separation storage compute with fast provisioning continuous backups. And at the end, if we have time, which maybe now we won't, we can talk about some HTAP QP stuff that kind of brings everything together and shows why this architecture holistically can really do transactions and analytics with separation and storage compute in one cloud native system. Cool. So yeah, so let's talk about durability. So the high level architecture of single store is it's a scale out database with essentially shared nothing partitions. So each partition has one primary replica and two or three second one or two secondary replicas and the durability is managed locally on each partition. So each partition has fast synchronous replication, which I'll talk about in detail. And the data is committed when it's committed in memory on each of the replicas. And the nice thing about the commit is that it doesn't touch the blob store and it doesn't touch the local disks. So everything's very fast. It's in memory on three nodes. You're good to go. The data is stored in LSM tree on each partition, which is fairly standard. The top level is a skip list and the lower levels are HTAP optimized column stores, which is how single store got its name. But it's just they're just fancy column stores. And then of course all the data in the partitions logged. So when you write data to the skip list, we log it row by row. And the column store data is stored in immutable data files. If you want to update a row, we just changed the metadata for that segment of rows. So the data files themselves are immutable. And this immutable property, that's going to be super important. The fact that the blobs are immutable is the way that we're actually able to get the separation of storage and imputes in a way that's really reasonable and really kind of easy to think about and work with. And so finally, the data is asynchronously uploaded to a remote blob store. And we have clock si. It's not quite clock si. It's kind of it's our own thing, but it's fairly similar to clock si. So yeah, let's get started. Single store durability. I call it integrated durability and compute. And I'm going to tab over just make sure I'm still connected to you guys, which I think I am. All right, cool. Yeah, you're good. Sorry. Yeah, just a little paranoid after what just happened. OK, cool. So as I mentioned, we have each table is going to have an in-memory skip list, which is the top level of the LSM tree. So new rows, recently updated rows. We're going to have data file metadata. So for each of the column store files that we have, we'll store the metadata for that in memory. On disk is a paged log. So the log is separated into 4k pages. And each of the pages gets replicated in whatever order they happen to reach the secondaries. So they can be acknowledged independently. So short transactions don't have to wait for long transactions. And of course, we have the actual data, which sits in these files that are ready to be scanned and have super fast column store scans, vectorized execution, et cetera. Cool, let's insert some data. So I insert one row. It writes the log that I've inserted that row, puts the row into the skip list. Nothing special here. I insert three rows. I write a second page of the log that says I've inserted these three rows and puts those rows in the skip list. One nice little thing, I mentioned the replications out of order. There's no data dependency between these two pages of the log. So if the primary sent these to the secondaries in whatever order and the secondaries active in whatever order, that's fine. That's just how we keep each little transaction fast, no matter what's going on. So you're ready to commit once you know that that chunk of the log has been hacked. All right, cool. Let's say that the system decides that these four rows that we have in the skip list, we're ready to make them into an actual data file. So it flushes them to disk. And the system will write two rows or two pages to the log. So it'll write a create data file page. And it'll write a delete from the skip list. There's no word data in the skip list. This create data file page is on log page three. And so that actually gives us the name of the file. Here's the file. It's file three. And it's got the rows in it. And the rows are ready to scan column store format. And we also have the metadata for that file. So again, the element of the file is three. That's how we know the name of the file. That's how we know whether or not it's committed. And also because we just created the file, none of the rows are deleted. So we have a little bit back there that tells us which of the rows are deleted, and it starts out with none of them deleted. So if we want to delete a row, say we delete B, we simply update the metadata. So we write that update to the log. We update it in metadata. We don't touch the file. If we want to update a row, we simply up. So say we want to update C2 to C7. We simply update the deleted bit vector, as seen here. And we insert. So by setting the second bit, we interpret this third row in the file as not being there. And we move the thing to the skip list. And I mean, obviously, this is a dramatically simplified version. But this is essentially how our durability works. You want to do 1Z2Z updates to the column store data? You can. And we don't change the column store data. We just move the data to the skip list. And we have some fancy things around locking individual rows in the column store, which I'm not going to get into, but is one of our big technical advantages. So are there any questions about how we do durability? Yeah. Well, I had a question on, didn't you say that the log is written asynchronously and that transaction commits are independent? The log is written. No, so the log is synchronous replication. So each page of the log needs to be in memory on however many replicas you want it to be before the thing is considered committed. I understand that. But as far as this transaction 1 and transaction 2, couldn't 2 have gotten to the three replicas before 1? Or are you saying that 2 wouldn't be processed until 1 was committed? No, that's correct. 2 could be sent to the three replicas before 1 was. And that would be totally fine if there was. OK, so that's how I understood you saying it. And then what about 3? 3 cannot happen until 1 and 2 have made it to all the replicas. So what is it that is protecting 3 to have happened as well as 4 cannot happen until 3 has happened? Just like here we have a logical data dependency. The fact is that the flushing process can't commit on the primary until the things that it's trying to flush are committed. So those nodes, while you're waiting for the ax from the secondaries, you'd be those rows would have like rollox on them. Or they wouldn't even exist yet if they haven't been sent. So yeah, so the fact that the thing was able to happen on the primary guarantees that if the secondary tries to replay what it does have. So suppose that 2 was committed but 1 was not. If you fail over to the secondary, you're just going to have 3, 2, 5. But the user would never have received an acknowledgement that 1 was written. On the other hand, 3 could not have happened because the primary just didn't have access to this row. So you're saying 3 couldn't have happened, but 3 doesn't seem to reference ABCD. Or is that actually in the metadata of that log record for the create data file? It's not in the metadata. It's in the data. The process is going to scan these things and form them into a column store file. And it's not going to take the ones that aren't committed. Just the iterator is not going to see it. Yeah, OK, so it's scanning the log to find out. It's scanning the skip list. OK. Yeah, the flushing process is going to scan the skip list. And it's just not going to see things that aren't committed. OK. And the skip list is only maintained on the primary until some kind of a takeover happens? Or is that correct? The skip list will actually be replayed on the secondary. But it won't be replayed synchronously. So there's a process on the secondary that is taking whatever prefix of the log it does have and replaying it. And then if there's a failover, it'll also replay whatever it has, even if there's holes in the log. And so the skip list, it's pinned in memory. It never goes away. OK, thank you. Yeah, these are great questions. Yeah, so the skip list is always there. And the data file metadata is always there. So the background process that's replaying the skip list is also replaying the data file metadata, including those updates to the deleted bit vector. I had maybe one more quick follow-up question, which is that I believe you mentioned that pages are replicated in whatever they reach the secondary since whatever order they reach the secondary. It's like there's no guarantee. Yeah, that's right. So the secondaries are doing some kind of buffering of their own before they apply the pages. The pages are at a known offset. So they'll write them straight to the log at the correct offset. And obviously the Linux, we don't F-sync, though, so the Linux file system. Or we don't F-sync right then and there. So they'll be buffered by the Linux file system cache. Cool. OK, so yeah, recap. So the log is used for the manipulation of individual hot rows, but also file metadata, which is kind of funny, but that's the way it is. Bulk data is stored in column store data files, which are immutable. And the data is committed when it's replicated in memory to a handful of nodes. And the replication is locked free in out of order, so it's super fast, low latency, predictable latency. And I haven't said anything about cloud storage yet. So the right latency is low and is predictable. Cool. All right, so let's now actually talk about how to make this classical structure into a modern structure. So we already have this highly durable low latency store in which all the data is recorded in a single log for a single partition. And we can easily copy this data to a remote blob store. Simply after a data file is committed, you upload that data file. Every minute or so, you upload whatever tail of the log in a log chunk. And you periodically upload a checkpoint file, which just has all the known data file metadata in it and all of the in-memory rows in it. And only the primary replicas have the checkpoint. Of course, our checkpointing process is very simple. You simply scan the skip list. You scan the metadata and you serialize it into some format and you upload it. There's no merging of B trees. There's no none of this nonsense. It's all just taking something in memory and serializing it. Okay, so the nice thing about this is like S3 has 11 nines of durability. So any data file that's at an LSN below what's, we're uploading things in the LSN order. And any data file at an LSN lower than what's uploaded could simply be deleted from local disk. Or to say another way, the local disk is just a cache now. And once something is below that LSN, you can evict it from the cache because you know you can just download it from remote storage. And so we say that the storage is bottomless because you can just keep loading and loading and loading. And as long as your queries are mostly staying within a working set, everything still works. Are there any questions here? What does your cache efficient policy look like in this case? Like if it's sort of this trade off between cost and performance because every S3, yeah, it costs you money. Yep. So you obviously don't want to. You don't want to churn the cache. It takes you time more than it costs you money. Cause our service runs in EC2, so we don't have to pay for the actual number of bytes, but we do have to pay per the API call. But because these blobs combine multiple columns and there's some heuristics to see which columns should be in which combined, right? The number of git requests is actually fairly small where like a 32 node, a 32 node machine that's doing like a pretty aggressive, like a workload that doesn't have a working set and is constantly churning the cache costs about $2 a day worth of git requests. So like it's not free because you can have a lot of nodes but like S3 is cheap, but the time is just horrible. If you have to download a lot of data from S3, you're looking at, I don't know, hundreds of milliseconds per fetch. So that part of it is bad. Anyway, to answer your question, we use LRU too. And in certain cases, we'll just store the thing in memory instead of storing the thing on disk. But so like if I do one scan, it doesn't totally destroy your cache. But yeah, it's, this is a really subtle problem and we are still working on tuning it. So this is like a very active area of development right now. But I think that the most important thing out of this is that you said that the cost is, you don't care. It's really before and still. Even though you're paying for a get call, like who cares? Yeah, I believe it's half a cent per thousand get requests. A thousand get requests is 32 gigabytes, which, you know, I don't know, half a cent per 32 gigabytes is not crazy. Yeah, that's right. Cool, okay, yeah. So what are the benefits of separation in the storage compute? Well, they're the same as the benefits for our competitors. You can store more data than you have a local disk. You can turn off the cluster when it's not used because you can just prevent it from this thing. Another cool thing is you can do fast provisioning. So to provision a new replica, all you have to do is replay the snapshot and whatever log chunks come after it. You don't have to download the, you don't have to download any of the data files. You obviously have to download the data files if you want to run queries, but in order to just get up and running, get ready to serve acts for redundancy, you don't really need anything. So, and the snapshot files and the log chunks are small relative to all of the data files. You can also burst reads and you're only gonna have to download the files that you're actually going to read. And if you're a little bit careful about it, you can burst writes. And of course, by moving all of the data to one place, you have automatic continuous backups and point-in-time restore, 11 nines of durability. The thing's not going anywhere. It's awesome. So you never have to think about taking a backup again. I'll talk a little bit about how we do that as well. So these are the benefits that we have and Snowflake and BigQuery have and Aurora has. If you have a separation of storage and compute, you get these things and it's cool. And if your database doesn't, it's not a modern database in my opinion, or at least not in the cloud. Like I just think this stuff is table stakes. And it's also vaporware for simple store. We don't have it generally available yet. So I'm kind of making fun of myself a little bit. But yeah, I think this stuff's a big deal. What are the benefits of single store though? Well, you have low predictable latency and it's resilient to temporary blob store outages. So if my blob store goes away for 10 minutes, which it's allowed to do because it doesn't have that many nines of availability, I can keep writing because the writes only talk to the cluster. I can't read data that's not in my local cache. That's the only bad thing that happens. Super easy to handle burst in the workload. I have unique, I can do unique key constraints because we don't have multi master. That's kind of snowflake you can write from anywhere, but like the rights have to take all these global locks and it's a little bit more complicated. And I just think that this design is really easy. Yeah, of course, the trade off is only one cluster is writable. You can have readable secondaries but only one writable secondary. And if you lose an entire region, then you will lose data. So the advantage of, oh, I committed the row, it's in S3. You will never, ever, ever lose data. Any questions here? Cool, all right. All right, application, fast provisioning. So I talked a little bit about fast provisioning, but I'll talk a little bit more. So the question is, what do you actually need in order for a replica to come online? Well, you need all the data file metadata and you need all of the in memory rows, all the rows in the skip list. So provisioning a new replica is really easy. You just download the checkpoint, whatever log terms come after it, and replay it. And then you can just replicate the live data from wherever the current primary is. And so this is an operation that can happen in seconds to minutes, rather than downloading many terabytes, which would take longer. So if I want to provision something to do redundancy, well, I literally just provision it and it's ready to act right. If I want to burst reads, I have to download the new working set and then I'm ready to run reads. It's a little bit tricky to burst write. It's trickier to burst writes, but you can still do it. All you have to do is reshuffle your partitions. So if suppose you have 100 partitions across 10 nodes, you could split that to 100 partitions across 20 nodes just by spin up 10 new nodes, provision new secondaries there, preheat the cache and then failover. If you don't preheat the cache, your workload will stop working after the failover. What is the definition of like burst reads, burst writes? Like what does that term mean to you, Pat? So bursting reads, I do not, what I do not mean. Okay, okay, that's a really good question. And that's a really polite way of saying, hey, if like one query comes in, that's a little bit slow, this isn't gonna work. So it's more like, oh, I have this dashboard that I wanna create. And so I'm just gonna create a whole new set of partitions and this dashboard is gonna run. Or oh, I know that for the next hour, this analyst is gonna be like playing with some data and they just wanna be able to run whatever queries they wanna run. Cool, spin up a new thing. And you can, the scale is pretty much indefinitely. You can have hundreds of these secondary, readable secondaries if you want. What it does not do is minute by minute scaling. Yeah, so you mean like, so there's some provision you can do ahead of time. Let me get on a permanent basis, but like with some amount of time, the idea is you can spin up a new partition or new set of partitions that services one application away from the main store. That's right, yeah. It's more akin to Snowflake's warehouse site. It's just like warehouses. It's like the same thing. Yeah, bursting rights, it's more like, oh, there's just like, it's six o'clock. Everyone just turned on their TVs, whatever it is. Are there any other questions about fast provisioning? Hi, sorry, a quick question. Why do you need to warm up the cache for rights? So the reason you need to warm up the cache for rights is because rights can only happen on the primary cluster. If the primary cluster isn't doing very many reads, then you don't have to warm up the cache. But the assumption is that there's some workload happening on the primary set of partitions that you wouldn't want them to suddenly all have a cold cache at the same time. So it's also true that, let's say I have just redundancy too. I've got one primary, one secondary, but the secondary isn't, no one's querying the secondary. So the secondary's cache is cold. If the primary dies and I do a failover, whatever workload was running on that primaries is just gonna grind to a halt because it has to download however much local disk that the primary had. So actually, even our redundancy secondaries, they do best effort to keep the cache in sync with the primaries, or that they get stats from the primaries because you don't want to use up all that much space, but you want it to be warm, even if it's not super hot. Does that answer your question? Yep, caught it, thanks. Is it really availability if after the failover you have to download 100 terabytes? Like, that's kind of what's up. So it's asynchronous, it's best effort, but most of the time when you failover, your workload's not gonna be privileged. All right. Hey, Jo-Yo, question on that? Yes. We're just saying that in replication, we do our best to replicate to the secondary as well. I thought that only data replicates, it sounds like you're saying the plan cache replicates, or what exactly, how's that done? It's not the plan cache, it's just the blob cache. It doesn't replicate, there's just like a little heartbeat that says like, hey, I noticed these, like this gives you statistics on the blobs. Hey, these files are hot, you might want to keep them around, because no one's querying to the secondary, so the secondary doesn't know what files are hot. Okay, cool, thanks. And again, the main reason that you do this is for availability, because something that's queryable, but slow is not particularly available. How much time do I have? Yeah, you have like 17 minutes. 17 minutes, great. All right, cool. So I'm just gonna finish up by talking about some loose ends. So this is really talking about storage, but I just want to talk about our query layer and some of the query features that we have that make single store really unique, and you could dedicate entire talks to them, but we just don't have time. And there are features which also let you do transactions and analytics in the same system. So, of course we have a scale out cluster, and you have many shared nothing petitions on several hosts. So one of the issues that you have, oh, before I talk about the QP features, I wanted to talk about more storage features, I'm sorry. So all of the logs in LSN uploads, that all happens independently. So in LSN on one node, it could be different than in LSN on another node. And so what we do is each transaction is given a logical timestamp, and it's inspired by clock SI. And so the timestamps can be used to find an LSN in each partition. So you get a whole bunch of LSNs that allow a user to restore the cluster to a consistent state, right? I was like, okay, I have all the data ever for each partition, but I want all the partitions to be at the state that they were at a very specific time. And so this allows you to do that down to making sure which transactions and which transactions out. These timestamps can also be used to run snapshot isolation transactions. And we use 2PC to maintain animicity for multi-partition transactions. So this really is kind of the, from the transactional perspective, this is an asset database. We've got each letter and it actually works. We also have a bunch of really cool QP features which are used to give us both transactions and the analytics in one system. So each table has an optional sort key and an optional shard key. We have secondary hash indexes on our column store tables, which is cool because it allows you to seek to an individual row. We also allow locking of individual rows one at a time, even if they're in those column store segments. So you can use the hash index to find a row, scan right to it with our SQL encodings, lock the row, and we even have unique hash indexes. Some things are still gonna suck. You wanna do an upsert, you're gonna have to like jump all over the universe, decoding each of the different columns, but like you wanna just get a unique key constraints, insert ignore, or even just replace a row. These things can all happen super fast. And if you combine these facts with the facts of the hottest rows tend to stay in the skip list. Our SQL stores column store can perform kind of within an order of magnitude of a row store database on point right transaction workloads, which I was pretty shocked to hear that this all kind of came together and works, but we've got benchmarks of us running TPCC and okay, obviously it's not as fast as row store but it's only like 10 times slower. And I thought that was pretty cool. So are there any questions on the last two slides? I guess I would add, you sort of set yourself up. Why is it 10x slower? Column store, like are you competing against Postgres MySQL running on a single box? Like what's the, what was the reason? Sorry, I'm sorry, yes. Single store has full row store tables, which essentially just means everything's in the skip list. And if you pin everything in the skip list, then in addition to the things always being hot, you also don't have to like, open those column store secondary indexes. And so you don't even, if you want to do a unique key check, you just check it on the skip list. It's going to be crazy fast. Like, yeah, I have seen and Andy post on Twitter some comparisons of select skip lists and BW plus trees and stuff like that. But that's my next question, but you really take it now. Why? Like, do you guys re-read the whole engine from the row store to the column store, the single store, like this back when the cell called them people, why did you keep the skip list? We kept the skip list so that the hot rows stay hot. But it could have been the B plus tree. I, with all due respect, I think that your skip list is not as good as our skip list. That's fair. Yeah. It wasn't our skip list, it was the Australian skip list. We found that like, that was the list as the best one in academia. So that we take offline, I'm curious. I mean, like so, did you do internal best? I mean, this is not the point of the talk, but like, do you guys include internal best marketing saying, oh, should we consider a B plus tree? No, I mean, our skip list is like, pretty simple. We do have an engineer that has written a BW plus tree at Microsoft, and he swears that they're better, but everyone knows they're more complicated. So that's the BW tree, I don't recommend that. I'm saying, but a good solid B plus tree, at least the literature suggests that, not just mine, other things, that the B plus tree apple comes with skip list. I mean, there's other things you have to worry about in your system, it's only one fact of the word. I'm just curious to think, okay, let's go build the entire thing. You know, we're running a whole new engine that's considered maybe looking at a skip list or something alternative in the skip list. And then someone posted into the simu skip list double linked. So again, it's not our skip list. We found it from this guy in Australia that was with the best skip list that was under force. Yeah, yeah, that I was, it was years ago, but I was like, was this the guy with the double link skip list? Like, yeah. And it wasn't towers, it was wheels, it was some Australian thing. Yeah. I think that it just kind of like most of, so most data is gonna be in the column store segment. Most people that have these huge data sets, the in memory part just has to be fast enough for these point workloads to be like a bajillion times faster than the column store. You want one row, like you're not gonna, you're not gonna beat these very simple data structures. That's that. And then if you're gonna be scanning it a lot, like a column store is just better for scans. So like the skip list is just absolutely terrible for scans, but you don't scan it and it's okay. Okay. Sorry. All right, sorry. Is this gonna talk, we can open up to the question of the audience if you're done or if you have more slides you wanna go through. Yeah, yeah. No, I have one more importance. Yeah. Okay, so conclusion. We got benefits of sorting queue, integrated durability, means we're not sacrificing transactional latency. Our query processing feature set lets us do the point workloads and the hybrid column store table without sacrificing scan performance because the scans happen on the column store. And so it's a good candidate for an HF database and the most important slide we are hiring. So now questions. So I will applaud and have everyone else. Sorry, Mike, I had written down about the skip list question. I was saving up to the end and you kind of like jumped into it. So, okay, we'll open up to the audience if anybody has any questions. So, hi, I'm Ling. Thanks for the nice talk. I have a little bit of maybe a higher level or from the application perspective question, which is you mentioned that it's, you have a synchronous replication, but then you will commit the data as long as, I mean, the data reaches in memory of replicas, right? Instead of waiting for that thing to finish, right? Yeah, so I'm wondering, just wondering from a application perspective, have you ever faced a little bit maybe concern from the customer saying that, hey, my data has not reached disk yet. What if, I mean, more than half of the replicas gotten like a down, I mean, has your customer any, have any, have you ever faced any pushback from the customer's kind of thing? Yeah, so that's just the default. We also support sync durability and some of our customers do use it. It just depends on how paranoid people are. Oh, then I'm actually a quick follow up, pretty curious to know maybe how much of your customers are using the, I mean, synchronous durability one or versus how much are using the asynchronous one, which is pretty curious. Async is the default. So I'm going to guess like 99% of people use async. I'm only aware specifically of one customer that does sync durability. All right, interesting, yeah, thank you. Hi, I guess I have another question of, you mentioned at the start of the talk that like, you started off with fast synchronous replication, but so you also have a synchronous replication. So I guess you said most of your customers use asynchronous durability by default. So suppose I'm running something like TTCC with asynchronous durability, what kind of performance hit will I see with like synchronous replication as opposed to asynchronous replication? I think that we have definitely done that experiment. And so sync versus async is like almost nothing. It's like five to 10% like, I think on TPCC, it might've been indetectable, but don't quote me on that. Like our sync replication is basically as fast as our async replication. I think durability was like a bit slower, like noticeably slower, but not like 10x slower, maybe just a bit slower, but I can't give you the numbers. I could find the numbers and follow up. When you say it's not 10x slower, you mean throughput, not latency, right? I mean, both, I mean, the latency, because, you know, yeah, I think the throughput is like identical and the latency is like 10% slower or something. But how is that? So from personal experience, I feel like S3 is going to be orders of magnitude, larger latency when it comes to commit, right? The commits never touch S3. So what do you mean by synchronous, like synchronous durability then if it doesn't draw the line at S3? Synchronous replication means that you have acts from as many nodes as you need. So in this experiment, it was just two nodes. So you have acts from one node. That's sync replication. Sync durability means you're on disk on two nodes and then the replication to S3 is always asynchronous. We never do sync replication to S3 because it would just, we don't have any advantage there. So, yeah, sorry, go ahead. So the durability is managed by the cluster. You have live nodes that are managing the durability and if you lose one node who cares, if you lose a whole region, you're gonna lose data. So that's bad. Okay, I guess, but in this case, so just to be clear on your storage hierarchy, you have in-memory, which is obviously not volatile, and you have S3, which is presumably very durable. So what role does local disk play here? Because you seem to assume that if the region goes out, local disk data is gone, but like basically, is it basically just slower, more a cheaper main memory at that point? Why does there have to be a special level for it? Yeah, actually, like we have discussed literally considering all disks to just be like, you know, just a cache, right? Like disks are a hell of a lot cheaper than memory and sequential scans of disks are quite fast. So like, yeah, you got these local SSDs and this essentially, yeah, it's essentially just a cache. Yeah, but somehow you seem to be differentiating local disk when it comes to durability, when you mentioned synchronous durability? Yes, unless you turn on synchronous durability. But in the cloud, in the cloud, I'm kind of assuming that if you lose a node, you're losing everything. All right. I'm not counting on, I'm not counting on, for instance, Kubernetes to put my pod back in next to the same disk. So it really is for the paranoid, not that it has much practical use. Well, we used to sell databases on-prem and in fact, we still do. Okay. Got it. Yeah, the vast majority of our business is on-prem today. Got it. Cool. Thank you very much. This technology I'm hoping will be generally available like in this summer, but none of this technology,