 The Carnegie Mellon Vaccination Database Talks are made possible by Autotune. Learn how to automatically optimize your MySeed call and post-grace configuration at autotune.com. And by the Steven Moy Foundation for Keeping It Real, find out how best to keep it real at stevenmoyfoundation.org. Alright guys, let's get started. Welcome to another Vaccination Day this summer series. We're excited today to have Michael Runstrom. He is the head of product at Hotworks, where he's been working on RodDB. Michael got his PhD from the University of Lipnitch. I'm butchering that one. But in Sweden, he's been working on NDB, which is the backbone of MySQL cluster for over 20 years now. And since then, he's been spun NDB off as RodDB as part of this larger Hotworks platform that they're building. So, as always, if you have any questions for Michael, please give him this talk. Please unmute yourself, say who you are, and ask a question anytime, that way he feels like it's a conversation. And we thank Michael for being here because he's in Sweden and it's 10.30 at night. So we appreciate him staying up late to talk about his database. Michael, the floor is yours, go for it. Okay, thank you very much Andy. So, RodDB is, as mentioned, is a fork of something called MySQL NDB cluster. Which originated within Ericsson in the 90s and has been developed a long time since then, and is still actually developed inside of Oracle. And so I'm going to go through today a little bit more about the requirements that started the project and continue to lead the project, so to say. I'm going to talk a little bit about LATs, what that is. The architecture of RodDB, use cases, how do we do high availability, some of the basics about data distribution and a lot of other things. And then at the end, I'm going to do some deep diving, particularly I'm planning to do a deep dive into how we do checkpoints in RodDB. So that's more of where we go into details. So a little bit about NDB in the beginning, I mean, I've always been very interested in numbers, still am. And when you do benchmark, you get numbers. So I always have this keen interest in getting things to work faster and faster and faster. So that's always been an interest for me. And so one of the things about RodDB is that we try to always make it very fast, both with throughput, low latency, high availability, and now also scalable storage. So we have a lot of interesting benchmarks, competitions in the past. And the most lucky one was one where we had the competitor was organizing the benchmark. And we still won the benchmark. That was a pretty nice one when they actually changed the rules three times just to try to win it. But we still won three times. Can you name names or bleep it out? Well, the other ones were internal databases inside of Ericsson. And then there was a database called Clastra, which was bought by a son in the 20 years ago. And nowadays it's gone from the earth. And ESA was one of them. Okie doke. So let's move into requirements. So the original requirements came from the telecom systems. And the idea here was that you had tens of interactions. When you have, for example, you started a mobile telephone call. So you get a number of interactions, 10, 20, 30 interactions with a database. You have to do all of those in 10 milliseconds, even at high load. That's the requirement. Because people, they are waiting to connect their telephone calls. So they don't appreciate waiting too much. Obviously extreme availability. Networking applications is another category. For example, the DNS server, DHCP server, AAA servers, you name it. They all have extreme availability, but they don't have the same requirements on throughput and latency, but availability always on. Gaming applications, they are more about complex interactions and high availability. And the latest thing that Rondi B is focusing on now is machine learning applications. And they can have tens to hundreds of interactions within a few tens of milliseconds. And again, there is a user waiting. So you have to complete those even at high loads. It has to complete within a few tens of milliseconds. And you can have tens of thousands of such interactions. So there was a benchmark which was performed by Spotify, where they tried this out. And they had 40,000 interactions per second. And they had 200 records that they retrieved in each interaction. And another thing about machine learning applications is that here the size of the storage can grow to all the way up to the gigabytes. So if we put that together in a technical level, we have the latency requirements. We have to basically, it's driven by users waiting for the database transactions. And it's mobiles, finance, gaming, availability, downtime equals no phone calls, no social media interaction, no internet, no financial transactions, no business. So obviously, downtime is completely forbidden. Throughput is driven by the many tens of thousands of interactions leading to many millions of key lookups per second. And scalable storage is driven by, well, it's driven by data and many, many millions of users. Okay, so that was the last property. So what do we do with RunDB with that? We have a key value lookup latency, which is around 100 to 200 microseconds. And that's at high load. So that's not sort of at low load. Even at high load, we have these latency numbers. And then we have to get down to about 30 seconds of downtime per year, at least in some systems. I'm going to talk a little bit more about what you actually need to do in order to get there. And then each CPU in the data nodes have to handle about a couple of hundred thousand lookups per second. So that leads that each server can do many millions of lookups per server. And that leads to that the cluster should be able to do at the level of billions of lookups per second. So we're talking per second here all the time. And when it comes to storage, RunDB scales to 16 terabytes of in-memory storage per data node. And you can at the same time have hundreds of terabytes of disk storage per data node as well. So obviously you can scale to fairly large sizes because we can have up to 64 data nodes in RunDB. So then we go into the architecture. This slide is something that anyone that wants to build a database has to understand fully and completely. And I could probably give a lecture on this for one hour, but I'm not going to do that. But essentially, you have to define where do we want to put the data server APIs? Where do you want to put the query server APIs? Do you want to put them inside the database? Do you want to have a distributed architecture? So if we look at what we have in RunDB, you can see here that we have the data in the data nodes. We have the actual application code or server code is in another layer. So we have, for example, a MySQL server that can access the data. You can access the data from a C++ application. You can access the data from a Java application. You can access the data from a Node.js application. You can also write an LDAP server on top of it. And in June, we're also going to provide a REST API towards RunDB. So as you can see, it's an architecture that is not just meant for an SQL database. It's meant for SQL. It's meant for LDAP. It basically has a generic data server, not so much as a database. It's more of a data server. So it can serve many purposes and it can actually serve those purposes at the same time. Okay, so use cases. Here's a good example of how you can combine all those things. In the telco system, you can have LDAP servers and at the same time have SQL statements being executed. And you can have special telco applications that takes care of HTTPS, CLI, radios and so forth. So as you can see here, you have many ways to access your data. And this is what kind of tradition from the telco arena that NDB took and has then developed further into RunDB today. This is a very simple system, a DNS server. So looking up, translating internet addresses to IP addresses, obviously you need to have very high availability. But this application is written directly towards the C++ API. So here we only have one API. Here's another application and financial application, the stock exchange. So you have all the changes on the coming in from real-time. You have the stock order feed where the actual updates come from. And then you can get the real-time stock quotes from the Java application. So this is a pretty typical application of NDB and could of course also be a typical application for RunDB. This is an example of where we actually had some researchers at the KTH that they started with a Hadoop file system, a distributed file system which actually had a bottleneck in that there was only one metadata server. And then you could use journal notes and ZooKeeper to make it highly available. But they wanted to have a more scalable architecture, so then they brought in RunDB for that. So after RunDB was added, they had a completely scalable architecture with any number of name notes, any number of RunDB data notes, and any number of file server data notes where the actual large files were stored. You could even store some of the files in the RunDB data notes. So this is a file system that we're actually using in Hopsworks. So everything we do in machine learning actually depends on this file system. So we have a back-end file system and this also is working on top of RunDB. So when we are selling to customers a feature store, they can store their machine learning features for online usage directly in RunDB, but they can also store things for offline features. They can store that in this distributed file system so we can have really massive amounts of data in Hopsworks. So I mentioned that we wanted to support 30 seconds of downtime per year less than that. So we actually had, for many years, we even had down times on a pretty high level that was down to less than one second per year. But of course the humans program system, so that meant that eventually you always get this large crash. So after a while you always get to around 30 seconds of downtime per year because humans cannot really make things better than that, I would say. So that means that in order to get to that you actually have to have replication both in a cluster, but also between clusters so that you can survive earthquakes and other things. And we have a pretty advanced system in handling this. And the author that I noticed that he's here today, Frazier Clemens. So he's actually one of the main authors behind this part. But I'm not going to talk so much more about that today. So just want to mention that as well as... You said it's quite dedicated. Is there like a one that says what makes it special? And is it good at airing or is there something more fundamental like theoretical? It has conflict detection, which is pretty advanced actually. I don't really know of any other system that has such an advanced conflict detection system. So you can more or less run transactions and have basically discovery exactly which conflicts that you interfere with and based on that you can write an application that resolves those conflicts. So quite advanced I would say, but it probably too advanced for normal users. But from a research point of view it's probably very interesting to study it. But again it would take hours to explain all of that. And if you're really interested, I wrote a few chapters about it in a book, which is called MySQL Cluster 75 Inside and Out. That's some chapters on that. Okie doke. So in the cloud you have to sort of, I mean, you have virtual machines in a cloud. In order to get the highly available database cluster in the cloud you actually have to make sure that you put the virtual machines in the correct availability zones. Actually I mentioned here failure zones as well. So some clouds can also have within an availability zone. Oracle Cloud can do that. I'm pretty sure that Azure can do that. Not 100% sure about all the others. So what is needed here is that you need to make sure that, I haven't really introduced no groups yet, but think about no group as a shard. And so basically that means that you have to make sure that all the nodes within a shard are placed on different availability zones. And then the MySQL service you can actually scale them any way you want. As long as you can access the MySQL server it can access the rest of the cluster. So we also provide managed run-to-beat. So this is for when our customers want to use run-to-beat. The only thing they need to do is to create a cluster. They need to specify the number of replicas. They need to specify the VM type of the data nodes. They need to specify the number of node groups, so shards. They need to specify the block storage size or basically the file system size. And the number of MySQL servers and the virtual machine type for that. And also they can add API nodes if they like to have some special applications. So it's really simple to create the cluster. You just specify those things and then say start. And even more interesting is what happens when you discover that your cluster is actually too small. Let's say that you wanted to start really small. You started with one replica and four gigabytes of memory. And now you want to grow to two replicas and 16 gigs of memory instead. So the only thing you need to do is to change the number of replicas. Specify the new VM type and then basically just say submit. And then the software will make sure that this change happens and it happens online. So you can continue sending transactions to the cluster and it will still operate. And if you want to go down from two replicas to one replica that can be done. You can go up to three replicas you can change. It could be a bit to go down is always more difficult than to go up. But because you could actually end up in a situation where you don't have enough memory. But if you have enough memory you can go down as well. So this is the interface and how it looks not going to go into any details. But it's as you can see it's not you don't need to be a database expert to start a cluster. You only need to know some something about what you what you want to do. And this is the interface when you want to change the cluster. And then you get the list of the changes to be submitted. And then our software will make sure that you that these changes happen in an online fashion. And it will actually not do all of them at once. It will do one thing at a time to make sure that the cluster stays online all the time. Okay, so let's go into some basics about about things. Let me just I just need to keep track of time so that I don't you're going to talk too much. So I've already talked a little bit about node groups. So at the very beginning when NDB was designed you there was a choice actually how how do you distribute data. I mean there's one way of distributing data where you optimize on quick recovery. And you can also choose to optimize on surviving the maximum number of failures. So I decided to go for the variant where we survive as many failures as possible. So this means that if I have two node groups for example with two nodes in each that means that I can survive up to two node failures actually at the same time even. If I have one node group with three replicas, I can survive one two failures as well, but they have to happen one at a time. So I cannot survive two at the same time. So, so we can actually survive quite a few failures at the same time, given that we have focused on high availability and not done. So that means that when you when you recover you are always using nodes in the same node group. So that means that when we distribute the data we split the data into partitions in horizontal partitioning. And then we decide that this partition goes into this node group this partition goes into the second node group. The next one goes into no groups here again the next one goes into no group one again. But we also distribute the primary partition in an efficient way. So that we have a fair load on the data nodes and on the CPUs. So this is perfectly classic data distribution using hashing. So we have a hash distributed hash, which we use to perform the partitioning. Very short about the commit protocols so wrong to be also supports transaction so actually everything that happens in wrong to be is performed by transactions. So if you have studied to face commit, you might be aware of that there is one variant to face commit which is called linear to face commit. And what you can see here is that we do linear commit for each row. So that means that we first send the commit to the primary and then to the backup and then sort of back to the transaction coordinator and then we go back again so that means that we actually save some communication. But if we have a transaction that involves multiple rows, then those happens in parallel. So so we do sort of have an optimization but we don't do linear commit completely we do sort of a mix of normal to face commit and linear commit. And one problem that you have probably heard about is that to face commit is a blocking protocol. We actually solve that problem in the way that whenever a node fails whenever that TC transaction coordinator fails. We will take that transaction coordinator and rebuild it so we will basically rebuild the state of that transaction coordinator so that we can abort or commit all transaction that it was that it was handling at the moment when it crashed. And we can do that because all of the nodes, the participants are still operating or not all of them but enough of them are operating so that we can still get the state. So that's why we actually do have to face commit, but it's non blocking. You log all your two basic messages to desk. We don't log the actual to face commit we log later on when we do a group commit so we essentially have two faces where we, the first phase is what we call network durable. So that means that it will survive crashes, but it won't survive a cluster crash. And then every second we make sure that that all those transaction that committed the last second are also forced to disk. So that means that we are this durable after one second but we are network durable as you commit. Okay, and now we go a little bit into the road structure so we don't really have any advanced data structures. I wouldn't say I mean it's a fairly standard so we have a fixed size columns which have obviously for fixed size columns for in memory. And the that's obviously the very fastest that you can read and write. We also have a variable size in memory part. It also has a dynamic part. The dynamic part is very important because that means that you can add columns without sort of any, you can do that as an online operation. And one of the things that I'm adding right now which is almost ready is that we also will have variable size on this columns. That means that you will be able to add also this columns dynamically without any downtime for the system or for the tables. We have an index state we have two indexes we have a distributed hash table. If you're interested in in learning more about that you can search for my PhD thesis it's called LH race to three because it's hashing in three steps. It's very very much intertwined with CPU caches. And it's been very very successful all the years so that's at least one of the reasons why Ron the B has a good performance. We also implemented an order index which is an in memory order index which is using a tea tree. You can read about that there's an article in 1990s on tea trees. It's nothing I mean it's basically not that different from other binary indexes like the trees and so forth but it's specialized for in memory. So if you look here. Have you done have you done I mean I don't know if you've done debate with a tea tree. Like, weren't there a bunch of paper that came out in the 2000s that basically says like the, the indirection having to look up for the rose all the time in tea trees. It's a mental like on the super scalar architectures. We did benchmarks when we implemented tea trees and compared it to a B tree. And actually the result was that they had the same performance. So, I wouldn't say how long ago was this benchmark. That was 20 years ago, now 15 years ago. Okay. So there was super scalar CPUs but but not very much more than that. I guess you guys have known what's using tea trees and I know there's extreme TV. Because times 10 is using to teach teachers to they got rid of it. Or like it's, it's, you can get people's people's trees by default. So this is this is rare. This is awesome. Yeah, that's probably true. I mean we were more or less sort of we had a choice. Do we want to go with the tea trees or do we want to go with the trees. I mean they had the same performance and so we just decided we go for the tea trees and and I wouldn't say that the tea trees sort of has been the focus. I mean it's really key look up through the hash key that has been is really, really the focus. But nowadays I would say that we also do scanning very, very efficiently. So, so I'm pretty sure that tea trees is not the optimal solution, but it's also implemented fairly efficiently. So it's definitely not a sort of significant problem for us. Be quite like Ron DB has it and then therefore NDP has it. Correct. Oh yeah, this is like implementing tea tree is definitely a few years of work. So you don't sneeze it out of your nose like that. So, okay. So this now I'm going into the what we call the virtual machine architecture, but I actually remove the name virtual machine here because it might be a bit you're probably thinking about virtual machines as virtual machines in clouds and operating systems but what virtual machine here is more actually the term virtual machine. It's not. So basically, but here we are just having a layer that takes care of sending messages between. So we have three layers here. We have blocks, which is essentially modules of software. We have threads that contains a number of those modules. And we have nodes, which are essentially processes. So whenever you send the message in Ron DB, you send it to a node, a thread and a block. And we have this architecture that make sure that you can sort of direct it so you can go internally in the thread, you can go to another thread in the same node, and you can do a send receive over to another node. And the nice thing is that from a program from your medical point of view, you don't really see what you're doing. It's only the address that tells you where it's going. So that's an interesting part of the rendezvous implementation. Another thing that we've added lately too, and this is actually that started in NDB but completed in Ron DB. And that's the memory architecture where we now have completely taken care of all the data so that we sort of manage the entire data memory inside something we call shared global memory. So effectively when you start a data node, you more or less the idea is that you take care of all the memory in the machine, except what operating system needs and so forth. And then we automatically decide how much memory is going to be used by the different parts, but there is also flexibility. So if if for some reason you need more transaction memory at the moment, then you can steal that from some other parts. And if you need a lot of query memory, which is used for complex queries, well, then you can use that. So we manage data much more efficiently nowadays. Okay, and then we have the threat architecture. So this is an interesting topic that again could be discussed for and this is actually one of the most important reasons why Rhonda B is fast. So I actually have written two or three blogs about something I call thread pipelining. So as you can see here, we actually when when we handle a key look up, it's not handled by one thread. It can handle by up to four threads. So you get you come in to the receipt thread that takes care of the IP sockets. You send it over to a TC thread, which takes care of transaction handling. You go into a data manager or query thread to take care of the actual database operation. And then you go into send thread to actually send the response back or to another node to handle update of the other replicas. So we had some experiments done to see sort of how does this architecture compare to a single thread architecture where you do everything in one thread. And it turns out that this is much more efficient than but read the blog. If you're interested, there's a lot of data about why it's faster and and so forth as well. This is basically data. This is basically what data the stage event driven architecture from like Matt Walsh from the early 2000s. Yeah, I mean it's it's it's a message message passing architecture where we actually have message pausing even between threads internally. I'll post it. I put the link in the chat here. We can cover this later. Yep. Okay, another thing we added lately. I mean, most databases that do key value operations. They are often designed in such a way that they partition the data based on on the partition. And there's one partition taken care is taken care of by one CPU. And we had that architecture in MDV and in rendezvous for a long time. But in lately, I've actually added the possibility that read reads key reads are actually possible to do also by what I call here query threads. So query threads, they can actually execute operations for any partition. And obviously that means that there has to be some new fixes and things like that. So that means that we have to have a scheduler for those read operations. And that scheduler actually takes care to make sure that that we only use query threads that share every cash with the actual data owning thread. Because rights, they are still performed by the data owning thread. So we have to make sure that query threads that reads the data that they don't go and read data from the wrong CPUs. It's mostly in AMD CPUs that this matters. So because they have this. Well, they basically develop the CPUs in a different way compared to Intel and ARM. So we'll see what happens with Intel and ARM CPUs if they go the same route. We go back to the previous slide. Sorry. What? Go back to the previous slide. Yeah. So this is what there's operations and you have this LDM thread. And you're basically saying when the LDM thread has nothing to do, you're pushing on to it's like work queue to go do something. Right? Is that what this is? Yeah. So actually it's a kind of, I mean, again, this is something you could talk about for around an hour because it's really philosophical because. You spend, when the TC received threads, they don't really know on a detailed level about how loaded those threads are. So what we do is that every 50 milliseconds we distribute data about the load level. But that's a fairly long time in a CPU. So we also take care of looking at queue levels. So, and that means that we at least know what has happened one millisecond ago. So based on that information, we choose whether we go to the LDM thread or whether we go to a query thread. And so we have very, very, very detailed information about the load level and the queue, the queue levels on each other thread that can handle the query. And then we put it into the queue. And then once it's in the queue, then it will be handled by that thread. Sort of like co-routines, right? Sort of like everything's saying, all right, the TC received thread. The idea isn't work, but let me go, if I can, let me show it down to the LDM thread. If I think it's idle, I can take care of it. That's the gist of it. Yeah, more or less that if the LDM thread is sufficiently idle, go there. And then we have a special query thread that if you look at this picture, you can see that there is a query thread that operates on the same core. And if that's available, then that will be used. If both of those CPUs are very heavily used and there is another one that's not so heavily used, then we can put it on a different CPU core, but still on the same entry cache. I understand. Yep. So that was kind of interesting. It's the first time I've been reading a physical book about time and philosophy and apply that within a computer server. Kind of interesting. At least I thought so. So some rendezvous resources before I dive into the advanced stuff about checkpoints. So we rendezvous.com obviously. So if I have time, I will also show you a little bit about this YCSB benchmark. It's kind of cool because it's the first time in as far as I know that somebody has published numbers on a benchmark when you're doing recovery at the same time as you're running the benchmark. So we provide numbers both on throughput and on latency while the recovery is ongoing and also when the crash is happening. So go there and have a look and see if you have some get some interesting ideas. Okay. Now I'm going to go into the rest of the time into a topic. I guess it's about 10 minutes left or something like that for. Yeah, like 15, 20 minutes. Okay. So partial checkpoints is something that I actually spent almost two and a half years on that project. So it's recovery architecture is inherently complex, not so much because it's more complex than other code, but mostly because the crash happens not when you did the problem when you had the bug. It happens much later. So the only way to find the bug is to do extremely extensive logging. And that of course means that you're changing the timing and so forth. So it's sometimes pretty hard to find the problem. But the reason for changing the checkpoint in scheme is pretty easy. So we used to have a full checkpoint and I looked on other databases that are in memory databases and I so as far as I've seen, most of them still use full checkpoints. And if you look at what that means it means that if you have an in memory data size of for example 16 terabytes. It's pretty obvious that that it's going to be hard to do a full checkpoint. Because that takes ages to write to this. So we wanted to have an architecture that supports in memory sizes up to 16 terabytes we want to be able to survive with a redo log of 64 gigabytes whatever the workload. We want to be able to have a checkpoint on this which size which is no more than 60% more than the actual data size and with compression it should be even smaller than the data size. And this is obviously driven by memory sizes growing in the introduction of SSDs. And 64 gigs log size that's just like that's a number you came up with right that's not something specific to around me. I came up with that simply because that's what was required when when I was running at the very very high load. Okay, with 32 gigabytes I wasn't really able to to get it completely stable in all workloads. Most workloads even four gigabytes would be okay. But I always test with very, very, very tough workloads so and with 64 gigabytes, more or less everything works. It's not the right time for this conversation but like, is why to be a really tough workload if you configure a certain way or that like that's not even like in the ballpark you guys are considering as a tough workload. So the workload that I used for testing was low, I can show that. I think it's here so the test case that we used was actually loading data into tpcc with dbt2 so we had 53,000 warehouses. So basically five terabytes of in-memory data that we load at about one million records per second or something like that. Yeah, so that was the test case. So we were fortunate enough to have in the Lona's persistent memory server. Otherwise it's kind of hard to get hold of a machine with five terabytes of data in memory. So what we concluded is that the full checkpoint isn't really practical beyond 200 gigabytes of memory. And when we were thinking about how to solve this, obviously one solution would be to simply use page-based checkpoints the same way as this database does. And I was thinking, well, why not? But then I sort of made some calculations. How would it happen if I have a database of 16 terabytes or one terabyte even here only? And I do a page-based checkpoint with one million rows committed per second. And it turns out that the page-based checkpoint has to write 320 gigabytes. Whereas the row-based checkpoint only has to write four gigabytes. So that's 80 times more efficiency with row-based checkpoints. So obviously row-based checkpoints is a lot more complex. But if it's 80 times better, well, you go down the drain and do the complex stuff. So a row-based partial checkpoint means that you write all the changed rows since the last checkpoint. So basically, in a sense, the checkpoint is sort of an extra redo log. And you could probably, if you really spend time on it, you could probably combine it with a redo log, but we didn't do that. But you also have to write a subset of all rows in each checkpoint. And what we did was that we divided from the B uses a row ID, which consists of a page ID and a page index. And we basically divided the page IDs into 2048 parts. And from zero to 2048 parts were checkpoints in each checkpoint. And each table partition is handled independently of others. Recovery applies checkpoints based on a control file per table partition. And you always start with all these data files and go to the last one. So that means that some rows will actually be encircled more than once. And I'm not going to go into details, but there are some interesting mathematical intricacies in that. That was kind of interesting. This I already mentioned that we need to avoid running out the redo log. So again, we have in Rhonda B, we have quite a few adaptive algorithms. And the checkpoint speed is one of those adaptive algorithms where we keep track of CPU usage, we keep track of redo site, how much we do we have used, how much writing we have actually done to the database. And all of those things are put into a sort of into some computational efforts. And then we come up with a checkpoint speed. So that means that if you write faster, we will checkpoint faster. If you have problems with the IO, we will have to slow down for a while. If you use CPUs quite heavily, we will also slow down temporarily the checkpoint thing. So we always try to sort of to provide something which is good. But obviously if the redo log, if you're almost running out the redo log, we will start spinning up the speed quite heavily. And the test case, as mentioned, was dbt2 with 63,000 warehouses where each warehouse is about 100 megabytes of data. There is some interesting problems. So checkpoint is complicated since pages can actually be dropped off during scans and between checkpoints. So in the middle of this long development, I actually had some doubts whether this algorithm would actually work. So I had to write a proof, not so much for everybody else, but even for myself, I have to prove to myself that this algorithm actually works. So if you look into the code, you can actually find a 10 to 30 page long proof of the algorithm. So if anybody's interested in seeing if it actually works, then feel free to do it. So basically this is just to give you an idea about the complexity that you can find. Let's see. So when we're scanning a page, we have on each row, we have a timestamp. And the timestamp is implemented by what the thing I talked about previously, the global checkpoint, which is the sort of group commit executed once per second. So that's the level of if something has happened the last second, we will write that row. If not, we will ignore it. And also there's some some work to do when you're writing a row. You have to sort of decide whether the checkpoint has actually already passed it or if it's still to do the checkpoint thing because checkpoints have to be there. The checkpoints are fussy, but they are actually transaction consistent at the same time. Probably doesn't explain enough, but Yeah, what do you mean by that? But they're fussy, but they're also consistent? Well, fussy is means that you don't have to lock anything when you're doing the checkpoint. Yes. Transaction consistent and actually even action consistent means that the checkpoint actually restores a very, very specific point in time. So you have to restore exactly what happened when you started the checkpoint. And the reason for that is that we have to actually have we have pointers that points from in memory data to disk data. So if we don't have a complete synchronization of that checkpoint, we wouldn't those pointers would be able to point to the wrong place. You I think are you switching to like a multi version mode when you start the checkpoints or do you have a consistent snapshot. Yeah, basically, we're doing a consistent snapshot and that means that every time that you make a right, you actually have to check whether a checkpoint is ongoing. So it's a sort of sort of similar to this. When you have a page space, then you basically do a consistent snapshot of pages, but here it's for rows instead. When you check to see is a checkpoint occurring. What is that writing thread do just not overwrite and make a copy or does it like wait, what does it do. And that's what it says here it basically writes the row ID into checkpoint key list. So that means that that when you commit the transaction, you actually save the old version of the row of the row. Okay, until you have time to actually write it to this. Okay, I want to be both immediate something similar. This makes sense. Yeah, yeah, I think this is. It's definitely not unique for one to be but it's it's a kind of problem that if you make the solution in a certain way you have to solve this problem. And I think probably skip this. So this is also an important point that if we have one terabyte of 300 bite rows that's about three billion rows. So if we need to check each of those rows that will take several seconds of CPU time just to do that. So we had to do a little bit of optimization for that. And so each page has a two layer two level change map. So that means that we're using 136 bits in the page header to optimize the checkpoints. So that means that if you only have one right in a page, it's sufficient to check those 136 bits and check one part out of 128 of the rows in that page. So this means that we more or less can skip 99% of the rows we can skip them when sometimes even more than that. So that this is important to sort of to make sure that the checkpointing itself doesn't consume all CPU power. And we could what it could be necessary to go even further and have change maps on megabytes and so forth. But this seemed to be sufficient for us. Okay, and let's see. I think that this is a good time to start seeing if there are questions that anybody else wants to dive into. Or have something specifically that they want to know a little bit more about. Are you putting something to talk or like what? Sorry. Is this the end of the other presentation of the slide? No, I mean I actually have. I can talk a little bit more about one thing more at least that's quick and so this one I think is kind of cool as well. We have something called adaptive CPU spinning. So again, this is adaptive so CPU spinning means more or less that you have a thread. You don't have anything to do. So what do you do? You go to sleep or do you stay awake? And in the past we had a fixed static CPU spinning. So basically we said that let's spin for 100 microseconds independent of what happened. But what we did now is that we have adaptive CPU spinning. So we measure how much activity that we have. So this means that we have sort of a guesstimate on how long time it will take until we will be woken up again. And we also calculate how much time does it take to actually wake up. And then we have three different levels of how active you want to be. Do you want to be so active that you only say that you are more or less efficient? Or do you want to focus on low latency but at a reasonable level? Or do you want to be a database machine and more or less keep awake as much as possible to give the best latency whatsoever, even if it costs a bit? Another nice thing about spinning is that if you're in a hyperthreaded CPU, you actually use special machine or use special CPU instructions. And when those are executed, the other CPU in the same core will be much more efficient. So that means that CPU spinning in a hyperthreaded CPU almost comes for free. So that's another nice thing about it. I have also lots of slides on how we do networking. I think that that would take us pretty far into the night if I started using that. But maybe one thing I could say is about this. So when we are communicating inside a node, we are communicating to memory buffers. And so we implemented that in such a way that each input buffer has a single reader and a single writer concept. So that actually means that we don't have to do any mutexes in order to send a message from one thread to another thread. It does, however, require the use of memory barriers. So memory barriers is something that's required in order to make sure that the other side actually see the operations in the same order as we wrote them. And so there are two types of memory barriers. The right memory barriers that ensures that the receiving CPU sees all writes up to a certain point. And we have a read memory barrier that ensures that the receiving CPU sees all writes written by the sending CPU. And actually in order to have a proper thing here, you have to follow this protocol that you see here. So that you have to use a write memory barrier on the sending side and you have to use a read memory barrier on the receiving side. And you have to make sure that you update the head pointer before the memory barrier or after the memory barrier. I mean, and that you have to write the message before the memory barrier. So you have to be a bit careful when you're writing, but in this way you can actually communicate without locks. And actually you can even implement this using distributed memory. We use that in a previous version of NDB, but we nowadays that happens in the device driver instead, so we don't support it internally in NDB anymore. I guess. And take your time and wrap up. Yeah, I think that's, I mean, I could, I could talk for five, ten minutes more, but it's up to you if you want to. I have a two year old downstairs. I can't, sorry. I mean, I was, I'm already into the section that I didn't really think I would get into so. Okay. All right, do you have like a last slide? We're hires like that, or is it just ends? No, I don't really have, let's see if I had something at the end. There's CEOs on the call. You got, you know, you got to push it. Well, we definitely hired that in my last long. Got it. Okay. Yep. I'll clap my hands for everyone else. So we'll open up to the audience with time for one or two questions if everyone wants to support. Yep. So my question would be, we run to be, or sorry, well, any of you and now run to be, this was your PhD dissertation, right? From the nineties. You've been working on this for 26 years, which is rare, right? There's an older than single light, which is like, there's one guy doing single light. Monty has my single, but it's been a lot of people touching that, touching that code. It could have been your baby. So what's the vision? Like, like, what's the next 20, 20, 60 years look like for, for Ronnie for you? Like, what's like an ideal scenario? So I think that the, the feature store is obviously the first couple of years. And what that means is that we want to be able to also handle much more data on this. We want to be able to, to scale to more notes. We want to be able to scale down from notes. We want to be able to do even more operations online. Another thing that I'm hoping that this is something that is mostly work upon inside of Oracle. So, but I'm, I'm hoping that the Oracle team will deliver also some work on, on complex queries. Actually, when I was at Oracle, we did quite extensive work on Optima to also make sure that the rendezvous can work efficiently pushing down joints. And so at the moment we can do, we can do 12 way joins inside of the data notes, but we, we still haven't got support for aggregation. We haven't support for max and means and things like that. So, so that I'm hoping that it will come up and then obviously the next step would be to be able to do a little bit more of a hybrid architecture, where we also move into a sort of, we still want to be a sort of online database, but it would be nice to also be able to do some more analytical queries and more efficiently. So that might mean that we have to implement some columnar level optimizations. Compression is obviously something else. I have satisfied myself with thinking 10, 15 more years. I'm not thinking that I'm going to work more than 15 years at most more. So, I mean, I don't think 25 years ahead, but somebody else has to implement the last 10 years at least. Yeah, actually, that's one aspect of Brown to be reminded me that I didn't quite, I didn't realize that like, there is actually execution engine logic. Like it's not just getting fetch by getting delete get set and delete right the other you can there is an execution engine down there that you could do some joins. It's quite quite advanced thing actually. So it's called linked joins I think it's called so. So we have some queries can be quite efficient actually. But unfortunately, not all queries.