 The Carnegie Mellon vaccination database talks are made possible by Autotune. Learn how to automatically optimize your MySuite call and post-grace configuration at autotune.com. And by the Steven Moy Foundation for Keeping It Real, find out how best to keep it real at stevenmoyfoundation.org. Hi guys, welcome to a new chapter of the Vaccination Day to the Seminar series. We're super excited to kick off this new semester with Oren Ene, who's the co-founder and CEO of RavenDB. It's a database system he's been working on for the last 15 years. So he's finally come here to talk about what he's been doing. So Oren, as I said, is the co-founder of Hibernating Rhinos, which is the main company backing RavenDB's development. He's also an MPP at Microsoft and has written several books about databases, in particular about RavenDB. We'll also get him to thank him for staying up late. So he's in Israel right now and it's 11.30 p.m. so we appreciate him staying up late. I will say the record we've had for Seminar series is someone was at, I think they were in India. It was like 4.30 a.m. their time. So 11.30 is decent. It's still late. We appreciate you staying up with us. So as always, if everybody has any questions for Oren as he's giving us talk, please unmute yourself, say who you are and ask your question and feel free to do this anytime. We want Oren to feel like he's actually talking in a room with us and not just talking through a blank screen on Zoom. So Oren, thank you so much for staying late. The floor is yours, go for it. Thank you very much. So I'm here to talk about how you build the storage engine and specifically how you can implement features such as ACID and MVCC, multi-version crossing control without going crazy. This is most relevant because I had to build, I have to build a storage engine from scratch because we needed it for ReVNDB, which is the database that we want, that we created. And what I basically did, I went and gun over, basically I believe every single embedded storage engine that is there. I started from SQLite, LMDB, level DB, all of the other ones that came after that. There's a whole big list and I actually wrote about that in my blog and you can read a whole bunch of reviews of that. But it turns out that if you're looking into storage engines, the actual mechanics of what you're trying to do matters right about, especially if you also want to have high performance and reliable systems. So just to give you some background, this is the database that we create. VORON is the storage engine that we use. It was built completely in-house. Just to give you some details, most storage engines are written in low-level environments, typically CC++. ReVNDB was specifically written in C-Sharp. And that's something quite interesting because this is not typically taught to be a low-level language. But it turns out that C-Sharp has a lot of the facilities that we need, including the ability to work with unmanaged code, the ability to create our own manual memory management. And structs, value type, all much of other stuff like that. And that's actually really interesting because there are certain things that you get from running on a managed runtime. For example, concorns in code data structures are significantly easier if you have a GC to back you up. A lot of edge cases, you don't really have to think about it. I can write a finalizer and have it came up for me, things of this nature. Just to give you some context, I started thinking about what eventually become ReVNDB in 2007. Started actually developing 2008 and it went to production in 2010. It's been quite a journey and we learned quite a bit. So here is a pet peeve. Right now there are about five to six hundred different database products and there are a lot more database projects out there. I run into this project recently and I just saw this. It's meant to perform atomic correlation with reasonable durability. And that basically pressed every single hot button that I have in my head because this is not how I think a database should be. There is no reasonable durability. You are there, you know, durable acid or you're not. So I want to spend some time talking about why this matters. Basically it matters because if you don't have acid system, you cannot make any assumptions about the sort of data that you have in your system. Which means that the level of complexity that the application developer has to face is enormous. Combine that with disability systems where you have to deal with multiple systems latency between nodes, stuff like that, and it goes absolutely bonkers. Now acid, from my perspective, is the only way to go. The only problem with that is that this is tremendously hard to get there because you effectively have to fight the entire system. A lot of that you can see here. This is an HDD, hard disk drive, and this is an SSD. And also the typical disk that you run. This is where you eventually are storing your data. And the problem is that even an SSD, even NVMe drive, are so slow compared to a main memory compared to CPUs. So you have to orchestrate your entire system around accepting that slowness. In fact, if you are interested in high-performance systems, you typically have to think very carefully about the memory hierarchy of the system. You have the L1, L2, L3 caches, main memory, and then you have disk. Sometimes you have multiple levels of disk and with different speeds for each one of them. And when you try to think about what it means, how do I create an acid system, you realize that disk creates a super big problem for us to deal with. If you're talking about a really good hard disk, we're talking about 10,000 RPM, 50,000 RPM. That gives us at most 2 to 300 writes per second. Now, I'm talking about individual writes, not the size that we're writing. And the reason this method is because effectively you have to wait for the platter to spin. And that matters because now I want to make a write. I want to make sure that my data is durable. But I have a physically limited number of times per second that I can do that. And that can create horrendous problems for my data. Now, at that point, what you typically see is, oh, I don't need that. Maybe I have a battery-backed disk. Maybe I have something else. But eventually you realize that, OK, this is so slow, many people put buffering and caches in front of the disk. And it's actually hard to know when you actually store the data durably to disk. It actually gets even worse when you realize that you're not the only player in town. In a typical system and server, you're usually not the only process that is running. And you're certainly not the only process that is writing to the disk. Even if you're the sole process that is running, there is also all of the background work that is happening by the operating system. For example, I write something to my file. OK, I issue the write system code. Is the data on disk? No way to tell. I can call fsync. That would ensure that the data is on the disk. We're good, right? No, because if I increase the size of the file, that's made the data about the file. That is stored in the parent directory. So I have to open the parent directory as a read-only file and fsync. Of course, in some cases. That is the simplest scenario where I just want to write a single value to disk. So I have to fsync multiple times to get it to happen. And that is assuming a perfect system. So I have to think about how do I get to an asset system? How do I make sure that when a transaction is committed, it is always there? Notice that I haven't even touched it on atomic or isolated. Consistent is more the level of the application, I'm not touching that. So the first thing that we have to realize is what is the communication protocol that we have. Which is really strange. I'm talking about building a storage engine. A storage engine is used to build a database. Why am I considering the manner in which database stock is a key aspect of how I'm going to make things durable? And the answer is that I have two different models. This is the chat model. Begin transaction. Okay, transaction was updated, it was created. Update, okay, update, okay, commit. Okay, we're done. Each one of those is a separate statement and this is how SQL works, for example. And each one of them is a different network operation. On the other hand, I may have this model effectively like a chat. Here is the set of commands that I'm running. I'm sending that as a single unit. It gets processed by the database and then it sends a confirmation. Why does it matter so much that this is the first thing that I have to consider? It matters because it's an issue of when and how do I need to keep a transaction alive? If I'm running something like this, that means that I don't really have to think about a lot of issues, because transaction concoms is limited. I got a bunch of operations that I need to run as a transaction and I'm going to execute them. I may be getting additional transaction but I don't have network latency inside my transaction. Or the other in here, between this statement and this statement may be multiple seconds. There may be an actual human who is sitting there and typing things. It may take a minute to complete the transaction. That leads to a very different model of how we work with the database. If you're familiar with some of the basic terminology in databases, then we have the notion of transaction isolation. You have all of those levels with stricter and stricter guarantees about what you have. And higher and higher cost. And this is almost entirely related to the communication model, because you have to deal with transactions that span seconds or minutes. They're going to be interleaved, which means that you have to deal with looks and transactions that are very long live compared to the other option. There is this really nice discussion, OLTP through the looking glass. And they analyze the internal costs inside database engines, OLTP database engines. And about 40% of the cost was around looking. All of which is around supporting this scenario. So one of the fourth decisions that you have to make is which option do you want. Now, interestingly enough, this model for long transactions can actually make it faster for us to do things. Why is that? Because I have all of this time from accepting this statement to write it to disk, which typically means that I have this notion of very do-log or undolo-log. So the issue here is I have a typically databases if they want to be durable. They say, I'm going to write what I'm going to do to a log. And the question here is, what am I going to write to the log? What am I going to do to accept this operation? Or what do I need to do to roll back this operation? And there is this areas model about how a write a log should look for relation database and it is complex. On the other hand, for something like this, I can execute the entire transaction in one shot. I don't have to wait for the network. And that actually gives me a lot of really interesting optimization opportunities. So let's look at some of the options that I have. First thing, I want to make my life easier. So I'm going to use the message model, the batch model. I'm getting all of the operation and transaction in one shot. I don't have to wait for the network. I'm going to have only a single writer at a time. And I'm going to allow multiple readers and ask it is an absolute requirement that I have. And if you think about it, that sounds like a really limiting environment. How can I have concourse in this case? How can I have good performance if I have a single writer? And we know that your ability is expensive. How can I make that work? So it actually turns out that a single writer has some huge implication on the complexity of your code. You don't have any granular locking. You don't have to lock on a per page or per row. The code is much simpler. The number of concourse items that you have to juggle in your head in order to make something is vastly reduced. But how do you handle that? How do you actually implement? Okay, I'm writing something, but I also have readers that want to read that. How do I make that happen? And how do I implement your ability? So the fourth thing that I have to deal with, I have to deal with how I'm modeling the data. Let's imagine that they have a data file. Data file is some binary data. And I'm going to divide that into pages. So far, this is extremely standard database behavior. Each page, in my case, is about 8 kilobyte in size. And it is like this. Now, a key aspect here is that whenever a write transaction, or to be more exact, the write transaction, because I have only one of them, is going to modify. We need to modify some data. I'm not modifying the data on disk. Instead, I'm creating a copy of the data and modifying it here. What's the impact of that? I can go and modify pages. And at the same time, a read transaction go and read. Because I modify a copy, it's free. I don't actually have to do any coordination between readers and writers. There's no blocking between. That actually turned out to be a really huge decision. Now, it also means that, OK, I'm done. I can commit a transaction. But OK, what does it mean to commit a transaction? Let's talk about this. I made some modifications. Now I want to commit a transaction and start a new transaction. This is more of how it looks like. I have the notion of a page translation table. What does it mean? Whenever I'm looking, and you can see here, I'm here at transaction 32. And I need to open, I want to get page number one. Here is the page translation table for 32, 31, and 30. Now, no one here, no one here, no one here. So I get the page here directly from the data file. I'm getting page number four. So I'm looking at the page translation table for 32. It's not here. But I go to the previous version. Now it's page 31. And here is the page that was modifying page 31. So I'm getting the last version of this page. If I would modify that, I would create another copy of that and install that in the current transaction. You can see this is what happened for page number five. I modified that in multiple transactions. And now I can make it work. Now consider I currently have a running transaction for page 31, for a transaction 31. It is using this version. So it's going to see this version of page five, this version of page two, et cetera. This really simple concept of having a hash table of the modified pages and linking it back to the previous transactions is all I need to do to create every CC, multi-version in concourse and control. I also get snapshot isolation effectively for free. Now I mentioned that I'm doing that in order to get durability and concourse. So I touch a little bit on concourse, but how does it help you for durability? Well, let's see how I'm rolling back a transaction. And this is basically it. Go free all of the pages in a transaction and free the table and you're done, free the page translation table. There is no persistent modification that we made and every CC is effectively free in this scenario. Multiple pages exist at the same time and readers have a snapshot. The writer doesn't look for it. It's also really important to understand that at this point I'm dealing with all pages. There is no structure to how I'm working with this. If you look at something like Postgres, Postgres is using the chatty model with SQL and the way that it implement things is that each page is a list of tappers with form transaction to transaction and the entire database engine has to be aware of those. It has to know that I'm scanning through things and you have to deal with vacuuming and ordering things and how about some other work around that. In this case, the model that I'm talking about here is below the level where you put meaning on the bytes. We just divided the data into pages and everything else follows from that. You're actually doing a pen approach or multi versioning, right? You're creating new pages. I'm not. Initially I did. If you're familiar with how LMDB works, this is how it works. It always does copyright, but the problem in this case is that in the time that you do that, you have to copy the whole, you have to copy upward. If I'm looking right now, let's say that I have a typical B3. You have the root. You have page one, page three, page four. Now I want to make a modification on page four. I have page eight, which is a copy of page four, but in order for it, but I have to also have the page three, page one, and the root. I actually have to make four copies here in order to deal with this modification because effectively I'm working with immutable data. Now, LMDB deals with that in a really nice manner because it keeps track of a free list and will use pages. But as the data size grows, you still have a right amplification. And that's not something that I wanted to do. So instead, what I'm going to have, this is the data on disk. And now I have this page translation table that point to somaloc data. And that's it. And obviously, that doesn't really work that well over time because, well, you have bigger and bigger lists and the more time it goes by, you have to scan back through all of the pending transaction. That's horrible. So let's talk about how I'm actually going to commit. So there are two ways to commit a transaction. First of all, I need to make all of the changes that happen in that transaction public to any new transactions that they have. And the way that I'm doing that, I'm just publishing the current base translation table. I'm installing that as the new value for all future transactions. Now, this data doesn't work that well because now I have to deal with how to ensure durability and how do I cut down the change that would go over time? If I have 100,000 transactions, I would have a change that I have to scan through 100,000 page translation tables. So as I mentioned, this IO is really, really slow. And we never in pretty much any other system, we are never writing to the disk directly. And there is quite a lot of details in these two links that shows how complex it is to actually write durability to the disk. And in 2018 or so, there was something called F-Sync gate, where it turns out that Postgres, MySQL, Mongo and a bunch of other databases as well were using F-Sync wrong on Linux. And the underlying issue was that F-Sync can fail. And if F-Sync can fail, then the state of the data, what's in memory, what's on disk, is effectively undetermined. About the only thing that you could safely do at this point is crush the database and run recovery from scratch. And what Postgres did was it would retry the F-Sync and it would get an F-Sync, it would fail, then it would at some later point do an F-Sync again, it would work, but some data that was supposed to go to disk never reached disk and then you ended up with data corruption. It was a big profile in the database community at the time. So let's talk about how I can get to durable systems. So writing to the disk is really slow. It's even worse than that, writing to disk on random location is even slower. So I have to do something about that. Now, I cannot assume that a write to the disk will be atomic, even if I'm writing just a single page. If I'm writing four kilobytes, the underlying disk may be walking in sector of 512 bytes and they don't necessarily have to be atomic. For that matter, I don't even know that the sector is atomic. So we cannot assume basically anything and we handle all sorts of really, really nasty errors. I wrote to this successfully, then I closed the file and I failed to close the file. As I mentioned, writing to disk is hard. Don't try to do that. So the lacking thing from our perspective is that we're walking in pages, which means that we have very clear model of what changed. In fact, if you think about it, this thing here, the page translation table, this is the list of pages that will modify this transaction. So what can I do with that? I can just do this. Use the write ahead log. And the transaction commit is basically, let's take all of the modified files, sorry, modified pages and write them to the log. On startup, we go through this log, read all of the modified pages from the file, write them to the data file, then you're done. And this is about as simple as you can get in terms of programming models. You have a transaction, it modify some pages, commit, write the modified versions of the pages to disk. On startup, you just write them and you're done. Now, there are some really important aspects that you have to take into account. First of all, there are many models to ensure that the data is actually written to the log properly. You can call to the write system call and eventually call fsync on the log file. But again, that leads to interesting complexities. What I prefer to do is to use this model. This use direct IO and including the data sync flag. And the idea is that what this forces the operatism to do is to send a command directly to the disk with the FUA force unit access flag turned on, which bases that to the disk, I want you to write it to persistent medium. Don't come back to me until this is done. Now, this is important because what does fsync do? Fsync effectively flush all of the buffers for a particular file. Now, in many cases, this is implement as let's flush everything for this particular disk. There was a famous case in 2008 where Firefox used SQLite and they used us to store some state in a page which meant that they were running about 30 transactions per second. And SQLite calls to fsync. And if you were running Firefox in Linux and trying to do something like compile some code which caused a lot of IO, it would cause a lot of fsync would effectively freeze the system because it would basically wait for IO and stop everything else. If you're using direct IO, this case, you're effectively skipping the let's put it on the cache and then flush the cache. You're effectively saying go directly there. Don't wait in line and then push forward. There are certain limitations to this. You have to have aligned memory and all much of other stuff like that. But it turns out that because you're working with pages which are naturally aligned, then this is not an onerous requirement. This is not an onerous requirement for a database. This is how you really think about it. Another consideration that you have to take into account is that don't try to grow the file dynamically. Preallocate the file to some reasonable size. We typically use 256 megabyte because that makes a lot of sense for our scenarios. Basically, you preallocate the file, fsync the entire directories, and then you can make the effectively direct cause to that without boring or anything else. That would be a single disk write that you have to work with. That's also nice. You preallocate the file. You're creating sequential IO, which is insanely optimized. This is actually really funny if you think about it. Sequential writes are fast on hard disks because you're effectively writing in a sequence. This is not how SSD works. But because all of the optimal systems were optimized to run on hard disks and were writing in sequential models, there was a huge pressure on SSDs to optimize sequential writes. They do. We're still carrying on this notion of sequential writes, being the fastest way to do that. We have to take into account that the write may be partial. When we write to the journal file, we do that with a checksum. Ideally, cryptographically, strong checksum. Now, we consider the transaction to be committed if the checksum is valid because now we have a really good indication that what we wrote and the expected value are the same. In other words, when we read the transaction log, we say, okay, here is a transaction. Let's get a checksum. Let's get its size. Read it from the disk. Check that the checksum is valid. And so this is the committed transaction. We are going to apply that to the data file. And we have a single write, which would be important in a moment. And this forces to wait until the disk comes back, which is super important for our perspective. Because once this came back from the disk, we know that this has been saved duably. Now, let's talk about this again. A transaction that is written is a set of modified pages. But because of MVCC, we also have the old pages. Now, an important observation is that in many cases, we don't modify the whole page. So we can do a diff and write only the modified bytes. On top of that, we can compress the data. And in many cases, you can especially consider the fact that, oh, I'm writing lots of very similar records, they would compress really well. That can give us significant savings in terms of how much IO we run. Now, this actually turned out to be really critically important. Why is that? Because of the way that it works, IO is expensive, especially we're now in the age of the cloud, where this speed is horrendously limited. You can go and take off-the-shelf NVMe drive, and you should be able to get to 100, 200,000 IOPS. Easy. You go to the cloud, you get 100 IOPS. This is basically like taking your database and killing it. Not even killing it softly, just killing it. So the reduction in amount of IO that you write is absolutely important to pay the compression and difference. And finally, there is an interesting observation here. We have an issue here of a single writer and slow IO. But I don't actually, and this is a really nice trick, I don't have to wait for the disk to complete the operation. I can do something called asking commit and still maintain all of the transaction guarantees. How do I do that? I'm starting the commit process, which essentially means running the disk, running the compression and writing to the disk. And I'm doing that in a background thread. At the same time, I'm already starting to process the next transaction. Now it is possible that the asking commit would fail. If it fails, I just abort the current transaction, abort the previous transaction because, hey, it wasn't written to the disk, it failed. I'm done. On the other hand, in the common case where the transaction is successful, and remember this point, where the transaction is successful means that we're able to actually write it to the disk. And being unable to write to the disk is a fairly catastrophic scenario. And we can paralyze the IO cost of writing to the disk with already doing the in-mortem processing for the next transaction. So this is basically what it looks like. One option is to get the right lock, execute the transaction, and commit. Get the right lock, execute the transaction, and commit. This is horrible. It's best because what happens is that you can execute one transaction per how many IOPS you get from the system. And that's not a lot. On the other hand, look at what we have here. We are doing something really interesting. We're doing transaction merging. Each one of those commands, each one of those is a set of commands that run as an atomic transaction. I have something that listen to the network, read the transaction, in all of its commands from the network, and then it throw that into a queue. And then I have a backup thread that take the right lock and read from the queue and execute this command in the same transaction. So multiple of them happen one after the other in the same storage transaction. And now I'm waiting for the previous transaction to complete. And after I successfully completed the previous transaction, I'm starting the asking process of the code transaction and going back right here to continue putting items from the queue. Now it's interesting because we are still maintaining all of the usual guarantees. We still have an actual atomic transactions. Each one of them is running in atomic transactions, but we don't guarantee the order between transactions. So we are free to execute them as we wish. We can actually try to be smart about it and try to look into the commands inside the transaction, execute them in some optimal order. In practice I found that this is too complex, doesn't give good benefits, but the difference between this version and this version is amazing. Just to give you some concept, the first time that we implemented that, that was the difference. And currently revenue B is able to process something in the order of 150 to 200,000 individual writes per second based on these concepts. Now so far most of what I've talked about is, okay I have the personal station table, copy on write, MVCC and write into the log. But what about updating the data file? How does that work? So the whole idea is that we have distortion of copy on write and a transaction is going to run. It's going to have a snapshot of the database. Eventually all of the current transactions have proceeded to the point that no one is going to look at the modified pages that we have in memory. No one is going to look at the modified pages on the disk because all of them are going to look at the modified pages in memory. That means that I can now go and update those pages on the data file because there is no read transaction that can observe them. That also leads to another interesting detail. We are now, we now have three separate steps for a transaction for the IO that we run. We have the commit portion where we write the modified pages to the journal. This is how we decide the transaction is committed or not. Then we have, once all of the read transactions have proceeded beyond that point in time, we can write those details to the data file. Now once we write them to the data file, we can delete them from memory. But writing to the data file doesn't write them to the disk. It writes them to the in-memory buffer for the operating system, to the page cache. There is a background process that would write them to the disk. That's actually something that we want because that gives the operating system the chance to do IO coalescing. Instead of having individual writes to the file, it can do that in a big bulk. Eventually, I'm going to call fsync. And when I'm calling fsync, then I can actually go and clear the log. The typical way that I'm doing that is that I'm calling fsync every time that I'm switching log files. And why does that matter? It matters because if a log file is 256 megabytes, then I can push a lot of transactions before I have to call fsync. Which means that the operating system had the chance to probably write quite a lot of those buffers to the disk, and I'm not paying a lot of cost on the fsync. So we are back on IO is a huge issue. And the cloud makes this extremely painful. This is especially the case because many cloud systems are actually using a burstable IO. So for example, the GP2 GP3 disk on AWS would give you up to 3,000 IOPS for some duration, and that would basically drop to effectively nothing. On Azure, you have the standard disk, which are really high latency and still IOPS, and that can be a killer. So as much as possible, we want to have predictable IO behavior and want to push as much of the IOPS as we can get outside of the critical part for the system. So a high level question, for your RavenDB customers, which one of them are running on provision IO versus the GPT or whatever the burstable one from Amazon? If you run in on the... We have RavenDB as a service, and you basically choose your own system. Basically, when you select the... Basically, when you select what edition you want, you have multiple options to go through. Let me see if I can show it. I understand, yeah. So one of the options is... So one of those options is here, which affect the number of cores, the memory that you have, etc. And the other option is the size of the disk and the IOPS that you have. And this is effectively... This is reflecting exactly the system that you have on the cloud, because this is IO1 or IO2 disk in AWS with 5000 IOPS. This would be GP3 or GP2, whatever, with whatever many IOPS you get for that. And... I have a question. What percentage of your customers that are running on the cloud are using provision IOPS versus not? 40 plus percent would run on provision IOPS, I would say. A lot of that is also... So that's actually interesting because... IOPS matters a lot, but only after you're running with the work he said beyond the memory capacity. And it actually turns out that in some cases it's cheaper to get a bigger machine with bigger memory than to get better disk. It's also funny. You can literally get a machine with more RAM than the maximum size of this you can get in the cloud. The maximum RAM you can get is 24 terabytes. The maximum disk you can get is 16 terabytes. Going back to the IO patterns, something that we run into quite painfully is that even though you can optimize for the steady state, you have to take into account all sorts of annoying edge cases. For example, we had a customer that says what they care about the most was running as quickly as possible with a low latency. So they had a really big machine, they had good disks, and they defined their journal size to be very high, which meant that we would F-sync very rarely. Now, the way that they did that, they actually run on two separate disks. They have one disk, which was the journal disk, which was very, very fast, and then they have the HHD, which was a lot bigger, but much, much slower. Now, that meant that, effectively, the critical part was going to a fast disk, and in the background, they always would speed the data to here, basically edit leisure. That worked great until they run into a case where they did a big bulk of data import and then restarted the database server. What actually happened was that they ran out of disk space, and they had to shut down to increase the disk space for the system. And the system would not start. The database would not start. Typically, the database startup times is measured in 30 seconds or so on large databases. It's very busy, but in this case, you saw that there is zero CPU, almost zero IO, negligible memory, but the system was basically not proceeding. And what turned out was that there were large journal files, and we used compression of the data. So, we start reading from the journal file, and we have to do random reads and random writes into the data file to run the recovery. And this is, unfortunately, a sequential process. So, basically, we run into worst case scenario for that setup. Okay, read here, read here, read here, and wait for it to happen. Read here, wait for that to happen, etc. So, that was an incredibly painful scenario for us. We actually had to change things around to make sure that we are handling that edge case properly. But, basically, storage engine is all about how you can think about, this is so hard to avoid paying the price for that, and what kind of optimal access pattern I have around building this. Notice that everything that I said so far doesn't touch on how you read or write from the disk. You can use your own pager to pull data from the disk and manage that, but, eventually, everything has to go to the disk and be persistent. And, even case, you have to make the distinction between written to the data file and actually sync to the data file as to separate steps. And I added some resources. A lot of those is about the details about some of the details of Voron and all bunch of details about how to build a database. This is a book that I'm still writing on that that covers all of those in detail, including executable code in C for Scratch. And, if you want to, this is the source code for Voron that you can go over it. That's it. Any questions? So, I will plan on behalf of everyone else. So, I have a couple of questions. Do you have questions for Voron? Put them to yourself and fire away. I'd like to ask a question. This is Steven. And how do you test any of those mechanisms that you speak of earlier? How do you test them? Besides just writing very brutal integration tests? Writing brutal integration tests. Let me put it this way. I have worn out multiple NVMe drives doing integration tests. More to the point, there are a couple of good tools that you can use. There is Alice and there is Bob. Alice is a tool that analyzed the system pool that you use and then it reorder them and see if the system still makes sense after that. Bob is binary. This is meant to test file system implementation because it works on the block device. Block reordering something. Don't remember the data from that. Those are the primary tools that you would use to do that. There is a lot of just reading the documentation, understanding what sort of guarantees you can have, and then going and reading other database engines and seeing what they are doing and what they can rely on. If something isn't 20 years old, I don't want to touch the API, mostly because you get into interesting edge cases. For example, there is a whole patch in the API and it works in EXT4 and in BTRFS, but it doesn't necessarily work on CFS and all sorts of other things like that. You want to be really conservative about what sort of API interactions you want. Something that we did, we had an environment where we could do a programmatic power interrupts on a physical machine and we run hundreds of cycles of that in order to test that this works. You can sort of do stuff like that on virtual machines and abort midways and things like that, but we wanted to verify on physical hardware all of the things that are involved. Some things that I haven't mentioned, which is important, everything that I talked about assumes that the disk wheel is reliable and that comes in two fashions. One, that if you send a ride to disk with 4-unit access, that ride is actually going to be durable and power-cycled. Second, that what you wrote and what you read are the same, which is not true. This is an interesting observation, but in general, for every 5 petabytes or so of rides that you write, you will see some form of wrong data. Everything that I talked about at this point was about the raw page layer, so we don't care what data you actually have, but we actually don't believe the disk at all and we utilize checksums heavily in order to validate the data is consistent. So when we write into the write-ahead log, then the transaction is checksum and we also have checksum on each individual page that validate that that page data is consistent. Now, in some cases, there are persistent data structures that would duplicate important data. For example, the master file table on NTFS is stored in multiple locations on the disk. We have decided not to do that for RevenDB and for Voron in general. What we have done instead is detect that there is an error and raise that to the user. The reason this is the case is because in many cases, in many cases, an error at the disk level is catastrophic because you have a hardware issue and at that point, the reaction is, if I'm working with effectively malicious data, I don't know if I can recover from that or even if I can recover from an error here, there's probably going to be an error somewhere else quite quickly. Another issue is that, again, thinking about the cloud, if you have an issue in the cloud, you can typically AWS systems are redundant and supposed to be protected on a lower level than the disk interface that we see. In many production systems, you would run on RAID 0, RAID 1 at least. RAID 1 or RAID 5 at least. So you have that level of redundancy. You don't need to add that at the database level as well. Thank you. I have one more question. Otherwise, I'll take it. So you mentioned at the beginning that you love running Voron and RevenDB and C sharp. It is rare, right? RevenDB is probably the most well-known C sharp data that I'm aware of. There's a bunch of other small hobby ones. You guys are certainly the biggest. Is there anything about C sharp that you don't like? I'm not saying you're going to go switch to C++, but is there something like C sharp that really bothers you? Man, this is a pain having to do this in this environment. Yes or no? So the primary reason is that it is incredibly easy to get into C++ that allocate and allocate heavy. One of the things that we did in RevenDB in general was to take a lot of ownership on memory allocation. We use the reader model, do a lot of on memory management, and that gave us about 10 times speed boost compared to previous version, where we utilize a lot of managed memory. At the same time, it's incredibly easy to create code that allocates, and for a long time, the basecast library API will basically force you to allocate. We had to do a lot of work around that. There has been tremendous improvement around that, and if I was writing VORL for Scratch today, I would probably have the ability to do almost no allocation at all in the common case. Allocations are bad because the performance and all of the usual things around that. I'm writing this thing, the Book of Writing, I wrote it in C, and I spend an inordinate amount of time dealing with edge cases, memory management, recovery of all those sort of things, and it is amazing how trivial a lot of those issues become when you don't have to... I know here would leak memory, so I have to write 10 times more code just to recover from that. I mentioned concurrent data structures, and I think this is a really, really important aspect. Having something that I can call dictionary in C sharp is just something that I can reach and use trivially. In something like C, C++, RAS, ZIG, those sort of languages, a concord doesn't basically does not exist, or if it exists, this is hazard pointers, or epoch base, and the level of complexity is just so much higher and more complex. A really good example, if I wanted to implement a skip list in C sharp, which is a very common data structure for main tables, it's easy, it's natural, I don't have to really think about how to do that. In C++, OK, concord skip list and managing the memory to free is non-trivial in the extreme.