 Today's class is, again, now we're going down the stack in our storage architecture. And now we're going to start talking about logging, right? We have to do this because we're an Emory database and we want to make sure that anytime we make changes, we want to make sure that when we come back, after a restart or a crash, that all our changes are still there. So we'll start off talking about different types of logging schemes you can have. And then I'll do a quick sort of overview at a high level of what Aries is, which is the predominant logging implementation or scheme for a disk-oriented database system. And I'll just bring this up because it'll show us or be able to contrast what a disk-oriented system does with what an Emory system does. We'll see that there's certain things that we don't have to do because we don't have the right dirty pages to disk that you would have to do under Aries. And then we'll talk about two logging schemes for Emory databases. We'll do physical logging from the silo paper, which is the paper you guys read, and then command logging from volt-DP. And these are actually variants of physical logging and logical logging. So again, everyone should have the requisite background from your introduction database course. And so I shouldn't explain why we want to do recovery or why we need a logging scheme in our system. But again, at a high level, basically, this is going to allow the data system to ensure that as transactions modify our database, that we can guarantee that we maintain the consistency, animicity, and durability guarantees we would have even though there may be failures. And every logging or recovery implementation in a database system is especially comprised of two parts. The first part are the things you do at runtime as you process transactions normally. So as we're running queries or as we're updating our database, we have to do something at runtime when those changes occur so that at some later point, if we have a crash or we need to recover our system, we can come back and use that information that we've collected during the normal processing to allow us to restore the database into the correct state that it was at the moment that it crashed. And the correct state would be the one where all the transactions that committed before the crash or they're all persisted and any transaction that did not commit for the crash, they're all rolled back and none of their changes are persisted in the database. Now for every database, we'll see this later when we talk about areas that contrasted this with silo. It ends up being a lot easier because in a disk-oriented system, you have to worry about whether you write out dirty pages from uncommitted transactions during the regular processing and then be able to reverse them on recovery. In an in-memory database, the beauty of memory is it gets wiped. So we don't have to worry about dirty pages and we only need to be able to redo things. So that's going to be the key thing to understand the distinction we're going to make between disk-oriented recovery and in-memory recovery. The fact that we only have to do redo and not undo. So at a high level, there's two types of logging schemes, physical logging and logical logging. So in the case of the silo R paper, again, this is another example where someone doesn't normally publish in database conferences, refers to things quite differently than how the regular database literature does. So in the silo R paper, they refer to physical logging as value logging and operation logging as logical logging. But in most database papers, there will be this dichotomy. So with physical logging, the idea is that as transactions make change to the database, we're going to record in the log exactly what those changes were at the low level, like byte level. Like if I update a value to set it from some integer value and I change it to number to 10, then I'm going to record in my log that for this particular tuple, this particular attribute, then it's now set to 10. So again, the idea is that you're going to store the actual value you're storing. In logical logging, the idea is that instead of storing the actual low level changes we made to the bits or the bytes of the tuples in our database, we're actually just going to store the high level operation that the application invoked that caused that change. So for example, if you're doing sort of SQL logging, you just stored the actual SQL statement, like the actual string of the SQL statement that was passed to you by the client, you stored that in the log. Because that's enough for you upon recovery, you could just replay the query and put the database back to the correct state. So the difference between the two of them at a high level logical logging seems like what you would want to do because you end up actually writing a lot less data for when you make big changes in your database than you would in physical logging. So again, in physical logging, I have to write out in a log record every single low level change I made to the bytes of that tuple. So let's say I execute a single SQL statement that updates a billion tuples. I have to write out a billion log entries for every single tuple that I modified. Under logical logging, if I'm just recording the SQL statement, all I need to record is that single SQL string that will update a billion tuples. And that's all I need to be able to replay it. So it seems like this would be a good thing, and everyone would want to implement logical logging. But in practice, only one system that I'm aware of, and that's VoltDB we'll talk about here, implements logical logging. Everybody implements physical logging. And part of the reason is that the time it takes when you execute the query the first time is going to be the same time it takes to when you execute it on recovery. So to do that one billion update on my database, if it took me a minute to execute the first time, upon recovery, it's going to take me a minute again. There's no magic setting. I can say, oh, I'm in recovery mode. Go fast. It's going to take the same amount of time. So this means that if you have a log trace of 10 minutes, you can run it maybe at an accelerated speed, but it might take 10 minutes to put the database back online. Whereas in physical logging, you can potentially do it more quickly depending on your hardware. The other weird thing about logical logging is that for some isolation levels, it's going to end up being difficult to figure out how to put the database back in the correct state because you may have concurrent transactions updating the same record at the same time, and you don't know which one came first. So under sterilized isolation, this is easy to do. You just take the transaction ID, and you just replay it in serial order. But when you run it something lower, you may not have this information. So let's look at a simple example here. So we have a simple table of employees, and they have a name and a salary. And then I have two transactions running at exactly the same time. And for this, assume we're running under read committed isolation. So for this, what will happen is the first transaction starts running, and it's going to want to do a sequential scan on the entire table and update the everybody's salary to give them a 10% raise. So the first thing we need to do is write an entry into our log record that says, here's the SQL statement we executed. We're doing logical logging, so it's the exact copy of what the client sent us. So then now we start our scan. And let's say that we do the first two, but then we have a context switch, and now this other transaction starts, and it starts, it wants to start running. So the thread switch is over. We write our entry to our log record that says, here's our new transaction, and it's doing this update. So then it goes down here, finds my record, gives me a 10% raise. That's great. And then it goes ahead and commits. And then now this transaction comes along. The second guy, the first guy starts running again, and now he goes and updates the entry. So at this point, we're fine because this is technically still serializable. Is that true? Yes. Well, sort of. But under recommitted, this is allowed to happen. So if we now crash and come back, the issue is going to be is that if we replay this log entry, we're going to see is that this first query here, we're not going to know that it updated the first two tuples and not the third one. The other guy got them for the third one. There's not enough information in this log for us to put us back to this correct state. So what will happen is that when we replay the log, we'll actually end up with a completely different database state than we had before. Now for this super simple example, you might think that, well, this is OK. But imagine if there's other transactions that then read the values in this first version of the database and then did some different kind of action or did some kind of update or sent out an email or did something based on the values that it saw here. When you come back and now try to replay everything this time, it's a completely different database state. So the true state of the database will converge dramatically depending on what your transactions are actually doing. So the issue is here, again, we don't have enough metadata to be able to know that this query executed two things, or actually updated the first two tuples and not the third one, and then the other guy updated the third one first. Because we're only recording exactly what the SQL queries were. Yes? But the replay is still considered, like, correctly because we're like a recommitted, right? So his statement is, at this point here, is this correct because it's recommitted? Yes. Is this correct because it's recommitted? Yes. If you only look at this, yes. But it doesn't match up what I had before. So think about this. If I run my system, I see that this thing set to 990, I turn the system off, turn it back on, and now my value flipped to 900 because I replayed the log. That's inconsistent database. It violates your asset guarantees. You don't want that. Now under serializable isolation, this would not happen because, again, you just tag these queries with the transaction ID, and you ensure you have a serializable order. But running at the lower isolation levels, this doesn't work. And this is why nobody actually implements this. To the best of my knowledge, and this may be something I'm interested in pursuing as a future research project, only VoltDB does logical logging. No commercial in-memory system does this, and we maybe want to figure out why. Is there a way to resolve this issue? And maybe you should always run it serializable isolation. Maybe that's the answer. I don't know. OK. Is that good? OK. No, no, no. Yeah, no. Yeah, I just want to ask you a question. There's no such thing as a stupid question. OK. At least you asked me, am I wearing underwear today? That's a stupid question. What's that? I just saw on the screen something. OK, all right. You sure? OK. All right. So now we talk about what disk ordering systems do. So if you take any introduction database class, adhere to CMU, we start teaching Aries now. We take any intro course and databases or read any textbook and databases. They will describe the standard protocol that people use in a disk ordering system to do logging and recovery. And it's almost always going to be derived or based on this project called Aries. So even though the textbook is going to say Aries, it's probably based on Aries. So Aries is considered the gold standard or the canonical implementation of how to do logging and recovery in a disk ordering database system. And so it was developed in the late 1980s, early 1990s at the Alamedin Database Research Group at IBM. And this was in the context of DB2. And so this paper is amazing. It's 70 pages long. It's very dense. But it covers everything you need to worry about to do logging and recovery correctly with checkpoints, fuzzy checkpoints, in a disk ordnance system. And so any system that's out there that's a disk ordering system, they say they use a write-ahead log, even though they may not say Aries directly. It may not implement exactly as this algorithm defines or this protocol is defined in the paper. But everyone's doing basically doing some variant of it. So it's not to say that people weren't doing logging and recovery in databases before Aries came along. This is just laid out in discruciating detail exactly how to do it. And then when everyone says, oh, I'm doing Aries in my database system, we roughly know that means write-ahead logging with checkpoints and log sequence numbers. So the key facet about Aries is that it's going to rely on a steel no-force buffer pool policy. So steel means that we can write out dirty pages that were modified by uncommitted transactions before that transaction commits. And no force means that when a transaction commits, we don't have to flush all of its pages that it modified out the disk. And the write-ahead log is what they're going to use to make sure that if you crash, come back, you know what should actually still be there or not. So Aries was the team that worked on Aries was led by this guy named Mohan. That's what he just goes by. Just Mohan, if you say Mohan in database, research community, everyone knows what he's talking about. He's super, super nice. He's one of my favorite people in databases. So this is a picture of me and him in, I think, Sigmod 2012 hanging out. You can tell he's out for a night in the town because he's got a shirt open here. So he's ready to party. And then here's a picture of a few weeks ago, actually two or three weeks ago, of my PhD student, Joy, giving a presentation to interview at the IBM Research Group. And then here, right in the front row, it's kind of hard to see, but there's Mohan, ready to ask hard questions. And the question always is, didn't already solve this problem in Aries, right? That's it. He always answers that question. All right, so this is the gold standard. And this is really, I think this paper is, I wouldn't have you guys read it because it's super long, but it really lays out exactly how to make sure that this thing, that you don't lose any data, no matter what happens in your database system. And I really like it. All right, so what are the main ideas in Aries? So again, as I said, that they're gonna use a right ahead log to record all the changes that transactions make to the pages in the buffer pool. And that has to be written out to stable storage or non-volatile storage before transactions are allowed to commit. And the way they're gonna keep track of the ordering of the changes that were made as you write them into the log is that every log record is gonna be assigned this unique log sequence number, a unique identifier. And that's what they use to determine the order of these things. So this is independent of the transaction ID, which is what Silo was using to ordering things, right? They're gonna use this log sequence number because this allows them to handle any possible isolation level. So essentially replaying exactly what happened when you ran the system the first time. And then in recovery, what's gonna happen is, and we'll see this when we go through the phases, but there's essentially do multiple passes over the log to figure out what was going on and what you need to redo and undo. And so the first is that you'll be able to repeat history during the redo phase. And basically again, you take all the actions that you log and you reapply them all to the database. And then you have to figure out what things you need to reverse because those transactions didn't commit. So then in the undo phase, you reverse all these things. But the key thing that they're gonna do, and again we don't have to do this in the in-memory system because we don't have dirty disk pages or disk pages at all, is that they're gonna have to write out special log records just to anytime they undo things to keep track of that they applied the undo. So if they crash during recovery after a crash, then they don't come back and try to undo things multiple times. So in areas for recovery, there's three phases. First phase you have to go through and from the last checkpoint, go through and figure out what was all happening inside your system to the end. And then you go back and redo all the changes from some appropriate point in the log. And then you go back and reverse direction the log and you undo all the changes to make sure you reverse anything that doesn't commit. Now contrast this with PsyDLR, where they only do one pass of the log and they only go in reverse order, right? And this is because again, in a disk based system, they have dirty pages that may have been flushed out and they may have been modified by transactions that didn't commit. So you may show up in your system and there's some pages here that have changes that shouldn't be there. So you have to go bring them out to your buffer pool and figure out how to undo them. We don't have to do this in-memory system because memory gets blown away after a system restarts, even if the whole machine doesn't crash, the operating system is not gonna be back in the same chunks of memory and lets you read things. So we basically only have to replay the log. We don't have to undo things. But the key thing that also that Aries does, that Silo doesn't do, is they're gonna tag everything with a log sequence number. Again, every time you make a new log record, you give it this LSN, think of it as a global counter, you just add one and give it to your log record and then we can write those things out into the log. So what makes Aries so complex is that, contrast this with the memory system, is that the entire system has to have this notion of these log sequence numbers because it has to know when it's safe to write out dirty pages from the buffer pool based on whether the log record that caused it to get modified has been written out yet. So there's gonna be this notion of these LSNs that are used all throughout the different systems, even though they may not have anything to do directly with logging, right? So to give a high-level example, say in the sort of simple diagram here, so we have our in-memory buffer pool with the tail of our log, then add on non-volta storage, we have the complete right ahead log, and then some disk pages as well. So first thing you see is that there's these log sequence numbers again, we're assigning these always in monotonically increasing order, and then in memory, we only need to keep a subset of the latest log entries. So as we write them out the disk, we can go ahead and truncate, or cut off the top half of the log and free up the memory and reuse them. Inside of our pages in our buffer pool now, we're also gonna have these page LSNs. So this corresponds to the log sequence number of the last transaction that modified this page, and it needs to match up with what's out on disk, what's in memory, it's actually more complicated than that, there's actually multiple LSNs per page, but we can ignore that, and this thing again is just gonna point to some other log entry, and the idea is that if we know that our page LSN is less than our flush LSN, which is the last log entry we've wrote out, that's to be the end of the log here, then we know this is less than this, and it's safe for us to write this out, or otherwise we can't. And this doesn't matter whether it's a table page or an index page, we always have to make sure the log record has been written out. And then on disk there's some information like called the master record, and this is another LSN that points to some entry in the right-hand log that tells you here's the last checkpoint that I took. So again on recovery you'd look at this and figure out here's where I need to start to start replaying transactions. So I'm being very hand wavy over this, I just want to convey the idea that these LSNs are all throughout these disk based systems, and you have to coordinate well what's this and what's that, less than this is less than that, make sure I flush things before I write this, right? And this is part of the reason why when we go back to that pie chart I showed at the beginning of the semester where they showed all the CPU instructions spending on different parts of the system, the logging is 12% and it's not even counting for flushing things out of the disk, it's just the instructions you have to handle. This is just corresponds to just the maintaining the log records, but all those other parts of the system are actually including these times as well because they're doing all these checks all over the place to see whether their LSN and the log record they're dealing with, the page they're dealing with is greater than or less than the flushed LSN in the buffer pool. So we're gonna be able to get rid of LSNs entirely in the case of Silo R, because we don't have to worry about ordering things, ordering the writing out pages before we get out log records. We just write out the log records and that's enough. And so not only are we gonna be faster at runtime for the concurrency part of the system because we assume everything's in memory, but now when we go prepare our log records there's less checks we have to do because we don't have to worry about disk pages. So another way to think about this is when we talked about the concurrency code methods or schemes from a few lectures ago, at no point did I even mention logging at all because you didn't have to, right? And so the same concurrency code scheme you would use whether you're using logging for memory database versus not using logging for memory database can be exactly the same. Whereas in disk based system there's our things you may wanna do or not do depending whether you have a logging terminal or not. Does that make sense? So there's less work we have to do because we don't have to deal with disk pages and a big part of that is not having the LSNs. Okay, so, but even if we get rid of, we can speed up the software because we're assuming that in memory storage, the disk is still gonna be slow, that doesn't go away. When we write out our log records. So a key observation here is that we obviously don't wanna have to just wait for every single transaction when they commit, wait till they do a flush before we can send back the acknowledgement to the client. To ensure that our database system provides asset guarantees that all our changes are durable, the system is not allowed to notify the client that your transaction committed until we know all its log records have been written out the disk. So if I open up my terminal and I write commit, the server will block the client is blocked waiting till the server to respond and say I've flushed all your changes. And so we wanna be smart about how we organize or schedule our flushes so that we're not doing it for every single transaction, we actually can batch these together. So we discussed this in the intro class in the fall but the key optimization that everyone does is called group commit. And the idea here is that we're gonna maintain a bunch of buffers that we allow the transactions to append their log records to and either when that buffer is full or there's a timeout say five milliseconds then we're gonna write them all together out in a batch to our log file, do a single F sync and then when that thing is finished then we know that everyone else, if anybody's waiting for that flush to finish now they're all committed. So if you're the first transaction to arrive and add something to the log buffer you'll wait the longest but then if you're the last transaction then you wait just as long as you would in the case you were flushing every single commit. So the idea here again is to amortize the cost of doing the F sync across multiple transactions. So this was originally developed by IBM in the early 1980s. So they had a storage engine for IMS. Remember IMS was the hierarchical system they built for the Moon mission in the 1960s. They came out with a sort of an insane way the Hecaton was an in-memory optimized engine for SQL server. This was an in-memory optimized engine for IMS. This thing called FastPath and then one of the things that they introduced was this notion of doing a commit. It sort of seems obvious now but I guess in the 1980s this was considered a novel. It's hard to actually really understand what the Hell FastPath actually is because the papers from the 1980s are difficult to read in some ways. All these tech reports they're very archaic and then now if you try to Google FastPath you just land on like the IBM corporate page for IMS and there's nothing really like meaty you can understand. Another optimization that we've already talked about before is doing early lock release. And the idea here is that when a transaction commits when the client says I wanna commit my transaction at that point we can say internally that the transaction is committed even though the log record may not been written out the disk yet. And so we can then have the transaction release all its locks that it may hold on any tuple or any entities or objects in the database they can release those locks allow other transactions to start running and read their rights, read their changes because you're gonna assume that they're gonna commit and then only when the transactions commit record is actually flushed out the disk then it's truly committed and we can start exposing information to the outside world. So if I'm transaction A I make a change then I go ahead and commit and then I'm waiting for my the F-sync on that log buffer if transaction B comes along it's allowed to read the changes made by transaction A but it can't commit until it knows that transaction A is committed. So internally you maintain this data structure that says I know this transaction is waiting for this other transaction and then that when the first transaction commits then you can have everyone else be committed as well. So this is a standard optimization that's used in almost every single system. We saw this earlier with this sort of the speculative read case in hackathon. If you did a compare and swap and saw that the transaction that modified the latest version that you're reading hasn't committed yet you're allowed to read it but you can't actually commit yourself until you know that first transaction committed. So it's basically the same idea that we talked about in hackathon but now instead of delaying not only that the transaction committed the delay also has to include that you know that it's been flushed. Yes? So it is for like providing benefits to like without delaying the second transaction. So your statement is a statement of question, sorry. Your question is does this provide, does this allow you to not have to delay? What does this do, right? So think about the transaction A, it did a modification then it goes and commits and then if we had to wait until it flushes and before we release any of its locks then no other transaction can read any, start making forward progress until that flushing finishes. But if you think about it that first transaction is not doing any work during that flush, it's literally waiting for the flush is blocked. So rather than just having one block behind it assuming in the common case the flush is gonna succeed you just release its locks let anybody read whatever it wants, right? It's sort of like, it's like we, something recommitted. Let the resource free. Yeah, you free the resource and then instead of blocking it while you flush, right? That's what like returning to the client's nothing change. Yeah, so the statement is in terms of returning to the client, nothing changes. What do you mean by nothing changes? Like, it'll go faster, yeah. But we still guarantee correctness because again we don't wanna expose any uncommitted changes. And so again, the client says we wanna commit internally in memory we've committed, right? We've set flags, we did validation and all the things we talked about for regular concurrency control. But when it comes, we have to wait until that flush occurs. So now we can talk about in-memory databases. So as I sort of alluded to earlier in the lecture great thing about in-memory databases is it's easier to do recovery compared to like in a disk-oriented system because we don't have to worry about tracking any dirty pages in our buffer pool because we don't have them. And then that means we don't need to undo anything. And so the only information we need to store in our log records is the redo. Because again, if our system stops we come back online, memory is wiped. So we're gonna load in the last checkpoint that we took, which we'll talk about on Wednesday. And then that'll put us up to some state of the database but then we replay the log to get us to the point we were right before that we stopped. So for this, again, as we said, we're still always gonna be blocked by the slow f-sync times for running up the non-volatile storage. That can't be avoided, at least in current hardware but we'll talk about later how to deal with that in newer systems or newer hardware. So the one thing I'll say is like when you go back and read some of the earlier papers on in-memory database logging and recovery, they make this big assumption that the system that they were gonna run on was gonna have non-volatile memory. So non-volatile memory basically means that it looks like DRAM, it smells like DRAM, like you load in stores to it like DRAM at a byte addressable level. But when you pull the plug all your memory is persisted and you come back online and everything's persisted. You can get this now essentially with battery backed up DRAM but you can't get that on Amazon and battery backed up DRAM is notoriously, I think finicky is the right word. It works until it doesn't work. And the batteries wear out, the form factor of them is larger, so you sort of lose real estate on the motherboard. So you can buy systems that have battery backed up DRAM but it's certainly not commodity hardware. So for what we'll talk about in today's class, we're gonna assume that we're still gonna run on spinning this hard drives like HEDs or the NAND flash SSDs. So the one thing I'll say is that real non-volatile memory, not battery backed up DRAM but like a special hardware, it's a new type of storage medium that actually is truly persistent is coming. I normally would say it's four years away. As a phone calls I had of last week, like it's here. You can't buy it, it's coming. It'll probably be public available in 2019. It's Intel, Intel is putting out the first stuff. It's not a secret, they public talk about this. We actually have, I'll have a whole lecture about non-volatile memory at the end of the semester because this is like the future stuff that's happening. But we're gonna get access to this pretty soon. And now it's worth, when you have NVM, the protocol that you may wanna use may actually change versus what Silo R is doing. But we'll talk about that later. But for this class, we'll assume we have this. This is coming later, very soon, but sorry. That didn't make any sense. Today's hard drives, NVM we'll deal with later, okay? Ignore this. All right, so Silo. So Silo R is the sort of persistent or recoverable version of the Silo system we've read about before. So the reason why I had you guys read the Silo paper for their OCC implementation versus the hackathon one, although I like both papers, is that I want you guys to read the Silo R paper because I think that's sort of a standard UR implementation of doing login recovery. And so to understand Silo R, you need to understand Silo. So that's why I had you guys read both papers. And so the key idea out of Silo R is that they're gonna get really high performance for writing out log records by parallelizing as much as possible. So traditionally in a disk-oriented database system, there's a single log file, right? Or the database system sees a single log file. And so you can paralyze that with RAID or redundant storage devices and things like that. But at the high level, the database system sees one log file and writes to that and recovers to that one log file. In Silo, on a single node, they can write out the multiple log files at the same time. Now in a distributed database, you can actually have, obviously you have parallel logs as well, but for here we're talking about a single node system. So the way it's gonna be organized is that the database system is gonna assume that there could be one dedicated storage device per CPU socket. And for that CPU socket, there's gonna be one thread that's gonna be dedicated as the logger thread for that socket. So some of the cores on that socket will be worker threads, some will be checkpoint threads, and then one of them will be the logger thread. And it's responsible for writing things out to the log and then doing the F-sync. So what'll happen is that as the workers execute transactions, they're gonna generate these new log records for all the modifications that were made to the database, and they're gonna hand them off to the logger thread. But the key thing again we saw in the first Silo paper is that they go to great lengths to make sure that there's no centralized bottleneck in the system from having to coordinate all these different threads with each other. So that's why they're having a single logger thread per socket, and only the cores that own that same socket can interact with that logger thread. So the first question is how are we gonna deal with memory in this environment? So what's gonna happen is the logger thread's gonna maintain these pools, these log buffers, essentially these byte arrays, and then the worker threads are gonna come along and say I need some buffer space because I'm gonna run a transaction and it's gonna start making changes. So they ask the worker thread, the logger thread, give me a buffer pool, it hands it off, and then as they fill it up when it's full, then they hand it back to the logger thread, who then can then provide him another buffer pool while that logger thread then flushes out the full one out the disk. Now at some point, if you run out of memory, because otherwise the log records could take all of the time, or take up all of your space, the logger thread will deny the request or block the worker thread from getting another buffer pool, which essentially stalls all the transactions until whatever's in the pipeline to get flushed out is written out. Then once that's done, you free the memory and you can hand them back to the worker threads and they can proceed. And this gives you sort of the nice throttling mechanism you need in your system because if the worker threads are running transactions as fast as possible and they're just creating nothing but log records, then the logger thread's not gonna be able to keep up and you're just gonna have this huge chunk of memory that's allocated just for log records and you can't flush it out to disk in time and you're gonna have these huge pauses in your system. And so by doing it sort of incrementally, by flushing it out and stalling only when you don't have any more free buffer pools, you sort of smooth out the performance over time. I think in the Silo R paper, they mentioned that they allocate I think 10% of the memory to the buffer pool so it's for the logging. Different database systems do different things. Usually this is a knob you can tune. All right, so what's gonna happen is that now out on disk, at the logger thread writes out these files, after 10 epochs, because again they're using the same epoch mechanism that we talked about when we talked about OCC implementation, after 100 epochs, they're gonna stop writing to the same log file at that device and then start a new one. And the reason why they do this is that it makes it easier to actually manage the log files if they're sort of broken up into these smaller chunks rather than just having this one giant log file. And this is, again, this is not unique to Silo. This is pretty much how every system works. So this is a screenshot from a MySQL installation here on campus and what you see here is that there's these two files here, IB log file zero, IB log file one. And then this is essentially the right ahead log for MySQL. And again, as it gets, when the log files are 500 megs, they basically stop writing to it and they start a new one. And the advantage of this is that it makes it easier to be able to manage these things because at this point here I know that when I start writing out this log file, I maybe don't need this one anymore, so I can go ahead and just delete this file, whereas if it was a giant one gigabyte file then I wouldn't be able to clean it up as easily. And then also for historical reasons, some older file systems only let you have file sizes of four gigabytes. So you can easily run out of space in that case. So that's what SiloR is doing this to make it easier to manage these files. And then now within the log file itself, for each log record, they're gonna have the ID of the transaction that modified the record. And again, under Silo, you're given that transaction ID from the batch, from the coordinating core thread that's running at your socket. So that's gonna be guaranteed to be unique across all sockets. And then you have a bunch of triplets that are gonna say, here's the table I modified, here's the primary key or the identifier for that tuple, and then here's the values that I modified. And this value can either be just a single attribute or a key value pairs. So a simple example here would be, say I wanna run this query, wanna update the people table, and I wanna set the isLame flag for Prashant and myself to be true. So inside of the log file, you would have an entry that says, for transaction one, zero, zero, one, people table, here's the two tuples, and here's the changes that I made. So this is physical logging, because I'm storing the actual low-level modifications that I'm making to the tuples themselves. And I update two tuples, I need two log entries, versus if it was logical logging, then I would only need to record this. So let's look at an example of the whole architecture. So again, I have my, say I'm running on a single socket, I have my worker core, a worker thread, I have my logger thread, and then the logger thread is writing out to the single log file here. And then somewhere else in the system, there's this epoch thread that's always updating the epoch every so often to move time forward. So my worker thread gets a request to execute a transaction. And again, on the silo, they call them one-shot transactions, these just mean store procedures. So this store procedure wants to start running at this worker thread. So the first thing the worker thread needs to do is go to its log buffer or logger thread and say, give me a free buffer. And then once it has that, it can start writing changes, impending all the changes that the transaction makes to this buffer thread. And at some point, sorry, the buffer pool. And at some point, the buffer gets full and yeah, sorry, make changes up to the database where it log records. At some point, the buffer is gonna get full. So we hand it back to the logger thread and say this guy is full and then we ask it to give us another one. And then we can continue executing the transaction and impending changes here. So then the epoch changes. Now at this point, when the epoch changes, the logger thread needs to make sure that it flushes all of the changes that it has from any active transaction or any committed transaction out to the disk. Because when we start doing this across multiple cores or multiple sockets simultaneously, everyone needs to be synchronized. So everyone needs to write out the log records whenever an epoch changes. So in this case here, this transaction filled up this first buffer and that's full, so we didn't wanna write that out. But the second buffer didn't get full yet but the transaction's still running. But we still need to hand it off to the logger thread and have it go ahead and write it out. But now in this case, we don't have any more free buffers so the logger thread's gonna have to stall and then why we go ahead and flush this out. And then once this is done and everything's been written out, we get our free buffers again and we can hand that off to the logger thread. So again, the idea here is that all of our sockets can be doing the same steps, basically doing the same process but because these guys are, a logger thread's dedicated to a logger thread, they're only getting memory that's local to it. They're not getting memory from another socket. So those avoids having to go and send any cash in coherence messages or validation messages over the socket to the other socket, from one socket to another. Everything is all localized and that avoid any kind of coordination across the system. So the one thing though they throw in there in the system is this thing called the persistent epoch. So what they're gonna do is they're gonna have a one thread in the entire system for all sockets is that it's gonna keep track of what is the highest epoch that has been flushed out by all of the sockets in my system, by all the logger threads. And so what'll happen is that a transaction that can executed during a given epoch can only actually be fully committed and acknowledged back to the application that it finished when it's persistent epoch that is equal, greater than or equal to this has been written out. So this is again, they do everything in batches. So all the transactions will commit and validate in a batch and then when it gets flushed out, I can flush things out ahead of time because I don't wanna wait till the very end of the epoch before I flush everything. But I can't truly say my transaction is committed until I know the persistent epoch has been written out. So let's go look now, look at a multi socket environment like this. So again, we have three different devices. Assume we have three different sockets. And so each of these devices is gonna have its own logger thread. And then each of these guys will have their bunch of worker threads that coordinate with the logger thread. So we'll have one thread, doesn't need to be, it doesn't matter what socket is actually on, we'll just designate this guy as the persistent epoch master or the thread. So it's in charge of writing out some persistent epoch file. And again, I don't think it matters what storage device we're using. It's just that every single time that we update the epoch that everyone gets updated and then once we know that they've all flushed this out at 200, then we can write that out. So the best of my knowledge is that you don't actually need this in Silo R. This is only done for convenience, right? So what's going on here? So before I flushed out the persistent epoch, I had all of these guys flush out, oh, epoch 200. Here's all the changes for this epoch as they're written out. So now if I crash and come back, all I need to do is just scan across all these log files and say what's the highest epoch that you guys all written out, right? So say this guy only wrote out 199, these other guys wrote out 200. I know that this guy hasn't written out 200 yet. So the highest epoch I can restore is 199. So that's essentially what the persistent epoch is doing. It's just saying, all right, all of these guys wrote 200. Now I can say that I've written 200. But you can still get that information from the log files themselves. So they're doing this to make it easier upon recovery. You just look on one location and say, what's the highest epoch that I need to deal with? And then now you can do recovery in parallel and not worry about coordinating or doing some extra checks. You just blast them off and say, all right, start at this epoch and recover everything. Okay? Yes? That still doesn't make sense. You released results after the P epoch. So your question is, that still doesn't make sense if you do not release a result that is larger than P. You can, right? You don't have to write U and P. If your transaction only touches data, that's managed here. Now even then that's okay, right? So if I have a transaction that has log entries in two different sockets, then I know I have to wait until both of them are flushed out before I can acknowledge, right? So I'm not leaking any information. So this is just saying that all these guys are flushed out 200. When I come back, if I don't have this, I can look for each of these. What's the highest you flushed out? And if I'm a transaction, I would know, all right, well, I'm in Epoch 200 and I made changes that are managed by these two logger threads. I had to wait until both of them flushed out 200. But the state says that your transaction would be only on one logger thread. Yeah, so his example was, could you still, could you have incorrect state if you don't have this? And I'm saying no. And the only way I can think of it could be a problem is if you have transactions that touch two of these guys. But if you only have it on one, and once I know that, I see what your point, yeah, if you flush out 200, here it. Just give the result back. All right, I'm so sorry. So you flush out 200, but then you have to wait until everyone else flushes out 200. Then you can release. That's still okay. But why do they have to wait until the other guys get on one side? You have to wait for the other guys because say this guy flushed out 199, you crash. If you tell the outside world that transaction in Epoch 200 committed, I crash and come back. I look at my logs. This one says I'm only at 199. Now I'm gonna restore the database to 199. So the transaction that was under 200 that I told the outside world committed is not gonna get replayed because they can say that occurred after the Epoch that everyone is in sync on. But I can't be recovered that as well. Recover what? I don't want to recover. Because we know that that is successful and it's the longest process. Because you could have a transaction here that could have read something in this, updated here, it's a log entry got here. You don't know what this thing, you don't know what a transaction here read. It says share nothing on the texture. It means that each one of them will be coupled with a database that's open as well. That's what the paper says? Yes, it says that. The grant that they do this was followed. I have to go read that again. Silo's, at least when we talk about OCC is not a shared nothing system, right? Any transaction can read anything. Unless it's the partition version. I have to go back and double check. I think that data can read anything. It's a shared everything system. It's a shared everything system, right? And again, the original Silo paper, they talk about partition Silo. That was when you had to have a more sort of a store protocol of if I need to read something of this partition, I have to do the actual work. I think you can read anything, but all the writes, again, you would always go into here. So again, I think if you just understand, again, the high level is going on here about what these epochs mean, that's the most important thing. And that this thing, I think, is only done for convenience. OK, so let's talk about how to recovery. So this is the standard technique that's used in every in-memory database or actually any database system, right? You always load the last checkpoint that you took. Again, because we come back online, whether it's a crash or a safe shutdown, when you boot the in-memory system back up, there's nothing in memory. You have to populate it. So you're going to load the last checkpoint. And from there, then you can start replaying the log to put you back in the correct database state. So the checkpoint is not needed for correctness. You could just have the log for the entire state of the database, right? If your database was running for a year, you'd have a one-year log. You could just replay that, and that will put you back in the correct database state. But of course, that makes recovery really long. So that's what the checkpoint is. Checkpoint is a way to be able to truncate the log, say, I don't even care about anything that comes after this, because I know all those changes have been written out at the disk. So another key thing to point out, also, another big difference between in-memory database recovery and disk-oriented database recovery is that, as far as I know, all in-memory database systems do not log any changes to the index. So that means that if I insert a million tuples into my table and I have an index on that, when I crash or come back, they're going to rebuild the index entirely from scratch, right? And the reason they do this is because it's not worth the overhead of having to log out changes for the index while you're running a transaction. And you have to load the checkpoint in any way, which is essentially just doing a sequential scan from the disk to read all the blocks in your checkpoint and put that back in memory. So you're already doing that. That's the most expensive thing. The model is to rebuild the index, which is because it's in-memory, it's going to be fast. So as far as I know, every in-memory database always rebuilds the indexes. Don't recover it. Yes? But it is better to, like, rebuild the indexes because it can solve also something like a problem of, like, clearly balance if it needs to. Yeah, so that's another good point. So yeah, so because when we rebuild the index from scratch, we end up having, we can do sort of basically bulk loading. And we can have a nice compact index with less holes and if you're using a B tree or something like that. Absolutely, yes. OK. So the recovery protocol is that, again, you just check to see the persistent epoch file, see what's the most resistant persistent epoch, and then any log record that occurs after this, this timestamp or this epoch, is completely ignored because you know that it wasn't actually committed. You didn't tell the outside world that it was transaction committed. And again, you can check this from the different log files or you can just check it from the persistent epoch. And then the way this is going to work in silo, which is much different than a disk-based system, is that we actually can start from the end of the log. So in the earliest log entry, then the one that's the latest in time in the wall clock time. And we go back in reverse order and we apply all those changes again. And what happens is we can keep track of whether we've already modified or updated the tuple as we replay the log so that if we see another log entry that wants to apply a change to a tuple, we've already updated in an earlier log entry, then we don't need to replay it again. Contrast this again with, in the case of Aries, when you're doing redo, you're starting from the oldest point in time and rolling forward in time. So you may end up modifying a tuple multiple times because if it was updated in every single log entry, you don't know what the final state is going to be because you're going from oldest to newest. So you keep replying this change over and over again. In silo, because you're going from newest to oldest, once you update it once, you know that's the final value that should persist or be made visible upon recovery of the database. So you ignore everything else that comes after that. So that's another big optimization or another big difference you can do in an in-memory database system. Now, this is only, I think, silo does this. I think in every other system, they actually do it in a redo fashion, going from oldest to newest. And we'll see this in BoltDB. They certainly do that. So let's see how to do some recovery again at a high level. We crash, come back. The precision epoch gets read in here from that file on disk. And that basically instructs all of the different worker threads to say, you should load in all the database from this point in time. You load in the checkpoint, and then you replay the log. And then that point now you know at the correct state of the database. So during recovery, you can't actually process any transactions. Essentially, the system is stalled because you don't want anybody to come along and start reading and writing the database because you don't know whether they'd be reading and writing the correct state. Now, there are some techniques you can be clever about recognizing what portion of the database has been recovered or not. And that way you can maybe stall a transaction that tries to update things that have not been fully recovered yet while letting other transactions keep on running. That's more complicated. And for our purposes, we don't need to worry about. And nobody actually, I think, does that in practice. Everyone pretty much does it like this. Load the last checkpoint, and then replay the log. OK. So any questions about silo or at a high level? Again, I'd like this paper because it goes in great detail about describing all the different parts and how they all fit together. And it also fits in nicely with the curcurtial scheme that we talked about before. This notion of epochs are getting written out the log. And it's the same epochs we saw in the curcurtial scheme paper. OK. So the last 20 minutes, I want to talk about another in-memory logging scheme from VoltiBeam. So the, actually, no, hold up. Before we get there, performance numbers. Sorry. Yeah. All right. So let me skip this. So for this, they have some measurements doing YSB and TPCC. And so they're going to run this on a four-second machine with 256 gigabytes of RAM. And so this machine was back at MIT. And it's, I guess, six years, seven years old now. The key thing you might overlook to how this is going to work is that they had three Fusion I.O. drives. So this is back in 2013. These drives are pretty much state-of-the-art. These are really expensive. So each of these drives was like $5,000. So to run this experiment, they're basically running on $15,000 worth of storage devices. Nowadays, there's a bunch of other hardware vendors sell drives that perform this as well. And maybe they pay the premium as these cars were back in the day. And because this machine only had three PCIe slots, they end up having to be using a RAID 5 disk array just rather than spinning disks to sort of get the same performance as each of these Fusion I.O. drives. So in here, I want to show basically two sets of experiments. So the first is running YCSBA, which is 70% reads, 30% writes. And the three lines you see are mem silo, which is running without any logging at all, any checkpoints at all. The green line is log silo, where you're only running with logging, no checkpoints. And then the red one is the silo R, which is the logging and the checkpoints. And then these gray bands here represent when they're actually taking the checkpoint. So basically, in silo R, they're just taking checkpoints all the time. And then when the checkpoint finishes, they pause 10 seconds for whatever reason, and they start taking the next checkpoint. So the system is constantly checkpointing over and over again. And so what you see is that there's a very minor dip in performance when you're taking checkpoints or it's not taking checkpoints. This is actually really, really good. Typically, what you see in a lot of systems is like you see maybe a 10% to 15% overhead while you're taking a checkpoint, just because there's threads now scanning things and writing out the disk, and that's interfering with maybe the other regular workload. But overall, that looks roughly about 10% difference performance with logging without logging. So that's pretty good. When we actually look now at a breakdown like this, you can see for TPCC, which is a very insert heavy workload, again, there's a bigger difference between the performance between the silo R and the log silo without checkpoints, and then obviously a greater difference between the logging memory only, or in-memory logging, in-memory without logging at all. So again, this is just showing you that this thing can still stay in high throughput, even though we still have to write our log entries out. And here also you see that in the case of the logging checkpoints, there's a greater variation or standard deviation between the performance here. Because again, there's just so much more we have to write out the disk. And then for this, I guess I'll just skip at a high level, but it was to say that it's writing out a lot of data. So the log file for TPCC is 180 gigabytes after running for a few minutes. That's a lot. This is generating a lot of data. And so you may say, all right, well, wouldn't I burn out my SSD very quickly when I run out of space? Again, if you're running at whatever they were doing, if you're running at whatever, half a million transactions a second, that's a serious application that needs serious hardware. So you can afford a device that can handle this. And that's where they're running a $15,000 worth of hardware. So again, the main takeaway when you get from these numbers here at the end is just that you're always going to pay a penalty for logging out, logging things out the disk because disk is always slow. But in the case of silo R, it's manageable. It's a suitable trade-off for having your data to be persistent. All right, so this is what I thought I was going to do next. So now I'm going to talk about another logging scheme in VolteB. And so the motivation of how VolteB does logging is based on this idea that failures in a all-to-be database are actually rare. Actually, modern hardware in general, fairs are rare. I have a machine that the database group web server is running on. It's 10 years old now, and it's never gone down. So it's not to say that hardware isn't faulty, but it's not the common case that your database is going to be running on a system that's just going to be failing all the time. So rather than trying to optimize the case for fast recovery, instead, why don't you optimize to get the best performance while you're actually running? And you can also mitigate the issue of having to always recover from a checkpoint in the log over and over again by doing replication. And so you're going to also want a logging scheme that can replicate very efficiently across the cluster. So another thing I'll point out, too, in the case of all-to-be databases, is that typically they're not big. So in HTAP environments, certainly, or OLAP systems, those are much, much bigger databases. So maybe you want to run on a large cluster of machines. But most all-to-be databases should be able to fit in memory. And we only need to run maybe on a dozen machines, maybe less than that. So with a small number of machines, we're not going to have these things going down all the time. If you scale up to a Google MapReduce level, then you're on 1,000 machines, and yeah, things crash. But on a small database or all-to-be database, it's not going to be common. So with the way command logging is going to work in VoltDB, is that it's a logical logging scheme where we only need to restore the transaction invocation request from the client. So VoltDB runs entirely as all transactions have to run as store procedures. So the only thing we need to log is the name of the store procedure that the transaction invoked, the input parameters for that store procedure, because essentially like a function call, and then some additional metadata, which we'll see in a bit, to make sure that if we come back and replay this store procedure, that we make sure we get the right version. So the way to think about command logging, it's another name for just transaction logging. The command is the invocation of the store procedure. So there are some caveats to this approach. And I'll show what it looks like in a second. But the key thing about this approach is that it only works if you're using a deterministic concurrency drill scheme. So we didn't really talk about this when we talked about concurrency drill before. But all the schemes that we talked about in OCC or MVCC, these are non-deterministic schemes, meaning if we run the transactions in one day and then run it the next day, we may not get the exact same order. Unless we run everything in serial order, we may get arbitrary interleavings of queries. Now it's up to the database system to make sure that it enforces serializability to make sure that it's as if the transactions are executed in serial order. But there's multiple schedules that are equivalent to serial ordering. And we don't know which one we're actually going to get. So that would be a non-deterministic scheme. So if we want to do command logging where we don't record any information on what queries we executed, we only record the store procedure invocation, the database system may decide to interleave those queries any way that it wants if it's a non-deterministic concurrency drill scheme. And therefore, we would end up with an incorrect database state or the inconsistent database state when we recover the system. The same way I showed this in logical logging before, where I asked you to query the first time and at one database state, upon recovery, we end up with another database state. So the issue of this, what the problem is because the curriculum protocol is non-deterministic. So both of these is a deterministic scheme. And so what they're going to do is they're going to figure out the order of their transactions and essentially their queries before transactions start running, log that, and then execute the transactions. So then when you come back, you know the way they were ordered the first time. So you can just replay that order the second time. And this will guarantee that you end up with the consistent database state. So what do I mean by this? So let's say that I have a single value in my database, a equals 100, and I have three transactions. And so in this case here, I have a equals a plus one, a times three, a minus five. So as long as I execute these transactions in the same order, then I'm always going to put the database back in the same state, right? But if I jumbled these order, then the value is going to end up being different, right? So that's one part of this. We got to make sure that we always, whatever order we figure out the first time, we replay it the second time the same way. The other issue you got to deal with too is that the queries themselves or the operations that modify the database, they have to be deterministic too. Because if I come back the second day and replay this, I don't want to get a different result. So for example, instead of saying a equals a times three, if I made this be a equals the current timestamp, I run it on Monday, I get one timestamp, I run it again on recovery on Tuesday, now I have another timestamp. Now the value is going to be completely different and now my database state is going to be inconsistent, right? So this would be bad, we can't do this. So in the case of VoltDB, they do some extra checks to make sure that you don't do this, or like if you have a timestamp, they record the timestamp as part of the transaction so you know what timestamp it actually ran with. If you do a random number generator, like you call random, it makes sure that you always have the correct seeded random number generator, but you could do things like call the outside world and say what's the current temperature and use that to input into your database, right? That would be bad, that would be inconsistent, that would be nondeterministic, so you don't want to do this. All right, so we didn't talk about VoltDB before, but I've got a quick high level of how it works. So VoltDB is like the partition silo system we have before where it's going to be a share nothing system where the database is going to be split up to these partitions that have a unique subset of the database. And so each partition is going to have a single-threaded execution engine that has exclusive access to the data at this partition. So now when a transaction comes along, it passes along the procedure name and the input parameters, which is going to look essentially like a stored procedure, so you have a bunch of predefined SQL statements, and then a run method with input parameters, right? This is just the stored procedure stuff we saw before. And so the transaction is going to start running, and actually before it starts running, we write out to the log all the same information that the client passed us, along with that transaction ID that determines the serial ordering of this. And then once that's now durable, then we can now make our changes to the database and go ahead and commit. And we don't have to wait for a flush when we commit because this will get flushed while we're actually running. So we write this out, flush it in the background, then we run our transaction, and then we just double-check to make sure our flush is done, and then we can go ahead and write our entry out, right? So we're gonna do this for all transactions, so when we crash, come back, we just replay them based on these transaction IDs, and that'll guarantee that it's in the same database state. Yes? So his statement is, if the transaction aborts, you have to go back and let it know. No, because, again, we're deterministic. So if I'm aborted today, it should abort tomorrow, right? If you call the outside world and it says, is it raining, abort? That's bad because it might not be different, right? So you don't have to do any of that. Now, you could, you could add some extra metadata and say transaction one, I logged in the first time, it aborted, but I don't need to write the abort. Transaction two, if it knows that transaction was aborted, it could piggyback some information and say, oh, by the way, transaction one is gonna abort that you saw before, but that means it requires you to do one pass through the log on recovery to figure out what actually need to skip. And depending if you have a lot of boards, that might actually be a fair trade-off. Okay. I see that we also do snapshots and checkpoints, so we'll talk about that next time. So again, the data system's gonna write out the transaction command to the log before human starts running, and then just like in silo, where we have this notion of different sockets, have different logs, every partition is gonna have its own log file. So if a transaction has to touch multiple partitions, there'll be a base partition that's sort of the home of that transaction, and that's the one that's responsible for running up the log. Now this is gonna be, this makes things complicated when we do recovery because if we touch multiple partitions, if we crash and come back, if those partitions are on another nodes, in order to replay the log, we have to crash those other nodes and let them replay the log too, because our transaction may have touched multiple nodes. And we'll see an example of that in a second. But another advantage we're gonna get though out of this is that we actually are gonna be able to do replication very efficiently because we're only passing around this command information, and you can't do that easily with physical logging. All right, to do recovery, you load the last checkpoint just like in Cylo R, and then all the nodes have to re-execute all the transactions in the log in the same order that they were added to the log, based on those transaction ID. So you find the last checkpoint, and then you replay the log from oldest to newest, and essentially you're re-executing all those transactions from beginning to end over and over again. And because we're deterministic, we'll put it back in the exact same database state. So this is what I was saying before about logical logging. So if I have a one-hour log file that I need to replay, and it took exactly one hour running at full speed, then it's gonna take me one hour to recover the database. So typically in VoltDB, I think the default is they take checkpoints every five minutes. And you can turn that up or down based on how much you're willing to trade off performance versus the recovery time. But they would argue that you're almost never really gonna have to replay the log and let you do a complete stop and turn the system back on because of replication. So the master may go down and then you just immediately fail over to the replica and appoint that as the new master. And replication is super efficient because of this passing around the commands. So under VoltDB's command logging scheme, it's an active-active replication scheme where the transaction is gonna run into completion or run in its entirety on every single replica without having to do any coordination as you go along. So contrast this with active-passive. Active-passive would be the transaction runs on the master and the master sends out up logs or updates to the replicas and may replay them on the database. Under active-active, I'm gonna run the transaction here and here in exactly the same time. And again, because I'm deterministic, I don't have to do two-phase commit to coordinate or synchronize them because if I, if I, if assuming the databases are in the exact same state when the transaction starts, if it commits here, then it should commit here. If it doesn't, then it's some hardware or software failure and I need something bad happening, I need to crash. So again, so the application starts, sends over the command. The master immediately passes that over to the replica. It doesn't wait for an acknowledgement, right? It just immediately starts processing it here. Same thing in the replica. They get this and they immediately start processing it. So now when the transaction commits, all you need to do is send back just okay. Say, yes, I did this. And as long as you get that from all your replicas, then you know that everybody's in the same state. You don't need to check what that state is. Just is it okay? If that's done, then you know you're now fully committed, you've written out to the log and you've written out to your replicas. All right, so that's the unique thing you can do with command logging that you can't do with physical logging. Yes. Does it count for no work failure? But if your message is not delivered, if they still require it, does it sound like he's doing something that's watered down where it's going to be this way? So his question is, so if you're here, do I still need to face commit because if this goes over and it never gets it, at some point, this will never come because I never got it. But so if there's a complete network partition, then this thing would get lost and you have to crash it and recover, right? If say you send out two of them, you get the first one but not the second one, right? You could piggyback information and say, I'm sending you over transaction ID two, the last one I sent you is one. If you see that you can get one, then we're out of sync and we should resynchronize. We should fix that. So you don't need to face commit. You don't need to face commit in the common case. You still need PAXOS to do leader replication and all that, right? That doesn't go away. But as you actually query the transactions, you don't need to synchronize on a per transaction basis as you would on a physical logging. All right, so this sounds amazing, right? So command logging sounds like the right thing to do. What are the problems, right? Well, as I said, if the transaction updates multiple partitions on multiple nodes, then if one node goes down, you got to take down the whole cluster because in order to replay the log, I need to have everyone back of the same state. So a real simple example like this would be if I have three partitions here running on different nodes and this is my transaction. If a transaction runs at this partition here and then it starts making updates to these other partitions. So we do a select from P2 and then we update P2 or P3. If P1 crashes, I don't know how to put this thing back in the correct state because all I have is in my command log is just this, right? How to invoke this thing again. I don't know how to pick out, oh, this query executed but this one didn't so I leave that one alone because I need to know how to use information I collect from these queries to just make decisions in my program logic of what the next query I should execute, right? So I don't know how to be smart about, oh, I only need update this and not that because there also will be cases where I'm doing a select here on partition two and actually I'm doing that here. I'm doing a select on partition two and based on the value, I do one thing versus another but if I'm now one hour later and I'm recovering the database state for partition one, I don't know what partition two was one hour ago because now it's forward in time. So I have to crash it and then everyone loads the checkpoint and everyone restarts. Now that's expensive to do because you're bringing up a bunch of machines online and replaying the log everywhere and then VultiD would argue while you avoid that because of through replication. So the likelihood that you're going to need to complete recovery across the entire cluster is rare as long as you're doing enough replication. All right, so one of my party thoughts. Physical logging is general purpose approach that works for all the perpetual schemes that we talked about in this class plus two phase locking and other things. It works for everything whereas the command logging only works for deterministic schemes and when you're based entirely on store procedures. And I sort of was being coy about this about non-volta memory. I'll send an email on Piazza about some information that has come in. But the thing I'll say is that for non-volta memory when it comes out we will use an in-memory database system architecture and in-memory database system architecture we better suit it and run on non-volta memory than the disk oriented one. So all the extra airy stuff that you do in a disk oriented system is not going to work well on non-volta memory because non-volta memory is going to want to read and write things at a smaller granularity than a disk page. So the architecture in terms of how we organize the logging component as we described here are exactly what we're going to want to use when we have non-volta memory come out and not the airy stuff with LSN and all that in general. Again, we'll discuss experimental hardware or new data system hardware later in the semester. But just because it's non-volta memory doesn't mean we still want to do the silo R stuff like all that we still have to do. All right, any questions? All right, 420 on the dot, awesome. All right, so next class, we'll do checkpoints. This is sort of the second half of this logging thing or the recovery thing I didn't talk about. And then also talk about a really cool technique from Facebook, how to do fast restarts of the system using shared memory for memory data, which I think is really clever and I like. Okay, all right guys. So if you want, we're going to head over now to the Oracle talk at 430, okay? Mm, I need something refreshing when I can finish manifesting to cold a whole bowl like Smith & Wesson. One court and my thoughts hip-hop related ride a rhyme and my pants intoxicated lyrics and quicker with a simple moe knicker sits on my city slicker, play waves or pick up rhymes I create rotate at a way too quick to duplicate fill a breeze as I skate Mike's a Fahrenheit when I hold it real tight when I'm in flight then we ignite blood starts to boil I heat up the party for you let the girl run me and my mic down with oil I still turn with third degree burn for one man I heat up your brain give it a suntan to just cool let the temperature rise to cool it off with St. I.