 All right, Joyce here, we can get started. So today we're going to talk about more logging and recovery stuff for a database system. But we're sort of differentiating between what we're talking about today, between last class. The last class is all about the physical logging. And now we're talking about alternative methods that aren't as common. So before we get to that, I want to spend some time doing some course announcements for the project and some other things that are going up. And then I want to spend a little time beginning to clarify some of the questions that Matt and Adam had about physical logging that they didn't quite know the answer. And we just clarify those issues. And then we'll spend time on the paper that you guys are assigned to read, this command logging stuff that's in VoltDB. And then we'll talk about different methods of doing MRE checkpoints. And then we'll finish off talking about a neat little trick in a paper that Facebook wrote about a year or two ago, how to do fast restarts of an in memory database system using shared memory. So the first major announcement is that Project 2 will have an automatic extension for everyone. So it's no longer due this Wednesday, it is due a week for now during spring break on March 9th at midnight. So this is automatic for everyone. And so what this means, because we're giving you an extra week in time to work on this, we expect the implementation to be very good. This means we want to see a lot of test cases. We want to see that your thing can perform well and scale and achieve better performance than you would have if you submitted it two days from now. So now just because Project 2 has been extended by a week, that doesn't mean that everything else has been bumped up. So that means that after we come back from spring break on that first Monday, every group's going to have to do their project proposals here in class. So that's still going to be due on March 14th. And I'll explain on Wednesday. Wednesday we'll spend time talking about different topics you can work on in your project. We'll talk about what I expect in a proposal. And then going forward after spring break, we'll spend time talking about how to develop in a large database system, how to work on legacy code. I'm also going to, because now we're going to spend Wednesday this week talking about the project proposals, we were going to talk about compression. We're going to talk about that after spring break. So there's been no assigned reading for this Wednesday. Everything's just been slid down the schedule by one. And then we're going to drop one of the lectures on hybrid systems. There was two days a lot or four. We're just going to drop one of them. So everything's sort of been shifted down. So any questions about this? Also because now we're giving you more time to do Project 2, and we expect that the code, your implementation will be better. I'm also shuffling around the grade breakdown. So it used to be Project 2 was worth 20% of your grade, and the final exam was 10%. So now is, I've moved, sorry, the final exam was 15%. So I've moved 5% of points out of the final exam and put it into your Project 2. So basically, because the final exam is going to be based on the reading reviews, so combined together you get 20% there. So we're giving you more time to work on Project 2, but we're making it count more for your final grade. Any questions about this? Who here has a working Project 2 BWT implementation so far? Raise your hand. I know somebody submitted it, and they passed everything but the memory leak check. What's that? Those of you guys? Ish. It works sort of. All right, so again, both Joy and myself will be around Spring Break next week. So if you want to talk to him or meet up with us, just shoot us an email. All right, so the other two things, again, the clarifications for the questions we had last week. So Adam had a question, or Alex had a question about could you use logical logging, with logical logings to work with snapshot isolation, and then Matt had a question about the way the silo R did the persistent epoch stuff. So I want to go through both of these and clear things up. So the example we have before about why logical logging wouldn't work, the example I originally showed were running these two transactions with each with one query running on a lower isolation level, like re-committed. And so if we were running this under snapshot isolation, well, what happened is this first guy starts running, and we would append the exact query we were running in our logical log. And then the cursor would start moving through the tuples and updating them one by one. So we'd get to this first one, update it, get to the second one, update it. But then now before you get to this third tuple, we have a context switch, and now the other query starts running. And of course, we have to append it to our logical log. It gets in and modifies this tuple, sets it, joys salary to $900, and then commits. And then when we go back and try to now continue running our query, when we would get to this, we would see that it's been modified in the future from when we started. And therefore, this transaction could not complete. It would have to abort, rollback any of these changes. And so what I didn't really talk about, but we would have to do in this particular example is we'd have to be able to remove this entry from the log. So that means that because the transaction hasn't committed yet, we would just keep this in memory. And then when we realize it aborts, we just blow it away, and nothing ever gets written out the disk. So now, if you were doing this on the snapshot of isolation, my intuition is that I think that logical logging would still work if the log entries included information about the order in which they finished. So if we are pending it as the queries first arrive, and as they get executed, if we append it in that order, then the execution order might actually differ based on how things get scheduled. So we would have to include in our logical log to say, this guy came before this guy. And therefore, we could try to infer from that in what order we should be able to get to these things. So snapshot isolation would help us with this. If we had to run other queries with lower isolation levels in a mixed environment, which happens all the time, running analytical queries or running things that aren't as important as you can run in lower isolation levels, then this certainly wouldn't work at all, because we still have this anomaly where if we re-ran the queries on recovery, we could end up with a different state of the database. So the main takeaway of this is I think snapshot isolation could work. I kind of thought of it as a bunch of different examples of where you could have a problem. But it only works if you can capture their commit order in the log and not just their execution order. So now Matt's question about Silo R had to deal with about this persistent epoch file. So remember that our architecture looks like this. We have our different storage devices. And they each have a log or thread. And then those log or threads have a series of worker threads that are making changes to the database and appending log records to this guy's buffers. And then he eventually flushes them out the disk. And then what happens is we have one log or thread that's designated as the special persistent epoch thread. And it's responsible for writing out this persistent epoch file to disk. So what happens is when the epoch thread says we have a new epoch, all these guys start flushing out everything that corresponds to that epoch to their storage device. And then when they do that, when that F-sync completes, then they notify the persistent epoch thread that they've successfully stored the information at this epoch. And then once it gets the acknowledgments from all the different threads, the logger threads, it's then able to append this to the persistent epoch thread. So Matt's question was, is this thing even necessary? Because all the logger threads will know that they've flushed out the epoch up to this point. And so on recovery, all you have to do is just check to see whether everybody has successfully committed 200. And then if so, you know that would be equivalent to what was in here. And the reason why you have to wait until everybody is able to flush their epoch up to the point is because you may have a transaction that made modifications that span multiple worker threads, or each writing out two separate logger threads. So I actually emailed Eddie this weekend and asked him about whether this is even necessary. And he responded 15 minutes ago. So I'll do the best I can to summarize what he said. So he basically says that, Matt's absolutely right, that you don't need this at all, because everything you needed to know, what was the last persistent epoch you can get from the individual log files. And so he makes the point that avoiding this additional f sync allows you to improve the latency by roughly 20 milliseconds, because you're not waiting for these guys f sync and then this guy to f sync. But he fully admits that going from, on average, they're doing about 90 milliseconds per, 90 millisecond latency per transaction on average. And removing this only gets you down to 70 milliseconds, which is still not that very good. The main reason that I took from his email about why they have this is that it makes the implementation of the recovery protocol a lot easier, because you don't need to do anything special scans in these log files to figure out what the last persistent epoch actually is. You just look at one location, and you can then use that to go through and replay the logs as needed. So it's not for anything special. They only did this because it makes it easier to do software engineering. And then their argument was they're already running at 70 milliseconds for latency, which is not that good in a transaction processing system. So adding another 20 milliseconds to add this log file that makes it easier to do recovery was a trade-off they were willing to make. So Matt's intuition was absolutely right, where you don't need this persistent epoch. They only use it to make implementation easier. I'll forward you the email from Eddie. So now let's get to the new stuff. So remember that last class we made a distinction between physical logging schemes and logical logging schemes. And we said that the physical logging scheme is where you're actually storing the low level information about what was modified in the database on a per query basis. So you can actually say this offset for this tuple, here's the new value that I install for these different attributes. And memory said that in in-memory database we don't need to store any undue information because as soon as the transaction commits, we blow away all the in-memory information for that transaction and don't need to write it out the disk because we never were going to have to reverse its changes. So for physical logging, it's going to be slower for execution. It's going to make the runtime system go slower because we're having to prepare and copy all this information that we want to write to our log buffers. But it's going to make our recovery protocol much faster because we don't have to re-execute any queries. We just take whatever the bits that we have for our different attributes in each record and we supply them into our database. And then we distinguish this between logical logging where we're storing the high level information about what either queries or transactions we're actually doing but not what changes they make to the database itself. And we said that this is faster for execution at runtime because the amount of information you have to store in the log, it's much less than physical logging. But it's going to be slower for recovery because you essentially have to replay and re-execute all the queries and transactions. So given this distinction, the paper you guys read about command logging makes the observation that failure in an OLTP database system is actually kind of rare because OLTP databases aren't that big. They're usually measured in the orders of maybe hundreds of gigabytes and the largest one that I know about is 10 terabytes, so they're not that big. And therefore they're not running on hundreds of machines. So we're not talking about like a Hadoop cluster at Google or Yahoo or whatever where they're running on 5,000 machines. And then in that case, yes, those machines at any given time, one of them is always going to be down or fail. Most OLTP databases run on 10 machines. Then you run a lot less. So in that case, the likelihood that your hardware is going to fail is not that high. So therefore it's better to optimize your database system to assume the common case that you're not going to crash. And therefore you want to get the best performance you can get for all your runtime operations, your execution of queries and transactions. And then if you ever crash, then yes, it'll be a little bit slower, but that's a rare case and you just deal with it when it happens. And we'll talk about how replication fits in this as well, that allows you to even avoid the recovery protocol in its entirety. So the logical logging scheme that you guys read about in the paper is called command logging. And the basic idea of command logging is that you're going to store the invocation of a store procedure that the client makes to the database system. So remember we talked about store procedures are like these RPC calls or RPC mechanisms where you can declare a function that has personal program logic that's intermixed with SQL statements. And so the application says I want to execute that function. Here's the input parameters. And then that gets invoked as a transaction by the worker thread. So in command logging, the only thing we need to store is the name of the store procedure that we executed, the input parameters that we provided to the function itself. And then there'll be some additional safety checks that we want to log to make sure that when we have to recover from the log that we re-execute exactly the same store procedure that we had before. So for example, you could store like a checksum or a version ID and say that it's store procedure foo. And here's the version that I executed it before. And that way something comes along and changes foo to either change the SQL statements or to change the actual program logic itself. You don't re-execute it and get a different result again. So you're guaranteed that whatever is executed the first time, you execute the second time. So essentially you can think of command logging as just transaction logging for store procedures. That's all it is. We don't store the individual SQL statements that it executes, we just store that again, that one single store procedure invocation. So one aspect of why command logging can work is something that we haven't really talked about yet. And that's this idea of a deterministic concurrence goal scheme. So the idea of a deterministic concurrence scheme is that if I have the same state of the database that I had before and I execute the same store procedure with the same input parameters, then I'm guaranteed to get the exact same new state of the database that I had the first time I tried this. So if I ran the database system first and I execute a transaction that updated the database, if I crash, come back, and I get back to that same state and execute that same transaction again, I'll get the same new state of the database. So this relies on two things. The first is that we have to have the order of our transactions, assuming we're doing command logging, we have to have their order predefined before they start executing. So it's not like in MVCC where you let things start whenever they want, and then we interleave the operations as they come along, we have to know exactly before we start running what the order is going to be. And the second thing is that we have to require that our transaction logic, or the program logic in the store procedure itself, has to be deterministic. So what I mean by that is let's say we have a database that has one value, a equals 100. So we had three transactions here, then we come up, we satisfy the first condition and we're going to order them ahead of time. And now if you look at the logic for each of these transactions, it's all deterministic. So a equals a plus one, a equals a times three, and a equals a minus five. So that means that no matter how many times I execute these guys over and over and over again for this database state, it's always going to come out the same result. Because these are obviously deterministic. An example of a non-deterministic transaction would be something like this. A equals a times the current time. So that means if I run this today and I run this tomorrow, this now function is going to always return back a different result. So that means that I'll get a different value for my database state here. So this is an example of a non-deterministic program logic in a transaction. Is this clear? So in both of these command logging, they're going to disallow stuff like this. Because if I ran this today and my database crashes tomorrow and I rerun this, I'm going to get back a different value and end of the inconsistent database state. And therefore, I'm not going to be able to recreate exactly how I was before. So this is bad. We don't want to do this. So I want to spend now a little time talking about the architecture. Yeah, sorry. Before starting, you need to know that now I'm starting to break. So your question is if I have three transactions, I order them one, two, three. Or you can order one, three, two. But before starting three, you should know that you're doing three. It's not a predefined order, right? You can do scheduling. It should be a realistic order, but you can do scheduling. That's what I'm asking. Yeah, so say that on my node, one rise first, three arrives second, and two arrives third. I could flip the order of these based on the yes. That's fine. You have to be sure that it's in that order. Yes, correct. Yes. All right. So I want to spend some time now describing actually the BoltDB architecture. And I guess for disclosure, I wrote a system when I was in grad school with some people. We built a system called h-store. And then they took h-store, forked our code, and made BoltDB. So the high level architecture that I'm describing is the same thing that we developed in h-store that's now in the commercial product BoltDB. And I want to say full disclosure because I'm biased in some ways about the design of the system. But I'll be totally upfront about where I think the deficiencies are. And we've already talked about the h-store concurrency code protocol. So now I want to talk about what the actual physical architecture of the system looks like. So assuming now we're running on a single node, our database is going to be split up into these disjoint subsets called partitions that are stored entirely in main memory. And then for each partition, we're going to sign them a single-thread execution engine on a single core that has exclusive access to all the data at this partition. And so you can think of this in our earlier work, we didn't really worry about NUMA regions. We can imagine this is one core running on one NUMA region, and there's another core running on another NUMA region. So they can all access memory that's local to them. So now the client will send a request. And again, we said we're primarily running with a stored procedure API. So we're going to pass in the name of the procedure that we want, and then it's input parameters. And so in VaultDB, all the stored procedures are essentially, think of them as like Java class files. And they're going to have two parts. The first part is at the top, you're going to have these pre-declared queries that the stored procedure is going to invoke. And these are essentially the same thing as prepared statements, because where there are, you could have constants, you could put in these question marks that say that there's going to be a value that's going to be inserted for this argument here in the query, and I'll give that to you at runtime. But the database system can do all the query planning and optimization ahead of time, even though it doesn't know what the value is for these guys. And then all of these queries are going to have a unique name within the context of the stored procedure. And then we have a run method, and the run method takes in these input parameters. And these are the same parameters that the client's going to send us over the wire as part of the request. And then within the stored procedure logic itself, it's intermixed with regular Java code, but then it makes calls to invoke the predefined queries that we have above. And of course we pass in more input parameters than are substituted for the question marks there. So the way a stored procedure written like this, the requirement that it has to be deterministic, means that just as before in the example I showed where you couldn't have anything based on the current time or the date, you can't have anything like that inside of the function here. You can't make a call to a random number generator, and you also can't make a call to an outside system. So you can't make an RPC request to some other remote system, get back a value, then have an if branch that says, if the return value is this, execute this query, or if it's that, execute that query. Because again, if we come back and run this another time, that external service may provide us with a different answer. We would not get back the same result and end up with the incorrect database state. So in the early days, I went on, I guess you could call it a sales call, with Stonebreaker and some other people, in the early days of ODB, we went to go visit PayPal. And PayPal's transactions, when you send money from one person to another, in the middle of the transaction, they would make a call to an outside fraud detection system that would come back with a result. And then based on that result, it would say, should this transaction commit or not? So that wouldn't work in this kind of environment, because let's say the database crashes, and you come back online, and you start replaying all your transactions that are making calls to the fraud detection system. But now that fraud detection system is down. And therefore, this external service now is causing our database to never be able to recover, because we're not going to be able to invoke that RPC request again. So by decoupling the two different services, you avoid that problem, but then you can't use a system like BoltDB for PayPal's workload. OK. So we get a request. The transaction gets a timestamp, and then it gets queued up at whatever partition that has the data that needs to access. Before the transaction starts running, though, we're going to write out its entry to the command log on disk. And then we have to do a group commit, and we'll flush. And once we know that it's durable on disk to the command log, then our transactions will be allowed to start running. And so for this, what we're going to store in a command log is the same procedure name and input parameters that the client sent us. We're also going to store a transaction ID, because that corresponds to its serial order, the order in which it should get invoked. And that way, if we have to recover from the log, we know what order we should replay these transactions. So once we know this is durable, then our transaction gets queued and allowed to start running. At some point, it'll execute the queries and make changes to the database. And then when it finishes, we can do a commit right away and send a result back to the application. Now while this is all going on, in addition to do command logging, there'll be background threads that are going to be taking asynchronous copy and write snapshots of the database or checkpoints of the database and writing them out the disk as well. So now if there's a crash, we do the same thing we did under physical logging, where we load in the last checkpoint that we took and then replay the command log for all the transactions that got up to the point before we crashed. So I want to go through now the recovery protocol and the checkpointing protocol in a bit more detail. And we'll talk a little bit how command logging actually makes it really easy to do replicated nodes as well, much easier than I think with physical logging. So the logging protocol is that, again, we log the same command that the client sent us before the transaction starts running. And so that means that regardless of whether the transaction commits or not, we always have to log it. So we have a transaction that aborts. We still have to log it because we're doing it before it starts running and we don't know whether it's going to abort later on. So most of your transactions in multi-V, you want them to only touch a single partition, but if they have to touch multiple partitions or multiple nodes, then you don't want the log record for that transaction to appear in much different locations because you're not going to know where it actually ran when you recovered the database system, or recovered the database. So the base partition is the idea of where the transaction store procedure is actually going to run. And so that's the place where it's responsible for actually storing the log record on its local command log on disk. And so remote partitions don't have to log anything, or nodes don't have to log anything for the transaction, but if we have replicas, then they have to maintain the same log that the master does. Because that way, if the master dies, the slave has the exact copy of the log and the state of the database as the master. For checkpoints, the way we're going to do this is that, unlike on Silo, where they took fuzzy checkpoints, where the checkpoint itself could be inconsistent, in multi-V, the checkpoints are consistent, meaning that they have a correct snapshot of the database when it's written at the disk. And the way they do this is that they invoke a special transaction that gets queued up like any other transaction in the system. But this transaction will lock all the partitions and then tell them to switch into the checkpoint mode. So because we're running the single-threaded engines, when we invoke this special transaction, that blocks any other transaction from running at the same time, then we switch into checkpoint mode, we release the transaction, release the locks, and now a separate thread in the background starts taking our checkpoint while we still continue to exude other transactions. And the way we can avoid the inconsistent snapshot is that we switch the database into what's a copy on write mode, a copy on update mode, where instead of doing in-place updates that you normally do where you just overwrite the current value of every tuple directly where it exists in memory, any time you do an update, you make a copy of that tuple and you flip a little bit in the header that says that this tuple was modified and sort of deleted after the checkpoint already started. So as our checkpoint thread scans to the table, if it sees any tuple with any of these three bits set in its header, it knows it didn't exist when the checkpoint started and therefore it completely ignores them. And if it finds older stuff, it can clean them up as it goes along. Once it's done scanning all the tuples, then it invokes another transaction that says, all right, we're done doing the checkpoint. We switch out of the copy on update mode. And now when we add an entry to our command log that says we've just completed a successful checkpoint, here's the location of where it is on disk. And then on recovery, we don't have to do anything. We don't have to modify or look in the log to see how to rectify this snapshot. We know it's fully consistent. So this is the distinction between stylus. Stylus did fuzzy checkpoints. BoltDB does non-fuzzy checkpoints. I would say, again, most database systems do fuzzy checkpoints. This is rare, and they can only do this because they're doing logical logging. And the single thread execution ends. All right, so again, we talked about how to recovery. You just load in the last checkpoint you took from disk, and then you had to re-execute all of the transactions in your command log that occurred since after the checkpoint was created, or at least the checkpoint was started. So this is, again, different from stylus. Stylus would go at the newest entry and go back in time. And they can do this trick where they know that if something has been modified in the future, then it can ignore the modifications that it finds as it goes back in time that are older. In the case of BoltDB, they have to start from the oldest log entry and move forward in time because they don't know what the actual tuples were modified by every transaction. They had to re-execute everything. This also means that they re-execute any transaction that ended up aborting the first time. They ran it because they don't log anything that says that this transaction aborted. They just only log it when it starts, and they just re-execute it just as it was before. So this means that in the case of BoltDB, because you had to re-execute all the transactions over again, the amount of time it's going to take to recover the database depends on the amount of time that has elapsed since the last time you took a checkpoint. So let's say you take a checkpoint every hour. That means that you potentially have a one hour worth of log, a command log, when you've recovered the database. And so if you're running at 100% utilization and running at your max speed of what the database could support, it would take you an hour to recover that log. If you say you're only running one transaction a minute, then that'll be 60 transactions in your log. So even though it's a one hour log, you can blast through that in milliseconds. So in some ways, it depends on the time that's elapsed. It also depends on how fast the database system actually can process the logs that you have, the entries that you have. So I think you could actually store some hints in the log to say that, oh, by the way, this transaction ended up aborting. And that way, when you came through and replayed it the second time, you could just find these aborted entries and just ignore them as you process, as you replay the log. But the problem with that is that it required you to do two scans on the log. The first pass to figure out what actually aborted and the second pass to replay. And I think that's a lot of software engineering work for not much gain. It'd be an easy optimization, but I don't think it's worth it. OK. So I understand the basics of command logging. You log the store procedure entry, and that's enough that it'll let you reexamine that same transaction with the same amount of parameters the second time around. And because our store procedures are deterministic, we're guaranteed that the state of the database will be the same no matter how many times you crash and restart and crash and restart. Yes? Yeah, you have a diagonal factor that will abort again because it's all deterministic? Yes, correct. So an abort would be like someone trying to buy a book that's out of stock. If stock is less than or equal to zero abort. So every single time you run, the state of the database is always the same, so you're always going to abort. The second thing is, on a higher level, command log is similar to logical log. So command log is an example of logical log. Yes. So logical log doesn't necessarily have to be query only. Right? So it's just the high level information about what you need to do to re-execute whatever it is that you were doing the first time and no information about where things are actually stored on disk or in memory. Yes? So his question is, if you provided hints in the command log that a transaction aborted, why would you have to do two scans? Because the issue is that when we write a log out, it's before the transaction starts running. So we don't know whether it's going to abort or not yet. So we would then have to do another log that came later that says, oh, by the way, you executed transaction one, two, three, maybe a couple milliseconds before. It's going to abort, so don't actually execute it. So his question is, on that partition, no other transaction is going to run until we know our guy aborts. Correct? So on a single node, you have multiple partitions, and they're all writing to the same command log. So it's not like what you could do is you could read ahead a little bit and have a batch of things. And then within that batch, you could recognize, OK, this guy aborts. Yeah, that actually would work. It's not like you have to scan the entire thing to find that entry, because the initial log record for the transaction when it started, and then the abort record could be pretty close to each other in the log. And actually, yeah, it actually wouldn't be that bad either, because you're not logging that information. So it's not like you would have tens of megabytes to say that a transaction aborted. You'd just say, yeah, this transaction ID, it aborts. Yeah, so you actually could do this. Actually, I should double check, actually, whether they do that or not in the real version. We didn't do this in age-door, but we didn't really care about recovery time. Any other questions? Yes? Yeah, so do we rebuild the index based on the checkpoint in the log, or do we actually store the index on the checkpoint? And then again, as far as I know, for every in-memory database that I'm aware of, nobody logs or stores the index on disk. You always rebuild it. It's simply not worth the performance overhead to write out an index durable on the disk. You just rebuild it real quickly on restart. So in the case of ODB, you only store the tuples when you take the checkpoint. You don't store anything about the index. Any other questions? OK. So let's talk about some things that the command logging actually can do that is not as easy to do in physical logging, or I dare say even impossible. So although this course is all about single node databases, I want to talk a little bit about replicated environments. So the basic idea, the way everyone does replication, for the most part, is master-slave replication, or primary, secondary. So you have some master node that's going to take the command or request from the client and be the primary location to where you're going to execute it. And then there'll be a replica that you want to as tightly as possible be strongly in sync or strongly consistent with the master. And that way, if this guy dies, you can elect this guy to be the new master. And then all the requests go to that. So with command logging, again, the client sends the master node, our procedure name, and input parameters. And what we can do is we can just immediately forward this request over to the replica once we assign this request a transaction ID. So we don't even need to wait for this thing to actually start executing on our end. We can immediately just send it over to this other guy. And because we have the transaction ID, we figured out what order we're going to have our transactions run in. And so all the replica needs to do is just look at the transaction ID and make sure it runs in the right order. And so what's awesome about this is that when the replica finishes executing this transaction, it just needs to send back a single acknowledgement to the master and say, this transaction you told me about before, I executed it and it succeeded. It doesn't need to do two-phase commit, like an atomic commit protocol to make sure that everybody is in sync. So you don't need to say, hey, did you do that? Yes, I did. OK, let's commit it. OK, I will. It's one message back over the wire. And we can do this because, again, our storage procedures are deterministic. So we know that if these guys have the same database state on both nodes and our storage procedures are guaranteed to modify the database in the same way and put it in the same state, then as long as this guy says I succeeded, then you know that he has the exact same copy of the data that you do. And then you can send back the acknowledgement to the application. The way we do this in physical logging is that the transaction would run here. And then as it modifies the database, you stream the op log or the physical log to the replica as you go along. Like, I updated this tuple. Here's the new value for it go. And it doesn't actually execute any of the queries. It's just replaying that log for you on this other side here. So in that case, you have to do two-phase commit because you have to make sure that everyone got all the same log messages, everyone executed in the same order, and everyone is OK with applying those changes. So command logging allows you to do what is called active active replication, meaning the transaction is actively running on both all the replicas in your cluster versus active passive, where you run the active transaction here and these guys get the passive updates that are generated from the master. So this is sort of clear. So it allows you to avoid two-phase commit, and you can run much faster than you would in a physical logging scheme. So where does command logging fall apart? The problem with command logging is that if you have a transaction that spans multiple nodes or multiple partitions, and one of those partitions goes down and you don't have a backup or you don't have a replica, you need to restart the entire database system and recover every single node. Under physical logging, you wouldn't have to do that. You could pause the database when the node that goes down comes back up. You recover the checkpoint, replay the log, and get you back up and running. But you only need to do that for a single node. In command logging, you have to do this for all nodes. So let's look at an example here. Say we have a cluster of three partitions and just assume that these partitions are running on different nodes, and say we have some pseudo code for a transaction that looks like this. That the first thing we're going to do is execute a query on partition two. We'll get back some value, and then based on that value, we'll execute either partition two or partition three. And let's say that this transaction is running on partition one. And it can make other changes and make other modifications. So now we send our updates to these other guys here. So this guy dies, and we're going to reload the last checkpoint we had and replay our log. And we'd have to replay this exact transaction we had before. Since it modified these two different partitions here, and we don't know in which way how they did that, we'd have to crash both of these guys and replay them. Because if we re-executed these queries where y equals y plus 1, then they may actually be running this again, even though they don't have the same database state that we recovered from the checkpoint here. So we can avoid this by having the master slave replication so that if we have multiple replicas for our partition, it's very unlikely that we're going to have to do a complete restart and recover on every single partition. But if we don't have that, if the final master for our partition one goes down, then we have to recover everybody. Because we're not all going to have the same state across all nodes. Is that clear why command logging doesn't work? One particular problem with command logging. The other big problem with the HStor protocol, the conertural scheme, is that it's terrible for analytical queries. Because if you have a query that spans multiple partitions, you have to hold the locks for those partitions as you run that query. And then you combine all the results to produce the final answer. So that means that why you say your transaction or query is running on one partition, it's holding the locks for other partitions, and they're basically sitting idle and doing nothing. So the volt-to-v stuff is amazing for OTP. It does have problems for very complicated O-lap stuff. And that the HANA guys, the hyper guys, and M-SQL can avoid because they're using MVCC. OK? So now, in the time that we have left, I want to talk about different ways to do in-memory checkpoints. So we basically talked about two schemes so far. We talked about silo, which is basically doing a scan of and tries to avoid writing tuples that it doesn't need in the checkpoint. And then we talked about the volt-to-v case where you're taking, sort of, you switch it to a copy-on-write mode, or copy-on-update mode, and you have the snapshot avoid new tuples. So I want to talk about four different approaches to do snapshots. And the main thing I'll say is that the choice of a approach which you actually want to use in your database system is tightly coupled to what concurrency control scheme you're going to use. And we'll see in a second, but like for example, if you're using MVCC, then you probably want to use the copy-on-update mode, a copy-on-update method, because you already have multiple versions, and you can just skip them as you scan along. But if you're using two-phase locking or OCC, then you want to use a different approach. So the basic idea of checkpoint is that we just have a separate thread that runs in the background, scans through our table heap, and just writes out every tuple that it finds, one by one. And if you allow your database system to provide access to the internal threads only directly to these table heaps, you don't have to go through the index at all, and you don't have to worry about any locking or latching on those data structures. You can just go through the individual tuples. So the four different approaches we're going to talk about are naive snapshots, copy-on-update snapshots, and then two specialized methods called weight-free zigzag and weight-free ping pong. As far as I know, everybody usually implements these guys. I don't know if anybody actually implements these in any of the big systems. At least worth discussing. So naive snapshot is pretty straightforward, pretty easy to understand. You basically just have the database system pause or block all transactions from executing. Then you have your thread taking a complete snapshot of the database and write that out the disk. It's super easy to implement. The problem is, though, obviously your blocking transactions when you do this, you can, the way you block transactions, you basically have your worker threads stop taking new transactions. So that means if you have one transaction that's really, really long, you pause all your other threads or your other worker threads. But that guy keeps on running, and now you have this big time period where you're completely idle. And so the way you usually implement something like this is you have a little timeout that says, I try to pause all my worker threads, but this guy took 10 milliseconds to finish. So therefore, I can't do it this time. Let transactions re-execute again, and maybe that long running transaction finishes, and then come back around and try to do the naive snapshot again. One interesting approach that the hyper guys originally proposed that I actually think is kind of cool is that instead of having blocking all transactions the way you do this, you just fork the database system process, and then you have the child process of the fork be responsible for writing out the checkpoint. So what happens is you fork the process, and now you have a fork process where the same transactions that were active in the parent are now active in the child, and then you have the child go through and abort all those transactions, roll back their changes, and now you have a consistent snapshot of the database that you can then just write out directly to disk and not worry about getting any interleaving from active transactions, because all the new transactions will get fired off on the parent process, not the child. So this is one of the original ideas that they proposed in the first incarnation of hyper, which I thought was really kind of cool. We actually try to implement this in h-store, but it turned out to be a terrible idea, because if you fork something that runs with a JVM, when the parent process runs garbage collection, that starts reorganizing all the pages, that calls to the operating system to copy the pages, since it's copying on write, and then your performance basically completely tanks. Then you also have the problem that when you fork the JVM, it doesn't actually respawn like the garbage collection thread and all the other internal threads, so you have this weird JVM that has a lot of problems. So I think this is a good idea, but it only works if you're not using Java or Scala. So for the copy and update snapshots, we've already talked about this before. This is what, basically, Voltivee does. You switch into a special mode, and then, instead of doing in-place updates, you always have a tuple copy or have a transaction copy whatever it wants to modify to a new location or a shadow copy, and that way, when the checkpoint thread scans through, it just ignores anything that was created after the checkpoint. And you can do this in different granularities. Voltivee does it on an individual tuple basis. In other implementations, you can do this on a per block basis. So now, the two issues with the naive snapshots and the copy and update snapshots is that the naive snapshot case has to wait for the checkpoint thread to finish before it's allowed to start exiting transactions again. So you have this pause in your throughput. And then, in the copy and update case, you may have to acquire latches that are being held by the checkpoint thread to make sure that nobody mucks around with the layout of tuples in indexes or other things while it's going through the data. So you could have little mini pauses on the copy and update case that you need to avoid. So the way to get around this are two weight-free approaches that they propose in this paper from the guys at Cornell. And I'll go through real quickly the two basic examples of what the main idea is that instead of having a single copy of the entire database in memory, they're going to have multiple copies. And they're going to trade off that additional memory overhead to avoid the weighting and the locking and latching in the other approaches. So we'll talk about weight-free zigzag first, and then we'll talk about weight-free ping pong. So in weight-free zigzag, what they're going to do is they're going to maintain two bitmaps that keep track of what copy of the database should a transaction modify and what copy of the database should the checkpoint thread read from and write out the disc. And then this allows you to avoid having to copy everything exactly when the transaction modify stuff. You just assume you have these two copies in the beginning, and you have a little of these bitmaps that allow you to figure out which one, which two of the copies you should look at for every individual tuple. So I'll go through two examples. So here we say we start at the very beginning, we have our two copies of our database in memory, and then we have our two bitmaps that correspond to whether we should be reading or writing for what do these two copies, a transaction or a checkpoint thread should read and write from. So in this case here, if we're a transaction and we were looking at our bitmap for the reads, at this offset zero that says for the second tuple, we should look at the first copy. And then if we want to modify it, then we want to make sure that we modify the second copy. So now, our checkpoint thread will start, and in the very beginning it looks at the right bitmap, and that's going to tell it where it should find the current version of the database at this point in time when the checkpoint starts. So if you look in the right bitmap, it tells you where the right should go, so the opposite of this is telling you where your reads should go. You just take all of these, this bitmap here, negate everything, and that tells you that you want to look in here to find your consistent snapshot. So now when our transactions start running, while the checkpoint occurs, we look in our right bitmap, and for each slot or each tuple position, that tells us that we should be modifying their values over here. And for our reads, we look over there, and we have to update this bitmap after we modified the values that tells us that we want to read, or transaction wants to read from the database of this tuple again, this is the copy you should look at. So we sort of have to keep these two guys in sync, this thing and this thing. So now when the checkpoint finishes, we were able to write this complete copy out, and that's consistent on disk. So when we start the next checkpoint, we'll do the same thing where we flip the right bitmap to say, all the right to now go to the first copy, and then we negate everything here that tells us we have now our zigzagged, the consistent view of the database zigzags across these two different copies here, because we negate where we wrote to, and that tells us where we should read from. And then the same thing now, if we have our transaction actually read and write anything, we'd always look in our right bitmap and tell us where we need to go. So in this case here, if we want to modify the first tuple, we want to modify it in the first copy. And if we want to modify it in the second tuple, we modify it in the second copy. So this is clear. So this is a lot of busy. We have two copies of the database, and we use these bitmaps to say where the checkpoint should be looking at, and where the transaction should be reading and writing. So in weight-free ping pong, we're now going to bring back a third copy of the database, and then we're going to have this be a master version that we can do all our current writes, and then the shadow copy would be where the checkpoint thread can read a consistent snapshot of the database and write that out. The idea here is that we can do all our updates in the master version, and that doesn't interfere with the current state of the snapshot in the shadow copy. So the problem with this approach, as we'll see in a second, is that unlike before, on a weight-free zigzag, for every single write we have to do in our transaction, we only have to update one location in the database. In weight-free ping pong, every single write has to update two locations on the two different copies. So if we go here, so we have now our master pointer that says copy two is where all our new updates should go. We have our complete original copy of the current state of the database. And then copy three is considered the shadow, and this is where all our checkpoints will go. So our checkpoint thread starts here. It has a consistent snapshot of the database and just scans through and writes all these two plus out one by one. But then our transactions show up here, and anytime we do a write, we're going to update the complete copy of the database, but then also update our master copy here. And then we flip a bit to say that this thing has been modified since we started our new checkpoint. Now the checkpoint thread finishes, and we want to switch. So this thing's going to become the new master, and this guy's going to become the shadow, and our checkpoint thread's going to want to read through this and write that as the checkpoint out. So what we have to do is we flip all the bits here to zero to say that we've been modified since the last time we started, and then we can zero out also all the memory locations as well. So we don't see dirty data when we come back to the second time. So now we flip our pointer, the master pointer, to say that the copy number three is our current master, and then this is where the checkpoint thread's going to look. But the problem is that we don't have a complete view of the database here because we didn't modify all the tuples. So some of these spots here are still empty. So what we have to do is either go out and copy the exact, all the things we're missing from the current copy, the complete copy of the database, into our shadow copy that we're running at the checkpoint, or we have to go on disk and find the last checkpoint that we took and fill in these holes here. So then we can then just scan through this entirety and write that out as a consistent snapshot. So this allows us to avoid the overhead of doing locks and sort of doing all these bit flips all the time. But we have to do this actual work or go fetching things from disk to fill in the gaps. And this is why I don't think anybody actually does this. All right, so the different ways we implement all these things, we're doing bulk state copying, locking, bulk bitmap reset, and paying a big penalty for memory usage. So all of these are different sort of design decisions you have to make about what your checkpoint method actually does. I'm not going to go through all of these, but the basic idea is that the weight-free ping pong and the weight-free zigzag pay additional penalty for having higher memory usage by avoiding some of the bottlenecks you can have in the naive case or the copy and update case. And then there's this table here that shows you what the different methods, for each different method, which one of these design decisions or trade-offs they have to make. And as you can see, in the case of weight-free ping pong, you have to have three copies of the database, which I think is not a good trade-off. So I want to do this very quickly. I just want to show you that there's different ways to implement your checkpoint method in the system, other than just the silo way and the volt-TV way. So up until now, we've been mostly talking about checkpoints, logging, recovering, all these things, assuming that the database has crashed in a bad way. But not all database restarts, you have to recover from a log and a checkpoint, are due to hard crashes. So it could be the case that you want to update the OS kernel or the libraries, you want to upgrade the hardware, or maybe you want to update the actual database software itself. So in the first two cases here, this is probably going to require you to restart the database system, or restart the node where the database system is running. So in that case, yes, you're going to have to shut down the system and then load back up from the last checkpoint and replay the log to get you back to a correct state. This last one here, updating the system software, that doesn't require a shutdown or restart of the operating system, that just requires a shutdown and restart of the database system itself. So ideally, it would be nice if we could not have to go through the huge penalty of replaying our logs and reading things back from disk when we want to upgrade our database system software. So this is sort of the motivation behind the Facebook method for doing fast restarts by using shared memory. So they're focusing on the third problem here. And so they're doing this in a system called SCUBA. It's not a public release system, but it's an internal database system that was developed at Facebook to allow them to do really fast event detection and anomaly analysis on log data that's generated by all their different services. So the basic way to think about this is say you have like for the timeline, there's all these different services it's going to get invoked in order to generate the HTML that's passed off to the user browser. And so what they want to be able to do is they want to be able to look at cases where the timeline generation took a long time, look at all the different services in their stack and figure out which one of them was causing things to run slowly. So all of their different, all the different software is all generating these log events about how long it took for every individual request and then you want to throw that into a database and you want to run again your analytical queries to identify where you have problems. So SCUBA is a system designed to do this. So it's not doing transactions. It's not doing stuff that sort of directly faces the user. It's all internal back end stuff. So they have a nice little, they can relax some of the protections that a database system will provide because they don't care so much that the queries that executed on SCUBA are 100% accurate. So that means that if some log events get dropped, you know they have you know millions and millions of people using their systems. If you lose you know a couple of events a day you get in the back. Your head's hitting it, yes. So if you have a bunch of different events they sort of get lost, it's not a big deal. So SCUBA is sort of a system to do this kind of log analysis. It's not doing, you know, not guaranteeing all the protections that my SQL would. So they have actually an interesting architecture. It's a heterogeneous setup where you're going to have some nodes in the system and that's where they actually store the data in memory database on disk log. And then there's these aggregator nodes that sort of sit above it in the hierarchy that are responsible for sending down query requests to the leaf nodes and then they come back results and they combine them together and then push them up further up on a tree. And this is actually the same architecture that the MemSQL guys use. So it sort of looks like this. So you have these leaf nodes at the bottom. This is where the actual database is that they keep, you know, they maintain for all the new events that are coming in. So as every time, you know, you click on something in Facebook, it generates a bunch of log events that goes through Scribe, which is their version of Kafka. Got message broker system. And then those messages arrive in the leaf nodes and then get stored in the database. So now if you as like an internal developer at Facebook execute a query, it would first land on the root of the aggregator tree and then it percolates down where you actually do the scans of these lowers guys and then it gets combined on the way back up. It's sort of like you think about it as a map-reduced model, but doing SQL queries to do group buys and aggregations. So what we're talking about to do Fast Free Starts is we're talking about doing it on the leaf nodes here because this is actually where the data is. These aggregate nodes are stateless. You don't maintain any information about what's in the database. It's just knows how to take results and combine them together. So these guys are CPU bound and memory bound. Okay, so at Facebook they have this sort of coding practice or edict or philosophy that they want to push out updates all the time. I think their turnaround time is like every two weeks or four weeks or something. So they're making changes to their products. They're making changes to their systems all the time. And they're pushing their changes and deploying them on a really short time interval. Rather than in like years for every update, they're doing things in weeks. So the problem is if you try to restart a database every two weeks and you have to restart from disk, it's going to be really slow. Do you do a mass upgrade across all your nodes if you have a really large cluster? So what they want to be able to do is that they want to be able to have the database system store its current contents of memory in shared memory and then restart, come back up, look in shared memory and put it back into its heap. So because you're allowing the contents of memory to outlive the lifetime of the process and you want your new process to be able to reuse the old memory and not have to scan everything from disk again. Because that could take hours. Whereas in this case you can restart in now seconds. And that matters a lot if they're making changes all the time because the amount of time that any node in your cluster is down is reduced significantly. And again, they don't care so much about having a 100% accurate queries because they're okay with losing log events, but still you don't want your nodes to be churning and taking a long time to recover. So there's two different ways to do the shared memory restart. The first approach is that you can rewrite your memory allocator to use shared memory. Instead of calling the libc malloc you can actually have your own version of malloc that knows how to divide things up into shared memory. And then when you come back when your process restarts your allocator knows how to go find the things that it left around the first time. It's really fascinating when you read the paper is they actually I think Facebook employs the guy who wrote the je malloc which is a faster version of malloc. And they spent a lot of time talking with the guy who wrote a faster version of malloc about whether this would actually work in a system that know you don't want to do this because it'd be a huge pain in the ass to use shared memory for malloc if you have to subdivide things up into different segments and be bad for performance and it's very tricky to maintain thread safety across your different if you have multiple processes all allocating to shared memory regions how do you make sure that they don't kill each other. Another problem with using shared memory heaps is that unlike in regular malloc when you allocate a chunk of memory that's not physically backed in physical memory yet right they wait until you actually try to do something in a page fault before they allocated things in shared memory when you call malloc in shared memory it allocates it right away. And so that would be slower and more problematic for what they wanted. So what they propose to do is when the database system is going to restart they just copy whatever in their heap out into shared memory restart the process and then copy it back and share it back into the heap right. It makes it easier because you avoid this sort of runtime overhead and you have to add some extra stuff to make sure everything's safe and doesn't have any problems when you restart but it's not that much more work. So that's essentially what they're doing in Facebook with these scuba restarts. The idea is that it says I want to restart it makes an in-marry checkpoint of the database writes it out to shared memory and then when it completes it writes a little log message to say I took this checkpoint and I know I completed everything. So that way if you recover the database if you restart and come back you don't try to pull in contents of shared memory that is incomplete because your process crashed before you completed the copy. They also have some checks to make sure that if you modify the version of the shared memory layout from one process version or software version to the next you don't kind of come back and read it back in and get corrupted data. So they have a bunch of mechanisms in place to make sure that these restarts are seamless and they're fail-safe. And then if you ever try to restart and realize that your shared memory is corrupted or your shared memory is not what you expect there to be then you just restart from the way it was off of a disk because you normally would that's always going to be adorable for you. So again, they're not doing this for checkpoints as the way we would do this for recovery you're just doing this so you can restart the database system itself and reuse what was already allocated in memory. In the back, yes. No, this is on a single machine it has nothing to do with any, this is a single node I'm restarting I copy everything to shared memory I restart the process and I just copy it back in. Correct, yes so this is not like, this is not a crash this is like, oh I have a new version of my database system I want to push it out right away let me issue the shutdown request it writes a thing out the shared memory you restart with a new version of the software and you copy everything back in. It's not for recovery at all. It avoids having to read everything back from the disk. That's a good point though. Okay, any questions about this? This is not really like I think it's a clever idea it was a neat little paper it's not something that how do I say this? This doesn't guarantee your database is durable this just makes it easier to cycle things through. Okay, so what are my parting thoughts? Logical logging is obviously faster than physical logging at run time but as we saw that there's certain corner cases and scenarios where it can make recovery more problematic and if your environment if your application is if you have enough hardware to replicate things as much as possible then maybe that's okay and you don't pay that huge penalty because you always have a backup node it's my opinion that the copier on update checkpoint method is probably the best way to go if you're using a multi-version I don't know if anybody's using the wait-free ping pong or zig-zag stuff because I don't think most places would be comfortable having your database double or triple in size just to take faster checkpoints and then as we saw it in the Facebook restart case shared memory actually has some use after all and I think it's a neat little trick and actually that's something I think would be cool to add in our system okay so just as a quick checkpoint of where we are in the course so we have gone through all of these topics here so again this is what I say that these are the core fundamental principles of database system implementation that you need to have if you're ever building a new database system so sort of right before spring break this is what we've gotten so far so now for the rest of the semester we're going to talk about these topics here these are sort of like the added bonuses the extra stuff or the more things you would have in a modern system to help you get better performance so that's sort of what our focus is going to be on for the rest of the semester going forward these are the additional things that not every system out there is going to have whereas most systems out there have these but these are the kind of things if you want to build a new system that can take advantage of new hardware, new problems and new applications and things like that these are the kind of stuff that we're going to focus on for the rest of the semester and I would say these are the kind of stuff that you could probably consider in for your project 3 to be kind of cool okay so March 2 is not due this Wednesday it's due the Wednesday during spring break on March 9 at 12pm everyone's proposed for project number 3 it's still going to be on the Monday we come after spring break on March 14 everyone's going to come up here and present for 5 minutes what exactly they're going to work on okay and there's no reading required for this Wednesday's class because I'll spend time talking about topics for project 3 so I'll come up and propose different things that you could work on then you guys go out and figure out what do you actually want to do and then also talk about the extra credit that we're going to have on the course the website is not up yet but I want to talk about it and discuss it in a way that people can get started on it okay any questions alright don't use the bathroom on the 9th floor because there's no water but have a good day