 OK, folks, I think it's time to start. Welcome. So today we have a very long lecture. It's actually the first of two lectures. We are going to discuss transactions. But first thing, I have some announcement and some other discussions. So Homework 3 is posted. This is a programming homework. You are asked to write a small application in Java. For those of you who are developers, this is a very naive application. The purpose is not to do application development. The purpose is to understand how to connect to a database from Java, how to do dependent joins. A dependent join is when you have to join two separate databases. And then you're essentially using a nested loop. You have to implement the join in the application itself to understand what it means to integrate databases and also to practice transactions. You are required to add a start transaction and commit or rollback instructions. So this is where the focus should be. Did anyone look at him? Homework yet? OK, so it's over there. It's posted here as a starter code. And again, don't charge the quality of the Java program but focus on the data. Because I want it to be as small as possible. So the number of lines is reduced to a minimum, such that you can practice a connection to the database with a minimum number of lines. What you have to write is about 180 lines of Java. That's all you need to write. But think about what it means to connect to two different databases. How can you optimize it? Where are the pain points? What is the transaction to? How do you use transactions in a database application? I didn't prepare any slide about homework too. Are there any issues with homework too? OK, so in addition, we had a reading assignment for today. Namely, the paper query answered in using views by a long holiday. How many people have read it? Oops. OK, should the people respond to this discussion until next week? So I would really ask you to read sections 1 and 2 only. And 3 would be optional. We discussed homomorphism and query containment in case you did not understand those from the last lecture. But do read section 1 and 2 that gives you a very nice perspective on query answering using views. And then we will discuss these questions next time. OK, so today we will discuss transactions. And the way I want to proceed is I'm going to discuss some basics, what the transaction is. And then I hope to cover the same complete, I mean, I don't know, I must cover, I finish covering completely the simple recovery algorithm, simple recovery, log-based recovery algorithm, which is based on the second textbook on the book by Darcy and Orman. And I will start the basic definitions of concurrency control. And we will discuss concurrency control mechanism next time. Next time, we will also discuss an advanced recovery mechanism, which is the area system, which is described in your textbook. So a warning for this lecture, I would ask you to read these chapters in our main textbooks, 16, 17, and 18. But a warning, the material on recovery and some of the material on concurrency control that we will cover next time is described only in the second textbook. And the reason I choose to present that material is because our textbook jumps directly into a discussion of the area system with all the complexity. What I like about the approach taken by García Morina and Orman and Wiedem is that they have a very simple recovery method that uses very simple principles. And it's much easier to understand what's going on. And then when we discuss areas, you will have a deeper understanding of these mechanisms. OK, so that's a reading assignment. So transactions. So here is what the problem is. The problem is that we have multiple applications that talk to the same database. And they all want to do updates, problem in multiple updates. And we need to allow these to run concurrently. And the updates of multiple transactions might interfere. And moreover, the system might crash at any time. And this is worrisome because it might compromise the consistency of the database. And the solution to this problem, which is a major problem, is this abstraction called a transaction, which bundles together instructions from an application and calls these instructions as transactions. And then we need to worry how to implement transactions in the system to guarantee that they don't interfere. That is the idea. Transactions were really invented in the database realm. And I mentioned here the three Turing awards that were ever awarded to contributions in databases. The Turing award is like the Nobel Prize for computer scientists. This is the highest award that can be given to a computer scientist. And there have been only three given to databases. Most of them, the vast majority go to theory, to mathematicians who prove theorems. And kind of the rest, they go to systems. And only three went to computer to database researchers. One went to Charles Bachmann. Ironically, at the time when his approach to databases died, he was the main promoter of the CODACIL data model. In 1970, Ted Codd proposed a counter model, which was a relational data model. And for 10 years, all the database researchers say they debated and disagreed which model is better. And in the end, the relational data model won because it's conceptually clean. But Charles Bachmann got the Turing award for the CODACIL data model in 1973. Then Edgar Codd got the Turing award for inventing relational databases in 1981. And Jim Gray received the Turing award in 1998 for inventing the concept of transactions, which are today, they are really ubiquitous. They are everywhere. They are used in all applications. And Jim Gray, yes, question. What's the normal form for normalization called the voice code normal form? Is it the COD? The COD and voice code normal form is Ted Codd, yes. So once a relational model was proposed in 1970-71, people have started to look at various issues, various problems. And one of was how to best represent data, the independent functional dependencies. And one of the theoretical results was a voice code normal form. And I suppose they work together on this voice code. So Jim Gray, he was a researcher at Microsoft. And he disappeared a few years ago in very, very bizarre circumstances. It was a great, great loss for the computer science community. OK, so what are these transactions? Well, to better understand what transactions are, let's imagine for a while how the world would look like if we did not have invented transactions, if we didn't have transactions. So can you still write an application that modifies data in a database? Well, a database is just a collection of files. So you could simply write data to files to ensure that whatever updates your application is doing are durable. And you just rely on the operating systems concurrency control mechanism to handle concurrency control. But as you will see immediately, and you can probably guess, two things can go wrong. Either the system crashes, and then we get in trouble, or the interference between multiple applications gives rise to some anomalies that I will describe. Three kind of anomalies are famous, and they have names. And many other things can go wrong, and they might not necessarily have names. So here is what happens if the system crashes. So imagine you wrote an application that moves $500 from Fred's account to Joe's account. And after you subtract the $500 from Fred's account, then the system crashes. Obviously, we are in big trouble, right? Because we can restart the system, and the file is still here. But we don't know if the system crashed before or after we managed to subtract the money from Fred's account and to put it back into Joe's account. So there is no way to recover from a system crash. The concurrency control anomalies, they are fun because they have these names. Here is one anomaly. This is like in Homer 3. You're writing an application that allows a customer to rent a movie. So what this means is you need to keep track of how many movies this customer has rented. So the first client updates the customer and increases the rentals by one. But at the same time, Fred operates from a second terminal. And he also increments the number of rentals by one from a second terminal. This is a second application. So what can go around here? We add one. We may end up adding only one. And this is because of the way file systems work. You read the first application reads the block containing Fred's rentals in main memory. The second application reads that block. Then the first application increments is the second application increments. And they write them back. And only one single write wins. The other write will be overwritten by the first write. So depending on how the low-level operations of these two applications are interleaved, we might end up losing an update. Is it here? Why we lose an update? The second famous anomaly is called inconsistent reads. It should be pretty clear what's going on here. The first application wants to do something simple. It wants to move five products from Gizmo into Gadget because they are probably renamed from Gizmo into Gadget. They have been re-labeled. And the second application, same thing, wants to do something very innocuous. Just wants to compute the total quantity in the database of all products in the database. What can go wrong here, depending on how these two applications are scheduled? What can happen wrong? So will the first client achieve in transferring five products? Sure. I can't see anything that can go wrong here. The second client doesn't write anything. The second client just reads. But what can happen to the second client? But what second client reads is five more than what is actually there. Right. This is the second client might be scheduled exactly here. So maybe we execute this piece first, then this piece next, and then this one last. And then the second client will see five more products than there should be. And of course, then the boss complains. And when the database administrator checks the same queries, then he sees the correct results. So what's going on? It's very, very difficult to catch such a bug. So this is an inconsistent read. The third famous anomaly is called a dirty read. And this is in the context of transaction rollbacks or abort. But it's easy to understand, even if you don't know yet what an abort or a rollback is. It's when we change our minds, when we want to cancel the actions that we did so far. So in this example, the first client wants to move $100 from account one to account two. So the way it does this, it does it in an unexpected way. First, it puts the money in account two. And then it checks that there was enough left in account one. And if it was, then it does the update. But otherwise, it takes the money back from account two. It's like rolling back. It's like aborting, canceling the previous action. And it's a perfectly correct application. You might wonder why we wrote it this way. But sometimes it actually makes sense. And it makes sense to write applications this way in the search homework. Because the conditions you need to check are so complicated, there's actually a much, much easier to check if you do them, if you act on the request. And only check at the end that your invariant is satisfied. And the search homework is actually designed to show you such a case, which is easier to check at the very end after you do the update that invariant is satisfied. So the second client does exactly the same thing, moving money from account two to account three. So imagine a worst case scenario in which these are interleaved in a bad way. What can happen? Yes? Account two can end up with a negative balance. Account two can end up with a negative balance. And this is because we execute this one first, then we execute this block of instructions second. And now x has been read. And we put $100 into the second account. But now we check x. And we see that there was not enough money to move. But we already moved $100 into the second account. So when we execute the fourth group of instructions, now yes, y is greater than 100, because this is what we put here in one. And therefore, the second transaction, the second application, will succeed. So now it kind of depends, because now there is a raise, so this subtracting 100 from account one, from account two, there is a raise between these two instructions. And depending whether they are executed eventually, we will end up with a negative balance. Or we might even miss a negative balance if there is a lost update. So this is called a dirty read in the sense that the value of account two that we read here, let me do this. So the value of account two that we read here might not be the final value. It might be this temporary value that will never exist, because it's been undone here. That's the essence of a dirty read. When you read a value that might be only temporary, and then it might be rolled back. So this is a summary of the famous anomalies. They are not deep theory, but people refer to them. They are called lost update, dirty read, and then consistent read. And I have the definitions on the slide. I'm not going to read them. You can read them at home. So of course, we need transactions to solve this. And here is my slide with the definition of transaction. A transaction is a set of one or more operations that either happen altogether or none at all. They either happen all or none. So that means that either all the updates to the database happen or none of the updates happen at all. And usually, it reflects a real-world operation, a real-world transaction. For example, it can be transferring money between two accounts. It's one transaction. Actually, the term is borrowed from financial transactions. It can be renting a movie. Maybe you want to purchase a group of products. You are not a single product is not, maybe you need a group of products that work together. And if you end up purchasing a single product, that is not what you wanted. Maybe you want to register for a class and so on. So that's the definition of a transaction. It's very simple. It's a very simple concept. And it's also very simple to use. It's used like this. There is a start transaction instruction. And when at the end, you place either a commit or rollback. Rollback means abort. And the transactions only refer to the updates to the database and to the reads as well. So only to the interactions to the database. And in between start transactions and commits, there can be an arbitrary number of SQL statements that are supposed to be executed or nothing. Is this concept clear? The concept of transaction. Yes? So is it right? It's not only that they all happen or don't happen, but also that the rest of the world views them as all or nothing at the same time? They're different isolation levels. So the rest of it, are you referring to the fact that different transactions might interfere? But yeah, that's also an attribute. They must happen in isolation. That's the right term. Transactions must be executed in isolation. As if they were the only application interacting with the database at that time. And we'll go slowly over these properties. So this is how you need to use transactions using start and commit or rollback. Just to give a short example, when you transfer money, you just place a start transaction here. And you end it either with a commit or with a rollback. And similarly here, start transactions, commit or rollback. It's very simple. So now let's discuss at a deeper level what properties we expect from transactions. And this is the acronym, ACID, that has been established since the 70s or early 80s. Later, as a theory of transactions has developed, that theory does not match very well these acronyms. But the acronyms are nevertheless are part of our culture now. So I need to spend a few minutes discussing them. So we say that a transaction has ACID properties. And what we mean by that is that a transaction is atomic, is consistent, is isolated, and is durable. So let's go over each of them in turn. Atomic means that it's an all or nothing. It means that either all the updates of the transactions are reflected in the database, or none of these actions is reflected at all. Concretely, translation. It means that if the system crashes, it's not allowed to crash in the middle of the transaction. Now, you can't tell the system, please, don't crash in the middle of the transaction. We need to take some actions to make sure that it's as if the system never crashes in the middle of the transaction. So that's atomicity. The next key property, and these are the two key properties, atomicity and isolation. So the next property is isolation. Isolation says that a transaction, when it executes, is as if this were the only transaction interacting with the database at that time. Its effect should be isolated from that of other transactions. Now, the theoretical concept that we will hopefully cover even today that replaces isolation is called serializability. And we will do a much more thorough study of serializability towards the end of this lecture. But the popular term for this is isolation. The other two properties, so we discussed only A and I, the other two properties, they are less crisp. The next one, consistency and C, this is actually obsolete. But it says something important. It says that it starts by recognizing that the database must satisfy certain constraints. And these can be simple constraints, like key, foreign key constraints. But they can be application-level constraints that might not be enforced explicitly, but they hold implicitly. For example, the sum of all accounts should be greater than 0. Or the number of products in the warehouse is greater than 0. Or the sum of the accounts is equal to 0 if you have debit and credits. So we must ensure that the database is consistent every time. Now imagine an application that's written correctly. So the application writer, the programmer, has ensured that if it starts from a consistent state of the database, then after the application executes, the end state is also consistent. The role of the transaction, the duty of the transaction, is to ensure that the database indeed remains consistent at the end. Now this is actually a consequence of the transaction being atomic and isolated. If it's atomic, it means it's all or nothing. It can't crash in the middle and break this consistency requirement. And if it's isolated, it means that it's as if the transaction were the only one interacting with the database. So there is no interference from other transactions. So consistency is implied by atomicity and isolation. And the last one, durability, is actually an ambiguous interpretation. I found two different interpretations in the literature and textbooks. And actually, the research literature never refers to acid, it refers to more concrete terms. But I found these two different interpretations in textbooks. So one interpretation is just that transactions have to write data to disk. That the updates should not sit just in the volatile main memory, but they should be written to disk. The second also means recovery. It means that once a transaction commits, even if some of its updates have not yet been forced to disk, it's the responsibility of the transaction manager to ensure that if the system crashes, then it's going to somehow write those updates to disk. Or roll back updates that are at the table. But once the transaction commits, then there is no more rollback. But I prefer to have detailed discussions when we get to more rigorous definitions. Right now, we're not there yet. Just a quick discussion of reasons for rollbacks. There are two reasons why transactions might be rolled back. One is because this is how the application is written. The application has an if statement. And if something goes wrong, it says rollback. And then, of course, all the updates need to be undone. But there is another reason. Namely, the system might decide to rollback some transactions, which is very unfortunate. For example, if it detects deadlocks, if it detects a deadlock, then there is no hope. There is nothing safe, so then the system must choose a transaction and kill it. And then, the application needs to restart that transaction because it was imported by the system. So there are two different reasons for rollback. OK, so that was a superficial and high-level overview of transactions. What we are going to do today, two things. We are going to discuss a simple but complete log-based recovery system. And then, we are going to discuss the basic definitions for serializability. So the system is, again, taking from the textbook by Garcia, Maureen, and with them. And these are the chapters. I think these are the chapters from the latest edition. But if your edition doesn't match this one, you'll find the corresponding chapters. They are about undo, redo, and redo, undo logging. And what I like about this is that it's a very simple log. And it's complete. It's easy to understand. There are no, the area is way too complicated to fully understand in totality. OK, so before we can actually discuss a recovery manager, we need to discuss a reason why we recover. Well, the reason why we recover is because data sits on disk. And our updates are in main memory. So when we do updates, there is a delay between the moment when we do the updates and when they actually get written on disk. So I need to discuss a little bit this characteristics. So there are a few orders of magnitude, difference in performance in accessing main memory and disks. You probably know this quite well. The disk latency, which is a term for how long it takes from the moment we request something to be read or written to disk until this is actually performed, consists of the seek time plus a rotational latency. Now, what does it mean? What is the seek time? What is the seek time for a disk? Yes? Time it takes to find a piece of data. Time it takes to? The time it takes for the head to go find a piece of data. Exactly. That's the time for the heads. The heads are like this. And they need to move and find the corresponding cylinder. It's called the cylinder. And the rotational latency, that should be clear. You have to wait until the platters they rotate and the heads can actually read this thing. Once this happens, then, and this is very low. So this is in the range of, I mean, it's very bad. The disk latency is very high. It's a lot of time. Once you get here, actually, this would be the total, once the heads are positioned over the data that you want to read, the throughput you're getting is better. It's, correct me if I'm wrong, these numbers might be a little bit off, sorry. They're in the range of 40 megabytes per second. 100. OK, let me ask you something more interesting. If you think about a period of like 20 years, which of these two numbers are likely to evolve and which ones not? Both of them. What about the disk latency? What about the transfer time? Latency is going to evolve dramatically. Should we expect half the latency in 20 years? Three orders of magnitude right now. For the latency? Yes. No, these are the numbers I learned in school many, many years ago because you're looking at flash memory. No, I'm talking about traditional old style disks, which are still the cheapest way to store a large amount of data. Database of today. OK. Well, you want to know where it's going to go. It's going to go to that. And you will get 65 microseconds latency. But that's a different thing. So that's not the head latency that's a different technology. This is same. I mean, does it actually matter that it's a spinning disk versus a multi-level flash? No, I just had a simple question. If the speed of the head can increase over the years, and the answer is no, the speed of the disk won't decrease. But the transfer time did increase over the years. So why did the transfer time increase? Did the disk increase? Did it start operating more densely? Because of density. So the rotation speed doesn't decrease too much. But the density increased over the years. And this is why the transfer time increases. So on the other hand, your colleague has a point. So now there is new technology that would definitely shake the way we store data. But exactly how that is going to influence database systems is still not well understood. The new, the solid state drives, the solid state memory, they are much better at append-only writes than they are at random writes. The disk doesn't matter so much. If you write one block at a time in append-only and you're missing the rotation, this is no better than a random. It's not much better than a random write. But for solid state drives, I've heard people doing experiments. And the difference is dramatic. So the thinking is that the new architecture of database systems that are based on solid state drives will be an append-only database, where you always append whenever you want to do update. So the entire structure of the entire organization of the database will be different. But that is, the jury is still out how the systems will look like. And the jury is still out what exactly the solid state the technology for the future disks will be. All I want to point out is that the script theme, there will still be a major difference between the access time to main memory into disk. And this still justifies transactions, thinking about the fact that there is a delay between the time we do the update in main memory and the time they propagate to disk. So the buffer manager, that's what I wanted to get. Because disks are much slower than main memory, copies of blocks on disk are kept in a buffer that looks pretty much like the operating systems buffer. It's something like this. Every block that is read in the buffer now is called a page. It's a disk page or block. And every slot in this buffer is called a frame. And if there is no page sitting in that frame, then it's a free frame. So how does it work? It's a top here. There is the application, which in our case is a database server. And this application executes a query or executes an request for an update. And at some point, it will say, please read for me, block such and such from disk. So the operation is called read. And what follows, we call that operation a read. Now the buffer manager receives this read request and will look up its buffer pool, but it has a set of pairs of the form page ID and frame number. And if it sees that the frame is in the buffer pool, it's like a standard cache, and it will just return it. But if it's not here, then it needs to read it from disk. And that operation, we call an input. So that's an important distinction that we will, it's just a notation. It's a notation that I'm going to use for several slides. So remember, read is a request to read something from the buffer pool. Input is a request to read a physical page. And similarly, write and output. Now what happens if there is a read request? The page is not in main memory in the buffer pool. And there is no room left, which is kind of the typical case, right? Because in a normal state, the buffer pool is full. So what do we do? What happens at some point? We need to fix something. Sorry? We need to fix something to make room. We need to evict a page to make room. The algorithm that chooses which page to evict is called a page replacement algorithm. Which algorithm do you know for page replacement? First and first out. First and first out is a perfectly valid algorithm. Do you know another one? Least recently. Least recently used. Others? Which one is optimal? There is a notion of an optimal algorithm from operating system. Do you replace one that's going to be used last? Yes, it's called least recently used. And that's indeed provably the one that will in the future be used last. The one that in the future will be used last. But LRU, as far as I know, is proven to be optimal in some sense. And I don't want to dig deep into LRU. I just want to remind you what LRU does. So we can get a sense of a page replacement algorithm. There is another one, which is a lot of fun, which is actually used in practice a lot, which is called the clock algorithm. I invite you to read about it in the book. So try to learn as much as you can. The clock algorithm is a simplification as an approximation of the LRU. And it's much more efficient to implement. So let's see LRU in one slide. LRU, the essence of LRU, is that it keeps a list of the order in which pages have been accessed. So if you look at the first list here, it says that page 5 was the most recently accessed. And page 7 is the oldest one that was not accessed for the longest time. So let's see how this works. So suppose the application says, I want to read page 6. Is page 6 the buffer? Yes, it is. So page 6 will be returned to the application. But now we need to move page 6 to the beginning. I don't know why I put it. Because I changed the font and this one is misplaced. It should be from here. So we need to move page 6 to the beginning. And here it is. So the order of the rest remains unchanged. And we just moved page 6. So the next thing is that the application says, read page 10, which is not in the buffer pool. So what do we need to do? So now it's really interesting because my process, again, misplaced. It should be in some other place. So we need to evict. Whom are we going to evict? Page 7, because now we know which one was the oldest one. Of course, had we read page 7 instead here, suppose we said, read page 7. Then page 7 would have been promoted to the beginning. And then when we request page 10, then which one would be evicted? Page 3. But this point is that there is no escape from keeping the entire order. You need to keep an ordered list with all the pages, which means more bookkeeping than you would like to do for the buffer manager. Because a simple read operation now needs to manipulate some pointers in a linked list. And that's too much overhand. This is why people have looked at approximations like the clock algorithm. So the database management system, they could, in theory, rely on the operating system's buffer pool because the operating system has a file buffer pool for the file system. But they prefer to re-implement it, as far as I know of all systems. Of course, the commercial ones do this. But even the pre-wants, MySQL and Postgres, they all implement their own buffer pool for two reasons. The main reason is that they have better control for transactions. They can force pages to disk if needed, or they can pin them in memory if they need to be there for a while. And the second reason that many textbooks give is that you can improve the performance of the LRU algorithm if you know the query workload. If you know what query you're executing, then you can improve the performance of the LRU algorithm. The problem I have with this statement is that I only know of one single improvement. Can you think of an improvement? Can you think about typical queries? And how can you improve the LRU for a simple query? The query that every textbook gives is a sequential read, like select star from table. In that case, the table is huge. So if you use LRU, every single read request will be to a page that doesn't exist. So every single read request requires overhead of replacing a page. How can you improve over this? If you knew that you have to read the entire table, how can you do better than LRU? The whole cache, all the pages in the world. Read ahead. That's the only thing you can do, which is yes. You could partition your buffer cache between streaming and random access, or implement something where you preferentially evict stuff that's basically been read but not modified. So you said partitioning based into streaming and random access. And evict only, evict preferentially pages that have not been modified. Clearly, if this sequential scan is done concurrently with random accesses, then the partitioning makes a lot of sense. Because in one partition, you could read ahead. And in the other partition, you could continue to manipulate. I mean, you can dedicate that to the random accesses. I'm not sure about the second idea. If you want to preferentially evict pages that have not been modified. Yeah, maybe. I don't know. But that's one reason that people give that you could potentially improve your performance of the buffer manager if you know the workload. So this is the picture. We have the buffer pool. The updates are done in the buffer pool. They are done by many transactions concurrently. But right now, we focus on the fact that if the system crashes, we have some pages sitting in the buffer pool that did not make it to disk. And maybe some pages that made it to disk and they shouldn't have made it there in the first place. Because the transactions, they are not supposed to commit yet. So now, let's start discussing recovery. But before we discuss recovery, let's have a short, high-level discussion of the types of recovery, of the types of crashes. Many things can go wrong with the database. So here are a few things. Wrong data entry. If the data is wrong, if somebody types in the wrong amount for an account, then it's wrong. And if you type in the wrong name, type in the wrong customer name, then it's wrong. From the latest CACM, there is a very funny article about somebody who bought, he was complaining about the simple checks that today's application they don't do. He bought a computer and he bought extended warranty. And when they scanned the barcode, there were three barcodes on the box. And they scanned the wrong one. So now, his computer was bought, he paid for warranty, but the wrong barcode was in. Then it took him, I don't know how many days to sort this out. Then the way to solve this was that he had to return the computer and then buy it again. But now, they mistyped his last name. And for some reasons, this could not be corrected. So it wasn't messed. How can you protect against that? So there are three different barcodes on the box. One comes from the shipping company, one from the PC company, one from who knows what. You have to read the right one. What should you do? How can you prevent the clerk from inputting the wrong barcode? Make barcodes different lengths? They probably are of different lengths because they come from different realms. One comes from FedEx and the other comes from whatever they manufacture in the delo room. So they're pretty clear different. If they have the same code, then it just checks the code. It checks that it's a computer code. It's not a shipping code. So some simple checks prevent you from having such nightmares. How can you check the misspannings in names? What can you do? It's actually not that hard. Have a reference database of standard names. Check if that name is there. Nobody has my name. Nobody has my name either. That's OK. So maybe you ask the clerk to ask the customer if that name is indeed correct. So there are ways to protect against that. The second thing that can go wrong is when the disk crashes. The disk always crashes after a while. And I'm still talking about the old mechanical disks. The solid state drives, they are even worse. They are supposed to support only a fixed number of writes. I heard the number of 100. Nobody, and I haven't heard anyone talking about what happens if you exceed 100 writes to the solid state drive. 100 writes? Yep. And then they die, forever. Now, it doesn't mean that after all physical layouts of solid state drives, they are up and they are circular. So you probably need to do lots of writes before you reach 100th writes to the same blocks. So when the solid state drive dies, you put in a much larger one. It could take you about two years. Yeah, that's the idea. You throw away your laptop and you buy a new one. So what happens when disk crashes? This is the technology that we use. We don't use database recovery. But instead, we use redundancy. We copy data. And we recover the lost data from copies. Well, catastrophic failures. Again, these we can only protect using remote copies, remote backups. And the only ones that we can protect against using database recovery are system failures, which can be either due to hardware. Maybe the power went down. Or they could be system failures. Maybe there was an operating system crash, or a database crash, or an application crash, or the system hangs. And then you need to reboot the system. Everything works fine. Nothing has been destroyed except the state of the database is inconsistent. OK, so the key idea in all recovery systems, in all recovery algorithms, is to use a write ahead block file. So if you imagine all those blocks sitting in the buffer pool that have been written, but not yet sent to disk. The application thinks that the blocks have been written, but they have not yet been sent to disk. Every update, everything that a buffer manager does, it will write in this write ahead log file. And here it will force these blocks to disk. It will force every single log to be written to disk. Such that when the system crashes, you know that the log at least is up to date. And that's the only guarantee you have from a write ahead log. Now we need to use a lot of cleverness to figure out how to actually recover the database from that log. But the key principle is that there is a log file. This log file is up and only. And it is always forced to disk. OK, so for the purpose of recovery and concurrency control, transactions are modeled as very simple read and write operations to so-called elements. Think about an element as being a block. So the abstraction that we will use is that a transaction is a sequence of read and write to blocks on disk. An element can also be a record or can be an entire relation. But they are typically elements, and we will discuss in the next lecture. We will have a more refined discussion about the choice of elements. But for now, let's just imagine that the elements are blocks. OK, so the operations that we will examine are read. This means read element x and store it in a local variable key, write, which means from the local variable t, write to the element x. And remember, these are between the application or database system and the buffer manager. If the buffer manager decides to go to disk, then it will use the operations input and output. So let me make this clear with a very simple example. Here is a transaction that does something very naive. It wants to read element a from disk multiplied by 2 and write it back. Then it wants to read element b from disk multiplied by 2 and write it back. And there's a single transaction. There is a simple invariant that we will always check, namely that if a and b were equal at the beginning of the transaction, then they should be equal at the end of the transaction. And if not, then we know immediately something went wrong. So here is a very close look at what happens behind the scenes. On the left, you have all the actions of the transaction, all the actions of both of the transaction and the buffer pool. And here we have the value of the variable t. On the right, we have what happens on disk. And initially, we assume that both a and b are 8. And we expect them in the end to be 16. So here is what happens. The first thing is that the transaction says read a. But a is not in the buffer pool. So now the buffer manager will issue an input a. And as a consequence, a is read in the buffer pool. Now we can answer to the read a request. And now 8 is moved to t. t is multiplied by 2. It's written to a. So the buffer in the buffer a is 16. But notice that it's still 8 on disk. The update only happened in main memory. Now the transaction wants to, so we are here. Now the transaction wants to read b and update it. So the buffer manager first inputs b. Now we have b in main memory as well. We can return b to the transaction, to the application. It's multiplied by 2. Now it's 16. It's written back. So now both a and b are written back. And at this point, the buffer manager decides to write these things to disk. And initially, only a gets there and then a and b. OK, it was a simple example. And maybe I spent too much time on the details. Here is an interesting thing. Suppose a system crashes right here. What happens now? What are a and b? They're 16 and 8. They are no longer equal. That's the worst place for the system to crash. So that's what we want to protect against. At this point, we want to read the log and put everything back together again. OK, so I have here a picture with a crash. OK, so the log will have entries for every single update done to the database. And these updates are interleaved between multiple transactions. And the idea is that when we recover from a crash, we will read the log. And we will figure out how to do the recovery. Good, so let's look in detail what exactly do we write in the log. And we will discuss three kinds of logs. Actually, two in details. The undo log and the redo log. And then we will discuss how to combine them. Undo, redo, which will be very, very easy. So the undo log, the thinking in the undo log is that if the system crashes, all the transactions that have not committed need to be undone. Their actions, whatever they've written to disk, needs to be undone. And for that, in the log, we write three kind of entries. Whenever a transaction starts, we write that it started. And t is like the name of the transaction over the number. If it commits or it boards, we write this. And whenever the transaction t modifies an element x, we write the old value of x. And the reason we write the old value is because we are doing an undo log. So we might want to, if the system crashes, we want to restore the old value of x. And that's what we care about. So the log is not that cheap. If the database makes one update to a record which sits in a block, it means that the old block needs to be written to the log. This is not a cheap operation. Doing recovery is a significant overhead to the database operations. Good. So let's see exactly what happens in the log. Here is our example. A star t is written here. At some point when the transaction writes, we write in the log the old value of a. Later, when the transaction writes p, we write in the log the old value of p. And when the transaction commits, of course, we write commit. OK, so now let's the fun begin. So here is an interesting thing. Supposes the system crashes right here. This is our bad case scenario when, look, the database is corrupted. So if you look at the log, here is a log. The commit didn't make it into the log. How do you read it now? You look at the log. What can you tell me immediately if you read this log about the transaction key? Did it commit, or did it abort, or did it abort? Yeah, I would say it didn't commit. It didn't do anything. It was still executing. In an undo log, the rule is all the transactions that have not committed need to be undone. So what do we do? We have no clue where, in this picture, the system crashed. But we can see the log. So what would you do? We can see. Now it's back into the crash. Yeah, write a back, and write b back. So now a is 8, and b is 8, and life is good. They are back to where they started from. Does the order matter? Should we write them in this direction, or should we write them in that direction? In reverse order, because it's undo. In this example, it doesn't matter. But if a was written multiple times, you want to undo them in the reverse order. So you are left with the oldest value of a. And the last thing is that, of course, if there are multiple, and that's a typical case, there are multiple entries from multiple transactions in the log, we only need to pay attention to those that belong to transactions that have not committed. So that's the basics of the undo log. Now I want to show you the recovery rules. So here are the rules written formally. So if the system crashes when we do recovery, we examine the log, and we do with two steps. First, we decide for each transaction if it committed or not. If we say, if we see that there is a matching start commit or start abort pair, then we are happy. That transaction is not our concern. But if you only see start, then that transaction is something we need to worry about. And once we decided which transactions we need to undo, we simply undo them reading from the back. In fact, we can do this in a single pass. We just read all the instructions from the back, and we undo. Let me show you an example. And don't read the question yet. I'm going out of order over some slides because I think that's the best logical order. So imagine that the system crashed, and somebody gives you the log. Here it is. Here is the end of the log. And now you need to recover. Let's recover. So we read from the back, from the end. You see this entry. Do you update X2? Yes. So X2 will be V2. Next one. Do we update X3? Yes, we update X3 is V3. Now we see commit T5. Nothing to do. Next we see T4, who has modified X4. Do we update? Yes, X4 is V4. Next we see this. X5 equals V5. Do we update? No, because T5 has already committed. So we skip this one. X1 equals V1. Do we update? Yes, X1 equals V1, and so on. Now you've got the idea. But actually, no, the questions make sense. So question one we answered, which updates need to be undone, namely exactly the updates of the production that has not committed yet. Is there a question at Microsoft? OK. The second question, we can contemplate this too. So as we are recovering happily, the system might crash again. What do we do? Are we in big trouble? OK. Just redo, just start again and do it all over, because they're all at impotent operation. Just start again and redo them all over. Exactly. We can redo as many times as we want, and there is no harm done, because we always replace the values with the old values. Is there a question? Are we keeping a log when we're doing undo? That's a great question. Are we writing anything to the log when we undo? No, we don't write anything to the log. That's why we can start again. That's why we can start again, exactly. OK. And the last question, how far back do we need to go? Sorry? So I had an open question in my mind. I never completely understood how long these transactions are in practice. The rule of thumb is that you need to write short transactions. Transactions are expensive, and they consume system three sources. But what does short mean? Are transactions like a few seconds, or are they a few minutes? And I asked a colleague who works for SQL Server, how long can transactions be? And he told me, oh, we have customers who run transactions for months. Wow. So this is not the best program in practice, and he didn't tell me more. But apparently, this is what you need to prepare for. Some people run transactions for months, which picks a question. How far back in the log do we need to go? Under there? Tell the checkpoint. We did not get to the checkpoint yet. Yep, under there. But normally, under the beginning, because maybe that transaction that started many months ago did one update, and we haven't heard ever since. And you need to go all the way back to undo that update. There is no place to stop. This is why we were introduced checkpointing, to be able to shorten the recovery process. But before we discuss checkpointing, I need to discuss something much more important and much more subtle, I would call. And for that, I want to go to this picture where I first show you how the undo log works. And I want to ask you the following basic question. Look at this right, and look at this log entry. When does this log entry have to be written to the log? I said that it's being forced to the log. But we don't have to be that aggressive. We might postpone a little bit. But how far? Clearly, we can't write it before we see the right operation. But can we write it later? Take the forward, output A. Must be before you force the data to this log. Exactly. That's exactly. It's a very subtle observation. We can postpone it. But we cannot postpone it later than the output. Once we output A, the log entry must be already in the log. Otherwise, we run the risk that we output A, the system crashes, and we don't have the information to undo. So that's how far we can go. Now, what happens to this, to B? Of course, this has to be done before output B. Is it good to output early to write the pages to disk early, or is it better to postpone writing than to disk? Postpone. Postpone, right? Because maybe other transactions might want to read the same pages. So how far can we postpone these two outputs? Sorry? Until you realize it. No, there is a very strict rule. Think in the context of the undo recovery manager. Before the commit? Before the commit, that's a big problem. We must output these values to disk before we commit. Once we commit, we can only undo transactions that are not committed. But the transactions that commit it, we can't redo it. If these outputs were updating an account from which we have dispensed money, the money is gone. So if we omitted to write the update to disk, and we said commit, which means we dispensed money, then we are in trouble. So these are the two important rules in the undo recovery manager. They are written here. But I prefer to look at them on a slide like this. Namely, every entry in the log must be forced to the log before the corresponding output. And the outputs must be written to disk before we commit. Any questions about the undo recovery manager? Because now we are going to start discussing how far back we need to go. And by the way, the answer to question 2 is somebody says the magic word. The reason why we can repeat the recovery process is because the operations are very important. So let's spend a minute on question 2. What happens if there is a second crash during recovery? We simply repeat the recovery process. The reason why this is possible is because the undo operations are very important. What is very important to do? What does very important mean? It means that if you apply the same operation twice, it's the same as applying it once. In algebra, an operation star is called very important. If it has a property, x star x is equal to x. Do you know of such an operation? Yes? But we need a star. So is plus important? What number? No way. Is multiplication important? No. What is add important? Which one? Assignment. Assignment decoration to a constant, like a equals something. That's true. So if these operations were assignments to a state, I'm just trying to come up with an example from algebra, a binary operation that is add important. And logical end is add important. x and x is the same as x. Or in set intersection, as intersected with x is x. For set union, these are examples of add important operations. OK. So now back to the question, how far is the log? Do we need to go? Well, without any precautions, we need to go to the beginning of the log. To optimize this, we need to use a concept that's called checkpointing. And the idea is that from time to time, the system will decide to write stuff in the log and maybe to force pages to disk in order to guarantee that it doesn't have to read the log beyond that point during recovery. Now in the case of the undo log, checkpointing can be done very, very simply. Here is one way to do checkpointing. When the system decides it's time to checkpoint, it will stop accepting new transactions. It will tell everyone we are close for business for a while, and it will wait until all the transactions are committed. And then it can write a checkpoint, which is a guarantee that all the transactions so far have committed, so you don't need to continue to read the log from here on. And then it can reopen the system for business. It can accept new transactions. OK, is this idea clear? Let's see how it looks like. So here is an example of a log where we use such a checkpoint. So the system does stuff here, and at some point it decides to checkpoint, which means there are no new transactions. And when all the transactions terminates, then it can write a checkpoint. So therefore, if later the system crashes somewhere here, then as we read backwards, we don't need to go past the checkpoint statement, because we know there are no more transactions waiting. OK, what's wrong with this? I mean, is there a logical error or a performance error? What's wrong with this? Your customers have three-month transaction or pretty much SOL? The customers, actually the customers who run a three-month transaction, they will be very happy. I'm afraid the other customers will suffer, because if we decide to checkpoint, then we are going to wait for that customer who runs for three months to end. This is just one database. Yes? The suggesting that you can only checkpoint between, after you've closed out all the current transactions. Can you repeat? You must wait for all transactions to end before you can checkpoint, because that would say yes. So you must wait for all transactions to end before you write this checkpoint. So this kind of crashes three months of whatever you were doing, or you have to have three months to read your log. Yes, with an undo log, if the system, well, with any log, if your transaction takes three months and the system crashes, then you lost. But the problem is that the system freezes during checkpoint. And it freezes until the last transaction terminates. And if one transaction really wants to run for three months, then you have to wait for three months. It's not a good idea. We do for this kind of customers, the redo will also take three months. Do we allow new transactions to come in when we're doing a redo after a crash? That's a good question. So during recovery, do we allow, it's a question independent of the checkpoint. It's kind of related, like if you're not, if you do not allow new transactions to come in, and we're doing a redo of a log, which is, like, had been there for, like, three months, so it will take around the same amount of time. I think that you're asking multiple questions, and they're all interesting. So let's ask one at a time. So when we redo, during the redo, sorry, when we undo, when we do the recovery, do we do this? Why we recover? Do we allow new transactions to start? No. Recovery is a process that is done before we make the database available again. So during that time, we do not allow new transactions. And the second question you asked, which I find quite interesting, is, suppose the database runs for three months. The transactions are short. They are only minutes or whatever. And the system crashes, the transactions actually might be long or short. Doesn't matter. The system crashes, and now we need to undo. Will it take three months to undo? Or can it be dramatically faster? Yes? Hopefully much, much faster because it's just transposing values. Exactly. So the transaction, they may take a long time because, well, they interact with customers. They do some expensive computations. There are many reasons why they take a lot of time. But during undo or during regulator, we just need to write values to disk. So it's much, much faster. Yes? Could you just bring open transactions forward with a checkpoint? We do this. So we get to this checkpoint in a second. Actually, right now. So that's a solution to the problem of freezing. Instead of freezing the database, there is something called non-queuescent checkpoint. I had to look up the dictionary what quiescent means. Quiescent means quiet, still, or at rest, or inactive. And non-queuescent is opposite, like my kids. My kids are both non-queuescent. And this is how we want checkpointing to be, non-queuescent. And the idea is actually quite simple. Whenever we decide we want to do a checkpoint, we write in the log a special entry called start checkpoint. But we also mark all the transactions that are currently active. This is an information that any database system has. The database system knows how many and knows the names of the transactions that are currently active. So whenever it decides to make a checkpoint, it will write all their names in the log. Then it continues normal operation, but it watches for these active transactions. And when all of them terminate, then it writes the end checkpoint. So let's see how this works. So this is a log. The system crashes here. And we start to do the undo from the end of the log. And we keep track of the transactions that have committed. We don't undo them. But we never know if we find a new operation of a transaction that has not committed. And we need to continue to read. Where can you stop reading? When do you know that no more uncommitted transaction, no action of any uncommitted transaction is beyond a certain point? Beyond what point? Beyond end or beyond start? So let me ask you this. Suppose we start to stop here at the end checkpoint. We stop right here. Can it be the case that there is a transaction that we need to undo here? Can there be a transaction with me right here, T9, that modifies x and there is a whole value v? That we still need to undo. Where did that transaction start? Obviously earlier, but this is a start here before the start checkpoint or below, or we can't tell? Below. That transaction must have started below. So start T9. So why didn't it terminate? Is it possible that it did not commit? Yeah, it's absolutely possible that it continued. The end checkpoint was written even though this new transaction has not committed yet. That's OK. But why don't we have to read beyond start, the start checkpoint? Because this is where we can stop. We can stop at start checkpoint. Could it be the case that there is a T11 that is still updating something? Because it would have been written in the start checkpoint as one of the open transactions. We know that everything's done from that list. Exactly. So either it committed or if it did not commit, then it must appear here. And actually, then it must have terminated because we see the end checkpoint. Does the system follow the data that is from the open transactions at the checkpoint? So the question is, does the system force the data to disk? You get the open transactions at the checkpoints. But at that point, did the data of these transactions already written to the disk or is still in demand? The data is written according to the two rules that we discussed. So the outputs are forced to disk before the transaction commits. If the transaction has not committed yet, you have no guarantee. Then we don't know. OK. Here is a question for you, and then we take a break. Suppose the same picture falls, but there is no end checkpoint. What happens now? How far back do we need to go? Can you still use a checkpoint, or do you need to go to the beginning of the log file or to the previous checkpoint? So you know all the open transactions at the start checkpoint, so you don't? I'm thinking a lot. So can we stop here? Can we stop at start checkpoint? No, but you could only. You only have to unwind T4, 5, and 6. Exactly. You only have to go back. The only transactions that still matter here are T4, T5, and T6. Maybe there is an update done by T4. And T4 has not finished. It's not committed. But we only need to trace back T4, T5, T6. And actually, only those that have not committed yet, which is something we see from here. OK, this is a good time to take a break. After we break, we discuss the other kind of recovery manager and serializability. So like a three-minute break. OK, so let's start. We still have about half of the slides to cover. And it gets actually denser. This material is actually a lot of fun. I don't know how you enjoy it. But it's one of the harder parts of the material. But it's also one of the most fun parts. So our last thought about the undo recovery manager. One thing we can do with the undo recovery manager is that we can implement rollback. Rollback and database systems are implemented by reading the log. So how do we do this? Well, we simply read the log backwards and we undo the actions of the current transaction. And actually, there is a simple optimization that you could have probably imagined yourself. We will discuss it in the next lecture in more detail. Namely, every entry in the log has a unique log sequence number. And it's called LSN. It's a standard terminology. And the log entries that belong to the same transaction, they are linked through the LSN number, such that it's easy to read only the log entries that belong to a given transaction if you need to undo it. So this was the undo recovery manager. The problem with the undo recovery manager, remember, was a problem of efficiency. The problem was that it is required that we output to disk, that we force to disk, all updates done by a transaction before we can commit. Before we can commit, everything that a transaction has done must be forced to disk. And that is inefficient because it means, essentially, that we can't exploit the buffer pool to its full potential. We can't implement LRU, for example, because we are forced to send to disk pages before we commit a transaction. To address that, the other approach to a recovery manager is to do only a redo. And that's called redo logging. And the idea here is that if the system crashes, then we don't need to undo anything. But instead, we might have to redo the transaction that have committed. So in this log, the entries look similar. There is a start entry. There is a commit and a abort entry. And now entries that correspond to updates, they look similarly, but now V is a new value. Whenever a transaction modifies an element x, we write in the log the new value that should be written to x. So let's see this in action in our example. So remember here, t modifies a. And this modification only made it into the buffer pool. But what we write in the log file is the new value of a, which is 16. And similarly here, when t modifies b, when t modifies b, then we write in the log that t modifies b. And finally, the transaction commits, and we are happy. So question. Now let's play games. Let's suppose that the system crashes right here. So you see this in the log. What do we need to do with a redo recovery manager? Nothing, indeed. Because in a redo recovery manager, transactions that have not committed, they have not written anything. Very simple rules. And indeed, 8 and 8 are the old values. OK. So now let's a transaction commit. So suppose a transaction commits. We dispense the money. This is my best example for pointing out how important it is to realize that a commit is a commit. You can't change your mind later. Once you commit and you tell the application commit, that's it. You can never say, but please let me change my mind. So it's committed. And the system crashes right here. So yeah, we tried to write things to disk, and a made it to disk. But b did not make it to disk. And now we see this log. What do we do? Yes? Just step through, transposing the values, making sure both writes happen. Exactly. So now we step through. And we make sure that both writes happen. And indeed, we set a and b to 16. For obvious reasons, we need to go now from the beginning of the log towards the end and do the reduce. So that's indeed what it is. And let me just recap what we discussed. Now there are two distinct steps. We need to discover first which transactions have committed and which have not committed yet, which is exactly as before. And we do this reading from the beginning to the end. And then in the second step, we need to read in the opposite direction from the where the first one is from end to the beginning. And then the second step, we need to read from beginning to end and redo all the actions of the committed transactions. A quick example, for which transactions do we need to redo the actions here? So forget about these things. This is where the crash happened. For just t2, right? So we go through the log and we see entries for t2, just one. So this is what we do. We update x2 to v2. OK. So now back to the same question as before. Where am I? Here. So same question as before. This entry to the log. We can postpone it for a while and not write it to disk. But for how long? When must we force this entry to disk? No, this is referring to the log entry. Definitely before commit, with missing actually. What's the correct answer here? Right, before commit, exactly. That's the correct answer. Because once we commit, then we must be able to recover that value in case the system crashes. So this must happen before commit. I don't know how I drew the error arrows, but they look nicer. What about these outputs? When can we write them? Can we write them later? Can we write them earlier? After commit only. Only after commit. Before the transaction commits, we were not allowed to output anything on behalf of that transaction. So the picture here is, that's the picture I remember. I did not draw these arrows, which are actually quite obvious. So think about LRU. What will this rule do to the LRU algorithm? How will it affect it? What can happen to our LRU algorithm? And actually to any page replacement algorithm. So the buffer manager decides it wants to evict the page. So it would like to output A. This happens maybe somewhere here. The buffer manager says, now I'd like to output the page A, because I need that frame for something else. And you say, no, no, no, you can't do this. Because we cannot output A under the transaction that has modified A has committed. So what can happen to the page replacement algorithm? You can run out of memory. Because if all the pages belong to the transactions that are not committed, you're stuck. So this recovery manager actually does not work in isolation. You need to do something to it. You need to combine it with an undo recovery manager. But it's a nice example to study independently. Does it make sense that everyone understands why the buffer manager will get stuck with this recovery manager? Because if it's not allowed to evict any page, then it's stuck, no matter what page it chooses to evict. Good. So same problem as before, with a naive redo log, we need to redo from the very beginning. Now it's actually even worse than with an undo log, because all the old transactions that have committed many months ago, they just run for a few seconds, they commit it, and the system crashes after many months. Well, we need to redo all of them. Because who knows, they might still have some pages in the buffer pool sitting around that have not been written yet to disk. Even more than in the previous log, we need a checkpoint. And here, please pay attention, because this checkpoint is both completely different from the undo checkpoint. And it's also different from the checkpoint done by Arius. But it makes sense in this particular context. So I'm going to discuss it as it's presented in the book written by Garcia, Marina, Uerman, and Widem. So the idea in the non-quiescent checkpoint for the redo recovery manager is that when we want to do a checkpoint, we write start checkpoint. And then we start inspecting all the pages in the buffer pool. And if that page belongs to a transaction that has already committed, then we write a page to disk. And when we finish inspecting all the pages in the buffer pool, then we write end checkpoint. So let's try to see an example. So the system crashes here, as always. And the problem that we face without checkpointing is that there are these very, very early transactions that have committed long time ago and whose dirty pages might still be in the buffer pool. Dirty means that they have been updated. They might still be in the buffer pool. And therefore, we need to redo them. OK, but now let's look at what the start checkpoint does. So what happens behind the scenes between these two points, the start checkpoint and the end checkpoint? Yes? So I guess some of them are going to commit. Sorry? Some will commit. Oh, no, no. Actually, let me ask more concretely. What happens to the pages modified by T1 between these two start checkpoint and end checkpoint? They will be written to disk. They are guaranteed to be forced to disk. Even if they continue to stay in memory, they are forced to disk. That is the guarantee we get from the end checkpoint. OK, so now let's think about the recovery. So the rule, again, is we need to recover only transactions that are committed. So in this case, OK, so let's suppose I was wrong. I didn't notice this start checkpoint. Let me erase it. Let's suppose, let's say, the crash happened here. And in the first phase, we need to identify the transactions that have committed and those that have not committed. And in the second phase, we need to redo all the actions of the committed transactions. And there are two kinds of transactions. Those that have started and committed before, actually three kinds, those that have started and committed before the start checkpoint, those that started and committed after the start checkpoint, maybe let me write here like a T9 and commit T9, and those that start and end after the checkpoint. I'm not going to write them. Question, do we need to redo T9? Let me ask with a simpler question. Do we need to redo T1? No, that's the whole purpose of the checkpoint. We don't need to redo T1 because during the checkpoint, its pages have been already forced to disk. OK, do we need to redo T9? Yes, because, well, it modified pages, but the process of forcing pages to disk was already in progress. So we have no clue if that process managed to force some of the pages of T9 to disk or not. So we better redo T9. And basically, that's it. What if T9 ended after end checkpoint? Any change? No, there is no change. We still need to redo it. OK, but now, let's see what happens if the crash happens here. So the difference is that we see an order start checkpoint with its end. And we see a new start checkpoint that did not terminate. So let's imagine the same thing. There is a T9 here. Start T9 and commit T9. So my question is, how can we exploit this T9? Then it's a bad choice, because 9 actually committed there. Let me call it T8. So T8 committed. Normally, you have to redo it. But there was a start checkpoint here. Can we use that? T8's been flushed, so you don't have to worry about it. So clearly, this checkpoint initiated a process that was supposed to flush T8. But do we know for sure that that process terminated? No. The flush isn't guaranteed to be done at the start of checkpoint? No, the flush is guaranteed to be done at the end of checkpoint. So no, we cannot use it. Actually, now I see it's written here. We cannot use it, because the only guarantee we have is when we see the end checkpoint. This is when we know that the flushing process has terminated. You should think about this flush process, which is actually also used in RDS, but in a less critical role than here. It's being an expensive process. That might take minutes to flush a very large buffer pool to this. Maybe minutes not, but definitely many seconds. So it's not something that happens instantaneously. OK, so that is checkpointing. We need to read back to the last checkpoint that has terminated successfully, but not beyond that point. And we only need to redo the transactions that have committed and that started after that checkpoint, actually that committed after that checkpoint. OK, so a quick comparison. In an underlogging, we had to output to disk before we commit. That decreases dramatically the performance of any LRU algorithm, because, well, essentially, we have to do lots of outputs, eagerly. In a redo log, we are not allowed to output to disk until the transaction commits. That's OK with performance, but we might end up stuck. If we don't have any frames left and no pages allowed to evict, then we're stuck. So to make both things more flexible, the idea is very simple. You can combine the two logs into something called undo redo logging. And RDS is actually an undo redo log. And for this example that we're discussing today, the undo redo log has entries like this, where whenever a transaction updates an element, we write both the old value and the new value. So now if the crash happens, how does recovery happen? How do we do recovery with an undo redo log? It's very simple. We do two recoveries. For all the uncommitted transactions with undo, and for all the committed transactions, we do redo. And we have here both informations that we need in order to both undo and redo. So here is an example. So suppose a system crashes here. So our log is this. And then I'll show you a second example. So in this case, it has committed. What do we need to do to T? Do we need to redo it or to undo it or nothing? We need to, no, we need to redo because it's committed. So think about it this way. It has committed, we dispense the money, but some of the updates are still hanging out in memory and we need to write them to disk. So we go to the log and we just replay. We say we update A to 16 and B to 16. Okay, now the second scenario is when the crash happens here. So then the log looks like this, which means we did not dispense the money. Okay, the customer did not take the money. So what do we do? What do we need to do now? Undo. We need to undo. So we go backwards and undo. And we write the old values, A is eight and B is eight. Okay, and final question. When do we have to write disk to disk? Can we postpone writing it to disk? Before commit, we have to write disk. Before commit, but I'm not going to draw that arrow. That's kind of redundant. Yeah, because in the log, we have to write them in the right order. But before what? Before the output. And the reason is, because if we ever have to undo that transaction, then if we output and we need to undo, then the log better have the undo information written there. But what about the outputs? When do we have to output? Can we have to output before commit or after commit? As you can see, we can do both. It doesn't matter. So now we have the full flexibility. We can output before commit or after commit. Okay, so let me see. Yep, nothing new here. We need to do two passes, a redo pass and an undo pass. And this is illustrated right here, but I'm not going to go over this. This is a very simple example. So that was my discussion of a very simple recovery manager. Any questions about the recovery manager so far? I should stress this is a significant simplification of the real recovery manager that's used in database system. And we will discuss some of the advanced issues next time. They are described actually quite well in the textbook in the Ramakrishna and Gerke textbook. And I'm going to follow the textbook for discussing areas. All the data structures are right there in place and they are well described. But essentially they manipulate all the concepts that we have discussed today. They put them together in a very careful way. That's what areas does. Good. So finally, the last topic that we discussed today are the basic definitions in concurrency control. I'm not going to discuss concurrency control mechanism. That's something we will discuss next time. So what's the problem with concurrency control? The problem is that we have many transactions and now a transaction is not a single update. Like it is the same thing what's happening with recovery. The transaction consists of many updates. Now for efficiency reason, we can't wait for one transaction to complete its execution before we start the next transaction. We need to allow their operations to be interleaved. Why is that? Why do we win by interleaving the actions of multiple transactions? We win big time. That look? Increase your concurrency level. Sorry? Increase your concurrency level. Increase your concurrency level and that helps in what? Sorry? In performance. In performance because whenever an input is issued to the disk, there is a long delay, the latency, the disk latency between that issue and the time the data is fetched from disk. During that time, a different transaction may do some actions. So this is why we interleave the actions of the transactions also for performance in order to cope with the disk latency. But the problem is that if they interleave, they may hurt each other and we have seen examples of famous anomalies. So the basic idea is that there is a scheduler and the scheduler needs to decide which transaction goes next. But in order to understand how to design this scheduler, we need to understand what it means for a scheduler to be correct. And the right definition, the correct, the formal definition is called serializable schedule. And as you will see, there are multiple mechanisms for serialization. In the next lecture, hopefully the next lecture, we will discuss actual mechanism to implement to ensure that the schedules are serializable. And there again, I'm going to use the other textbook because it has a better description of concurrency control mechanism than Ramakrishnan and Kierke. So these chapters they refer to Garcia, Morina and Orman and William. Okay, so we discussed a problem. So let me start by definition. I'm going to define three kind of conflicts. These are conflicts between actions done by two or by the same transaction. And they're called write, read, read, write and write, write conflicts. And let me illustrate them with examples. Think about a lost update like this one here. Transaction one reads and then writes a new value for A. Transaction two reads and then writes a value for A. And of course, here we have a case of a lost update. And the conflicts are right here. They are between read and write and they are between write and write. So think about conflict. Not something that's conflicting, that's wrong, but conflict means you can't switch the order. If you switch the order of these operations, then the result might change. That's the essence of a conflict. Let's see another example. Inconsistent reads. That's an instance of an inconsistent read. Transaction one writes two things and transaction two reads both things, but it reads something that shouldn't have been there. Right, transaction one wanted to write the same values and transaction two might end up reading different values for NB. So what are the conflicts here? Conflict between write and read. Conflict between read and write. Two more, I think. Dirty reads, write a board. Here is a conflict. Write and read. And one more, unrepeatable reads. This is an anomaly that we have not discussed at the beginning of the lecture. T2 wants to read A twice. Now, why would it do this? We don't care. This is what T2 wants to do. It should be allowed to do. And because T1 has written A, because of the read, write, write conflict, it will get different values for the two reads. Okay, so we have seen conflicts. Now, here is the definition of the schedule. A schedule is a sequence of the transactions actions, it's an interleaved sequence of all the actions of all the transactions. That's what a schedule is. Our goal for the remaining of this lecture is to define what a good schedule is because some schedules are not good. Schedules that result in those anomalies that are not good. So what is a good schedule? That's going to be our main question. And I'm going to use this example in which T1 wants to add 100 to both A and B, and T2 wants to multiply by two both A and B. Okay, so T1 reads A, adds 100, writes A, then reads B, adds 100, writes B. T2 reads A, multiplies by two writes, reads multiplies by two writes. Same principle, if A and B were equal before we execute the transactions, then they have to be equal when we finish executing the transactions. Here is a schedule. It's a very simple schedule in which we execute T1. The scheduler allows T1 to execute to completion, and then it executes T2. Is this a good schedule? This is the golden standard of a good schedule. It doesn't get any better than this. If this is what the scheduler would always do, then the transactions will run in isolation. If you want, that's a formal definition for isolation. So this is clearly a good schedule. Can you think of a different good schedule? After T1 writes A, then T2 could start doing its read, A, then operate up to the part where it wants to read B. So let this one go first, and then let T2 do this. Oops, put it there. And then? B from T1? Two? Must be tired. Three and four. Yep, that's a very good schedule. Now this, how do we say that this is a good schedule? How would we define formally, mathematically, that this is a good schedule? That there are no anomalies. And the concept is amazingly simple, but it took a while for people to understand this. And this right here, we say that a schedule is serializable, that's a term for good. If it is equivalent to a serial schedule. Serial schedules are our gold standard. So if you can interleave the actions of your transactions, such that what you get is equivalent to a serial schedule, then you're good. Then that schedule is good, it's called serializable. Okay, so let's see an example. Here is exactly the example that you proposed. This is a schedule that is not a serial schedule, but it is serializable, okay? Here is another example of a schedule that is not serializable. And that's because, well, if you first add 100 to A, and then you multiply both A and B, and then you add 100 to B, you're not going to get equal values. Okay, but let me share with something a little bit annoying. Look at this schedule here. I modified the transaction, and this is just for this particular slide. So now the first transaction adds 100, and the second transaction adds 200. This schedule is serializable, right? Because it's equivalent to, it's equivalent both to doing T1 before T2, or to doing T2 before T1. But do you expect a scheduler to schedule operations in this order? Yes? Do I have to actually examine the data transformations to do something like that? It would have to examine the data transformations in order to identify such cases. And then it would have to be able to reason about these transformations. It needs to prove that they commute. We don't expect a scheduler to do this. And this is why we end up with a more complicated definition, which is that of a conflict serializable schedule. Now what happened to this? I'm sure it was correct. Of course you can't read what's there. Can I move this? I don't think I can edit it. Okay, so the idea is that we assume the worst. So whatever the transactions do to the data, we assume that they modify the data in such a way that we can't switch operations between transactions. And as a consequence, we don't care about what the transactions do to the data because we assume the worst anyway. And as a consequence, we can write transactions only as sequences of reads and writes. So transaction one reads A writes A and then reads B writes B. And transaction two does exactly the same thing. Reads A writes A, reads B writes B. Okay, so now I can define formally conflicts in a schedule. Imagine a schedule of actions corresponding to different transactions. Two actions are so-called in conflict if you can't switch, if you can't swap them. So that means either they belong to the same transaction then of course you can't swap them because you want to keep the order in which transactions they want to write or they are writes to the same element X or they are write read or they are read write. For any two such actions, we say that they are in conflict. It doesn't mean that the schedule is bad. It only means we can't swap them. So now here is a definition that we will use that everybody uses. Yes. So read read by different transactions of the same element is okay, though, right? Read read by. Two different transactions of the same element. Yes, read reads are not in conflict. We can swap reads, very good observation. So that's exactly what we can do. We can swap reads, but we can also swap operations by different transactions than to different elements. We can swap those two. So now here is a definition that we and everybody uses. A schedule is called conflict serializable. If you can swap, if you can swap the actions of the schedule without swapping conflicting actions, and obtain an equivalent serial schedule, then it is called conflict serializable. For example, if you look at this schedule here, we can start swapping to make it serial. So look at the ones and twos. All the ones are blue and all the twos are red. And I want to move all the ones at the beginning and all the twos at the end. Can I swap? So can I swap these two? Yeah, I can swap them, they are not in conflict, right? Because they refer one to A, the other to B. Then can I swap these two? Because now these two are next to each other. Read and read are not in conflict, so I can swap. And we need a few more swaps, right? We also need to swap these two. And then I think we're done. Right, right, right, because they are not in conflict. And we obtain the schedule below. And this is a serial schedule because transaction one completely precedes transaction two. Question. Which one implies the other? If a schedule is conflict serializable, that it means that it is serializable. And if a schedule is serializable, that it means that it is conflict serializable. So true or false? Every serializable schedule is conflict serializable, true or false? Oh, well, you don't need to transform it. No, but is it, right, you don't need to transfer, but you still need to prove that it's a key balance. So you look at A, what did I say? Every serializable is a problem that implies constant. Every serializable implies conflict serializable. So you look at a schedule and it is serializable. Is it also conflict serializable? Yes. Look at this one. This is serializable, but is it conflict serializable? No, you can't swap them because there are conflicts. So which one would you like to place first? Would you like to place T1 first? If you want to place T1 first, then you can't get, you can never swap these two rights. But don't they write different variables? Oh, but it reads the same. Right, so this is an example of a schedule that is serializable because of the particular operation that the transactions are doing, but it is not conflict serializable because they, but we can't swap these two rights. We could swap them because we know that if we swap them, we get the same and answer, but if we didn't know what the transaction we're doing, then we could not serialize it. But what about the converse? Is every conflict serializable and the schedule also serializable? Yes, the converse is true. Good, so you got this. Now, how do we check it? If I show you, if I show you just this, I give you a long such sequence and I ask you, like on the final, is this schedule conflict serializable? I will never ask you if it's serializable because you don't know what operations that transactions are doing, but I will ask you, is it conflict serializable? How do you check? Well, this is why we have a few slides because it's not an immediate answer, but it's not too difficult. We construct what is called the precedence graph and we do it as follows. Yes, question. So serial and serializable are the same thing? Serial and serializable are completely different. Serial means we execute one transaction after the other. Serializable means we interleave them, but it is equivalent to a serial schedule. So, what you're saying there, you basically pick two transactions and then you try to see if you can sort them out, like separate them basically by, or you see if you can't do it because there's some conflict somewhere in there. Yes. So you don't actually have to do the sort, you just look for a conflict and if there is one, then you make the edge. Yeah, you're trying to guess what the algorithm is, right, for checking serializability. Yeah, I'm just trying to. Yes, so that's exactly the idea. So we construct a graph, which is called the precedence graph, where every transaction becomes a node. And then whenever we find a conflict, we draw an arc, we draw an edge between those two nodes. And then you'll tell me what happens next. Let's see how this is done on this example here. So this is a schedule and we want to find out if it's serializable or not. Conflict serializable. So we need to find conflict. And conflict means that two operations from two transactions refer to the same element. Look at element A. What conflict do you see between which operations? Between, yeah, which operations and which transactions? Element A, right by? Right by two, this one, and which one? Oh, you want right by three, okay. Here is a conflict, the right-right conflict, which means that in any conflict serializable, that in any equivalent serial schedule, which transaction must come first? Two must come first. You can never move three before two because you have this conflict. That's the idea. So we draw an edge from two to three. And actually I like to put an A here to remember why we drew this. Anything else about A? We also have this, let me use green. Right, green. But it's the same edge. Right, no progress. Other conflict. Conflicts introduced by B. Right, by which transaction? By one, right? Yeah. Read. And read by two. So here is another constraint, is it where one must come before two because of B. Any other conflicts? Yes. But it reduces exactly the same edge. So that is a precedence graph, and now comes to the question. If you look at this precedence graph, what can you tell? Is the schedule serializable or not? Conflict serializable. No cycles, right? So. Exactly, there are no cycles. And therefore, we will look for a serial schedule in which transactions are executed in what order? In the order in which you see on the precedence graph. So the transaction would be executed in the order one, two, and three. And yeah, you can swap them. You can swap adjacent operations until you move all the operations from one first and then two and then three. And we can do this because the precedence graph tells us that we won't get stuck. We don't have conflict in the opposite direction. Okay, so this was the first example. Here is a second example. Let's work this out. So can you read quickly for me conflicts? Look at A. What does A introduce? Here is a right. And read three. So this is one. Another one. With A, I don't think we have any other one, no? Two. Right, B. What conflicts does B introduce? So here is A right one. And then there is read one. So we have one mask, what do I do? Both transactions. Is that read two? Yeah, it's not good. Read by two and write by one. Excellent, read by, so two must be come before one. I'm sure we have more. Look at this. Oh, the right one and right two. Right, so one must come before two, right? And I think that's it. So what can you tell me about this schedule? It is not complexly realizable. This is not a good schedule. This schedule should not reduce the schedule. Good. So that was the easy part. So now a few refined definitions of serializable schedules and it gets much more interesting. Here is an example of a schedule that is not complexly realizable, yet it is equivalent to a serial schedule. And I want us to examine this in a little bit of detail because some of the schedulers of the scheduling mechanism actually produce such schedules that are like this. So look at this one here. Transaction one writes X, then transaction two writes the same X. Then two writes Y, then one writes Y and then three writes Y. I'm going to draw here the president's graph. What are the conflicts? What conflict does X introduce? If you look at X, which transaction must come before which? One must come before two. If you look at Y, which transaction must come before which? Two before one and one before three and two before three, right? So this is not a complexly realizable schedule. There is no way you can place either one before two before one without introducing a conflict. However, I argue that that schedule is equivalent to this one here, which is serial because we are executing one, two, three. What happens? Well, actually the answer is right here, but can somebody explain in a longer sentence? It is a serial schedule, right? The schedule is about a serial, that's true. The question is, why is our schedule equivalent to the schedule at the bottom? Because the X is happening, or the last X is the same transaction and both of them, the last Y is the same transaction. So let's examine here, what exactly, which is the conflict that we ignored? The conflict that we ignored here was between these two. Let me actually, let me erase and use red. This is a conflict that we ignored. Transaction two wrote Y before transaction one. But nevertheless, in the serial schedule, we allow one to write Y before two, Y. Because three wins anyway, right? Three is the winner right here. And no matter in which order two and one have written Y, three wins and takes everything. So this introduces a new kind of equivalence. We say that two schedules are equivalent if, and the definition is, the definition is long, it's right here, but it's very simple to understand. They're equivalent if they're all the same. Which means that whenever a transaction reads a value that is initial for the schedule. In one schedule, then it reads an initial value for the other schedule as well. If a transaction reads a value that was written by some other transaction, then it reads the same value in the other schedule. And if the output of an element is not written by one transaction in one schedule, then it's also the same in the other schedule. That's the definition of viewer equivalence. By the way, this is actually well covered in our textbook, in Ramakrishna and in Gerke. So here is a more complicated example. On the left, there is a schedule which is not conflict serializable because it's not equivalent to the schedule on, it's not conflict equivalent to the schedule on the right because of the following conflict, because of which conflict or because of this one, right? Transaction one writes Y after Transaction two. But in the serial schedule, Transaction one writes Y before Transaction two. However, that's okay because Transaction three wins anyway. Good. So we have the definition of viewer equivalence. And now a transaction is called view serializable if it is viewer equivalent to a serial schedule. Which means that if it is conflict serializable, then it's also view serializable, but the converse is not true. Again, you wonder why we studied this. We studied this because next time I will show you an interesting scheduler based on timestamps that cannot guarantee always conflict serializability, but it can guarantee view serializability. Okay, and now the last 10 minutes, it gets really, really hard because now I'm going to throw in recovery. So suppose a transaction abort. And of course we cannot undo its actions, but remember the dirty reads. Maybe some transactions have read values that were modified by this abort a transaction. So what do we need to do to those transactions? We need to abort them, right? We need to abort them to however, maybe that transaction already committed. So what do we need to do then? Remember the example with money transfer when we deposited temporarily some amount of money into one account, then the other application read that value based on that makes a determination dispenses money and commits. I mean commits dispenses money, it's one single atomic operation. And now this transaction wants to abort. What do you do? You quit, you change your job, you say, not me. That's not a good schedule, right? That is, we should not allow such a schedule because if you allow that schedule, then you won't be able to recover. You won't be able to abort. So here is an example. Transaction one wrote A which is read by T2 and T2 does something that we can't undo and then it commits, goes away and now we want to abort T1. We cannot allow this to happen. Okay, so this is a definition. So a definition, the schedule is called recoverable. Whenever, let me say this carefully. So whenever a transaction wants to commit, for example, to dispense money, then all the values that this transaction has read, all the transaction whose values is this transaction read have already committed. So think about it this way. If you want to dispense money and if you want to commit, then you must ensure that all the transactions whose value have read have committed and then the schedule is called recoverable. So here are examples. The first one is not recoverable because this transaction here commits, but the value that it has read was written by a transaction which at this point has not committed yet. But the other one is recoverable, right? Because if it commits, well, does it actually get to commit? What happens here when this abort? What do we need to do? When T1 abort. It's done YT2 which you can still do because it hasn't committed. Sorry? It's done YT2 which you can do because it hasn't committed. Exactly. If T1 aborts, then we need to unwind T2 because it has written a dirty value. So we undo all these actions. It's not fun to abort a transaction just because another one has decided to abort, but at least we can do this. It's still consistent. It would be much nicer if we didn't have to abort transactions and the schedule that allows us to do this is called a schedule that avoids cascading abort and it's right here. And the definition is that whenever a transaction reads an element, not when it commits, but even when it reads an element, that element must have been written by a committed transaction. And that means that no other transaction can abort and force my transaction to abort as well because everything I'm reading comes from a committed transaction. Okay? So here is an example. So this is not good. The schedule on the left is not good because here we read, because here a transaction to read the value written by transaction one and it's not clear whether transaction one is going to commit or abort. And if it aborts, then we need to do cascading abort. But if we wait for transaction one to commit, then we're fine because now what we read is guaranteed to be not dirty. It's guaranteed to be clean. Okay. I know you're confused. Let me review this and I'm going to review these last schedules next time, but a quick review. The schedule that we discussed can be classified along two dimensions. One is whether they are serializable or not. And we discussed here serial schedules, serializable schedules, which is kind of difficult to check because this depends on what the transactions do. Conflict serializable. And this is kind of the most commonly used definition. And then there is a variant called view serializable, which is more flexible, allows for some schedule that are not conflict serializable. For recoverability, we must insist that schedules be recoverable. But even if they are recoverable, they might not necessarily be efficient because they might force us to do cascading abort. So a stricter definition, what it is, abort. A result of schedules that avoid cascading abort. Good. I see some very tired faces and I can only empathize with you. These were pretty dry concepts, but I'm going to review them at the beginning of the next lecture. And you will see them in action as we discussed time stamp concurrency control mechanism based on time stamps, which are a lot of fun, which are kind of more fun than the ones based on locking. Good. So for next time, we will discuss concurrency control mechanism based on locks and timestamps and validation. And then we will discuss the RES recovery manager and then general transactions with more depth. Any more questions? Good. Then have a good evening. See you next week.