 Okay, hello everyone. Can you hear me at Microsoft? Yeah. Hello, welcome. So today, here in Seattle, we probably don't see this from Microsoft. We have the industrial affiliation meeting, and it's now going on. There is a lot of activity in our building, and we have now food and drink, and there is like a big party going on. We are kind of isolated here, and the building gets wild. So if you hear any noise or if you see anything unusual, it's because of this meeting going on. Okay, so let me start with announcements. Homework 3. It's now, you're probably already working on it. It's due next week. There have been some questions. First, each customer has exactly one rental plan. That's what the homework says, and we'll stick to this, which means you have to think about how to implement this. It cannot have zero rental plans. You can have exactly one. Just think about how to represent this. Another thing that I really insist, whenever you have a many one relationship, you do not implement this with a new table. A many one relationship means just a foreign key. If you have a many, many relationship, that requires a new table, but a many one relationship does not require a new table. That's just a simple foreign key. So that was also true for homework 2. I realize I did not emphasize this hard strongly enough, but it's stated on the slide. So it's right here. But again, you have to pay attention in homework 3. You do not implement many one relationship using a table. Also, some people would like to use both Cress here and the department, and it's now available onto this. Paramp will set up instructions on how to access it. If you have any questions about this, please send an e-mail to Paramp and to me. Any other questions about homework 3? I have a question. Yes. The T.S. sent an e-mail saying that every cable should have a key. But how can we do an history where I can write the same movie five times in a month? Because I like that movie and I want to watch it return it. I want to watch it again and return it. So the question is how can a customer rent the same movie several times and keep the history? Well, what would be an obvious solution here? Yes. At a time stamp? Like at a time stamp or insert the date when that rental happened. Does this answer your question there? Yes. We need to be a little bit creative. It's not 100% smooth. You have to do something about this. Yes. What is the state of this possibility? That's another possibility. So you could have only one or two records per movie and user. So if the user rents the same movie a second time, then you essentially lose a history. But it's much nicer if you keep the history around. So then you need to have several entries for the same user movie. Any more questions about homework three? Homework four is now posted or should be posted. Param is supposed to link it from the main website. It's a collection of problems from both textbooks from the Ramakrishnan and Gherket textbook and from the Orman and Orman-Garcia-Marina-Marina-Widom textbook. The problems from the other textbook, they are listed in clear, so they are copied. You have the full text and the homework. You don't need the book to do the homework. However, the book is a recommended reading just because the material in the book is, the questions are based on the material in the book. Everything you need to know to solve this homework we covered in class. So technically, you don't need the book. But some people would like to have the book and I have two copies sitting in my office. I was supposed to bring them here. I'm going to bring them during the break. You can lend them from me and share them if you would like to read the book. Okay, so I don't have anything else to say about the next homework. It's due in two weeks. No programming, which is a good news. Okay, so where are we? We discussed the transaction last time and we are going to continue discussing transactions today. Perhaps even next time if we don't finish what we have to cover today, which is quite likely. So there are two aspects to transactions. Recovery, recovery from crashes. And here we discussed last time a very simple undo log and a very simple redo log. They just highlight the basic principles in a recovery manager. That's what we discussed last time. Then we started to discuss concurrency control and we only discussed serializability, the definition of serializability. And today we will continue concurrency control and we will start by discussing a log-based scheduler. And after that, we are going to discuss some more advanced topics in transactions. First, I'm going to show you the weak isolation level in SQL. They're not that advanced, but we need to discuss them. Then we are going to discuss areas. I definitely want to finish the discussion of areas today. They stop either here right after areas or we might start discussing the advanced concurrency control mechanism, the optimistic concurrency control algorithm. And we'll see how far we might finish them or we might discuss them next time. So that's where we are. Now before we start today's topic, I want to have two discussions. The first discussion is the one we postponed from last time. It's to discuss the paper on query-answering using views. So by now I hope everybody read sections one, two, and perhaps three from the paper. And let me ask you this. What is the thing? What is query-answering using views? What is this problem? Who can define the problem of query-answering using views? You have to reformulate the query so that you get an equivalent query that uses only views. Right. That's exactly what the problem is. You have to reformulate the query to get an equivalent query that only uses views. So let's make a step backwards. First of all, you have these views sitting around. So for some reason you have pre-computed some views. You have materialized some views over the data, which meant some effort. Maybe this involved computing joints, maybe some aggregates, some group eyes. It took a few minutes to compute these views, or maybe hours if they are very big. So now you have them. They are there on disk. And you get a new query. And the question is, this query, it refers to the base tables. It doesn't refer to the views. It refers to the base tables. But the question is, can you in a smart way rewrite this query such that it uses views, which you have already computed, as opposed to redoing the work and computing again the same joints and the same aggregates? That is a query answering problem using views. Now, there are actually two variants. One is called query answering, and the other is called query rewriting. Did anyone understand what the distinction is between query answering using views and query rewriting using views? The paper talks mostly about query rewriting using views, which means exactly what we said. We rewrite the expression of the query to use only the views or to use the views and some base tables. In query answering using views, we relax a little bit and we say, maybe we can answer the query by using some arbitrary algorithms that read the views, not necessarily a query. Then we get a slightly more expressive power. It's a subtle distinction. I don't want to go any deeper into this. So is it clear what the problem is in the query answering using views problem? Now, we find this in very many places. Now, I wanted to discuss just two applications. The first one is physical data independence. What is actually physical data independence, regardless of the query answering using views question? What is physical data independence? What is the principle of physical data independence? Yes? It doesn't matter how you store your data, you should be able to have some logical structure that you query against. Exactly. You should be able to have a logical structure that you query against. Underneath, you should be free to modify the physical layout of the data to add indexes, to drop indexes, to modify indexes, to modify the ways of data stored without affecting the application. The application should see exactly the same logical representation of the data. This is physical data independence or data independence. Now, what does this have to do with query answering using views? Physical data underneath change. The tables underneath the view change. You just have to change the view. How it calculates. Where is the view here? Yes? The logical structure. It's very similar. You're querying this abstract concept, and then if you have to tranquilize that, then you're getting data from the underlying physical representation. Right. That's the thing. There are now two pieces of data. There is a physical representation of the data, and then there is a logical representation of the data. Now, the logical representation is not stored. Right? Nobody has that table. We only have the files that hold the data from which we can reconstruct the table. So the physical representation are the views. Maybe our table is partitioned into some physical, maybe it's vertically partitioned into several tables, maybe even distributed, maybe. And these are views over what we would like to query, over the logical table. So now when the query is formulated over the logical table, we can't answer it. We have to rewrite it in terms of the physical tables. And that is a connection to query answering using views. It's a very elegant concept that was actually introduced by people from the University of Wisconsin in the early 90s before this query answering using views problem was well understood from a theoretical perspective. So they observe that various kind of indexes and various physical representations of data they behave like views. Now, second application is in data integration. How come? What is data integration actually? What is the goal of the data of data integration? Yes? We have a bunch of disparate repositories of information, and you'd like to bring them together so you could query them without having to worry about all of the different ones. Exactly. So to take disparate sources of information and to offer a single unified mediated schema, the term is mediated schema, gives the illusion that there is a single database. But when we formulate queries over this database, those queries they need to be rewritten in terms of the physical information sources. Is there a question there at Microsoft? Okay, I saw a hand. So that's a connection. So now we have this mediated schema that might not be populated. It's only a virtual representation of the integrated data. And all the physical, the real sources, they are described as views over the mediated schema. And the task of query answering using views is to take a query over the mediated schema that's issued by the application and to rewrite this into queries over the sources, where the data can be extracted from. Okay, any more questions about this paper? Okay, we'll respond to questions under the final. Yes? See, there was an infinite conclusion that there's kind of an existing gap between the two styles that we talked about. And I didn't understand. Like I said, the first approach was very bad with a large number of views. And I just didn't understand. The gap versus love. Are you referring to these two keywords? Global as view versus local as view? The problem answering queries using views raises a multitude of challenges ranging from the arrival of... Well, the algorithm for answering queries using views are already being incorporated into the systems. Maybe we should take... Yeah, I'll take it offline. Okay. No, that's okay. It's great to have a discussion, but we need more participation, I guess. Then I have more concrete questions where I really would like everyone to participate. This is about reviewing the material we covered last week. What is the schedule? Who can tell me what the schedule is? In concurrency control, this is a topic of our investigation, the schedule. What is the schedule? Order of reads and writes to the database? Yeah. It's an order of reads and writes to the database. Perfect. But each read and write needs to be characterized by two things. By what? It's the read to the... Like, for our memory, we need to read to the disk in the memory. So we need the element name. The element is like the page or the record. And we need to know something else. Which transaction did that read? And that's all we need to know. A sequence of reads and writes. And for each of them, we need to know which transaction and what element is being read or written. And the schedule essentially represents an interleaving of these operations coming from different transactions. Perfect. What is a serializable schedule? That's basic. Everybody should know. Thirty years from now, you should know what a serializable schedule is. What is it? It's a schedule that is equivalent to a serial schedule. It's a schedule that is equivalent to a serial schedule. But what is a serial schedule? Yes? It's a schedule in which possible transactions are run one after the other. Exactly. It's a schedule in which first we run one transaction, then the next transaction, then the next transaction. That's a serial schedule. And if our interleaving is equivalent to some serial schedule, then we call it serializable. Okay. So looking at a schedule, what do we call a conflict? Two transactions. A dependence between two transactions? How should this dependence be? Right read or right write. Right. And actually it forms as a book, has a wonderful way to summarize this. It's any pair of operations of which one is the right. Okay. If one of them is the right, then you have a conflict. And moreover, they have to be to the same element. So let's backtrack. A conflict is a pair of operations in a schedule done by two different transactions to the same element, of which one is at least one is the right. And what the conflict essentially says is that you can't swap the order of those two transactions. They're given by what we see in the schedule, by that conflict order. So what is a conflict serializable schedule? Conflict dependencies between transactions. That is one way to test it. It's a schedule where there are no loops in the, I will just forget this, the precedence graph, the term is precedence graph. That is the theorem that characterizes conflict serializable schedules, but we did not define them like this. How did we define them? It's a schedule that has conflicts, but we can find the serializable schedule. So we can find a serial schedule by swapping adjacent operations that are not in conflict. So we swap them. We are sure that this will not affect the semantics of the schedule. And if we end up with a serial schedule, then we call it a conflict serializable schedule. Okay? So now let's get to the hard part. What is a view serializable schedule? Is every view serializable schedule also conflict serializable or vice versa? Let's start with a simpler question. Every view serializable schedule is conflict serializable? No, it's not, it's false. It's the opposite. Every conflict serializable schedule is view serializable. So what is a view serializable schedule? Serializable is when you can find a serializable schedule. Serial schedule? Serial schedule based on certain data, certain properties of the data that you process. Okay, so let's do this together. So we look at a schedule. We call it view serializable. If there exists a serial schedule such that this serial schedule is in something equivalent to our, it's so-called view equivalent to our initial schedule. What does view equivalent mean? It's not, we don't get from one to the other by swapping, no more swapping. We have conflicts, but these conflicts don't result in the result actually because the data conflicts don't affect the final. Right, so we can't get from one to the other by swapping because there will be conflicts that cannot be resolved. However, these two are view equivalent. What was the example that we gave? Two consecutive writes in different transactions for the same value. Two consecutive writes. Well. There were two writes. They weren't exactly the same. So suppose we have two writes and there is a third write that overwrites what both of them did. Then we can swap them. Provided that nobody has read the result of the first write that was overwritten by the second write. But if in the end everything is overwritten by a third write, then we can swap them. So the definition of a view serializable schedule is longer. You have to read the slides, but it's also defined in the book, in our textbook, in the main textbook. But it essentially captures the following intuition that two schedules are view equivalent. If every transaction that reads an element in the first schedule, if it reads the element written by some other transaction, then it reads the same value in the other schedule. It reads the value of the element written by exactly the same transaction. In essence, if you can show that the two in both schedules every read ends up reading the same value. Moreover, the end result, the result after finishing the transaction for every element is the same in the two schedules. I have a much nicer formulation of the slides, so it didn't come out nicely. But I hope you get the right intuition behind the view serializable schedule. Okay, so now it gets really hairy. What is a recoverable schedule? It's about abort. When a transaction wants to abort, what do we worry about when a transaction wants to abort? Right, so maybe it has written something. Then another transaction has read that thing. And now our transaction aborts. So what do we need to do with the second transaction that has read? We need to abort it. If it read something that was dirty, which was a dirty read, and the transaction who wrote that value aborts, then we need to abort the transaction that read the dirty read. What can go wrong? When is it the case that we cannot abort that transaction? Exactly, if it was already committed and we gave out the money, then we are in trouble. So transaction one writes a value a, then transaction two reads a, and transaction two commits. And now transaction one wants to abort. So this is not a recoverable schedule because we can't abort transaction one. So what is a recoverable schedule? It's a schedule in which whenever a transaction commits like T2, then all the transactions whose values this T2 read have already committed. So in order to make this transaction here recoverable, what needs to happen? We can't allow T2 to commit until T1 commits or abort. If it aborts, then we are going to abort T2 as well. And if it commits, then we are going to allow it to commit. I know it was late last time when we discussed this, but I really insist that now everybody is on the same page with recoverable schedule. Are recoverable schedules clear? Then what is a schedule that a boy's cascading aborts? The other more stricter condition. Yeah, what are they? I know reads come from uncommitted transactions. Exactly. Let's put it in a positive way. Whenever a transaction wants to read, the value it reads must be written by a transaction that already committed. So if you look back at my example here at the top, which I realize is not very nicely written, what needs to happen to hold this a schedule that a boy's cascading aborts? Exactly. Read A before read T1 commits. So you're waiting even earlier for T1 to commit. It's even a stricter condition. OK. So with these definitions of pressure in mind, now let's talk about how to implement a scheduler. A scheduler is a module that simply watches for requests from the transactions for read and write requests and decides who goes next. Decides who's allowed to proceed and who has to wait or maybe who needs to be aborted. There are two approaches to schedulers. The pessimistic scheduler is the one that we're going to discuss first. This is based on logs. And I suppose everybody knows what logs are. And then either towards the end of this lecture or next Wednesday we will discuss much more fun schedulers. They are called optimistic schedulers that use timestamps or validation or snapshot isolation, which is a term you probably have heard, snapshot isolation. So what I'm going to do over the next half an hour or an hour is discuss a locking scheduler, which is a very simple idea. The idea is that every element that can be read or written has an associated lock. And if a transaction wants to read it or to write it, it needs to acquire that lock. If it cannot acquire the lock because the lock is acquired by a different transaction, it's doing something to that element, then the transaction has to wait. And when the lock becomes available, then the transaction is going to get the lock, then it can proceed. Let me ask you this. Does everyone know here what a lock is? Should we repeat the definition of a lock? Okay, so I will assume that everyone knows what a lock is. And for the examples I'm going to give next, the notations are L, Li of a means acquires a lock. I think I'm going to use uppercase mostly. And Ui means unlock. So lock and unlock. Okay, so let's take an example. Remember our non-serializable schedule. Here are two transactions. The first wants to add 102A and then wants to add 102B. And the second wants to multiply A by 2 and to multiply B by 2. So if they were to run in isolation, remember I from acid. I means isolation. It means that the effect of the transactions should be such as if they were executing an isolation on the database. Well, if T1 were to execute an isolation, then if you run it on a database where A and B are equal, then at the end of the execution, A or B are still equal. They are just increased by 100. Similarly, if you execute it to an isolation, you start with A, B, A equals to B, you end up with A equals to B. But if you interleave them like that, then we are in big trouble. Then A and B are no longer equal. So obviously the transactions now have not executed on isolation. And this is a non-serializable schedule. Good. So I hope this will refresh your memory. Here is how locks solve the problem. Before we read, T1 must acquire the lock. So then it reads A, it writes A. When it's done, then it can unlock. And now it wants to work on B. So in order to work on B, it's going to require the lock on B. In the meantime, the schedule decides you spend too much time here. So why don't you stop for a while? I'm going to schedule transaction 2 now. And transaction 2 requires a lock on A, reads A, writes A. Now unlocks A. And now it would like to move on and do the operation on B. But when it tries to require the lock on B, it's going to be put on a waiting list. It's going to wait for this lock to be released because the lock is already acquired by transaction 1. So therefore transaction 1 proceeds. It unlocks B and now transaction 1, transaction 2 can continue and life is good. If you look at the schedule, this is a serializable schedule. We have prevented a non-serializable schedule on the previous slide. Everybody is with me? Because now I'm going to show you something quite surprising. My next slide starts with this. Is transaction 2 takes the lock B? Is transaction 2 takes the lock B? When? Transaction 1 is having lock B. That's exactly the problem. She started with bet. She had a comment starting with bet and I just want to emphasize. My slide also starts with bet. That's exactly the issue. Maybe the scheduler switched to transaction 2 before transaction 1 got a chance to acquire the lock on B. So now when transaction 2 wants to acquire the lock on B, it's okay. It can proceed. It can proceed. It's going to unlock B and now transaction 1 proceeds. But is this a serializable schedule? No. No. That's exactly our non-serializable schedule. Now if you think about this, this is actually quite bad news. Locks, you know them probably from operating systems. Locks are supposed to be the magic. Locks are supposed to prevent multiple threads or multiple processes or multiple transactions to access the same resources at the same time. But look, they did not do the job. They did not prevent, they did not ensure that the resulting schedule is serializable. Okay? So how do we fix this? Do you have a question? Add a locking lock so that you can do your unlocking and locking of the atomic operation. Because if unlocking A and locking B were themselves in a locked space. That's good work here. But I wonder if this generally leads to three elements. Maybe if you have three elements, I'm not sure it works. Our main goal is to keep A and B equal. Then through the lock, we both have A and B. That's a good idea. But in practice, you cannot implement a lock on pairs. Because every transaction may want to read several elements. And you don't have a lock for every subset. Yes? You can't lock every variable being used by a transaction when it starts. You can do this here. Everybody seems to be heading in the right direction. Let me show you what the buzzword is. You will recognize this. The technique that ensures realizability is called two-phase locking, 2PL. I'm sure you have heard about this. And the rule in 2PL is that all the lock requests must precede all the unlock requests. So in other words, you're not allowed to unlock. Once you start unlocking, then you're no longer allowed to lock anything. If you look back here, this rule is violated because transaction one unlocks A, and then later wants to lock B. So this here is not a two-phase locking. Does this make sense? It's a small thing to write. So we have, suppose, n number of elements. And you're locking one to n. The other one can start from n to 1. It's smaller. That's an interesting phenomenon. Here is what she says. One transaction wants to lock elements one to n. So it's going to lock them. One, two, three, four, five. Transaction two wants to lock the same element. It goes in the opposite order. n minus one and minus two. What happens here? Do we get a non-serializable schedule? We get a deadlock. That's a different, it's a big headache, but it's a different headache. I'm going to talk about this later. Right now, I'm only worried about ensuring that the schedule is serializable. If we get a deadlock, then we don't get a schedule at all. We just lock. We get a stop. Yeah. So transaction one, I believe she was saying that it spends some time. In this case, yeah. Transaction one doesn't hold any locks. Totally? Yeah. I think we agree. This is a bad schedule here. What happens here? The interesting thing is that with this very simple rule, we can guarantee that, I mean, the scheduler will guarantee that the resulting schedule is serializable. And that's actually quite amazing. It's not a necessary, this is not the only rule that can guarantee serializability. I will briefly mention the second one. But this is what works. It's simple. And this is for database system implement. Yes. So that means that anything that locks early on in transaction is going to hold that lock a lot longer? Exactly. So let me repeat what he said. This means that if you lock something early in the transaction, you are going to hold on to that lock basically most likely until the end of the transaction. So are there any other equally valid but more optimized for a short lock time methods? So to my knowledge, there is no other strategy that ensures serializability. What database systems do, they relax the serializability requirements. They have weak isolation levels. And we are going to talk about them today. I hope we definitely need to get to them today. And I will show you what happens here. But if serializability is our golden standard, then we have to implement two-phase locking. Again, it's not the only way you can, if you wanted to investigate the theory of serializability, it's not necessary that you do this through two-phase locking. But it's a simple rule that works. So let's use this in action. Yes, there is a question. Yes. So what happens if the transaction two is reversed in order? First B, then A. And the lock for B not acquired until, well, the same way as it was in one of the slides. Well, you lock B only after rightfully on lock A. So it's still 2PL safe. But will that not cause problems? Let's actually examine 2PL because we haven't looked at the slide yet. This is a slide that illustrates 2PL. Let me describe it first, and then I come to your question. So here we, in T1, I decided to lock both of them at the beginning. And when we are done with A, we can unlock A. And later, when we are done with B, we unlock B. And now, when transaction two would like to lock B, it's going to be denied because this lock is held and, therefore, it's going to wait until transaction one finishes and then it's going to unlock both of them. And now the schedule is serializable. So getting back to your question, you wanted to lock them in a different order. So if you move lock B lower for the transaction one, L1B, you move it lower, yes. And you move unlock as well together with it. So they're both a little lower. And then you move the second half of the transaction two before the first half. So you move B before A. You want this whole thing to be moved earlier? Yes. Before transaction one, oh, you might get a deadlock in this case because you want transaction one to acquire the lock on A first and then transaction two will acquire the lock on B. And then you're in a deadlock because each of them will wait for a lock from the other transaction. So 2PL does not prevent deadlocks. It only prevents conflicts. 2PL ensures serializability. It does not prevent deadlocks. And actually, nothing prevents deadlocks. We'll discuss this in a few slides. There are techniques that can prevent deadlocks, but they are expensive and they are not worth the effort. Good. So let me make this a formal statement. Two-phase locking guarantees that the resulting schedule is serializable. Proof. Let's prove this. Suppose not. Suppose we run two-phase locking and something bad happens, like in our example when we used locks, but the resulting schedule was not serializable. Well, if the resulting schedule is not serializable, it's not conflict serializable. I didn't read this, but it guarantees conflict serializability. If the resulting schedule is not conflict serializable, then there is a cycle in the precedence graph. And the cycle can be long. I illustrated here a cycle of length 3. So let's examine this cycle and let's prove that such a cycle cannot exist. There must be a contradiction there. So I'm going to start by looking at this conflict. If we have an arrow from T1 to T2 in the precedence graph, it means that there is a conflict in the schedule between some operation done by T1 and some operation done by T2. And the element is written here. Now, in two-phase locking, the rule is that T1 must have acquired a lock on A and only after T1 releases a lock can T2 do something to A. Can it acquire the lock on A? And this is why I wrote this first line because of the exclusivity of the lock. We have this first line here that T1 must release a lock before T2 acquires a lock. Is this first line clear? Why I wrote this here? So it means that the moment, the time when T1 releases a lock on A must precede the time when T2 acquires a lock on A. Did you have a question here? Good. Now let's look at the second line. I'm claiming that the time when T2 acquires a lock on A quickly precedes the time when T2 unlocks P. Why? Two-phase locking. Definition of two-phase locking, exactly. This is because of 2PL and only because of 2PL. So now we are one step further. So we unlocked B. Now let's look at what happens when we unlock B. Look at the second conflict. Transaction 3 had a conflict on B with transaction 2 and transaction 3 came later. So therefore when transaction 3 acquires a lock on B this can only be after transaction 2 has released a lock. So why is this temporal ordering here? Because of mutual exclusion. Because this is what locks insure. You can't acquire the lock on B until the previous transaction has released a lock. Okay, we have a long cycle here of length 3. Now I'm making this claim that transaction 2 releases a lock on B before... sorry, transaction 2 acquires a lock on B before it releases a lock on C. Why? Yes? 2PL. 2PL. So this is because of 2PL. And from here on this is the same thing. So there is this conflict from T3 to T2 and because of that the unlocking C must precede locking C by 1. This is because mutual exclusion. And finally because of 2PL one must first acquire the lock on C before it releases a lock on A. Now everything I underlined here means that the time when the operation on the left happened is strictly smaller than the time when the operation on the right happened. But look, we have a cycle. This can't happen in time. This is not a cycle in the precedence graph. This is a cycle in time. We can't have a cycle in time. So this is a proof that 2-phase locking guarantees serializability. Any questions about this proof? I think it's insightful. And something to look for when you see a proof is where is this extra assumption used? Did we use the fact that the schedule is a 2PL schedule? Yes, we used it in several places. Okay. But now we have a completely new problem. We are running 2PL and look what Transaction 1 wants to do. It wants to abort. Transaction 1 abortes. So what is wrong with the schedule? It is 2PL. The locks happen here. The unlocks happen here. Here are the locks. And here are the unlocks. It is serializable. Yes? So you had a similar constraint, right, where you can't abort after you've started a lock. Yeah, but first let me bring everyone to the same page. The problem here is that the second transaction has read a value A that was written by Transaction 1. And actually it also read the value B written by Transaction 1. But these values, they will be undone when the transaction abortes. The transaction manager will undo those actions and will restore the values to their original values. And now we are in trouble because Transaction 2 has read them. And moreover, things can actually go even worse. If Transaction 2 commits before Transaction 1 abortes, then we are in big trouble because then we can't abort it anymore. So does everyone understand the consequence of this example? Two-phase locking for anti-serializability, that's fine. But the resulting schedules may not be recoverable. We can't run them. And the fix, again, is extremely simple. So this is just summarizes. The fix is very simple. The policy that database systems apply is called strict two-phase locking. And the other simple rule, which is transactions are not allowed to unlock until the end. Unlocking happens atomically with commit or with abort. So if you go back to our example, this one here, this schedule will no longer be possible because these two unlocks, they need to be done at the very end when the transaction commits or abortes. And therefore, this will prevent T2 from reading dirty data. So what is interesting about strict 2PL? The schedule is guaranteed to be recoverable. And in fact, it's guaranteed to avoid cascading abort. Remember, a schedule that avoids cascading abort ensures that whenever a transaction reads a value, the transaction that wrote that value is already committed. We know that in 2PL, that transaction must have committed because otherwise it would not have released a lock. It's going to hang onto its lock until it commits. So this is why it avoids cascading abort. It's actually even more. This schedule is called strict, which I didn't define, but here is something fun for you to do. Read in the book what a strict schedule is. It's even more restricted than avoiding cascading abort. And it turns out that the schedules enforced by strict 2-phase locking are guaranteed to be strict. So this is what database systems do. They use locks, but with strict 2-phase locking, in which locks can be acquired at any time, but they are held until the end of the transaction. They are only released when the transaction commits or abortes. Any questions about strict 2-phase locking? Yes. There seems to be this kind of contention between the desire to have very short transactions and this, which pushes the lengths of locks out even more. So the example you had a couple of weeks ago where some guy had transactions that lasted several months seems more and more dangerous. So it's all on the application developers then to ensure that transactions close promptly and that transactions are small. Yes. This is on the application developer. There is also the other class from currency control mechanism, which are the optimistic ones. But usually they prevent schedules from happening. They don't allow more schedules. So in some sense, the strict 2-phase locking is the most general scheduling mechanism that ensures realizability. So yeah, it's up to the programmer and to the database administrator who controls admission control, how many transactions are allowed to run concurrently of the system. Okay, so more thoughts about locks. We discussed a single kind of lock, no matter if transactions want to read or write them. Now you probably know this. It makes much more sense to have two kind of locks. Shared locks, which are used for reading and exclusive locks, which are used for writing. So what's the rule? If I hold a shared lock and he wants to read, is he going to wait for me? No. Shared locks are not in conflict with each other. But if I hold a shared lock and she wants to write, what happens? Then she will wait until I release my shared lock. What happens if several transactions hold a shared lock and she wants to write? Then she will wait for all of them to finish, to terminate. And vice versa. If she holds the exclusive lock and we want to read, we all have to wait until she finishes with the exclusive lock. It's very simple. So then there are other kind of locks that some database systems support. One is called an update lock, which starts like a shared lock and then you can escalate to an exclusive lock. So you can tolerate during the beginning of your transaction. You can allow other transactions to read this element, but at some point you want to update it, and then you acquire an exclusive lock. There is even something called an increment lock, which allows for commutative operations. So I can increment, and you can increment, and we can switch the order, and it's still fine because increments they commute. So database systems, they go to a great length to try to reduce this number of the contention on common elements. Other techniques are to control the granularity of the lock. Locks can be of fine granularity. They can be just for the record, which is kind of the logical thing to do. But if you do this, then the data structure that maintains all the active locks, it's very large. And it is expensive to access. So then every lock request or every unlock request is going to be more expensive. The alternative is to use coarse grain locking. So locking on entire tables. And actually, by database systems too, they have both kinds of locks, and they do lock escalation, which means maybe you have a read lock on a table, and now you narrow down into a write lock on a record. Or vice versa. You do, what's the right term? The opposite of escalation. Descalation? I don't think so. Descalation? So I should be honest. I do not understand the details of how these locks are used. But the principle is pretty straightforward. They try to minimize the contention. OK, dead locks. Everybody in this room and in the other room wanted to talk about dead locks. So let's talk about dead locks. So the dead lock is when there is a cycle and there waits for graph. Transaction one waits for a lock that is held by transaction two, and transaction two waits, not necessarily by something held by transaction one, by a different transaction. But in the end, you end up with a cycle. And then you have a dead lock. Dead locks are bad. They are one of the unpleasant consequences of using locks in order to do concurrency control. There are two ways to deal with them. What database systems do is this, the second one. Dead lock detection, what do they do? They do they have some timeouts. If a transaction doesn't make progress in a certain amount of time, then they assume that it's dead locked. Or they trigger a more expensive algorithm that constructs this waits for graph of all the transactions. They want to have a clear picture of who waits for what. And in this graph, they look for a cycle. And if there is a cycle, then we know there is a dead lock and we know that we need to abort one of these transactions on the cycle. And that's exactly what they do. They abort a transaction on the cycle. I know the SQL server has an interesting strategy for how often to run this dead lock detection algorithm. It runs this every, I used to know this, 10 seconds or five seconds. Does anyone know how often SQL server runs a dead lock detection algorithm by default? Periodically. Sorry? Periodically. Periodically. But I used to know the exact period. And I think it's about 10 seconds or so. But the interesting thing is that if it detects cycles, then it decreases the period. Because it assumes that the system is overloaded. And this is why there was a cycle. So with high probabilities, there will be another cycle after a short period of time. And whenever it detects a cycle, any system will choose a transaction on this cycle and abort it. And hope that now there is no more cycle. In order to try to avoid dead locks, one strategy would be to acquire all the locks at the beginning. If you know what elements you're going to read or write, then if you acquire all the locks at the beginning, then you don't create these cycles. But sometimes that's not possible because we don't know what the application code will do, what elements it's going to read or write. The theory that guarantees dead lock avoidance is to have an order on all elements. And to strictly enforce that locks are acquired in this order. And this guarantees not dead locks. But what does that mean in the context of databases? It works if you can make it work. It means that you have to consider a global order on all the records in your database. And if a transaction wants to acquire a lock on record number 50,000, once it gets that lock, you're not allowed to go backwards. You won't allow it to request a lock on record number 30,000. It's only allowed to increase the element number for which it acquires a lock. It's completely impractical. You cannot implement this total order in a database setting. So this is why database systems, they are stuck with dead lock detection and with aborting a transaction whenever a dead lock is detected. The second one is what we're doing in Plochial. Sorry? The second one is dead lock avoidance acting on lock set once. That's what we're doing in Plochial. Actually, let me ask this to everyone. She said this, acquiring a lock at once. No. Is this 2PL? Or is this strict 2PL? Is this strict 2PL? What does strict 2PL say? It says that we have to unlock everything at the end. But it doesn't say when we lock them. We can lock at any time. But in strict 2PL, we are forced to hold the locks until we commit. This is the opposite. Because you should actually acquire all the locks at the beginning in one single atomic operation. And that's more difficult to enforce. Which actually begs the question. When you write an application and you update the database, you don't worry about locks. And your Java code or your C sharp code will simply issue update or insert or delete statements to the database. No locks. Where are these locks coming from? Who places these locks? That is the transaction manager. So the transaction manager, it has actually two functions. Or the locking scheduler. It has two functions. The first one, task one, is that it needs to insert the lock requests, the locks and unlocks. So how does it know to do this? It's actually very simple. Whenever your application code requires a read or a write, then the locking scheduler will first insert there a lock request for the lock corresponding to that element. So therefore, if you imagine the sequence of operations of the transaction, the locks are actually inserted right before the read or right before the write. And this ensures strict 2PL because it will never release them until the very end. It just acquires the locks and when the transaction commits, then it will issue a global release, a global unlock for all the locks held by the transaction. So it's really a very simple way to enforce locking. And I hope you have this picture in mind. Try to imagine yourself implementing the locking scheduler. You just have to monitor the request done by the transactions. And whenever there is a read request or a write request, you inject a locking operation right at the beginning. And whenever there is a commit or abort request, you do all the unlocks. Yes? I've seen one example in Moscow where you can get a write lock up front where you can do a select for update. And I just want to mention because it seems like one of the few places where I've seen where you kind of have access to a more serious form of locking up front. So you could almost do the get all your locks up front deadlock points. I see. So you're referring to a programming practice. MySQL has this thing, you can select for update and it gets write locks. I see. So it's not a programming practice. It's a command. It's a command in MySQL. But it also suggests a good... If you didn't have that, then you could issue a dummy update at the very beginning to acquire all the locks. That's just a common secret thing not specific to MySQL. Oh, I see. So... What's the name again? Select for update. Select for update. So we have a comment here in this room that there is a command called select for update that you can issue at the beginning of the transaction and then it will... The fact is that it will acquire all the locks for that particular update. And they are write locks, I suppose. Exclusive locks. What's the downside of doing this? Why not do this? If you roll back... If you roll back, it's the responsibility of the transaction manager to ensure that you roll back. I mean, it's an exclusive lock. So that means no one else can read from that point on. Exactly. The downside is you're reducing the amount of concurrency in your system because once you acquire these locks, nobody else can read them. Did you have a comment here that I interrupted you? Oh, I was going to say, yeah, for update with lock, it will put an exclusive update lock on all rows that are returned for those locks. Right. So they will put an exclusive lock on all these rows. If you don't do this, maybe there is a chance that the transaction might proceed in the meantime. What other way do we have to put an exclusive lock? Don't just put a date. And we want to adopt the other transactions. Is there any other way to do this? So is there any other way to do it if you want to update? The alternative is to postpone acquiring the lock until the moment when you actually need it. What this elect for update does is that it forces the locks to be acquired at the beginning of the transaction. And the alternative is to wait to let the locks be acquired only when you do the actual update. Which could be significantly later in the transaction if the transaction is long. Okay. So this is one thing that the locking schedule needs to do. On the other hand, it needs to enforce the lock semantics. And for this, you have to picture yourself having to talk to the transactions and to the database system. So you need to know which transactions hold what locks. And whenever a lock is acquired or released, you need to update the data structure that maintains for each transaction the set of locks or for each element the set of transactions holding or waiting for that particular lock. The lock table this is big. And it's a critical data structure in any database system. This is a table where the information about locks is stored. It has to be in main memory table. And it's this is where much of the time goes when a transaction wants to acquire a lock even if the lock is granted there is a significant overhead to just update this data structure to recall the fact that now the transaction holds a lock on this element. Okay. So here is a great picture that I essentially copied from the book. This shows you a representation of the performance of the locking of a locking scheduler as a function of the number of transactions in the system. So initially if there are no transactions then the throughput is pretty good in the sense that if you have one transaction that wants to execute it's going to go through it at full speed. Actually what is throughput? How would you define the throughput of a database system? It's actually measured in TPM What does TPM mean? Or TPS? Transactions per second. Transactions per second. So this is how many transactions per second it can serve. Think about like a banking system. How many transactions per second can it serve? The more the better because it means it can serve more customers. And this is the throughput as a function of the number of transactions. You expect that the number of requests increases the throughput should increase because now we have more requests. But at some point it doesn't increase anymore. It flattens and it's actually worse. It starts to decrease. You're asking for more transactions but actually the number of transactions that are executed per second decreases. Why? What happens behind the scenes? Why can't the system keep up with the increased number of transaction requests? The number of uncommitted transactions go up. So they are waiting for what are they waiting? They're waiting for locks. Now more transactions hold locks and they're waiting for each other. Something even worse might happen. A deadlock. Maybe there is a deadlock. If there is a deadlock then the system has to abort some transactions and the throughput decreases even further. So if you're a database administrator, this is your sweet spot. You want to be here. You don't want to admit more transactions than the system can handle. Of course it's difficult to tune this. You don't know. It depends on your workload. So I mentioned that strict two-phase locking is not the only way to guarantee serializability. I'm going to mention one more because it is described in the book. This is a way database systems handle locks to three-like structures. For example, indexes. When you need to update an index, an index is a tree. It looks something like this. If you do an update you first want to read the record. So you would traverse a tree top-down to get to the corresponding leaf. Whenever you read a note of course you need to acquire the lock on that note. Now look at what happens. All the transactions or these hundreds or thousands of transactions per second they all content for one particular thing, for what? You know, another transaction might not want to read the same element. Maybe the other transaction goes this way. So is there any contention? For the root node. Everybody wants to go here. There is a hype. Sorry? If it's read it is shared. But if you do an update if these transactions want to update we will study B-trees. In the worst case the node can split up to the highest level. It happens very rarely, but you don't know when it happens. You need to acquire all the locks in order to ensure that you can split the note. Updates are quite frequent even if they are in different places of the leaves of the tree. So database systems they do not implement strict to face locking for accessing trees. There is a different protocol which is also a lot of fun. It's called the tree protocol. I think I described it on the next slide. But let me show it here. It says something like this. You're holding the lock on a node of a tree. You can acquire a lock on any of its children and then you can release a lock. So it goes like in this case like this. In fact the rule is written differently. It says that in order to acquire the lock on any node of the tree, you first must have a lock on the parent. But once you have it you can release it because you can go further down the tree. So this doesn't quite solve the splitting problem when we need to go back up and split. This is something I need to discuss about the tree trees. But I can tell you what the trick is. They split aggressively. If the root is full then they split it because they don't want to hold the lock until they come back and they apply the same rule for every single node. So in order to avoid contention for the root and the upper levels of the tree database system they split aggressively without fault. Yes? What if you took, for example, a highly high traffic table and add multiple nodes accessed by a hash that can be a row ID? But you still need locks. If you want to update them then you still need to lock them. I'm saying that the tree locking structure has multiple roots. It deals with the different hash values of the row ID. So are you referring to an index structure? Yeah. So there are two index structures and we will discuss them. So one is a B3, B plus 3 and the other is a hash table. In a hash table there is some contention there too that you need to worry about. I don't know exactly what kind of protocol database systems are using. But for B trees that's a protocol that's being used. It's called the tree protocol. Okay. I'm not doing very well on time. So let me discuss this phantom problem before we take a break. And please pay attention. It's a subtle problem. So let me show you directly the issue. So I'm showing you here two transactions and the big problem. So let's examine the transactions and as you can see I didn't write them as read and write operations. I wrote them as SQL queries. And look at what transaction one does. It just wants to select all the blue products. Maybe it wants to count them or display them or something. And then it does some other stuff and then it reads all the blue products again. Why second time? I mean who knows. It's a beginner program. It's none of you. They just wanted to read the blue products twice. Now the other transaction wants to insert the product which is blue. And now my question to you is if the schedule is here the first select then the insert and then the second select. Is the schedule a serializable schedule or not? It's not serializable, no? What happens to the two selects on the left? Yes? A second one we see an extra product. So this is not a serializable schedule but actually if you look deeper it gets worse. Because if I translate this into a sequence of read and write operations then it is not serializable, it's actually complex serializable. So let's take a look. Let's suppose there are two blue products called them X1 and X2. So what does a first select do? Well it's going to read X1 and read X2. What does the insert do? It's going to write neither X1 nor X2. It's going to write something called X3. What does a first transaction do now? It's going to read them. It's going to read 1, 2 and 3. Where is the conflict? This is a conflict serializable schedule which is equivalent to what serial schedule? Who goes first? Which transaction should be executed first? He too. Because we do have this conflict here between write to X3 and read X3. But something goes wrong. What goes wrong here? Obviously when we look at the SQL queries this schedule doesn't seem serializable. But if you write it down as a sequence of read and write as we learned it turns out to be serializable. What did we do wrong? There are more reads in T1 than you've written down. I mean it's actually reading the entire document as an element that you haven't labeled. We're assuming there's only two things that are blue when you start. Right, but still it's not fair to just say it's reading X1 and X2. It's reading. Something else which T2 changes. Exactly. That's what happens. This X3 appeared out of the blue. It appeared out of nothing. It wasn't there before. So the assumption that we made so far and it's called the phantom the assumption that we made so far and that doesn't hold is that the database is static. When we make the assumption that we made so far was that the database consists of elements and this set of elements is fixed. These are all the elements there are. But in practice when an element is inserted or even deleted then the database is dynamic. This set of elements is no longer fixed. It can increase or it can decrease. And in our example we had this new blue element. So the blue element was an element that the second transaction inserted and it just appeared out of nothing. This is called the phantom and dealing with phantoms is difficult. There are no simple solutions. Here are some solutions that database systems do. They can lock the entire table. Whenever you have to read or to write you lock the entire table. This will guarantee that the schedules are serializable and they deal with phantoms correctly. A more clever way is to index the attribute on which we read. Can you index all the blue sorry. A more clever way is to lock on the index entry for this attribute. If there is an index on blue for example then we would index we would lock that particular entry. And this essentially prevents anyone from doing anything to blue. They won't be able to insert a new blue element because that needs access to the index. Or there is something called predicate locks. Which means these are not locks to an element but they are locks to some predicate. And whenever somebody wants to do some action the system needs to check that that action doesn't satisfy any predicate for which there is a lock. And that can also prevent phantoms. But bottom line there is no clean way. There is no simple way that can ensure that can ensure serializability and deal correctly with phantoms. Database systems do this probably they use the first option or a combination of the other two and it's expensive. So this is I have a question here. It's more of a trivia. What sort of MySQL does the second item by default? So MySQL does this. So it places an index on places and a lock on the index entry. If there is an index. It will create indexes. Any more questions or comments? Then we'll take a four-minute break and then we'll discuss levels of isolation. Okay. So as I promised I brought here two copies of the other book. I can lend them to people in this room and please share them among you. Who would like a copy? I'm going to place them here. Good. So let's continue. I would like to cover in this the remaining two more topics. One is decrease of isolation and the other is the areas recovery manager. So let's start with decrease of isolation. The default level of isolation in SQL is the serializable level of isolation or at least that should be the default. But as you saw especially because of phantoms and because of the duration of the locks this can be sometimes too expensive. It is quite often the case that we can relax, we can weaken this requirements that are to be acid. This depends from application to application but we may be able to weaken this requirement and as a consequence improve performance. And this is what often database administrators have to do in order to improve performance to weaken the degree of isolation. So what are these degrees? There are actually four degrees defined by the SQL standard. They are called read uncommitted. Read uncommitted means that when a transaction reads a value that value may be actually dirty. So the other transaction might abort. Who knows? It's a dirty read. The other is read committed. Now you're only reading committed values written by committed transactions. However, if you read it a second time you might get a different value because it might be another transaction overwrote it. The third one is repeatable reads and which are guarantees that you're getting the same value if you read it multiple times. And the last one are serializable transactions. And there is a trade-off. The level of consistency that we get is being traded off for performance. We get more performance for the weaker levels of isolations. Okay. So I essentially have this the same information a second time. The syntax is here. We say set transaction isolation level and then we say what isolation level we want and that's the isolation level that SQL will implement. And ACID is only the last one. This is what ensures ACID properties. Now I have a warning for you. This is a very fuzzy area of database systems. Every vendor feels free to break the rules a little bit in order to get better performance. So you should not assume that by default the database system implements serializable implement serializable transactions. But do read the documentation to understand exactly what they're doing. Yes. Is isolation level which will avoid phantom records? Yes, that's the serializable isolation level. That should avoid the phantom problem. Okay. So let's discuss them one by one. The first one dirty reads is called read uncommitted. Isolation level read uncommitted. Here is how it's implemented. Implementation is actually easier to describe than any semantics. Implementation is like this. Transactions have locks for writes. And here the scheduler implements strict two-phase locking. But there are absolutely no locks for reads. You want to read something, there is no lock. The consequence is that if you have an application where the transactions are read-only that application will run blazingly fast because no transaction ever needs to acquire a lock. If you have an application where most of the transactions are read-only and one or two do some writes, then most of the read-only transactions they will run very, very fast. However, you should be aware that being dirty, that you might read data that is not consistent. Second isolation level read-committed. So the implementation here is again very straightforward. The locks for writes are long duration which means strict two-phase locking. Now the reads they do have locks but they are short duration locks. So that means that whenever a transaction wants to read an element a read-lock is acquired first but once the element is read, the read-lock is released. This means that you will never read a value that is written by a transaction that has not committed yet because that transaction still holds the right lock. So you will wait until the transaction who wrote that element commits. However, since you are releasing the read-lock immediately after you read, another transaction might overwrite it. And next time when you read it you get a different value. So this is the read-committed isolation level. The next one is called repeatable reads. And here the rule is very simple. Strict two-phase locking for both and this is not yet serializable. Why is this not this isolation level not ensure serializability? Because of the phantom problems. Exactly. So this ensures serializability over a static database. And it's a good serialization level because if you want to ensure complete serializability then it might get quite expensive. Yes? For the two database systems that we are using in class, what are the default isolation levels? Postgres and FSC. For Postgres, as far as I know it's serializable. Read-committed. Read-committed. That's a default if you don't specify anything that's read-committed. Does anyone know? What's the default isolation level is for SQL server? For SQL server I'm sure it's read-committed. Yeah. For Postgres. Okay. The other question to ask is whether database systems indeed offer the serializable isolation level. Again something to read in the documentation. And the last comment on this part. Sometimes if you declare transactions that are read-only then you can get more performance from the system. You give the transaction manager more options to schedule that transaction. And actually what they do is they use optimistic concurrency control mechanism for transactions that are read-only and they use locking for transactions that are read and write. And this is a syntax. You say it's a transaction read-only. Okay. Any more questions about isolation levels? Yes. Consider a scenario where we are not having a transaction and we are running individual queries. So if we provide locking hints in the query, in the SQL query, does it bypass the transaction isolation level that has been mentioned? Oh, I just don't know. I don't even know the syntax of locking hints. With no lock and with no lock kind of a thing. Oh, if it says with no lock then I be pretty convinced that this will not be serializable. Yeah. Now what I'm asking is will it bypass the isolation level set? I suppose so. I suppose this is why they accept those hints. But again, I would have to read the documentation of these hints. I don't know what they mean. But that's the only thing they can do. They can bypass. They can weaken the isolation level. Okay. So what I want to do in the meaning of this lecture is to discuss one of the two advanced topics that I had in mind, which is the Aries Recovery Manager. Next time we will discuss time-based concurrency control mechanism, which are the optimistic concurrency control mechanism. And they are based both on the other book by Orman and Garcia Molina Widom and also some material on snapshot isolation that I didn't find in any book. And you need those for the homework. So you need both topics for the homework, which is due two weeks from now. Okay. So talking about Aries, let me start by revisiting some basic principles of the Recovery Manager. We discussed last time a very simple Recovery Manager based on undo or redo or a combination of undo and redo. And that was a great way to get exposed to the basic principles of Recovery Manager, which I want to recap now before we discuss areas and details. So there are two ways in which you can implement transactions, actually four ways. One is to allow steel or non-steel. So steel policy is when you allow a transaction that has not committed yet to overwrite an element that was written by a transaction that committed. And under no steel, you don't allow this. You do not allow the override to happen until the transaction commits. So with no steel you're postponing the outputs until the transaction commits. Steel does the opposite. It allows the transaction to override even if that transaction has not committed yet. The other dimension is force non-force. It's easier to understand. Force means whenever a transaction modifies an element, you force that output to disk. And non-forces, it means you allow that output to stay in the buffer pool. You allow that update to stay in the buffer pool until the buffer manager decides to output it. Now the easiest combination for recovery is no steel with force. Which means you are not you don't allow a transaction to override until it commits. And when it commits then you output all its updates to disk. And that's kind of the easiest way to recover. Actually then there is nothing to recover because the commit means that everything has been written to disk. But you guess the highest performance if you allow the other combination, steel no steel. It's just a matter of terminology. There is no new concept on this slide. Now what the right ahead log does. It allows us to implement a steel no-force policy through this concept of a log. Through the concept of a right ahead log. And the idea is that we will use a force policy only for the log. And this is much more efficient than forcing all the transactions to write to disk when they commit. Okay, I'm not exactly sure why we do this discussion. It's kind of a very high level discussion and my plan is to go deeper into details anyway. So why don't I move ahead and show you the details. This is an important concept here. When we discussed elements last time we said that they can be an element can either be a page or it can be a record or we left it open. It can be an entire table we said. Now we're going to look deeper into this. If the element is a page then this is called physical logging. We log the physical page. If the element is a record then we call this a logical logging. Think about a record as being smaller than a page. You usually fit multiple records in a page. A log might combine both of them and then it's called a physiological logging and this is what Arias does. It has both physical and logical logging and we will discuss why. And finally before we discuss Arias in detail let me remind you the rules of the right-ahead log. These are essentially the relaxed rules that undo redo logs and there are two or three depending on how you read them. First rule is that before you output a page to disk the log entry referring to that output must have been already sent to disk. Remember we had these arrows in our table. This says that the log entry must precede the output. It's a simple idea but it's really critical that we keep this in mind. But otherwise you can postpone. You don't have to force the log. You can postpone it but not longer than the output. The second rule says that before a transaction commits you must force its logs to disk. Let me represent it like this. Every log entry that belongs to a transaction must be forced to disk before the transaction commits. The last rule is actually not a rule but it's like a definition. When does a transaction actually commit? When we say that it has committed. When can we dispense the money? The exact time moment when the transaction commits is when the commit log entry is written to disk. That's when the transaction commits. That's by definition. Good. So with this in mind let's discuss the Arius Recovery Manager in detail. Let me start by saying that it is a redo undo log. So you will see both kind of actions both redo and undo. But what is really surprising is that it uses physical logging for redo and logical logging for undo. So after the system crashes during the redo phase as we will see it will redo entire pages. But during the undo phase it will undo individual records. Any ideas why? It's not a simple answer. Does anyone have an idea why do we do which one is easier you think? Which one would be easier to reason about and easier to implement? The physical. We don't have to worry about who owns those records. It's just at the page. We don't have to worry about whether the records are coming from indexes or their database records or whatever are they. Whatever is written on the page this is what we redo. Why do we do logical level undo? Yes. Is it presumption that there is going to be more redoing than undoing? Actually yes. There are many more undoes than redoes. There are many more you said the opposite. But there are actually many more undoes than redoes. If the database never crashes there are no redoes but there might be plenty of undoes. This might give you a hint. Exactly. Whenever a transaction aborts we need to undo it. And this is when we use a log. And now it should be obvious we cannot do a physical undo for a single transaction. Because its records are packed together with records thus by other transactions on the same page. In order to undo a single transaction in order to undo selectively a transaction it must undo at a logical level. First great insight. And by the way do read chapter 18. Actually not just 18. Do read all three chapters in the book. They are great for advanced study. They are not great at giving you the first detail look at concurrency control and recovery. For that the other book is better. But once you understand the basic concepts these are great chapters. Let's continue areas. The next concept that is actually specific to all log recovery managers is this notion of a log sequence number. Very simple. Every entry in the log is a number. It's like a key. It's increasing. And if you want to refer back to a log entry you refer to it by the LSN by the log sequence number. Very simple. But here comes the red part this much more interesting. Every page be it in the buffer pool or even written to disk even the pages that are written to disk every page has an LSN which represents the the latest log entry that refers to that page. So imagine this log that records many updates to the same page. But the last update this is what we actually store in the page itself. Just something to keep in mind. At the end everything will fit together but we need to look at all these data structures. And here are the three big data structures and areas. I'm going to discuss them first and I'm going to show you a picture with them. And the transaction manager maintains in my memory a table of all the active transactions. Let me use blue. It's quite easy to understand. There is an entry for every transaction that is active. When a new transaction is started we need to create a new entry. We need to remember the name of the transaction or the identifier. And in addition to every transaction we have this important information which is the last log entry of that transaction. Of course when transactions update they will generate log entries and what we maintain in this table is for each transaction the latest log entry. Good. The second data structure is a so-called dirty page table. Of course there is a buffer I didn't have it on this slide but there is the main buffer where we keep pages in memory. But in addition to that buffer we maintain a separate table where we list all the pages in the buffer that have been modified they are waiting to be sent to disk. And what we maintain here is for every dirty page we maintain the earliest the earliest log entry that cause that page to become dirty. If later updates continue to update the page we don't care about those log entries we only maintain the earliest ones. And finally there is a write ahead log which is not main memory but here the one interesting thing is that this has a pointer or the entries in the log are linked by transaction so all the logs belong into the same transaction are linked through pointers backwards. Let's review everything we said with some examples. So here is a buffer pool right here. As you can see there are some pages page p5, p6, p7 and every page has an LSN that is the most recent log entry that refers to that page. Let's see what happens here. Page 5 for example has a page LSN 104 which is this one right here. Now look at the dirty pages table page 5 was modified and therefore it appears in the dirty pages table. But what is the recovery LSN or the rec LSN? Yes. First LSN that modified that page First LSN that modified that page and it is always less than or equal to the page LSN. Every time we update the page page LSN increases but rec LSN remains the oldest one. And then there is this active transaction table that maintains for every transaction it maintains the last log entry of that transaction. So all these things are like a spaghetti structure they point to each other and we need to fix this in mind to understand what's going on and how to recover. Okay. So let's see first what do we need to do in order to understand these data structures. Let's see what we need to do there is a transaction T and it wants to write a page called P what do we need to do? Let me go back so transaction T writes P. What do we need to update? It's a buffer pool. Let's start with a buffer pool. What do we need to do here? Sorry? The page LSN. Actually this is not the best place to start what do we need to do first? In the log entry. In the log entry we need to refer to this page and now we know what we put in page LSN we put the LSN of this entry. What else do we need to do? Active transaction. Exactly to the active transaction we need to say that this transaction has modified now it's LSN is 105 What else do we need to do? If this is the first time that this page becomes dirty then we need to add here an entry referring to the fact that this page was made dirty by by LSN model 5 if it was already dirty then we don't add this. Okay Second thing suppose the buffer manager yes Are things removed from the dirty pages table once the transaction commits or something? When is an entry removed from the dirty pages table? When? When the page is forced to disk When the page is forced to disk Who would force a page to disk? A commit? No We don't want commits to force pages to disk so commits should only force logs to disk Who should force the page to disk? The buffer manager Remember the first in LRU policy The buffer manager has a completely different logic It sometimes wants to take this page and force this one to disk because of LRU and then it's going to write it to disk and then it removes this dirty pages table and this is actually my second question The buffer manager wants to evict page P What do we do? Let me go back I'm going to erase this thing and let's think about evicting P Let me take P6 What do we do if we were to evict P6? Let me take P4 P4 is here P5 is here and now we want to write it to disk so it's not going to be dirty What do we need to do in this case? Write it to disk and remove the record from dirty pages Remove the record but there is something much more subtle We need to do something else Where is this log sitting? This is the tail of the log It sits in my memory We kind of force it to disk Extremely aggressively So what do we need to do now? Did you want to comment something? Force all the entries that refer to the page Yes, we need to force the log Remember the rules The first rule I showed you is that whenever we output a page to disk the log entry for that page must have been sent to disk Now imagine this log it has many many the tail of the log can be pretty long All these are log entries that have not been yet forced to disk You can't selectively force You have to send an entire interval of entries to disk But from the beginning up to what point? Yes To the last modification of the page It is so handy, exactly here We need to send it so far back We go through the log and everything up to one of four needs to be flush to disk before we output it And then we can output it to disk If the tail of the log was longer we can keep this in memory But up to one of four up to the page LSN we need to flush to disk And I think that's all we need to do And finally when a transaction wants to commit what we have to do in this case if a transaction wants to commit What was the rule that refer to a transaction? Yes All the entries that refer to updates of that transaction must go to disk So again we need to read the tail of the log and force to disk up to what point? Yes Up to last LSN Up to the last LSN of the transaction that wants to commit So these data structures they talk very nicely to each other during normal operation Now let's see what happens So let me see if I got some right So when a transaction wants to write we need to update these things we need to flush the log Yeah, I did say something wrong The last one when a transaction wants to commit you can't leave anything behind in my memory from the log because the last entry is going to be exactly the commit entry for your transaction We need to go to disk as well So in this case you really have to flush the entire log to disk And that's okay, that's what the log is supposed to do, is to be supposed to be forced Good, so now let's look at recovery and before we talk to recovery I'm going to discuss check pointing Last week we discussed a simple recovery manager based on undo and the other on redo And we discussed the check pointing for the redo there that was pretty complicated The check pointing for the area system is actually much simpler It's incredibly simple What check pointing does it writes in the log the entire transaction table and the dirty pages table It just writes them to disk That's all it does But in addition there is a background process or background or maybe the page replacement manager that occasionally flashes to disk some of the pages and that also impacts the recovery So now we can finally discuss recovery So now the system crashes We analyze the log We find the latest check point that check point gives us two tables the dirty pages table and the transactions table And from here on we need to do first an analysis and then to do the redo and then to do the undo So the analysis is essentially at a high level we'll just figure out what happened since the last check point until the time of the crash and we look at them very closely The redo is a standard redo The only question is how far back do we need to go and the undo is actually more complicated the undo because it's a logical undo will turn out to be more complicated So let's do the analysis I'm going to skip this picture Let me start right here So the log goes from left to right This is time and here is where the crash happened So we read the log and we look at the most recent check point What we find there are these two tables the dirty pages and the active transactions This tells us immediately how far back in the log we need to go when we do the redo recovery How far back do we need to go when we do the recovery The redo The smallest lsn and dirty pages The smallest lsn and dirty pages Exactly The smallest lsn Because this refers to the earliest point when any page was updated and it was still dirty in the buffer pool when the system crashed This is how far back we need to go And now comes something Very, very fun These were the pages at the time of the checkpointing We need the same These were the dirty pages and the active transactions at the time of the checkpointing We need them here at the time of the crash How can we How can we compute these two tables at the time of the crash It's actually a trigger question It's almost not a question We just read the log and every single every single record tells us what to do to these pages as called replaying the history So we read the log and we simply replay the history What can we find as a log? Well, maybe we find an update It says transaction T has modified page B and the old value was V No, sorry, this is the new value The new value is V Then what do we do? Well If P was in the dirty pages table then we don't do anything If P was not there then we create a new entry in the dirty pages table and we insert this LSN as its recovery LSN What else could we find as a log that affects these two tables A commit maybe transaction T commits then what do we do? We remove it from the active transactions table What else can we find as a log that might be of interest? A board A board which is like commit by something else the opposite Start Start T Then we create an entry in the active transactions We replace the history and that's what I find amazing We are able to reconstruct what was in my memory at the time of the crash namely these two tables the dirty pages and the active transactions Now we are almost ready for the redo phase but let me ask you this When we computed the first LSN we used the dirty pages at the time of the check pointing Is that the right thing to do? Should we use the dirty pages at the time of the of the crash maybe Which information should we use when we compute the first LSN? Can we have an after the replay Can we have a recovery LSN that goes even further behind No, we cannot have this because all these updates are this kind they come later and they will only insert in the dirty pages entries whose recovery LSN is going to be later and conversely there is no nothing we don't have any information about the pages that have been forced to disk that information is missing from during our replay history we don't get that information so we are forced to use this as our starting point for the recovery so this is the second phase the redo phase we start from that first LSN and we just start our redo phase exactly the way you know it but iris is actually smarter it tries to be very efficient and it avoids aggressively doing the redoes that are not needed that are not necessary and here are the rules that it uses in order to avoid doing some redoes that are not needed the first rule is that it looks at the page it sees an update record in the log and if that page is not there is dirty then it means it has been flushed already to this so there is no need to update but suppose the page is dirty then there is a second rule maybe it was made dirty but it was made dirty by a log entry that comes later the value that we are currently that we would redo based on the current entry and old value because later the page was made dirty and this is why it ended up with the dirty pages table don't redo and the third thing is really aggressive so now it thinks that this might be a redo that is actually that's active that needs to be done but before it does a write operation which is more expensive than a read it reads the page from disk and it checks the page LSN on disk maybe it was still forced to disk maybe that's a newer information and then it doesn't do the write and it essentially trades off a read for a write and if anything fails then yeah it ends the redo the update what exactly is the recovery LSN what is the recovery LSN this comes from the dirty pages table I don't know if you notice my color code so I'm using exactly the same colors for these data structures so the dirty pages table are in red the recovery LSN is the earliest LSN of any log entry that made that page dirty but what happens is that if the page was flushed to disk then it is made under it's made clean it disappears from the dirty pages table and when it is written again then it gets reinserted in the dirty pages table with a much newer recovery LSN and what happens in number two here is that the updates that came before this most recent dirty write before this most recent write they are skipped does this answer your question maybe your question was much shorter that's my question yes it was just named Rec and LSN previously so I kind of missed that got it your question was much shorter indeed recovery LSN and Rec LSN means the same thing thanks good so that's the redo phase it's actually not it's very easy but it's very smart it uses all the smartness in order to improve efficiency ok so now the undo once it finishes with the redo RES will start the undo phase which also needs to do outside of the recovery process whenever a transaction aborts it needs to be able to undo the entries for that transaction ok so this is more difficult let's see how to best describe this the idea here is that this is not replaying history or undoing history about doing some updates which are real updates and therefore they need to be written in the log so during the undo phase it starts from the end it it undoes the transaction that needs to be undone but during this process it writes in the log some special entries that are called CLRs called compensating log records with the idea that if the system crashes during the undo these are the records that need to be redone in order to do the undo ok so let's see in some more details how this works the first thing to figure out which are the transactions that have not committed we discussed them last time it's exactly the same concept we need to figure out which are the transactions that have not committed in this case for RDS these transactions are very easy to collect they are exactly those in the active transaction table and moreover they have this funny name they are called loser transactions the loser transactions are those that are uncommitted ok so now conceptually what we need to do this is our log and we need to skip only to those entries that belong to the loser transactions if this were a single transaction then it's very easy to imagine what goes on we need to read the latest log entry of that transaction undo it skip to the previous log entry undo it skip to the previous one undo it and so on by the way as we do this we will append at the end of the log what clr touch that if we crash we will not undo these things but we will redo the clr ok but the problem is that we don't have a single transaction to chase but we have a set of transactions or the loser transactions so this is why the logic is slightly more complicated conceptually nothing new here to undo is a set of all transactions that we need to undo and what we do at each point it's actually sorry it's not a set of transactions it's a set of their last lsn so what we do at each point we choose the the largest such lsn which is this one here this is like the latest log entry of any uncommitted transaction undo it and then add to the to do set its previous lsn and again choose the latest one the largest one so we chase down this list but in a slightly more complicated way in which we have a set of transactions that we are watching and we always choose the biggest of all these transactions so now let's look at a little bit more detail at what we do during the undo so if this lsn if the lsn we are looking at let me use a different color if the lsn we are looking at is is a regular record a regular update record then we undo it as we expect and we write a clr and we link this clr to to this to the previous entry of this of this log entry such that if we ever have to if the system crashes during the undo phase then as we will redo the clr we will redo and then we know how to continue the undo from this last blue arrow ok on the other hand if the lsn is a clr record then we don't undo it but instead we insert its pointer into the into the to undo list ok and at some point we also need to to mark the end of the transaction now I wonder yeah I wonder whether in the other case we should also insert an end transaction maybe this is something we can discuss on the next slide let me let me try to summarize this the undo phase maybe it's easier to imagine a single transaction so the single transaction has these entries in the log and they are linked like this so lsn30 points back to lsn20 which points back to lsn10 and now we need to we need to undo so we undo the entry for lsn30 and we write a compensating log record that points back to 20 it's clear where we get the 20 from this is exactly this pointer here ok then we move to the next entry so we undo the action for the log entry 20 we write a clr and that now will be we point back to 10 and at this point the system crashes and we have not yet finished we have not yet finished undoing this transaction so then what do we do we read back from the log again so let me use right now we find this undo record we don't do anything but instead we follow we follow this pointer here and now we know that this is where we have to continue our undo so we continue and we write this entry and now this is null there is nothing to point to because there is no other no other entry and this is when we when we know that that transaction is terminated this is when we write an end transaction here every time you do the sub undo you also insert effectively the clr as the last transaction as the last entry to the transaction every time every time you do an undo operation you have to insert the clr but do you make the clr as being the active transaction as the last transaction for the active transaction effectively because otherwise you will go read through them and 20 and 30 are still going to be in the log so why do you look for the clr number 50 here instead of trying to go the 30 again so after the system recovers that's a question well, so yeah when the system recovers it still sees the 30 as the last transaction as the last entry to this transaction unless you're saying that the clr effectively becomes the last transaction last entry to the transaction so the question is how do we avoid reading this again how do we avoid reading LSN 30 again and the reason we can avoid this is because we start the recovery from here and since the first entry we see as the clr that points us all the way back to record 10 does this make sense maybe we should look again at these rules so in that case if we restart the undo process then the first record we see is a clr so we are here do you mean to say that Agnus transaction table will list the clr record as it's last LSN for the transaction no so the question is whether we introduce the clr in the Agnus transaction table it's a good question are you sure they are being introduced in the Agnus transaction table once the clr is written to the log we know that its undo action is being it's been done it's been written to disk and therefore they don't need to be they are actually redone that's actually a good question but they redone and undone because we processed the clr as well as we were redoing we processed them but we only processed them in order to find the pointers to where we left where we left in the undo process I thought that the redo was reprocessing the clr this is what I said but now I don't see the place for them to be reprocessed let's think a little bit actually the undo of a clr can be complex it can remove this entry this record that was inserted when you remove a record it means complicated updates to the b3 those updates when you undo them yes you're right whoever asked this question from Microsoft is correct the undo action can consist of multiple writes because it is a logical undo so the undo could be you have inserted this record in the database now undo this remove it take it back which means remove it from the index which could mean many many index blocks are being restructured now and these are physical these are written as actions that need to be redone if the system crashes okay let's go through this again here because it's really an important point and makes a distinction between physical logging and logical logging and I'm not going to start a new topic we are going to end on this slide so we are undoing this transaction that starts here but the first undo action the first undo action consists of several operations that need to be written to the database those operations I'm not exactly sure whether they need to be written to the log before or after the CLR any suggestions let's try them before and see what happens so to undo this we need to write the following things to the database and we mark this as regular updates in the log when we finish this we say CLR if you crash continue from here next operation now we need to undo number 20 again we write to the log all the entries that correspond to this undo operation we finish we write the CLR and point back here now comes the interesting part now we it becomes quite obvious that I wrote them in the wrong place so now we start undoing number 10 so we write these entries here but now the system crashes during recovery all these all these things will be redone so if we missed to send to disk these updates to the index that reflected removing the record that's okay they will be redone during recovery because they are right here in the log but here we are in trouble we don't want these to be recovered because this is where we will continue we will continue to undo from here so that means that those entries need to be written after the CLR so let's go again we still have a few minutes so I think it's fine if we go over this so we need to undo we need to undo lsn30 we first write the undo record to make sure no but this doesn't work either because now the system can crash exactly okay so how does it work then so we write these entries to disk and then we do what yeah we just have two records one at the beginning and one at the end no it has a single record I'm pretty sure and I'm also pretty sure that there is a redo on the CLR so the undo actions are subject to be redone if the system crashes so we wrote these to disk we wrote these to the log then we write the CLR same here and now when we want to undo number 10 good question why don't we read the book figure this out for next time okay with irises like this we have a lot of fun reading the initial part when you get to CLRs then it's like oh no too much the CLRs are really the part that makes iris less attractive but by next time we'll figure it out and we'll figure it out exactly how recovery goes if the system crashes during the undo phase good so I'd like to skip to finish here I have one more slide on iris it was celebrated in the database community in the early 90s when it was first published by mohan who is a researcher at IBM Almaden because it put together it didn't have new ideas but it put together many ideas as the research community had discussed about recovery in one comprehensive package so he really thought about all the details and how they fit together and this is what's remarkable about iris he said all these things they fit together and the whole package is efficient and you can't remove one thing without breaking the whole that's nice and it's also used in all modern database systems as far as I know all of them use and if they are commercial systems they have some kind of licensing agreement with IBM and that's all I have to say about iris I think that's a good place to stop and next time we will figure out how CLR works I mean you will tell me any more questions? remarks? have a good evening have fun with homework 3 and I'll see you in a bit and don't forget there are two books here if you want to grab one just need some back in two weeks after we finish the homework