 Now, let us move on to our next topic, which is transactions. Now, what are transactions? Conceptually, what are transactions? Practically, how do we tell a database that this is a transaction? What are the issues in concurrently running transactions? These are topics which we are going to see in this lecture. Now, hopefully towards the end of the lecture, I will also cover a bit about concurrency control today. And tomorrow, I will wrap up with concurrency control and also with recovery. So, all of these deal with the basic concept of a transaction. What is the transaction? It is a unit of work, which may access data, may update data, but a transaction should have a few properties. First of all, if there is any failure of the system, the transaction should be completed to its end or whatever it did partially should be undone. So, in the face of failures, transaction should be atomic. Now, why is this important? If a database system never fails, it is not such a big issue. But in reality, systems do fail not often, but when they do fail, it should not be a disaster. So, if there is a failure sometime, at that point, you should be able to recover the system and take care of anything incomplete. Now, some of the early database systems, for example, FoxPro was used in IIT earlier. In FoxPro, if you had a power failure in the middle of something, it would leave the database in an inconsistent state. First still, it would leave indices in a state where you could not even start up the database. So, then somebody had to manually go, delete that index file, start the database, recreate the index and so forth. A lot of manual work was required just to come back up after a power failure. These days, with current databases, that almost never, never happens. Extremely rare that something like this happens. So, there is a bunch of tricks, which they use to make sure that if you are running a transaction before failure, when the system comes back up, whatever things you are done partially will be rolled back. So, that is to deal with failure of various kinds. There are also power failures, hardware failures, software issues and so forth. The second major issue, as you all know, is concurrent execution of multiple transactions. We are going to see that in detail. So, a transaction is shown here, which transfers $50 from account A to account B. So, it reads A, subtracts 50, writes it back, reads B, adds 50, writes it back. So, this is one transaction. If there is a failure, you have removed money from A, but not added it to B, the database is in an inconsistent state. First still, you have a customer who is very unhappy. The money has been deducted, but not credited. This customer is unhappy. If, on the other hand, A is not updated, but B is updated, you have a customer who is too happy. They just got $50 free, and the bank is unhappy because that $50 appeared from nowhere. So, both of these are problematic. So, the transaction has to complete fully or not run at all. So, that first requirement is atomicity. So, if any failure happens before this write completes successfully, we need to undo it. What do we mean by undo? We have to restore the old value of A and B. If a failure happens after this transaction completes, so we tell the user, you are done. Any failure after this should not roll back A and B. So, then if you roll it back, the user will be unhappy. He said, look, I did a money transfer and you told me it is okay, and now you are saying you did not do it and I am in trouble. So, that is the durability requirement. Once the transaction is completed, it should persist. Its effect should persist. It should not ever be undone either partly or fully by the database system. The next problem, so we saw atomicity and durability. The next problem is what is called consistency. Now, what does consistency mean? We have seen the number of integrity constraints. These are all things which help to ensure that the database is in a consistent state. So, primary key constraints, foreign key constraints. These guard against programmer errors. Now, given a program which is correct, the program itself enforces the number of constraints. So, our previous program which transferred money actually makes sure that money is neither created nor destroyed. It is preserved. Any accounting system has this property that there has to be a double entry. Money can neither be created nor destroyed. If it comes in from the external world, that means it decreases from the external world. If you pay it out, it is added to the external world. So, the sum of all the things should be preserved in any accounting system. If the transaction is coded wrong, it is going to make the system inconsistent. But as long as transactions are coded correctly, what they will do is they will take the system from a consistent state to a consistent state. Now, supposing the initial state itself was wrong, in the accounting system, the sums of the account balances do not add up correctly. The way accounting system work, double entry system work, they should add up to 0 actually. If they do not add up to 0, there is a problem. So, now, if the database is consistent, the things add up to 0, any transaction which is correct would run and at the end, the balances would add up to 0. The important thing here is that while the transaction is transferring money, the balance is going to be inconsistent. But that thing will become consistent again when the transaction finishes. So, consistency is really a property of the transaction. But it is important from our perspective because we want to show that even if transactions run concurrently, the system will be consistent. How do we prove this? Basically, we are going to say, if transactions run serially, what can we say? Initially, the database is consistent. Transaction 1 runs serially meaning one after another. Transaction 1 runs, it starts from a consistent database. Its code guarantees that if the database is consistent initially, in the end it will be consistent. So, initially it was consistent, it will be consistent after it finishes. Similarly, for every transaction in the system, if coded properly, it will give this guarantee. Initial state is consistent, final is consistent. So, if I run them serially, the database will be consistent. So, now, the key insight is if transactions run concurrently, if we can somehow show that the result is equivalent to running them serially like this, then everything is consistent. So, that is where consistency comes in. But this issue of concurrent executions, you know, being, ensuring consistency is going to come up next in the context of isolation. So, to understand isolation, let us show an example where isolation fails and something goes wrong. So, let us say that this transaction, the same one which we have been using so far is executing. It subtracts 50 from a. It is not yet added 50 to b. The transaction is correct. In the middle of this execution, another transaction comes in and it computes the sum of the balances. In this case, we are not reading all the other balances. Let us just read a plus b. We know what a plus b was initially. After the transaction 2, it should be the same. But if t 2 reads a and b in between the two updates, what is it going to see? If it adds a and b, the balance will not be correct. So, what has happened is that t 2 has not been isolated from t 1. So, t 2 is only reading. But even that reader is not isolated from t 1. What does isolation mean in this context? If t 2 runs, it should see the state before t 1 executed or the state after t 1 executed. Now, if t 2 ran before t 1, the sum of the accounts would be whatever it was initially. If it ran after t 1 also in this case, the sum would in this case it would be the same. What t 2 is seeing in either case is a consistent state of the database. If it were accounting system, the sum of all balances will be 0. But if it runs in the middle like this, it has not been isolated. So, isolation is something you have seen by example. Defining it is a little more tricky. We will come to it. But one of condition which ensures isolation is to run transactions seriously. If you run one after another, there is no way a transaction can see an intermediate state of another one. So, that is very easy. Modern databases do that always. Why even bother running things concurrently? The answer is if you run things seriously, think of a queue. There is some guy I had a few in the queue who in a shop who has purchased 1000 items. Now, here you are cradling your thumbs waiting for 1000 items to be billed and so you wait for a very long time. So, you are going to be unhappy. What you would like is if you could somehow run concurrently with that guy, you have one more person who is generating bills. You are not stuck behind this guy who is buying 1000 items. You have 5 items, you have billed, you leave quickly. So, that is one reason that you do not want to wait. If you have to wait like this, the average waiting time shoots up and people are unhappy. There are other reasons which are to do with efficiency of using the system. I will come to that. So, the bottom line is running things concurrently is actually very important for multiple reasons. But if you do so, you have to be very careful. So, you can have a very conservative person who says I will not risk anything, but then that guy has to spend a lot more money on hardware and so forth and get poor performance by running things seriously. On the other hand, you have this clever guy who says let me figure out how to run things concurrently and it turns out it is not easy to do things correctly when you run it concurrently. But luckily for us, there have been a lot of very clever people who have figured out how to do this right and databases do very well with respect to making sure things run correctly. Well, there is a footnote to it. I am going to come to it in the concurrency chapter. So, moving on, a transaction is a unit of program execution which and the database must guarantee the following properties for a transaction. Atomicity, everything must be reflected in the database state or nothing the transaction should be reflected. Consistency we have already seen. Isolation, this is where it is hard to define. I said serial guarantees isolation. The general definition of isolation which is from Jim Gray and Reuters book, Transaction Crossing which is considered the Bible of Transaction Crossing. Instantly, Jim Gray won the Turing Award for his work on Transaction Processing. He also unfortunately vanished in mysteriously when his boat was lost sailing in the Pacific. So, he was an amazing person. But he was the world's leading expert on transactions and his definition is as follows. It says that for every pair of transactions, T i and T j which are executing, it appears to T i that either T j finished execution before T i started or T j started execution after T i finished. So, if you take any pair of transactions, you have this one. Whatever it sees about what this one did, it should appear that either they ran one ran first and in the other or the other way. The other one ran first and then this ran. So, for any pair, this property holds. It turns out that for those of you who are familiar with the notion of serializability, this property is not serializability. It is weaker than serializability and the reason is that if you do not know what is serializability, I will come to it. But most of you probably know this. So, insisting on isolation equal to serializability turns out to be too strong. There are weaker notions and that is the reason that Jim Gray recommends that this is the minimum level. If it is worse than serializability, you definitely have a problem. So, this is the minimum requirement for isolation, although it does not directly guarantee serializability. Finally, durability. After a transaction completes, the changes it has made to the database persist even if there are system failures. So, a transaction which is running is initially active. It is running. It does its updates. Then it says I am done. I want to commit. Now, we say it is partially committed. At this point, the normal situation is the database will process some things and then say, I have done some work to ensure that all your updates are reflected in the database. Now, you are committed or the database may say, well, I tried to do some work to complete you, but there is some problem. Therefore, you are failed. I cannot commit you or while the transaction is running, the transaction may decide that something is wrong and I cannot go ahead and commit. At this point, it is failed, but fail does not mean it is done yet. At least the database is done with it. If a transaction fails, the database has to abort the transaction. What does that mean? Aborting a transaction means whatever changes it did have to be undone. So, or rolling back means undoing whatever the updates are. So, the database has to make sure a failed transaction is aborted. After it is aborted, the database can make a choice. It can retry in some cases or I just say, I have forget it and tells the programmer whoever ran the query saying, sorry, I could not run your query. Then it is the job of the application program or the data analyst or whatever submitted the query to decide what to do next. The application program may go back and tell the user, sorry, the transaction aborted retry if you want. Then the user resummits the query and it is tried again. Maybe this time it will succeed or maybe it will fail again. So, this diagram shows the possible transaction states. Initially, the transaction is active. At some point, it is done with whatever it wants to do and it is partially committed. In the normal situation is at that point, the database writes out its updates, it is committed. If something goes wrong after this point, it can fail or while it is active itself, it decides it cannot continue. So, it is failed and then after undoing everything, it is aborted. It is completed aborted. So, coming back to concurrent execution, there are two reasons. The one which I told you about was reduced average response time. So, that people do not sit waiting behind long transactions. The other is improved processor and disk utilization. Now, what is this? Now, if you think of a computer system, it has multiple resources. It has a CPU. In fact, today all computers have multiple CPUs, more or less. All new ones have multiple CPUs in multiple cores. Then it has at least one storage device at disk and usually many storage devices. If a single transaction is running on the system at a time, it is logic is sequential. So, it may be using CPU now. Then it wants to do an IO. It is using the disk. While it is running on CPU, the disks are idle. While it is waiting for IO from one disk, the CPU and all other disks are idle. This is ridiculous. You have so many resources and all of them are idling most of their time. That is a very poor utilization. Can't you use this wasted time to run another transaction concurrently? So, that is the reason. Second reason that concurrent execution is very important. So, this is particularly important because if you have data on a disk, the IO takes time. It takes 10 milliseconds. In this 10 milliseconds, your CPU could have done a huge amount of work for another transaction. Then when this IO is done, the transaction runs on CPU for some time. That is another IO. So, while one is doing IO on one disk, another can be running CPU. A third can be doing IO on another disk. So, you can, with two disks and CPU, you may get triple the performance with concurrency than without. So, the problem is that when things are concurrently, things can go badly wrong. Initially, when you think about it, if you just say things can run concurrently, what can go wrong? You can easily create examples of what can go wrong. The question is how can you show that a particular system ensures that things will go right whatever you do? To understand this, we need some abstractions of what is going on in a system. The most important abstraction is the notion of a schedule. What is a schedule? A schedule is a sequence of instructions that specify the chronological order in which instructions of concurrent transactions are executed. So, a schedule will assume that one thing runs at a time. In reality, today with multiple CPUs, things can actually run exactly at the same time in parallel. It turns out that is not a big thing problem theoretically because if they access the same thing, then they run in a particular order. If they access nothing in common, the order in which they ran does not matter. We do not know the order because they ran concurrently, but they did not access the same thing. So, their order is irrelevant. If they did access the same thing, well, one of them would have done the access first, then the second. So, one wrote it, the other read it. So, depending on which happened first, you know their order. So, a schedule is going to have a strict ordering of the different instructions of different transactions as we look at it. So, a schedule for a single transaction is what all updates it did. For a set of transactions is all instructions across all transactions. And the order in which instructions appear in an individual transaction must be the same in the concurrent schedule. Obviously, it is the same order in which it is done. Now, a transaction which successfully completes will have a commit instruction as the last thing in the schedule. We are often going to omit the commit instruction under the assumption that whatever was this last update, commit happened immediately after that and succeeded. That is the default where we omit the commit instruction. And if it fails, it has an abort instruction as the last statement. So, here is a schedule for the same transaction we saw before which transferred $50 from A to B. But this time, there is another transaction also which takes 10 percent of the amount from A and transfers it to B. So, what are these two transactions? The first one is here on the left. Read A, subtract 50 from A, write it similarly, add 50 to B, commit. This fellow reads A, computes 10 percent of A, subtracts it from A, adds it to B and commits. Now, these are the two transactions. And here they are serial. T1 completed before any instruction of T2 executed. So, that is a serial schedule. Here is another serial schedule where the order is flipped. T2 ran first, then T1. Is the final result going to be the same in the two cases? The answer is no. In the first case, 10 percent was after transferring balance of 50 from A. In this case, 10 percent is before removing 50 from A. The final result is different in the two cases. Now, the key point of a database system is it does not care which order you ran the two transactions. Either is fine for the database. But if things are running concurrently, it should be equivalent to one of the two. The results of the two orders, T1, T2 or T2, T1 are different. But if things, if they ran concurrently, the final result should be equivalent to one of these two. We do not know which one, but it should be to one of them. That is an important principle. So, here is a schedule where the instructions are interleaved. They are concurrent. What is happening here? This fellow reads A, subtracts 50 from A, writes it. This fellow reads A, computes 10 percent, subtracts it, writes it. Then, this fellow reads B, adds 50. Then, this fellow reads B, adds 10. What is the final result? It turns out the final result here is going to be exactly the same as if T1 ran and then T2 ran. How do we know this? Look at these instructions here. This is reading and writing B. This is reading and writing A. Now, what you do to B has no impact on A and vice versa. So, the idea is we can actually flip these things. We can do this first and then this. So, we can move the instructions, move one up and the other down because they do not conflict with each other, meaning they access different things here. We can shift them and it will not affect anything. It will not affect the execution of the transaction. It will not affect the final result. So, this idea is why we know that these two schedules are equivalent. That is one way of understanding the equivalent. Another way of understanding the equivalent is to see what is the result after executing this. You can see that the result would be identical to a schedule where T1 ran first then T2. The final result is the same and if they printed out things in between, whatever they did, they would see the same values. It will be identical. That is another way of thinking about equivalence, which is actually almost the same, but slightly different. Now, here is another schedule, which is actually problematic. What is happening here is the first one reads A, subtracts 50 from A, but has not written it back yet. Now, what does it do? T2 is coming in. It is reading the old value of A, remember, and it is computing 10 percent, updating A. It is also reading B. Look here. T2's last instruction in the first part is read B. It is read the old value of B. Now, this fellow reads B, adds 50 to B and writes B. So, let us say B was 100. This fellow has subtracted 50 from A. This fellow reads the old value of A. So, let us say A and B were both 100. What is this fellow going to do? It is going to write A as 50. Before this wrote A as 50, this fellow took $10 away from A and wrote A as 90, but this has clobbered it and has set it to 50. What about B? This fellow added 50 to B and made it 150. What does this fellow do? It read B earlier. So, it read 100. It added 10 to it and wrote 110. What has happened? What is the final state? The final state after both of these is that A has been set to what here? A has been set to 90 in this case and B has been set to 110. So, in this particular case, the effect of T1 has been completely clobbered, but depending on what else you do, you can have other intermediate states which are messed up. For example, if this read B came a little bit down here, you could have other situations. So, this is the kind of problem which can happen with a concurrent schedule. So, how do you say that a particular schedule is safe and the notion of serializability is a huge step forward in this. If you look what happened before that, there were certainly some very smart people in places who understood this intuitively, but average programmers would have kind of fumbled through this. They would have said, well, this causes problems, that causes problems. Let me do this. I think it works because I cannot, I can show that this problem does not occur, but just because you showed one problem does not occur, that does not mean that a different problem cannot occur because of concurrent execution. So, to be sure that a concurrent execution is safe, the only way to be absolutely sure is to show that it is equivalent to a serial execution. That is the idea of serializability. So, we know that serial execution is safe. It ensures consistency. So, we are going to say that possibly concurrent schedule is serializable if we can show that it is equivalent to some serial schedule. Now, what do we mean by equivalent? There are different notions. One is conflict serializability and we are going to see that. Another is view serializability which we are going to skip for lack of time. It is there in the book. So, we are going to do one more simplification. Now, we know that transactions are executing as bunch of code. It is too much for the database to analyze what code is running in the transaction. It cannot. So, all the database knows is what the transaction does when it interacts with the database. How does the transaction interact with the database? Well, it runs SQL queries, but for our purpose we are going to assume an even lower level of interaction with reads and writes. Even an SQL query can be broken up into a number of reads and a number of writes. So, conceptually as far as we deal with serializability, what matters is the lowest level read and write which happens. Below the SQL query itself, the SQL execution engine eventually reads and writes data tuples. So, let us look at it at this level below SQL. So, first of all SQL is something the database can see. Above SQL is an application program, but let us look at a lower level of the database to which all of this is a black box and all it knows is there are read and write instructions to the database. This is the level at which we are going to ensure serializability and this is all that we know about what is going on. So, a schedule now consists of reads and writes. In between a read and another and a write, we have no idea what happens. How did it compute the value which is written? We have no idea. It is all a black box above. With just this much knowledge, we have to ensure serializability and we can. The nice news is even though we do not know what else happens, our notion of serializability is very powerful and works just fine and the idea is based on conflicting instructions. The only instructions are read and write to a specified data item. So, two instructions conflict if first of all both must be to the same data item. If they have two different data items, the order in which they work does not matter. So, there is no conflict, but if both are to the same data item and both are reads, how does it matter whether this one read first then this or the other way? The result is the same. What matters is at least one of them is a write. So, let us take the case where this one was a write and this was a read. So, write followed by read. If you do them out of order, what is going to happen? If you do the read and then the write, the read will get an old value. So, the read and the write to the same item conflict. You cannot swap them. Similarly, if the read happened first and then the write happened, if you swap it, now the value of the read changes. So, regardless of which happened first, you cannot move them around. Now, what about write and write? Both of them did the write. As far as the transaction is concerned, it did the write and went away. The problem is if this write and was followed by this write, the final state is based on this one. If the writes are flipped, the final state is based on the other write. So, the order in which they occur affects the final state of the database. So, write and write also conflict. In other words, you cannot swap them. So, intuitively a conflict between two instructions forces a temporal order. In other words, you cannot swap them. If you know they happened in a particular order, if you flip the order, the result is going to change. So, we cannot swap them. But if they do not conflict in a schedule, we can flip them. And this modified schedule is going to be completely equivalent. It has no impact on the results of what happened. This is the key observation. So, if you take a schedule and swap, what do we mean by swap? This operation is in T1. This is in T2. Swapping does not mean exchanging them. It means doing them in the opposite order. The operation is still in T1. This is still in T2. But now, swapping means this was done first, this was done second. Swapping means you flip it. This is done first and this is done second. So, that is swapping. So, if a schedule S can be transformed into another schedule by a series of swaps of non-conflicting instructions, then the schedules are conflicting equivalent. And we say that a schedule is conflict serializable if it is conflict equivalent to some serial schedule. So, this is the key idea. So, if a schedule is conflict serializable, we know that the result is consistent because it is entirely equivalent to some serial schedule. And all serial schedules are fine. They take the database from a consistent state to a consistent state. This may seem simple, but it is actually very, very fundamental. This is the basic premise on which a database can say, look, we are running your transactions concurrently, but don't worry. We are taking care of things. We will guarantee conflict serializability. Well, at least that is what we think database is guarantee. It turns out that many databases lie and people are living with this. Luckily, they do not get into trouble too often, but occasionally they do. And we will come back and see what these lies are. But meanwhile, let us see an example. Here is the schedule with reads and writes only. This was read a, write a, read b, write b. This did the same thing. If you go back to our previous one where one transferred 50, one transferred 10 percent, it turns out at the level of read and writes, they are identical. This red A wrote something to A. I don't know what. This was a red A. It wrote something to A. We don't know what it did in between, similarly here. But using conflict serializability, we can recognize that read b here and write a here don't conflict. We can swap their order. Similarly, this read b and read a anyway don't conflict. We can swap it further. In other words, we can move the read b up without any conflict. Similarly, write b doesn't conflict with write a or with read a. It moves up. So, what we have got is by a series of swaps, this schedule was transformed into this schedule. And we know that because the swaps were non-conflicting, these two are going to give the same result. They are equivalent. And this one is serial. What we have just proved is this particular schedule is serializable. Of course, there are schedules which are not serializable. Take this example. T3 reads Q. T4 then writes Q. Then T3 writes Q. Why is this not serializable? Supposing T4 ran first, then T3 would have read a different value and the final value will be different. If T4 ran later, the final T4 value would be different. So, it's clear that the two orders T3, T4 or T4, T3 both would result in a different final state. In terms of conflict, since both of them are writing Q, look at the conflict here. Read and write conflict. So, I cannot move right up. Write and write conflict. So, I cannot move right down. In other words, there is no way I can swap instructions to get to a serial schedule. Therefore, this schedule is not conflict serializable. So, either way you look at it, it comes to the same thing. Very short quiz break. So, centres, please set up the quiz software. Make sure it's up. While the quiz is set up, let me explain the question. So, I have a small sample schedule here. This one reads A. This one writes B. This one writes A. Same two transactions, T1 and T2. The question is, is this schedule conflict serializable or it is not conflict serializable or it is actually serial or none of the above? So, just hang on a second. Make sure your ST buttons have been pressed. Please press the ST buttons and we are starting the quiz. Don't press any button now to answer the question. Choose the option. In almost out of time. This is actually a very, very simple question. But a slight trick because some of you may have failed to notice that A and B are different items. If you did notice it, you should have got the correct answer which is, since A and B are different items, we could take this write B and flip it with this read A and let T2 run first. That is the serial schedule which is equivalent to this because there is no conflict. They are on different items. For that matter, we can take this write A and this write B and swap them because they are also two different items. So, we have two serial schedules. One is T1, T2. The other is T1 first, then T2 first, then T1. Both of them are actually equivalent to the given schedule. Therefore, even if one is equivalent, it is enough. So, the schedule is conflict serializable. It is not serial because the instructions are interleaved. So, the answer is one. It is conflict serializable. So, let's see the results of the quiz. This time, most centres have managed to upload, but only 98 people have answered. This is interesting. Most of your answers have not been received. I am not quite sure why. So, the top option is the correct option, but the close second is it is not conflict serializable. That is wrong. What you probably missed is the fact that one is on A and one is on B. A few people said it is serial. That is obviously wrong. Serial means one completed everything before the next started. It is not serial. Let's move on. There are a few more properties that a schedule should satisfy, and this one recoverable is a very important property. What is that? So, here is a situation where take this schedule here, T8 and T9. This one is reading A, writing A. It has just written A. Now, this other guy comes in and it reads A. What is the A value which it gets? It is going to get the value which T8 wrote. It reads it and it commits. Maybe it displays the value to the user, but T8 is not done yet. It is doing some more, maybe read B, whatever. At some point, T8 may run into trouble and it has to be rolled back or the database system may crash after this commit and in that case also T8 has to be rolled back. But if T8 rolled back, the value of A is restored. Now, you have a very interesting situation. You have T9 which has read and displayed a value which never existed according to our atomicity. That is a big problem. You say the transaction never executed, but you allow T9 to see the result. So, this is a schedule which is not recoverable. The fact that you allow T9 to commit means it is too late. You have a problem. You have allowed it to see data which is not committed before the transaction commits, T8 commits. That is a very, very bad idea. So, a recoverable schedule is one where this cannot occur. So, how do you ensure it does not occur? Well, the minimum thing you have to do is prevent T9 from committing until T8 commits. So, if T9 read a value which T8 wrote, T9 is not allowed to commit until T8 commits. If T8 aborts, then so is T9 also has to abort. A stronger condition which many databases impose is that actually when T9 tries to read value here, it will actually be prevented from going ahead until T8 commits. That guarantees it. That if T9 cannot even read the value till T8 commits, then it is safe. But, recoverability does not insist on this. All it says is if T9, what it says is T9 can commit only after T8 commits. If T8 aborts, so must T9. The problem here is that if T8 aborts, T9 also has to abort. And in general, you can have a series of aborts. Here is an example. T10 wrote A. T11 read the value which T10 wrote and it wrote A back. T12 read A. So, now if T10 aborts, well, so must T11. If T11 aborts, so must T12. So, what you have is cascading rollback. This is also bad. So, what you would want is a schedule which is cascadeless. They cannot be cascading rollback. And the way to ensure a schedule is cascadeless is to prevent a schedule from reading an item until that value has been committed. So, you cannot read an uncommitted value. So, that is what this is. The commit operation of TI which wrote the item appears before the read operation. Practically speaking, what will happen is the database will prevent this transaction from proceeding with the read. It says wait. Let this commit. After it commits, I will give you the value. So, every cascadeless schedule is recoverable and this is what database is enforced normally. Now, another quick quiz to make sure everybody is in sync. This time there is no trick within with A and B, but look at the schedule. T1 reads A. T2 then writes A. Then T1 writes B and then T2 reads B. The question is, is this schedule conflict serializable or is it not or is it serial or none of the above? Same options as before, but the schedule is different. There is read A of T1 followed by write A of T2 followed by write B of T1 followed by read B of T2. So, go ahead and start. Go ahead and choose your answer here. Okay. The timer has run out now. While we wait for the results, let me explain the solution. Now, how do you know it is conflict serializable? If you can swap it to make it serial. Can you swap it? So, let us look at the middle to middle operations here, which is T2 has written A, T1 has written B. Can you swap these two? The answer is yes. They are on two different items. You can swap them. They are not conflicting. So, what we land up with is a schedule T1 followed by T2. So, very clearly this schedule is conflict serializable very easily. So, option 2 is wrong. As before, it is not serial. Things are interleaved. So, the choice is option 1. Let us see how many people responded. Little better in terms of centers. Number of responses of people is dropping. So, apparently 150 clickers are not alive. And of those which are alive, half the people have not responded. Again, if you are not sure of the response, try guessing. It is okay. It gives me an idea of whether on average people have understood or not. So, anyway, coming back here, it appears that half the people did not respond because you were not sure. But of those who responded, unfortunately, the choice was wrong. The majority said 2. It is not conflict serializable. Again, these are reads and writes to different data items. You have to note what data item is being written on red. So, this is similar to the previous one. It is not a trick. This is a basic thing. So, the answer is 1, as we said. And very few people actually got the answer. So, what this means is, on average, most of you need to go back and read this material again if you are covering it in your syllabus. Talking of syllabus, by the way, we went over the syllabus of many universities before we ran this course to make sure we were in sync. And it turned out they were fairly in sync. If you look about 10 years ago, most universities did not have any internals in their database course syllabus. They basically just covered SQL, VR diagrams, normalization. That is about it. But in the last starting about 10 years or so back, more and more universities started adding database internals, including transactions and so on. So, it may be that your particular university doesn't have it currently, but the general trend is towards adding that. We also see this in MTech students who come here. They are a nice sample of B, a syllabi from earlier on. So, when we first ran our MTech database course, we had to start from the basics of internals. These days, most of our students coming in already know the basics of it. Whatever I am talking about now, they have done in their B. So, our course is actually now a second level course starting from more complicated topics. So, what this means is even if you don't cover it currently, please pay attention and learn this because it will appear sooner or later in your syllabus. So, what we have seen is some schedules are good, some are bad. What's the use of finding out if a schedule is good or bad? What you want is a mechanism that will ensure that bad schedules will never occur, that only good schedules will occur. That's the job of the concurrency control mechanism of a database system. So, pretty much all database systems have a concurrency control manager whose job it is to ensure that schedules will be serializable. Well, they almost do that. We will see why they word almost. The word almost is because many databases provide optionally or sometimes they force it on you. They provide or force a weak level of consistency. What does this mean? First of all, let us see why a weak level of consistency may be fine in some applications. So, take an application which wants to find out how much money, how much deposits does the bank have. So, this number is actually changing second by second. I go put money in the bank, the deposit changes. You withdraw money, it changes. So, it's hard to define how much money the bank has at this moment. But for many purposes to have a rough idea of how big the bank is, it's fine to get an approximate result. Another example is statistics for query optimization. So, if the statistics say that there are 95 distinct values versus 99 distinct values, it probably doesn't have a huge impact on the plan chosen. So, some approximation is okay. So, what this means is if as a result of doing something which is non-serializable, you get an answer which is slightly off, it's acceptable. But the first question is why even do this? Why not stick with serializability? And the answer is that it may affect performance because if you make things serializable, then concurrency goes down. Certain sequences of operations are prevented. So, transaction has to wait. And as a result, performance may go down. If you have a transaction which is reading all balances in the bank and you want it to be exactly consistent, then nobody will be allowed to update a balance while this runs. This may not be acceptable. The bank may find that all these customers are waiting and they're not able to process the update because one transaction is just reading the balances. This used to happen in one time. Many databases would suddenly appear to hang because somebody ran a large query like this which read all the values. So, that is actually a terrible situation because no processing can occur. So, many people are willing to live with weaker level of consistency. And in fact, this is formalized in SQL by having a level called serializable which is supposedly the default. It is the default for the standard. It turns out no database actually keeps it as a default in the implementation. What are the weaker levels? One weaker level is repeatable read which basically says the following. You can only read committed records. That is actually the same as the next level read committed. But in addition, if you read the same record again, several times during the transaction, you will get the same value. That is repeatable. What it does not guarantee is, if you read a relation now, read a records in a relation and now you, after sometime you check what is in the relation again, you may suddenly see a few new records appearing or a few records. Well, appearing is the main thing. You will see new records appearing. That may be possible which will not be possible in a truly serializable execution. Then there is read committed which is even weaker. It says only committed records can be read. But if you read the same record again, it may change in between. And finally, read uncommitted which says anything goes. You read, you get whatever is there currently. So, these are the levels which SQL defines. It turns out there is one more level which is used widely in practice called snapshot isolation which we will see later. In fact, things are even worse because on many systems, if you say run in serializable mode, default is read committed on most databases. But if you say set the transaction isolation level to serializable on two of these databases Oracle and PostgreSQL, they will say fine, we will run in serializable mode. But it turns out they don't actually run in serializable mode. They run in a different mode called snapshot isolation which is not actually serializable. This is something which comes as a big surprise to many people because they take a database course and they say serializability. Well, that's what we know what it means. And we assume that's what systems provide. It turns out they don't provide it by default which is bad enough. But you can change it. Worse still, it turns out even if you tell them to change it, two of the databases don't actually do it. And this can cause problems. We'll see it later. Now, the next question is how do you define the boundaries of a transaction? How do you say that these updates form one transaction? The SQL statement has a begin atomic and then end block which nobody implements. What people implement though is a way to tell the database, here's the first beginning of a transaction, say begin, run a bunch of SQL statements and then roll back or commit one of the two options. But there is another catch on pretty much all databases by default if you run two SQL statements, the first one will run as one transaction, the next one will run as another transaction and commit. This is not the same as running both as one transaction. So the example where you subtracted 50 from A, it will commit immediately and make A visible outside. And then the add 50 to B will run as a separate transaction. That's very bad. You don't want that. How do you prevent it from happening? There are two ways. The most frequently used way is anyway these queries are running from an application. So the application can tell the database, please don't commit immediately each thing. So how to do this? In JDBC, if you have opened a connection, you can say connection dot set auto commit false. There are equivalent ways of doing it in SQL, but this is what is commonly used. So auto commit means each SQL statement is committed automatically. By setting it to false, what you are telling the database is don't commit an SQL statement as soon as it's executed. Wait till I tell you commit or rollback. And in JDBC, how do you say commit or rollback? Turns out you can say connection dot commit or connection dot rollback. So that's how you turn off auto commit, run a sequence of SQL statements, then commit or rollback at the connection level. So that's it for this chapter.