 So, let us move on to questions for any topic which we have covered so far, not necessarily things from yesterday. We have Srinath Ji, we are not getting audio from you, can you please check your mic. Yes sir, my question is, what are combiners in Hadoop and when should the combiners be used? If you remember yesterday, we had talked about these multiple levels of radius. What happens is, when you run a mapper on a particular machine, it processes lots of records, potentially millions or tens of millions of records per machine. Now, if you take a typical radius scenario, it is going to do some kind of aggregation across all these records and you have a lot of records locally which you can already aggregate. Supposing, we were doing word count and obviously the same word will occur across many documents which are being processed at one particular site, but then if you go across, you know, many documents which are spread across multiple machines, again the word will recur. So, the idea is that you can do partial aggregation locally before you ship data across the network and that will reduce the amount of network traffic. If you recall, I mentioned that the bottleneck in many of these computations is the network today because the speed of disk has been increasing quite a lot. The speed of flash disk has increased even more, but the speed of the network is still at 1 gigabit per second, typically or if you pay more money, you can get 10 gigabit. So, you want to reduce the amount of network traffic. So, you can do local aggregation. The job of the combiner is to do the local aggregation. So, you have an option in Hadoop to turn on a combiner or not to turn it on. So, if you set a combiner, it will be run locally on whichever machine is running map that will do a local sort and then do a local reduce on whatever data is available locally and that reduce function is not the original reduce function, but the combiner function. In fact, Hadoop takes a shortcut and says that if you do not specify a separate combiner function, but you ask for the combiner to be run, then the same reduced function will be run in place of the combiner function also. So, that is basically what happens. You do local aggregation, then ship less data across the network and then do aggregation again. So, you can count locally and then add up the counts later on. Barthi Vidyapit, do you have a question? Sir, how can I calculate the, what is the procedure to calculate the cost of query? How do you compute the cost of query? I did not go into that in detail for lack of time, but when we covered the individual operations, if you recall, we talked about the cost of some of the operations. I said that this is the number of seeks it will do, this is the number of block transfers it will do and so forth. So, those cost estimates for individual operator were based on the sizes of the inputs and the number of matching records and so forth. So, the issue is how do you estimate the sizes of various inputs as well as the number of records that match a selection condition. For lack of time, I did not go into it. It is described in lot of detail in chapter 13 of a book and in the slides also, in the full version of the slides. But to give you a brief picture, supposing I am doing a full file scan, if I know the size of the relation, I know how many blocks would be transferred and so I know the cost of the file scan. If I am doing an index lookup on a key, I know roughly how many Ios will be required, assuming that the index pages are not memory resident. So, in the worst case, every single page of the index has to be fetched from this. But more realistically given today's memory sizes, although the formulae in the book talk of the cost as being proportional to the height of the tree, for most medium sized relations, we can assume that the inner nodes of B plus tree are likely to be in memory, because they get access more frequently than the leaf level nodes. If you have a fan out of 100 and inner node is 100 times more likely to get access than a corresponding leaf node. So, many practical databases assume that inner level nodes are all in memory and then just take one IO for the leaf and then maybe one more IO for the data for a key and so forth. So, that is for individual operation. But now, if I have a selection a less than 10, how do I estimate how many records will satisfy that? If I have a selection age equal to 23, how do I know how many records will satisfy that? For that I need some extra information. I need to know for example, histogram of distribution of ages in the database that I am in the relation that I am querying on. So, if it was students, age distribution would be maximum between let us say around 17 to 22 or 23 for undergrad students. So, if I ask for a student with age equal to 40, I can estimate that there would be very few for a normal college. Open universities are different. So, I can have histograms to estimate the number of records that satisfy a particular condition. The next issue is if I have a join, how do I estimate the output size of the join? So, again there are formulae for that. Special cases are if the join is on a foreign key, then the size of the referencing relation will be the size of the output because it is a foreign key join. Assuming there are no nulls, every tuple in the referencing relation will have exactly one match in the referenced relation. So, the size of the join we can estimate that by. What about other join conditions which are not on foreign keys? There are formulae for estimating these. You have to make some assumptions. You could assume uniform distribution or if you have histograms, you can actually get a better estimate. So, you thereby get estimate of a size of a join which in turn will help you estimate the cost of the next operation up in the tree. Similarly, I will need number of distinct values or histograms to be computed on the output of a join. So, again there are statistical techniques for doing these. Some of these are described in the book. If you want more information about how a real database does it, there is a lot of information about Postgres SQL's cost estimation. Postgres has nice documentation that way. So, if you want to know more about it, you can go to the Postgres SQL website or another alternative is on the course model page, it is teaching and research resources. Go in there and I have a link to CS 631 implementation techniques for database system. So, this is our course on database internals and I have not only information, the book chapters that I cover in that course are here. I cover these in a lot of detail. In this workshop, I am covering most of these chapters only in bits and pieces. In this implementation course, I cover it in detail and correspondingly here on the right side, there is a lot of information about how Postgres SQL does these things because that is a real database and it is nicely documented better than most other databases. So, you can read these up to understand what is going on. So, let us get back and take one last question before I start today's topic. Hello, good morning sir. Sir, I wish to know the difference between timestamp locking and two-phase locking. There is no such thing as timestamp locking. It is timestamp-based concurrency control protocol versus locking-based concurrency protocol and that is a good question to get started because that is exactly the topic I was going to cover right after the last question. So, you have asked the right question to lead into timestamp protocol. That is the next topic. So, let me get back to the slides and answer your question in great detail. Well, timestamp-based concurrency control is coming up in just a second, but before that I want to wrap up on locking protocols. So far, we have assumed that we lock individual rows of the couple of the relation. So, we say when we want to access a particular data item, we lock that couple. Now, locking can be a different granularities. In theory, you could even lock at the level of individual attributes of individual tuples, although I do not know any database system that actually supports that currently, but you can certainly have other granularities of locking. For example, if there is a select query which is going to read every single row in a relation, that is going to acquire a lot of locks and if that relation is very large, where do you store information about the locks? I briefly mentioned yesterday that there is a data structure called a lock table which keeps track of which transaction has which locks. But if there is a relation with 10 million or a billion rows and I have to keep track of 10 million or a billion locks, it is going to take a lot of space and that is not really feasible. The lock table will become extremely large. I do not want that. And the solution to that problem is what is called multiple granularity locking. The idea is instead of getting locks on separate records, you could get locks on an entire page or on an entire relation. Now, multiple granularity lets you choose adaptively between these. Now, if you go back to the early databases, they did not do tuple level locking. They actually did page level locking. And after a while, they realized that page level locking affects concurrency because many times people want to access different tuples in the same page, but only one of them can lock a page at a time. So, although the other tuple is not being used, it cannot be locked by the other transaction. So, from page level locking, people start introducing row level locking. But if you do row level locking lively, you have the problem which I just mentioned, too many locks. So, multiple granularity locking allows you to adapt between these two in a nice way. So, what you have to do is, if you want to lock an individual record, you do not lock the whole relation, but you leave a marker on the relation saying, I am locking individual records below. That marker will lock somebody else from locking the whole relation. So, the problem here is that I have locked the record while you go and lock the relation. We are locking apparently two different things. The lock manager might allow both of us to get exclusive locks, me on a tuple and you on the whole relation. Obviously, they conflict. So, what needs to be done is, if I want to lock a tuple, I will leave a kind of marker at the relation level. It is called an intention lock, which says, I leave a marker on the relation saying, I am locking individual tuples. If you combine and say, I want an exclusive lock on the relation, the lock manager will say, suggestion has an intention lock on the relation and he has now got a lock on or his transaction has got a lock on the individual rows. On the other hand, if you combine and say, I also want to lock individual rows, your first step will be to get an intention lock on the relation and both of us can have intention locks on the relation at the same time. Then, you will lock a tuple and as long as the tuple you want to lock is not the same as the tuple, I have locked, you can go ahead. So, multiple granularity locking has this nice property that you can eat your cake and have it too. You can choose to lock at the appropriate granularity. Now, how do you choose which granularity? If you are using SQL, this is easy because the query optimizer knows what is going on. It has the plan. So, if it is doing a full relation scan, it is going to say lock the relation. If it is doing an index scan, it will get an intention lock and then lock individual rows. It puts that as part of the plan. So, that is a very brief introduction to multiple granularity locking. There is one more issue. So, time stamp is being pushed back a little in time. I told you, whoever asked the last question that that is a next topic, well, I lied. It is not the next topic. It is two topics down. So, let me cover the phantom phenomenon before we move on to time stamp. So, the phantom phenomenon can be explained as follows. Let us take a transaction that scans the relation. Say, T1 says find the list of students taking CS 101. Let us say that the way it is done is by scanning all the records of the relation or even if you have an index, the same thing happens. You find all those records that correspond to CS 101 and find all of those. And we can get shared locks on these because I am not updating it. I am just finding the list of students. Here is another transaction, T2 which inserts a new student in 101. In other words, if you remember our schema, this would look at the takes relation and this would add a new record to the takes relation. Now, this transaction has, let us say this has already run. So, look down here. T1 has found all students taking CS 101 and it has locked all the couples corresponding to 101. Now, T2 comes and inserts a new student in CS 101. That is, it adds a new record in the takes relation. Now, do these two conflict? This shared locked all the records that have been found, which had 101. This exclusive insert, by the way, exclusive locks records, the same issue can also happen with update. I will come back to that. So, this insert exclusive locked that particular takes row that it inserted. Now, do these two conflict on any record and the answer is no. This is a new record. Therefore, T1 never saw that record. It never even tried to lock it. This is a brand new record. So, T2 can get the lock. So far, so good. Now, supposing T2 updates the total credits of the same student, 101 and it commits. When it commits, it is going to release all its locks and it goes away. So far, so good. Now, supposing T1 reads the total credits of the student 101. It is going to get a lock at this point and what is the value it sees? T2 has already updated that and it has committed. So, T1 will see that value. Now, something very strange is happening. T1 has seen one of the updates of T2, namely the update of the total credits, but it has not seen the insert, which T2 did of this new record. Now, clearly this violates serializability. We have a problem. Let me emphasize this once again in case you did not understand it. Let me use the white board. So, we were looking for those students who had taken CS 101. So, there is a number of students. So, these are actually not exactly students. They are roles in the takes relation. So, takes has something CS 101, another one CS 101 and so forth. So, T1 S locks all of these. Now, T2 insert a new row. So, what has happened here? It is a brand new row. So, T1 has not locked it. Therefore, T2 is allowed to create a row, lock it and insert it basically. Next, T2 update student with ID 101. So, it is a different relation. It is a different record. It updates that. So, it gets an x lock. So, it is locked it and then T2 commit. So, T2 is done. Now, T1 reads student 101. And is it blocked? No, because T2 has released all its locks and gone away. So, T1 is allowed to read the updated value of this student. Therefore, if you see the precedence, T2 has written something which T1 read. On the other hand, there is another thing which T2 wrote which namely the this row inserted into the takes relation, which T1 did not see. So, if you looked at that row, it would appear that T1 ran before T2. So, with respect to that row, the serialization order is T1 is before T2. With respect to the update on student, T2 is before T1. So, you have a cycle. This is not serializable. That is the problem. And that problem is called the phantom phenomenon. It turns out it can happen with insert as we saw, but also with updates. So, for example, if T2 instead of inserting a new student, a new record in takes, took a particular record in takes which corresponded to say CS102 and updated it to CS101. That record would not have been locked earlier by T1 and that is more or less the same as deleting old record and inserting a new record. So, the problem also happens with updates. So, this particular problem it turns out is because we are only locking tuples, but at the same time T1 is actually reading more information. It is what it is doing is it is finding out what all records are have you know course ID equal to CS101 in the takes relation. Now, in order to know this, it is not enough to read only the records which have value equal to 101. It is reading some other information which helps it decide that that is it. There are no more records with CS101. Where is this information? One way is it reads all the records in the takes relation. Another is it might use an index to find this. Whatever it is, it is read information that told it. There are no other records. Now, T2 has gone and updated that information which T1 used to decide that there are no more records. So, the issue is that it is pure tuple level locking is not enough. You have to do something more. There is other information which needs to be locked and the book describes details of how to solve this problem. There is something called index locking which is a widely used thing to prevent the phantom phenomenon. Now, interestingly or shockingly, depending on how you look at it, Oracle and PostgreSQL versions before 9.1, both of them did not implement, you know, did not prevent phantom problem. And not only they did not do that, they lie. If you go read the Oracle documentation, they calmly say, you know, Oracle does not have the phantom problem. It turns out that the word phantom problem is used loosely in different ways by different people. And if you see one of the early papers in this area, it defined it in a specific way which in fact, the snapshot isolation which we will see shortly in Oracle actually does not violate. So, if you look at it in a very narrow sense, okay, maybe it does not violate it. But if you look in the sense of violating serializability and the more general definition of the phantom problem, Oracle definitely has, is vulnerable to the phantom problem. And so was PostgreSQL, but they fixed it in recent release in 9.1. They fixed this particular thing using index locking. Now, the actual problem is very old. The solution is also very old. IBM had realized this many moons ago and fixed it in their databases decades ago. Okay, now let's move on to timestamp-based protocols, which was a question. This is a different class of protocols which don't depend on locking. So what actually they do some kind of locking internally, but let's not worry about the implementation. Let's look at the logical level. Logically, each transaction is issued a timestamp when it enters the system. So, you can read the value from the clock as long as two transactions don't get the same timestamp. Or you can have a counter which keeps getting incremented. Each time a new transaction comes, the counter gets bumped up by one. So it's a logical timestamp. It could be a logical time. It doesn't have to be an actual time of day. We don't care. Now, the following things happen. The idea is that transactions should be serialized by whatever timestamp they were assigned. When they entered the system. So each transaction has a timestamp. We denote it by TS of TJ or TS of TI for transaction TI and TJ. So if a transaction came first, its timestamp will be less than a time some for transaction that came later. Now, in addition to transaction timestamps, we also have timestamps for data items. So if you take a data item Q, it actually has two timestamps. The right timestamp of Q is the largest timestamp of any transaction that executed right Q successfully. So whenever a transaction succeeds in writing Q, and what do I mean by succeeds? There are some rules coming up. If those are satisfied, the right is allowed to proceed. And the timestamp is said to the timestamp of this new transaction, which will be the largest to write Q up to that point. Read timestamp is also stored for every data item. And it is the largest timestamp of any transaction that executed read of Q successfully. So again, there are rules for reading which are coming up. So every data item has two timestamps. Read and write timestamp. Now, here are the rules. And what the rules ensure is what is shown here. Any conflicting read and write operations are executed in timestamp order. What does this mean? Supposing there is a conflicting read and write which arrive out of timestamp order, then the transaction is going to be rolled back. And let's see exactly how this is done. Supposing a transaction TI issues a read of Q. Now supposing it's timestamp is less than the right timestamp of Q, then what does it mean? A transaction with a later timestamp has already written Q. The value which this guy should have read, should have been an older value. If it reads this new value, it's going to be serialized after that other transaction. But that is going to violate the timestamp ordering. This guy's timestamp is less than the timestamp of the guy who wrote Q. So the only thing you can do is to roll back TI. So if the read timestamp is less than the write timestamp is rolled back. Otherwise, the read operation is executed. And one more thing is done. The read timestamp of Q is set to the maximum of the current read timestamp of Q and the timestamp of the new transaction. And you can have a situation where P 70 comes and reads a data item. And the read timestamp of Q is set to 70. After this P 72 comes, the timestamp is set to P 72. After this P 68 may come and read Q. At this point, you don't update the read timestamp, it stays at P 72. So it's the highest of the current read timestamp and the read timestamp of the current read. So that was straightforward. Write is also straightforward. The transaction issues a write. The first thing that is checked is, if the timestamp of this new transaction, the transaction is doing the write, is less than the read timestamp of Q. What does this mean? Then somebody else has read Q and got a particular value. If I were to allow this write to proceed, that other transaction should have seen the current value being written. But it's too late, that transaction has already read the old value. And therefore, the only solution is to roll back the transaction and reject the write operation. So whenever the write operations, transactions, timestamp is less than the read timestamp of Q, the write operation is rejected, transaction rolls back. Otherwise, the next step is if timestamp of TI is the rest, then the write timestamp of Q, then TI is attempting to write an absolute value. That is, a later transaction has already written Q. Again, the write is rejected and TI is rolled back. There's actually a variant which you won't get into, which allows this particular write to be ignored rather than rolled back. We won't get into that. And if both these conditions are okay, that is, they don't require a roll back, then the write operation is executed. And the write timestamp of Q is set to the timestamp of whoever did the write. And that's it. It's a very simple protocol. There are a few more details which I'm omitting. In particular, you have to make sure that transactions don't read uncommitted values, so you have to keep track of which data items have been written. In other words, you can keep a lock. In fact, many systems which do timestamps additionally keep locks to ensure that uncommitted values are not read. But the locking is not necessarily a two-phase especially for reads. The locking is only to ensure that uncommitted values are not read. There are other solutions without using locking, but locking is usually the simplest solution for preventing uncommitted reads. So that's the timestamp protocol. Now even the timestamp protocol suffers from the phantom phenomenon. And again, solutions are there to prevent the phantom phenomenon. So let me cover the next protocol also, the validation protocol. And then maybe we will take a few questions after a few more slides. The validation protocol is done slightly differently from timestamp. Any transaction is in three phases, read and execution phase, where transaction TI writes updates to local variables. It's not written back into the database, per se. Then if you do a, if a transaction does a write and then tries to read the value which it wrote, it will be served from the local copy. It won't go back to the database. So now when it is finished, it's going to do a validation phase to determine if the transaction can be allowed to commit. If validation succeeds, then the updates are applied to the database. Otherwise, the transaction is rolled back. So this is the basic idea. And the key idea is how do you do the validation? And the trick is as follows. Basically, let's assume that validation is done seriously. That is only one transaction, thus validation at a time. And it also does the write phase along with validation. This can be relaxed. But to simplify understanding, let's suppose that you have transactions which run and then when they enter the validation phase, only one transaction at a time is allowed to validate. And as soon as its validation succeeds, it does a write also. It finishes up. Now what is the write phase? All the writes which was stored in local variables are now written back into the database. That's the write phase. So we'll assume these two are done together. And only one transaction is doing this at a time. Now the basic idea is as follows. We want to make sure that if any transaction that ran concurrently had any conflict with this transaction, then there's a potential for serialization violation. Now if there is a conflict only with transactions that committed earlier, we don't care. They are committed and gone. Any value which I read will be a value which they wrote. That's not an issue. The issue is conflicts with concurrent transactions. And the way this protocol does it is by keeping track of what all data items, the transaction read, and what all data items, the transaction wrote. It's there in the next slide. This protocol is also called the optimistic concurrency control protocol. Since as long as the transaction is running, it's never blocked. It keeps running optimistically, assuming that all will go well during validation. Now the actual validation is based on two principles. As I said, you want to track what each transaction read and wrote. So these are the read set and the write set of every transaction is kept track of. Now the validation check will check for conflicts of read, write sets with all concurrent transactions. What do I mean by a concurrent transaction? Actually, it's not going to check with other transactions which are still executing. They're also concurrent, but that check will, supposing I have transaction T70 running now. And it wants to validate. There's another transaction, T59, which has not even started validation. T70 is not going to validate against T59. T59, when it wants to validate, it will validate itself against T70. T70 will only validate itself against all transactions that ran concurrently with it and have already validated. So that's the thing here. Concurrent transactions here means transactions that were committed between the time the transaction started and when it is validating. Transactions that committed even before it started, we don't care about. They couldn't have had any influence whatsoever. So how do you detect what are concurrent transactions? Well, we have a start time and a validation time. And for each transaction, we keep track of when it started and when it validated. So we have an interval. And this interval is used to identify concurrent transactions. Now the key thing is to check that if there is a conflict with any concurrent transaction, what do I mean by conflict? If a concurrent transaction that has committed wrote something which this guy read, there is a conflict. If it read something which this guy wrote, also there is a conflict. Because maybe it should have gone the other way. This can be relaxed. But if it wrote something which this guy wrote, again, there could be a conflict. Again, it was concurrent. It has committed first. But there could be an issue in the serialization order. Because it might have done an update which this guy did not see. So there are various conditions which have to be checked here. The read, write sets have to be validated. And if all goes well, then this transaction is allowed to commit. If the conditions are violated, the transaction is rolled back. That's the idea of the validation check. The key thing to note here is that a concurrent transaction may have done an update which this guy did not see. And in turn, if it does a write, then that could have an issue. So details are in the book. I'll skip the details. And I will maybe take just a couple of questions. I'll see if there are any chat questions first. Somebody has asked about different types of time stamps. What is the need for time stamping? I'm not going to get into great detail on time stamping. It will take me a lot of time to explain it. And I want to wrap up a few topics today. But I hope I've conveyed the intuition. The time stamp protocol's intuition is each transaction has a time stamp. And the serialization order must respect the transaction time stamp order. So if there is any update which is out of order with respect to the transaction time stamp, then the transaction has to be rolled back. That's the key intuition. Similarly, validation keeps track of read and write sets. And if there is any conflict with the concurrent transaction, then the transaction must be rolled back. And the time stamps here are used only to detect what are the concurrent transactions. The validation is done using read and write sets. Maybe I'll take one or two questions. There are a lot of hands raised here. Let's start with Sairna. What is tree locking method? What is tree locking? So I didn't cover that here. But there are protocols called tree locking and graph locking. Now there's a simple case of that I think I discussed yesterday, where if you have, I mentioned yesterday, if you remember, that if one transaction locks in the order AB and the other in the order BA, then they are likely to deadlock. If they both lock in the order AB, then there won't be any deadlock. So that was in the context of two-phase locking. Now a variant of that, which is not actually two-phase, but does censure serializability, is to order the items totally. And now I can only lock items in that order. Whatever items I can lock, whatever items I want. But I only have to lock in that order. And so there are variants of this protocol which have a partial order based on tree. And then there are graphs and so forth. And the idea is that there are some other rules which don't insist on two-phase locking. But they have other rules about when you can lock a data item. To lock a data item, you must have locked some other data item before that. So that's the ordered locking protocols. So it's not enough to just lock in an order. In general, if you want to lock an item, you should have locked its parent data item. So again, I don't have time to get into the details. But the idea is that these protocols do ensure serializability without requiring two-phase locking. You can actually release a lock and then acquire some other lock, but subject to some other rules. There are rules there also. So the tree locking and graph locking protocol are all variants of this basic idea. Again, for lack of time, I won't get into the details. Details are briefly sketched in the book. There was a little more emphasis on this once upon a time. But in practice, it's not very widely used today. So we briefly mentioned it in the book. And then there are some exercises which go into this in a little more depth. The solutions are available online if you can see them. Any follow-up questions? Yes, sir. Sir, we use cursor technique in Postgres SQL. So how can we apply this cursor technique in Oracle? By cursor technique, you mean a database cursor which lets you step over one row at a time. Is that the cursor you're talking about? Right. Is that OK? So first of all, so far I've been saying we run SQL and we'll get declaratively query a whole relation. Now there is another interface which is supported by many databases, which is called cursor. So you can open a cursor on a relation, step through the cursor, that is, step through the records one at a time. And essentially, when you do this, your program can operate on the current record and read it, perform updates on it, and then move to the next record and so forth. And along with this, the locking technique used here by default is that the lock is kept only on the current row, which you are reading or you can also have a cursor for updating. So then it gets an exclusive lock. So it's not purely declarative, but it's more procedural in that you're stepping through the records of the relation one at a time and performing updates. So there is syntax for this thing, update where you're looking at update the current record and so forth. So yeah, all databases support some form of cursor. We don't necessarily recommend using it because it's kind of procedural and low level. It can be useful in times. It may be more efficient because it's releasing locks early. So in order to get lower concurrency, that's one option. Any follow up on that? Thank you, sir. There's a question on chat which is interesting. How is timestamp-based protocol handled in a distributed transaction? That's a good question. The key to timestamp protocols are that timestamps should be increasing and they should be unique. So it is possible to allocate timestamps based on the local timestamps at each node, but then you could get into trouble because the clocks are not in sync. One of the nodes may keep issuing a transaction with some very old timestamp, which keeps getting rolled back repeatedly. Because another transaction with later timestamps, the timestamps are skewed. So logically, it is fine for sites to issue their own timestamps and the timestamps, whatever is issued by the site is used, except they should be unique. That's a minor detail, which is easy to take care of. You can append the site ID to the timestamp ID, and that will make it unique. If two sites have the same timestamp, it's OK. But now the issue, like I said, is it's a practical issue where if a particular site has a clock which is slow, its transaction may keep getting rolled back all the time. So you can do this if the things are kept tightly in sync. Now, this is something which people didn't pay much attention to all these years. But very recently, in October 2012, Google published a paper on a system called Spanner. Let me use the whiteboard. So Spanner is a distributed database. It's scalable, meaning you can keep adding nodes to it, and it can grow in a seamless manner. It is a database in the sense that it supports transactions, it supports recovery, it has a notion of relations, it supports SQL, and so forth. So this system actually uses timestamp, and it does some very interesting things to ensure that timestamps are in sync across machines. And it goes to great lengths to do this. So one of the things it does is it uses GPS. GPS is a global positioning system which many mobile phones have today, which can tell you where you are. But the core of GPS is a clock, which is extremely accurate. And the clock essentially measures delays. And from that delay, you can find out how far you are from a satellite, and from that you can figure out your position. So GPS inherently has to have an extremely accurate clock. The satellites are broadcasting the time. So you can, from your GPS sensor, you can get the time very, very accurately. So they use this to keep all the nodes of a geographically distributed system in sync on time. And then they have a bunch of other tricks also. They use atomic clocks, which are very accurate clocks. So even if GPS dies tomorrow, for some reason, they can still continue working. So there are a bunch of cute things they do to ensure that all the nodes in a distributed system are more or less in sync with respect to the timestamp. And this property is critical for them, for certain other things. I don't have the time to get into the details. But if you're interested in this, this is a nice paper to look at. The next stat question is, where is multiple granularity locking used practically? And the answer is, many databases support multiple granularity locking. You may not see it externally, but they do support it internally. It is very, very important, because otherwise the lock tables become very large. If you have a transaction that does a lot of updates, you get into trouble. I think I'll stop there. I want to wrap up the concurrency control soon. And I'll go back to my slides. So the next key concept is something called a multi version scheme, where you can, whenever you do an update, you create a new version of a data item. And these data items are labeled by timestamp. So the timestamp must correspond to the commit order in general, or serialization order. And so let's say that I have transactions with timestamps q1, q11, and q45, which successfully updated a particular data item. Then there are correspondingly three versions of the data item corresponding to q1, q11, and q45. When a read comes in, you look at the timestamp of that transaction. Supposing it is p34, it's going to read version q11. On the other hand, if t50 comes along, it's going to read version q45. So keeping in mind that there are multiple versions, and a transaction can read an appropriate version, the timestamp ordering protocol, which we briefly looked at, has been extended to create the multi-version timestamp ordering protocol. Similarly, the two-phase locking protocol has been extended to create what is called the multi-version two-phase locking protocol. And the idea here is that update transactions do two-phase locking as usual. But they also have timestamps denoting when they commit, and that timestamp is used to label data items which are updated. And you keep multiple versions. On the other hand, read-only transactions can give a timestamp and say that I want my data at this point in time. And the read-only transactions will read an appropriate version of data, and will never block. If a transaction is doing an update, it will create a new version. It won't update the existing version. It will create a new version. And the read-only transaction will be allowed to proceed by reading an older version. It will never try to read the latest version, the timestamps are such, such, such. It will never try to read the latest version. It will only read version, which is already done and no longer locked. So multi-version two-phase locking is a very nice protocol. It's biggest benefit is that read-only transactions will never block. Moreover, read-only transactions will never block update transactions. And typically, you have two kinds of transactions. You have transactions which are read-only, but they read a lot of records. Then you have transactions which do updates, but they are typically much smaller. They don't update all the records. Occasionally, you have such things, but they are rare. So the multi-version two-phase locking is a very nice protocol. There's only one catch with it, which is that transactions should declare themselves as read-only or update transactions. And the problem is most code does not do this. If they start a transaction, say transaction begin, and then they execute one thing after another. When you read the first part of the transaction, the first query might be a read-only query, but then the second query may do an update. Now, if the first query is read-only, but we don't know what the next second query is, we cannot read it as a read-only transaction. Therefore, even the first query must do two-phase locking, and that causes problems. Then the benefit of multi-version two-phase is lost. So let me repeat it. Multi-version two-phase has an idea that you have timestamps associated with different versions of a data item. A read-only transaction also has a timestamp. And it reads an existing version based on its timestamp, the latest version based on its timestamp. So if I have read-only transaction with timestamp t34, it will read version q11. t50 will read q45. There may be another update transaction, which is currently going on. Update transactions use two-phase locking. They conflict with each other. Read-only transactions don't do two-phase locking. They just read the appropriate version of a data item and keep moving. So they never block. And correspondingly, they don't do any locking. So they don't block update transactions. So it's a very nice protocol. But like I said, the problem is that when you have this in a system which is running transactions written in Java or C or whatever it is, you don't know what are the future things that this transaction will do. So there's no way to label a transaction as read-only. So Oracle looked at this. And others also looked at this at that time. And they came up with another protocol called snapshot isolation. And I'm going to talk very briefly about that coming up. There are some implementation issues in multi-valued. I'm going to skip that. And let's come to snapshot isolation. The multi-version two-phase locking can be thought of in another way, which is that when a read-only transaction comes, you give a logical snapshot of the database state to the read-only transaction. And it reads things from that logical snapshot. What is a snapshot? You take a picture. If you are in the classroom, you take a picture. Everybody is frozen as they are. If they move subsequently, that is not reflected in the snapshot. So the read-only transaction will read data from that snapshot. Now, since we don't know what transactions read-only, the pessimistic thing is to say everything is update transaction, then there's no benefit. So the snapshot isolation protocol is overly optimistic. And what it does is it says, let me satisfy reads of all transactions from a snapshot. I don't care whether they're going to do updates or not. If they do a read, I will give it from the snapshot. Now, this is a logical snapshot. It's not physical. It's not like a copy of the database has been created. But there are efficient ways of creating a logical snapshot. And many databases support this today. PostgreSQL, SQL server, Oracle, they all support this efficient ways of getting a snapshot. The issue is now, if an update transaction reads from the snapshot, it might read some data item and update some other data item, which can actually lead to non-serializable execution. If it used two-phase locking, it would be fine. But now, the problem is it is not reading the current version of a data item. It reads from a snapshot, which might be a bit old. And then it may go and update that data item. So that could be problems. So the first published literature on snapshot isolation was by Berenson et al. Sigmund 95, where they said that, hey, this protocol has a bug. It is not serializable. And the goal of that paper was to point out some flaws in how SQL had defined isolation levels. They said that the SQL definition had some holes in it, and they pointed out the holes using this paper. And by introducing snapshot isolation as a protocol that exposed the holes in the SQL definition. But guess what? Oracle, soon after, actually implemented snapshot isolation as part of Oracle. I'm not quite sure how snapshot isolation came out in this paper. It's possible that these people knew that Oracle was going to implement it. Or maybe Oracle read the paper and got the idea. I'm not very sure. But nevertheless, the idea is the protocol in published literature was shown as not being serializable. But yet Oracle went ahead and implemented it. And they also misled people into thinking that that protocol ensured serializability. So in Oracle, you tell it to run in serializable isolation level. It will run in snapshot isolation level. And it doesn't actually ensure serializability. Now I have several slides on snapshot isolation. I will just skim the basic idea about what it does. So here are three transactions, T1, T2, T3. T1 starts, writes something, commits. Then T2 starts, it reads something, x and y. It's going to create a local copy of x and y as of the time when it started. At this point, y is 1. x is not updated. Let's say the initial value is 0. Now T3 comes and writes x and z. Now T3 commits also. But T2 does not see the updates of T3, y, because it's reading from its snapshot. In its snapshot, z is 0, x is 0, and y was also 0, y is 1. So in its snapshot, even though T3 is already committed here and set z to 3, T2 is now doing a read z. What is the value it gets? It gets 0, because it's reading from its snapshot. Read y? Well, it updated it. So it is seeing the value 1. Even if this value was 0 here, it would see 1. Now this part, write x equal to 3. What happens? Now it requests a commit. So here, what snapshot isolation does is it checks, was there a concurrent transaction which wrote a data item which T2 wrote it. So in this case, T2 has written x, and that was a concurrent transaction. This was T3 was concurrent with T2. It also wrote x. At this point, T3 has already committed. When T2 does a commit request, it is rolled back. So this step essentially does validation. So snapshot isolation does some kind of validation similar to the optimistic concurrency protocol. But the exact validation is different. It only validates its writes. It does not validate its reads. And this can get it into trouble, as we will show in the over here. We have two transactions. One of them reads y and assigns it to x. The other y reads x and assigns it to y. So let's say x is 3, y is 17. In any serial execution, what would x and y be? If T1 ran first, it will set x to 17, and both will be 17. Then T2 will set y back, you know, both are already 17. So T2, in effect, doesn't change anything. On the other hand, if T2 ran first, it will set y to 3, and x will also be 3. And then if T1 runs subsequently, it will set x back to 3. So no change. So in any serial execution, the final value of x and y would be either 3 or 17. No other value is possible. But in snapshot isolation, both start at the same time. They both get a snapshot. In that snapshot, x equal to 3, y equal to 17. Both of them see the snapshot. Now T2 will set y equal to 3, while T1 sets x equal to 17. And neither of them updates a common item. I mean, they don't update any common item. And both are allowed to commit. And the net result is that x and y are swapped. So in case you didn't understand it, what is happening is the commit check for snapshot isolation only looks at what items these transactions write. It does not look at what they write. And if you look at what they write, T1 and T2 write two different things. And if they read from a snapshot, you have a situation where they swap values, x and y swap values, which is impossible in any serial execution of T1 and T2. Therefore, snapshot isolation does not ensure serial That's the take-home message here. And there are other ways it can happen with the inserts and so on. I will skip the details. Now the bottom line is that Oracle, even if you tell it to run in serializable isolation level, will run snapshot isolation. And the net result is that transactions may not be serializable. Now you might wonder, how on earth is the leading database in the world getting away with this? You tell it to be serializable and it is not. And the answer is they get away with it because most people don't notice it. In fact, somebody showed that one of the widely used benchmarks, serialization problems will not occur with snapshot isolation because of certain ways in which different sections are written. So it turns out that serialization violations are not common. But that doesn't mean they can't occur. In fact, in IIT Bombay, we had our financial system written by TCS. The people who wrote the system were pretty smart. But they didn't fully understand snapshot isolation. We used Oracle. And a couple of years after the system was launched, a financial auditor said, hey, your account books have this situation where two different transactions have the same ID. And that is a violation of financial rules. You can't have two people being given receipts with the same number. That's like you issued a duplicate receipt and ate up the money. So they pointed out a violation. And he said, how did this happen in the code? And after a bunch of sleuthing, we finally realized that it was because of snapshot isolation. It does not ensure serializability. And it turned out that TCS people, I think, from their seniors, they had heard about this issue somehow. But they had tried to solve it, but they had not succeeded in solving it. They didn't know how to solve it. So that is a partial solution for update clause. I won't get into the detail which Oracle supports. PostgreSQL also supports it. But it's not complete because the phantom, even if you use this, there is no clean way to deal with the phantom phenomenon. So even with this, it's not good enough. So that is an issue. Now what happened is PostgreSQL 9.1 changed their concurrency control. They implement a variant of snapshot isolation, which they call serializable snapshot isolation. I won't get into the details. But the bottom line is that if you use PostgreSQL 9.1 or greater, serializability is assured. It also does what is called predicate locking to prevent the phantom problem. So it actually is now clean. There is no problem with PostgreSQL 9.1 onwards. And that is the system which you will use. Interestingly, when I ran this lab in 2010, at that time, the latest version of PostgreSQL did not have this fix. So the same lab exercise which you are running today would have returned slightly different results back in 2010. So anyway, we change the exercises a little bit. So what you're running is not exactly the same as what they ran, because now one of the exercises to show lack of serializability fails now, in the sense that it is actually serializable. So we kind of drop that exercise because it didn't help you too much unless you fully understand snapshot isolation. And I have only skimmed it, so we'll not worry about it. The bottom line of today's exercises is to see what happens when you run transactions concurrently. And what you will see is that in PostgreSQL uses snapshot isolation, there are some exercises which reveal this. At the same time, you will also see that PostgreSQL uses some kind of locking internally. And the exercises will demonstrate that to you. So with that, I will stop on concurrency control. There are a couple of things on weak isolation levels which I have already covered. There's also something on concurrency index structures which I'm going to skip. So that's it for concurrency control. I'll take one or two questions and then move on to recovery. Now we have Sarvajanik Suresh, please go ahead. Good morning, sir. So my question is, when concurrent transactions are executed, does database take into consideration sort as job first, math or such kind of round robin? Or does it depend upon operating system process management for such kind of execution of concurrent transactions? That's a good question. So this question is about scheduling of transactions. So if you have a long transaction and a short transaction, should the database give priority to short transactions over long ones? And how is this priority dealt with? The answer is that most databases do not provide very much control on priority here. So what happens is that each transaction is run in a separate thread of process, and it's left to the default scheduling to give priority to different processes. However, there is an area called real-time databases where you want to give priority to certain transactions over others. So people have built prototypes at least of real-time databases, where you can set priorities. And these have been researched quite extensively. And different prioritization schemes have been shown to have work better than others. Now practically, does any of the current generation databases support these kind of priorities? I don't think so, although some of them may have some underlying features. But they don't do a great job of scheduling, per se. They just allow the default. And as you said, if PostgreSQL just runs a number of processes, and the OS priority defines the order in which things are run, there's no other priority estimation done by the database. Can you follow up with that? Hello? I have one more question. Query optimizes and concurrency control, recovery, all these involves cost. Is there any situation where database decides to avoid these and just to execute the transaction severely? That's a good question. So the question is that concurrency control has a cost. So is there any situation where databases decide that the cost is not worth paying and run things severely? That's a very good question. Now if you have data on disk, here concurrency pays off a lot. Because when you run a transaction, if it does disk IO, that can block for 10 milliseconds. And that's a significant amount of time in a high performance database system. And you really don't want the whole database to be blocked in such a situation. On the other hand, if you have a main memory database where there is never any IO, everything is resident in memory, this kind of a thing won't happen. The second issue is when you have a very long transaction that blocks other transactions behind it. So if you have a main memory database where every transaction is guaranteed to be very short, there are no long transactions, then it is not clear that all the price you pay for concurrency control is worth it. In fact, there have been studies which have shown that in this situation, main memory databases with only very short transactions, it is not worth doing concurrency control. You might as well just run things severely. Again, this was in an era where processors were single core, so some time ago. Today, you have multi-core processors which can run many things in parallel. So I'm not sure if that result holds today, even on regular single system, because it has so many cores. There is overhead to concurrency control, but you can run transactions in parallel, and the overhead of locking will probably pay off in such a multi-core system. But it's a good question. People have asked it and done research on this question. Good morning, sir. This is Akhilesh Sharma from Sarpadampad Singhanya University. We would like to congratulate you first of all. And we are having one question that, how view serializable schedules are good for maintaining concurrency between the transactions? So first of all, if you remember, I mentioned that there is something called view serializability, but didn't go into the definition of it. Now the question is, is it ever useful? And it turns out when you use timestamp protocol, I briefly mentioned a particular check. Let me go back and show you that slide. In this slide, if you see step number two, this was the timestamp protocol when you issue a write operation here. So what happens here is, if the timestamp of the transaction which is trying to do the write is less than the write timestamp of Q, then TI is attempting to write an obsolete value, and it is rejected. And TI is rolled back. So this particular step would violate the conflict order. Now there is an optimization of this particular step which says that, yes, it is trying to do an obsolete write. However, instead of rejecting it, let us just ignore it and let TI continue. This is a very, very specific optimization. It has been proposed. It's an interesting observation that obsolete writes can be ignored. So now if you do that, it turns out that the schedules which you get are not actually conflict serializable, but they are view serializable. And the idea is that, that particular write, there was a conflict, but you could ignore it because the value which it wrote was anyway overwritten by another guy subsequently. And that's the key difference between conflict and view serializability. View serializability ignores certain conflicts if it had no impact on the latest state. So essentially, the write which you did should have been clobbered by somebody else without an intervening read. Now this is kind of rare. It's not a hugely important thing in practice. So yes, this is a cute optimization. But in practice, I don't think it has a huge impact. So view serializability is not viewed as something which is terribly important. Back to you for your next question. Sir, this question is not related to this topic. Actually, I just wanted to know you spoke about big data now, but there is another concept of data warehousing data mining, where there is a multi-dimensional data. So if this is in distributed environment, how we manage concurrency and indexing in that context? If you can highlight some. So data warehousing, as also the big data scenarios which we saw yesterday with Hadoop, are all for decision support, where you're not doing updates on the fly. The updates have already been done somewhere. And those updates are coming to you. So the actual transaction processing system that needed to deal with concurrency is already done. And it is just telling you that these things happen. So there is no need for concurrency control at this level. Essentially, you can do the updates periodically. Or even if you do it regularly, read-only transactions could perhaps use a snapshot or whatever. So by snapshot, I mean logical snapshot, not a physical one. You can keep timestamps and make sure that they ignore things with a later timestamp. So this is basically how these systems, the data warehousing systems, avoid the issues of concurrency control. They either do snapshots or they don't allow updates all the time. They do updates periodically when other transactions are not running. And when you have a distributed setting for decision support, there is no real change. Concurrency control is not an issue. However, if you have a regular transaction processing system which itself has to be distributed, running across many machines for performance or other reasons. It may not just be performance. You can have a company which has two databases. And you need to do a transaction which spans those two databases. It has reads and writes maybe in both the databases. So how do you manage this? Again, in this short course, I have not had time to get into distributed transactions. But in the book, there is a chapter on distributed databases. And that talks about two of these issues. The first issue, let me write it down here. So in distributed databases, the first issue is atomic update. When you have two different machines running database software, you do an update in one. And then there's a failure before the update happens. In the other one, you have a problem. How do centralized databases avoid this? They have a log which they run on recovery. And we are going to see that as the next topic. But in a distributed database, that becomes harder because there is no control on recovery. Each is independent. So for this, there is a thing called two-phase commit, which is widely used. In fact, on this same whiteboard, I mentioned Google Spanner. And I said it's a distributed database. In fact, Google Spanner implements two-phase commit as one of the options for doing transactions that span multiple sites. There are other options also that it supports. And this is a very old idea. I mean, many databases support it. Don't get the idea that Google invented it. Google merely uses it. The second thing is distributed concurrency control. So again, there are techniques which have been developed for this. If each of them does locking, two-phase locking, it turns out that everything works fine, except for deadlock detection. So locking can go ahead as usual, but you need distributed deadlock detection mechanism. Again, these are all described in the book. If each of them does its own thing, one guide is locking, another does snapshot, and so on, then life is more complicated. And then there are alternatives, which people have proposed to ensure serializability in spite of transaction spanning multiple sites. So that's a brief reply. I will stop here on questions on concurrency control.