 So, the database system must provide a mechanism to ensure that all possible schedules, the order in which things happen actually in the database are serializable. And as we saw a policy which executionally one transaction at a time generate serial schedules, but provides poor degree of concurrency, poor utilization. And therefore, we run things concurrently reduce concurrency control protocols which ensures serializability. Now, there are actually many different concurrency control protocols. Lot of intelligent people have figured out many different ways of achieving this goal. They provide different tradeoffs actually between the level of concurrency, the amount of overhead and so forth. Again we would not look at all of them. I will just mention the key idea of one thing which is the most widely used, but actually others are used in practice. This is not the only game in town. So, locking again these are words which you will see very often when you use database locking. So, what is locking? Basically before you access an item a piece of memory a part of the database, you get a lock on it. There is some system called the lock manager which you ask you know give me a lock. You meaning not you the programmer do not care. The database is doing this internally. So, when you read a database item the database internally is telling the lock manager please lock that data item which the transaction wanted to read. Your SQL query read a bunch of data items internally the database gets locks on all those data items. And there are two types of locks. There are read locks and there are write locks. Now, the transaction manager a lock manager does the following. Once a transaction has been given a write lock. So, it has write locked an item it has been granted a write lock until it releases the lock. Know the transaction can get either a read lock or a write lock. Nobody else can get that lock. And if a transaction has read locked an item already then no other transaction can write lock it. However, another transaction can read lock. I am reading you it is fine if you read the same item does not conflict with me, but if I am reading I should not allow you to write at the same time while I am reading. If I am writing I should not allow you to read or write that idea with about locking. Now, the lock manager supports locks, but internally the code implementing SQL query has to do the following. It has to read lock any item before reading it and before writing it must write lock the item. And these locks must be kept until the transaction completes. It either commits or awards. At that point you can release the lock and other somebody else to get the same item. So, two things can run concurrently, but they are not both allowed to update the same item. They can run concurrently as long as they update different items. If the second guy wants to update an item which this fellow has already read or written then it is not allowed to proceed it is told wait lock manager says wait. And after the first fellow finishes then only the second fellow can continue. Is it on the entire tuple or attributes we can lock? That is a good question. What is the unit of locking? And that depends on the database system. You do not know and generally you do not care. Like I said you are not as a programmer you are not doing any locking. Typically the database system will either lock an entire tuple or it may even lock not just the tuple, but a disk page on which the tuple is residing. It locks the whole page. In fact some of the old database system would lock the entire relation. Of course then the concurrency goes down sharply. If you lock a tuple then somebody else can lock another tuple. So, you get a lot of concurrency. If you lock a page there may be 100 tuples on the page. Anyone who wants to read one of these 100 tuples will have to wait. But if they want to read some other tuple on some other page they can go ahead. So, there is a tradeoff between the level of concurrency and there is some overhead to the locking also. Lock managed has to keep track of locks. So, there are overheads. So, database systems will trade it off. They have some very nice techniques to deal with this. To decide in fact dynamically they can decide whether to lock a tuple or a page and so forth. Is it always by database or application as a choice? No, the application. To do it through user. So, some database systems will let you say lock this table. You can write a SQL. It is not standard SQL, but Postgres SQL or Oracle and so on. Let you specifically say lock this table or lock this row. But you do not really nominate it. You should not message it unless you really understand what is going on. So, you generally do not do anything and things just happen quietly underneath for you. Sir. Yeah. What is the use of locking a page? Locking. Locking a page. If you lock a page that means you lock all the rows in that page. But what is the concept of the page? Why a page? I will not explain what a page is. Conceptually a page contains multiple records. Each record is inside one page and a page can have multiple records. So, it is physically how it is stored on the disk. The disk logically consists of many pages. A page is a unit for the, as far as the disk is concerned. And multiple records can be in a page. So, if you lock the page that means you have locked all the records in that page. Nobody else can get a conflicting lock on that page. That means, if you have right locked a page, nobody else can get a read or right lock on the page. They cannot read or write any tuple in that page, any record in that page. I am using the words row, record, tuple, all interchangeably. They all mean the same. Sir, the thing is normally we either, I mean you said some old databases lock a relation and for example, if you take cobalt, like you will be locking the entire file or a record level lock. But this page is like I mean a set of I mean unrelated tuples really do not make sense. No, that is what I was asking. Okay. So, in cobalt you, what all kinds of locking do you do normally? You use file locking, right? So, that you have record locking. So, there you have the programmer have to deal with that. In a relational database, you, the programmer don't have to care at all about it typically. And whether to get a record lock, page lock, file lock, that's all up to the database. And so, what sense does it make to get a page lock is your question, right? Like I said, it is something in between a record lock and a file lock. There are trade-offs. It gives you more concurrency than a file lock but less concurrency than a record lock. On the other hand, the locking overheads, the space required by the lock manager and so forth are less than a record lock but more than a file lock. So, it's somewhere in between these two extremes. And it is useful in certain scenarios. Okay, good question. So, the question is if there is a foreign key dependency and so on or something similar, then will a lock on one thing result in a lock on the other? The answer is no. Nothing happens. If you lock a row, you've locked just that one row. Now, if somebody goes and concurrently updates that, the database system will still figure out that there is a conflict at the time when, you know, one of them commits, when it completes. But my transaction already happened over. No, no. What happens is, even though this transaction is, let's say you have a foreign key from A to B. Because A has a foreign key referencing B. So, this means that A should make sure that, you know, B does not, the corresponding record in B does not vanish. Or equivalently, B should make sure that it does not vanish if there is a record in A. Okay? So, this is done in a different way. If these two guys read the two records, there is no issue. The only problem comes when somebody is trying to delete that record in B, which might affect a record in A which references it. So, what happens at that point is, instead of just deleting the record in B, you will check, is there a record in A which references B? And then, what you have done in effect is that check will get a read lock on A. You did not, in your transaction, explicitly touch A. You just said, delete record from B. But internally, a database will look at A to verify that it is safe. In the process, it will get a read lock on the respective parts of A. And thereby, if there is a conflict, if somebody else is doing something over here, it will have to wait. And one will happen first, then the other. It has to wait. Yeah. If there is a conflict like this, it will be detected at this point. But let us assume that in my policy, there is one agency. That agency, I am updating in one table, which has to have a relation over there with B table. Yeah. Here, when I am doing over there, the whole transaction has not updated. I have no. Yeah. So, since it is an agency, it is a foreign key here and primary key here. If you just update the details of the agency, there is no issue. The problem comes when you try to delete the agency. Delete the agency. That is where this problem will arise. Another file is not related over there anymore. Is, sorry, what did you say? Number 4, one system, suppose if you want to update it over there. Yeah. It permits only to that file. Or locking, or the record locking, or anything. Yes. And it will not see any inconsistency between the other attributes of the other tables, even though it is a primary other. Correct. Now, here, here the database will guarantee. Once you have declared a foreign key, it guarantees that it holds, regardless of concurrent executions. And what I told you was internally what happened. You as a programmer don't have to worry about that. You can just rest assured that the constraint will be maintained in spite of concurrent execution. In fact, this is a big, big advantage of using a database over using files in COBOL. You always have to worry about what to lock, what is the effect of locking. You have to in effect worry about serializability. Were you aware of, how many of you knew about all these concepts of serializability beforehand? You are using it. How many of you already knew all these ideas of serializability? Please raise your hand. Let's take a quick poll. How many of you did locking in practice? How many of you have written locked files? Please raise your hand. If you have wrote any program which locked file? Locks. Locks, file, or anything, any other kind of lock which I have applied. Many of you have done that? In COBOL. In COBOL. I am not talking about database. Almost all of you have done it. Without knowing formally what serializability is. I hope this helps you in understanding what is going on in COBOL because you are actually implementing serializability on your own. In a database system, you don't have to worry about it. The database takes care. So that's a big win for the programmer. Let's have a headache. We have had problems in our old system in IIT. Long back, we had a FoxPro system where the same kind of thing was there like in COBOL. You had to lock the file when you did updates. I believe there is a bug in one of those things which led to people who were able to keep books for, I think they could return it and it would not be recorded the other way. One of the two would happen. So there were race conditions which led to consistency problems. It was rare but it would happen once in a way. So that is a big risk which is avoided here, typically. However, let me on the next slide say that it's... Well, there is something called truth in advertising. So the truth in advertising is coming up now. Not all databases tell the truth. Some lie and occasionally they can get you in trouble. So let us see what it goes on. So serializability is all great but it does mean that concurrency goes down. And many times people are willing to take a hit on serializability just in order to get better performance. So take some very well-motivated examples. Supposing you want to get an idea of how much total balance this bank has. You want to sum up the balances on all the accounts. You don't care if it is exact. Maybe there are some one or two transactions going on which you will double count or not count. But the final result may be close enough. Typically bank is not going to transfer 100 crores at a time between accounts. But if you add up the things between the accounts, it may come to thousands of crores. So you may have an error of 1 crore, maybe in 10,000 crores. It will be perfectly acceptable because you only wanted an approximate idea. So there you are willing to not actually readlock the item for the entire duration of the transaction. Note here that transactions get a readlock and release it only at the end. Here you may get a readlock, read it and immediately release it. It is not going to guarantee serializability. You might see inconsistent things. However, that may be acceptable for you. That is one kind. Another example is, we will see this tomorrow but the query optimization is something that database does. It needs to have some statistics about relation. It needs to have a rough idea how many records are there, how many different branches are there and so forth. So those statistics again can be approximate. It doesn't have to be exact. So you may be willing to have a weaker level of consistency for those. So SQL actually allows you to specify the level of concurrency that you want for a transaction. So the default is serializable which we saw. But you can, in SQL the default is serializable. Let me make it clear. It is not that every database does serializable as default. In SQL the default is serializable. There is a level of consistency which is slightly lower called repeatable feed. So what this means is, you cannot see a record which has been written by a not yet committed transaction. So only committed records can be read. If you read the same record multiple times you will get the same value. However, it is possible that transactions may not be serializable and why? It may find some records inserted by a transaction but not find others. So you have a transaction which is inserting 10 records. You have another transaction which may see 5 of those records and not the other 5. It is not quite serializable. It is running in effect in between this transaction. But some applications may be willing to live with repeatable read. There is an even lower thing which is just read committed. Which is, which says you can only read committed records. You can only read it if the transaction is wrote it committed. However, if you read the same record multiple times you may see different values. You read it once, you got a value then somebody else meanwhile came and updated it again. If you read it again you will get another value. So that is not even repeatable read. It is called read committed. So that gives even lower guarantees about correctness. Serializability guarantees correctness. These guys do not give guarantees of correctness. It is up to you to figure out whether the correctness level is good enough for your application. And the lowest level is read committed. Read anything. Just read whatever is there in the database. Currently whether the value was written by transaction which is committed or is still active which may abort anything and happen but still read it. So in theory you should use this only if you know what you are doing and the default should be serializable. However serializability has a price in terms of performance and database systems are often sold based on performance. They will tell you hey my database system can handle 300,000 transactions per hour. My competitors can handle only 150,000 on the same hardware. So buy me. But to do this you may be able to do this only by compromising on one of these things. So many database systems cheat. For example Oracle and PostgreSQL. By default they have a degree of consistency called snapshot isolation which is not even one of these. It is not part of the SQL standard. However that is what it supports by default. If you make things worse even if you tell Oracle to run it in serializable mode there is a way to tell it to do that. It will say fine I will run things in serializable mode and then go ahead and run it in snapshot isolation mode which is not really serializable. In fact in IIT we found problems due to this. There is some somewhat non-standard logic, some meaningful logic which resulted in some numbers being allocated in a certain way. And two different bills were not supposed to get the same voucher number. But it so happened because of the way the logic was written. It was perfectly fine in serializable mode but Oracle went ahead and ran it at snapshot isolation. And now two different bills could get the same voucher number. It is possible. Happened rarely but it did happen. And in fact our financial auditors caught it. They told us we found this place where you have two different bills with the same voucher number. Your software is buggy. It is not our software. It is developed by TCS. And we contacted TCS and told them look there is a bug. They were completely clueless. They did not know what was going on. So we sat down and looked at the code and finally figured out that it is because of this. And their programmers were completely clueless about all this. They never heard of snapshot isolation. In fact I think that the current programmers may not even have heard of serializability. Whether people who coded it originally knew about this serializability but didn't realize that Oracle was running in snapshot. So that's why the trouble happened. So in practice other databases will often run in repeatable read mode by default or even read committed. If you do nothing, you just run the database out of the box. You run a transaction, begin transaction, run updates. It may actually run not a serializable but a repeatable read or read committed. Then you have to use, I don't have the syntax here. But in SQL you can say set isolation mode, isolation level I think serializable as part of a transaction. Start a transaction and set isolation mode serializable and then run the rest of the things. Then you're safe. Or you can edit the configuration file of the database and set the default to serializable. So either way you can guarantee that things are safe. It's in your hands for other databases. But for these two, there's nothing you can do. Well, there are ways to work around it. We have done some research on that. But it's not so simple. But it's possible to work around this flaw and still guarantee serializability. But you have to do a lot more work. In effect you do locking to avoid this. So the last topic is on replication. But before I come to replication, this is a good time since all of you have dealt with controlling concurrency by using locking. You're all somewhat familiar with this topic. So I'm sure you have questions. So please ask more any doubts you have. Looking forward to not worrying about locking. It doesn't matter. It's easy enough. So this main difference only sees between repeatable read and read committed is, in the first one we are expecting the same result in all of our. If you read it again. We read it again. But what is that clause that qualifying? However a transaction may not be serializable. So that is the difference between the serializable level and the repeatable level. And this explains why repeatable read is not the same as serializable. This is an explanation. So the explanation is by example. You may have a transaction which is inserting five records. Another transaction which is running in this mode, repeatable read. So if it reads the same record, it will find the same value for that record. But it may see two of the records inserted by the first transaction and not the other three. For these two records, if it reads them again, it will see the same value. For the other ones, it may not see the records at all forget seeing the same value. It doesn't even know the records exist. So this is an example of why even though you read only committed records. And if you read the same record again, you are guaranteed you will read the same value. How many of a times you read it? In spite of all this guarantee, it is still not the same as serializable. The issues through it inserts. Only with inserts. It is actually not only with inserts. There are also issues. I have not described it in these slides. But there is also an issue with what is called a predicate read. If I want to read all the records that satisfy some predicate, somebody may do an update such that the set of records which satisfy the update changes in the middle. With repeatable read, that is not prevented. So what this means is I read all the records at Powai branch. And I have locked all those records. Now somebody else can come and update the branch for Matunga record and change it to Powai. I don't know about that. But actually there is a conflict. Because if that person then commits before me, is serialized before me, then I should have read that record. That Matunga record which became Powai, I should have read it. Because that transaction committed before me, is serialized before me. But with repeatable read, I will not see it. With serializable, it guarantees that if that guy tries to update, it will not be allowed. He will have to wait. If I have issued a SQL statement that says, read all records at Powai, if somebody tries to update a record and set its value to Powai, it won't be allowed to continue. It will block until this transaction commits. So that's called predicate locking. So what we looked at was locks on tuples and files. This is a different kind of thing which locks predicates. This is a problem which would also arise in COBOL, if you use row level lockings. Just think of the same thing. You lock all the Powai records. And somebody comes and updates the Matunga record to Powai. You are not going to see it. Suppose that Matunga to Powai record is the eighth record. With a repeatable read, we are only having seven records of Powai. So even after that Matunga is updated to Powai, it will keep showing the seven records only. You will think there are only seven records. Because you went through all the records and found all those with Powai. And you found those seven, you are happy with those seven. And then you are still running and you are serialized after that fellow. But meanwhile Matunga's record, one of Matunga's records, became a Powai record. Any other questions? So concurrency control is really an interesting topic. There are a lot of details which are not obvious up front. If I am reading a record from a particular table, the other person is also allowed to read. If you have a read lock on the record, somebody else can also read lock the same record. No problem. So reading is allowed. Reading is allowed. But if you write lock the record, then nobody else can get a read lock. You cannot even read it. Or if somebody has a read lock, you ask for a write lock, you have to wait until all the read locks are released. Before you can get the write lock. And only after you get the write lock, can you proceed to write. So if I have obtained a read lock, the other person cannot even write. He cannot get a write lock. He has to wait. And then finally you will be granted the write lock after all the read locks are released. And then only they can do the write. So even this normal select statement will obtain a read lock for me? Yes. Select statements obtain read locks. And in fact, if the implementation is done properly, as is the case, not with Oracle, but with IBM and SQL server, the predicate locks are also obtained. So that it gets enough locks and you are safe. If you set the mode to serializable, it will be serializable, truly serializable. What is the exact behavior of this snapshot isolation? This is in Oracle PostgreSQL. Like we have given that read committed will behave like this uncommitted. Yeah. It is a little harder to describe. I did not try to do it in one slide. If you are curious, it is not there in the current edition of the book. However, if you go to the book slides, the URL is there in the book, dbbook.com. Go to the chapter 16 on concurrency control. At the end of that chapter, there are several slides on snapshot isolation. So you can read up more on that if you are interested. Sir, when we are reading, suppose we are executing a select statement. So I think generally it is reading from the immediate available buffer, not from directly from the disk. It first reads from the disk into memory and then returns the value to you. So the lock is created from whatever record present in the buffer or directly present in the disk. Good question. So the answer is both. The locking is on the logical thing, whether it is in memory or on disk is irrelevant. Logically, you have locked the record. In fact, the lock manager has no clue whether the particular record is in memory or on disk. You give it a record identifier and say lock this record identifier. That is all it knows. It has no clue where the record is. It can lock a page or a record identifier, whatever. It does not know anything more about it. So wherever you read it from, once you have the lock on that item, wherever you read it from, it is fine. Of course, if somebody has updated the data item in memory and has not yet written into disk, you cannot, but has released the lock, you cannot go to disk and read it again. So the database system takes care of that. So if the copy in memory is the latest, it will read the latest version. It would not read an older version from disk. So this is something which is very commonly used, especially when transactional consistency is not as critical. For example, you will have frequent changes in policy rules, what policy is available, what is the maybe rate for certain policy for a bank interest rates, currency conversion rates, all of these change. But they are not exactly transactional changes. Today you might do it or yesterday maybe you did it by sending a telegraph or something to branch offices. Today you may be doing it on the net computer. If you send a message, banks may, before core banking, banks will probably get a daily update saying, today's exchange rate is this. Today's interest rate is this for this category of deposits. Now this kind of a thing, you might want to put in the database and make it available to remote databases. As and when you update, it goes to the database. It does not have to go immediately as part of the same transaction. You do the update locally and within a short period of time it is propagated. And how is it dealt with? Well, you are not generally going to update today's exchange rate. You have already said what is the exchange rate for today. Your next message will say what is tomorrow's exchange rate. At the end of today, you will send a thing about tomorrow's exchange rate. For the beginning of tomorrow, you will send that before any processing is done. So, here you do not have to worry about a single transaction updating all the remote copies of the database. You do not worry about that. You just update one copy and all the other copies are replicas of this, which will be updated soon, but not necessarily immediately. So, that is the idea of replication with weak consistency. It is called serializability guarantee, but it will go fairly soon. And the most popular thing is master slave replication, where all the updates are performed on a single master side and propagated to slave sides. So, as I said propagation is not part of the original transaction. The transaction runs at the master side locally, commits and goes away. And then the update is sent to the slave sides. It could be immediately after the commit or you might have a policy which sends it every hour or once in a day at night or whatever it is you want. And at the slave side, data can only be read. It cannot be updated. So, you do not have to obtain locks at that. At the slave side, you are not writing anything. You only read it. The only guy who writes it is the replicator coming from the main side. So, this is not quite right. You should not say there is no need to obtain locks. But the only lock king required is to deal with the updates coming from the master side. So, this is widely used for distributing information from central office to branch office. And since LIC is planning to have multiple database at branch office, you will almost surely need to do this. So, the way you do this is, I am not showing it here. But there are systems which support replication. And depending on the database you use, you will use an appropriate tool. For PostgreSQL, for example, there is a tool called Sloney. For Oracle, I think they have some built-in things for replication. Ships with Oracle. Similarly, SQL server and all those have built-in things. So, the way they work is, you set up a subscription. The slave sites subscribe to something from the master side. So, they subscribe to a relation or maybe a relation with a selection on it. Not the whole relation, but some subset of the relation perhaps. And then whenever an update happens in the master side, there is something there, the database there. No swole has subscribed to those things, relations. And if the update is relevant to them, satisfies the selection, the update is sent there. And it is sent regularly. Even if the slave site is down currently when the update happens, doesn't matter. When the slave site comes back up and gets in touch with the master again, all the pending updates will be sent. So, this is the mode in which the slave is guaranteed to get the update, not immediately, but it will go there after some time when the slave is back. So, this is very widely used. So, I would recommend that you use it. If you are doing like daily replication, this mode of replication which is supported by Oracle and others works very well. So, you can say create a transaction consistent snapshot of the database in Oracle, which means it is a state of the database reflecting all effects of all transaction up to some point. There is a serialization order. So, everything up to some point is reflected. Anything after that is not reflected in the snapshot. So, you can do a create snapshot, create such a snapshot. It is not actually physically copy is not made. It is a logical thing which you create, it is a logical snapshot. And now you can copy the snapshot, replicate the snapshot over somewhere else. And that fellow will see a physical snapshot and can run on that. So, you send a snapshot of these relations at the end of the day. The slave site can work on that snapshot for the next day and see a transaction consistent state. As long as they are only reading, the queries are read only. They are all serializable. They are all serializable yesterday. They are running today, but with respect to the main office, the master, these are serializable as if they were run at the end of yesterday or whenever the snapshot was taken. So, they are not serializable with respect to the current point in time, but they are still serializable. So, this is a mode which can be useful. So, first of all replication can be useful with or without this feature. And with transaction consistent snapshot, you can give somewhat better guarantee at the slave sites. So, I will stop here.