 chapter 15 which is on concurrency. So, the next question is we have said so much about ensuring serializability what are schedules. So, the important question is how do you enforce some control on transactions. So, that whatever schedules are generated are going to be serializable or whatever other level of serializability we want. And the answer is there are a number of ways of doing this. The most widely used is based on locking. So, there are lock based protocols we are going to see this in detail. There are also protocols based on time stamps and validation. We are not going to spend too much time on this for lack of time. Then there are lot of optimizations on locking one of which is called multi granularity locking which is actually very useful. Again for lack of time we would not get into it here, but these are things which you should be aware of. It is a teacher whether or not it is included in your course. Then the next one is something called multi version schemes. I am going to spend a little bit of time here because this may be something new to most. In particular I am going to focus on snapshot isolation simply because even those of you who have already taught this subject many times probably have not seen a lot about snapshot isolation. And if you deal with oracle or periscule that is what you get. So, I am going to spend a little bit of time on that. That will probably happen tomorrow not today. So, that was an overview. So, what is locking? A lock is something which is a conceptual thing which you acquire on a data item before you do anything to it. In operating systems often there is just one kind of lock exclusive lock. You get the lock nobody else can access it. You release the lock somebody else can access it. In database systems it is very useful to have two kinds of locks. One is exclusive one is shared. Exclusive means when you have the lock on the data item nobody else can get a lock. Shared means when you get it somebody else can get a shared lock, but nobody can get an exclusive lock. So, it allows sharing. Now exclusive locks correspond to write operations. Shared locks correspond to read operations. When you read you do not care if somebody else also reads that is fine they can also get the shared lock. But if somebody is trying to write at the same time that you are trying to read bad things can happen. So, in order to write you get an exclusive lock nobody else can be writing once you get the lock. To read you get a shared lock which will allow readers other readers, but not other writers. Now all of these are lock requests are done by typically they are done by the SQL engine in reality because that is the one which goes down to the level of what data items are fetched. And the SQL engine makes calls to the concurrency control manager saying lock this in X mode or lock this in S mode. And if the lock cannot be granted the execution will wait until the lock can be granted. In certain cases the lock cannot be granted at all and the engine is told sorry roll back your transaction. So, there is a notion of a lock compatibility matrix which is very important. So, when you have just these two lock modes S and X shared lock exclusive lock the compatibility is like this S and S is true that means if one guy has a S lock another guy requests an S lock it is compatible. So, compatibility is true. On the other hand S and X are not compatible what does this mean? It means if somebody has the S lock another guy request the X lock it is not compatible he cannot get it until the S lock is released. But it is also symmetric if somebody has the X lock the new guy is asking for the S lock it cannot be granted until the X lock is released. So, either way it is compatibility is symmetric. So, S to X is false X to X is also false. Now, actually this side of the matrix is redundant because it is a mirror image of the other side whether it is you look at it as S X or X S does not matter it is the same. So, this compatibility matrix is used by the concurrency control manager to decide when a lock can be granted. Now, it seems obvious that if you have S and X this is what it is it turns out that real life concurrency control managers have other lock modes with multi granularity locking there are what are called intention lock modes and this lock compatibility matrix gets more complicated it is particularly important in those cases. So, here is an example of a transaction performing locking again if you are running in SQL this is what would happen at below the SQL interpreter the SQL interpreter issues read and write and lock calls to the lower layers of the database system. So, what is this transaction doing it is locking in S mode A reading A then it is unlocking A then it is locking B in S mode reading B and unlocking and then it displays A plus B. Now, if you are familiar with locking from operating systems operating system locking typically runs in a fashion similar to this before you access a resource you lock it use it unlock it. Unfortunately, doing locking like this is not very effective in a database system it does not guarantee serializability why is this the case take this case where we are reading A and B and the idea is that if any transaction comes with transfers money from A to B the sum of A and B should not change. So, you do this sequence what if a transaction comes in between let us say that here is an unlock A followed by a lock B. Now, what can happen is because things are concurrently the engine may run the first three operations and then put T 1 to sleep for some time while T 1 is sleeping T 2 comes in it actually transfers money from A to B and commits and goes away. Now, what happens sorry this is T 2 T 3 comes in and transfers money from A to B commits and goes away. Now, T 2 wakes up and reads B what has happened it is read the value of A before T 3 ran it has read the value of B after T 3 ran. So, what is its sum of A plus B it is inconsistent the value is wrong it does not correspond to any serial execution. So, just locking items in S or X mode before you access them is not enough that does not guarantee serializability. What is required is a set of rules about when you can release locks you have to acquire an S lock before you read you have to acquire an X lock before you write, but there are some more rules which tell you when you can release locks and those rules constitute what is called a locking protocol and that is required to ensure serializability. So, we are going to see what is called two phase locking in just a little bit, but before we do that you should be aware that while locking is important to ensure serializability there are a few problems which arise which you have to deal with. One of the problems is deadlock situation here is an example T 1 X lock B and wrote it and now T this is sorry what is it T 3 T 4 came along and it S locked A it read it that is fine so far. Now it says S lock B. So, it wants to read B, but it cannot get the lock because T 3 has already got an X lock on B and T 3 is not committed. So, its lock is not released. So, T 4 has to wait for T 3 at this point T 3 is was you know sleeping a little bit here because T 4 got access to CPU now it comes back and it says X lock A can it get the X lock on A it cannot because T 4 has already got an S lock on A. Now what has happened T 4 cannot proceed until T 3 releases the lock T 3 cannot proceed until T 4 releases the lock. So, you have a deadlock neither of them can proceed until the other proceeds which means neither of them can proceed that is it the system will halt at this point that is a problem. So, any system which uses locking can land up in a deadlock state. So, that does not mean you do not use locking what it means is you need a way to detect deadlocks happening. When a deadlock happens luckily in a database system we can always roll back. So, what do we do? We pick one of the two as a victim let us say we pick T 4 as the victim we can kill it we can abort it. When it aborts what does it do? It undoes any update in this case there were no updates and then it releases the lock and it is gone. Now T 4 can proceed getting the lock and complete after this sorry T 3 can proceed after this T 4 can be rerun and this time it gets both the locks and completes. So, that is a good situation. So, that is how you deal with deadlocks. Another problem is potentially starvation there are two kinds of starvation. One kind of starvation is if a transaction asked for an X lock somebody else had an S lock fine it has to wait. Now another guy comes along and asks for the S lock and currently the lock is an S lock it is compatible it is granted. So, this first guy who asked for the X lock is waiting the original guy who came in went away but the new guy has the S lock now. Now one more guy comes with and asked for an S lock and then one more guy one after another they take turns going in jumping this fellow who wanted an X lock he is waiting in the queue one after another all these readers come they get priority now because somebody is already there with an S lock and they never allow this fellow to go in this guy is starving. So, this can happen if you do not do something about when a lock is granted it turns out this particular problem is easy to solve. What do you do? If somebody has an S lock but somebody is waiting to get a X lock now one more guy came in afterwards you do not allow this guy to jump ahead because there is somebody waiting in the queue now. So, no queue jumping if this guy is waiting and cannot proceed anyone who came after him has to wait behind him but if the guy at the head of the queue wants an S lock and currently the lock is an S lock the guy at the head can proceed. So, there is a very simple solution which never allows transaction request to go to be granted out of order that prevents this kind of starvation there is another kind of starvation which is actually trickier because of a deadlock one of the two transactions will be rolled back. Now, that guy tries again again there is another deadlock due to some other transaction again this fellow is rolled back. So, if you have this poor guy who comes runs a bit is gone he comes back again runs a little again is cut out this can happen repeatedly any number of times potentially and this guy will starve. So, that is also bad it is not fair you want him to have his chance to. So, what happens in most databases is they will keep track of any such internal resubmissions with an external resubmission from the user side they do not know what is going on but if the database reruns on its own then it will actually take care to make sure that the same fellow is not repeatedly killed eventually he will be allowed to proceed. So, much for deadlock and starvation now let us come back to the problem which I told you we had a guy who was getting an S lock before reading guys were getting X locks before they were reading, but still there was a problem. How do you avoid that problem the answer is to follow what is called the two phase locking protocol. What is this two phase locking protocol there are two phases there is a growing phase where a transaction may obtain locks, but it cannot release locks. Then there is a shrinking phase where a transaction in release locks, but cannot acquire locks note that these two phases are per transaction each transaction has a growing phase and a shrinking phase. While this transaction is growing another may be shrinking while this is shrinking another may be growing that is not a problem. For each transaction though there is a growing phase where it cannot release any locks can only acquire locks the moment it releases its first lock it cannot acquire any more locks it can only keep releasing locks that is the idea. That is how that this protocol the two phase locking protocol you can prove that it guarantees serializability. In fact you can show that the transaction serial order is defined by what is called the lock point. The lock point for simplicity we will define as the point where the transaction acquired its last lock. After that it cannot acquire any more locks it can only release it turns out you can alternatively define the lock point as just before the point where it released its first lock that is also fine the definitions come to the same thing. So that defines the serial order. So two phase locking is widely used, but it turns out bare bones two phase locking is not enough because cascading rollback is possible. Now what about recoverability that is also a problem you can have a situation where a transaction releases a lock before it commits and another guy gets that lock and commits. So even recoverability is a problem not just cascading. So at a minimum what you want is that if you get an exclusive lock to write an item you cannot release the exclusive lock till you commit. This is called strict two phase locking. So strict two phase locking is absolutely essential to make sure that the schedule is recoverable. So whenever any database says it implements two phase locking it is not the bare bones one we saw in the previous slide it is with this extra requirement that exclusive locks have to be held till the transaction commits. In reality most systems actually implement something even stronger they generally will not release any locks in the middle of execution. Their growing phase is still the end of the transaction they keep acquiring locks they will never release locks until all the work is finished at which point they will finish the writes everything is done and all locks are released together at the end. This is called rigorous two phase locking. So all locks are held whether it is read or write locks everything is held till commit. Some terminological confusion is there in the industry many people refer to what we call rigorous two phase locking many people call this strict two phase locking. So different people use the same term to mean different things. In our textbook based on the original papers which introduce these terms we follow that strict two phase locking means hold exclusive locks till end shared locks can be released in a two phase manner rigorous is what I just told you and what some people call it strict. So I think this is a good point to break today's thing I will wrap up with a quiz followed by some announcements followed by question answer session. So first the quiz. So the question is there is a lock X A then unlock X A lock S B unlock S this just one transaction that is it this is what is running. So the options you have here are the schedule is two phase. So remember the rules for two phase the growing and shrinking is it two phase check it. The next option is the schedule is recoverable. So not only should it be two phase but it has to satisfy the property that nobody can read an uncommitted value and commit. The next option is it is cascade free that is what I mean by schedule is recoverable. Here there is just one transaction what I mean is if other transactions came in still recoverability would be guaranteed. Similarly if T 1 follows whatever it does can you ensure that the schedule is cascade free that is somebody else will not have to roll back if T 1 rolls back and the last option is none of the above. So please press the S T button and be prepared. Just wait a few seconds for synchronization. So at this point you should be able to run your quiz and please select the option. You have 10 more seconds almost time out is make sure entered and time is up. So let us see what is the answer. First of all is the schedule two phase. Two phase means once you release a lock you cannot acquire any more locks. But in this case T 1 has unlocked A and after that acquires a lock on B. So it is not two phase. Is it recoverable? No if somebody read A after this point it could read it because the lock is released and that person could commit and so it is not recoverable. Is it cascade free? No. For the cascade if it is not recoverable it is not cascade free. If cascade free means that not only is it recoverable but in addition if T 1 aborts it will not require anybody else to abort. In this case it is not even recoverable. So the answer is four none of the above that is the correct answer. Let us see the statistics. This time we had good response 133 people have responded and the answer chosen the correct answer is actually the smallest number apart from people who pressed E by mistake which is not even an option. So a lot of you have thought that the schedule is recoverable that is the primary thing. So I think what you have understood is most of you have understood that the schedule is not two phase that is good. So the number who said that it is two phase is only about less than one third but after that I think the question was a bit confusing. So I can understand why people started getting confused about recoverable and cascade free. In this case as I explained it is not recoverable or cascade free. So the result is better than it looks. It looks like all of you have done very badly but in fact all of you or most of you have avoided the obvious wrong answer and have only got confused by two answers which are indeed a bit confusing. So that is it for concurrency. Tomorrow we will be covering more about locking and snapshot isolation and so on and then move on to recovery. Now let us the first question is while executing view in SQL server express edition I get query execution timeout where number of records is more. Can you tell the solution please? The express edition is a kind of constrained edition and it is usually meant for certain simple embedded applications. So it has constrained. Now if query execution times out why did it take a long time? You should ask that question. One possible reason is that you defined a query which actually takes a long time. Maybe the query was written wrongly and you could have it was generating a wrong result which was very large. So the first step is to check if you wrote the query correctly. Now once you make sure the query is written correctly I mean if the query itself is generating a huge result optimizer cannot do anything because that is what you specified and you made a mistake. If you wrote the query correctly to specify what you actually wanted and it is still taking too much time then there may be a few reasons possible. One is maybe statistics are wrong maybe the plan optimizer chose was bad. How do you deal with this? You can see what plan it generated. You can make sure statistics are up to date. I think in regular SQL server statistics are auto computed. In the express edition I am not sure whether it does that. If all this fails then use a different database. You may find that building an index may work in some cases. The next question was why have you not projected course ID from teachers before the hash join? That is a good question. Now I gave you a few example plans. I did not claim it to be the most efficient plan. So in general an optimizer would avoid copying attributes unnecessarily. So if an attribute is not needed later it would project it out. So the optimal plan would actually do what you are suggesting. Remove the course ID attribute if it is not used anymore after this. The next question is what is a theta join? A theta join is simply a join with a join condition. It is just a name for supposing I say join r and s on the condition that r dot a equal to s dot b. That condition r dot a equal to s dot b is what is referred to as theta. Conditions in the theoretical database community are usually called theta. The Greek symbol theta is used for this. Correspondingly a join with a condition specified like this is called a theta join. This is to contrast it with a natural join where the join condition is fixed. It is that attributes of the same name are to be equal. So to contrast it with a natural join this is called a theta join. Next question which is the most efficient programming language used for the implementation of query evaluation. So that is a good question. So all the database system which I know of the widely used ones are based on c or c plus plus. The only exceptions are a few things like Derby, Apache Derby and a few others which are not intended for really high performance data processing but are intended for effectively handling small amounts of data. Some of these are written in Java. Java is a very nice language but efficiency for a database you know to build a database system is not one of its strong points. So you would not use Java. The same problem would apply to C sharp also. So c or c plus plus is what everybody uses. Next question also from Amrita Bangalore. How do you use a hash join in a query? So I understand this question as saying I ran a query on the database. I ran explain and it said it is using indexness join. How do you force it to use hash join? The answer depends on the database. In Oracle you can give a hint which says use hash join. In PostgreSQL you cannot do that but you can turn off other joins. You can say do not use indexness set loops join in this session and then submit the query. So then if you use indexness set loops earlier it will be forced to choose an alternative. Maybe it will try merge then you say do not use merge then it will try hash. Next question. Generating query execution plan is done dynamically or statically. Well what does this mean? So when you submit a query, is the plan already pre-computed sometime and reused or is it computed every single time a query is submitted? This is actually a very good question. Now finding a plan for a query is an expensive operation. It takes time which is exponential in the number of relations. It is not cheap. It is not very expensive either. It may take you know a few hundred milliseconds which is not bad but if every single small query submitted to a database every time repeatedly takes 100 milliseconds or 1 second for optimization the database will run very slowly. And you are also wasting time. It is the same query is coming again and again with the only difference being that some parameters changing. So I run a query for student 1, student 2, student 3, student 99. How does it matter which student I ran it on? It is the same query except the student ID changes. So what every database today does is it will look at the template of the query that is it will remove the constants. So remove the student ID you have a query template. Now when you get a query for the first time it will compute a plan for the query and it will store that plan. The next time you get a query which matches the same template. Template matching can be done fairly fast. You get a query with a different ID but the rest is the same. It will reuse that pre-computed query plan. This feature is known as plan caching. Let me write it here. Plan caching and it is very important every database does this. There is a slight risk with plan caching. Supposing the best plan with the parameter, so let us call that the parameter. This was ID but let us take something else. Let us take a database which is widely used as a benchmark the TPCD database. There is a certain query there which can be run for different countries. You can run the same query for USA, for Iceland and so forth. Now as you know US is a big country. Iceland is tiny. India is another big country. If you run this query for Iceland, maybe the best plan is to do an index nested loops join. If you run this query for USA, maybe index nested loops join will be very very slow because there are so many customers from the US. So you should be doing a hash join. Same for India. So a problem they run into is that if you choose the wrong plan it can be very slow. So there is a bunch of tricks to deal with this situation. There is a area of optimization called parametric query optimization which tries to pre-compute several plans. One would be for small countries like Iceland, one for large countries like US and pick an appropriate plan. Of course this is an example but it could work with any query. So some database systems provide a few such features. Next question, when does database compute and store statistics on data? And how to tell the database to store records in sort at all? There are two questions. The first question is when does the database compute and store statistics? As I told you on many databases it will only do it when you tell it to do it. So on PostgreSQL you say analyze. Oracle older versions also you had to say analyze. I think newer versions will decide at some point internally to re-compute it even if you do not tell it. So that part is answered. The second part is a good question. How do you tell the database to store records in sorted order? The SQL standard has no way of specifying this. But each database has its own way of specifying this. On some databases you can say when you say create a table you can also say cluster on something which says that it is going to be sorted on a particular attribute which is specified. And subsequently if the relation has inserts or deletes the database will maintain the sorting. On other databases like PostgreSQL there is no way to tell it to maintain the ordering. But what you can say is here is this relation at this point cluster it on a particular attribute which means right now it will take the current contents of the database it will sort it and store it in sorted order. What it does not guarantee is what happens if inserts happen subsequently or updates for that matter. Those will be out of order. So PostgreSQL has only approximate not up to date sorting. It is still very useful for certain queries which run selections on a particular attribute. If it is sorted on that attribute all the records will be together. If it is not sorted on that attribute the records are scattered all over. We have seen this happen in many applications. The difference in performance can be enormous because in one case for that one query all the records will together 1000 records all together very fast. You can get them all. If it is not sorted the 1000 records are in 1000 different pages 1000 IOs huge huge difference. So an important tool in the bag of a database performance tuner is to actually tell the database to sort the relation on certain attributes which are useful for certain common queries. Another way of doing this is to create an index on that attribute and that index will also include other attributes of the relation. So then what can happen is queries can use that index to fetch the attributes they want rather than go to the underlying relation. So that is another option. Next question. Practically what order of B plus trees is used for indexing? So that is defined by the size of the database page and the size of the index entries. So it is computed on the fly. So a database will typically be configured to have pages set to 16, 32, 64 kilobytes. So that is fixed. Now for a particular index we know what is the key size on average. Actually the order is and you know if you have variable length keys the notion of you know this must have a minimum of 50 pointers is not actually added to. Instead what is done is this page must have at least 50 percent of its bytes used. Now if it so happens that all the keys in this page are very large then the number of pointers from that page will be small. If on the other hand if the keys are very small then the number of pointers from that page may be more. So the order in this case is defined by what fraction of the space in the page is used not by what number of keys should or pointers should the page have. For variable length keys this is what happens. Actually speaking on realistic systems when you have 16 kilobyte page with keys are usually not that big. So let us say 8 byte key is already fairly big. An 8 byte key with an 8 byte record identifier is 16 bytes. If you have a 16 kilobyte page well you just figured out you can store up to 1000 key pointer pairs in that page that is a lot. So the order is actually very very large it is like you know worst case is 800 in this case with a 16 kilobyte page. So this is common very very wide fan out is common. Now moving on a primary index sorting is done on the attribute. Is this sorting done automatically when the index is created or should it be specified explicitly? The answer is when you create a relation and then you say it should be sorted that is the sort order and then you build an index it is not that you say create a clustered index on this. You just say create an index if the index is in the same order as the storage order it is a clustered index with the one exception that you use a B plus 3 file organization. In other words you store the records in the B plus 3 itself. Then what happens? The relation is automatically sorted on that. So that is the order of the tuples is decided when you decide how the relation is stored not by creating a B plus 3. One question says selection operations are commutative but does that matter does it make any difference in processing cost if the order is changed. Now if the selection is simply reading the tuple and doing a small look up on the value and then deciding to pass it or not probably not a huge impact. But the reason it is very useful is let us say I have a condition on two attributes one of which has an index one of which doesn't. So if the selection on the index attribute is first and then the other one then I can use the index to fetch relevant records and then see if they satisfy the other condition that could be very fast. But if it's flipped the selection on the non-index attribute is first then the selection on the index attribute there is no index to be used then I have to scan the whole relation. So this is essentially why commutativity is useful for selections. There are a few cases where the selection operation itself involves some external function which is very expensive. In that case the order of selections could also matter I don't have time so I won't get into it. Then the last question I am going to take is the centralized version control systems used in industries are they based on two phase locking that's a very nice question. I hope all of you have seen version control systems in use. If you haven't I strongly urge you to play around with those. So there is a very on Linux there is a very simple to use one called RCS and then there is another called CVS and so forth. So these are used to control updates to code when multiple programmers are working on a project. So what happens is you can take access to you can lock a piece particular file update it check it in and so forth. So the question is is this based on two phase locking. The answer is two phase locking actually doesn't make too much sense here although typically the cycle list that you lock some files check them out update them then check them back in. So it looks like two phase locking. So you could think of them as two phase locking but in reality what happens is people don't unlock all the things at once. They will lock a file update it lock another file update it unlock one file lock another file they may go in whatever order. So it's under human control. So you have no idea whether it is two phase or not the human decides it. So then what is the notion of consistency in a version control repository like this centralized repository. The answer is there isn't any notion corresponding to serializability in databases. There are other more fundamental problems that I used certain function which was defined. I had read it and I assumed something about it. Meanwhile you updated it and when I compile my code with your function I am in trouble. So there are lot of issues here I don't have the time to get into it but people have looked at version control systems from a database concurrency control perspective but the issues are a little more complicated. I will just leave it at that. So we are way behind morning schedule. So let's break here.