 playing at Goldmark in Lawrenceville. And I'll set up a, I'll send you some information on Wednesday. So if you're free, come down. It's gonna be 11 to one at night. So, okay. And you said you have a bouncer that's gonna ignore the IDs or you have like fake IDs that you're gonna hand out on the side as a business? That all of that information, I cannot say in this class. Okay. Please don't do that. Please don't do that. Okay, great. Excellent. And we'll have a trip report, some of you guys make it to that. So, all right. Sorry about taking a little bit of time to get started. I had a bunch of people at the office hours, but we're going to talk about the concurrency control today. And let's just jump right into it. All right. So let's get into this. You've seen this diagram of what a database engine looks like as a modular architecture for the different components. We are going to talk today about concurrency control mechanisms that hit across these different layers, which largely is the operator execution and the access method layers. In two classes from now, we'll talk about recovery methods, which is at the buffer pool and disk manager layers. Now, all of this will become a lot more clear as we jump into what does transaction management mean? What does recovery mean? So let's go straight to that. So, transaction management allows us to do things safely in a database system in situations where multiple things are getting updated at all at the same time. Let's start with a simple example. Here is a schematic of database operations. These could be fired through a SQL query. From the perspective of the transaction management component, we are largely going to look at the operations that happen at the object level. So imagine we fired up a transaction and that ends up reading an object, could be a record, could be a page. We're just going to refer to them as objects here in this class, does some checking of that value, does some action like pay, and then updates that A by subtracting 25 from it and eventually writes that. For the purpose of the discussion for transaction management, we are largely concerned with the read and write operations that happen with each of these database objects. You can think of them as records, pages, tables, what have you, it won't matter. Just assume the reads and writes for objects that are in the database is what we care about. And we want to have certain properties about these read and write to ensure notions of correctness as we'll define it, and that's what transaction management does. So what could go wrong? This is a simple set of actions that are presented to the database in response to a query that an application might have sent and a couple of things could go wrong. So here on the right side, you see the bank balance starts out, has a value of 100. The first thing that happens is does that check and everything looks okay. It says yes, that check passes so there's sufficient balance to pay $25. So this is you going to an ATM asking for a withdrawal and pays you that $25 so you now have it and the new balance is calculated and that balance now gets written to your bank account. Okay, so after the new balance gets calculated that last action of bank balance of 75 is that write operation that you see in that code on the left side. But now what if after the ATM machine spit it out the $25, you went and yanked that ATM machine's power pot so that it could not send the right back. Okay, imagine you were that fast or there's a natural power failure. So the paying action and the right to the database is not atomic, right? They're happening in different systems. They might even be geographically spread across and that things could happen in between those actions. So how do we make sure that this transaction is correct? Obviously we don't want that here. Bank would be very upset if that's how database systems work because they would be losing money if this happens. So we want this type of stuff to be protected and we'll talk about how. Here's another scenario, a different scenario showing a different type of problems that we also need to worry about. So this is not a power failure or things of that sort but concurrent actions. So imagine you and your significant other both share this bank account and you have individual debit cards that allow you to debit from that same bank account and you both go to two separate ATMs simultaneously, run this transaction. So both of you are running these sequence of actions to withdraw $25 and sufficient balance is checked on both sides. You get paid, ATM spits out $25, your significant other gets paid but now because both of them read the original bank balance both of them have calculated the new balance as being 75 right? Because everything looks correct and then the first one writes the second one overwrites that and now the bank is short $25. Again, this is not an ideal situation for the banks. Banks would never use database systems or enterprises wouldn't use database systems if these types of things are allowed to happen. So semantically what we wanted is that we wanted the database application was sending this read and write request that was written by some application code that is firing up the SQL queries to the database system and now to no fault of the application code writer that's what the application code writer is doing effectively writing the code that ends up sending those read write operations to the database it feels the database system is just corrupting the data. So that's obviously not what we want of a database system we want it to be well behaved. We want the final balance in this case to be $50. So how do we deal with these bad systems behavior and have the database be coherent because ultimately the database is the keeper of the records that's the master copy of what you have in the bank account and it has to be consistent in spite of all these different failure scenarios. So one system way assuming that you don't have power failure is to say, okay the second type of option that I had concurrent actions were corrupting each other is to say I'm gonna have a very simple database system even if two people requested the queries at the same time I will only run the one query at a time I'm gonna queue everything up, all the requests I get I will run one query at a time and I can get a sensible correct behavior for the scenario we just talked about. I can also start to do things like before a transaction starts I can make a copy of the entire database make all the changes there and if the transaction complete successfully so didn't have a power failure all of that other kinds of stuff I will overwrite it with a new copy. Both of these are really bad from they will give you some form of correctness but they're really bad from a performance perspective because it would be a very, very slow database system and you won't be able to quite use that, right? Today when you go to Amazon and you're checking out your shopping cart you know ultimately something's going and issuing database transaction records to debit from your, to add to your shopping cart eventually another transaction gets there sent to your credit card account to record that amount needs to be debited and all of that works correctly even if multiple people, multiple account holders on the same credit card are doing the transactions at the same time multiple shopping cards are getting created so we really want to be able to do this at scale you don't want to do one transaction at a time you want to be doing thousands, tens of thousands if not millions or billions of transactions at a time okay so how do we do that, that's a really hard problem okay we want better utilization for the database hardware we want higher throughput, better response times and of course we want it to be correct so that the right thing happens and we'll define what that notion of correctness is and in many cases you also want it to be fair so if multiple transactions come at the same time you want all of them to be given that equal chance to complete and not just say I'm gonna hold two of you in the back forever and just let the others go forward, right? That's not fair you want to have some notion of fairness okay and some systems may not have that sometimes you have priorities where you say I want high priorities transactions to go through but assuming you want fairness you at least want the capability of having fairness how you balance that in terms of partitioning your workload to let higher priority stuff go is a different issue but you do want the mechanism of fairness to be built into the database system okay questions so far in terms of what we are trying to achieve yep we will try to solve all the issues except the power failure issue and the power failure issue will be solved and we'll get to all of that solution later on where the power there's a very subtle issue in the power failure stuff which is with all the stuff we'll talk about in the next few lectures we can make sure that the database record ends up in the right state but if the bank has paid you $25 and there's a power failure there's extra action that is needed because that $25 has been a physical action everything else is a digital action and to undo that action of paying you $25 when they should not have paid the bank would have to then do a separate transaction in physical space where they'll send you a letter saying whoops we accidentally paid you $25 but here's what your bank account is so there are other ways to do that we won't we will be able to undo all the reads and writes if we don't want them to happen but if the transaction is paid $25 there's a physical action you can't undo it you'll undo that by doing other physical actions in the physical world similarly if there's a transaction that fired a missile you can't undo that missile you can't bring that missile back you'd have to send an apology letter or something else like that so or something really bad so it's like the undo for physical actions in transactions will require undo in the physical world which we won't cover and there are mechanisms to do that that are business related actions you'll sometimes get a letter from your company whoops we made a mistake sorry about that here's what really happened oh and by the way he's a $10 gift card to make up for a mistake but all the digital actions the reads and writes we will cast them into a very strict structure to guarantee very well-defined properties of correctness that we will hold through the mechanisms we'll talk about today and going into the next few lectures great question other questions all right so we won't these arbitrarily interleaving of operations and be correct so before we go a little bit further just want to solidify the notion of transactions we're going to carry out these operations and the database as we just talked about is only concerned about the read and write operations to the database objects these physical world actions of paying an amount or starting firing up a missile stuff like that those we won't be able to repair if bad things happen okay and there's a in the advanced database class depending upon the material that says sometimes we'll talk about stuff like that and what are compensating transactions and stuff like that that might happen and what people might do as other ways so but for the purpose of this class we'll just look at in this in this many ways of more limited scope reiterating what we talked about we think primarily about read and write operations you notice that in the code that I had there was also a check there was a subtraction there was math and stuff like that from the perspective of what we want to cover today for transaction management all we care about is did an object get written did it get read and do those reads and writes end up interfering with each other in bad ways what are those bad ways and how do we prevent those bad ways okay and how do we prevent those bad ways while allowing the maximum number of transactions to work on the system concurrently that's the hard part right because we already saw there's an easy way which is to just do one transaction at a time if you want to do better than that alright so it's only these objects which we'll just refer to as variables A, B and C but if you want to have a simple model you can think of it as a record but inside a database system and this we cover in the advanced database class that A could be a column which is doing things at a finer level of granularity or it could be a page or it could be a file or it could be a database some earlier versions of databases like Mongo at one point to do concurrency would lock the entire database they don't do that now but the the size of that object or the A, B and C just refer to database as objects for the purpose of today you can think about them as records and everything today and the next two classes will make a lot more sense if you wanted to pick a mental model for that okay alright uh how does the database system know that a transaction has started and transaction has ended in SQL you can explicitly put a begin transaction statement and an end transaction statement to tell the database system all the stuff that happened in between is a transaction and in between could be multiple SQL queries if you don't have a begin and end transaction like all the stuff you've been doing in your homework when you fire up a SQL query implicitly the database system will put a begin at the beginning of the query and an end at the end of the query but a transaction could be multi-queries right you can explicitly put begin and end you could also put an explicit abort statement which is to say so instead of end it is called a commit which is to say everything I've done from the begin to here please commit it and make its changes permanent the other way to end that transaction is to say abort so it may be I'm trying to make vacation plans so I do a transaction to book flight tickets then I do a little bit of search in the code to find hotel reservations and find whoops don't have a hotel so I could say oh abort this transaction then abort says I did some work reads and writes but I don't think I can go further something doesn't look right abort this transaction and you could the application could be trying again let's you could explicitly abort the transaction and sometimes it may be that the database aborts for you because multiple transactions are happening at the same time it detects some unsafe condition and says whoops I'm gonna abort this transaction for you and you get an abort code from the transaction manager's perspective the module that we are trying to understand and build today it will be presented with a begin transaction bunch of read write actions and eventually it will get a commit or an abort action and that's what we have to build in the system code to do transactions so transactions could end by committing or they could end by aborting aborting means undo everything that might be done commit says everything I did make it permanent so they can end in those two ways and the abort could be self-inflicted as we talked about application could have explicitly put in abort call or it could be that the database system has to abort for some reason to do all the safety guarantee that we talked about and we'll look at different ways in which abort happens inside a database system over the next couple of lectures Okay, many of you might have heard of this thing called ACID it's a cool acronym that was that the community came up with a couple guys who came up with to say what are the properties we want of these transactions and the properties are atomicity, consistency, isolation and durability so let's go through each one of those atomicity says I have a transaction that did a whole bunch of read writes and when I say commit everything should have committed to the database or if I say abort nothing should have happened so it's like I want this all or nothing property of this transaction even though lots of reads and writes might have happened in this transaction it should feel like it's atomic so it's very much like what you see with atomic instructions in processors but now this is for multiple reads and writes those reads and writes may be spilling data to disk so it's a much higher granularity so much higher problem consistency says consistency is a little weird one and I'm actually gonna go to that last let me go to isolation and I'll come back to consistency isolation says if two transactions like we have those two debit transactions happening withdrawal of $25 from the bank account for you and your significant other we want those to not interfere with each other it should feel like each transaction happened by itself so even though we don't want to run one transaction at a time we want the illusion of the system running one transaction at a time it should feel like when my transaction ran I had the whole system to myself and nothing else interfered with me does that make sense so it's like I should feel no one interfered with my work durability says that if the database comes back to me and says I've committed your transaction after that even if there's some failure the disk fails or the memory fails I should be able to recover that state of the database with all the commit information in that so if I had changed the final value of the bank account to 75 the database said I've committed your transaction that commit call came back with a green signal and then there's a power failure when I bring back the machine I should see 75 in there not 100 which was the value it started with now going back to consistency and I waited for that because I'll just put a little labels around how we go about doing things like that consistency is this weird thing that says if the database starts consistent as per some definition then it should end in that same consistent state now that also seems vague so what does it mean that the database started consistent in SQL there are things like you can define a primary key foreign key you've already done that SQL also has these things called constraints where you can say at all and things called checks so a check might say the price I've got a price field in the table and that price should never be greater than 100 okay and the database is required whenever any updates happen to the database system to make sure those things hold true so consistently really says if the application has specified in the SQL DDL all the things that they want the database to hold correct then the database transaction shouldn't mess that up and the main ways in which that happens is by checks primary keys and you can also define checks outside tables and other ways of the defining constraints and you really want all of that to hold okay so it's sort of weird but it says that the application has defined the structure of the database correctly using everything that SQL allows with these checks the transaction management everything that we do here shouldn't mess that up if I said A plus B must always equal to 100 and someone modified A and modified B in the appropriate way so that it is still equal to 100 in a single transaction the transaction shouldn't violate anything and change that constraint so it should maintain that correctness okay and we'll come back to it it sounds a little vague but in many ways that C was fitted into that asset definition because the real key things that the database is done is AID but if the acronym was aid then it didn't sound as good as ACID and this was done when hippies were ruling the Bay Area and Europe and they wanted ACID everywhere so that's my understanding of how C was plugged in we'll talk about different mechanisms and those are what you see in those blue bubbles atomicity and durability will require this mechanism to redo and undo stuff oh I've changed A and I've pushed it out to disk I've pushed it out from my buffer pool to disk oh but I need to abort that transaction so how do I undo that oh I committed a transaction with a new value of A but it's still sitting in my buffer pool it didn't make it to disk I need to redo the disk copy so the redo and undo mechanisms is what the A and D provides we'll use things that we'll see today and in the next lecture that cover that as I said consistency is provided by making sure that our integrity constraints are defined in your SQL DDL and the transaction management system should violate that isolation is done by something called concurrency control which we will also look at and locking in other mechanisms will come into play there so we're gonna dig into each one of these so let's get started so that's our agenda dive into each of these four components so atomicity of transactions there are two possible outcomes for executing a transaction it either commits or it aborts and the bots could be explicitly triggered by the application or it could be something that the database says something's unsafe I need to abort you and then that transaction could restart and reissue that SQL statement and the DBMS needs to provide this all or nothing slash atomic property of these transactions so let's make it a little bit more concrete we've been playing around with examples so we'll look at a couple scenarios as we dig into this we take $100 out of an account but the database system aborts the transaction before we transfer it so we want to make sure that that is reflected correctly scenario two is we take $100 out of an account but then there's a power failure before we transfer that we've kind of seen both of these transactions but just giving you more examples of what that should be and we have to determine what's the correct state of the account after both these transaction aborts so how can we do some of this stuff the atomicity part can be done in one of two ways logging and shadowing one way is to log everything that we do so if I'm recording all my actions the minute I change a value from 100 to 75 I can record that my value was 100 I change it to 75 my before value was 100 after value is 75 now I can take those logs and I can maintain those log records and I can store those log records in memory and at appropriate points move it out to disk once it moves into disk I know it is going to be there it will survive that power failure if I don't think I can trust a single disk to hold that log many times people will use disk mirroring remember we talked about disk mirroring a couple lectures ago so that there are two disks that will have a copy of that same file I write but it's actually being written in twice in two places so that if one fails I still have that other copy that if both fail you'd have to make a third copy and so on so you have to decide what failure you can tolerate but once a log hits the disk you kind of know you can recover from that and reconstruct the state of the database and we'll talk about the protocol areas as we in two lectures from now how we go and make that happen but logging is the mechanism that most database systems will use to go and record the changes that are happening okay all right does that make sense questions yeah, how does this relate to like storing like at the storage level in the sort of log base storage right see that is in this sort of a no no don't confuse this with log structured file system is that what you were thinking it will look like that way it's different so a lot of this origin in the 70s and 80s it's a file of records but the records are going to be variable length and the structure will have a certain set of common fields but think of it as a file with a bunch of variable with some fixed length components and a whole bunch of variable length it's sitting in a file that will often be called a log file and it's basically going to be written sequentially from first to the end and it'll have its own buffer management so it will the recent pages will be in the buffer pool at certain points we will have to tell that the buffer pool to say I cannot commit this transaction till this page you have here in memory that needs to hit to disk before I can commit this transaction so we might hold the transactions commit we won't return it till the end but you can think of it as a file of records just like a regular file where it's got its special structure and schema but if you had a dev massive log structure story and this is logging a story so you have a lot of redundancy making yeah so the question is if I have a log structured file system do I need logging a log structured file system just works for that file so in a database system I may have 100 tables a transaction may have updated in five different places I need a global log logging mechanism to deal with that so traditionally what will be done I have a log file which is going to keep these changes for all the changes that are happening across not just for a single file so log structured file system does make sense they have similar elements to the types of mechanisms that you're trying to do redo and undo type of things but they are on a profile basis they may not be the right structure for it they don't necessarily have all the buffer stuff so it's good for certain applications in file system but for database systems we are going to need something different and we need it across different files you might prove no operations too so we'll talk about all of that as we go into details but that's a good point why do the operating systems guys talk about log structured file systems they have very similar elements to that can we use that we need to worry about transactions that update records across multiple different tables which may be in multiple different files sitting in the file system and still be atomic across those files so it's different in many ways we have to do a little bit we have a harder problem to solve great question though other questions all right so logging is used by nearly every database system what's the alternative you might ask in the very early days before logging and all the details about logging really worked out a very easy way to kind of do this stuff was to do shadow paging it's like copy and write type of thing and you'll see that in many applications you'll see similar concepts by the way those of you taking operating systems or a dug into that are going to find a lot of common things between database systems and operating systems we already hit latches and locks they try to different things but a lot of database things are going to be on larger things lots of files lots of data and in some set the mechanisms are going to be different and in my view often richer okay so shadowing is okay I'm going to make a change to a record on a page you know what I'm just going to make a new copy of the page and make the changes there that way if I need to undo stuff I'll just go back to the old page and I've got both my before and after copy for the entire page now of course some systems still do that but it's not a good idea it's less efficient right if I just wanted to change one byte in one page I'm going to make an entire copy of the page so that's obviously wasteful I still need to go and clean things up and merge things up others before you know it if I've got one page which has got a hot record and let's say it's the counter that says how many skews do I have for the pink Barbie doll that's really popular every time someone buys that Barbie doll that counter will change I'm going to make copies and copies of the page right so if I've got a million Barbie dolls I'll have a million copies okay by the way counters in database systems aren't kept that way they are done with more semantic components using commutative action that's just a side comment also an encouragement to take the advanced database class to talk about more complex mechanisms to make transactions go even faster and better okay but you get the idea that shadow paging can very quickly it's a cheap mechanism easy to implement but it's problematic logging is superior and that's what most database systems use so let's get to the C part which has always been a little difficult to explain but hopefully with this slide it'll start to become better it's essentially saying I have a contract between the database system and I as an application programmer have a contract with you the database system that I'll tell you what I want in the application through my SQL statement my primary key constraints my checks and other things that I want in the system and don't mess it up if those constraint held before the transaction started then the transaction ran and it may have touched a million objects read and write millions of objects across hundreds of tables once you say it's committed all those constraints must still be true okay one of my constraints was the sum of all the columns in this A in the price field is less than 100 that should still be true if I said all the column A should add up to be a million and exactly be a million and lots of changes happen to different to different values in that column A that should still hold true that some should still be a million at the end of it and I've specified that through my constraint that should still hold true okay now the transaction management is not going to do anything different as you'll see we just have to make sure that the atomicity holds the AID components hold and consistency will just get taken care of we don't mess that up so that's why this C is kind of like plugged into this asset stuff the transaction mechanisms we'll talk about today in the next two lectures I'm going to directly address C it's the responsibility of the application programmer to define those constraints correctly to make matters even more confusing a little while back close to a decade ago there's all this excitement about this thing called eventual consistency even though it has the word consistency it has nothing to do with the consistency that we just defined here as a C in asset that consistency was a model saying oh this transactions are hard in distributed systems it becomes very tough to do that though now we have figured things out and how to do things in distributed systems well too but that time when people were building these systems and largely the systems were being built by people who hadn't taken the database course that was the problem they said the best way we can make this happen is to be eventually consistent so let the changes happen and then eventually if A two transactions made changes to A eventually they would be the same so if that $25 withdrawal from two accounts happening simultaneously it's like eventually they will end up to be 50 but for a while even though both have been paid $25 you read the value you will see 75 so the eventual consistency model is don't trust the value eventually it'll be right but the question from the application perspective is how long do I have to wait for it and how do I know when something is right or not and there were early days when people were building these applications with the eventual consistency where you'd have your Facebook app you'd post a message and your friend would see it you would refresh here and it would be gone because eventually it would be there but it just fetched from a different server that hadn't gotten your message your first connection was made to server one second connection for refresh went to something else it hasn't gotten the update so as you can imagine that became very hard for application programmers to do the whole point of having database systems or data platforms deal with transactions is so I as an application programmer don't have to deal with it so it's now universally accepted that it's a pretty bad idea unless there's you very strict requirements for performance that require you to do nothing but just that okay but it is not the go-to way to build data platforms by saying you the application person start worrying about transactions can you tell me the solution to the eventual consistency? the bunch of systems still use eventual but I'll briefly flash the standard paper which globally is using things that are even stronger than asset and then again if you take the advanced database class we talk about stuff like that over there is there so it's not that systems don't use that today I'm just saying in my view it's not a good idea to use that as a default way to build a data platform there's still some reasons why you may want to do that but it's not the way you should say I'm gonna make my database by design eventual consistency because I didn't think of doing other ways of doing full transactions their performance obviously you can get more performance with the eventual consistency there's no doubt about that I'm sorry for this question I'll close my ears everyone close their ears no basically the question is like what is eventual consistency? eventual consistency says eventually all the values will be the right values so it will feel like you got your transactions but for a while you may see inconsistent values so that's what eventual consistency is so an application reaching the view asset might just be wrong yeah as I said you may see 75 for a while when it should be 50 both of you are you're standing next to each other on two side-by-side ATM machines both of you got your 25 bucks and you will say oh 75 is in the bank which is not true and the bank will say you know what it was actually 50 we sent you that eventually it'll be the right amount but as you can see it's hard you don't want that in your application yep another question no Dynamo has eventual consistency other systems too I have eventual consistency models put this term out yes but there was a whole time where people were saying we don't need SQL we don't need transactions no SQL eventual consistency that hasn't quite worked out for those people okay so I'm not saying there's no call for eventual consistency I'm saying that's not what you should design for by default unless you know what you're getting into and now you're going to have to put the complexity in the application okay so there is still some place for it but not like that's not the default setting so alright let's go find my mouse again there we go alright so isolation of transactions so we are still on I use submit transaction each transaction executes as if it is running by itself and obviously it's an easier programming model as we talked about and the DBM message is going to do its stuff to give this one at a time or isolation as a principle so how does it do that there are two classes of methods and again today I'm just going to outline it at a high level we'll get into the details of it in the next lecture the two classes one is pessimistic which is to say even before I let a read and write happen I'm going to be pessimistic if I think something bad is going to happen I'm going to hold you back hold you that transaction that I think is going to start this bad action back okay and we'll see how that happens in a little bit with a couple of mechanisms the second way is basically to do something called optimistic concurrency which is to say I think life is good I'm going to let every transaction go through I'll still provide the isolation principle so that they don't interfere with each other but I'm going to do that by checking assuming everything is going to go well so it's like everyone go do your reads and writes I'll keep track of it go make your changes but before you're ready to come in I will do some checks to see if you guys interfered with each other all of you guys all of you transactions that were running together simultaneously so optimistic says most transactions don't conflict with each other so I can get a higher performance system by being optimistic that most things will work out it will still be correct so it'll make sure bad things don't happen but it's a different philosophy okay alright so let's start digging into a couple of examples now as we know transactions are going to be cast in terms of these read write operations and now you're starting to see some of the begin and commit calls that are coming in so T1 is a transaction that's subtracting that moving hundred dollars from bank account A to bank account B and T2 is a transaction that's adding 6% interest to all the bank accounts and assume this is a small bank so it just has two bank accounts okay also keeps the examples simple what's the possible outcome of running these two there are two possible outcomes one is assume both A and B start with a thousand dollars if I run A followed by B or B followed by A I'm going to get two different execution strategies but eventually A plus B should be two thousand and 6% of that means that the total bank account across those two banks should be two thousand one hundred and twenty right so that's what we want to end up with now what we are going to look at next I'm just going to jump into this is with just these two transactions I could have a correct isolation property being held by either running the first transaction first and running the second transaction first right there are two possible outcomes T1 runs first followed by T2 or vice versa and so here you can see one example where the value of A is 954 and B is 166 that will correspond to having done the hundred dollars transfer first then adding 6% and the other one is doing it the other way around okay so either one of those by the isolation principle is correct if both transactions are in the system at the same time we are okay we as a transaction management system are okay in picking that order so let's just look at it more visually and that might make sense there's this notion of a serial execution and the serial execution diagram that we'll see in the next 10 or 15 slides all are going to have time going from top to the bottom so imagine you are the machine and you're watching transactions come at you and isolation basically says if this execution is such that you can show me, prove to me that no matter what you do whichever order you allow the transaction it feels like one happened before the other and that one happened before the other could be all of T1 happened followed by T2 in this case which is the example on the left which is when you end up with A is equal to 954 and B is 1166 or T2 followed by that both of them are correct but if you are going to interweave these reads and writes for these two transactions the database better end up with one of these two correct end states which one doesn't matter but it should be one of those two not something else question why are we allowed obviously this is like an example but like if it was on things like one of them would have been perfect right no that's the whole point so the question is why are we doing this in parallel that's the whole point right now there's a serial execution so T1 is happening after T2 in time but what we want to do is to imagine I've got two cores in the server and today's servers have 40 cores and database machines sometimes have hundreds of machines I don't want to be just running one transaction at a time I want to be able to do as many actions at any given point in time I want to do as many things as possible because I want to use all the hardware I have access to that way I can get more transactions in the system higher throughput and lower latency potentially no no no we will not lower the standard so hold on we will not lower the standard we are definitely changing the standard a little bit to say any permutation of T1 followed by T2 is allowed T2 T1 and T2 can be interchanged as long as we can prove one did all of its work before the other so hold on for two slides you're right it's like why is it not only one possible way that T1 should be followed by T2 if we did that then we would allow less parallelism in the system okay so you're going to relax the strict nosing of what is correct and this is where there's this notion of strict serializable which will say the transaction should feel like they retired in the way in which they happen in the physical world we'll talk about that in the next class a little bit and we'll talk about it a lot more in the advanced database class okay first let's get a little bit right and this is by the way what most database systems do today they will take this slight liberty of re-changing it because it's also like T1 and T2 issued at the same time what does it even mean for it to be issued at the same time they may be in different cities if you're sitting next to each other it's not as if you would be pressing the button exactly at the same second because that doesn't happen right so it's like what does simultaneous mean so we're going to be a little bit more relaxed all we have is to say for us to be have this proper isolation the final values should be one of these two not a third value look at this slide here first yep so here T1 starts and does its subtraction of a hundred so now A has a hundred dollars already removed T2 starts and gives six percent interest to the two bank accounts but the bank account value for A that it is looking at is a hundred dollars short and so it ends up with a value of 2114 as the sum total which is six dollars off from what that original about should be okay so what we want to allow is these two the interleaving in the previous slide that is safe where you end up with the correct value even though there's interleaving but this is an example of an unsafe interleaving where you ended up with the wrong value right so this is fine sorry I I was here on the wrong slide here this is fine and the question is how do we determine that this is okay this type of interleaving of actions across different transactions is okay but this one is not okay so let's look at how we go about doing that so this is formal notion of serial schedule that says a schedule that does not interleave any of the actions so we saw that right a serious example of a serial schedule was one where you just had all the T1s like here is a serial schedule if we go back this is a serial schedule T1 followed by T2 or T2 followed by T1 okay now the thing that we are trying to get to the harder part is an equivalent schedule so here was a schedule that we saw that was correct it is not serial but as you see this is correct and we want to look at this schedule and say the one on the left is equivalent to the serial schedule on the right we want to be able to show that happens and in which case we'll say that schedule is safe and that's allowed so the notion of equivalent schedule says allow some interleaving and as long as you can prove that this interleaving is safe and the proof of safety is going to be by saying all these actions are equivalent to some serial schedule in this case one of those two T1 followed by T2 or T2 followed by T1 you are okay you will end up with the correct answer so it sounds like magic but there's a very simple way to go figure that out and this is the part that you were asking about it's like okay but why are the two correct states of the database and that's really what we are relaxing a little bit in this notion of serializable and as I said there's a social of strict serializable where that real world effect is taken into place and it's stricter but for the purpose of this class we are going to work with this notion that we just have to prove it equivalent to one of the many equivalent serial schedules and if you have two transactions there are only two possible outcomes if there were three T1, T2, T3 it could be first transaction could be either one of those there are three possible combinations then two then one right so you get that factorial effect we are going to think about these rewrites in terms of conflicting actions that can happen between these objects like the A's and the B's and the C's so we are going to have we call these things anomalies and there are three different types a read could interfere with a transaction reads an object and something else writes to it or a transaction writes an object and someone else reads that object the third one is transaction one writes to an object and the other transaction also writes to that object what is missing from this combination is read read which is obviously not a conflict if I'm just reading this two transactions if I've just got a copy that's read only many transactions could read it you're not going to conflict with each other it's a read only copy alright so let's take these read write conflicts and go look at that here is the schedule in which there is a read of an object A followed by a write of an object A as these actions got interleaved and so that would be an example of a read write conflict there are other conflicts in this schedule too but what's the downside of this read write object conflict transaction one read the value A T2 ran and then T1 reads that value and got a different value right so it was 10 before T2 made it 19 T1 reads it again and obviously it is not seeing the isolation principle right in the same transaction if I read the value twice I should see the same value unless someone else interfered with me so it violates the isolation principle and so this is bad alright let's look at the and so that's called unrepeatable read because when you have a read write conflict same transaction reads the value twice it's going to see different values a dirty read is I read a value let's say it is 10 write it to be 12 someone else reads the value which is 12 so far no problem but what happens a little bit later is that transaction T1 aborts and when transaction T1 aborts that value 12 should never have been read because it didn't get committed to the database it was an intermediate value which is getting thrown away right that's what an bot should do and so now transaction T2 when it was reading the value A read a dirty value which should have never been in the database because of that abort so that's called a dirty read a write write conflict is T1 is writing a value T2 is also writing that value and when T1 writes another value B T1 it feels like it's writing 10 dollars imagine A and B are now columns and it's updating a record where dollar 10 is in Alice's account and it's column 1 is 10 the account amount and B is the name of the person it thinks it's adding 10 to Alice but the other conflict over there is writing to that same record 19 and Bob so you have 10 in there 19 and Bob got written to that same record and then you have this Alice B that is written that update of 19 and Bob just got overwritten so transaction 2 did its work but its value just got overwritten so it lost its update it never got made it made it to the database okay so we'll use these properties to then define how to go make this equivalent schedule work the two types of serializability conflict serializability which we'll talk about next something called view serializability since we are running out of time I might just totally go skip that piece but I'll briefly allude to what that does okay won't grill you with exam questions on view serializability so it's okay if you don't totally get that conflict serializabilities was database systems implement and let's just jump into that let me just go with this graph form here to make it even more easier to understand so what we'll do is the following we'll take operations that we have and here's the schedule that we have we will start going through that and enumerate all possible conflicts read write write write and write read okay those three conflicts we'll create a graph called the dependence graph so I've got two notes in that graph T1 and T2 every time a transaction comes in a new load will be added so the graph will have as many transactions that are active in the system at any point in time now I'll just start going through the schedule and start marking all the conflicts and I start with the first conflict that I see which is a read write doesn't matter what the conflict is as soon as I have one of those three types of conflict read write write write or write read I'm going to put an arrow there so the first conflict A is on this object A and it goes from T1 to T2 because T1 happens before T2 in the schedule right so I'll draw a directed arc from T1 to T2 okay does that make sense so really simple I'm going to just walk through this for every conflict that I have I'll draw a directed graph in the order in which this conflict is happening now I go and find the next conflict which is this right followed by read and I will do that so the proof is really simple you take a schedule if you walk through it and draw a line for every read write write write or write read conflict and if it there's a cycle in that graph you have a bad schedule that schedule will violate the isolation principle that schedule should never have been allowed to run in the database system in the next class we'll see mechanisms that you can use to prevent that from ever happening okay does that make sense if you didn't get anything else because of that bug in the slide that's the main thing I want you to get from this isolation principle what direction do the edges go the edges go in the time order see so look at the right followed by read the red line that is shown here right now it's going from t1 to t2 right the read the right in a happened before the read in t2 so I will draw a directed line from t1 to t2 saying some action in t1 happens before the action in t2 which means t2 better not do stuff that is the other way around because now I've got a contradictory state that I'm going to end up with okay so the serial schedule is a schedule is the the time on the left side is the order in which the database is seeing those actions and it's going to construct this graph and what we'll see in the next class is the minute it starts to say the second stuff which is going to cause this out to complete it will stop it won't let it will basically not let this thing proceed beyond the right of v because it says if I allow this right to happen this are to form I will have a conflict so this line it will pause that it will the stop transaction t1 from proceeding with the techniques we'll talk about in the next class okay yep does it need to what if t1 is actually the start yeah and then it may need to abort something or delay something so we'll we'll talk about that right so there are the sometimes where it may just say you know what I need to abort and I can't go any further alright so it's really that simple we will construct this graph and now we have this beautiful proof that says how do I have correct isolation property alright so here is three transactions just to show this in a slightly more complicated way as I walk down the first conflict is that read to write there are other conflicts in there there's a w a and t1 to read in a I'm not showing every possible conflict in there but both of them induce arc from t1 to t2 right it's just a hazard and then there are other conflicts but that will just redo that same line you don't have to redraw that line multiple types one arc is one arc is enough right that's all this is showing as you go further down we see another conflict from t2 to t1 okay and that's all there is in here so this basically says in this schedule even though there's a bunch of interleaving going on between the three transactions this you can prove to be correct and can you guess what's the correct serial schedule look at the graph yeah it's going to be the order of that graph t2 it's like t2 happened first then t1 happened then t3 happened whoops it's as if that was the serial order in which they were executed and we were just running one transaction at a time and that's the whole idea now if you get why this graph makes sense it's trivial but it took a little while for people to figure this out it wasn't that trivial and that's what many of these beautiful ideas are they seem simple only in retrospect okay so for governor everything I said with that bug in the slide this if you get it you understand how isolation works okay yep yeah and we'll talk about aborts and other kinds of special handling in there but what is true is that even before the abort if you start to see a cycle you know you will end up in a bad situation so you need to stop it right because time is evolving I as the database transaction manager if I'm at this point in time and I've been told should I admit this read of B I have to make a decision if I'm this pessimistic transaction management system pessimistic isolation management if I I will have to decide whether to let that happen or not if I'm optimistic I'll say let it go and in the end I will figure out right so that's the difference between the pessimistic and the optimistic yep is this still not being created before it happened or like it is being created while it's happening right so you can imagine something like this is happening in the system third step is run we start with the begin on T1 then read of A is done write of A is done system says I can let you go then when the write of A comes system is saying you know what I need to put an arc in this graph so as time is progressing these arcs are being done the uh region writes are being presented to the system at time evolves and that's the difference between the pessimistic method and the optimistic pessimistic will say first time I think there's a problem I'm gonna stop you and we'll see how in the next class and optimistic says I'll let everything go but I know how to get you back to a safe place if bad things had happened okay this is not optimistic this is just saying how do I detect that something is good or bad but this is the mechanism yep if it is a cycle it is not okay how do I find the cycle when do I find the cycle is where the difference is do I find the cycle as soon as it is formed or do I find the cycle after I let everything go yeah so that's basically the whole theory of saying is this equivalent to that order or not so here is another example and this is where we are getting into this view serializability and kind of what is happening over here in here what's happening is A is getting $10 removed from it and there's a a sum that has been calculated that's getting printed out over here of course there are problems with this transaction because that one causes the arc from t1 to t2 and then you have this transaction so this is not serializable but if instead of the second transaction doing a sum of the two transactions if the code in there was only about saying find the number of transactions that have values account values greater than 100 the fact that it is a cycle doesn't matter from the semantics of that second transaction at a very high level that's what view serializability is it'll if you knew something about the semantics of what was happening you will admit a few more types of schedules that you would not otherwise and as I said that's all I want you to know for this class if you didn't get that that's okay let it go we're not going to talk too much about view serializability it'll just allow certain types of cycles because it says I think the application is okay all right and there's a formal definition of it if you're interested in looking at it in terms of what it allows it basically allows a few more schedules than a strictly serial schedule will allow all right so conflict serializability is the main thing that we want you to know about all right pictorially here's the universe of all possible schedules including the bad ones the ones we don't want to happen sorry serial schedule is that strict serial stuff which we talked about t1 followed by t2 and that's the only thing that I'll allow conflict serializable is that we are allowing a few more combinations and view serializable is where you might have a little bit more by way of this application performance okay and there are many more layers in there and we'll uncover some of those in later classes and again a lot more in the detail class so we still have one more letter to cover which is the d which is the durability right we did atomicity consistency isolation which is where we spend most of the time today and the last one is durability which will cover really fast durability we have a full lecture coming on it on this in took the second class from now which is about making sure that the changes that we make to the system we're making changes to a's the b's and the c's and stuff like that just in memory right if you're updating a column value or a record that's sitting in the buffer pool but what happens if you have a power loss and memory is volatile so those changes never made it to disk but you might have committed that transaction so durability is the aspect that says if the database commits a transaction oh by the way the database is allowed to use a buffer pool because that's an efficient way to build data processing systems right you don't want to go to disk every time you want a buffer pool because buffer pool accessing data in buffer pool is just so much faster than accessing things on disk but if the database says it's committed you want to make sure that if there's power failure you can get the right values in the database so durability will do that by making sure when the commit happens it's going to make sure certain things get written to disk and it'll try to do very small amounts of forced write to disk so that it can provide this durability property okay and as I said in two lectures from now we'll talk about that so that's basically what asset looks like yeah we can talk about that offline the answer is the principles will probably apply it's like I have volatile storage and non-volatile storage and I can apply this principle of durability across any of those systems yeah yeah the same principles do apply yep the logs sorry yeah the logs will be kept in memory we'll get to that in two lectures from now the log records that we create will be kept in memory too but at appropriate times we will actually flush them out to disk to guarantee the durability property we'll talk about that what happens if the log fails that's the whole lecture two lectures from now it's a full lecture I can't give you a 10 second answer the 10 second answer is that we will make sure that absolutely what must be written to disk is written and if we write the wrong thing we will look at the value and undo if we need to we'll do one of two things with stuff we write on disk either redo the operation because it's inconsistent with what was in memory or undo it because we wrote something that we should not have written okay so we'll do redo and undo logic on on those okay so here's the conclusions uh concurrency control and recovery amongst the most important functions the transactions are super important uh but that's a pretty essential definitional component of database systems and if you go back to the early days of database systems that's why they started getting adopted in uh enterprises because they allowed all this record keeping now I need five minutes to go over a couple things but before that we've talked about this a few times that there are there's still a lot more things you can do with transactions this is breakthrough paper that came out from google called stanner by the way before that they were doing eventual consistency stuff in many parts and they realized wow application programmers can't quite do that so they actually build a very hard thing which is a globally distributed system that can do transactions where the transaction could be touching objects in their database which is distributed so it might touch an object in london touch an object in the us and commit that transaction across the globe as one transaction and do that fast and efficiently and the reason they did that is if they kept the eventual consistency stuff all kinds of application programmer bugs were showing up like the ad system would report wrong things it would tell the same advertiser proctor and gamble for example in london this is how many impressions we showed and the same campaign in u.s would show a different number and these two guys get on the phone and say what the heck's happening what impressions did we really show you want that answer to be precise because someone's paying money for it that was one of the big reasons why they went and built that it's a beautiful system requires atomic clocks where you need satellite syncing across the data centers so that the two clocks are not out of sync cockroach db was formed by people from google who worked on this and have a version of it that doesn't require satellite clocks but fascinating field there's still a lot of new things that are happening bonus time i leave this link in the slides if you want to we've talked about all kinds of different models and levels of consistency but there's a lot more and you can play around with that chart but i want to spend a couple minutes on the project three which is on query execution and the overview of this project is we essentially have these are the different components of bus stop as you've gotten to know and love over time project three is related to the optimizer and the query execution project four that's coming will be query execution in the transaction management piece okay so what is project three you're going to add access methods two different access methods sequential scan and index scan you will also do insert delete and updates so these are the operator stuff that you're going to add you're going to add these as new operators that are in the system two different types of joints nested loops join and hash join and there's some miscellaneous window aggregation function limit sort and top k you'll also touch the optimizer so there's already an optimizer in bus stop and to convert a query if a query has both an order by and a limit you can convert it to a with a simple transformation into a top k query right so that'll give you a chance to look at how do optimizers work if you see a nested loops that's often a bad idea so in the optimizer you will write a rule to convert nested loops to hash join okay because those are typically much faster and similarly with sequential scan to index scan if an index exists don't do a sequential scan go use an index scan these are think of it as the heuristic stuff that we talked about in the optimization stuff right not cost based the heuristics just rules based okay uh the leaderboard will require making deeper changes to the optimizer so even if you have awesome code from project one and project two you're guaranteed not to get good high stats on the leaderboard unless you go add new optimization rules and you that's all described in the project you'll be adding rules to window aggregation and and top k uh quick tips start with the easy stuff the insert and sequential scan that's the easiest stuff get it right before uh you go uh do other things with it and the key thing is uh now you can actually go and run bust up in uh in the browser so i'm going to close this here and whoops quick powerpoint don't need that anymore and if you go look at bust up now you can actually uh go and run bust up in the browser let me go find the link here yep uh chi who's just an awesome programmer has written this version which completely runs in vassal code in the browser and it's already loaded with some tables in there and you can start to do things like uh select star uh from the mock table and do things like that whoops where's my semicolon and you can start to run stuff so you can use this to test if your code works our solution is sitting behind that yep yeah yeah so the use this as a reference don't use great scope as a way of doing your debugging write your tests otherwise you will not do well in this project okay so hopefully it's a fun project start with the simple stuff it may seem like a lot start with the simple stuff and we'll be surprised how quickly you start knocking things off okay all right thank you DJ Shubham hit it you to the peri gates you get gizama trying to skate and that's your first mistake i ain't lying for that cake your family see your weight my grandsons have me wait they ran through every state when they actually how i'm living i tell them i'm living great