 Okay, so real quick administrative things, a reminder for everyone, project number one is due on this Tuesday at midnight. We're also going to be releasing project number two on Tuesday in class. And so, again, project number one is done individually, obviously project number two will be done in a group of three. So I know some of you have been sort of joined together and formed your groups. So what I'll do is I'll send out a link on Piazza to like a Google spreadsheet where you can fill in, you know, who are going to be in each of your groups. Just so we know that who, you know, who still needs to be assigned to a group, if you can't find a group, you know, we'll figure out something for you, okay? So any questions about project number one? Any questions about what's expected for forming a group for project number two? I'll say it's very also likely that whatever group you use for project number two is probably the group you'll use for project number three. Last year we had some issues, some people got in some fights, they got kind of uglies, we had to sort of split them up. You guys seem much nicer than last year, so let's try to avoid that. So if I have to intervene, I will, but let's, you know, this is the group you'll be working on for the next couple of months to finish up the semester, okay? All right, so for today's agenda, I'm going to focus on, again, more on the concurrence control. I'm going to start off talking about isolation levels and stroke procedures. These are things that I think we may be alluded to in the intro class, but we don't really go sort of describe why they actually matter in the context of concurrence control and high-performance systems. And then we'll finish up in the rest of today's lecture, we'll finish up in discussing the optimistic concurrence control protocol, and in particular in the context of the silo system for the paper that you guys read. And then we'll finish up talking about the tick-tock protocol that I helped be involved in. And so part of the reason why I'm focusing on concurrence control so early on in the semester is that I want this sort of to be in the back of your mind as we go throughout the rest of the semester and read all these other topics, that thinking about how would you actually do the thing we're talking about when you know there's other queries or transactions running at the same time that could be updating the database. All right, certain things are really easy to do if you assume the database is read only, if you assume that nobody's actually trying to modify things. But if you have transactions updating data at the same time you're trying to read it and do analytics and machine learning, then this becomes problematic. So that's why I'm sort of front-loading it on concurrence control, because that's going to sort of color everything we do as we go along, okay? All right, so I'll make two observations to motivate the first two parts of the lecture. So the first thing I'll say is that in the last class for the paper we read, for that thousand core evaluation, all the transactions were running under the serializable isolation level. And the reason why we did this is because serializability is sort of the gold standard of what you want to achieve when you're doing concurrence control, because it allows the application developers and application programmers to write their applications and queries and transactions and other things without worrying about dealing with concurrency issues inside of the database system, right? You can write your application assuming that when you invoke a transaction that it's going to have exclusive access to the database and therefore you don't have to worry about reading dirty data, you don't have to worry about getting back vector clocks and other things that require you to figure out what should be the correct result of a particular query. But the downside of serializability is that enforcing it places extra restrictions about what transactions can run at certain times inside our database system, so this can inhibit the amount of parallelism we have, which in turn can reduce the performance of the system. So in a lot of cases that we'll see, almost every single database system that you can think of that supports transactions doesn't support serializability by default, and some of them don't even support serializability at all, right? They run at weaker isolation levels. And the reason is because weaker isolation levels are going to allow you to get better performance and improve your scalability of the system by allowing some anomalies to occur that don't occur under serializable isolation. So the isolation levels are a way we're going to be able to declare in our application what anomalies a transaction should be allowed to be exposed to, right? And the lower the isolation level you have, the better it occurs that you're going to get because there are going to be fewer restrictions about what operations can run at certain times. You're worrying less about the conflicts that you would normally worry about if you care about serializability. So again, everyone here should have taken an introduction database course, so everyone here should be able to explain to me what these three anomalies are in isolation levels. So what's dirty read? What does that mean? You're reading data from a transaction that has not committed yet. Correct. What's an unrepeatable read? Somebody other than him. A transaction reads some data, then later on it tries to be the same data. Exactly. It's when a transaction be more specific too, but she's right. When a transaction reads an object, then reads it again in the same transaction and now it gets back to different value, right? If you say data, it's sort of, it's nebulous, but you're absolutely correct. And then this one's more tricky. What's a phantom read? How does this work? Other than her and him in the front. When a transaction reads data that is then deleted by another... Okay. You got a half correct. Yes. A transaction reads data that is then deleted by another transaction and you go back and then try to read it again. It's not there. It actually goes the other way too. It's when you read something and it wasn't there at the beginning and then when you read it again, it's out there. So the example to use for a phantom would be like, if I'm doing a range scan, all right, and I'm getting all the students that are rolled in this class from age from 18 to 25, right? And then I do my scan the first time and I get a count. I compute an aggregate. And then someone comes along and either inserts a tuple or deletes a tuple in my range and then now when I go back and do that same aggregation, that same range scan, my count ends up being different, right? And it's different than these other guys because in these further two cases here, this was talking about specific objects. This is sort of why I qualified the statement she said. This is not reading on an object. Now we're reading at a higher level like a range, right? So again, when you use lower isolation levels, you're telling the data, I'm allowing my transaction to be exposed to these anomalies. And so the four isolation levels that are defined in the ANSI standard are these here. So this came out when the SQL 92 standard came out. And you have, going from top to bottom, this declares the strength of the isolation level. So serializable is at the top. So this means that you have no phantoms and all your reads are repeatables and you can never have the dirty read anomalies. On a repeatable read, phantoms may happen. This usually means because you're not checking the ranges. Read committed is that you allow for phantoms on repeatable reads and then the read uncommitted says that all of these things could possibly happen. And again, we're saying that they may happen. Doesn't necessarily mean they will happen, right? If you have a database server that has one core and your application submits transactions in serial order one after another, then it's never the case that there's never been a concurrent transaction. So in that case, even though you declared transactions running under read committed, you're essentially getting the serializable isolation level because that's just the nature of what your workload looks like, right? And I also say too that this is not global, these are not global across the entire workload. So you can declare one transaction to be serializable and you declare another transaction to be read committed. And the data system will enforce these isolation levels on a per transaction basis. So another way to look at this is sort of the hierarchy in terms of protections. Again, going from top to bottom, you have the serializable at the top gives you the most protections and then read uncommitted gives you the least amount of protections. So when you actually now go look in actually real database systems and see what isolation levels they support, you see one really interesting thing. So again, the whole last class, we talked about how to do serializable transactions, right? But when you look at real systems here, you see on their default, only two of them provide serializability as the default. Most of them are doing read committed. And then when you look at the maximum, some of these guys, in case of MemSQL, they don't even support serializable, they only do read committed. So this table was generated by Peter Baylis, who's now a professor at Stanford. So he basically was going looking at all these different database systems, seeing whether how prevalent serializability actually is. So again, the main takeaway here is that you don't see serializable as being the default. Like MySQL's default is repeatable reads. This is what Facebook runs at, and they're fine with that. And they have the world's largest MySQL cluster. So the two things that are worth noting here is that there are actually two isolation levels that aren't in the list of four that I just showed. So DB2 claims that their default isolation level is something called cursor stability. And then in the case of Oracle, their maximum isolation level is this thing called snapshot isolation. So what's going on here? There's some isolation levels that we didn't see in our standard. And if you go read most database textbooks, they won't talk about these things. They'll talk about the four that I just showed. So what happened was the standard came out in 1992. And then a few years later in 1995, some really famous and smart awesome database people wrote this paper on a critique of the isolation levels. And they basically said the standard as was written was focused on database systems that are using two-phase locking occurrence control. And then for all the anomalies that we talked about, they're in the context of a two-phase locking system. But there's actually other isolation levels, and there's other anomalies that can occur if you're not using two-phase locking in the way that you would learn an intro course. So in addition to the four that I showed here, there's actually two other isolation levels that they talk about in this paper called cursor stability and snapshot isolation. And this would be relevant for our discussion both today and next class, because snapshot isolation is what you get when you use multi-version concurrence control. And we'll see what this means and what anomalies can occur that won't occur under two-phase locking. So we'll go through both of these. So cursor stability is the default isolation level in DB2. And it's kind of difficult to think about if, again, if you understand how two-phase locking works, throw all that out for right now. Assume we're not doing two-phase locking. And so what's going to happen is the database system is going to maintain this internal cursor that's going to point to whatever item or object they're trying to access during a query. And then when you're done whatever it is that you need to do, you release that lock and move on to the next item. So it's not two-phase locking because you're not acquiring locks in the growing phase and giving up in the shrinking phase. Think about it on a per-query basis. If I have a single select statement, I'll acquire the lock on that item or tuple that I'm reading. I do whatever it is that I need to do. That's my cursor lock. And then when I'm done, I give the lock up and move on to the next thing. So let's say I was doing a query that had to read two tuples. I would grab the cursor lock on the first tuple, do whatever it is I need to do, release it, then go grab the cursor lock on the next tuple and do whatever it is I need to do and release that. And a deadlock can occur because any transaction can only hold one cursor lock at a time. So it turns out that cursor stability is a stronger isolation level than read committed, but it's weaker than repeatable reads. So it's sort of in between. And in particular, it's going to allow us to sometimes event what's called a loss update anomaly. And I'm saying sometimes here because, again, it depends on the workload. This is not the cursor locks aren't always going to prevent these problems. And you'll see why in a second. Let's say that we have two transactions here, T1, T2. And they're really, really simple. So T1 is going to do a read on A, then a write on A, and T2 is going to do a write on A. And so for this, assume we're running on a database system that only has one CPU core, and that means only one transaction and one operation can ever happen at any given time. And again, we're not going to run this under two days locking. We're going to run this under cursor stability. So for a start off, we have a single program counter, again, because we have a single core system. So we're only doing one thing at a time. So this transaction wants to do a read on A. So again, it's going to go find the object A, assume it's a tuple, grab the cursor lock on it, do the read, release the lock, and then move on. And then now we do a context switch. This transaction starts running. Again, it finds A, gets the grab the cursor lock on A, updates it, and then releases that lock. And then now we switch back over here. And same thing, do a write on here. And then it goes ahead and commits. So what's the problem here? The loss update anomaly is that this transaction here finished before this one did. But this update to A, it will get lost, because this is going to commit afterwards. So this is the loss update anomaly. This is saying this thing should be persisted. This is the final, whatever transaction two wrote to A here is what the vinyl value should be. But because we're only maintaining cursor locks, this guy was allowed to write it and commit without any problems. So the way a cursor lock would help you is that when you do the read on A here, since you're going to do a write on it later, you just hold the cursor lock up until you do after you write, then you commit. So even if you do the context switch and try to do this write on A, it would try to acquire the cursor lock on object A. It would not be able to do that, because this guy was holding it. This is how it avoids the loss update anomaly. Now, again, under two-phase locking, we don't have this problem, because this guy would take a shared lock and he would hold it. This guy would try to acquire an exclusive lock. It's incompatible, so it would have to stall. And then it would upgrade its lock to get the exclusive lock and do the write, then commit, and then this guy would then be allowed to proceed. So now you see why this is less than repeatable reads, but above read committed. So this is, again, this is the default in DB2. Now, when I say the reason why, it wouldn't always prevent this problem, because.