 At the very start of the unit, I talked about how, in a database, the interactions of clients with the database are done in these units of work called transactions. And while we haven't talked about transaction thus far, all the SQL statements we've covered so far actually have to be performed in the context of a transaction. So how do you do that? How do you put things in a transaction? Well, first off, you just write start transaction, then you list your statements, whatever work you want to do. And then when you're done, you end with a commit statement, a commit statement, then triggering all of the changes you made to be committed to the database, to be made persistent in the database and made visible to subsequent transactions. Now, I did mention there are certain scenarios that might arise in the course of a transaction which would require the transaction to effectively be aborted for all of the changes you're making in the transaction thus far to be discarded. And when that happens, that's called a rollback, all the changes you make get rolled back. Usually, this is triggered by something in the database which then notifies you, the client, of, hey, we aborted your transaction, we rolled it back. In some cases, though, you, the client, may have reason to wish to roll back the transaction to abort the transaction. And for that purpose, you can then end your transaction not with the commit statement, but with the statement rollback. Now, earlier when we discussed transactions, we talked about asset, atomicity, consistency, isolation, and durability. And when we talked about isolation, we mentioned that in practice, we don't always want perfect isolation because it tends to degrade performance with a high-demand database, a database which is being taxed by the client's making requests of it. So in the SQL standard, there are actually specified four different levels of isolation, with at the top the most isolated being serializable, and at the bottom being what's called read uncommitted, which is basically no isolation. What characterizes the difference between these levels of isolation is the locks, if any, which these levels will acquire on the rows of a table in the course of a transaction. In the highest level of isolation, serializable, a transaction will acquire all three kinds of locks. It will acquire write locks, read locks, and range locks. Locks are a tool used to resolve concurrency issues, which work by simply acquiring a so-called lock on a piece of data. That is, staking a claim on a piece of data such that all other actors who wish to use that piece of data either are prevented from using it while it's locked, or those other actors are expected, if not required, to check and see whether a lock has been acquired on the piece of data before that actor attempts to use the data. So these locks are not necessarily like going into a room and locking the door so no one else can go in. It's more like hanging a occupied sign on the outside so no one else tries to come in. And the difference between a write lock and a read lock is that a write lock once acquired prevents others from writing the same data, whereas a read lock once acquired prevents others from reading it. And a range lock is a lock that neither prevents reading or writing, it's just a fixed selection of rows, basically. The idea of the range lock is that in the database in the course of a transaction say, if you perform a query where you filter for the rows where column X is less than 100, you want to make sure that the range of rows selected is the same in subsequent queries of the same transaction. You don't want the selection of rows to be affected by something that happened in some other transaction that overlap with your own transaction. So that's what range locks are about. To understand the effect of these different isolation levels and the effect of these locks, it's actually easiest to think in terms of what kind of undesirable situation do these different isolation levels avoid. And there are three effects that arise if we lack isolation and these are what are called dirty reads, non-repeatable reads and phantom reads. And notice how they correspond to the locks. When we use all three locks, including range locks, we avoid phantom reads. When we use just write locks and read locks, we avoid also non-repeatable reads. And when we use just write locks, we avoid at least dirty reads. So what exactly are these phenomena? Well, a dirty read occurs when in a transaction, let's call it transaction X, we read some data which has been updated by some other transaction, let's call it transaction Y. But that data was not committed by transaction Y. So transaction X is in effect reading some data which has not really been committed. It's been modified in the database but not properly committed. This is generally undesirable because that uncommitted data, there's the possibility it might get rolled back for whatever reason. And there's also the possibility that within transaction Y, when that data was updated, there's some other piece of data that's meant to be updated with it somewhere else in the database. And so you sort of have this consistency problem where you have this half updated data and transaction X might end up getting one part of that data update without getting the other parts. Like, say we have the database for a flight booking system and in the course of an update, when some customer buys a seat, we're updating that, hey, there here's this new customer and oh, here also is the seat in the plane which is no longer available. If at the same time we have another transaction triggered by a customer trying to find out whether a seat is available, they might erroneously see that the seat is still available even though in the database in another table it's been remarked that hey, this customer has purchased the seat. So this could lead to a situation where two people end up buying the same seat. The phenomenon of a non-repeatable read is a situation in which during the transaction we read the same rows more than once but get back different values even though we in our transaction haven't modified those rows. This can happen in the course of your transaction if in another overlapping transaction the data is being modified. The first time you read the data was before the other transaction modified the data and the second time you read the data was after it was modified by the other transaction. Arguably the scenario is not a huge concern because why are you reading the same data twice? That sounds kind of inefficient. Why didn't you just read it the first time and stick with that? Well, sometimes it does come up though and sometimes it'll matter whether or not the values are the same or not. And as is always the case in concurrency whether these bad reads, these dirty reads, non-repeatable reads and phantom reads, whether they're really a problem depends upon really what you're doing in your application. But there are situations where you do have legitimate cause in the course of a single transaction to query the same data twice and there are situations where you don't want that data to read any different the multiple times you read it. Lastly, the phantom read scenario occurs when in the course of a transaction you perform the same query multiple times and get back different rows. So the non-repeatable read situation is when you get back different data in the same row. Phantom read is when you get back different rows entirely. And this may occur because the conditions upon which you're filtering the predicates, they may say something like column X less than 100. Well, if data changes the rows in which X is less than 100 may change. So you get back different rows. And this is why at the highest isolation level serializable, most databases employ what are called range locks. The lock on a selection of rows such that if you perform the same query with the same semantics, you'll get back the same set of rows. Now to make it absolutely clear where a range lock is necessary, when you perform a query and you get back a certain set of rows, and even if you have a write lock and a read lock on all of those individual rows, some other transaction might modify data in other rows such that those rows then fit the criteria of your filtering predicate. So they would then get included in your query and your query that you perform the next time will actually return different rows. Range locks prevent that by in the query fixing the set of rows that get returned by that query such that the next time you perform the same query, you get back the same rows. The last thing I'll note about these concurrency issues and isolation levels is that some databases don't actually use locks. They use a different technique. They use what's called multi-version concurrency control. Postgres, for example, uses multi-version concurrency, and that's actually why until recently it actually only supported two different levels of isolation. It supported the highest level serializable and it supported the lowest level uncommitted read. So it was an all or nothing approach of total isolation or no isolation at all. The gist of multi-version concurrency is, as the name implies, to provide a different version of the state of the database to each transaction, what's called a snapshot. So when I initiate a transaction, I get a fixed snapshot, a fixed view of that current state of the database for my very own. And when I work with the database, I just see things in that state. When I make changes in the course of my transaction, those changes are generally given a time step. And then when I try and commit, there's a reconciliation process of my transaction reconciled with the other transactions that may have overlapped. At the time I commit my transaction, there may end up being a conflict which may have called for rolling back my entire transaction, but generally more often than not, it will go through.