 As I said, I can't be on campus this week, I'm still here in Europe. Right now I am actually in Munich visiting Thomas Neumann at TUM, that's where they invented Hyper and the new database system they've been putting in Umbra. So part of me going down to visit Thomas is sort of to sort of get a better understanding of why the Germans are so good at databases. And so as part of this, they took me up here in the Alps, somewhere in the mountains, I don't know exactly where we are, it's freezing. And the idea is like this, you go in the woods and you think about your thoughts for like an hour or something, somehow that's going to magically make you better understand databases. So whatever, I'm just sitting here now, it's freezing cold, there's some hunting dog running around, fighting things, you may see that in the background. And I thought it would be a good time to get through and start discussing the next lecture. So let's see how far we can get before it gets too cold. So the last class, we discussed the sort of a high level history of database systems. And primarily what I focused on was sort of thinking about the problem in terms of like at a high level, mostly based on data models, you know, whether it was the CODESIL staff or the hierarchical data model, the relational data model. And so that was the sort of main discussion of last class. Another way to sort of think about it was an overarching theme throughout the entire history of databases is this constant struggle of building database systems that have to deal with the limitations of the harbor at the time. Right, it's the same in the 1970s, 1960s as it is now. You know, we're always trying to run databases on new hardware and get the best performance we can out of it. So back in the 1970s, though, when they built the first relational databases, database systems, the hardware was a lot different than what we see today. But at a high level, the basic idea is still the same. And so the problem, though, is like when we start talking about modern systems, the although the basic idea is still the same, the limitations of that harbor is not exactly the same. So back then in the 1970s, you didn't have machines with a lot of sockets, a lot of cores, or you had a uniprocessor. You had a single core CPU that could really only execute one thread at a time. The RAM was also severely limited. It was very expensive. So you were lucky to get a machine had DRAM, but maybe like of like a megabyte. So now we can go into the terabytes on a single box. And because RAM was limited, we had to store the entire database on disk. And so the whole architecture, the database system was predicated on, you know, retrieving data from disk. And of course, back then, disks were way slower than they were now. And certainly the sequential access was much faster than random access. So you try to design algorithms and data structures that could maximize sequential access. So now in the modern era, in the 2020s, now we have machines that have enough DRAM, such that most databases can probably fit entirely in main memory. There's always going to be the outliers. There's always going to be the Googles and the Facebooks and the Amazon to the world where their databases can be in the size of petabytes. But for probably 99% of the applications, their database may be measures in gigabytes or a few terabytes. And so again, with that size, you can certainly have a database fit entirely in main memory, whether it's a distributed system or a single node system. So to be able to understand why databases aren't as maybe as large as you think they actually are, you've got to understand that there's this sort of difference between structured and unstructured data, or semi-structured data. So structured data sets will be things that we normally think about with databases, so you have a well-defined schema, they have attributes, right? And every record has to have those attributes. And so in those data sets, they're usually, again, these are typically smaller. Unstructured data sets would be things like video files, sound files, or images, right? These are things where the contents of every single record in that data set is going to be different, or there's not even any structure at all. And you can't run queries directly on a video data. Semi-structured would be things like log files that are generated such that they're meant to be human readable, but they're not going to have the exact same fields for every single log record. You can parse them and extract those fields, but some log records might have other fields, other ones might have others. So again, unstructured and semi-structured data sets, these are typically larger. You think of like YouTube, YouTube, the MySQL database that they use in YouTube. That's going to be a fraction of the size of the total amount of video that they're storing, the videos, all the unstructured stuff. The MySQL, Fatis database is the structure database. So for our purposes here in this class, we're primarily going to focus on the structured data because that's really all you can kind of do the things that we're talking about here, right? There's no magic way to do vectorized query execution on video data. What do you have to do? You have to convert that video data into structured data and then you can run your queries on it. So, right, so that's going to be the focus for ourselves here today. The other thing we need to understand also now is that, how are we going to get the best performance in a data system to run on the structured data? So you may think, all right, well, if my database can, most databases can fit entirely in main memory, can I take a traditional disk-oriented database system as one that as I described last semester in the introduction class? Can we just give that, run that out on a machine with a large enough DRAM and set the bufferable cache size to be big enough such that everything is going to fit in main memory? Is that going to be enough for us to get the best performance? And the spoiler is going to be no, right? And we need to understand why. So for today's lecture, see how far we can get before we get too cold, is that we're going to first discuss what a disk-oriented database system is, so we can understand when we start talking about in-marry data systems, how do they avoid all the bottlenecks of the slowdown issues of a disk-oriented system. Then we're going to talk about, from this, we can go a little deeper into the current control and see what are the bottlenecks that we have to overcome to get these current-year protocols to run in in-marry systems. So the definition that I always like to use for a disk-oriented database system is one where the architecture system of the system is predicated on the assumption that the primary search location of the database, like all the contests of the database, there's records and indexes and materialized views, everything we have to store, is going to be on some kind of non-volatile storage, whether it's a spinning discard drive or an SSD. So that means that the database is going to be designed based on this assumption and therefore all the algorithms, the data structures and everything that's on inside of it has to be aware that at any time we could have to go get something from disk. Now, the database itself will be organized into a set of fixed-length pages or sometimes blocks, and then we're going to use a in-marry-bufferable manager to store or cache the pages in memory as we retrieve them from disk. Again, we're assuming a Von Nomen architecture, and that means we can't operate directly on data as it exists on disk. That can change in some modern hardware, but for our purposes here, we'll just assume that's not the case. Therefore, this bufferable manager is really all about how deciding how to move this data back and forth between disk and memory as needed, and we want to do this in such a way that we minimize the amount of stalling or the amount of times we have to go out to disk. So this buffer pool is sort of the key thing that differentiates a disk-oriented database with an in-marry database system in terms of the architecture. So what's going to happen is, well, how a query that's going to execute in our system and any time that acts as a tuple, we have to go find the page that our tuple is located in. So the data is to go check to see whether that page that it needs for that query is already in memory. If it is, then we just hand back the pointer to where it's located from our buffer pool and the query can execute. If it's not, then we start to need to do some extra stuff. So first thing we have to go do is, well, we got to decide what frame we want to copy our page into our buffer pool. So a frame is just a location in the allocated memory of our buffer pool manager, and we use this term to differentiate between a page or a block because it's the same frame that's going to be used over and over again for different pages. So if there's a free frame, then our job is easy. We just say this is that frame that we're going to put our page in, we go fetch it from disk and copy it in and we're done. But now if there's no free frames, then we've got to find a page to exist from an existing frame, and that's when things get complicated. So if the page we want to evict, if it's never modified by a transaction, meaning it's not dirty, then our job is easy. We just go write it out. We just go drop it and then reuse the frame. But if it is dirty, then we've got to go now evict it. So we've got to write it out to disk and then flush it, and once that's done, now we can use our frame. So again, you can see sort of why this is complicated because we're running essentially LRU or clock or whatever eviction policy you want to use to decide how to evict frames and we've got to balance that with what other queries or transactions are running at the same time, what pages are dirty. This all gets very complicated. So now once the page is in memory, then we can try to be smart in trying to translate quickly any on disk references to our page to now an on memory address so that maybe we don't have to go through this full process of checking the buffer pool manager every single time, but not every system actually does that. So at a high level, what I just talked about looks like this. So for our purposes here, say if we're doing a query that's going to look up an index, index wants to find a record, the record's inside a page, and we got to go fetch it from disk. So to simplify our discussion, we'll just say the index is not backed by buffer pool pages, it's just actually sitting in memory. In most systems, that's actually not true. The index pages themselves would be backed by the buffer pool manager. So we have to go check to see whether those pages are in memory as well and do this entire process, but we'll just keep it civil and say it's entirely memory. So the first thing we do is do a look up on our index to find our record, and what the index will give us is back, give us back is a page ID and a slot number. And then we can use that page ID to look, do a look up on a page table and find the location of the page that we're looking for. So let's say that we're looking for page one, and then we would not find the entry in our page table or we'd see an entry that says, oh, it's not in memory, it's on disk, and here's where to go find it on disk. So in order to bring it into memory, we have to go pick a existing page to evict. So we have to latch this page table to make sure that nobody else is trying to bring it in the same time we are, and then we've got to pick one of these pages to evict. So let's say we pick page two, but page two is dirty, so now we've got to write it out to disk and flush it, and then once that's done, then we can now use the free frame to copy in page one. And at this point now we update our page to be able to say, hey, if you're looking for page one, here's the frame, and the buffer will manage to go find it, and then once that's done, we can release our latches and anybody can now access it. So this is sort of a gross simplification of how this eviction process works, right, because I'm not showing that, well, if there was an entry for page two, and then you go update that, now in the page to be able to say it's not in memory, now it's on disk, but at a high level, every single database system that's using a buffer pool is doing something similar. So what's the problem with this? Well, if now we go back to our example, say let's give our database system a lot of memory and now everything's going to fit in memory, well, we're still going through this entire process to go look up the page table and try to do a translation of like the record ID to its memory location every single time we access a tuple, and we have to take latches and protect things because we don't want us to be accessing a page while and then have another thread try to evict that page. But like I said, if we have enough memory, then we're never going to actually write anything at the disk, so pitting these pages or and latching them is actually not necessary at all, so just sort of wasted work, and then running that eviction policy to updating all internal metrics about how pages are being accessed. That's also wasted work too, because again, nothing's ever going to be evicted. So this sort of answers my straw man question at the beginning that, well, can't we just give a database system a lot of memory and, you know, a traditionally disk-based data system, a lot of memory, and wouldn't that be enough for, you know, getting the same performance as we would get in an in-memory system? And the answer is no, because you're doing this all, you know, doing this extra work, just go access a single page. Now, this is going to have cascade into other issues as well, so like in concurrent results. So the traditional disk-based systems, disk-oriented systems, they're designed such that at any time a transaction could touch memory and it's not in disk, and therefore that transaction has to get stalled while the disk manager go fetches it. And so while it's stalled, other transactions are going to be able to run on other threads or the same thread, so that this system can keep making forward progress even though, you know, one transaction got stalled because it has to get something from disk. All right, we did this because otherwise, if we can only execute one transaction at a time and we stall anytime we have to get something from disk, then the system is going to look unresponsive because it's going to keep stalling and, you know, every single time, you know, we have a page miss. So because now we're allowing other transactions to run at the same time, we have to use a concurrential protocol that could be setting locks in our own records or objects in the database to make sure we provide the asset guarantees that you want from transactions. Right, because if now a transaction modifies a page and then that page gets written out to disk or for that transaction can commit because some other transaction evicted that page, then we got to make sure that we keep track all this extra information and that if we crash and come back that the uncommitted transactions changes don't persist. So in a disk-oriented system, if it's using locking, it's going to maintain this locking information in separate data structures, all right, in memory hash table, in the lock manager to avoid the lock information getting swapped out to disks. That way, I don't have to determine whether I can even hold the lock on a tuple to go, got the disk to figure out, you know, go fetch that lock information. Everything is always going to be in memory. Other problems we're going to have in an in-memory system is with logging and recovery. So most database systems that use a bufferful manager are going to be using the steal no force bufferable manager policies. And this just basically means that all the transactions that a transaction makes, all the modifications that a transaction makes, have to get added to a right-of-head log and those entries have to get flushed to disk before transactions are allowed to commit. And then any updates to a dirty page, the log records corresponding to those updates have to be written to disk before the dirty page can be written to disk. These are things that we covered in the introduction class last semester. So now in memory system, we don't have dirty pages anymore. So maybe we don't need to use the exact same protocol. And then maybe our log entries don't need to store the exact same information as we had in a disk-oriented system, right? If no dirty pages are ever going to get read to disk, then maybe it doesn't make sense to store the before image or the redo information, sorry, the undue information of the transactions modification because that page will never, the dirty page never gets written. So there's a lot extra stuff we have to do in a disk-oriented system that maybe doesn't make sense anymore in an in-memory system, like keeping track of the log sequence numbers, maintaining the undue information because, again, dirty pages don't get written to disk because everything fits in memory. All right, so to get a better understanding of what this overhead is actually going to be, there's this study that was done at MIT, it's actually 2008, so over 10 years ago now, where they took an old-to-be database system and they instrumented it so that they can measure the number of instructions that the data system was spending in different parts of the during query execution while you're running TPCC. And the idea here is to break down the system into different components, as I talked about, and just measure again how much time we're spending in each of them. So, okay, this is for a database where everything fits in memory, there's no any no reason to write some memory, nothing gets flushed to disk in that right-of-head log, this is just what is the cost of going access data that's in memory using a disk-oriented architecture. So the first overhead is in the Buffalo Manager, this is about 34% of the CB instructions are spent doing updates or lookups into the page table, doing updates in keeping track of all the metadata you have for for the eviction policy. 14% of the time is spent doing latching, right, this could be for the internal data structures such as the page table with the lock manager, or at any time that's you know it's a low-level construct that we need to protect. 16% of the instructions were spent on locking, so this particular system was called shore, and it uses two-phase locking, so this is the overhead of updating the lock information for transactions while they run. 12% of the instructions were spent in the log manager, so this is not the cost of writing out the disk, this is the cost of preparing the log records that we're going to write out. Then 16% of the time is spent doing a comparison of keys, doing traversals in the B-plus tree, right, this is sort of unavoidable, this is just saying, you know, if I'm trying to find the record that I want through the B-plus tree, this is the cost of comparing keys. So this is now going to leave us with a poultry 7% of the CPU instructions were actually doing what they would call real work, like executing the logic for transactions, getting back the data, and performing the commit operations, things like that. So this is, again, this is showing you that if you take a disk running system and you give it all the memory that it wants so that everything fits in memory, you're not going to get potentially the best performance because everything could still, you know, you still have pain to penalty for all this internal architecture that assumes the data is not on disk and there's all the protection mechanisms for that assumption and that are actually not necessary. So my battery on my tablet shut down earlier today when I was recording this outside, it got too cold and everything, the whole thing, just shut down. It was like 20 degrees Fahrenheit, which is like negative seven or so degrees Celsius. So that was kind of weird. So back inside in this weird, I don't know, German gold deprivation chamber thing that they have here. Again, I'm supposed to come in here and think about deep thoughts about databases and somehow become a better programmer, like a German. I don't know if that's going to work. So let's just continue with the lecture. So where we left off just now was we were talking about how the disk-oriented systems have, because they make that assumption that the primary stratification of the database could be on disk, there's all this architecture set up so that at any time, you know, when you read something, you have to go check to see whether it's on disk and if not, you can install that thread and do other stuff. So now we're going to switch over and talk about an in-memory database system. So this is one where the system is going to assume that the primary stratification of the database is always going to be permanently in memory. So that means that any single time a transaction or a query goes and reads a tuple, it can assume that that thing is going to be in memory and therefore it doesn't have to go through all that page table and the buffer pool and check anything. It just goes and reads whatever it wants or writes whatever it wants. So it's not to say that everything we're going to talk about this semester is going to be only targeting in-memory databases but we'll see how there's going to be certain design decisions that we're going to make that'll make our life easier if we make this assumption. So this idea of an in-memory database is not new. The first proposed systems go back into the 1980s but they're not really, you know, viable option now maybe in the last 10 years because it's gotten to the point where D-run prices and capacities are such that, you know, we can store really large databases in memory. So the, you know, there was these initial ideas proposed in 1980s. We'll talk a little bit about that as we go along but the first commercial in-memory databases didn't really come out until like the 1990s. The most famous three are probably times 10 which is really originally called small base and then they forked off from HP and became times 10 and then Oracle bought them probably about 2006 or so. Data Blitz was a system at a Bell Labs in the 1890s. It was originally called Dolly. It was sold for like telecom switches and things like that. I don't think it's still around today or if it is it's sort of it's obviously in legacy mode and then all the bases out of South Korea it's one of these early again these early in-memory databases which actually is still around today and actually within the last two or three years they open sourced it so you can go and check that out on GitHub. So even though again now the dataset is going to be entirely in main memory we're still going to organize data into blocks and pages. There's not slotted pages because we don't we don't have to worry about indirection within a page itself. So the system architecture would be slightly different now because now instead of doing these record IDs we can deal with direct memory pointers. The way we're going to handle fixed length versus verbal length data it would be slightly different than a disk space system because again we don't have slotted pages. And although not many systems have this but some do there is a concern that now because everything's in memory any thread can read or sorry can write anything in the address base of the process. So to make sure that we don't have you know air prone software collaborating our data and causing permit damage we can use checks on throughout the system of these blocks to keep track of you know the the status of a page and you know detect errors with these ever problems. Now underneath the covers the operating system and the hardware is also going to be organizing memory into pages as well. That's not really going to be our concern for most most of this semester we'll talk a little bit later on and a few more lectures because you need to understand how it's actually going to lay out data so that when we start laying out data we can align to what the underlying page representation in the os is or the hardware is. For our purposes now we can ignore that. So again let's return back to our high level example here where we had a query one of the access to the tuple it was going to go through an index to look it up. So now in our index instead of returning back a record id or page id in an offset we're now going to get a block id and an offset and this block id could either be the direct memory addressed of a fixed length block or there could be an additional mechanism that allows us to look it up and see you know convert that block id to a memory location. So the primary source location database again is in memory or every tuple is in memory but we're going to organize them in these fixed length records and so for this it doesn't matter whether we're assuming a row store or column store but the basic id is the same and that is we're going to have a set of blocks or restore the fixed length data for a tuple. So anything like an ints, dates, floats, reels, things like that all that can be stored is fixed length block and that means that the size of every tuple here is going to be the same and so that says now if we have this block id and we convert that to a memory address when we want to do a look up to find the tuple within the offset of that block we just do some simple memory arithmetic to take the size of the tuple multiplied by our offset and that tells us where to be jumping memory with that block to regret it. Now to handle variable length data this is going to be much different than what we would do in a disk base system. So variable length data we're instead of actually storing the data in line the fixed length data block for the most of the time we're going to start a pointer to some other memory location in a variable length data pool where that's a direct access to the data that corresponds to this attribute within this tuple. And again the idea here is that we can guarantee that all the tuples and the fixed length data blocks are fixed length and then for anything that's variable length we shove that into the the variable length data block. Now again this is different than the slot of page design you would see in a disk oriented system because in there we're trying to reduce the number of disk reads so therefore we try to pack in all the variable length data for a tuple within you know with the tuple itself with all the fixed length data. It doesn't always happen if you get the spillover to another page we can do that but most of the time we try to make that happen. And in this world and memory system we're actually want to store the variable length data separately so that way we can do that deterministic lookups to find memory addresses for tuples. All right some other things that that are going to be different that we kind of talk about quickly with in-member databases right and again these are the things that we're going to talk about throughout the entire semester. The first one is going to be how we're actually going to store indexes or how with data structures we're going to use for indexes. So when this first memory database systems were proposed in the 1980s the hardware was a lot different than the hardware we know about today or how the hardware is laid out today. So in particular back then the cache and memory access speeds were about the same but now this is not the case at all right CPU caches are way faster than than main memory access. So back then people were designing data structures where you know reading from cache was the same as reading memory so they would organize a certain way but now in the modern era we don't want to do that so therefore we want to use indexes that know that they're dealing with memory and they have caches and try to minimize the cache misses when we access things. So the spoiler is going to be that a B plus tree is going to turn out to be a the best data structure to use for an in-memory database even though the plus trees were originally designed for disk-based databases they're actually still really good for in-memory databases as well. So the other major difference difference we're going to have with indexes is that in a disk-based system you would also write log records and write out pages for the indexed disks so that you can recover them after the system restarts. And in memory system we're actually not going to record any log records and not write indexes out to disk or most systems because the cost of rebuilding the index after restart the system is is going to be super low. So think about this when the system in frame memory is when I restart it I'm going to bring the database entirely back into main memory and so the cost of reading that those the data from from disk is super expensive whereas the cost of building the index at once the data is already in memory is cheap because it's just CPU computations. So for this reason and again avoiding the having to log records to the log updates to records to the index updates to the index at runtime or to slow us down we just rebuild the index after after we restart. Again we'll cover this in plain this more detail when we talk about indexes. For query processing again now in the disk learning system the disk IO is always the most expensive thing. So who cares what kind of for the most part how you computed the data or organized the access of data once it was in memory. So in a disk learning system or sorry in memory system we are going to care about now the overhead of doing function calls and branches and things like that so you need to be more careful how we organize the system and do query processing. These sequential scans are also not significantly faster in a in memory system so maybe there's certain algorithms and you know join methods for doing joins and other things that we don't have to worry about making you know optimizing or maximizing that sequential access because random access would be good enough. Again we'll cover this in more detail as we go along. For logging and recovery I sort of mentioned this already before but now that everything's in memory there are no dirty pages you need to flush out the disk. We can be a bit more conservative or we can end up recording less data than we than we need than we normally would need in disk learning system if we know that everything's in memory. So standard techniques like group camp will use a batch of logs amortize the f-send costs that's applicable for a disk learning system too but being able to use a more lightweight logging scheme is a definite advantage for a memory system right because again there's no dirty pages we don't need to do undoes anything we write to disk as part of a checkpoint is going to be already from a committed transaction for the most part. All right so now if the disk ios not the the slowest resource then we can change and we change what sort of protocols and methods and algorithms we're using the process queries and transactions in our database system. Now we need to be mindful of the other bottlenecks that are now going to come to the forefront because disk ios no longer in the critical path. So this is essentially what the entire semester is about how to deal with these other issues when we design database systems. So locking and latching so concurrent methods and low-level primitives protect that there's data structures. Cache line misses our big deal because memory access versus cache access is is much more expensive chasing pointers doing jumps to random locations in memory and what calls more cache misses that becomes problematic evaluating predicates you know taking the where clause for every for a billion tuples and evaluating it that's going to be expensive. Moving data or copying data so data movement would be if i have to move data from one socket CPU socket to another that's expensive copying data would be materializing the intermediate results or copying data between sockets to pass things around those are things we want to avoid and of course networking is always a big issue now. This is the primarily between the the application and the database system and not between different nodes of the same database. It matters for a disk running system too but again now that disk is gone this is this is even more problematic in a memory system. So for the rest of this lecture i want to discuss locking and latching so we're going to focus on concurrently show here and this is just mostly to set us up to understand going forward how you know to think about and methods to think about you know what is it that makes the data system slow and you know when we start scaling up more CPU cores what are some of these other bottlenecks and problems that we're going to have. So again as we covered in the introduction class concurrently code is essentially the protocol that the data system used that allows it to execute multiple transactions at the same time and each of these transactions are going to sort of have this illusion that they're executing on the system by itself. So you don't have to worry about reading or reading the effects of other transactions running at the same time. This is essentially what you would want to achieve in your system because that's the easiest way to program your application. Of course now this is not easy to do if you're interleaving operations between the different transactions you could violate this ordering. So the concurrently protocol of any database system is going to provide the atomicity and isolation guarantees within this asset acronym. So for an in-memory database system there's this key observation we have to make about how we're going to acquire locks to access tuples and that is the cost of a transaction acquiring a lock is essentially going to be the same thing as just accessing the data. So what I mean by that is in a disk learning system we said that all the locks will be stored in memory some kind of data structure and they would be and they were separate from the actual tuples but now if everything's in memory then the cost of going and accessing the lock table is going to be the same as accessing the tuple. So ideally I want to be able to do those at the same time so that I'm not paying the penalty of doing two memory reads and this is the core idea that we're going to try to try to install when we do in memory database system concurrency control. The other important thing to understand too is in a disk learning system the stalls are due to transactions trying to access data that wasn't memory and have to go out the disk and get them but now we're not going to have those kind of stalls anymore. Yes they'll be memory stalls but those are going to be much much less than you know disk stall but the thing we are going to have or that the earlier systems don't have is a way more cores and so now the contention is going to be in the system of many transactions trying to read and write to the same objects at the same time and they're not stalling because there's a disk they're stalling because they don't they can't acquire the locks on certain things. So do you understand again how we're actually going to be able to maintain locking information along with the tuple? We need to understand this sort of basic comparison about primitive we're going to use to modify things so we're not going to use mutex's protect tuples because they'll be too slow instead we're going to use these atomic operations called compare and swap. So I think this is primarily covered in other classes and so I just want to give you a quick overview of what it is because this is going to come up multiple times throughout the semester and it's good to sort of see it once. We're not going to be able to detail actually how this is implemented hardware just know that this this concept actually exists. So compare and swap is an atomic instruction that pretty much every modern CPU will provide you. It's not a new concept I think it goes back to the 1970s primitive every Xeon or ARM architecture or PyroPC will give you this now and the idea is that it's a single instruction that's going to do a lookup in a memory location and it's going to check to see where that memory location has a certain value that provided and then if that value is equal so the value in that memory location is the equal to the value you're checking with then you're allowed to install a new value to update it otherwise the operation fails and so in this example here this underscore underscore sync bool compare swap this is a this is like a c++ or linux libc or linux c++ operation or intrinsic different platforms have different things and I think for the most part in in modern systems there's there's easier to understand or intrinsic functions that can do the same thing but the idea here is that we're giving a giving in a memory address we're giving it a compare value and a new value so the current memory address that m points do contains the value 20 so in a single instruction now we're going to see whether 20 equals 20 in the memory location and if yes install install the new value 30 so in this case here in a single instruction we look up to see that m equals 20 and it does so then we can install 30 right otherwise this would have failed so in this example here the the function here returns a true false to say whether it succeeded or not there's all different types of comparison swap instructions sometimes they'll return back the the new value that that got installed or the old one if it didn't get updated some and then you can change them you know the size of the the memory address that you're looking at with you know 32 bits or 64 bits right but again the basic idea is always the same thing within a single instruction we can do this check and this will be a core primitive that we're going to use all throughout the semester that allows do these kind of lock free or latch free operations very efficiently so we want to talk about quickly sort of the two sort of categories or classes of concurrential protocols that we're going to be working with this semester and this is again this is sort of a refresher from what we covered in the introduction class last semester and so the first is going to be two phase locking so these are pessimistic schemes where the data system is going to assume that transactions are going to conflict and therefore they have to acquire locks on any objects before they're allowed to access them timestamp ordering is an optimistic scheme where you assume conflicts are rare so you don't require transaction to acquire locks on database objects and all you do instead is that when the transaction goes to commit you see whether there there was a conflict and then you correct them as needed so let's go through each of these one by one um showing really high level examples and then we'll get into the paper discusses how do you actually implement this on modern systems so here's a simple example for two phase locking we have a transaction t1 and it wants to do a read on a followed by a write on p so again under two phase locking we have to acquire the locks for any object that we want to read and write so in this case here we got to get the read on the lock on a followed by the lock on b so this is a really simple example or simplified example because i'm you know there's only this a simple one type of lock on on a and b but in a real system as we covered last semester uh you know you would have different lock modes you can have a shared mode the multiple transactions read the same object an exclusive mode to say that only one transaction can can lock it can connect can write at a given time so in this case here because two phase locking the first part of the transaction is called the growing phase and this is where we're acquiring locks that we're going to need sorry that we're going to need during during the execution of the transaction and then soon as we release one lock now we're in the shrinking phase um and we're not allowed to acquire any new locks but we can do operations on the objects we still hold the locks for so in a real system like in based on sql you wouldn't actually have explicit lock and unlock commands these are something we did is some does does for you automatically underneath the covers so typically you you don't release the locks until the transaction actually commits which is called rigorous two phase locking but for our purposes here in this example we're not doing that we can we can unlock a and then and then do the right on b and that still follows the original two phase locking protocol all right so let's say now we have another transaction t2 comes along and it wants to do a right on b followed by a right on a so say these two transactions are running at the same time on different threads and therefore they can they can do these things in parallel with each other so in t1 we first get the lock on a t2 and get the lock on b this is fine assuming there's no other transaction running at the same time so so both these transactions can acquire those locks in the next step the transaction t1 doesn't read on a which is allowed to do because it holds the lock on a transaction t2 does the right on b which is allowed to do because it holds the lock on b but now we get into trouble here because t1 wants to do a lock on b t2 wants to get a lock on a but each of these are held by the other transaction so they have to stall right and they're essentially waiting for the other transaction to give up the lock so go ahead and can make forward progress but of course we have a deadlock here and so now we have to do something to break this because otherwise the system would be locked forever so there's two ways to do deadlock or handle deadlocks in two phase locking the first is do deadlock detection this is where you have a separate background thread that's just going to occasionally or periodically wake up check to see whether transactions are running and if it finds a deadlock then uses some kind of heuristic to decide how to kill them like i said you know kill the transaction that has that has done the least amount of work or is the oldest or holds the most locks the various systems do various things or the different systems do all the all different types of deadlock detection algorithms deadlock prevention is instead of having a separate thread you just have a way to make sure that when transaction tries to acquire a lock if it can't hold it then it makes a decision about what it should do other than than just waiting so for example if i if a transaction tries to hold a lock acquire a lock but that lock is being held by somebody else then it can either kill itself or kill the other transaction and steal us locks right again we just have to make sure we do the operations in the right order so that there's uh so that way there's you know there's no cycle dependencies all right so the other type of protocol is called timestamp ordering um and again this is sort of category systems that are going to use timestamps that are locks to figure out the right order that transactions should be allowed to complete so the sort of two basic protocols are basic timestamp ordering which is sort of confusing because it's usually referred to as the timestamp ordering protocol which is in the category of timestamp ordering protocols but you know so i'll call it basic to just to differentiate it so the idea here is that we're going to check for conflicts on every read and write and we're just going to and use timestamps to determine whether there is a conflict and then we're going to copy tuples into a private workspace for each transaction as they read them to ensure if they go back and read the same tuple they get the same value right because otherwise you could be reading something that was written in the future and that shouldn't have happened optimistic current control is where in addition to copying the things you read into your private workspace sorry you're also going to make copies of any tuples you modify and and all your rights go into private workspace and then when you go to commit then you do the validation that checks whether there were the running conflicts and if not then you can merge all your private workspace changes back into the global database so again we'll go through each of these one by one so say for the basic timestamp ordering protocol we have a transaction t1 wants to do a read on a write on b and there's going to be some kind of stall and it wants to do a write on a so when a transaction starts we have to assign the timestamp because we're going to use that to determine the serial ordering of transactions so there's different schemes you could use i could use the hardware clock i could use a logical counter i could use a hybrid of the two but for simplicity let's just say we use a logical counter and we just update that counter every single time we start a transaction and we sign that new timestamp to the to the to the to the transaction so this transaction starts uh and is going to get the timestamp 1001 now inside the database for every single record we're storing we're going to maintain two additional fields so we'll maintain the read timestamp and the write timestamp the read timestamp will be the highest timestamp of the last transaction that successfully read this tuple and the write timestamp will be the last time the timestamp of the last transaction that wrote to this tuple and the idea is that these timestamps are always going forward in time they can never go backward because that would be violating the timestamp ordering of transactions all right so our transaction starts it as a read on a so the first thing needs to do is check the write timestamp and see whether its timestamp is greater than ours meaning we would be trying to read something in the future that we wouldn't shouldn't be allowed to read in this case here the the write timestamp for record a is 10 000 our timestamp is 1001 so we'll be allowed to read this so we need to update now the read timestamp to check to see whether its timestamp is is is less than than ours and if it is then we want to go ahead and update it so in this case here we'll update it with 1001 now this is telling other transactions that may want to update this this tuple that there's was a transaction at timestamp 1001 that has read it so make sure that we don't write something in the past that this transaction missed so then now we do the write on b same thing we first check to see the write timestamp see we have to see whether it's in the future from when we're at and therefore we would be overwriting future data with past data which would be is not allowed and then we check the read timestamp to make sure someone didn't read this record in the future and if we did write to it they would end up missing it so in this case here our timestamp checks out for both of these both the reads and the writes so we go ahead and update the write timestamp field then i'll say that our transaction has some kind of stall like it's computing the you know the one billionth digit of pi or its accesses in remote some remote system for whatever reason there's a stall and during this time another transaction comes along and modifies record a uh and now updates the write timestamp with 1005 so now we'll see that we have a read issue because now when our transaction wakes up and tries to then write to to record a that timestamp 1005 is greater than our timestamp 1001 so we shouldn't be allowed to do this because this would be trying to overwrite a logical record that was updated in the future with a physical record of in the past so in this case here this would be violating the timestamp ordering and our transaction has to get killed and aborted and we roll back any changes so most systems that are out there don't do the basic timestamp ordering protocol the more common approach is to do OCC or Optimistic Encounter Control again this is confusing because timestamp ordering protocols are by themselves or optimistic schemes but there is a protocol specifically called the Optimistic Encounter Control Protocol Scheme all right so the basic idea here is that like timestamp ordering sorry it's late like timestamp ordering there we're going to maintain a private workspace we're going to copy all our red data into it so that we go back and read the same record we get the same value we're also going to write all our changes into that private workspace as well so now when a transaction commits we get to verify that there's no conflicts and if not then we can go ahead and install them to the global database so OCC is an old protocol actually most time most concurrency protocols the basic ones are old um this one goes back into 1981 and actually was invented here at Carnegie Mellon by a professor H.T. Kong H.T. Kong is no longer here he's now at Harvard um and he's not even a data professor he did not networking but this is sort of like his most famous work is actually a database paper so that's kind of cool that this came out of the computer science department here at CMU so again here's the simple transaction I want you to read on a on a right on a right on a and right on b so in our database now we don't need the read timestamp field for records we only have to have the right timestamp field and so now when our transaction starts unlike in basic timestamp ordering protocol we're not actually going to assign it a timestamp all right we're going to do that later so anytime we're going to read and write stuff we're going to make a copy of it in into a private workspace and set it with the timestamp infinity so OCC has three phases um so the first one unfortunately is called the read phase so even though we're going to do writes in this this for whatever reason this they call this the read phase of a transaction if I had my choice I probably would call this like the run phase or the execute phase I think that makes more sense for whatever reason it's called the read phase so again we're going to do a read on a here in the read phase so we're going to have to copy that record and the global database into our private workspace so that again we can always read this thing over and over again at the same value so now when we do on right on a right we're not going to modify the global database we're going to modify the one in our private workspace so we don't have a write timestamp yet because we haven't been assigned one so we're just going to set that to infinity in our workspace and then update the value same thing on B here we're going to first copy it from the global database into our private workspace and then update it with our infinity timestamp and the uh in our new value so now when a transaction goes to commit it's not actually going to commit right away there's not going to be two additional phases the validate phase and the write phase and so the validate phase is basically where we're going to look at a private workspace see what see what what records we modified and go to see whether that would buy whether there's any uh there's there's either transactions that are still running that have read this data and therefore they didn't see our updates because there was in a private workspace or there's transactions in the past that have already committed have modified this and therefore we didn't actually see their changes as well and therefore we would have a conflict so that's either you're doing backwards validation or forward validation again we covered that in in the introduction class it's not really important right now but basic idea is like is again you're making sure that transactions are always committing sort of in the right order so if we pass the validate phase there's no conflicts then we now enter the write phase where we now are finally assigned a timestamp and then we update the global database with our changes that we've made from our private workspace with our with our new timestamp and then and then at this point the transaction is considered to be done and it goes ahead and commits so the one important thing to understand about this is that when there's not really any contention then the optimistic schemes like OCC are actually going to perform better than two phase locking because you know since we assume that the conflicts are rare we're going to spend less time checking for conflicts like we're going to spend less time looking for for for conflicts that don't actually exist it's sort of like in the discordian system where if everything fits in memory then we're going to waste time looking you know to see whether we have to evict something to make space in our buffer pool manager right so it's better off just to optimistically assume everything fits in memory and to jump right to get get what we need but the issue is going to be now with high contention i mean we have a lot of transactions trying to trying to read and write to the same records then all the protocols that i've just talked about are essentially going to degenerate down to just being the zero execution of transactions meaning only one transaction can run at a time and we're going to waste all this extra extra these extra we're going to spend a lot of time in our protocols doing work that's essentially useless because when our transactions are never going to be able to commit so to better understand this issue this is what this the paperhead you guys read is about so this is the study i did with a former student of mine who is now a new data professor at the University of Wisconsin we did a few years ago where we implemented a testbed system that allows us to evaluate different concurrential protocols at extreme levels of parallelism and so what's kind of interesting about this project is rather than taking like Postgres and MySQL or whatever data system you want and trying to do a bake-off between the two of them we implemented a single system that had a plug-hole API that allows us to plop in different concurrential protocols without making major changes to the rest of the system and the idea here is that we want to strip it down to just being the bare minimum you need sort of execute transactions in a data system without all the additional bells and whistles and features that are yet that a full featured system like MySQL Postgres would have right we just focus on what the overhead is of these concurrential protocols so the idea also too was that we want to run this you know these systems in a highly parallel environment um so that it would sort of really expose this the what the main bottlenecks are in the implementations of the different protocols in the system in a way that you couldn't get maybe running on a machine in like 32 cores or 64 cores which is sort of the most you can get around this time in 2014 when we wrote this paper so the system we're going to use for this evaluation is called DBX1000 so this was again the system that the student wrote um for this paper and again it was written from scratch just to have this pluggable API that allows you to plop drop in different implementations of these concurrential protocols so it's a stripped down system there's no network access there's no logging at least at the time that we wrote this paper and it didn't support concurrent indexes it really was just focusing on how fast can you do concurrential protocols all the transactions are going to execute with stored procedures and so that means all the server-side logic in order to execute a transaction is contained on on the data system itself so you never to go about the or the network and ask the application of the client you know what should i do next and so we're going to run this now also in a chip simulator developed by at MIT at MIT um called graphite and this was modeling a single socket tile-based cpu with a nuka architecture so nuka stands for non-uniform cache access and we contrast that with the numeral architecture non-uniform memory access that we're most going to be talking about during the semester in like you know intel zion chips so nuka basically means that the cost of one core accessing the cache of another core is not always going to be the same because it's going to have this tile-based architecture where you have to communicate over this mesh network in order to you know do cache and validation and do reason writes between different cores and so some cores are closer to you on the network and therefore you're reading writing to their memory locations or their caches is much faster than one that's maybe on the completely other side of this network so i'm not really an architecture person um the reason why we want with this sort of tile-based approach is that when you talk to people that are architecture experts they said that when you start getting them to like a thousand core count sort of the conventional wisdom is that you'd want to use a tile-based architecture like this again the main takeaways we're going to get from the study are not going to be dependent on using a tile-based architecture and still applicable to the xeon type systems we'll be talking about for the semester it's just what the harder was that that we had at the time to model this so another point thing that they understand too is that this CPU simulator runs really really slow all right so it's 10 000x slower than the wall clock time uh of you know system running a bare metal so that means that there's a bunch of optimizations the student had to do in dbx 1000 to get it to run at a reasonable time in the simulator so if you go look at the code on github you'll find that there's a lot of um there's a lot of assembly in the different parts of the system because he had to do that to get it to run run fast and you know get the experiments to finish in time all right so the target workload we're going to be using for this paper is going to be ycsb um the yahoo cloud serving benchmark so the it's basically a key value store workload that sort of meant the model the sort of the access patterns of web services or web based applications so in the database there'll be two million tuples and each tuple is is one kilobyte so every transaction that we're going to model our run in our system is going to execute uh queries that are read and write 16 different tuples at a time and we're going to vary their access patterns the skew the access patterns so sometimes there'll be hot spots and sometimes they'll be uniform access and all the transactions are so going to run a store procedure store procedures running the serializable isolation level so the six different schemes we're going to compare are based on the the different type of protocols that i talked about before so for two phase locking there's going to be deadlock detection and then deadlock prevention with no weight and weight and die and this this is emblematic of sort of the some of the biggest database systems that that are out there today now on the timestamp ordering side there's a timestamp ordering and there's occ and there's a multi version uh there's the original multi version protocol presented or described in the like the 1979 paper that originally described mvcc that uses timestamp ordering next class we'll see that you can have different variants of multi versioning that uses occ or two phase locking but for this one we just went with the sort of the original protocol proposed in the 1970s and this is emblematic about most of the newer systems today as well as some sort of classic uh database system that use multi versioning all right so for the first experiment here uh this is sort of the the sort of the baseline performance of these different protocols in a read-only workload so what you're seeing is that along the x-axis as we scale up the number of cores uh we're adding more uh the more concurrent transactions running at the same time so every single time we add a decor that's going to have an additional transaction running that's you know running part the executing part of the workload so 200 cores there's 200 current transactions for the same time so this is a read-only workload with uniform memory access uh or or tuple access so and this is the best that these different protocols can can do because there's there's zero contention so the first thing we see is that the deadlock detection and the no-weight protocols actually can almost scale linearly and and performing the best right because these protocols are so simple there's no deadlocks and therefore the overhead of transactions and executing is is minimal next thing you see is that there's this knee here um at 800 cores where weight and die and mvc's start start to dip down and this is due to the overhead of allocating unique timestamps for transactions as as they're running right when you're running with a million transactions at the running the same time or when you're running eight million transactions a second for this tile-based architecture the timestamp allocation and coordinating across all different threads becomes a bottleneck and lastly you see that OCC here actually does the worst and so this is the combination of of having to copy the private workspace for every single transaction and copy that back into the system that actually comes a big bottleneck so next we see here is a right intensive workload when you now have transactions actually updating the database and this is on a medium contention workload where 60 the transactions are going to try to access a hotspot of 20 percent of the database so the first thing you need to see is that deadlock detection is actually now the worst protocol to use for this this environment you know in the last slide i showed you it was actually the best but now when you have high content a higher amount of contention there's deadlocks uh there's more deadlocks in the system and therefore it takes longer for the deadlock detection thread to come around and break these deadlocks so you essentially have threads stalling and waiting for that deadlock detection thread to come through and clean things up and that that hurts performance next you see that the uh no wait and wait die actually do the best all right and this is because these protocols are so simple so what's happening is that the you know yes there's any time there's a conflict um and there would be a deadlock these protocols immediately just kill the transaction restart them so you don't spend any wasted time trying to figure out you know should should i wait or you know hoping the other transaction going to give up a slot you make some action right away no wait the simplest one right basically says i try to acquire my trans block i couldn't get it immediately kill myself and again because we're stored procedures there's no overhead of restarting transaction in the memory environment it's super fast all right in the middle here you see the three timestamp warning protocols basic to mvcc and occ they're roughly all doing not the same occ is doing slower again because there's extra overhead of copying things to the work you know in and out of the workspace over and over again um but you know they're roughly about the same so now we get to this final graph here which is the most important one of the paper um and this is where we're running a writing test of workload where 90 percent of the transactions are updating 10 percent of the database so now you see that all the protocols basically crash down to zero when we get up to a thousand cores right there's none of these protocols are actually scaling so and again the reason is that there's all this overhead of transactions having to check for conflicts or copy things around and there's just so many conflicts that you just can't make any for you know forward progress get things completed so one interesting thing is that you see no wait is kind of going along doing okay relative to the other protocols until 800 threads and then it crashes down to zero so it's not doing that great compared to the previous slides like it's only executing two and a thousand transactions a second whereas the other slides I'm showing you can do millions of transactions a second but again it's it's doing okay because it's it's protocol is so simple that yes there's you know when the system identifies that there's going to be a deadlock it kills the transaction immediately and restarts it whereas the other ones uh and the cost of doing that is super cheap relative to the other ones of course now at a thousand cores doing that just you just aborting transactions so much that you just can't get any work done the other interesting thing to point out is that OCC is actually the worst over here when you have a small number of cores but on the other side it's actually doing the best and again this is what because it's basically degenerating down to do serial execution of transactions uh so the way the protocol works is that when you enter the validation phase you can guarantee that one transaction is in that validation phase and therefore one always always be allowed to actually commit so when you're essentially running here at a thousand threads or a thousand cores is running same thing as you're running on one core so at least one transaction is going to always be guaranteed to commit and that's why it gets roughly the same performance it gets on the other side of the graph right but the main takeaway again is that over here at a thousand cores everything's is doing bad and roughly it's gotten a little bit better since 2014 we wrote this paper but this is roughly where we are actually today in modern systems in terms of you're running on a single box with like you know 64 32 or 64 cores so we really haven't hit in the danger zone in a modern system of having high contention um in you know without really large number core counts on a single box we're just sort of not there yet but just this graph basically shows you that existing methods that we use out today aren't aren't going to be scalable and so real quickly where are we actually spending our time so this is a breakdown of the different protocols uh while the xu transactions and you see that in the case of no weight it's spending a large percent of its time just aborting transactions because as soon as it finds a conflict it says this this is bad and just restarts it whereas the other protocols are spending time waiting to try to acquire locks or try to acquire timestamps or waiting for the deletion thread again that's all just sort of wasted work all right so let's quickly talk about some of the the bottlenecks we identified in this paper um and sort of some potential solutions we can to get around them and the idea here is sort of understand these issues at a high level so that again when we start talking about uh npc in more detail in next class right you in the back of your mind should be thinking about oh well this is how they would you would handle this this is how this could be a problem when we hit larger core accounts than what we actually have today so lock threshing time some allocation memory allocations will go with each of these one by one so lock thrashing is a phenomenon that you would have in a two phase locking system where the if a transaction ends up waiting longer to acquire locks then this causes other transactions waiting behind it end up waiting longer to acquire the locks at the first transactions holding and therefore that causes other transactions behind that you know sort of second path of transactions to wait even longer so it's this convoy effect where because one transaction waits longer it causes everybody else to wait longer and then that uh gets exacerbated as you add more transactions so one way we can actually measure this in our system um is by removing all the overhead of doing any deadlocked prevention or deadlock detection and just having transactions just do nothing but acquire locks and just wait and so in this case here the way we can do this is that we're going to force transactions acquired locks always in some kind of primary key order just think of like we're holding you know we have a lock one two three four five six and we have to acquire them in that sort of lexographical order so in this environment deadlocks are possible so we'll never have any uh stalls due to deadlocks it's just just the stalls waiting to acquire these locks and so if you run that experiment on the system you would you get a graph like this so the theta variable here is representing the amount of skew in the workload right are more transactions trying to access a small number of tuples so the most extreme case is theta 0.8 and theta zero means that there's there's no contention so what you see is are these nice uh curves these knee bends where the lock thrashing comes into play and now transactions are end up waiting longer and longer to acquire locks when there's more more contention right and at some point you know this grabs keeps going down these lines keep going down you don't recover from this thrashing effect when i really like about this experiment is because is that it shows uh this graph actually looks exactly like any sort of textbook that shows you were described lock thrashing right like this is sort of one example of a textbook uh but they always have this sort of theoretical diagram of like oh here's the effect of lock thrashing but you know in this case here our graph matches exactly what you would expect to see according according to um according to the theory which is kind of nice all right for time stamp allocation this is not necessarily going to be a big issue for us in our system uh we're going to get by with doing either batch addition or atomic addition which is another variant of compare and swap but it's basically saying that in a really large number of core counts the having transactions i'll have to all acquire unique time stamps that actually could be a big bottleneck so the mutex is always just going to be bad we don't avoid that in the money possible atomic add a bash add these are just compare and swap methods and then for these particular cpu's uh some cpu's like intel uh they can have a harbor clock where you can get through a signal instruction but this is not clear whether intel is actually going to keep this around um and that's not something we potentially want to rely on and then a harbor counter approach was something that was something that the the student came out with and added to the the graphite simulation system this is not something that we would have you know that exists today and so this graph is just showing you that the when did these bottlenecks come into play for running these different uh time scheme allocation schemes and again the batch atomic one is probably going to be good enough for what we need right but we're talking you know trying to allocate a hundred million time stamps per second that's more than what whatever need all right the memory allocation stuff um the main issue here is that since we have to now copy things um into private workspaces for some some protocols or you know copy tuples to make sure we have repeatable reads in in private workspaces um that copying can be expensive and so if we use the default libc malloc that's going to always be super slow and we never want to use it so we're not going to discuss different memory allocators too much this semester other than to say we don't want to use the default malloc we'll end up using ge malloc or tc malloc or these other malloc protocols that are libraries that are out there all right so again i realized it's kind of weird for having me give the rest of this lecture in this this gold plated chamber thing um but i just want to get this out there so that we can start talking about mpcc in more detail on on wednesday when i come back to come back to campus so the main takeaway i want to give from this lecture is that the design of an in-memory database system is being much different than a disk running system conceptually it's the same right you actually queries their indexes will run transactions but the implementation of the the components of these systems are going to be much different and the good news also too is that in memory databases are not considered to be super bizarre uh anymore when i first started grad school maybe over 10 years ago now but like 2007 2006 they're considered uh sort of an exotic system but nowadays especially with you know systems like redis they become more common and people become more comfortable with the idea of having in-memory only systems the one thing though i will say is that in recent years it seems like the d-ram prices and capacities have stalled compared to the gains we've seen in ssd's so whereas before i was pretty bullish about oh well in memory databases are going to take over the world of course why would you ever want to use a disk running system i i don't think that's actually true um and i think that it's worth looking to see what we can do to bring back ssds in a database system without actually having to slow it down and bring all the overhead of a disk oriented architecture so we'll talk about a little bit at the end of the semester but that's something in the back of my mind uh i think it'll come up a couple times during the semester okay all right so just to finish up this is it for today i'll be flying back uh this friday and i'll be on campus on tuesday next week and we'll have the recitation for uh discussing the architecture of the system and then on wednesday's class next week we'll we will start discussing multi-virgin courage call because again this is the the dominant method or protocol that everyone uses but it's more than just concurrency protocol it's going to sort of encompass all different aspects of the data system so we'll start understanding that a bit more detail okay so that's it for today i'm going to go hang out with the germans a bit more um and then i'll see everyone uh next week in class all right take care see you bank it in the side pocket what is this some old pool yo took a sip and had to spit because i ain't quit that beer called the o-e because i'm old g-i you look then it was gone grab me a 40