 We're going to cut it off early before we could actually finish, and I'm going to spend a little bit of time talking about the stuff we missed in the context of database storage, and then we'll switch over now talking about buffer pools and memory. So if you remember from the last class, I showed all these examples of storing tuples and pages, and the way that we were going to store the tuples on those pages is in contiguous order. And it's usually in the order that you specify when you create the table. So I have attributes A, B, and C. For one tuple and one page, I'll store A followed by B followed by C. And at that point, the tuple ends, and then the next tuple starts. These are stored in contiguous order. So again, it's important to note that the relational model doesn't say anything about how data should be actually physically stored. You're only defining what the logical schema looks like and what the application has to deal with. And there's no point, does it require that you have to store it the way that you define it in the table? And as we'll see in a second, it actually may be the case the way to store tuples and the way I was showing where you just sort of one after another in sequential order may actually not be the best thing to do for all workloads. All right, so to motivate this example, I'm going to use a three-table database, a real simple one, that's actually derived from Wikipedia. So if you go download the media Wiki software, there'll be a MySQLDDL file, and it'll have roughly these create table statements in it. So we'll have a user account table with the user ID and name. There's a bunch of extra stuff that you have for a user account that we don't care about. So then we'll have a Pages table where we keep track of all the different articles in Wikipedia, we'll have a page ID and a title. And then there'll be a revisions table where you actually contain the foreign key reference to the Pages table and the user table. So for each revisions, you keep track of what user made that change and what page did that revision belong to. So again, it's a real simple table where they have this hierarchy like this, so we didn't really talk about this too much in the beginning, but I mentioned it in the first class about what sort of have the categories of database workloads that are out there. And so if you remember, the first one I talked about was online transaction processing, or OATP, and the way to characterize now that these particular workloads is through three key properties. We'll talk later what a transaction actually means, but for now, just think of like applying some change to the database or in response to some user action. So for example, if you go to Amazon and you add something to your cart, it's going to store that in the database. That action of adding things to the cart in the database is a transaction. Or in the case of Wikipedia updating a page and adding a new revision, that's considered a transaction. So in OATP workloads, the transactions are going to be really short-lived. Meaning it's going to go into the database, make some change, and then immediately save everything. But again, think of it on Amazon, you click something out of the cart, it comes back right away. The amount of data you're actually going to read and write in transaction is actually going to be pretty small as well. Again, think of like on Amazon, you don't allow to go manipulate other people's accounts on Amazon, right? You can only add things in your cart. You can only make purchases for your items. So the total Amazon database is really big, but the amount of data that you're touching per transaction is actually really small. And the last one is that these transactions are going to be very repetitive. Again, think of it on Amazon, you add something to the cart. You can't really do arbitrary things on the Amazon store interface, right? You can add things to the cart, you can buy things, you can update your payment information, right? So everybody else is invoking those same transactions. So it's the same code over and over again being invoked for these transactions. So you're just doing the same operations over and over again. So the way to think about OATP is that this is also usually the kind of application you'll build when you first start building a new system or build a new application. If you're a new startup, the first thing you end up building is usually an OATP application because you don't have any data to analyze because you haven't collected anything yet. So this is the part where you actually collect that data. And then once you have that data, you actually want to start asking questions about it. Then you enter the realm of online analytical processing or OLAP workloads. So these are usually read-only queries where you're going to read large segments of tables of the database and try to extrapolate new knowledge. You're going to find things that, what's the most bought item after a hurricane, things like that. You're asking questions about your data. So these queries usually touch a lot of data. So they're probably much longer running than the OATP queries, OATP transactions. And you're often going to do joins between multiple tables to combine them together, right? You follow the foreign key references or whatever. You put the tables together and find the data you're looking for. And also sometimes you see what are called exploratory queries where this is maybe you don't know what you're actually looking for just yet. So you're going to execute a bunch of random queries that are just exploring the data. Like think of a tool like MicroStrategy or Tableau or any kind of visualization tool. You're in the user interface or click on some buttons, looking at different charts. Those are all considered exploratory queries. So OLAP is actually much different for all these reasons than the OLAP side, as well as the fact that nothing's really repetitive. Maybe you have a dashboard that you occasionally refresh, but that example where you're updating a visualization tool, those are random queries. So the system can't really do too much to pre-compute some of the things you may be looking for. So I have some sample queries actually here. So in the case of the OLAP side, these are some sample queries like getting the last revision for a particular page, updating your user account to say that when you logged in and then inserting a new revision record. And then for the OLAP query, you see it's actually something more complicated where it's going to scan all of the user accounts and find any user account that logged in with a .gov host name. This is sort of famous a few years ago where there was politicians on government computers going, updating their Wikipedia pages to say flattery things about themselves or remove any scandals that they have. So there was a published report about this a while ago where people found people on Wikipedia with .gov host names modifying Wikipedia. So this again basically do a group by on the count the number of logins per month for government counts. So given that we have these different workloads, now we could talk about how do we actually want to store our tuples. And so the precise term of what we've been assuming so far in this semester is called the n-ary storage model. If you recall from maybe the first or second lecture, I said n-ary tuples are when you have an arbitrary number of attributes. So in this case here for the n-ary storage model, you're storing all those attributes together for each tuple. And the next tuple doesn't begin until you get all the attributes for the first tuple. And it's exactly as we've been showing throughout the semester. So basically the data says we're going to store all the attributes for a single tuple one after another. So in this case here, each row corresponds to a single tuple. And you have all the values for all the attributes in line with each other. And then when it's finished, then you go on to the next tuple. So then basically in a NSM database system, you'll store the tuples like this in a single page. So this is basically saying, with the exception to the overflow stuff we talked about last class, all the data for a single tuple will be packed together in a single page. So for simple queries like selecting the user account for get a single record where they have the username and password, this is actually a pretty good thing for us because we'll do a lookup in some kind of index, whether it's a hash table or B plus tree, we don't care. And then it's going to identify the record ID, which is the page ID and offset or a slot, for the particular page that we need. And then we go to that offset and go grab all the values that we need. And so the reason why this is really good for OLT workloads is because most of these times in OLT workloads, the queries need all the data for an attribute. We're not just trying to get one single column. So that means, in this case here, it's a select star. So that means when we brought this page in, we used all the data that was contained in this page. We didn't have to go to another page to get something else. And yes, we read reading tuples that we weren't going to use in our query, but that's sort of unavoidable, as we said, because we can't address tuples in single cache lines or single bytes. Everything has to be organized in these pages. So we always have to get the page and get other tuples when we're going to get just one. And if we do an insert, same thing. It's really easy. We just go grab a page and we can just do a straight copy of all our data directly into one slot. And it's essentially one mem copy operation. So let's look at an example now where this is actually a bad idea. So let's use that OLAP query that I showed before. And if we now look in the query, we can actually identify what attributes or what columns we actually need. So in this case here, in our where clause, it's doing the comparison for running a like on the host name. And assume we don't have an index on that. So that means we basically have to scan every single page in our database. We have to do a sequential scan to look at everything, to find tuples that will match our predicate. But then it's actually worse. Say we bring in one tuple now. When we now actually look at what columns we're accessing in our query, we know we need to do a comparison on the host name. So we're going to have to look at all these attributes here. So that basically means that we're going to scan these tuples and jump to the different offsets one by one to go get that single host name attribute. And then up above here, the output we want to produce is based on the log in. And so we do the same thing for every time we have a matching tuple. Then we have to jump to the next offset, go grab the attribute for the last log in, and then put it in whatever hash table we're using to compute this aggregation. So what's the big problem with this? What's that? Page fall is short, but besides that. Correct, yeah. So there's five attributes in this table. And all these other ones here we brought into memory, and we're not even using them. So this is essentially when we, this is relevant to when we start talking about the buffer pool stuff, is we're bringing in data that we're never actually going to need, and it's ended up taking space in memory in our buffer pool. And so we're going to end up evicting other pages to make space for this data. And then we're just going to get terrible, terrible performance. Question or no? OK. So the advantages of the NRE model is that there's really good for fast inserts, updates, and fleets when you're accessing entire tuples. But they're really bad when you want to do analytical queries, when you want to scan maybe a portion of the actual attributes in the table, and you want to scan the entire table, or a large subset of it. So what's one solution to this? Column store. Correct, yes. So the sort of precise term of it is called the decomposition storage model. And basically what's going to happen is instead of storing sort of everything in rows, we're going to store all the data in columns in our page. That means that all the values for a single attribute will appear aligned with each other together. And then when that you run out of those tuples, or out of the values for all the tuples, then you start the next one. So it would be sort of like this. So for a single page, we're going to have only the hostname attribute for this table. And internally, there's some metadata we're keeping track of to say, well, we know that there's a hostname page, and then there's other pages for the other attributes, and we know how to jump to particular offsets in those columns, the clumnar pages, to go find the tuple that we want if we want to stitch it back together. So now if we go execute that query that we had before, now we see that to do the scan on the hostname, since we don't have an index, well, it's just a single page to go get the things we need. And we identify which offsets in our column we have the matching value that we want. Then we go fetch the last login page, and it's the same thing. Now we just go grab a single page, and we can jump to the offsets and get the value that we need to compute the answer. So before my simple example, when I ran this query under the NSM model, the row store, I had to scan all six pages. But now with the column store for this particular example, I only need to scan two. Yes? If you had a large number of attributes, wouldn't there be a lot of storage overhead for storing the primary keys for each of these values across some of your All right, so his question is, actually, I don't slide to this but I could maybe cover this later, his question is, is this going to be wasteful because now I need to store the primary key for every single attribute, a tuple attribute in my column store? The integers no, because what they actually do is it's all based on implicit offsets. So hostname's a bad example because it's a var char, but let's say you always have 32-bit integers. If I'm the 12th tuple in one column, I know how to jump to the 12th tuple to the next column. So you don't actually need to store the extra primary keys, you just have to store the, you just have to compute the offset using simple arithmetic. So the offset will only be the same across each of the columns? Correct, yes. Now for the var char, that's when things get tricky and typically what they do is do dictionary encoding. So you'll convert the var char into a unique integer and there's a hash table, a mapping table that maps that integer to the original value. So then in your column store, you only store the fixed offset value. There's other things, tricky things you have to deal with when you do compression, like arbitrary compression, like gzip or snappy, because those things aren't always going to be fixed width, fixed length. So you have to do some padding to make sure the offset already works out. Right, so yeah, so in most column stores that I know of, you don't actually store any extra data, you just compute the offset. That's a good point. All right, so the advantages of using a column store or DSM model is that it's going to reduce the amount of IO we have to do when we want to do OLAP queries, because we're only actually going to just read exactly the data we need. We never end up reading more than we actually need. And then, again, we'll cover this later. But we're going to do better query processing and get better compression because of all the data now that for a single attribute are going to be packed together, and they're not going to be sort of spread out. So a really simple way to think about this is that, say we have a column that says the sex of a person, and it's either binary, male or female. And so now you're going to have a bunch of m's followed by a bunch of f's, followed by a bunch of m's. And so you can do run length encoding to pack those guys in and do that real efficiently. And by better query processing, I'll say that some systems later on we'll talk about is they actually can delay the actual decompression of the data as far up the query plan as possible. So now you're actually operating queries on compressed data rather than decompressed data. And that allows you to get much better performance, too, because now you're not blowing out your buffer pool. And the obvious downside of this is that traditionally the DSMs are going to be much slower for any O2P operations, because now you need to take your tuple and split it up into all different attributes and do separate writes to separate pages. And that's going to be much more expensive. And then if you want to ever do a select star and put a query tuple back together, you've got to do all the fetches and stitch it right back. So column stores are kind of like standard technology now, although I don't think the textbook covers it, which is surprising. But it's certainly not a new idea. So back in the 1970s, the first sort of known description of a column store database system was this thing called Canter. There's only one page, because it wasn't a commercial system. It was built by the Swedish defense contractors or something like that. But they basically talk about how you can build a column store database. And then in the 1980s, there was a more formal academic proposal of the DSM storage model. And then in the 1990s, Sybase came out with a system called Sybase IQ, which is actually still available today. And that was sort of like an in-memory accelerator, sort of caching server in front of your database system. But it was really in the 2000s that column stores finally really took off and became more commonplace because the internet came along and everybody started collecting a lot of data really quickly. And so you needed something that could be able to handle this. Traditionally before the internet, only a very small number of corporations actually had really big databases, places like Walmart, for example. But once the internet came along, you can put up an app up and start collecting a lot of data really quickly. And you needed something to be able to crunch it. And so in the 2000s, they had Vertica, was probably the most famous column store that was built by my advisor, my snowbreaker. Vector-wise was a system out of Europe that got up all by Actian. And then MoneDB was an academic system that is actually still available today. But now in the 2010s, everybody had basically as a column store. Like CloudDare and Paola, Amazon has Redshift, and then Oracle, SQL Server from Microsoft and IBM will all sell you column store extensions. And they basically do exactly as I described here. So any questions about column stores? So the SQL Light system you guys are working on, and the storage manager you're building for your first project, will be an NSM system. It'll be a row store. And this is because SQL Light is inherently a row store. So it's more than just saying you're going to store things in columns versus rows. There's actually a bunch of extra stuff you have to do when you have to process queries, like the light materialization that I mentioned before. They actually get the benefits of a DSM system. So now let's switch to today's lecture. So today now we're going to talk about the memory side of things. The previous lecture, again, was all about how the data system is going to store data on disk and files. And now the question we're going to deal with is, how's the data management system going to bring data into memory and be able to crunch on it and possibly write it out later? We have to do this because your data system can't operate directly on files on disk. It has to copy things in memory first, make changes, and then write them out as 4K blocks or possibly larger. So again, so our goal, what we want to try to achieve now with our buffer pool is that we wanted the database system to have the illusion to the user or the application that the system has more memory than is actually available. So another way to think about the problem we're trying to deal with is in terms of spatial control or temporal control. So spatial control is what we sort of covered a bit last class. It has to do with how the data system is going to store or organize pages physically on the storage device. We said that if you're using a spinning disk hard drive, you care about having a lot of sequential reads and writes. So you try to organize your data spatially to be in continuous order. And then if you go down to within a single page, you want to put two pools maybe in a page that are actually going to be accessed together so that, again, it's one disk read or one fetch to go get the thing that you need. And so now what we're talking about is essentially how the data system is going to have temporal control of its data. And so that has to do with when is it going to read pages into memory and when is it going to write them out the disk. So we're not going to talk too much about writing things out. At this point, this will come up later when we talk about concurrently control and recovery. So at this point, we're really talking about how do we copy data in and then when we have to make room for new data, how do we make decisions about what to move out? The one thing I'll say, too, is that this is all very interconnected with things that were discussed later in the class. And so I don't want to talk too much about writing data out, because for that, you have to understand how the card control stuff works. And you have to understand the logging protocol. But it's suffice to say that it's related to the question that somebody had last class of why can't you just use RocksDB to manage your log for you. And the answer is no, because all these different parts of the system need to know what they're doing. Because you don't want to have data written out to a page before you write out the log. And you have to be very careful how you order these things to make sure that when you crash to come back, you can recover the database to a correct state from non-volta storage. So last class I also railed on about how using the OS to manage your memory is going to be a bad idea. And the two things that we pointed out was that the database system needs to always have complete control over what's going on, meaning what data is brought in and when things get evicted. And the other issue that we have is that the OS will stall any thread when it touches a page that's not in memory. You get a page fault, the OS blocks you, and while it goes and fetches the thing that you need. So related to that also too, it's more than just actually stalling your thread that's bad. Anytime you actually go in the kernel, that's always bad too, because that's actually really slow. Because now there's going to be, the kernel has mutexes and locks inside of it, so going down in the kernel is always an expensive operation. So we try to avoid sys calls as much as possible. We try to do everything in user level. So there are some ways to get around some of these limitations of the MAP. And I don't spend an entire lecture on MAP, which I think is a bad idea. But just in case you think like, oh, MAP is completely opaque to the database system, there are some API calls you can make to the operating system to get it maybe to do exactly what you want to do. So first of all, there's a function you can call called MAdvise, where you tell the operating system hints about how the database system or your process expects to read certain pages in memory. And it can tell you like, oh, I'm going to do sequential reads on these ranges of pages, I'm going to read this page once, and then never read it again. You can also use MLock to tell the operating system that certain pages can't be evicted. This is essentially doing the same thing as the pin operation that we'll talk about later in the class, where you're telling that this page is being operated on by a thread right now, no one's allowed to actually go and evict it until I'm done. And then the last one would be M-Sync is where you tell the operating system that it's now, I want to block my process until these pages actually have been flushed out to disk, because you want to do this for durability reasons. So again, this sort of seems like this is going to solve all our problems. And there's certainly been a lot of attempts to actually use M-Map in database systems. I don't want to make it sound like that nobody actually does this. So the Mone ADVise, the system I mentioned before, that is the column store, they're entirely based on M-Map. And then LMDB is a embedded database system sort of like SQLite that uses M-Map to manage their buffers. Mone ADVise is a OLAP system, meaning you don't want to run transactions on it. So this is actually kind of reasonable. For LMDB, for reasons I can't say on video, because the guy knows who we are and doesn't like us. I'll talk about that later. So with LMDB, they're trying to be an embedded system, and they only have a single writer. So using these tricks might be good enough. More famously probably be the system MongoDB, when they were first built and first came out, they used M-Map entirely. And a lot of the ways they got around concurrency issues is that they had a single database lock, meaning only one thread could ever read and write to the database at a time. Again, you have to do this because the ordering of your operations, you have to do that very carefully. And although these hints that you can tell the operating system seem like they're the right thing, the operating system doesn't always follow them precisely. And they're certainly not portable. Some of these don't actually work on Windows. M-SQL, the last I heard, uses it for their column store, which is again the read-only side of the system. SQLite has an option to use M-Map if you want to. It's turned off by default. I'm actually very curious to go read the manual to say, here's all the problems with M-Map, and here's why it's here if you actually need it, but we don't recommend using it. And part of it has to do because they need to run on all sorts of crazy operating systems and hardware. And all of these embedded devices may not actually have the same support that Linux does. And then the influx DB guys we had last week, where they use M-Map to be just the buffer pool for read-only pages. So they write out pages not using M-Map, and then later on they read them in using M-Map. And that's essentially ending up the same thing with the native DB. You have a read-only database. So again, the main issue is that these APIs are not going to be portable. It still doesn't help us in the OS blocking our thread whenever we have a page fault. And as we see later on, and I guess we can run experiments if you're actually curious, it's really tricky to make sure the OS actually orders your writes. Just because you say I write these pages and you do an M-Sync, it doesn't mean they're going to be written exactly that order. The operating system can do anything that it wants. And again, from a database system perspective, that's bad. Because we want to make sure that we don't have anything written out the disk until we have a log entry written first. So the main takeaway from all of this is, again, the database system is always going to know better than the operating system. And therefore, we want to manage memory ourselves. And so the buffer pool is essentially the in-memory cache that sits above the disk manager and below any other parts of the system. That's essentially going to be the memory manager for us. Question? Or no? OK. OK, so again, as I mentioned before, we have to make sure that our buffer pool is going to work nicely with our Concurrency Goal Scheme and our Logging Recovery Scheme. But for now, we don't have to worry about that. So another point and thing that I should have mentioned earlier, but I want to talk about now, is this distinction between locks and latches. So you might have seen on the first project, I said make sure you protect your data structures with a latch. And so this is sort of an ongoing debate between the operating system people and the database system people. In the database system world, we refer to things that protect data structures as latches. In the operating system world, they refer to these as locks. And the reason why I have to make this distinction is because in the database world, a lock is used to protect the logical contents of the database, protecting an index, a page, or a tuple, or a range of tuples, things like that, or a table. And so a lock will be held for the duration of an entire transaction. Think of like I started a transaction and I want to make a bunch of updates so I lock my tuples for that transaction. And then we also need to make sure that if we make any changes, but then the transaction aborts, we have to be able to roll back those changes before we release those locks. So locks are a higher level construct that are exposed to the application. Underneath the covers, though, when we're actually building our database system, we'll use what are called latches. And these are actually going to protect the critical sections in the internal data structures that the database system uses to manage data, execute queries, and do other things. And so the thing of latches are sort of, again, if you're coming from an OS world, this would be like a mutex. These are short-lived, meaning you're going to go into an index, make some change, and then release your latch, go and update your page table or your extendable hash table, and then release the latch. And we don't actually need to care about rolling back any changes, because again, this is all being done internally. It's not exposed to the application. So for example, that means if you try to acquire a latch on an index and you can't, then you have to have the code actually retry, the database system code. Whereas if you try to acquire a lock in the application and you can't, then it's up to the application to retry this. So the data system won't try to do that for you, always. So just to be clear, when I say latch, I mean like a mutex, a spin lock, which is also confusing, but a low level protection primitive in our internal data structures. We'll cover logical locks later on when we talk about concurrency control. So today, the things we want to focus on are sort of high level what the buffer pool manager does, and then we'll spend some time talking about replacement policies, allocation policies, and then I'll just sort of finish up talking about what other types of memory pools the data system can maintain. Because for our purpose here, we're focusing on what I'll call index data or tuple data, but there's a bunch of other stuff we need memory for as well. So the buffer pool at a high level is essentially some large region of memory, or I shouldn't say large, because it's an embedded system, it won't be that large. But it's some region of memory that's going to be split up into a fixed size array. The size of every element of that array is going to be the size of our pages. Now you see, again, why we have to have fixed size pages that are always the same, because then it's easy to jump to offsets in that array. So every entry in that array where we actually can store a page will be called a frame. This is also sort of a more database parlance, because we talked about slots before where there's an entry where you store a tuple on a page, and now in memory we refer to a frame as the entry where you store a page that's been copied in. So what happens is that when the data system comes along and wants to request a page, it has to first copy that page into one of these free frames. So in this case here, for page one it's copied here, and then page three would be copied there. So the key thing to point out too also here is that when we make this copy from disk into a frame in our buffer pool, it's going to be an exact copy. Meaning we don't want to do any deserialization or marshaling of the contents of the data. It's an exact copy of the bytes from the disk into our buffer pool. And this might be different than things like, say, JSON or protobuf where you actually do some deserialization, some marshaling, un-marshaling, when you go from disk or off the network into memory. In our world, it's an exact byte for byte copy into memory. So the buffer pool will also maintain what's called a page table. And the page table is essentially just a table that's keeping track of what pages are in memory. So if your upper level parts of the system says I need to access tuple in page one, then you would go to the page table and say, well, find me the frame in my buffer pool where I have this page. And then now you have a pointer to memory that you're given, and then the other parts of the system, the threads, can actually then process on that data directly. Once the thing's in the buffer pool, you don't actually make another copy of it. You just pass around pointers to other parts of it. So in our page table, we need to keep track of some additional metadata. So the first will be a dirty flag that will get set whenever a thread modifies a page. And we need to know this because if we ever need to evict this page in order to make space for a new page copied from disk, the data system needs to know that, oh, this thing's been modified since I last copied it in. So therefore, I need to write it out. If the dirty bit is not set, then you can just immediately throw it away because you don't care about writing it back because you know you have the exact copy still on disk. And then there'll also be a pin flag or pin counter. And this is essentially saying in the system, you're telling the buffer pool manager that some thread is doing something on this page right now so you're not allowed to evict it. Either you're reading it or writing it. It doesn't matter what it is, you always need to actually pin this. So sometimes you can set this as a flag. Other times there's a counter because that helps you understand how highly contended or how active a page is actually being used. But I would say the pin count is not the same thing as the latch on the page. Because if you have a thread that actually wants to make a modification, you want to take a latch on the page as well because then you prevent other threads from actually writing it. You can have multiple threads accessing the page and reading it if they're not making changes. That's not an issue because they don't conflict with each other. But if you have a page, if you have a thread modifying a page, then you need to take a latch on that page. You also need to take a latch on the page table any time you want to add a new entry. So say if I do a look at my page table to see whether some pages is there, say page two, I have to take a latch on the entry in that page, then go check in the page table or say the buffer pool to see whether it's there. And then if it's not, then I do my copy into my buffer pool in which you have to take a latch on that data structure as well. And then I can do my update into the page table and add the pointer now to where that page exists in the buffer pool. And then once that's done, I can release the latch. So we won't talk about this too much in this class, but this will come up when we take the advanced class in memory databases. Every single time you access a page, you have to take the latch on the page table to go check the pointer to see whether it's in memory. So if you have a lot of memory and your database fits entirely in memory, then every page is always in memory. And therefore, you're taking essentially a useless latch to go check to see whether it's in memory every single time. But an in-memory database makes the assumption that it's always going to be in memory. So therefore, it doesn't have a page table like this. And you don't have to take a latch to go find, see where's the offset that you're looking for. You know how to jump exactly to that memory location. So in a OTP system, if you have an important application and you have money, usually if you have customers you have money, then you can afford enough RAM to put your entire database in memory, or most of it in memory. And therefore, again, doing this check every single time, the page table would be as expensive and essentially as superfluous. In OTP systems, those are much, much larger databases, so you have to do that because you're scanning a large, large, large segments of the tables, and you're always updating the page table. So that latch is necessary. OK. So in the example that I showed here, we had a single buffer pool for an entire database system. But again, there's nothing about how we design our system that says we have to do that. So there's a couple of different ways you can do this. So you can have multiple buffer pool instances for your entire database system. So MySQL does this. They essentially do, it's like I have a page ID, I run a hash function on it. And then I have a fixed number of buffer pool instances. And that hash tells me where to go find that page. And they do this because then you don't have to take that latch on a single page table that you may be contending on across all your pages for all your threads. Now you can reduce contention because you can partition essentially the actices to the different buffer pools. Yes. This question is, does the latch apply to the entire page table or just the entry? Typically just the entry, because otherwise you're just locking the entire thing. And I think again, for the first project, we said that you actually can do either one. The lazy thing is just taking UTX on the entire data structure. But if you're a bit smart about it, you're taking the entry of it. So to avoid the contention on the page table, then in MySQL, they can actually split up into multiple instances. And this is something you define in your MySQL configuration file to say the number of buffer pool instances you want. But in the commercial and enterprise database systems, you can do way, way, way more things. There's way more knobs, way more things you can configure. So you can do things like have a single buffer pool per database. You can have a single buffer pool per table. And then you can also split up your buffer pool based on different page types. You can have a buffer pool for the indexes, buffer pool for table, buffer pool for sorting buffers and things like that. You can do all sorts of crazy things. So IBM, DB2, Oracle, Sybase, SQL Server, and Informix, I would call these enterprise systems. They're typically seen in older companies or large corporations. There's all sorts of knobs you can do to tune all these things. Again, the idea is that we just want to reduce the last contention, but also want to improve our locality because if we have, say, a table that we know is very important and we're going to be accessing it all the time, maybe we don't want to pollute its buffer pool with pages from other tables or we don't care about so much that we have to go to disks to go get things for them. Again, it provides the DBA with more control to tune the system to get better performance. Now, other things we can do in our buffer pool manager, which I think is kind of cool, is pre-fetching. So operating systems can try to pre-fetch pages. If they know you're scanning along an extent where you have a bunch of continuous pages, it can try to do things and say, well, I know you're going to read a couple of pages ahead, and they go ahead and fetch them into memory for you. And so we can do that too in our database system, but we also can do something that they can't do where you have indirection in your data structures where you don't know exactly, the pages you're going to be accessing aren't going to be exactly aligned. So for a sequential scan, it's really simple, right? You have, say, some query, and it wants to scan every single page. So it gets the first one, it's not in memory, so then it fetches in the buffer pool, goes to the second one, and it's not there. And now the data systems can recognize that I'm going to keep on scanning the rest of the tables, the rest of pages in my table. So it'll go ahead and pre-fetch the next couple of ones and start using them in memory. So now, or bringing them into memory, so now when I go along and I do my scan, the pages that I need are already there, and I don't want to hit a page fault. And the data system can do this because it knows exactly, well, roughly, what your query is trying to do. Because it has the query plan, it knows that you're going to do a sequential scan without a where clause, it knows you're going to hit every single page, so we'll go ahead and pre-fetch these things for you. And we can do this because we can do our IO in a non-blocking manner with our buffer pool manager. So again, but the operating system can sort of do this already, right? But one cool thing the operating system can't do is when you want to do index scans. So let's say now that I have my index, and essentially every node in the index is a separate page. And I first want to start at the root of my index and try to find some entry and scan along the bottom. So I start at the top, I hit this page. It's not in memory, so I go fetch in my buffer pool. That's fine. And then I scan down to the second page because I'm traversing the tree. That's not in the buffer pool, so I go fetch that. But now what's going to happen is I'm going to hit this bottom page here, and then scan along the leaf nodes to find data that I'm looking for. This is a very common thing to do with range scans. But the pages that I'm going to access are page 3 and page 5, which are not stored contiguously on disk. So maybe if you could maybe pre-fetch 3, but you're not going to be able to pre-fetch 5 because it's physically farther away than where page 3 is at. So the operative system can't do this at all, because it won't know that you're scanning along some data structure that logically or at the internal data structure level, things are stored contiguously. But on disks, they're not stored contiguously. So this kind of pre-fetching can only be done in the data system, because again, the data system knows exactly what your query is trying to do. Is this clear? OK. So another cool thing you can do in your database system is call it scan sharing. So let's say that now I have multiple queries running at the same time, and they're both scanning the entire table. And all the examples we've shown before so far, we're sort of treating each query in sort of isolation. So if they need a page and it's not in memory, then you go fetch it. But in a real system, we have a lot of users, a lot of applications accessing it. You have multiple queries running at the same time, and some of them could be doing the same low-level disk operations that from one, the two queries could be doing the same low-level disk operations. They may not be running exactly the same query, but they may be running the same scan. And therefore, we don't want to have to fetch things in for one query, throw them out, and then fetch the thing that's again for another query. So if we have queries that show up at the same time that are doing something somewhat similar, then we can piggyback them off one another and get better performance. So the basic way to think about this is, again, there'll be a cursor that the digital maintains for each query that says what page they're looking at and how many pages they still have to go in their scan. So if you start one query and you have your cursor and it's walking along, and then another guy shows up, rather than him starting at the beginning, you can have them pick up where the other guy is already at. And you have to know that you can come back around and finish the rest of the data that you need. So this is fully supported in DB2 and SQL server, meaning you can have queries show up at any time and can jump on at any point in the scan operation. For Oracle, as far as I know, in the latest version, they only support what's called cursor sharing. And that's where if you have two queries that run exactly the same query, they have the same SQL statement, then you can piggyback off each other. In DB2 and SQL server, they can actually know that while they're different queries, but they're doing the same sequential scan, so I can share their cursors. So this is giving an example of what this looks like. So I have one query that's doing a scan of the entire table A to compute some aggregate value. So it starts at the beginning. Nothing's in our buffer pool. Nothing's in memory. So we get the first page, then we scan along, get the next page, and move on down. So now we get to page three. And that's not in memory. And we need to evict something. So it goes and takes out the first page. Because for simplicity, assume we're throwing out the page that was last recently used. So that would have been page zero. But now query two comes along, and it wants to execute a computer different aggregate, but it's still a complete sequential scan on this table. So if now we had to actually start the cursor at the beginning and go fetch page zero, that's the worst thing to do, because that's the page we just threw away. So instead, what some systems can do is piggyback Q2 on top of Q1. It goes along and reads all the same data that Q1 reads, sort of thinking of that as like a pub sub method, where you have your threads say generate this cursor, start scanning over the tuples, and notify me when a new page comes in. So you sort of have this asynchronous IO effect here. But then when Q1 is done, because it's scanned everything, then we can come back around and look at all the pages that we missed the first time. Again, this is something that the data system can do, because it knows what queries you're trying to execute. It knows exactly what the application is trying to do. So I sort of mentioned in that last example that what the policy would be for evicting pages from memory to make space. But now we can go more detail how this is actually going to work. So the buffer replacement policy, which again you have to build in your first project, is essentially some algorithm that the data system is going to implement that allows it to make decisions about what pages to evict when it needs more space. If you have enough memory for everything, then you never have to run this. But if you do have to evict data, then you have to have some procedural way of making decision what to throw away. And so the things that we sort of care about in our buffer replacement policy will be correctness. We don't want to throw away things that we don't want to throw away exactly the next thing we're going to need. We want to have good accuracy. We want our approximations to be reasonable, and that's sort of related to correctness. We actually care about speed as well, meaning we can't compute some NP-complete algorithm every single time we need to decide when to evict a page because we're holding latches and locks in our database system while we're actually doing this. And then also, too, we want to try to minimize the metadata overhead. This won't be too much of an issue, but some of these replacement policy algorithms require you to maintain extra data structures. And so we don't want these to get too big because they become expensive to reverse and expensive to maintain. So the most sort of clearly understood and wildly implemented algorithm is LRU. And I would also say, too, this buffer replacement policy is one of the oldest problems, oldest concepts people have explored in research in computer science that goes back to the 1960s and 50s. So there's been a ton of work done on this, and there's also been a ton of work done in the context of database systems. But I'm going to be focusing on at a high level just the major ones so you know that was out there. So LRU is the simplest one to understand. And basically what's going to happen is that every single page is going to have a timestamp where you keep track of when it was last accessed. And you maintain this timestamp in your page table. And so when the data system needs to decide when to evict a page, it's just going to select the one that has the oldest timestamp, because that was the last one that was recently used. And so because we want this eviction very quickly, typically you try to maintain these pages, their page IDs, in sorted order so that it's only just popping off the front of the queue to go figure out what page you want to evict. So this is really, very simple. But it does have some overhead, because now you need to maintain this timestamp for every single page. Yes. His question is, if you have two transactions that are read only, and they're accessing the same page, could you have, so what was the last part of the timestamps? Yeah. Could it be possible that a slightly newer timestamp, or something older timestamp, would get overwritten? Or no, suddenly newer timestamp would get overwritten by the little one? So his question is, if you have two transactions that are running the same time, without getting into this, you're also referring to timestamp ordering. So they're assigned a timestamp that is unique as a transaction ID. And then you want to update the page with the timestamp to do LRU. You wouldn't actually be using the transaction IDs timestamp. So that's sort of like a logical time, the order of the transactions. This is more of a physical time, like the clock timestamp. So that's not an issue. So the issue is there is some overhead involved in this, because again, we have to maintain the sorted order of our pages, and we have to update that timestamp every single time they're accessed. The good approximation, or an easier way to actually implement LRU, is something called clock. I think I mentioned last class, clock is actually some variant of clock is what Lennox uses for their memory management replacement policy. And so basically what's going to happen is instead of actually maintaining a timestamp for every single page, we're just going to have to store a single bit that says whether that page was referenced or accessed by a query or transaction since the last time that we checked it. And then what we'll do is we're going to organize all our pages in a circular buffer, and there'll be essentially a clock hand that's going to tick through and check this reference bit, and you check whenever you need to evict a page. And if the bit is set to zero, then you know it has been accessed since the last time you referenced it. So you can go ahead and evict it. If it's set to one, then you flip it to zero, and that way when you come back around, you'll check to see whether it was flipped back to one, and therefore you know it was accessed. So let's do a really simple example. We have four pages laid out like this, and every page will have a reference bit, and for now we'll initialize them to be zero. So let's say that some transaction accesses the first page, so we'll flip the reference bit to one. And then now there is some other thread that needs to get space, so we need to decide which of these pages to evict. So the clock hand first starts here, we check this first page, its reference bit is set to one, so therefore we can evict it, but we'll set the reference bit now to zero. So then now we flip it over to this guy here, his reference bit is set to zero, so we know this has not been accessed in the last time we checked, so therefore this is safe to evict, and we can put it in a new page. And it's always initialized with the reference bit set to zero. So now, say these other two pages get accessed, again the clock goes around, and it updates the reference page, but then now we get to here the first page we had, and then the first time we passed through, the bit was set to one, but then we set it to zero, then when we came back around, now it was set to zero. So now this is safe for us to evict. So again, if this is sort of an approximation of LRU, you're just sort of giving this reference bit a second chance for the page. If it was accessed since the last time you checked, then it'll be one, and therefore you keep it. If not, then it's zero. And so you don't really care about the exact ordering of knowing precisely what page was accessed the least frequently. You just care within sort of an epoch, since you went around the clock, that this page was not accessed, and you'll throw it away. So LRU and clock, do you have some problems? And the most important problem that we care about in database systems is what's called sequential flooding. And sequential flooding is the problem where it may be the case that the least recently used page, or at least recently accessed page, is actually important. And the reference counters or the timestamps got polluted because some query did a complete sequential scan of our entire table. So this is because we're not actually tracking any metadata of how pages are being used, or how often they're being used. It's just sort of this core screen timestamp to say when it was last used. So to show an example of this, to say that we have some query, and it's doing an L2B query, or it's doing a lookup on a single record, where ID equals 1. Let's say that this tuple exists in page 0. So when the query starts and executes it, we need a tuple in page 0. So we go fetch that in. So then that query ends, and now we have an OLAB query that wants to do a sequential scan on all our pages. So now when it starts scanning along, it's going to go grab all the pages that it needs. But now when we get to page 3 here, if we're running LRU or clock, page 0 was the one that was least frequently used. So for this is the one that's actually going to get evicted. But now if I have this popular query shows up again to go grab that ID1 record, it's in page 0. But that's just the last thing that I evicted. According to LRU, this is correct. But what we really wanted to maybe do was evict page 1 because that was, in terms of for this scan, that was the last one that was actually used. And therefore we know we're not going to scan it back again. And so what's missing from clock in LRU are this information about how pages were actually being accessed. And so there's a bunch of extra new algorithms, I say new, I mean like 1990s, sort of newer variants of these things where they do keep track of the access patterns of individual pages and use that to make decisions about what page is the best one to evict. So there's an extension of LRU called LRUK. And basically think of this as there's an extra data structure where you keep the history of when pages were accessed. And then you can also do additional things like priority hints where you can have the data system tell the buffer pool that this page is important. So for example, the root node of every index is, every thread is always going to have to go through that. So maybe the data system wants to never evict any index pages, that's the root. But the leaf pages maybe we don't care about so much. Or if we know there were always inserting things in an index that were always inserting into the right side of the tree, and maybe we don't want to keep the left side of the tree in memory. So there's additional hints that we can get out of our database system provided to the buffer pool manager that we can use then to also make decisions about our indexes. And the last one, also, technical density was called localization. And this is where the database management system will keep track of the sort of read, write set for pages for individual transactions. And then it will limit the amount of memory or pages that a transaction is allowed to bring into the buffer pool. So for example, in my example here, I only have three pages in my buffer pool. And for query two, it was allowed to use all of them. But maybe I could have avoided that problem where page zero got evicted if I said, well, query two is only allowed to have two pages. And therefore, it can only swap in and out that it touches. So in this case here, it wouldn't have evicted page zero because page zero is being used by other threads or other queries. But it was the only thread that actually needed page one and page two. So therefore, when it needed to evict a page, it would not choose the one that's used by other people. It would choose the one that it's only using. So again, this is another aspect of database systems where the commercial guys actually have very sophisticated algorithms to do all the sort of management stuff, to try to infer and use sort of heuristics and other things to try to infer what the best decision is for the eviction policy. Whereas things like Postgres and MySQL, they're not just using Vanilla LRU, but they're definitely not as robust or not as complicated as the commercial guys. So the other thing we can talk about now, too, is also what we call allocation policies. And that's sort of related to the localization stuff that I talked about before, is again, is how the data system will make decisions about what it should do when a particular thread or transaction or query makes requests to get data or get memory. And so everything shows so far what we call global policies where, again, there's a single buffer pool and it's running some algorithm that tries to look at everything and try to maximize the throughput or the locality of pages for the entire system. But you can also have what are called local policies where you take for a single transaction and you try to understand what it actually wants to do and make decisions on what to evict based on what you expect to happen. Again, if you know that your transaction is going to be accessing the same index over and over again, you maybe try to keep those pages for that index in memory for that particular transaction. So it's sort of a ying and yang thing where sometimes the best policy for a local transaction or local query may not be the best policy for the global system. And typically the way the commercial guys support these different kinds of policies is that you can set transaction priorities. So you can designate some transactions or queries as being more important or more mission critical than other queries. And then the data system internally can make decisions that favor those higher priority queries than the riff-rack. And that's very common you see in enterprise applications. And think about this, think of it like in an OLAP database system when the CEO runs a query, you want to get the answer back right away. When Joe Schmo in accounting runs a query, you'll make a best effort. And again, the commercial system has allowed you to do this. So the last thing to talk about now, too, is what are other types of memory you want to use in our database system? So we've already talked about index pages. We've already talked about tuple pages. But as we see now going forward when we talk about query processing, there's a ton of extra stuff that needs memory. And some of these things will be backed by disk pages, meaning we can write into them and we can spill out the disk if we run out of space. Other things that we can consider to be ephemeral. And it's OK that if we run out of space, we just delete things. So this will come up a lot when we talk about sorting and join algorithms. You want to be able to sort tables that are larger in the amount of memory or the intermediate result for your query may be larger in the amount of memory that you have. And you want to have specialized sort buffers or join buffers so that you can spill those guys out the disk without ruining your disk pages. Query caches are when you actually sort of like memorization or you record the exact response of a single query. This is an example where maybe you don't care so much about keeping it around because if you run out of space you just blow away the query cache. Because then the next time you see that same query you just have to recompute it. There's things like maintenance buffers, like reorganization like we saw in Postgres. There's log buffers, which we'll talk about when we do a log of recovery. And then dictionary caches would be for compression. Again, all of these things will typically be also managed by a buffer pool manager. Because again, that's the thing that's managing the memory for the system. So you try to avoid things just like malicking anywhere you want. And a lot of data systems will spend a lot of effort to make sure that there's absolutely no memory leaks at all because that would then not be taking memory that could be used for your buffer pool. So every major data system has very rigorous QA tests and another testing that they do to make sure that there's never any memory leak for the database system. OK, so the main takeaway from all of this is that the data system is always going to be able to manage memory better than the operating system. And if anybody tells you otherwise, they don't know what they're talking about. Of course, now we're going to e-mails. So again, and then the buffer pool manager is what the system will use as the interface to get data in and out from disk. And then it can also be used for ephemeral memory for things that are intermediate query results, OK? So any questions about this? So next class, and I actually want to give a quick demo before we leave. In next class, we're now talking about hashing. Because now we're going to start talking about sort of going up the stack of actually building data structures we're going to need to do query processing and make these things run efficiently. So we'll talk about open hashing, extendable hashing, linear hashing, cuckoo hashing. And then on Wednesday, we'll start talking about order preserving your tree indexes, OK? All right, so real quick. So last class, I sort of glossed over this topic as quickly, and we have some extra time. I want to show it. So last class, we talked about how the data system is going to represent types. And I sort of blew this pass very quickly, but I said that there's basically two types of decimals. There's the floating point decimals and the fixed point decimals. And I said the floating point decimals are actually handled by the processor, the CPU. But the issue is that with floating point numbers is that it's sort of arbitrary precision. So you may end up getting different results. And I showed this sort of example here where if you run this simple C program of just taking 0.1 plus 0.2, you actually can get wildly different results. Even though technically it should be the same thing. And again, this is because the CPUs can't precisely store floating point numbers. So we have the same issue in our database system where we have different types. So you can have a, if you're OK with having those rounding errors, then you can use real or floats. But if you need precision of your decimals, then you want to use numeric or decimal. And then what happens is that the database system will store this essentially like a struct where it's going to maintain extra information about exactly what the decimal point is, what's on the right hand side, what's on the left hand side, and what the scale is. So in this case here, this is in Postgres. For a single numeric type, you end up having to store four 32-bit integers plus a pointer to some kind of char up here. So in a 32-bit floating point number, again, you store that in 32 bits. In this case here, we essentially have to store that in 20 bytes. And so this has a big impact on actually performance. So if you go to here, Postgres. So I've loaded in Postgres a table with two tables with, oh, you can't see it, sorry. Still no? This might ruin the demo. Of course, now I can't see it. Awesome. All right, so this is Postgres running on a machine in my office. And so I've loaded two tables with, I think, 50 million tuples. And these two tuples have two fields. And so in the first table, they have two fields, two decimals. And the first one, I'm going to store them as reels as the floating point numbers. And in the second one, I'm going to store them as decimals as the actual numeric type. So for this, we're going to run on the reels table. And for this, Explain Analyze basically actually runs the query, and then we'll tell you the query plan how long it took. So this query took, running on reels, took eight seconds. And then now if I run that same query on the decimals, it's going to take much longer because it actually has to look at that struct, figure out exactly what the type is, and how to add these numbers together. This should take roughly 20 seconds. So this is an example that if you're using the native types that the processor supports, it took twice as long, it took 16 seconds, if you use the native types that the processor supports, it's really, really fast to do arithmetic and other operations. If you now need to do more complex things, then that's actually going to be much slower. So and then the numeric one is a perfect example of this. So just to show you that we're, how long, sorry, to show you that the answer is actually correct. So this is essentially what we're doing in the Python code. For the same thing, we're just going to add these two numbers together and print it out. And this will take roughly 30 seconds. The database system when it has the native type can do this in exactly eight seconds. Actually, I guess the question is whether is Python doing the floating point numbers or the fixed point numbers? So what we can do is go back to Postgres here and run that same query on the decimals. Let's see what the answer gets. So at this point too, we also want to say, I already ran the query before, so everything's all in the bufferable caches. So this is all pure computation here of having to take the structs. So it's that number. And then if we go look at what Python gives you, it looks like it's off, right? Yeah, it's missing some of the rounding part. So again, how it is and will actually represent the types in your tuples can matter a lot of performance. All right, so any questions? OK, so again, homework two is due tonight. Homework three will be released tonight. And then we'll have class on Monday. We'll start talking about indexes. All right, guys, take care.