 So, real quick, two quick announcements, I want to discuss administrator things. The first is that everyone should be aware that homework one is due next week on Monday at midnight. Some of you have already started, people that try to get enrolled in the course, they've already finished obviously. Actually, whoever got it first, I don't know who it is yet, they weren't even in the wait list. They were already enrolled in the course, and they beat all of you, right? So whoever that was, congrats. And then the other thing also is next week we will be releasing the first programming project. So that'll be, I think it'll be due in two weeks after this. So I'll send out a notice about this, but we've adjusted the deadline slightly for these projects because last year the feedback we got was the first project was considered too easy, and the second project was considered really hard, so everyone thought the second project would be like the first project, and so a lot of people ended up failing because they didn't allot themselves enough time. So we're giving you less time for project one, more time on project two, okay? The other thing to announce is also there's a bunch of database talks coming up in the next week. So tomorrow in CIC on the fourth floor, we'll have the co-founders of Connecticut, which is the GPU accelerated database system out of Virginia, they'll be coming and giving a talk at noon. Again, that's open to everyone, you don't have to register for this, and there will be pizza available for that one. And then on Friday we'll have a talk from Pat Helen from Salesforce. Pat is a sort of a legend in the database features community. He's not academic, he's just sort of somebody that's been around working on databases for a long time since the 1980s. So he's going to come give a talk also at noon in the CIC on the fourth floor. There should be pizza this one as well. So I encourage you to come to either one of those. The first one, the Connecticut talk would be very technical, but they're going to talk about more about the engineering side of making database systems run on GPUs. And this is going well beyond anything we'll talk about in this course. And then the Salesforce talk should be a bit more lofty, a bit more higher level. I guess Pat's much more senior. It's about thinking big about distributed systems, distributed databases. And then on Wednesday next week at 4 p.m. we'll be having the co-founder of relational AI come give a talk. So I don't know this person. This is actually being organized by somebody else in Tepper. But the co-founder of relational AI had a previous database system start up called Logic Blocks, which is a bit of a esoteric system, doesn't look like, didn't operate the same way that the data systems that we're talking about here operate. But so he's sort of a semi-famous guy, so he'll also be giving a talk at 4 p.m. So the first one will be on YouTube, the other two will not be, okay? So the first two will have pizza, the second one, I don't know. Because I'm not running it. You can plan your meals accordingly. All right, any questions? Any questions about homework one? Okay, let's jump right into it. Okay, so where we're at now is that we spent the first two lectures talking about sort of higher level things like what the relational model is and how you would write queries against data stored like that. And that was all sort of lofty in terms of high level and abstract, right? We didn't describe anything about how we actually would implement code that could do those particular things. So at this point in the course, this is where we sort of start jumping to now, describing how you would actually build software to execute queries on a relational database system. And so that's really what the next, pretty much from now until Thanksgiving, the things we're going to be talking about. And so if you go back to the course outline that I showed at the first class, we sort of already covered the first part. We already covered relational databases. And now we're getting into all the implementation details. And so the way to sort of think about this as we're going to go along at a high level is that we're essentially going to be describing different layers of the database management system software that we're going to build on top of to do more complicated things or get more functionality out of it, right? So at the storage level, we're sort of starting at the bottom, right? We're talking about how we're actually going to read and write data and what that data looks like on disk. And then we define an API in our disk manager or our storage manager to say, here's how to interact with the data that I can manage. And then on top of that, we can then build our buffer pool. And on top of that, we can start building our execution engines or access methods, right? So we're going up one by one and describe how you would build each different layer, what are the tradeoffs, what are the different design decisions you have to think about for each layer. And then once that's sort of done, we can assume that we're good to go, right? As long as we implement whatever that layer wants or we define what an API is, we can build other stuff on top of that. So we'll see it in a couple of cases today where I'm going to describe different ways to organize pages or organize data in pages. And what happens above that actually doesn't matter in some ways, right? We can still use maybe the same concurrential schemes or the same indexing schemes on top of this because we had this sort of abstraction down below. And this is not new to databases or not unique to databases. This is the standard practice in putting any large software system. So again, so today's class and next class, we're focusing on the disk manager. And again, this is just how we're going to actually store data on disk. So for the course, I sort of mentioned this in the beginning, but I didn't really define what I meant. So I want to say that this course is focused on what I would call a disk-oriented database management system architecture. And so what I mean by that is that the database management system is designed such that it assumes the primary stores location of the database is on some non-volatile disk. And I'll define what I mean by non-volatile disk in the next slide. And therefore the components of the system are responsible for figuring out how to move data back and forth between the non-volatile disk and volatile memory, right? Because the database system can't operate data directly on disk, right? There are some newer disks that can do these kind of things, because there's CPUs down there. But in practice, like a commodity disk you would buy on Amazon or New Egg, you're not going to be able to run code down on there, right? So you always have to bring the data you need to operate on to either read and write into memory and then do whatever it is that you need to do. So the database management system is all about managing that movement back and forth at this level of the architecture. And this is the classic von Neumann architecture from the 1950s. So in order to understand the implications of this movement of data back and forth, we will first want to understand what the storage hierarchy looks like. What are we actually dealing with in our real systems? So the way to sort of think about it is this sort of tree structure like this, right? And every single layer is a different kind of storage class device. You can only move data directly from one layer to another, right? You can only take data out from, well, that's not exactly true, but you can only take data out from an SSD and put it into DRAM before you can put it into your CPU caches, right? Again, that's not entirely true. In most cases, but for our purposes, this is good with that. So the way to sort of think about this in this hierarchy is the storage devices at the top are really fast, but they have much smaller capacity and they're also much more expensive, right? And at the bottom, you have larger storage devices that have larger capacities that are cheaper per gigabyte or per megabyte, but they're going to be much, much slower. So if you go look in the textbook, they'll describe sort of this hierarchy as well. The one thing I'll point out is they don't put network storage at the bottom. Network storage would be something like EBS or HDFS, like a distributed file system. Below this would be like tape archives, right? I mean, that's really only used for the disaster recovery. You can get that from Amazon. Amazon has a thing called Amazon Glacier, right? That's like super, super slow. We're not even going to worry about that here. So in our course, the way we're going to differentiate between these different layers is not so much between SSDs and HDDs. It's really, is it volatile or versus non-volatile? So anything above the line is considered volatile, right? That means if you remove power on the machine, like you pull the power cord and then all your data is gone, right? DRAM needs power every so often to refresh its cells and maintain the charge. Same thing for CPU caches and CPU registers. And so the way we're going to access the volatile data is going to be different than how we're going to access it in non-volatile data. So in volatile data, it's going to support what's called a random access API, meaning we can jump to any single byte address and get the actual data that's there. And it's going to be able to support this fast random access, right? We can jump around anywhere in our memory space, and it's going to be roughly the same speed no matter where we go. Non-volatile memory is going to be better off for doing what's called sequential access. So that's trying to read a lot of data that's sort of contiguous to each other in storage, right? So you just sort of read all the data at once. You can still jump around, but it's going to be slower. And instead of accessing a single address, a dress line or cash line to go get the data we need, we're going to access the data in what are called blocks or pages, which are typically about four kilobytes, right? So in volatile storage, I can say, go get me this single 64-bit memory location, and I'm going to read the value. In a non-volatile storage that's block addressable, I have to go get the four-kilobyte block that has the data I need. And then inside of that, I can get the 64-bit value that I want, right? So again, the way to sort of think about this is that when I talk about the course, I'm not going to differentiate between SSDs and HDDs or network storage. I'm really going to say that anything above SSD, that's memory, sorry, DRAM that will be considered memory, and then anything below that will be just considered disk, right? Because at a high level, the algorithms we're going to use to accessing what are called disk are roughly the same whether it's an SSD, HDD, or EBS. So the other thing I'll also say too is that in this course, we're not going to care about anything above DRAM, right? So CPU caches like L1, L2, L3, or anything like CPU registers. Because in our world with a disk, the disk is so slow that all of our focus is really about how can we speed up or hide the latency of these storage devices down here. And once it's in memory, that's good enough. So if you take the advanced class in the spring 15.7.21, we're not going to do disk-oriented data systems. We're going to actually do memory-oriented or in-memory data systems. So this now, the disk goes away, except for logging. And now, since everything's in memory, now we care about the stuff at the top, right? And you can actually get pretty good speed up if you start aligning things to cache lines or try to maintain as much data in CPU registers, actual individual registers for as long as possible. But again, for our purpose here, we don't care. The other thing I want to bring up too is also there is actually a new storage technology that's coming out called non-volatile memory. That sort of sits in the middle here between SSDs and DRAM, right? So this is sometimes called persistent memory or storage class memory. The way to sort of think about it is it's like DRAM in that it's fast, what's random access, and things are quite addressable. But it's persistent like an SSD, right? So meaning if I write to non-volatile memory and I pull the plug, all my data is still there when I come back. So the last couple of years, as I've been teaching this course, I always keep saying like, yeah, non-volatile memory, it exists in the labs, you actually can't get it yet, so we're not really going to worry about it here. That's actually not true as of this year. So Intel announced that they have actually non-volatile memory in production, we actually have access to it here at CMU. My PhD student that just graduated, he's now at Georgia Tech. His whole thesis was on non-volatile memory databases. So we're not ready yet to throw away the entire textbook. But I think a lot of the things that we'll talk about during the semester, some of these things will go away, some of these things will get tweaked because now you're going to have non-volatile memory. So I think Intel's announcing it's going to come out later this year, supposedly. I think they're calling it 3D crosspoint or optane memory. There's a bunch of different marketing terms for it, right? Again, but the high level, it works like this. It looks like DRAM, but it's persistent like an SSD. So to give an idea of why we're going to spend a lot of work to move data back and forth between DRAM and disk, we have to understand what it takes to actually get data out of disk and bring it into memory. So there's a bunch of different tables on the internet that roughly have the same information, sort of like this. So the exact numbers might be slightly different, but the magnitude of the differences are roughly the same. So to read data from a single cache line in L1, which is the closest cache you can have right above the CP registers, takes about half a nanosecond. If you've got to read from L2, it's seven nanoseconds. DRAM is 100 nanoseconds and so forth and so forth, right? So when you show numbers like this with nanoseconds, people don't really know how to put that in perspective of real-world clock time. So instead of saying that maybe, for this case here, that L1 cache reference takes half a nanosecond, what would we say if it takes a second? Now you can actually see why reading from these slower devices is so bad, right? So if we say L1 cache is half a second, then reading from L2 cache is seven seconds, DRAM is 100 seconds, reading from an SSD is 1.7 days, reading from a spinning disk magnetic hard drive is 16 weeks, reading from network stores like EBS is 11 months, and then reading from a tape drive is 31 years. The other sort of metaphor people like to use is, say I want to read a book, I want to read a page in a book, reading from my CPU caches is like the book is right in front of me, reading it from maybe DRAM is like walking across the room to get it, meeting from an SSD is like walking down to the library and getting it, but like reading it from a tape drive is like flying to Pluto to read the book and then flying back, right? So these are orders of magnitude slower, and again, this is why we're gonna have to be smart about how we keep our data in DRAM as much as possible to avoid ever having to go to disk. Now in some cases, we can't avoid this, right? We have to actually write data to disk to make sure things are durable, but we can be smart about how we actually want to do this. So our system design goals for actually how we're gonna build a database system is at a high level, what we essentially want to do is we want to have the, provide a software system that has the illusion that the entire database fits in DRAM even though it doesn't, right? And we're gonna be careful about how we move data back and forth and we wanna do it in such a way that we avoid large stalls because we have to go fetch something from disk and we don't wanna have that slow down everything else. Again, it's unavoidable, the data's not in memory, you gotta go get a disk and get it, but ideally you wanna have the data system be able to do other things while you're going fetching that data. Let me be processing other queries. Of course, now we'll see this later on when we talk about transactions and concurrency control, but if you have multiple threads doing multiple things at the same time, because maybe one thread stalled to go out the disk, now you make sure that you order their operations correctly so they don't read data that they shouldn't be reading. Okay, we'll worry about that later for our purpose here, we're just trying to say how can we deal with reading, writing, data from disk and how we're gonna organize things. So the, I mentioned this before when we talked about the difference between nonvolta and volatile memory just to bring it up again. The big also difference we're gonna see is in how we design these components is dealing with the fact that sequential access on nonvolta storage is much faster than random access. So that means that we're gonna try to maximize our ability to read and write data sequentially, meaning reading large blocks of data at a time to get things that we need. If you have to go jump around, say you have a spinning disk hard drive and it has a mechanical arm that jumps around on the platter, if you have to jump it around for every single time you wanna read a single tuple, then that's gonna be really slow. But if you can read, pop the arm down and let it read a large segment of data, then bring that in memory, then that's gonna be much faster because again, there's fewer movements of the physical arm. So the operating systems usually support the ability to allocate multiple pages or blocks in a row, these are called extents. So you can say I wanna allocate one megabyte extent and then the OS will try to align those together and make them contiguous in the actual storage device itself. So as I said, in my design goals, we wanna have our database system provide the illusion that it has more memory than it actually has. If we have one gigabyte of RAM and we have a 10 gigabyte of database, we wanna make it appear that the entire database fits in memory. So what does that sound like? Who here has taken an operating system course? What's that? I heard it. Virtual memory, exactly, right. So this is essentially what virtual memory does, right? Virtual memory, the operating system says, all right, here's my address space, but then I have this little swap section where I can write out pages when I run out of space in memory. And it looks to the application, it looks like everything's actually in memory when it's not. You don't have to change your code in any way to deal with the fact that the page you wanna read is out on disk. The operating system hides all that for you. See, now everything, well, if this is what I want in my database system, I want the ability to have it appear that I have more memory than I actually have, and the operating system already can do this in virtual memory, why not just use the operating system for this? So the way you would actually write this is through a syscall called mmap. Who here has heard of mmap? Two in the back, okay, three. All right, so mmap means memory map file, and the way to basically think about this is that you have a file on disk, and you tell the operating system you mmap it, you load it in, and it's basically gonna map the physical pages of the file on disk into virtual memory addresses inside of the OS. And then from your application standpoint, you can jump to any location in that file in your memory address, and underneath the covers, the operating system will page things in and out as needed, right? So the operating system essentially responds for moving data in and out. So just to give a quick example, it's like this, say this is our file on disk and it has four pages, right, one, two, three, four. And then in memory, in the operating system, we're gonna have our virtual memory, so we mmap it in, so we have the four pages in memory, but at this point at the very beginning, there's nothing in there, right, it's empty because we haven't gone to disk to get anything. And then we only have actually two pages of space in physical memory. So now if my process comes along and it wants to read the first page, I'll get a page fault, which is an interrupt. The operating system stalls my process while it goes down to the disk, fetches the page that I want, puts it in physical memory, and then maps the virtual memory to point to that physical page, right? Again, if I access page three, same thing, I get a page fault, the OS stalls my process, and then it goes fetches what I need, right? The stalling the process essentially means that you're blocking your thread because now you're down in the kernel and the operating system knows not to have, it's scheduled for a quantum to actually run because you can't do anything because you're blocked waiting for this page, right? Of course, now we have this problem of, well, we want to access this page here. It's not memory. We don't have any more free space in physical memory, so what do we actually do, right? We need to evict one and make space for it, but now there might be other threads or other processes doing the same thing, so we may have a long stall, right? So this is bad from our point of view. Remember I said earlier that ideally, when we ever have to go to disk to get something, we don't want to stall the entire database system. We want to have something else keep running and maybe make forward progress on their queries while we go fetch the thing that we need, but in this case here with the operating system using Mmap, it doesn't do that because the operating system stalls our thread. Now if we have other threads, they can do other things, but that's going to cause problems, right? So we have multiple threads that access the same Mmap file at the same time to hide these page fault stalls. Now we've got to deal with problems of what happens if they're reading and writing to different pages and how to make sure things they actually get written out disk correctly. So I don't want to go into the details of how Mmap sucks. This has been a long-standing beef that I have with Mmap if you want to call it that. I will say also underneath the covers, I mean if you call malloc, it's essentially doing Mmap, right? But it's not actually being backed by disk, right? This is backed by physical memory that's always there. It could get swapped out, but that's the virtual memory in the OS. So Mmap works really great, or reasonably great for read-only workloads, but when you have writers, then you have problems. So the way to sort of handle this is that the OS does provide you some hints to overcome this, right? These three sys calls MEnvise and malloc Msync. These are basically telling the operating system how your process, your database system is actually going to access the pages that you Mmapped, right? So MEnvise can say that I plan on reading these certain pages now, or how I plan on reading them, and malloc can tell the operating system whatever you do, don't swap this out to disk because I'm always going to do something to it, and then Msync allows you to flush things. So there are some systems that actually use Mmap. The ones that I'm aware of in terms of full usage would be Monadb and LMDB. The LMDB doesn't like me because I don't like Mmap. That's another story. And Monadb is an analytical system that's highly read-only, so this is like fine, but when you actually want to start doing writes, that's when you have problems with Mmap. Now there are some systems that also use, partially use Mmap. So MongoDB, the original engine that it actually uses was based on Mmap. It was so bad that they threw it away and actually built what I'll call correct storage manager called WireTiger that does the things that we're talking about in this class. MemSQL uses Mmap for the column store for sort of read-only things, same as the influx DB, and then SQLite has different engines you can use and I think one of them is Mmap. So the bottom line, I want to get out of this, and we'll see this later on when we talk about the buffer pool stuff, the buffer pool manager in two lectures, but the reoccurring theme throughout this course also will be that the database management system can always do a better job than the operating system because it knows exactly what's happening. It knows what the query is executing, it knows what the data looks like, it knows how it's going to access that data. So therefore, it can always do a much better job at scheduling these things than the operating system, right? And there's certain precautions you have to take to make sure that you write things out in the correct order, which are actually difficult to do and not really portable with the Mmap syscall. So if I die, you can put two things in my tombstone. One is the operating system's not your friend. It's like a friend of me, right? You need it to survive, but it kind of gets in the way. And the second one is never use Mmap for your database management system, right? Now, I'll sign an announcement of this, but this has been a longstanding beef. There's some really good researchers in Germany that try to use Mmap for their database system, they gave up on it, and we want to write a paper to prove that finally Mmap is a bad idea for the database system. So if you're interested in this, and you want to ruffle some feathers, email me and we can talk about it. All right, so again, we'll see this a couple of times throughout the semester where it seems like what I'm describing to you is functionality that the database system already, or sorry, the operating system can already provide, but we're having to implement it ourselves. And every major commercial database system does this, right? Because they can always get much better performance and have much better control and guarantee correctness and safety of data in ways that may not always be possible with the operating system, right? So again, what I'm describing for this class will seem like a lot of ways like, oh, we're bidding a little many operating system. Yes, that's because we want to do things ourselves. And then you know what, if we go do this at a company, they pay us a lot to do it because it's hard, right? Which is always a good thing. All right, so what are the two problems we have to deal with database storage? So the first is that how are we going to actually represent the data in files on disk, right? And the second one is how are we actually going to manage this movement of memory back and forth, pages between the disk and memory back and forth. So for this class and the next class, we're going to focus on this problem here, right? So we're not going to worry about how we're actually, you know, if we have these pages and we copy the memory, what are we, how do we actually manage that? We'll cover that when we talk about the buffer pool stuff and then that's what you'll do in the first project. We're really focused on this problem here of like, we have these files on disk, how do we actually represent the data that's stored in them. So for today's agenda, I'm going to talk about three parts, right? And you can sort of see again what I said before, how we're going to build these layers up and do something more complicated things. So we're going to start with how are we actually going to maintain these files on disk and then we'll build up from that and talk about how we actually organize the pages in those files. And then from that, we can then talk about how we actually organize the tuples inside of those pages, right? So again, we'll go through these one by one and sort of see how these things are built on top of each other. So in the most basic form, the database management system is essentially just going to store the database as files on disk, right? There's no magic to it, there's just a bunch of files in your file system, right, there they are. Some systems will store a single file, SQLite does this, right? If you create a database in SQLite, it creates a single like .db file for you, which is nice because it's portable, you can move it around anywhere. But typically the bigger systems actually store multiple files in various hierarchies. So if you have Postgres installed in your system, go in its Unix, go look at slash var, slash lib, slash postgresql, they're gonna have then all these directories where they'll have different directories for tables, different directories for indexes, different directories for metadata. And inside of those will be these different database files. So these files typically are in a proprietary format that is specific to the database management system. So it's not like you can open up a text editor and actually see any data, right? It's gonna be some binary form that only that database management system can read and write to. Now we'll see this in the cloud systems later on, but there are sort of now these sort of universal file, the database file formats, things like Orc or Parquet or Ario, right? These are things that are designed to be usable or ingestible by different database systems. But in general, usually the actual raw files themselves are, can only be understood by the database system that made it. So because there are just files on disk, the operating system doesn't know anything about them, or it doesn't know that they're special, there's no other database files. So all sort of the standard file protection mechanisms that you get from regular files in the OS, that's the OS that provide that for you as well, right? You can just run your database system on top of EXT-4, EXT-3, whatever Windows provides, right? And the upper levels of the system for the most part maybe don't care. I would say in the early 1980s, there was a sort of a moment or a trend where the database vendors decided that all the file systems were crap and that they wanted to go actually implement their own database-specific file system. So what that means is you just take the raw blocks on the actual storage device and then the database system knew how to build a file system on top of that and just read and write to the raw blocks. As far as I know, nobody actually does that anymore. Certainly you wouldn't build a new system today and start doing that. There's conventional wisdoms that you really only get maybe like a 10 or 15% speed-up in performance if you run your own file system over something like EXT-4 and for that kind of speed-up, it's just not worth the engineering effort, right? And also it makes your database system less portable. So in general, the most major data systems are just gonna run on existing file systems and that's good enough for us. It's not great, but it's good. So now the thing we first need to build is what we'll call the storage manager. And this is essentially responsible for the maintaining the database files, whether it's one or multiple ones. The storage manager is responsible for saying, here's my files and I know how to get pages in and out of them. So the files are gonna be organized as a collection of pages, right? You sort of take the file and you split it up to different chunks. And each of these chunks will be called a page. And so the storage manager keeps track of what data needs to read and write to pages, how much free space there is in pages. If you wanna do compaction, maybe remove empty pages that aren't being used and use less space, right? This is what the storage manager provides for you. So now a database page, as I said, is a fixed size block of data and it can contain anything that relates to our database, right? So it can contain tuples, it can contain metadata about the tuples, it can contain indexes, it can contain log records, it can contain intermediate results from a query as we're running it, because we have to spill out the disks because we don't have enough memory, right? For our purposes in this class, we'll just assume that we're only dealing with pages that contain tuples. Typically you also don't mix the type of data you store in a page. So you're not gonna have a single page that'll have some, you know, to have index data and tuple data, right? From us, this is done mostly from a software engineering standpoint. You wanna have each page only store data for of a specific class or specific type. Now some systems actually require pages to actually be self-contained. And what I mean by that is, all the data you need to know about how to read and interpret the data that's stored inside that page has to be stored in that page. So an example of this to be, let's say I'm doing like dictionary compression, right? Dictionary compression is you take a string that occurs a lot and instead of storing the actual raw characters, you just store an integer to say, here's a placeholder for that original string. So you need a dictionary to say, here's how to map that integer to the actual original string. So if you store that dictionary on one page and then store the database that's encoded with that dictionary on another page, if that dictionary page gets trashed, right now you can't unencode anything in the original page. So some systems like Oracle, they, they're worried about disaster recovery. They actually say, oh no, everything you need to know to actually interpret the data has to be stored on that page. So that way, if for whatever reason that that page gets trashed, the damage is only isolated to that single page, right? So that is actually specific to Oracle. I don't think anybody else actually does this, but I think it's actually a really good idea. But it causes them to make certain other design decisions that may be seen in efficient from a performance standpoint, but from a, you know, never losing data, I think it's a good idea. So now every page is gonna have a unique identifier, we'll call it the page ID. Now the data system's gonna have to maintain some kind of indirection map that allows you to figure out how to take a page ID, because that's what the upper levels of the system are gonna ask for. The upper levels of the system is gonna say, give me page one, two, three. And now the storage manager needs to have some kind of mapping layer to take that page number and find the file and the offset of where it can go get that particular page. And then give the bytes back up to whoever asked for it. So they also do, I think, say, the reason why we wanna keep them fixed size is that that way in our file, if we say, oh, we want page one, two, three, and we know it's in this file at offset 100, we know exactly how to calculate where to go in that, jumping that file to get the data that we need. If we have variable length pages, then we need to have an indirection layer at the top to say, oh, you want page 100, it's at this offset. Now you need to maintain that thing. And now you have problems if you delete a page that may be variable size, now you have a hole that you maybe can't fill back up. So pages are always gonna be fixed size, and this makes everything much easier. And we'll see how we handle this with tuples that can be variable length within a page. So one thing I'll confess is that as I go along, sometimes I'll say page, sometimes I'll say block, the words are in some ways interchangeable, but I'll try to use pages much as often. And there's different concepts of pages in our system that we have to be mindful of as well. So at the lowest point you have what's called a hardware page, this is what the actual storage device itself provides you to the operating system when you ask to read and write a particular block, a particular page. Different storage devices have different page sizes, but in general they're usually four kilobytes. Then now in the operating system, again it's gonna be providing this notion of pages as well, and these pages are usually four kilobytes by default Linux, Linux and Windows are four kilobytes, but there's ways to turn on what's called huge pages where you can have it actually organize memory as in up to one gigabyte pages. And you do this because this reduces the size of the page directory or TLB, and that fits more than the cache and you have fewer cache misses on that. But in our world, again, we care about the database page, and essentially the way to think about this is we're gonna organize in our database our own pages, but then those pages will essentially get mapped to OS size pages and the OS size pages get mapped to physical pages. So you think it would be always four kilobytes because that's what the hardware wants and that's the OS wants, but in practice it varies a lot. So at one end of the spectrum we have a SQLite that has one kilobyte pages and this makes sense because SQLite is designed to run on really small embedded devices that may not have four kilobyte pages. DB2 and Oracle support four kilobyte pages post-processing SQL server or eight kilobyte pages, and then my SQL in uncompressed form is 16 kilobyte pages. I actually don't know why they do this. I have a hunch, because they wanna do in this organized tables, which we'll talk about later on, but when you do compression, they also can store pages in smaller sizes as well, but we're not gonna talk about it here. What I'll say though is that the hardware's page size is sort of what it can guarantee as what I call a failsafe write, meaning it can guarantee that if you tell it to write four kilobyte pages and it says I wrote four kilobyte pages, you tell it to write a page and that page size is four kilobytes on the hardware, it'll guarantee that either all four kilobytes get written or none of them get written. So that means that if you have larger database pages and you need to flush them out the disk, you may have to do extra work or extra protection mechanisms to make sure that you don't have torn updates. If you have a 16 kilobyte page in my SQL and the hardware can only provide four kilobyte atomic writes, you don't wanna write the first eight kilobytes and then crash and come back and have only half of your update. So we'll see this when we talk about logging and recovery, systems like my SQL have to do extra work to make sure that they're writing their data out safely and they know it's been staged and everything's always there. All right, so now we need it, if we have these pages and we know we have a bunch of files and they're gonna split up into pages, now we need a mechanism to figure out how to actually find the page that we want, right? And the upper levels of the system are gonna say give me page one, two, three and now we need to go figure out where that page one, two, three is. Or the system could say, hey, I wanna write this amount of data, give me a page that has a free slot and now our storage layer has to figure out how to find that. So, there's a couple of different ways to actually do this. I'm gonna focus on the first one because this is the most common. The sequential and sorted file organization, we'll see this when we talk about indexes, this is typically how you implement clustering indexes, right, basically the heap file is unsorted, sequential or sorted files are sorted and you need index to make that happen and we'll talk about that later and then hashing file organization is just an easy extension to the heap file one. So we'll focus on the first one. The other thing I'll say also at this point is that we don't really care what's inside of our pages at this point. Again, we're only trying to say give me a page or give me a free slot within a page, we don't care what's actually stored in that, right? And that makes our job a bit more easier here because we don't care about finding exact things, we can let the upper levels of the system do that for us. So a database heap or a heap file is an unorganized collection of pages where the tuples can be stored in random order or the data that we're storing can be stored in random order, right? We have a chunk of data we wanna write to a page, we don't care if we're using a heap file, we don't care where that actually goes as long as it just goes somewhere. So the basic API we need to be a support in a heap file or a database heap storage manager is the ability to get and delete a page as well as to iterate over every possible page, right? So if I have 100 pages, I want a way to say, here's the chunks of memory that for each of those 100 pages. Now, we're gonna need to maintain some metadata to keep track of what pages exist, obviously, and we need to keep track of where do we have free space? Because again, we may wanna say get a page that has a free space that we can store this amount of data. So now there's two ways to organize this. The first is a linked list and the second is a page directory. And the spoiler is everyone does the page directory approach, but it's good to understand what the linked list one is because you see why you need to use a page directory. So the linked list approach is you have a single header page that only contains two pointers. And the pointers are gonna point to two different linked lists. So the first linked list will be the free page list. So these are pages that have at least some free space where you can actually store data. And then the second linked list is the data page. These are pages there that there are, there is no more free space, right? So what'll happen is if you wanna say get me page 123, you actually have to traverse these linked lists and figure out where the pages is that you want. Or if you need to say I wanna insert a new record and I need to find space, you only have to follow along the free page list. And this is because there's an additional metadata that we need that we're not actually gonna track inside of the header page, right? About like what free space we actually have. All that's gonna be stored directly inside the pages themselves, right? There's no reason we couldn't, it's just this is sort of a strawman implementation where we don't actually do this, right? The other thing we have problem is that say we wanna get page 123, we end up having to do basically a sequential scan of both of these until we find the one that we want, right? So the better approach and what pretty much everyone does is to use the page directory. Where this is now you're gonna have a special directory page. It's usually at the beginning of the file. So SQLite, the first page is always this directory. And it just tells you here's all the pages that I have at different locations in my database files. And here's information about how much free space they actually have. So now you can easily traverse every single page because you just look in the directory. And say you wanna get page 123, and you look in the directory, it tells you exactly where it is. And you wanna find a page with free space, and the directory can maintain this as well. So of course now you have to make sure that the directory page is always in sync with the actual data pages themselves, right? Because if the data page has free space and the directory page thinks it doesn't, then you're never actually gonna use that space, right? So you always have to make sure that you flush out the directory page anytime you add new pages or modify these things. So what would happen in this case here if the directory page gets trashed? Are we screwed? You should have had no, why? Exactly, so if the first page gets blown away for whatever reason, right, we can always go back and scan through and look at every single page and reconstruct it, right? Small files, no big deal. If I have a one petabyte database, that's a problem, right, because it's gonna take me days to actually do this. So there's extra protection mechanisms we can have in our system by writing out to the log all the updates we have to our directory page to void have and to recreate everything, right? So this approach is resilient to failures even if the directory page fails. Even though again, remember I said that some systems don't like to have data that you need to reconstruct for a single page spread across multiple pages. In this case here, we can always make sure, we can always go back and recreate the directory page. Ideally it'd be nice if we didn't have to but we always had that sort of failback mechanism. Okay, so at this point, we know about files, we know about how we're gonna split the files up into pages and then now we have a directory, page directory component that allows us to find the pages that we want. So now the next question is, how do we actually organize the data inside of these pages? And this is what the page layout is. So every page is gonna contain a header, right, regardless of what size it is. And this header's gonna contain metadata about that particular page, right? Things like how much free space there is. You can also maintain checksums, right? So when you write out the file, you take a checksum of it and that way, or sort of write out the page, take a checksum of it when you write it out. And that way when you read it back in, if the checksum doesn't match, you know the page got corrupted in some way. You also want to maintain the version number of the data assembly system that created the page. So that way if you change your page layout and you load in an old database file, you know it's not, you know what layout to expect. We'll talk about this later on for concurrency drawing information. Sometimes you can contain transaction visibility, information about the data that's on this page so that you know whether a transaction that's active is actually could be able to see any of the tuples or any of the data that's inside this page. It's compression information like the dictionary encoding that we talked about before, right? So all this is gonna be stored in the header. And as I mentioned, some systems will actually require the pages to be self-contained. So now within a page, right? Every page has a header. Now we're gonna talk about what actually the data block looks like. What are we actually storing? So for this, we're just gonna focus on tuples. We'll see this later on when we talk about indexes or other things or log records. These are also gonna be stored in pages. But for our purposes here, we just care about how we're actually gonna represent tuples inside of these pages. So for this, there's essentially two broad approaches. The first is a tuple-oriented approach. We're actually storing the data for actually individual tuples. And the second one is a log-structured approach. I'll go through both of them. Okay, so having a stored page is in a tuple. So I first wanna propose a straw man, like a bad idea, and then we'll see how she had to make this better, right? So let's say that in our page, all we really have in our header, the simple thing we need to keep track of is the number of tuples that we have in our page. And for this, we're gonna assume that all our tuples are fixed length size. Meaning they're always gonna be 100 bytes or some number like that. So what's the easiest thing we can do to insert a new tuple in this? So I heard it, yeah. So I wanna insert a bunch of new tuples. I know where my header stops in my page, and I just start appending tuples one by one until I've run out of space, right? Simple. And then let's say, I mean, I update my header to say I have three tuples. So that way when I come back, if I wanna write a fourth tuple, I know how to jump to that location. What's the problem with this approach? What happens if I delete a tuple, right? I delete the middle guy here. I have two tuples. Now if I keep appending, I'm gonna miss that. So maybe what I need to do is actually always start from the beginning and keep scanning until I find a tuple slot that's actually used, right? Well it's another obvious problem with this, right? I said it was fixed length, right? What happens if it's variable length? Does this work? No, right? So how would I actually store variable length data in this? If I say I wanna insert the next tuple, how do I figure out where to put it? I either have something in the header that says here's where the offset every tuple is, or I sequentially scan until I find a free spot, then I have to figure out when the next tuple starts because I can then figure out where I can start storing data, right? So this is what the slot-of-page architecture or layout fixes. So what I'll say is that slot-of-pages are essentially how every disk database system, pretty much the majority of them actually are implemented. The exact details may be slightly different, but at a high level they work all the same way. So what's gonna happen now is our header. We're gonna have this thing called a slot array, right? So we have our header information, the things we talked about before. Now we have the slot array that's gonna contain a mapping of offsets for tuples at different positions to their starting location in the actual data space. And so down below we have our fixed length and variable length tuples all stored in line with each other, and then these offsets point to where they start. All right, so this is sort of adding another indirection layer inside of our page. So we can say things like at a high level, give me page one, two, three, and give me tuple five inside that page, and I go look at my slot array, and that'll tell me where I can find page five, right? And the way we're gonna add new tuples is that the slot array is always gonna grow from beginning to the end, and then the fixed length, the actual data of the tuples themselves are gonna grow from the end to the beginning. And at some point the two meet and I can't store anything else, and then the page is considered full, right? Again, there's another indirection layer that allows us to reorganize the pages without having to modify or change anything up above in our system. So maybe the case, let's say I delete tuple three, and now I have a bunch of free space here, and I can actually compact and slide over tuple four to be contiguous with tuple two, and all I do is update my slot array to say where tuple four now it starts, and everything else in the system doesn't know, doesn't care, right? It's sort of like doing compaction in the page level as you write it out at the disk. So as I said, this is the most common approach, how they actually organize the slot array could be slightly different, what additional metadata they store for the slot array could be different, but at a high level everybody works like this. So another different architecture, instead of storing actually tuples in pages, is to do what's called a log structured page layout, log structured organization. So last year I presented this in the same material here in the same lecture, but just in a different part. I'm not quite sure when actually to discuss log structured files, or log structured page layouts, because it's a combination of a bunch of different things. I think this is the right point, but we'll see how it goes. The way to sort of think about these log structured file organization is that instead of actually storing tuples inside of pages, we're gonna store log records. So we haven't discussed what logging actually is, but basically think about, it's like a record of the changes that you made to the database. So if I updated Andy's tuple, I made my salary go from $1 to $2, I would have a log record that says, update Andy's tuple from $1 to $2, and I can always replay that log and recreate the actual data itself. So in a log structured database system, you actually don't store any tuples at all. You only store the log records of the changes that transactions or changes that queries make, right? So now in a single page, what I'm gonna have is, I'm just gonna keep appending log entries from beginning to end that correspond to the changes that the query has made, and sort of thinking this is going in contiguously in time, right? I'd never go back and delete any log record inside of a page, so I'm never gonna have any holes, right? And these can be variable length because my updates can have variable length data, but I just keep appending into my page until I run out of space and then the page is considered full, right? So these log records, if I do an insert, it just contains the entire tuple. If I do a delete, I have some way to mark that the tuple got deleted, and if I do an update, it's just the delta of the change that got made, right? So what's the tricky thing about log structured pages? What's slower? Exactly, reads are slower, right? Updates are easy, just I find a page that has a free slot and boom, I append it to the end, I'm done. If I'm doing a read, now I need to go back potentially and look for the page and reconstruct the tuple into its original form based on its log records. So to do reads, I have to go in reverse order, right? So that's a problem. And so now again, this is within the consequence of a single page, but in a real log structured file system, these are log structured database, these are streamed across multiple pages. So I made to go need read through multiple pages in order to reconstruct the tuple back to the form that I need for the read. So the way to avoid having to read everything is that you can build indexes, blah, blah, basically say, if you need tuple ID one, two, three, here's where to go get it, right? At these different locations. So log structured systems are, log structured databases are not new. They can't go date back to the late 1980s, early 1990s, but they're actually really in vogue now. There's a much more systems that actually implement this. Things like level DB, ROX DB, Cassandra, HBase, and this is probably because they're running on things like HDFS, which are appended only file systems or appended only storage, right? So you can only write blocks in HDFS, you can't go back and update things. And that works perfectly for a log structure system like this. So the second problem we're gonna have, in addition to actually maybe reading a lot of, to avoid the problem of having to read everything, we can try to be clever and maybe do compaction and recognize that instead of having to replay the log for every single log record, we can take a chunk of it and compress it back down to just the single record that correspond to the change that actually got made, right? So if I update my salary, go from $1 or $2, and then the next day I need to get another raise and go from $2 to $3, instead of having to read those two log records, I can just compress that down or compact that down into a single log record. And now it reads much faster. And again, these are the systems that are actually using this. So what's the problem with compaction? So now for every single time I do a write, I append my log record, and then at some later point, I may compact these pages or compact the log records in my semi-page, and I'm writing it out again. So this is called write amplification. If I do a write to a record, my transaction committed, my query's done, I did my write, it's in my database, but in the background, I may be doing compaction and rewriting that same log record over and over again. In the tuple-oriented pages that I talked about, for the most part, if I write to my page, I write to my tuple, it's done. Once it's on disk, I'm never gonna go back and maybe you write it out again. Different systems do different things, but in general, that's the case. So with compaction, the problem is that we end up, we end up writing out the same record over and over again and we amplify the amount of writes we do it. And if you're using SSDs, they have a limited number of write writes before they wear down the cells, you could burn this out in a year, or burn out your device in a year, depending on how many writes it supports. So there's different levels of compaction or different, sorry, different types of compaction you can do. I wanna talk about two methods that are used in RockDB. So you might have heard of LevelDB from Google. LevelDB is an M-map database that does log structured storage. Facebook got ahold of it, renamed it RockDB, got rid of M-map and then made it much, much better. So if you want a log structured embedded database, you wanna use RockDB, not LevelDB. And so RockDB supports two types of compaction. The first is called Level Compaction, where the basic idea is that all your writes first end up in these sort of log files. And then at some point in the first level, when you write, say, to a certain number of files, you wanna compact them and combine them into a single larger sorted log file and put that into the next level. You keep doing this in the first level, you keep making log records and they get too big, you move them down to the second level and at some point the second level gets too big and then you move them down into giant files in the next level. And this is what I was meaning by write amplification. I can do a single write into level zero but then as I do compaction in the different levels, I end up writing that thing out over and over again because I'm gonna read it from one page, combine it with another page and write it back out. The other type of compaction you can do, universal compaction, there's a single level and basically what you do, you just take two different pages that are contiguous to each other in space and then you combine them to a single file like this. So we can talk about more about ROXDB and level to be maybe next class but at a high level this is what they're doing. They're log structured page layouts and there's upper levels of the systems know how to replay the log records to put the two files back into the correct form. All right, so the last thing I wanna talk about is the tuple layout. So again we have files, files up pages and within those pages we can organize our data that can contain tuples or contain log records. Now we wanna understand what these actual tuples look like. So a tuple in our world is essentially a sequence of bytes. It's all we're really doing and it's the upper levels of the systems that know how to then break those bytes up into attributes with the different types and expose them in a more programmatic way to the application. So it's essentially the job of the data design system to interpret those bytes, interpret those values in the manner that the program are described or defined for that table. So now within every single page we're gonna have tuples. Again, this is assuming we're doing a tuple-oriented page layout not the log structure stuff. But every tuple is gonna have a header and the header is gonna contain metadata about the tuple itself. So it can contain more visibility information about what transactions read and wrote to that tuple. It can contain a bitmap for which attributes are considered null. It usually doesn't contain anything about what the data actually looks like, meaning we're not gonna store any metadata about the schema or the tuple. All that is sort of handled separately by other parts of the system. So that means we don't have to be redundantly storing the same information on every single tuple over and over again. Right, so it sort of looks like this. You have a byte array for your tuple, you have your header and then you're gonna have a contiguous list of attributes that are stored just as a byte array, right? So typically what happens is in most database systems, however you define the ordering of the columns when you call create table, that's how it's gonna be written out to disk, right? So inside the page, if I have my five attributes, A, B, C, D, E, I will have eight attributes A, B, C, D, E lined up in that order inside the tuple itself. Now there's nothing about the relational model that says you have to do that, right? That's the beauty of the abstraction between the logical and physical. So at the physical layer, we can reorganize these anyway we want and at the logical layer, you would still see them always in the correct order. Now as far as I know in disk based systems, except for the column store stuff which we'll talk about next class, pretty much everyone always writes the data out in the order that you define it and that's just done for simplicity reasons. In the advanced class, we'll talk about in-memory databases where you actually do wanna reorganize how you lay out data in memory because that matters for getting word alignment with cache lines and you can get much better performance that way. But on disk, again, we said, we're just reading four kill light pages from disk. The alignment of what's inside the pages actually doesn't matter, right? That's always gonna be the main bottleneck. The other thing we can do also with tuples is instead of just storing all the data for a single table inside of a tuple, we actually can combine data from other tables inside of our tuples inside our own page. So typically every page will have, if you're doing tuple storage, every page will only contain data for one table. So if I have table foo, then there'll be pages for table foo. But there's another technique you can do to actually combine data from different tables inside of a single page to help you possibly speed up performance. So this is called denormalization. So I didn't do a lecture on normal forms and functional dependencies this year because it's really long and people always hate it. People start, their eyes start bleeding because it's boring. The only thing you really need to know about normal forms is that they exist in the textbook but nobody actually does it in the real world. The way to sort of think about it at a high level is it's a way to combine tables or break tables apart. And you sort of get this naturally if you have foreign key dependencies, right? Remember I talked about artists and albums, right? You could combine that into a single table but then you're duplicating everything. So the normal forms to say actually split them up so that you have single records for albums and single records for artists, right? That's the gist of normal forms that you really need to know. So with denormalization, instead of splitting it apart, it's putting it back together. So the user can define our tables in the proper normal form with separate tables but underneath the covers inside of our pages we actually can combine these things to get better performance. And again, it's completely transparent to the actual application. So I have two tables here. I have foo and bar and bar has a foreign key dependency on attribute A. So again, if I'm destroying these things as every page has its own tuples, then I would have a tuple for foo in one page and then the tuples for bar in another page. So physical denormalization is essentially like doing a pre-join. So as I know there's a foreign key dependencies between these two tuples, so I'm likely to join them together all the time in queries. So what I can instead do is actually inline all the values for the bar table that relate to this particular tuple inside that tuple. So inside the tuple I have my attributes A, B, sorry, but then I'm gonna have this list of all the attributes C that correspond to the bar table. So now when I wanna run a query, I know what portion of the tuple belongs to foo and I know what tuple belongs to bar and depending on what the query wants I may actually read one or the other, right? So again, this speeds up joins because now instead of having to go get one page fetch for foo, one page fetch for bar if I'm joining them together on the foreign key, it's one page fetch to go get all the data that I need. So I will say that this makes, the query is potentially faster but it can make updates more expensive because if I run out of space for putting attributes C in then I need to reorganize my pages. So as in all cases in databases what seems like a cool idea or a new idea is not new and it's almost like that South Park episode where they always say the Simpsons did it, right? So in databases it's always IBM did it. So IBM did this way back in 1970s with System R which is the first relational database that they built. It turned out to be really difficult to maintain and they abandoned the idea when they went off and built DB2, right? As I said doing updates comes more expensive because you have to reorganize things. But now 30, 40 years later now a bunch of companies are actually trying to do this because it makes joins go faster. So Google does this in Spanner for storing protocol buffer data, right? They actually inline the dependent values for one sort of table inside of another table. So again you reduce the number of page fetches you have to go get. There was a startup called Akebon from about 10 years ago that was doing this for MySQL. They got bought up by FoundationDB which then got bought up by Apple. As far as they know they ended up abandoning this. This technique actually looks a lot also to a lot like what the JSON or document databases like RethinkDB or MongoDB or Couchbase, how they work, right? Because you don't really define tables, you define these JSON documents and you can pre-join related table or related information inside that document, right? So that you don't have to do separate fetches to go get this. At a high level this is the same thing as physical denormalization. So it's the same technique that IBM did in the 1970s. All right, we have a few minutes left. So I want to talk about one last thing is how we're going to keep track of what our tuples are. So again, I ever said every page has a page ID and there's a page directory that says if you want page one, two, three, there's some directory information that says where to go get it. Now for tuples, we'll talk about indexes to go find them later on, but internally different data systems could represent individual tuples in different ways. And these are usually called record IDs. So the most common approach for a record ID is that it will be the page ID plus an offset or a slot inside that page, right? Remember I said if I want page one, two, three and give me the fifth tuple, I know that I have to go find that page and I look at my slot array and that tells me where to jump inside the page to find that individual tuple. So some systems actually expose information to you at the application level. So in Postgres it's called the CTID and they store this as four bytes. In SQLite they store this as a row ID which is eight bytes and then Oracle has their own row ID that stores this in 10 bytes. The reason why the Oracle one is bigger is because they are storing additional metadata information about what table or what file the tuple actually came from. But in practice it's sort of like this logical number that you're not really supposed to use in your application because it can change at any time. So I want to give a quick demo of this so you can actually see them. Right, so this one we're going to do Postgres at the top and SQLite at the bottom. So I have two tables, I'm sorry I have one table. I just made a simple table called R, right? So Postgres at the top has three tuples, one zero one, one zero two, one zero three and same thing as SQLite at the bottom. So as I said Postgres is going to store what's called the CTID. And again this is an internal thing that you're not really supposed to use in your application because it can change at any time. So inside of my query here I can say R.CTID and Postgres knows that I'm referring to its internal identifier. And you can see that it represents this as a pair of two numbers, right? The first number is the page, the second number is the offset in that page, right? So now let's say if I do a delete, I go back, right? And it's gone and what used to be the third tuple is still at position zero three because it didn't reorganize things, right? Let's say now I insert another tuple, I do my read again and then you see Postgres decided to rather than using position two where the two bodies deleted was stored, it put it in the fourth position, right? It's appending it to the end of the slot array. So we won't talk about the vacuum just now but you think of the vacuum as doing like garbage collection in the JVM, right? It's gonna go through and reorganize every page and compact them to free up any space that's not being used anymore. So now when I go back and look at my table again, remember I had zero one, zero three, zero four, right? Now I have zero one, zero two, zero three because it reorganizes the layout of the tuples inside the slot of page and then wrote that back out the disk and now my CT IDs are different, right? So last year somebody had a question whether I can do this, can I actually go to the table and say where CT ID equals something and Postgres is amazing and it lets you do it, right? But you're not supposed to do this, right? You're not supposed to use the CT ID in any way in your application form because as I showed it can change at any time. So in SQLite, down below, SQLite is at a CT ID, they call it row ID, right? And you can see it's just storing it as one, two, three. So it's not like the page ID in all set, it's this internal sequence that Postgres maintains. So now if I do a delete, right? Go back, the second guy's gone and I do a new insert and it didn't fill in the slot, it just used, depending to the end, right? Again, I think in SQLite, I don't know whether they guarantee that, I think in SQLite they guarantee that this is always, always be, always increasing and it always stay with the tuple. But let's see, so if I delete every single tuple, right? There's nothing there, now go insert back that same tuple. No, I started off with one, okay. So yeah, started off with to use it as a good change, right? Yes? So the coverage of connection has to be down manually? So this question is, in the case of Postgres, it's this thing called the vacuum. This question is, does this have to be done manually? New versions of Postgres, you do not have to do this manually. We'll cover this when we talk about multi-version current control. Basically what Postgres is doing is, any single time you update a tuple, it doesn't actually overwrite the existing tuple, it makes a new one. And it maintains an internal link list to say, here's how to go get the version of the tuple that you want, right? So if I go back here, and so the vacuum is the way to go and prune out the older versions that no one can see anymore. So we see that again. So if I go here, I have three tuples. So I'd say I do update, update R, set, val equal yyy, where id equals one of three. Actually, let me do this first before I do that. Let's go see what the CTI, sorry, there's our CTIs. So now I'll run that query. So we're gonna update the second tuple. I do my update, is that correct? Yeah, and then it took the, actually this is a good example of how the relational model is different. So before the update, when I did a select, it put the second tuple in the second position. After my update, it now became the third tuple, right? Because again, the relational model is unordered. So what happened was I did my update and I appended the new change for this tuple that was here, now at zero two, now at zero four. So now as I'm doing my scan, I'm just reading them in based on the offsets. So the tuple in the last offset ends up showing up as the last one in my output, right? Because again, the relational model isn't organized. There's no notion of ordering in this case here, right? So same thing, the point I wanted to make here is that it was at position zero two after the update as now at zero four. In the background, Postgres knows that it has to run the vacuum. I think it runs, it's a combination of either I've changed the table so much, or I'm running out of space, or it's at different periods, right? Again, this is what a database administrator can do for you. If the vacuum can be expensive operation because they're essentially reading everything, every page that has got changed since the last time you ran the vacuum, which means reading them in, reorganizing them and running them back out, you may not want to do that during the day because you're trying to keep up with the workload. So maybe you want to do less vacuum during the day more vacuum at night. This is something that the administrator can control. All right, any questions about row IDs? So some of them will actually store them directly in the tuple. I think SQL Lite's doing this, Oracle does this. Other systems like Postgres don't actually store the record ID or the row ID in the tuple itself because it's a waste of space. But you can derive that based on the page ID in offset. All right, so what do we cover today? We covered how to organize the database in pages. We covered how to actually track those pages so that when upper-levels on the system says give me a particular page, it knows how to go find that. We talked about different ways to actually store those pages. We talked about how to actually organize the tuples inside of those pages, right? So next class, what we'll talk about is sort of more complicated things that don't deviate from what we've talked about here, but it's sort of, again, building upon what we've done. So we'll talk about how to actually represent the values inside of the tuples themselves. By going even further inside of the page. And then we'll talk about different storage models. So a storage model is a way to actually organize the data within a table. And so what I'll say is everything I've showed here today is called a row store. Meaning like for every single tuple, it's sort of organized and integrously across all the attributes. Like one tuple doesn't begin until the next one finishes. Another way to do this is actually organize these things as columns for all the values for a single column across multiple tuples are stored categorously, right? And again, the page layout stuff doesn't change. So the way we're organizing our pages with the referees doesn't change. It's just how we're actually representing the data inside those pages changes. Okay? That's my favorite all-time job. What is it? Yes, it's the SD Cricut, I-D-E-S. I make a mess unless I can do it like a Gio. Ice cube with the G to the E to the Tio. Here comes Duke. I play the game where there's no rules. Homies on the cuss, so y'all gonna focus. I drink brook. Put the bus a cap on the ice, bro. Bushwick on the go with a flow to the ice. Here I come. Willie D, that's me. Gee, and St. I's case of a thought. Six pack for the act, gets the real bounce. I drink brook, but yo, I drink it by the 12 ounce. They say bill makes you fat. But St. I's is straight, so it really don't matter.