 So far in this course, we've talked about application level things, right? We talked about the relational model, we talked about the normal forms, we talked about SQL, right? These are all the things if you're an application programmer that you have to deal with in your application. And so now at this point in the course, now we're switching over to actually now talk about how do you actually build a database system that's going to manage the database and do all the things that we talked about before? How are we going to execute SQL? How are we going to execute low data into our database and make sure everything's safe? So we sort of covered this. This is the course outline that I showed at the beginning of the class. We sort of covered the first part. We covered relational databases, everyone should know at high level what relational model is. And then now we're sort of going down this outline here and essentially talking about all the different parts of the database system you need to build in order to have something that you could call a database system. And so the way to sort of think about the database system is that it's a bunch of layers on top of each other that provide different abstractions or APIs to other different parts, right? So at the very top, you have the query planning. This is where the SQL query comes in. And then you turn your query plan into a relational query plan tree, and then you execute them. And those things that execute the queries actually need to read data. Then there's your access methods. And those access methods need to read data from disk, and that's your buffleful manager and disk manager. So the sort of pipeline or the hierarchy of how query execute is going from the top bottom in my diagram here. But we're going to start from the bottom and go up, right? Today we're really going to talk about how do you actually store a database on disk and manage that. So this course is focused on what I'll call a disk-oriented database system architecture. And the definition I'll give you for a disk-oriented system is one where the database system assumes that the primary storage location of the database is on disk, right? And that means that essentially all the different components of the database system have to account for the fact that at any given time, you could be trying to access data that's not in memory, and therefore it's on disk, and therefore you have to go copy it into the disk and bring it to memory. And so it's really about this trade-off of how can we have the appearance in our database system that everything's in memory, even though it's not, right? And the reason why we have this issue is because if we store things in DRAM or in memory, it's volatile. If you pull the plug, you lose everything. And the disk is non-volatile storage, meaning we can write stuff to it, if we're careful about it, and we can come back later after we restart the machine or if someone trips over to the power cord or whatever, and our data should still be there, all right? It's not, disk and everything aren't magical, you have to use them correctly, otherwise you could lose stuff and you have to make sure we order our writes correctly. But that's the kind of things we'll be covering in this course. And so to understand this, we want to go first discuss what the storage hierarchy looks like. And so again, the way to sort of think of this is going from the bottom to the top. The bottom you have slower devices, but have much larger storage capacities. And then as you go up, you get to faster storage devices, but you have much, much smaller capacity. And implicit in this is that the things at the top are really, really expensive and things at the bottom are cheaper. And so again, it's this trade-off of trying to decide how do you manage, how do you manage the movement from the different layers? So the vision line that we're going to care about as we design our database system is sort of right here as I'm showing, between the DRAM and SSD. So the things above this line are volatile storage devices. And again, I said that means if you write something to it and then you pull the plug on the power, everything goes away. And the reason why they're volatile is because the way these DRAM and all these things work is that periodically you have to give it energy, give it power so that it retains the ones and zeroes that you're storing in it. The things below that are non-volatile storage. And again, these are sort of passive devices where you can write to them. And then if you pull the power, you can come back and everything's still there. But there's actually other two characteristics about these two different classes of storage devices that we're going to care about. And they're actually going to come up throughout the semester in how we design our algorithms in different database components. So the first is that the things at the top, the volatile storage devices, these support what's called random access. Meaning you can address to a single byte or a single cache line and grab individual elements in storage. Contrast that with the non-volatile storage devices. These are considered block addressable. And you can't go get a single, just a single thing, I mean you can. But the hardware or the devices actually go get a bunch of things all at once. A block or a page, right? So I said at the top I can access a single 64 bit integer. At the bottom if I want a single 64 bit integer, I gotta go fetch a four kilobyte page. Then I jump to the offset that has the data that I'm looking for. The other important aspect about this is that the things at the top, sort of related to that they're byte addressable, it also means that they support random access. I can go grab the single things that I want. Whereas the bottom, typically these devices want you to do what's called sequential access. Meaning I want to grab a bunch of data that's all stored contiguously in the storage device so that it's one fetch to go get the thing I need or continuous fetches. This is mostly an issue in old sort of spinning disk hard drives. Because there's a mechanical arm that has to jump to different parts of the platter, they start reading data off of the spinning disk. And so if you're doing random access and jumping to different locations on the platter, then it's a physical movement of the arm, right? It sort of swings around and jumps around and has to copy out four kilobyte blocks every single time you do a different seek. Whereas in DRAM, you don't really have to do this. In SSDs, they're sort of a hybrid between two. If you do sequential access within four kilobyte page, that's good. Beyond that, you don't see that big of improvement as you do in spinning this hard drive. So I think the book also too, the textbook might show you this sort of same hierarchy. And at the bottom, instead of putting what I'll call network stores, they'll put like magnetic tapes, right? This is like old technology of doing tape archives. This is mainly, Amazon will sell you a service called Amazon Glacier, which is basically the same kind of thing. But this is really for archival stuff. Like if there's a complete disaster, you go back to your tape archives. And you have a machine that you have to walk into the shelf and pull it off and put it in, right? It's a really long latency. So by network storage here, I mean something like Amazon S3 or Amazon EBS, right? It's underneath the covers, those different services are using spinning disk hard drives or SSDs. But because there's a network latency to go get the data that you need, they're going to be slower than these other devices. Okay, so for our purposes in this course, we're going to consider DRAM as just memory. Any time I say memory in the course, I really just mean DRAM. And then when I say disk, I mean any three of these things at the bottom, right? SSDs spinning as hard drives are network storage. And so again, for our purpose in this class, we're really focused on how can we manage the movement of data from disk up to memory, right? And we said that the database is going to assume the primary storage location of all the data will be down here at the bottom. So it's really about us copying things back and forth. Now, some advanced systems could get kind of clever and actually start worrying about how you store things in CPU registers or CPU caches. We're not going to cover that at all here. And part of the reason is because if everything's on disk, then trying to be really smart about getting things in CPU caches is not going to be a huge, huge win because the disk is always going to be the slowest thing. In the class in the spring, we're going to assume the database is going to be in memory, not on disk. And therefore, managing these things at the top actually makes a big difference. But we're not going to talk about that too much in this class. Another thing that I just want to mention that is coming, but we won't cover in this class either, is this new storage technology is called non-volatile memory, which is sort of this weird hybrid thing that sort of spans in between the two of them, right? Non-volatile memory, sometimes they call it phase change memory or memristors, resistive RAM. They have a bunch of different technology names. The basic idea is that you're going to get the best of both worlds. You're going to be able to access it on random bytes like you would with DRAM and be almost as fast as DRAM. But it's going to have the durability of an SSD or spinning this hard drive. So that's actually a big deal and it really changes a lot about how you actually design a database system. And this is actually what the research Joy's been working on for his PG thesis. And so for this course, we're not going to focus on this now because this actually doesn't exist yet. You can't go buy it online anywhere. But depending who you talk to, it might actually come out in 2018. But in the meantime, DRAM is not going away, SSDs and spinning this hard drive is not going away. So we're not going to worry about too much there. But at some point, maybe 10 years from now, we might throw away everything we do in this course and actually teach it to use non-volatile memory. All right, so the goal of what we're trying to talk about today is because the database is assumed to be on disk, then we means we can support databases that are larger than the amount of memory that's available to the system. So if I have a one gigabyte stick of DRAM in my machine, I can have a 10 gigabyte database. If you have an in-memory database, it has to fit in DRAM, so you couldn't go large in that. So again, everything we'll be talking about for the next two lectures is really about how do we read and write data on disk and manage the movement of data back and forth between disk and the memory. So if I say that we want to have the illusion of a system that has more memory than it actually has, what does that sound like? What's that? Say a lot, sorry. Virtual memory, exactly, right. So the way virtual memory works is that you sort of have this swap area on disk and the operating system is allowed to move data back and forth between disk and memory to give the illusion that your system has more RAM than it actually does. So maybe this is actually something we want to use, right? Maybe we want to let the OS actually manage our memory for our database system because it can already do this sort of the same thing with virtual memory. Who thinks this is a good idea or a bad idea? Not a good idea, why? Last choice on your design space. Right, so he says if you rely on the operating system, then you have less choice in your design space inside the database system. Right, so he's correct, but let's understand why. So here knows what M-map is, the syscall. One, two, a couple. All right, so M-map is called, it stands for memory map files. And the way this basically works is it allows you to take a file that you have on your disk, right, that's gonna be broken up into pages. And you'll be able to map them into the address space of your process. And then any time you jump in memory and try to access something within the location of that file. If it's in memory, you just access it. If it's not in memory, then you get a page fault and it goes and fetches the page you're trying to access and brings it into memory. Right, so it sort of looks like this. Here we have it on disk file, really simple, I have four pages. And then I have my virtual memory and physical memory. So with virtual memory, again, I can allocate all four pages, even though there's actually nothing in there yet. But in physical memory, I only have two pages. So if my database system comes along and accesses a tuple in this page, it's not memory. So the OS will go copy it, but it copies it into the physical memory location. And then it maps the virtual memory page to point to the physical memory page. So this is fine. But then now I access page three, does the same thing, it goes and fetches the data that I need. But now let's say I access page two, what happens here? Right, you have to replace one of the pages that are in physical memory, right? But which one? What does Linux do? Was that? Yeah, he said LRU, I actually looked it up yesterday, it's not LRU, it's LRU like, it's clock, which we'll cover on Wednesday. But I think a more important thing, not just like, that has to go pick one action to swap out. It's actually gonna put your process to sleep. And actually it did that for all the pages that had a touch that weren't in memory, when it tried to touch a page that wasn't there, the OS blocks your process because you hit a page fault, which is to interrupt. And then the operating system goes touches the page you need, and then once it's now in physical memory and it does the mapping through virtual memory, then you get control back on your process, right? So for regular applications like say you're building Photoshop or something on your laptop, blocking your process is not a big deal, right? Because you go to disk, you gotta go to disk anyway, who cares? But in a database system, when we start talking about concurrency and all these other things, this is gonna be a big, big problem. Because the process that got blocked when it hit a page fault, may actually be holding locks to other tuples, or the latches to other data structures. And now you'd be blocking for a long time. And all this other threads that are running other queries and other transactions could be doing useful work, but they can't because you're holding all the locks. So just to give you an idea of how long these stalls actually can be, I think it's important to understand what the access times are for these different storage devices that I showed in that hierarchy, right? So the faster storage device you can have is L1 cache. And a typical system can access this in half a nanosecond. And then as you go down, seven nanoseconds for L2, 100 nanoseconds for DRAM. For the SSDs, it's spinning these hard drives. Spinning these hard drives are always the same, whether it's a read and write. For SSDs, it's faster to read versus writes. And then we get down to the network storage. And this is a range, Amazon roughly says 20 to 40 milliseconds on EBS. And then again, the example I said that was the worst case is the tape archive. So somebody actually has to go put the tape into the machine. They have robots to do these things, but even then, it's a mechanical process and it's really, really slow. So it's sort of hard to wrap around these numbers because we're discussing them in terms of nanoseconds, right? But let's say instead of nanoseconds, we just said seconds. And now you can see, now I start to understand why stalling the thread, why you go fetch something from disk is a really bad idea, right? So we say one nanosecond equals now one second. If I have to go to my hard drive to go get something, that's a equivalent of less like in real time, going, you know, waiting 16 weeks to go get something. So going fetching things from non-volatile storage is really, really bad and really, really slow. So we want to avoid this at all cost. But in the case, this is unavoidable sometimes, so sometimes we do have to go to disk. And so if we're going to go be blocked for 16 weeks, ideally we want that thread to be able to do other stuff. Or have, you know, not hold locks and other things that block other guys from making forward progress. So this is why, so now you may say, all right, well maybe the way to solve this is actually now to just have multiple threads, right? Assume that we built a sort of straw man database system with M-Map. We only have a single thread and it's the only thing reading and writing. And so yeah, if it has to stall then, you know, nobody else will be worried about the locks that it has or, but that makes it look bad because now anytime you run a query and you go to get something from disk, there's a long pause while you go fetch everything and so the system looks unresponsive. So now you can say, well maybe I have multiple threads, so I can mask all this. Okay, now we're things getting even more complicated because at no point, you know, you don't know whether something's in memory or not. That's the whole point of M-Map, the operating system is hiding this from you. So it's unlike when you build your Buffapult Manager, when you go try to access a page, you'll know whether that page is in memory or not. And based on whether it's in memory or not, you can make decisions about what should you actually do for that thread. And it's hard to do that when you're using M-Map because the operating system is hiding everything for you. So this is going to come up multiple times throughout the semester. And if I die, you can put it on my tombstone. But the bottom line is you never want to use M-Map for your database system. There are rare occasions where it actually makes sense. We had the influx DB guy give a talk on Thursday last week. And he's like, yeah, we use M-Map. And I first I was like, that sounds like a bad idea. But then when he told me he's how he's actually using it, it's okay. Because they're not actually writing data using M-Map, they're just reading data. And in that case, it doesn't matter. If you're read only the normal problem, you have to write things, then M-Map is going to be a bad idea. And as he sort of brought up when I asked this earlier, by letting the operating system manage our memory for us, we're giving up the opportunity to do a bunch of optimizations and control actually how the system is going to work. So I talked about buff and replacement policy, how you decide what page to move out. But now if you know you're gonna have a query scanning a bunch of pages, maybe you can do prefetching and bring things in memory ahead of time. Because you know you're gonna keep going down the list. We care about thread and process scheduling. We care about when we talk about recovery and transactions. We're gonna care about when we actually flush data to the disk. So my example before, again, I just read things in and let the OS swap in things in and out. But now if I want to modify the file, I want to make sure that it actually makes it out the disk in a safe manner. So this idea that the operating system is gonna be a problem for us as a database system is not new. It goes back, there's a paper from 1980 written by Mike Stonebreaker, where he talks about how the operating system gets in the way from a database system. The bottom line is that the mantra from if you're gonna be a database system developer is that the operating system is not your friend. It's your enemy and it's always gonna get in the way. So we try to avoid it as much as possible. And so that means that we end up having to build sort of operating system like things in our database system. We have to build a thread scheduler. We have to build a buffer pool manager. But the database system always knows better than the operating system, because it knows exactly what your data looks like. And it knows exactly what queries you're trying to execute on it. So it can make the best decision about all these things, not the operating system. So again, the operating system is not your friend. Okay, so given that setup, now we wanna actually talk about how we actually store our database on disk. And there's essentially two problems to do this. The two problems we have to deal with. The first is just how are we gonna represent the data on files on our disk? And then the second question is, how is the data system gonna manage its memory and move data back and forth as needed based on what queries you execute? So for today's class, we're gonna focus on the first problem. The second problem we'll focus on Wednesday, and this is also related to the first project that we're giving you, right? In the project, we give you the source code that'll do this first one for you. So another important thing that we need to sort of address, and I mentioned this earlier when we talked about the hierarchy, is that traditionally database systems make a big deal about preferring algorithms and data structures and components that try to maximize the amount of sequential access to disk. Again, this is because on a spinning disk hard drive, moving the arm is a mechanical thing. So if you can move it once and read a lot of data, you'll get much, much better performance. And so as we go along, you'll see a lot of the algorithms that we'll discuss, care about this, and make decisions to do this, all right? Basically allocating much of pages together on a storage device that are contiguous is called an extent. And I think the POSIX API provides you support to do this. So again, as I said, nowadays if you have fast random access, either with SSDs or non-lateral memory or on DRAM, then this is not entirely necessary. And if you're interested in this kind of stuff, this will be focusing on in the advanced class in the spring, 15721. And so again, so the way to think about it is this course is a traditional disk-oriented database system. Both systems that you think of when you think of a database system is this, right? Postgres, Oracle, MySQL, DB2, SQL Server. All these are disk-oriented systems that we're talking about today. And then the more modern or advanced state-of-the-art systems will be the in-memory guys. All right, so for today's agenda, we're going to focus on four parts. We're first going to deal with how we actually organize our database as a bunch of pages and files. Then we're going to talk about how do you actually represent or organize the contents of these pages and keep track of where do you have free space or not. And then we're going to talk about how do you actually lay out the tuples inside of these pages. And then hopefully, if we have enough time, we'll finish up talking about sort of a more advanced topic that's not in the textbook. But how do you, the different storage models you can have in your databases. Okay, so in the most basic form, the data-based management system is essentially going to store a bunch of files on disk, right? Some systems have a single file like SQLite. Some systems or most systems will actually store multiple files. And sometimes usually it's inherent in the hierarchy of the directories of the layout of the actual database. So this means that the operating system doesn't know anything about our database files, right? They look like regular files like any other file, right? It's only the database management system is able to actually decipher their contents and extract meaning from them and actually show your data. So again, when you try to open up a text editor of a Postgres file or MySQL file, you're just going to see a bunch of bytes because they don't mean anything, you need the database system to actually interpret it. And so this means that because it is regular files in the file system, all the normal protections that the file system provides, like journaling and other things, they'll be available to our file system, to our database files. And that's sort of another example of where the OS is sort of getting in our way and the database system is going to do its own journaling, our own logging, right? When it makes changes to be able to recover from them. But now the operating system is essentially going to do the same thing, right, to make sure your files are okay. So you may think, all right, well I just want to get rid of the file system entirely and make a sort of a specialized database file system that only my data system knows how to read and write files from. And I would say this is actually how the first database systems in the 1980s all did this, right? And they all said whatever the file system that Unix or Vax that provided, they thought they were all crap. And then what they did instead was they would allocate a bunch of block of bytes directly from the storage device and sort of build their own storage subsystem on top of that. But nobody does this anymore because it's a lot of engineering. And although the file systems aren't great, they're good enough for what we want. Yes? So his statement is the operating system is journaling the metadata about files where the data is journaling or logging the data about the contents. Correct. But the point I just want to make is that sort of what I said just now is you could, there's no reason we have to use ext4 to ext3, those file systems. The earlier systems actually built their own file system-like thing that did all the stuff for you, so you didn't have to sort of have these redundant logging and things like that. But nobody does that anymore because it's a huge pain. It makes your system not portable. And then roughly the performance gain you'll get from is like 10%. So nobody builds their own file system. Everyone just uses what's already there. It's good enough, right? Right, okay. So now the thing we're talking about here is usually called the storage manager. Right, and again, it's the thing that's responsible for managing the database files on disk. And the way it's gonna sort of organize the files or the data in the files is through pages, right? So another thing too is also sometimes I'll say blocks, sometimes I'll say page. I essentially mean the same thing, right? Some chunk of bytes that's gonna be stored in our file. And so the storage manager is responsible for keeping track of all the data that's read and written from these pages. And it has to track the available space in these page so that if you need to write new data, it knows some location where it actually can store this data. So for now we're only gonna focus on reading and writing tuples to our pages. But all the same things that we're talking about in this class will come up again when we talk about writing out index files or log files. And other things, right? Keep it simple, we're only talking about tuple pages. But there's other different types of pages that you have to maintain as well. But we wanna have the same protections that we have for data pages. So page, again, is essentially just a fixed size block. And again, it can contain sort of any amount of data. Typically, most systems as far as I know do not mix page types. So if you have one page that stores only tuples, the stores tuples, you won't sort of sprinkle in log records, right? You keep those things separate. We do that because it's easier to build the software to maintain these things. Some systems also require that the pages have to be self-contained. So what I mean by that is all the information you need about the data that's stored in the page has to be contained in that page. So for example, say you were using a sort of a dictionary where you could map string values to integers. And that way, instead of storing string values inside the page, you just store integers, sort of a basic compression scheme. If the dictionary stored on one page and the data stored on another page, if somehow the hardware gets corrupted and that dictionary page gets blown away, then you have no way to restore the original data. Cuz you don't know how to reverse the dictionary in coding, right? So systems like Oracle, for example, they require that all the information you need to be able to recover a page stored in that page. And the reason why they do this is because the worst thing you have to do as a DBA is often go look at a bunch of pages on a messed up disk and try to extract, recover any data that you can. That means you're looking at raw bits to figure out what's in the page header, what's actually being stored. So if you have everything be self-contained, then it becomes less brittle. Meaning you can lose one page and doesn't reverberate and blow away all the other pages. So every page is gonna be given a unique identifier and the data system's gonna use sort of an additional indirection layer to map a page to an actual physical location on disk. So that could be some directory, it could be some file, but somehow there's an internal map. It doesn't really matter what it is that's gonna map the page ID to some location on disk, right? So one thing we had to talk about is that there's different notions of pages in when we're dealing with the database system, right? So at the lowest level in the actual storage device, you have what's called a Harvard page. And this varies per vendor, but usually it's about four kilobytes. And so when I say, again, by a Harvard page, it's sort of mean that this is what the hardware can guarantee will be an atomic right to that storage device. So meaning if I write a 4K block, the drive will come back and say I wrote that 4K block and it's guaranteed to, if it says it was there, it's there. If I try to write 8K blocks, it may only be able to write the first one and then crash and then not have the second one, so it can't be an atomic right. Then we have the operating system pages, and again this is sort of how it maps things in memory by default in Linux and Windows, it uses 4K light pages. But there are options to use larger pages. And the advantage of using larger pages is that now you have the TLB or the mapping from virtual memory to physical memory can be much smaller. You can reference more locations in memory with a smaller table if you use larger pages. But by default, they're 4K. And for our purposes in this class, we're not gonna mess with that. We can discuss that in the advanced class later. And then we have our database pages. So the database page does not necessarily need to be a one-to-one mapping to the OS page or the Harvard page. Different database systems do different things. So at the smallest range, you have like SQL light that does one kill light pages. And then Oracle and DB2 are 4K, SQL Server and Postgres are 8K, and MySQL is 16K, right? And so there's no right or wrong answer for what page size you can do. But what I'll say is that if you're using a page size that's larger than what the OS and the operating system can guarantee, can be a fail-safe right or be atomic right, then you have to do a bunch of extra stuff to account for that because you don't want torn updates. So in the case of MySQL, when we talk about logging, we'll see that what they actually do is when they flush out dirty pages, then pages have been modified by transactions and queries. And you want to write them out to the actual disk, the database on disk. It actually writes them to a double right back buffer first sequentially. And then once that's durable, then it actually goes and updates the actual real database pages, right? They have to do this because they can't guarantee that the right will be atomic at 16K. Okay, so now also we need to put the note also that every record or every tuple in the database is going to have a record ID. So this is essentially some unique identifier that the database system is going to assign to a tuple that allows it to track it in its internal bookkeeping mechanisms. And usually the most common thing that everyone does is that you'll set the tuple ID to be a combination of the page ID and the offset or slot. The page ID tells you where on disk it is and the offset or slot tells you what offset to jump to find the actual data you're looking for. So some systems like in Oracle, for example, they have really big row IDs that are 10 bytes. And this is because they're storing a bunch of information about what file is located in, what directory it's in, what table space is in. And they sort of embed all this in the row ID. So in some systems, these are actually stored in line with the tuple itself. In other systems, these are simply derived, right? And it's important to note that although you can see this in your application, you don't actually want to do anything with it because it's not reliable. Because the relational model doesn't say anything about the position of tuples on physical storage. The data system can move any tuple around at once, and then all these row IDs will change. So let's see that, I can give you a quick demo of what this looks like. And bear with me, cuz this is awesome. This is trying to run SSH inside of Windows, which is never fun. Nope, you guys never typed a password wrong. Okay, okay, all right, so this is Postgres. So I think I have a table here, R, right? So let me recreate this, drop table R, and then we'll create it. And then we'll insert a bunch of tuples. So now I do select star from R, and I have three tuples. So in Postgres, what you can do is you add in a, what they call the CTID, which is the tuple identifier, right? And then I'll do R dot star. So now in my output for this table, I see now I have this CTID. And now you see exactly what I talked about before, how it was a combination of the page ID and the offset, right? So the zero is probably the page number. And then the offset one, two, three is the offset for this tuple, right? So now let's say I delete a tuple, right? Delete from R where, I'll delete the middle guy, where ID equals two, right? And now let's say I insert a new tuple, insert into R values four and then x, x, x. So who says now, when I say, give me the internal ID for this tuple? I just deleted the second tuple, but does that position zero, two? Now I just inserted another tuple. Where do you think it's going to be stored? Zero four, zero four, right? So internally Postgres knows there's a hole there, but it preferred to actually use it the next slot. So we're not going to discuss what the vacuum is just right now. But think of this, the vacuum as like the, I'm going to say disk defrag, but the defrag for the database system in Postgres. It's going to go through all our pages and prune out all the old stuff that's not needed anymore and possibly reorganize stuff. So now when I go back, right? And I do that same query on the same data, now the row ID has changed, right? Now the third record is at two. And then the fourth record, which used to be at four, is now at three. So this is an example that the row ID will be exposed to you in your application but the data systems allow to do whatever it wants to reorganize things at any time. So you never want to build your application, assuming that this thing is unique or always going to be the same. Cuz at any moment, it can move things around. And again, this is allowed in the relational model cuz the relational model doesn't say anything about how data is actually stored. Yes, wait, so the question is if you use the, you're talking about Postgres in particular? So if you use the Postgres vacuum, can you do what, sorry? His question is if you use the vacuum, can you refer to the tuple as the exact address? Yeah, cuz there's one, three, four, three, and then. They've got to reorganize. One, two, three. Yes, so who's you? The application or the data system? The application doesn't know anything. I've got chocolate using the underlayer. Oh, let's find out. So the question is, so here I'm showing a select query where I'm just getting the CTID. And again, this is an internal thing that Postgres is maintaining. This question is, can I do my query now and say where CTID equals something? Yeah, let's see what happens. It's not gonna like this. Let's see whether I cast it correctly. I did it. Okay, so, but don't do that. Right? In, we have the same thing also in SQLite. Right, I'll create the same table, load a bunch of data. And then in SQLite, they call it, I think, Ro num, Ro ID, right? So it's the same thing, one, two, three. And then the question is, can we do where Ro ID equals one? Who says yes, who says no? Raise your hand yes, raise your hand no, one, let's you do it. All right, so again, I just wanna show you that this is like, when I defined my schema, I didn't say there was Ro ID or CTID. The database system is gonna do this internally for us, right? All right, let's see, okay. And something worked in Windows, that's great, okay. Okay, so now we wanna talk about how we're actually gonna manage our pages. And this is where the data systems are gonna differ a lot, more than just how many, what the page size is and things like that. Actually internally, how they're gonna manage the keep track of pages will be really different. So I'm gonna focus on the first one and the last one for the sequential sorted file organization, the hashing files. That'll come up I think next week when we start talking about index data structures and things like that. The heat file is sort of a generic approach to doing this. And then the log structure file organization is sort of the wild card and all of this, I don't think it's in the textbook at all. But this is actually not a new idea, but it's actually showing up in a lot of newer systems. So I think it's worth discussing a little bit. And again, it'll debate a lot from how we're talking about storing tuples and pages, but I think at a high level, you should be able to follow it. All right, so a heat file is a on-order collection of pages that the data system's gonna use to store things. And the only thing that the data system needs to guarantee with its heat file is that the other components of the system, the things that could be accessing data, they just need to be able to scan through it and find all the tuples it needs. It makes no guarantees about how things are sorted. It makes no guarantees about that you can jump to a particular offset and always find the thing you're looking for. The thing of this is that we're dealing with, if I need page five, I know how to find it in my heat file. So there's two ways to represent this. The first is the link list, which is sort of the straw man that every textbook discusses, but nobody actually does this. And then the more common one is the page directory. So in the link list, essentially, there'll be some special page in our file, that the heat file, that's gonna keep track of two pointers. The first will be a pointer to a link list that's called the data page list. And these are the pages where they're completely full of tuples. We can't store anything new in there. And then the other point would be to the free page list, where there's at least one slot in the page where we could actually store data. And so that means that if you need to scan everything, you get to traverse both lists. They usually have pointers going in both directions because you need to walk back and forth. And then if the free page list is empty, you just allocate new pages and append it to the end of that list. So what are two obvious problems in this? Seems really inefficient. Cuz you're just walking through this link list. So that's sort of obvious. The second problem is that where do we actually store the data for how much space is available in each page? Cuz the header page that I said just has pointers, right? So if I need to store a tuple that's kind of really big, I ideally wanna find a page that has enough space for me. So I need to embed it now in the actual data pages themselves. That means I have to walk that list every time to find a slot that can handle the tuple that I want to insert into it, right? And so you have to store this in the side of pages cuz you want all the same protections you have for regular data. Because if you have additional metadata about the pages that are stored in memory data structure, like how much free space is in the page, then when you crash, you turn the system off and come back. Now you need to scan through all your pages and count how much space they have and repopulate that data structure, right? So a better alternative is called the page directory or the page table. And essentially what this is just a sort of a hash table or directory that's gonna map page IDs to the actual locations in a file where they can be found. And then what you'll do now also too is actually embed the amount of free space that's available in each page in the page directory itself. So if you need to say I need to find something new, then you can go looking directly in those page directories and say, well, here's a page that can accommodate me, right? Again, it's important that the data system makes sure that the directory pages are always in sync with the data pages, right? Cuz again, you don't have the problem where I write something out to a data page and therefore I use all its space. But then I haven't updated that free space in the directory page. And therefore when you come back, it's out of sync. So anytime we modify or update the contents of a data page in terms of changing how much space is allocated to it, we also need to update the directory page. And anytime we allocate a new data page, we have to update the directory page as well. So it's not that big of a deal or big of a problem because you don't create pages that often, right? In terms of like if I insert a lot of tuples and they can all go to the same page, it's not like I need to allocate a new page for every single tuple most of the time. So flushing out the directory page every single time we update the number entries is not that big of a deal. But I need to make sure that they're always in sync. So the thing that's slightly different from all of this, from the heat file. So obviously also too is like going forth in the class, we're gonna assume that we have a heat file using the page directory architecture. Yes? Is it basically all innovations in the peanut file? This question is, are pages just location in a file? In the most simple form, yes, all right? But they could be broken up into different files, right? I think in, if you look at Postgres for example, if you just look at like varlib Postgres and like look at all the directories, there's all different files for like the page directory for indexes for data pages for different tables and everything, all right? But at a most simple form, yes, the page directory is like, it's this file and here's the offset to go get it, all right? And this is why we have to have fixed size pages because that way we can jump, we know how to do the calculation and the arithmetic and jump to the location we want for a given offset. Cuz all the pages will be the same size. All right, so again, as I said, for this class going forward, we're gonna assume we're gonna use heat files with the page directory layout. Cuz again, most data systems are built like that. But I wanna discuss a little bit, a really different one called log structured file organizations that doesn't actually have tuples at all, per se, at physical level. So instead of storing tuples in pages in a log structured system, we're actually gonna store only log records, right? So let's say I have a log file and every single time I make a change in my database, I insert a record, delete a record, update a record. I'll store a new, append a new log record to the end of the file, right? And again, yes, you can organize the log file in a bunch of pages, right? But we're not actually storing raw tuples. And so anytime we have an insert, we just put the whole tuple in a log record. For a delete, we'll just mark the tuple as being deleted. And then we do an update. We only have to store the delta, the things that actually change. So I have 20 attributes and I don't update one. I only have to store the change for that one attribute, right? So inserting data is really fast because, or making changes to data is really fast because you're just appending to the log file. And there's another example of what I said before where people design database systems in some ways to try to maximize the number of sequential access to the storage device. In this case here, appending to the log file, I mean by definition, it's sequential because you're appending new records and it's always being written out in sequential order. So what's an obvious challenge in a log structured organization? Right, reading, exactly, right? So the way you have to read is now you essentially have to go backwards in the log and find the entry that you're looking for. So let's say I have a select query or I want to find where, give me the tuple where the ID equals one. I have to look at every single entry in reverse order and find the latest entry that corresponds to one. Now in this case here, I only have an insert for one. So as soon as I find that insert, then I know that I have the whole tuple. But it may be the case that I come across an update for that record first. And then I have to keep going because I don't know what the rest of the values are until I find the insert record, right? So log structured databases are really, really fast for writes, potentially slower for reads. So the way they can overcome this is that you can build indexes in memory, actually point to offsets in the log so that if you know you're looking for a particular tuple, instead of having to do a brute force scan to look at all the log entries, you just jump to the one that you're looking for. So we can jump this index where ID equals one and not have to look at the other ones. And then another thing that they'll also do is periodically compact the log. Basically just take all the log entries and flatten it down to just be the actual tuple itself, right? So this is much different than everything we'll talk about in this course, in terms of storing things on disk because we don't really have tuples in the same sense of the, in the heap file. But I would say this is actually becoming very more common now. So HBase and Cassandra use this, level DB was a log structure database system built by Google. And then RocksDB is the sort of Facebook optimized version of level DB. Yes? Would you use this log structure file organization just for like a logging part of a database that otherwise used a directory? So this question is, could you use a log structure database file organization to just manage the logging component of a regular heap file database system? You could, but like there's a bunch of things that it's, there's a bunch of things that this thing provides that you don't need. You don't need indexes to jump to locations in the log. Typically also to the, the, the logging scheme or the, and the, the database storage on disk and the concurrency scroll scheme. Those things are so tightly intertwined. It's really hard to say like, all right, I'm going to take this piece from this system, that piece from that system. It's really hard to break these things apart. So everybody, everybody says, essentially just writes their own, right? We're going to talk about this, talk about this later. But a lot of times when people do new, new database startups, they take Postgres and sort of try to rip out the pieces that they just need because you don't maybe need all the other stuff. We tried this. It's really hard because things are just so intertwined, right? So yeah, it wouldn't make sense to say, I'm going to build a new database system, but I'm going to use RocksDB for this, for the logging. It doesn't make sense. RocksDB does way more than just that, right? It's a good question though. Okay, so, so I'll maybe talk a little bit about the log structured stuff later on in the semester. But I think it's good for you guys to be aware that it exists and how it differs from, from the, sort of the heap file of things that we're talking about. And when we talk about logging recovery later on, a lot of the same concepts will apply here. There's a bunch of extra metadata that I'm not showing in these log records that they'll have to maintain as well. And we have to maintain them also in a, in a heap file system. Okay, so now we know how to design the, the, the files and keep track of our pages. The next question is how, how are we actually going to store data in these pages, right? So again, for this, we're only, we're not going to talk about the, the log structure stuff, it's only for the heap files. So, every page is going to contain a header, and this header is going to contain metadata about its contents. So for example, it'll contain the size of the page, the check sum of its contents, so that if you know you come back and whether you actually have a valid page or not, which again also the file system does, but the data system has to do this as well. To keep track of like maybe what version of the system actually wrote to this page, and that way if you upgrade your system, you know how to actually, and you update your page layout, you know how to, to, to convert it. There's also a bunch of extra stuff about transaction visibility, about whether the contents of this page are visible to other transactions or the queries. And that's more of a focus when we talk about concurrency control later on. And again, I already talked about before about how some data systems like Oracle, for example, want these pages to be self-contained. So everything you need to know about how to understand what's in this page has to be, end up being, being stored in, in the header. All right, so now the question is how do we actually want to store tuples in this page? All right, we have a page header, we're always going to have that. Now the question is where do, where do we actually put tuples? So a, some, some simplistic straw main idea was just to be, let's just append new tuples from the, the top to the bottom. Every single time we want to insert something new, right? So what's, what's the problem with this? What's that? Deletion. Right, deletion, right? So say, say I, I, I delete this middle guy here and now I want a certain new tuple. Ideally, we want it to go there because otherwise we'll just have a bunch of, you know, empty holes in our pages and we're wasting space. But how do we keep track of where these holes are? Well, maybe I have a bitmap that says, all right, I have, I, I, I keep track of what positions I can, I can have free space and I check the bitmap when I write into it, right? So there's ways to get around that. But what's another obvious problem with this? Exactly, yes. So these are all, in my example, they're all fixed length. So in the case of here, when I deleted tuple two and I put tuple four, it was the same size, it was fixed length. But now if I have a variable length tuple, it may not fit in tuple four. The tuple four may not fit in that slot and I may not have to put it at the end. Now I can't have this clean abstraction of, or a clean, clean offset. So I know how to jump exactly to a particular tuple, because they're always going to be the same size, right? We said we could do that with pages, because the pages are always going to be fixed size, but clearly tuples can't be. So the way people solve this is the most common technique is what we're called slotted pages. The way to think of a slotted page is that we're going to have the header we had before, but now we're going to have this extra array at the top that corresponds to slots. And these slots are going to specify offsets into the page of where to find the tuple data. And so the slot array will start at the top, the fixed length and variable length tuple will start at the bottom, right? So the slot array will grow this direction, and then the tuple array will grow in that direction. And you can keep allocating new tuples or keep putting new tuples in your page as long as you still have space, right? Again, these are just pointers to the starting position of that offset for each tuple. So the example that I showed before with Postgres where it had a record ID and then after I ran the vacuum where it reorganized the page, it could then change that offset. It's essentially referring to the slot there. So in that particular example, it actually did a major reorganization. So it got assigned new slot IDs. But in theory, I could reorganize the page and keep the same tuples and the same slots and just change their offsets. And I wouldn't have to change any other part of the system, right? So this is just again adding another abstraction layer in the disk manager from the rest of the parts of the database system where the other parts don't have to care exactly how to go find the tuple that you're looking for. It just knows I have a slot number and the slot array will tell me where to find the tuple that I want, right? So again, it allows us to have indirection and reorganize things without having to break other parts of the system. All right, so now we understand how to store pages in files. We understand how to keep track of them. We understand what the pages are gonna look like. But now we gotta talk about what tuples look like, right? So in the example I showed you before, the tuple is just the box, right? There's nothing there to say what's actually inside of it. So now we're gonna talk about what the tuple is actually gonna look like. So again, the way to think about a tuple at the implementation level inside of our database system is that it's just a sequence of bytes. And then it's up for the database system to actually interpret those bytes based on the schema of the database or the table to know it's this type and it has this length and I know how to convert it from those bytes into the actual value that you're looking for, like the human readable value. So in every tuple, there's gonna be a header. Again, the page has a header and the tuples have headers for themselves. And the header's gonna keep track of things like the visibility, like for concurrency control, whether it's been marked as deleted. But it also keeps track of maybe, it was very common as a bitmap to say, which attributes in my tuple are set to null, right? Cuz a null is a weird value, you need a way to, you can't just sort of set all ones to say that's null cuz that's actually a valid value. So most systems actually store a separate bitmap to say, if the attribute at this offset is that bit set to true, then we know everything's null. So it's important to note here is that in the header, we're not actually storing anything about the contents of our tuple, the contents of its attributes, right? Because all that information, when we call create table, that gets stored in the data systems catalogs, which we'll talk about later. But it's this other location that keeps track of the metadata about what the data looks like. So we don't need to embed in every single tuple, you have these number attributes and they're these types and they have this name. That's all in another location. In some of the NoSQL systems, when we talked about the column family data model, or in JSON, like in MongoDB, they can't do that. They have to have, cuz every tuple could have a different schema. They have to embed the schema in the header. In our case, we don't have to do that because we have a rigid schema that's defined on the table. So now we're gonna talk about how we're actually gonna store the actual underlying bytes for every attribute. So typically what happens is the whatever order you call create table and you list all your columns. That's how the data system is gonna store those tuples, right? So if I have columns ABCDE in the sort of byte array for my tuple, it'll store them in that order. And there's no reason you have to do this, right? This is simply just done for software engineering reasons. And you don't have to do an extra in direction map every time to figure out where the actual, the tuples that you're looking for is located. But again, the relational model doesn't say that you always have to do this, right? And in the case of an in-memory system, sometimes there's cases where you actually maybe wanna change the ordering of how things are actually physically stored so that you word align the bytes for your tuple. In our case, we'll ignore that and we'll just assume that everything is fine. So now for the individual attributes, the way they're gonna be represented is usually defined by what's called the I triple E 754 standard. Who here has heard of this? One, two, three, a few. All right, so there's a standard that's been out for a while that basically specifies how a processor will represent in bits. Things like integers, floats, decimals, numbers, things like that. So all the processors will know how to write data out in memory that looks like this. So every single data system just uses those things. Like when you declare an int in C or C++, right? In memory, it's stored with onto bits, represented integer. That's specified by the I triple E 74 standard. So for most data systems, all the bits will be stored exactly the same, right? But it tends, obviously, whether you're big ending or little ending, it may get reversed around. So for integers, big int, small int, tiny int would use the standard. For the Varchars, we usually have the first couple bytes will specify the length of the Varchar field. Followed by its contents. And then for date and time, timestamps, that can vary between different data systems. Usually, they always store it as the, if it's a Unix system, the number of seconds, or microseconds, or milliseconds, since the Unix epoch. Sometimes you have more complicated structs for these kind of things. But for our purposes, we don't care about. The one I sort of want to focus on is how we're actually going to store decimals, because there's two types of decimals in SQL, in databases. There are sort of the IEEE 754 standard decimals, the floating point numbers. And then there's what are called fixed point decimals that the database systems can support. So again, for floating point numbers, we'll just use sort of the native types in C and C++, and then we're storing them exactly how it's specified in the standard, right? If you want to have, if you don't care about rounding errors, and this is okay, if you do care about rounding errors, then you want to use what are called fixed point numbers. I don't have time for a demo, but take my word for it. If you use fixed point numbers, they're slower, but they're accurate. Floating point numbers are faster, but you can round errors, right? And sort of give an example of what those look like. Say I take this little simple C program and I run it on my laptop. What should be 0.3 as the output ends up being these bizarre things. And you get different things in different ways, right? You can try to take 0.3 and cast it to a floating point with printf, or you can actually do the arithmetic and you end up getting widely different numbers, right? This is because the ICCC, that 754 standard has to allow for rounding errors because you can't store exact precise floating point numbers on your processor. So the way we fix this in our database system is used to call it fixed precision numbers, where instead of storing a single number, we'll store a bunch of metadata about what that number is. And then we know where the decimal point should be. We know it's on the right hand on the left hand side, and we know how to cast this. So the way to sort of think about this is like storing a varchar, sort of a byte sequence, but we know how to interpret those bytes to get the exact thing that we're looking for. So to give you an example of what this looks like in Postgres, so this is the actual Postgres code where they define a numeric type, and they define it as a struct, and there's all this extra stuff that they're storing just for a single number, right? Say with four 32 bit integers, and then there's a pointer to something else. Right, and that's just pointing to itself, sorry, a char array. So I'll just say that when you actually do arithmetic on numerics, these fixed point numbers, they'll be much, much slower, but they'll guarantee be accurate. And then depending on what you want, the data system will store those bytes in the tuple for you. So we're out of time, but the last thing I'll say is how do we actually handle large values, right? So most data systems don't allow you to have a tuple size that exceeds the size of a page. And so if you have really large attributes, then you don't want to store directly in the page with the tuple, you want to store what are called overflow pages. So these thing of this is like extra pages we have in our database, where in the regular tuple and the regular pages, they'll have a pointer, or some page ID and offset to some location in an overflow page, where we actually have the varchar data. In Postgres, this is called Toast, and you get this for free anytime. The thing you're trying to store is larger than two kilobytes. In MySQL, if the thing you're trying to store is larger than half the size of a page, so anything larger than eight kilobytes will be stored in this separate thing, right? Yes? This question is where would an overflow page be located? On disk? Where else, right? So you typically wouldn't store it in the exact same file as the data pages. This question is, does it mean you have to read extra pages? Of course, yes. So we can do a demo of this, we've got a time though. If I store something really large, then I do a select and it's not in memory. It has to go first go get the data page where the tuple is located, and it says, aha, I have something that's in an overflow page, and it has to do another fetch and get to the overflow page, right? I don't know whether this is still true, MySQL used to have problems where they couldn't use multiple overflow pages for multiple tuples. So if you had to say maybe make, if you had two tuples, and each had an overflow page, you had to allocate separate overflow pages for each of them, even though they maybe could fit in the same thing. All right, so we're out of time, but as a preview for what I'll pick up on Wednesday's class is that just like before when I said the relational model doesn't specify that how things can be ordered on disk. It doesn't actually specify how we're actually going to store tuples at all. And so in all the examples that I showed before, and I'm showing in this class, I showed you that all the attributes for a tuple are just stored contiguously one after another, right? Then you have all the data for one tuple, when that tuple ends and the next tuple starts. But there's nothing about the relational model that says that's the right way to do this, right? Cuz the relational model is just dealing with the logical level of the database. So and the reason why this will matter when we'll pick up on this class is that storing the data in contiguous order like that may not actually be the best thing to do for all workloads, right? And the preview is that maybe actually want to store it as columns. All the data for a single attribute together in a single page. And then all the data for the next attribute stored in another page. So we'll stop now when we come back on Wednesday, we'll pick up with why this actually matters. And then we'll also talk about buffer pool management and moving data back and forth between memory and disk, okay? Any questions? All right guys, thank you, see you on Wednesday.