 All right, DJ2PL, thank you. I heard a wild rumor. What? I heard he used to date Larry Ellison's girlfriend. F***, they f***ed. But then what happened? He stole it from you? Man, f*** you guys. All right. F***, we'll come back to that later. Larry Ellison's the founder of Oracle. He is the fifth richest person in the world. He owns a Hawaiian island, all paid for by databases. That's why this class exists. All right. So right, administrative stuff. So the homework one, we do this coming Sunday on the 10th. Project zero is also doing the 10th. Who here has not started project zero? One, why not? Two, do you know C++? That's a no. Do you know C++? No, it's OK. You should really start, because if you know zero C++, it will be a struggle. And again, we're not trying to do this to torture you. It's really meant to be like, this is what the rest of the course is going to be like in terms of the projects. If you don't know how to write C++ and don't have to debug it, you're going to have problems. Print F or standard C out is not a debug method. You want to use a debugger. And then project one will be released. It was supposed to go out today. Probably will come out Friday. And that will be on the buffer pool, which we'll start teaching in a week and a half. Any questions about homework one or project zero? All right, let's jump into material. So last class, we spent time talking about SQL and all the modern things you can do with the CTEs, lateral joins, nested queries, window functions, so forth. Prior to that, we talked about the relational model and relational algebra. And so at this point, that's the logical view of what the database system is going to look like, that we're going to sort of mentally construct throughout the rest of the semester. So we're not going to go back to discuss these things, but we'll see how we need to know what SQL looks like or relational model is. In order to build the various parts of the system, we'll be discussing going forward. So as I said at the end of last class, this point forward in the semester going forward up to around Thanksgiving is going to be how to build a sort of a classic or canonical database, a relational database management system. So the outline for going forward, we've already discussed what relational databases are. But the first four topics here are storage, execution, concurrently called recovery. These are the aspects that you would need to have to build a full-featured, safe, reliable database management system. And we'll assume it's going to run on a single node, because it makes our life easier. Don't go distribute until you have to. And then once we understand what a single node database system looks like, we'll then discuss how to expand this and do distributed databases. And then at the end of the semester, we'll talk a little bit also about what other additional features and optimizations we can apply, which then will be a segue into the advanced class, if you want to take that in the spring. So the way to think about a database management system, and the way the course is laid out, is a bunch of layers. And the different layers are going to provide different functionalities and capabilities for the database system, and they're going to expose an API to whatever the layer is above it. And the topics we will discuss again, basically how to construct those layers, put it all together, and have a full-featured database management system. So the way to think about it is that the most simplistic viewpoint would be the application comes along, and they're going to issue a SQL query, and that's going to first show up and get parsed. The string of text of the SQL query gets parsed. We're going to run through the query optimizer. Below that, then we'll start executing whatever the query plan is. There'll be access methods to actually talk to the tables or indexes or whatever we're trying to access. There'll be a buffer pool manager to manage the memory for our database system. And then at the lowest level, it'll be a disk manager that will be responsible for reading, writing data to disk. And so today's class, we're going to start getting the semester, we're going to start at the bottom, and then work our way up. When we get to something like concurrently shown recovery, that's going to permeate throughout the entire system. So we'll have to come back and revisit all these things. Like when we're running transactions, we need to know what's on disk, we didn't know what we're accessing, how we're accessing it, what queries we're executing. So the storage and execution will get us through the entire stack, and then we'll come back and touch it all over again. That sounds weird, but we'll look at it all over again when we talk about concurrently shown recovery. And it's the same thing with distributed databases. We have to know about all these things to build a reliable, safe, distributed database system. OK? All right. So for the system we're going to discuss, the methods we're going to discuss this semester, we're going to assume that the architecture of what we're trying to build, again, think of conceptually, we're going to construct in our minds a database management system. And Bust Hub is one implementation of this. We're going to assume we're building what is called a disk-based database system, or a disk-oriented architecture. And this is where the database management system itself is going to assume that the primary storage location of the database is going to be on some non-volatile disk. It could be an SSD, a spinning disk hard drive. If you're running in the cloud, it could be S3. But we're assuming that it's going to be disk-based. And all the things we're going to build in our database management system are really designed to now coordinate or orchestrate the movement of data back and forth from disk into memory. This should not be news for anyone. This is the classic von Neumann architecture, where the data is at rest on disk. We can't operate it unless we bring it into memory. And then the CPU can do whatever it needs to do it. So this is the overall theme of what we're trying to build. And this is obviously going to be super hard. And if you're an application developer, you don't want to be doing this yourself in your application code. You want a database management system that knows how to do this reliably and safely and correctly and efficiently to do it for you. So the way to think about what storage looks like, from our perspective, from a data management system, is in terms of this hierarchy. And you might have seen this in other classes or other textbooks. And the way to think about this is that going from the bottom to the top, the storage devices are going to get faster, but smaller, and more expensive. So at the bottom layer here, you have something like network storage. This would be like EBS or S3. I think in the textbook, there's a layer below this. I think they have tape drives, but nobody runs database systems off those anymore. But as you go up after network storage, and you have a locally attached spinning disk hard drive, and then you have maybe an SSD, and then after that, you have DRAM, then CPU caches, and then now CPU registers. CPU registers are super fast. It's the fastest kind of storage you can have, but we're talking you have maybe 32 registers on your CPU. And each one's going to be about 64 bits. You can't store a lot of space in there or store a lot of things in there, but they're going to be super fast. So from our perspective in this semester, the only thing we really care about is this division line here. And that's between volatile and non-volatile storage. Obviously, what does volatile mean? We have volatile storage. What's that? Yeah, it says data is gone when your power is gone. And you pull the plug on your DRAM or your CPU registers, like everything gets wiped out. When you boot the system back up, nothing's still there. Non-volatile basically means that you write data to the device and assuming that you make the recalls to tell it to get flushed, or you get back an acknowledgement. We'll cover that later. Then we assume that data is going to be persistent. And that no matter if we restart the system, pull the power, take the machine out, put it into another location, whatever, when we come back, that our data will be there. Of course, database systems, we don't trust the hardware. We don't trust the OS to work with that in a second. We don't trust any of this. So we're going to do a bunch of extra stuff to make sure that if we write stuff, maybe we write to a multiple locations or we write a backup for it. A bunch of things we'll do to make sure that we truly get non-volatile storage. But from the design of the architecture itself, we'll assume that it's volatile versus non-volatile. Another key difference we're going to see between volatile and non-volatile is how we can access the data. So non-volatile storage will be considered to be byte addressable. What does that mean? It says you can query each individual byte. Correct. Yeah, so let's say I have a one megabyte file. If I want to get 64 bits at some random offset, I can do that in memory. There's cache lines. It's not exactly true, but for now, we can ignore that. In a non-volatile storage, like think of it in SSD, you can't go get exactly 64 bits in a one megabyte file. You've got to go get the block that 64 bits is in, bring that into memory, and then do whatever you need on it. You can only address blocks, not individual bytes. And so the reason why this matters is that there will be certain algorithms we'll choose in the design of our system where we know we're fetching blocks instead of single bytes or byte offsets, and therefore we'll choose maybe an algorithm that is better for block addressable data. Related to this, we're also going to choose potentially algorithms that are maximized amount of sequential access of our data. So what do I mean by this? Yes. So he said accessing adjacent blocks are cheaper, so we want to do that as much as possible. So again, say I want to get 10 megabytes, and they're broken up to one megabyte blocks. If those one megabyte blocks are scattered in different locations, then it's called random access to go jump to those different locations to get that data. Or alternatively, if it's all aligned together, contiguous, then in theory, I can do one fetch command. I'm not saying what the device is, but it's one fetch command to go get those 10 megabyte blocks. And that's going to be way more efficient. If you just think of it on a website or dialing something up the internet, if I can go open up a single connection and get all the data I need rather than opening up different connections, the one fetch is going to be faster. At the hardware level, I think of a spinning is hard drive, and laptops don't come with these anymore. They still exist in the enterprise world. But there's a physical arm that's spinning around on a platter like a vinyl record that old people have. And so if you have to move the arm to get data, moving that arm is a physical thing. It's expensive. You're actually moving something through helium, but you're moving something, moving the arm on the platter. So if I can just move the arm once and then read a bunch of data without moving it again, that's the quantia access that's going to be faster. If I have to pick the arm up and move it over and over again, then that's going to be much slower. And again, we'll see this when we, this will come up also in the execution algorithms, that we will be certain algorithms will choose that will maximize sequential access. And we'll choose those over something that's more random access. And this is different than maybe how you think about algorithms in intro classes, because in that world, they assume everything's always going to be, the memory access is always the same. In our world, we're dealing with real hardware, so we can't make that assumption. So the way we'll think about this is that I'll use the term memory in this class, and I'll just mean here, I'll just mean DRAM. And when I say disk, I'm going to mean anything below that. So in SSD, it's spending this hard drive or network cloud storage. The ones up here, the CP registers, we won't discuss these in this class. In the advanced class, we'll talk about different algorithms and methods to try to maximize the amount of processing we can do of data in CP registers, or like L3, L2 caches. And in that world, you can make a huge difference. So also, we're noting there are some merging hardware devices or hardware that's available today that spans different layers. So you can get fast network storage or disaggregated storage or disaggregated memory, where this looks like it potentially could be byte addressable, but you're going over a physical network, so it's a little bit slower. So it straddles in between here. And then there was something called persistent memory that people have been dreaming about for a long time that would have the byte addressability of DRAM, but also the persistence of an SSD. And it would actually sit in the DIMM slot so you could write to it as if it was a memory, but if you pull the plug, everything gets retained. And this is something when I first started CMU 10 years ago, we were spending a lot of time researching this. This is something that was very interesting to us. If we had this persistent memory, basically all the stuff that I'm going to talk about in two weeks, actually project one in this class, basically goes away. Because I don't need to worry about moving things in and out of disk. Everything, my memory is persistent. Everybody know if anybody actually had to make this, this persistent non-volta memory? Yes? I'm just going to try to look at an SSD or a hard drive with something that's actually like Intel Octave or something. Boom. There you go. You can use it as if it's going to be around an SSD. Yeah, so he said you could try this with an SSD. Yes, people do that, but that's not true persistent memory. Or then you said Intel Optane. Intel Optane actually was an SSD. It was actually phase change memory. It actually was a physical device that could do, it was persistent memory. HP had memristors. There was IBM rumor to have something. Intel is the only one that actually made this. Who here has heard of Intel Optane? Well, he obviously has. Very, very few. It's already dead. So Intel killed it last year. Basically, Intel hired a new CEO, and they cut a bunch of divisions. And unfortunately, they cut this. And this sucks, because to me, this would have been a game changer, but Intel couldn't make any money off of it. And what sucks also, too, is now no one's going to try this for another decade, because if Intel could make money off of it, who will? But there was a various project at different database companies that were trying to build database systems just around persistent memory. Because again, a bunch of the stuff we're going to have to do about moving data back and forth between disk and memory goes away with this. So that's a shame. OK. So the reason why we have to be cognizant of what the storage is going to look like is because the performance characteristics, as I said, between these different devices are going to be dramatically different. And we're going to try to maximize the amount of work we can do for data when it's in memory. When we bring something off a disk into memory, we want to do as much work as we can on that data. Before we throw it away and bring something else into memory. In an ideal world, our database would fit entirely in memory. Even then, you still have to write out the disk. But in some cases, that's not always possible. So the way to think of this at the CPU level, a cache miss is going to basically is going to one nanosecond. Getting out of DRAM is 100 nanoseconds. And SSD is about 16 microseconds or 16,000 nanoseconds. That's actually pretty good. Spitting this hard drive, 2 million nanoseconds. And then EBS can fluctuate. Sometimes you get 50 milliseconds. Sometimes you get 500 milliseconds. Depends on how hot the data is. And then tape archives. Again, this is glacial. You don't want to build any system off of this. So this data comes from, this particular data here comes out. There's a Berkeley website in the link there. You sort of shows you the trends of the hardware performance or the speeds of these devices over time. This table has been attributed to Jeff Dean from the early 2000s. I think it might predate him before that. And so as humans, it's hard for us to reason about nanoseconds. Like one nanosecond, what does that actually mean? Or 2 million nanoseconds? Is that a long time? And so there's a simple trick you can do to realize how bad this actually is or how much slower things actually get. If you just change one nanosecond to one second. So this is a trick that Jim Gray used to do in the 90s. If you just change one nanosecond to one second, now you see how massively slower these other devices are. And you see why you want to keep everything in memory as much as possible. And so you think of it like, if I have to read a page from a book and say, doing L1 cache mess would be like me walking to this table and looking in the book. Or if I had to read from an SSD or DRAM, maybe it's walking over to the library and then finding the book. But if I had to read from a tape archive, it's 31 years, that's equivalent to flying to Pluto, the planet, and then reading one book. So you want to avoid all of this as much as possible. OK? So the sequential versus random, we've occurred this. We discussed this already. But again, this is going to be a reoccurring theme throughout the entire semester, where again, the database system is going to prefer sequential access over random access for both reads and writes. When you're spinning these hard drives, again, it makes a huge difference. But even on SSD, because of the way they actually work underneath the covers with the A6 and then doing compaction and so forth, you're better off doing batch reads and writes sequentially as much as possible. So the other system design goals we're going to have and how we choose how we want to build our system is that we want to give the illusion that we are operating with the database entirely in memory. Again, for most databases, aren't that big. Most databases are less than 10 gigabytes. But for really massive databases like in the terabytes or gigabytes and terabytes and petabytes, ideally, you want to get the appearance that everything's already in memory, even though it actually isn't. And there's tricks we can do to hide the disk stalls and so forth. And then since reading, writing disks is so expensive, we want to do a bunch of other tricks in our design of our system to avoid prolonged stalls or having the system appear unresponsive. Because one, that will frustrate the application or frustrate the user, because they think the system is stuck, but you're really fetching things from the disk. But this also is going to cause other problems because if we're holding like a lock on something and we stall because we have to get something from disk or writes it from disk, that's going to slow down everybody else behind us and have a convoy effect. So there's a bunch of things. For this reason, we want to avoid this as much as possible. And again, because random access is slower than sequential access, we want to maximize sequential access. So what does this all sound like? Having the appearance that we have more memory than we actually do. Virtual memory. All right, so we'll get this in a second. I'll explain why we don't want to do virtual memory from the OS and why as a database system developer, engineer, building the actual system, we always want to do as much as we can ourselves and not rely on the OS to do anything. So, hello, this is our diagram. This is what we're building. So we have some database file or files plural. It doesn't matter. We can discuss the differences. We have some database file that's on disk. And we're going to break it up into pages. I'll describe what a page is in a second. And there'll be some directory that's going to say, here's what pages I have. Here's where to find this offset and so forth. And then there'll be some buffer pool of memory that the database system is allocated, basically called malloc against the OS. Got some memory. And then we're going to use that as the staging area where we bring pages in from disk. So now, if the execution engine, the thing that's going to run our query, it comes along. And it wants to get page number two. We can ignore how it knows it wants page number two for now. But it assumes that's what it wants. It wants page number two. So the very first thing we've got to do is bring in the page directory, because that's going to tell us where on disk the pages are. And then it'll make a call to the OS or whatever the device it is that's storing the database file. And it'll bring that page into memory. And then now the buffer pool will give back the execution engine a pointer in memory, a 64 bit pointer in memory of where this page exists. And now it's up to the execution engine or the access method of the operators to then interpret what's inside that page, because all that's opaque to the rest of the system. They don't really, it's not entirely true, but at this point they don't really didn't know. And then let's say it wants to do a bunch of updates. It makes changes to whatever's in page number two. I'm not saying whether it's a tuple. I'm not saying whether it's an index, it doesn't matter. And then now the data system is responsible for writing this back out to disk to make sure that any changes are persistent. So this is effectively where we're going for the next three or four lectures. This is the architecture we're going to be building. So we'll discuss what pages look like in the next three lectures, three, four, and five. We'll discuss how to write things out the disk in six and how to manage memory in six. And then we'll discuss how to execute the queries up here in 12 and 13. OK? So our focus really today is what are these things on disk? OK? So I said before, what does this sound like? Everyone said virtual memory. And you say, OK, well, why do any of this, why take the next three lectures talking about what's, well, you need this lecture. Well, why take the next two lectures after that, talk about how to manage memory back and forth from disk when the OS can do that for us. If you ever know what's the syscall you would use to let to use virtual memory in this way. And map. Beautiful. Excellent. To answer, memory map file. So this is in the POSIX standard. Windows has their own version of it. But this allows you to take the contents of a file that's on disk and you map it into virtual memory in your process, in the address space of your process. And then now that process can jump to any offset in that address space in memory. And the OS is responsible for deciding, oh, is the thing you need in memory or not? If not, then it goes and fetches the page you need and brings it in memory. So the database system is not doing any of the reason rights. It just, M-Map opens the file. And the OS does all the management of the data, moving the data back and forth for us. So it sort of looks like this. We have an on disk file, we have a bunch of pages. We call M-Map to open it up. And then we'll have a concept of virtual memory and physical memory. So virtual memory would be what I see in my process address space. Again, at some starting location, I'll get the M-Map file. And then these virtual memory has to be backed by physical pages. So as I touch a page, the OS has to go then put it into some space in physical memory and then update the wiring for the virtual memory table. All right? So let's say that my process wants to touch page one. So we would have a page fault because the OS would recognize I don't have page one in physical memory. It'll go out the disk, fetch it for me, put it in, update virtual memory to now point to it. And then my process can touch it or do whatever wants with it. Same thing now if I want page three. It's not in there, I get a page fault. The OS blocks my process. When I do that, when I access it, it goes fetch to the page I need from disk, updates the wiring and then my process can start running again. My thread can start running again. What happens if I wanna touch page two? What's that? You have to get rid of it, right? But what happens while I'm getting rid of it? My process stalls, right? I notice the OS is gonna block me while it says, okay, well, I don't have any more physical memory. Let me go out and figure out what do these one, page one or three, which one should I throw away? Right? But the OS is gonna have its own internal statistics about how these pages are being accessed and it's gonna make a decision about what page to event. But it doesn't know anything about what we're doing inside the database system because it just sees reason rights. It doesn't, like, at a course screen. It doesn't know what queries are. It doesn't know what's in these data pages. What's in these files. So the OS is gonna try to make a decision on how to swap things out. And that's just, again, for eviction. There's a whole bunch of other problems that we're gonna face if we rely on the OS to do this for us. So in my first example here, I shared one thread, right? Or one process with accessing it. But again, we're gonna try to build a database, a modern database system that can take advantage of multiple cores or multiple CPUs. And so we need to have multiple threads access them. But now what if one of them touches something, writes it and then another guy tries to read it but it gets stalled because it gets evicted, right? The ordering can get kind of screwy. And again, the operating system doesn't know anything that's going on, what's running inside the system at the same time. So everything's read only, it's okay, right? Because we're not dirtying any pages. The OS can swap things out. It'll be good enough. And there are some cases, there's some data systems that do use M-Map just for some read only parts of the system. But if you now need to have multiple writers, which again, in a real system, we're gonna want this, then there's a bunch of other problems. So the first one is transaction safety. This is the one I sort of mentioned. Like if I have a transaction that updates multiple pages, I need to make sure that these pages are actually written out in the right order. And the OS doesn't know that because it just sees dirty pages. It doesn't know anything about, does this page need to be written before this other page? You can do some things like you can lock the page using M-lock, but that only prevents the OS from swapping it out. It doesn't prevent it from writing out. So now it may write out a dirty page that it shouldn't have, I crash and come back and now I have changes that shouldn't have been written to disk. And I have to go figure out how to reverse that or deal with the bad data, inconsistent data. We've already talked about doing stalls. Again, if you try to access something that's not in memory, you get a major page fault, it always blocks your thread, it just schedules you. The disk schedule goes, gets your page, brings it in and when it's available, then you get rescheduled, your thread gets rescheduled again. But now your thread is blocked and doing nothing, right? And it may be there are other queries you could possibly run while you're waiting for that thing to get fetched from disk. So then you say, okay, well, maybe I'll make a disk dispatcher or scheduler so that there's only one thread that goes and gets things and if there's a page fault, he gets blocked and then I can run another thread. But now you're kind of, again, you're building more infrastructure around the limitations of M-Map. Next problem you have is how do you handle errors, right? If I, in M-Map, if I try to access a page that for whatever reason is corrupted or not available, there's some hardware problem, you don't get an exception as you would if you write it in user space code. You get a SIG bus and it interrupt. Now you need a signal handler all throughout the rest of your system because you may be doing something that is in a critical section that you don't want to get interrupted or break. So therefore you have to have an interrupt handler to make sure you can go back to the thing you were doing before to handle this interrupt. Because this is the only sort of, this is how the operating system tells you things are going bad. You can't get back an error code, you get back an interrupt. That's a lot of engineering to handle this. You don't want to do this. And then there's obviously going to be performance issues. And this is because the operating system is going to have its own internal data structures about what's in memory and not memory, what's getting scheduled and not getting scheduled. And it has to protect those critical sections inside its own data structures, right? With new text or whatever. And now that's going to be a contention point. Whereas in a database system, since we know what queries are trying to do, because again, SQL is declarative, we know what the queries want to do, because we have the physical plan, we know what the data is trying to access, and therefore we're in a better decision to decide how to schedule things. Yes. This question is the database systems have the privilege to work with hardware directly. There are methods called kernel bypass where you can, or NVMe is sort of like this, there's ways to make calls to hardware without having to rely on the OS. The problem with those things are you basically have to free implement a bunch of the OS back up inside the database system. So there's something called, this is a tangent, like if you don't want to have to use the OS's TCP stack for networking, there's something called the dbdk from Intel, the data plan data kit or development kit. Basically it's a way to hook directly into hardware and you get raw packets out. But again, it's raw packets. If it's a TCP connection, now you gotta keep track of like the TCP headers and all that, right? Very few database systems do this. I only know two. One was one's yellow brick. We'll discuss them throughout the semester. They do a bunch of amazing stuff. They basically rewrite their own. They basically only use the OS to turn the thing on and they never call malloc again. They allocate all the memory, everything beginning. They wrote their own PCIe drivers. They do a bunch of amazing engineering. Few people do that. The other was it was ScaliaDB but the dbdk was so difficult to handle that it was a huge pain they don't do it. The answer to the question is for some things you have to go through the OS. In the 80s they got really crazy and they instead of using a file system they wrote their own storage layer on top of like raw block devices. People have tried this over the years. From an engineering perspective, it's often times you do have to rely on the OS but you wanna minimize your contact with it because the OS is gonna be your enemy. Other questions? All right, I don't spend too much time on M-Map. Suffice it to say, it's a bad idea, don't do it. If I die, you can put on my tombstone, never use M-Map for your database. And the reason why I just wanna bring this up is because we invite a lot of these database companies and startups to come and give talks at CMU and surprisingly over the last couple of years a lot of them mentioned they're using M-Map and we ask them why and they say, oh, because it's quick and easy to use and then when we go talk to them a few years later like, oh yeah, that was a huge mistake. We should not have done M-Map. We should have done what lecture six is gonna teach you, right? So here's a listing of some systems that I know using M-Map. There's a couple others I'm sure I'm missing. There's a bunch of hobby projects. So the ones at the top, these are full users. These are systems that where they entirely use M-Map for all manage data back and forth from disk and memory. The most famous one of these is probably Elastic. MoneDB was a comms for out of CWI, the same place that built DuckDB. The LMDB do is probably the exact opposite of me where I'm saying never use M-Map, he's like always, always use M-Map and he's been banned on a bunch of different databases mailing lists because he would email them and say like, you guys should be using, you should be using LMDB, you should be using M-Map. He's wrong. He's weird. But here's the ones at the bottom that weren't sort of partially using it. Actually, I would put Mongo as full user. They should really be at the top. But the ones at the bottom here, they all got rid of it. Because of all the issues that I mentioned before, you can get something up and running pretty quickly if you use M-Map, because you don't have to build your own buffer pool manager, but because you're lying on the OS to move data back and forth, it's gonna make horrible decisions, right? I'm not gonna, I don't wanna pick on Mongo, but Mongo is the best example of this. Mongo started off with an M-Map based storage engine, storage manager, and they were the hot database in the 2010s. They raised a ton of investor money and they had really good engineers. So they, and they were based on time on M-Map. If M-Map was the right choice, then they could have made that work. But what did they do? They threw it all away and they bought WireTiger was a storage manager that doesn't use M-Map, right? So again, just to reiterate, never use, we don't want to use the OS to manage memory because the data system is always gonna be a better position to do this. And it's not just for managing memory, it's basically for everything. We don't want to use the OS for scheduling, we don't want to use the OS for caching rights, we don't want to use the OS for, and the network stuff you sort of need to in some cases, but there's tricks to that. But like, the OS is always gonna be a problem. And again, for M-Map, again, I'll send this link on Piazza, you wrote a paper about this and why you don't want to use this, and then there's a 10 minute YouTube video cartoon about all the problems, okay? So again, the OS is gonna be a problem for us and we have to design our system to deal with it. All right, so for database storage, there's two problems I gotta deal with. How are we gonna represent the database on these files on disk? And then once we have those, how do we move data back and forth from disk into memory? So today's lecture is gonna be on the first problem here and then we'll cover the second problem in the upcoming lectures, okay? All right, so there's gonna be sort of three layers of what data's gonna look like on these disk pages. So the first question is, what do these files actually look like? And then within a file, there's gonna be pages because we're gonna have to break it up into different chunks and then we'll discuss what these pages look like and then within that page, we're gonna have tuples, the data itself of the tables, we can ignore indexes for now and you have to decide what are those actually tuples look like? So we're gonna sort of start at the top and then go deeper and deeper inside of these files to understand what they actually contain. So as I said before, the database system is gonna maintain a database as one more files on disk, Siegel, Lite, DuckDB, those are all single file databases. All the enterprise databases or Postgres, MySQL, every other system, more full-featured system is gonna be running, maintaining multiple files for your tables and databases. And so the format that these files are gonna be based on is typically gonna be a proprietary or custom to whatever the database system actually is. Right, meaning like you can't take the Postgres database files and open them up in MySQL or open them up in Siegel Lite. Now DuckDB, because they wanna be portable and compatible with Siegel Lite, they have connectors to allow you to read Siegel Lite files and other things. But in general, all the major database systems are gonna have their own proprietary format. And the OS doesn't know anything about what's inside of these files. It doesn't know what's inside of a page, it doesn't know where indexes are, where it's the tables are, it just knows nothing, it just sees a bunch of files. So next class we'll talk about portable file formats, things like Parquet, Avro, Orc, Arrow, these are gonna be open source specifications for what a database file could look like and then there's a bunch of different systems that know how to read them and access them and write them. But we'll work with those later. And as I said before, in the 1980s, there was a lot of the earlier database systems decided that not only were they gonna have customized file formats, they were also gonna have customized file systems. And they weren't gonna use EXT3, didn't exist, B3FS didn't exist back then, but whatever the equivalent was in the 80s, they didn't wanna rely on what the OS said what the file system was, they wanted to do everything themselves. But that's a lot of engineering work and nobody does that today. And it's usually gonna have a marginal benefit. The only systems that still do this would be like what I'll call enterprise systems. So this would be Oracle, the DB2s, the TeraDatas, these are like million dollar database systems that are trying to get as much performance as you can. These systems will support this. In addition to running off the generic OS file systems. So then the part of the database system that's gonna be responsible for maintaining and coordinating these different files is what we'll call generically as the storage manager. Sometimes it's called the storage engine, right? It's the same idea. And it's gonna be part of the system that communicates with either the hardware or communicates with the hardware or whatever the storage device is, either through the OS or using direct access to retrieve data and bring it into the database systems memory. And so we'll discuss this next class. I keep saying this, but there's so many things to discuss. A bunch of these systems will maintain their own sort of disk schedule or dispatcher that decides when, what pages to read and what order. Because otherwise if you just go do more to F reads and get to the OS, the OS is gonna figure out how to order things. But again, the database system is in a better position to know what it actually needs and in what order. So various systems can have their own thread decide how to schedule their own disc reads. And you wanna do this because you wanna minimize the amount of thrashing or bringing things in a memory how to throw it out right away. So if you know two queries need the same page, maybe you bring that in first or some other pages. And then you throw away the first page once you know those two queries are done with it. So the database files are gonna be broken up into what I call pages. And the database system is gonna be responsible for keeping track of what data has been read and written to these various pages. And then keeps track of how much space is available in each of them. Because again, if I insert a new, I need to insert a new tuple, I need to find a page that has space for me. So I'll keep track of some directory they says this page has this amount of space and go put it in there. Now the database system itself, the storage manager layer, the part we're talking about here, it's not gonna maintain multiple copies of these pages for redundancy, replication purposes. We assume that's gonna happen either above and below this part of the system in the stack. So above would be like something that knows if a query shows up and wants to do it right, send it to another physical box or another node and have both of those machines do the right. And then below it'd be like, if you're running RAID or some kind of storage appliance that knows how to replicate pages, then down below it'll do that as well. Typically, database and systems will not maintain multiple copies themselves because it's a bunch of extra work that ideally you don't want to have to do. All right, so what is a page? So in our view, from a data perspective, a page is gonna be a fixed size block of data and it can contain data from any part of the database itself. Again, for this lecture, we assume it's just tuples or records, but it can create indexes, log information, additional metadata, the catalog, statistics. It doesn't matter, but it's still gonna be broken up into these fixed size blocks. Most systems are not gonna mix page types, meaning you don't take like a one megabyte page and put in data from this table and this table and index and stuff like that, right? For simplicity, you're gonna assume that one page belongs to some object in the database, a table or index and so forth, and it would only contain data for that particular object. Some systems are gonna require every page to be self-contained, meaning all the information and all the metadata you need to have to understand what's inside that page has to be included in the page itself. So Oracle is probably the most famous one that does this. So within a page, you have to keep track of like it belongs to this table and has these columns or these types and so forth. The reason why they wanna do this is because if there's some corruption in the database files, you don't wanna have some page that contains the metadata about the table get blown away and then now you can't understand what's in any other page, right? Replication can solve this problem. Harbor has certainly got a lot more reliable in modern times than it did before. Like hard drives used super flaky before. They're still not ideal but they're much better than they used to be. So maybe having every page be self-contained is less of an issue today but it is a design choice that Oracle made very early on. Now every page in our database file is gonna be given a unique identifier, like a page ID, like some numbers, 64-bit integer, 32-bit integer. And then there'll be some method or some mechanism that the dataset is gonna use that allow it to map a page ID to some physical location on our storage device. And again, that could be like a file name inside of a directory as some offset. If we're running on like a cloud storage, it could be an S3 bucket as some offset and so forth, right? For my purpose, it doesn't matter. We just have a way to say, if we're looking at page one, two, three, there's some method to say, here's where to go find it. So now, what's sort of confusing in the concept of databases is that there's three different notions of what a page actually is. And so at the lowest level, you have what is called a Harvard page. And this is typically four kilobytes. And this is gonna be the largest size or the small size of a page or a block of data that the Harvard can guarantee that it can do atomic writes. So what do I mean by that? What do I mean by atomic writes? It says all or nothing, right? So that means that if I tell the hardware I wanna write four kilobytes and I get back an acknowledgement that, yes, I wrote four kilobytes, then I can assume that it made it. If I need to write eight kilobytes, and I send that as two four kilobytes blocks down to the hardware, I may write the first four kilobytes and then crash and then come back and then the second four kilobytes did make it. And there's no guarantee, the hardware can't guarantee that it can do that all atomically, like all or nothing. And so because of this, we have to do a bunch of other extra stuff inside of our database system to deal with that. We'll get to that later. And then above the hardware now that's gonna be, the operating system's gonna have its own notion of a page. And in Linux by default, this is four kilobytes. And again, this is like mapping something that's on the hardware to something that's in virtual memory. Now in X64, they also support two different modes or sort of huge pages, as they're called. So you can get page sizes, actually two megabytes and one gigabyte. And again, the hardware can't guarantee that it can write out four kilobytes atomically. This is just to reduce the amount of bookkeeping that the OS does for the pages that it brings into memory. And then with inside that now, above that, the database system's gonna have its own notion of a page as well. And typically this is gonna be anywhere in the range of 512 bytes, this is what SQLite does, up to I think 32 kilobytes, some systems let you go up to 64 kilobytes. And so the page size is gonna be the way we'll represent where to find, sorry, the page ID is a way to represent at what offset in some file for given page size, can we find the data that we're looking for? So again, they said most systems, the default's gonna be four kilobytes in SQL Server and Postgres, the page size is gonna be eight kilobytes. And then in MySQL, they go up to 16 kilobytes. For something like DB2 and Enterprise, actually for DB2 on a per table basis, you can change the page size. And I didn't take a guess, would anybody take a guess why one page size, a larger page size might be a better idea? What did I say in the beginning? What do we wanna try to maximize? So ground actual access, correct, yes. So if I now have, if I'm organizing on disk 16 kilobyte blocks and I need to read 16 kilobytes of data, then it's one sort of call to our dispatcher in the database system, that makes one call on the OS to go grab contiguous 16 kilobytes. If I'm using four kilobyte pages, then I gotta make separate calls, or do separate lookups to go get the data in potentially random locations. And there's syscalls you can make to the OS or to the device itself, that when you allocate data that you want things to be, ideally, contiguously aligned. You can pre-allocate an extent to say, allocate me a 10 megabyte block of data, and then the data system can divide that 10 megabyte block into eight kilobyte chunks or whatever the block size it wants, yes. Right, so then he says, and he's correct, doesn't this mean that large page sizes make writes more expensive? Because now, if I only have to write one kilobyte, but I'm storing as a 16 kilobyte page, I gotta write it all 16 kilobyte. Absolutely, yes. So this is a good point. This is something I would say throughout the entire semester. I mean, it's not just for databases, it's computer science in general. There's no free lunch. There's gonna be pros and cons to each of these, and then in different situations, one approach might be better than another. If my workload is entirely read-only, if I never write anything, then yeah, I want large page sizes, assuming I'm doing the large sequential scans. If I'm doing a bunch of writes, then maybe I want something smaller. But what the right amount is, depends, yes. But first you have some kind of read-closing layer of benefits, so if you were writing, you still have atomic writes. So it's all things right, but you still have optimization in groups? Yeah, so his statement is, if I do 16 kilobyte reads, and I'm still using four kilobytes of data, assuming that 16 kilobytes is contiguous, then can't I go make a single call to get the 16 kilobytes that are contiguous? And then if I had the right individual, within the individual four kilobytes, I can still write those out. Absolutely, yes, and this is what I was saying before. We will then choose algorithms or methods that try to write that data out, so that it is contiguous, so that we can do those fetches, right? And again, this is something a data system can do for us, because we know what the query is, we know what data you're gonna potentially read, and so we can read ahead for you and try to fetch things that are, before you actually need them. Now the OS can do that with prefetching as well, but it can only prefetch things that are contiguous in both directions, I think, in Linux, yes. But it can't do, if I had to prefetch things that aren't contiguous, it can't do that for me. All right, so it's this clear. So again, there'll be some page ID, and we'll see this in a second how this is being used, that's gonna allow us to say from page one to three, here's where to go find it. All right, so now we're gonna talk about how do we actually keep track of that mapping of page IDs to locations. And this would be one of the big differences of how the data systems are gonna organize and their pages. And again, I'm not saying one way is better than another. I'll describe the heat file approaches as the most common one, but certainly a bunch of other systems are doing different things, and there's trade-offs for all these ones. So at this point where we're in our discussion, we don't need to know anything about what's inside of our pages. Like again, we don't care whether it's indexes or tuples, we just need to know, for a given page, where, how do I go find it, and how to keep track of what pages I actually have. So tree files is basically you store, you can store the, like sort of in the leaf nodes, you can store the actual pages themselves, or you can have a hash table, ISAM or sequential sort of files. This is something from like the 70s. My SQL used to do this by default. It's not that common anymore, but it's again, it's another way to sort of keep track of things. And then hashing file is again, you use a hash table to look up. Heat file is going to be the most common one, but these other methods exist. And then for the log structure stuff, we'll see that next class. How does this, the directory is really keeping track, you still want to use potentially heat file to keep track of where things are. Again, because we don't care about what's, we don't care at this point what's inside the pages. All right, so heat file is just going to be a collection of unordered pages where our tuples will be stored in random order. And again, the relational model allows us to do that because the relational model doesn't find or doesn't says that the data doesn't have to be specified in exact order. Some systems might pre-sort them to make things faster for other effects, but the relational model doesn't require you to do that. And so the only API we need in our storage manager to support a heat file is basically to create pages, get a page, write to a page, delete a page, and then an iterator API to allow us to to to sequentially read pages, to get the list of all the page IDs that we have and read over them. So managing this heat file is really easy to do if your database is a single file, like inductee B or like in SQLite, right? Because all you need to do to find a given page is just know what the page number is. You know the size of the pages because they all have to be the same size. And now you just do simple arithmetic, like take whatever the ID I'm looking for, multiply it by the size of the page, and then I can jump to whatever that offset is in the file. And I know exactly what I'm looking for. And if you go look in the SQLite documentation, they talk about how in the header of their database file, all sort of the metadata that they keep track of to basically do this trick here. Where things get tricky is if you have multiple files, which again, most systems do, Postgres does, MySQL does, Oracle does. And then now we need a way to say, all right, for given page number two, what file and at what offset, what directory or what file and what offset has the page that I need, right? And this is what a heat file page directory can get for us. Again, think of it, it's like a hash table where you're just keeping track of, it's mapping from the page ID to the pages in the data file. And this typically would be sort of special file, either at the header of the single database file or at some special location inside of the database system and a directory. And you can sort of think it's the database when the database, it's the database that keeps track of what's in your database, right? The catalog is part of that as too, but this is like keeping track of where the physical location things are. And so this has to be kept in sync with the actual files on disk because I don't want to create a bunch of pages, not update my page directory, I crash and come back and now my page directory doesn't know about these other pages and I can't get to them, right? So there's a bunch of extra tricks we have to do to make sure that these things are kept in sync. Again, so just think of it, I have a bunch of pages, they're on some location, a bunch of files, doesn't matter where, and then it's just a mapping to tell me where to go find it. Additional metadata we can keep track of, like for every single page we can keep track of an essential location, the number of free slots that they actually have or free space they have. So now, again, if I want to insert a tuple and I got to find a page to put it in, I don't want to have to scan them all and figure out who's got free space, my page directory would tell me that, right? And then if I run out of pages, I know how to allocate them and then update my page directory to point to this as well. Again, just think of a hash table that gets written to disk that keeps track of the pages that I have. But then I can also, again, iterate or scan through and say here's page one, here's page two, here's page three, because we're gonna need this if we have an access method like a sequential scan operator, right? If we don't have an index, we need to iterate every single tuple in the table. So the page directory needs to be exposed to us, that API. Yes? So you mentioned in case of a crash, so that means because the directory is part of the overtime memory, do I also have to write it to the storage somewhere? Yeah, so this question is, because the directory has to be brought in memory in order to read it, it's a non-volta memory. So now if I crash, I don't want to lose it. Does that mean any changes I make to it have to be written to disk? Yes. But it's not as bad as like, you're not updating this thing all the time, right? So you're not gonna, you know, if you run out of space in your database file and you allocate more pages, you're not gonna allocate just one page, right? Because then you could potentially do that for every single query. You'll allocate like, you know, a gigabyte of data, update your page directory, that gets written to disk once you make sure that's persistent and it's safe then you proceed with running the query. So that means any update to the directory has to be written. Yeah, so the statement is, any update to the page directory has to be written to disk? Absolutely, yes. Because otherwise you don't know what you have. Yes. Yes. This question is, is this stored in a special database file or is it stored along with other pages? What do you mean by special? Just like a separate file? Yeah, so some systems will sort as a separate file. SQL Lite will store this in the header of the file. Particularly stored separately. Yes. So this question is, when I say a data system could use multiple files, but I mean within one table does it contain multiple files or within the database does it contain multiple files? Yeah. Anyone? So like DuckDB uses one file for all the tables? Yes, so DuckDB uses one file for all the tables, same as SQL Lite, but I mean we can pop them and progress and look in the data directory. There's a bunch of files in there that have numbers in them and they're various data files. And sometimes they'll be for indexes, sometimes they'll be for tables, right? And the various systems do different things. And again, not to keep repeating myself, this is the beauty of SQL. I don't know, don't care in my SQL queries whether I have one file or a thousand files, right? The data systems can decide how to do that. It just knows how to run your query for you. So again, different data systems do different things. This question is, how do you know whether you wanna use multiple files for one table? Again, it depends. So like if we won't talk about large, large columns yet, but like say you have a table that has a blob field or a text field, and that's like 10 megabytes, you'd wanna store that in separate pages, but maybe you wanna store that compressed because it's a bunch of text data. So you have the regular columns of the integers and floats, whatever, that's sort of one file and then your large stuff is sorted in another file, right? So that's one approach, but you can imagine also too, maybe you just, you have some space in a single file where the top part's the fixed stuff, the bottom's the variable length, it depends. I think my SQL up to the 5.6 used to store, I think it was one file for all tables or all databases and they just know it was one file per database and they split it up to be one file and not per table. And separate files were indexes. Okay. All right, so now we know sort of roughly at a high level what the files look like or how the files are laid out and how we keep track of where they exist. So let's not talk about what's inside, what's actually inside the pages themselves. So every page is going to have a header that's going to tell you something about what the data actually is. So a common thing would be like the page size or like a check sum, right? So if you crash and come back or you restart the system, actually any time you fetch something from disk, you can compute a fast checked sum to make sure that the data isn't corrupted. Maybe keep track of the version of the database system that actually created the page. That way if like you put out a new version that breaks compatibility, you can have some code that knows how to still read the old data. We all talk about transactions and so after the midterm, but like you can keep track of like what thread or what transaction wrote to what data in this and whether it's actually visible to whatever query you're running. If the data is compressed or encoded in a certain way, which we'll discuss next week, there'll be metadata about what the compression scheme actually is. There'll be information about what the schema is or what the table schema is as we talked about Oracle does. Sometimes there's additional statistics about what's in the data itself. So like for a given column, what's the min value and the max value? Because maybe I just need to read that instead of actually reading the data to figure out whether there's something I need. And again, as we already discussed, Oracle is famously self-contained, but not all systems do that. So now within the page itself, we need to decide how we actually want to organize the tuple data. So at this point in this lecture, we're gonna assume that we're only storing tuples in our pages, we'll discuss indexes later, and then we're gonna assume that we're storing tuples in a row-oriented manner, meaning like if I have five attributes, I will have a tuple and I'll have those five attributes contiguously before I see the next tuple. Next week we'll see about column stores, we sort of this slightly different. But for our purposes here, we assume that's row-oriented and we'll break this next week. All right, so there's three different approaches of what it could actually be in our pages. So the tuple-oriented storage where we're only storing tuples and the exact values that those tuples have, there'll be a log-structured approach where we just store deltas of what changed since the last time this tuple was updated and that should be index-organized. Index-organized storage could be like a tree structure where in the leaf nodes, I'm actually storing the data itself. So today's lecture, we're only gonna talk about the first one, tuple-oriented storage, and then next week we'll talk about the two other approaches. Okay? All right, so let's think about how do we actually want to store tuples in our pages? All right, so let's say we have a really simple approach where in our page header, we just keep track of the number of tuples that we have. All right? And any time you wanna insert a new tuple, we just append to the end, right? So we assume our data's fixed length. So if I wanna insert a new tuple, I just go look at the header, see the number of tuples, multiply that by the size of the tuple and that tells me where offset I wanna write the page. Which one? Oh, lander fly. Just kill it. Wait, is anybody wanting to keep that alive? For those on YouTube, we have a lander fly infestation. All right, so is this a good idea or a bad idea? I said straw, man, so it's obviously a bad idea. Well, why is it a bad idea? Right, so if I had to delete a tuple, so I delete tuple two, it throws everything out of order. Right, because now what I wanna be able to do is insert a new tuple and I don't wanna put it at the end, I wanna use this spot here, right? So number of tuples is enough. Maybe I keep track of where things are located, second side, where to put them in, right? But it's going beyond what I'm showing here. What's another obvious problem? What if the tuple's too big to be stored on the page? You said one of the tuples is too big to be stored on the page. Yes, but also what if they're not the fixed length, which most data is, is not fixed length, right? Email addresses aren't the same size, Android's aren't always the same size. Now I could just sort as a char type, but what does that do? That pre-allocates the space I need and if the largest email address is one kilobyte, then I have to store one kilobyte for every single email address even though it's not gonna be using that space, right? Then his comment is, well, what if the table can't, or sorry, the tuple can't fit in this page? How do I spam multiple pages? We'll discuss that next week, but also this also would not work, right? So clearly this is not enough for us. We need additional metadata to keep track of, keep track of how we're gonna store this. Another problem too is like, again, if four couldn't go here, right? I couldn't fill the gap when I deleted two. If four needs to go here, then I'm wasting space, but if I want to maybe move three up, but now I have to tell the rest of the system that I moved three, right? Because I haven't told you how I'm pointing to three, how I find three, but assuming it's gonna be like some offset within this page, but now if I'm moving three, then I have to go update every possible index that is maybe pointing to it, and that's gonna be super expensive. So the most common approach to handle this problem is called slotted pages, and what I'll describe here isn't gonna be exactly how every system does this, but at a high level, this is what everyone is doing if you're a row onto your database system that's using tuple-oriented pages, so not log structured, then they're doing something that looks like this. So we're gonna have, and we'll have a header, and keep track of all the metadata we talked about before, then after the header, we'll have the slot array where the, at every position in that slot array is gonna point to some tuple in our page, and the tuples will be starting at the bottom at the end of the page, right? So the bottom will have all the fixed length and var length tuple data, for now assume that everything is put together, right? So meaning like there isn't, if you have a really large value, it isn't stored in a separate page, the entire tuple has to be stored inside this page, and so the slot array is just gonna be storing fixed length offsets to where to find the starting location of individual tuples, and maybe you could also store the size of the tuple in the header if you wanted to, right? So now what's gonna happen is, as we need to update the table and add new pages, sorry, add new tuples inside the page, the slot array is gonna grow from the beginning to the end, and then all our tuple data is gonna grow from the end to the beginning, and at any single time I add a new entry into my slot array or, sorry, a new tuple, I put a, I update the slot array to tell me where to go find it, right? So now if I go back to the problem I had before where I say I deleted tuple three, well, this is fine because I didn't move any other tuples, the slot array still, for four still points to it, so I don't have to tell any other part of the system that I moved for, right? But even if I now wanted to reclaim this space that I've been, or three used to be and I've deleted, if I wanna slide three or four over to compact it, all I need to do now is just update the slot array to point to the new offset, which is easy to do. I don't have to, again, relational model says I don't have to do this, some systems do, some systems don't, we'll see this in a second, right? And this is all fine. So this is what SQLite, actually I don't know if SQLite does this, but this is what Postgres does, this is what my SQL doesn't do this, SQL Server does this, this slot of pages is the most common approach, this is what everyone does. Yes, so this question is, can I reuse slot position three if I start a new tuple, yes. But I don't have to put that in front of four, right? I can put it anywhere, other questions? Yes, yeah, so her statement is, her question is, wouldn't this be wasting space if the tuples are variable length? Yeah, so the, again, my tuples are growing from the, from the, the tuple data is growing from the end to the beginning, slot array is going from the beginning to the end, at some point I'm gonna run out of space and there might be a little bit of, like, you know, little space in the middle that I can't use for anything, is that wasted? Yes, but the advantage we get of not having to update other things anytime we shuffle the order of the slot array is worth that cost. Okay, so now, assuming we're sort of thinking of the slot of pages, now we need a way to identify tuples. And this, where this one we're gonna use, this was the notion of a record ID, different data systems might call this the row ID or the row number, right? But at a high level, the way to think about it is a way to uniquely identify some logical tuple based on its physical location inside of a file inside of a page. And it's typically gonna be a combination of like a file number, an ID number, the page number, and then the slot number that corresponds where they exist in that slot array. So that when you wanna do a lookup, say I need this tuple, if you have the record ID, you would then look in the page directory and figure out what page has it, go grab that page, then use the slot number inside the slot array to figure out where the, you know, what offset inside that page has the data you're looking for. So most databases will not store this record ID, this is something that's synthesized, materialized, again, based on the page directory or how you keep track of how to find things, meaning like within the tuple data itself, I'm not storing this record ID. SQLite does store this as a separate column that you're not supposed to see, but you can get to it. And the way they do it, the reason why they do this is because this is how they're going to, they're gonna use this as the primary key to allow them to identify individual tuples. So like, we haven't talked about secondary indexes, but if I have an index, that's not the primary key index. My value when I do my lookup on a key is gonna be that row ID, which I then use on the primary key row ID index to find the data I need. Different systems do different things. So the size of this is gonna vary based on implementation. So Postgres is gonna be six bytes. SQLite is gonna be eight bytes or 64 bits. SQL server has an eight byte one and then Oracle has a 10 byte record ID. Again, you can see this directly in the database system. So we can do a quick demo just to show you all this. So you're not supposed to use this in your application, right? So even though this would uniquely identify a tuple, again, it's the physical location of it and it could change. Meaning like, if I insert a tuple and I get a record ID, if my application then references it, the problem could be like, I could run Compaction or Garbage Collection or in Postgres it's called the vacuum where I could reorganize that page or maybe now the slot number changes or the page number changes and now the thing I'm looking for isn't there anymore. So again, this is a physical aspect of the database system that we're not supposed to really use in our application but it's exposed to us because if we need to administer or maintain these systems, we need to know where the data actually is. So let me log in real quickly, sorry. I'm gonna need to reconnect, sorry. All right, so we'll do Postgres first. So we're gonna create a really simple table R that just has three tuples, right? 100, 101, 102. So Postgres has something called the CTID and this is gonna be a tuple now that's gonna give you the page number and then the slot number. So these tuples here are page zero, slot one, slot two, slot three, right? So now, if I say I delete the second tuple, I delete 101. Now when I do my scan, again, now you can see that Postgres decided it deleted the tuple but it didn't move things around, right? It let the data where it actually, you know, where it actually lives. Then I can run the garbage collection. Actually, let me insert a tuple back. So I'll insert one of three. Now you can see again, it didn't take that O2 slot when the first tuple I deleted, just append it to the end, put it in slot four, right? So then again, I can now run the, again, in Postgres it's called the vacuum. So the command is vacuum. Again, this is a Postgres idiom. So vacuum full is gonna have Postgres basically compact every single page and write out new pages, new files. So if I have a bunch of pages that are empty, it'll release them when it creates a new version. So now when I do that same query before, right? Now you see decided it compacted it, right? So one, O1, O2, O3, right? Make sense? Yes. Is this referring to the page number and offset or slot number? So is it the slot number in the slot array? It's in the slot array. Because then you use the slot array to say what offset is it within the page where I find what I need. Yes. So question is the slot array starts at zero index, even the page starts at zero index, sorry. Slot array starts at one index. Page starts at zero index. Is there something in one? Let's find out. So you're not supposed to do this, but you can do, you can actually query this, right? CTID equals, and then O, O. That doesn't have anything there. But I can get the other one. So I don't know why they do that. Yes. So the statement is, when we say a slot, the CTID is gonna be six bytes, so it's probably a four byte page number and then a two byte offset. Doesn't that limit the number of slots we can have in a page? Yes, but Postgres is by default as eight kilobyte page sizes. So you can't have a billion tuples in a single page. All right, so let's look at other systems. So, again, Seagullite is different. So Seagullite, they have this row ID. Again, and it's actually storing this, right? It's a 64 bit integer and it actually stores this. So it uses this as the primary key, yes. So question is, is the size of the tuple stored somewhere? Typically in the header of the tuple, yes. All right, so if we delete from, we'll delete a tuple, run the same query, right? It doesn't reuse the row ID, because it's actually a physical thing, the primary key. All right, so let's do now SQL Server. So, SQL Server has this different syntax. It has this double, that's hard to see. If I highlight it, can you see it? There's 2%, there's percent signs there, trust me. Sorry, let's go back. All right, there you go, there you go. There's percent signs, sorry. So when you run this, you get back some hex data like this, right? What does this mean? So there is a undocumented command. Actually, this is all, like, this is all you're not supposed to do this, but you can't, like it's not documented, meaning Microsoft doesn't officially support this, but there is a command called whatever, this function here, and you pass in the physical location, and then you'll get back now the formatted file number, the page number, and then the slot number, right? And you can actually, it's interesting in learning this today, you can actually get back what that function is actually doing, so you can get it to spit back what it actually does, and you see here where they're taking that physical location and how they're jumping into different bytes to get the page ID, file number, and the slot number, right? So let's do the same thing we did before. Let's delete 101, then we'll run the same query to get the page number and offsets, right? So in this case here, it didn't move anything, right? So now if we insert our tuple back, insert a new tuple, run that same query, now look what it did. So my tuple before with ID 102, that was at slot two, but when I inserted the new tuple, it moved that second tuple 102 into slot two, and then put the new tuple into slot three. Postgres didn't do that, right? Postgres just kept depending on the end. Is this wrong? Who knows? Is it better? Who knows, right? And the reason why you can do this is because when you fetch a page and bring it to memory and you start inserting a tuple into it, you're holding the latch or the lock on that page, the data is just gonna decide whether it wants to do compaction or whatever the optimization wants to do, because it knows that no other thread can write to that page at the same time. So we can decide whether or not we want to compact it or not as we do it. Postgres doesn't do it, SQL Server does. All right, let's look at everyone's favorite Oracle. I need to create the table first, I think. I don't think it's gonna let me do that. No, it did. So in Oracle, they have a row ID, but you get, again, some binary data here, right? And again, this is stack overflow, this is not me, but there's a bunch of functions you can do to run this, right? And then you see now they're storing a logic ID, a file number, a block number, or the page number, and then the row slot, right? So again, taking something that's defined in the textbook, describing it at a logical level, or describing it at a theoretical level, here's how to organize your database system, and then you can see different limitations of it. Through SQL, you can then see how they are storing slots, storing things in slotted pages, all right? All right, so in the sake of time, I think I'm gonna skip, this will segue into the next class, but the tuple itself is just gonna be a sequence of bytes, right? There's some header, and then a byte sequence. And then it's up for the database system to know how to interpret those bytes based on the type, and based on the values that you're looking at. So again, we'll cover this next class, but the way to think about this is like, there'll be some header that contains information about whether this tuple, this or not, we can store, whether or not we wanna store, keep track of what columns have nulls, and then the execution engine will know how to jump to different offsets within the tuple based on the schema. So again, we'll cover this in next class, but just to finish up, again, what have we discussed today? There's a database system, it's gonna maintain a database, that data is gonna be tracked across different files, and it broke up into pages, and then we have different ways to keep track of those pages, keep track of how to store things in those pages, and then in the next class, how to store actually the tuples, okay? So again, a lot to discuss next next class, have a good weekend, see ya, get it. Shit is gangsta. The poppy with the motherfucking hookup, 28 a grand with the bottom of the clip will tell you, look up, show me where the safe's at, before I blow your face back. I got a block on taps, the feds can't trace that, style is like tamper proof, you can't lace that, the dominican, oh you could call me dominican, black skelly, black leather, black suede, Timberlands, my all black, dirty haters, send you to the purly gates, you get consignment, trying to skate, and that's your first mistake, I ain't lying for that cake, your fam, see you wait, my grand's is heavy weight, they ran through every stake, when they asking how I'm livin', I tell them I'm livin' great.