 Let's get started, again DJ Mushu is back in town, thank you for being here, so I think we've finally worked out all the kinks, so I should be recording audio, the DJ is playing over the head speakers, I got the clicker working, I bought new shoes so that doesn't sound like I'm on the basketball court all the time, so we should be good to go. Why the hell did you come to my house on Friday night? I need help getting my equipment back. You're going to get midnight? What am I going to do? I'm just getting a bunch of people, we've got to do whatever it takes. Please don't come to my house at midnight, I don't care what you're trying to do, just don't do that. For you guys in the class, the things that are on your schedule, Humberk 1 and Humberk Project 0 are about to do this Sunday at midnight, Project 1 will be count for your real grade, that will be released later this week. And then as we posted on Piazza, tonight we're having an info session where we will go through a bunch of different ways to set up your laptops for Windows, OS X, both Intel and ARM, and the on Linux to get the development environment set up for the projects, okay? Questions, yes. So the question is, if you can do Project 0 on your laptop, your development environment, will you be okay the rest of the semester? Yes. Questions, will you be okay without Valgrind? I mean like, can you drive a car without a seatbelt? Right, like yes, but do you want to? Yes. He says yes. No, don't do that. Any other questions? Is the question about Valgrind? Yes. Yes, go ahead. The question is, will the info session be recorded? If we can't make it, yes, it will be. And then we'll post it on Box for only, because to be students talking, we can't put it on YouTube, so this will be on Box. We'll post a link on Piazza. And all the project recitations going forward will also be available and recorded as well. Okay? Okay. Thank you. Sorry, this is being a student when we said this. There we go, that's better. Okay. Yes, let's double check that. As far as I can tell, sound is recording. All right, thank you. All right, so last class, we spent time, or last two lectures, we spent time talking about what a database looks like at the logical level, the high level, what the application programmer sees. Right, so we learned how to write SQL, we learned how to declare tables and relational model. So from this lecture and going forward, pretty much the rest of the semester, now it's all about how do you build a database system, they build a software that's going to expose that SQL API, expose tables and relations to programmers, to the end user. Right, so it's really about the internals of the system, how do we actually materialize and generate a system, a database system that can achieve the SQL capabilities and the storage that we want. So we're hoping we have the ability for BustHub to support SQL and the project's going forward, but pretty much to this point, we don't really need to understand all the stuff we're doing in Homework 1 and SQL. We'll just see how do we actually implement these things, okay? So the overall course outline is as follows, and so the way to think about what we're going to go through is going through the different layers of a database management system. So this is a high level abstraction or high level view of what a database doesn't look like, but you can basically think of it as a bunch of layers in software where each layer is going to make certain guarantees or expose a certain API to the layer above it and uses whatever it comes below it. So we've already talked about relational databases, so the order we're going to go through the rest of the semester is we're first going to talk about how do you actually store data in a database or in a database system, how do you then execute queries, how do you run multiple queries at the same time, how do you deal with the database if there's a crash and make sure you don't lose any data, and then we'll finish up the semester talking about distributed databases. So at a high level, pretty much for now until round Thanksgiving, we're going to be only talking about a single node system, because trust me, you want to deal with a single node first before you bring in multiple nodes or multiple machines because it's going to make your life a lot easier. And then once we understand the fundamentals of how to build a single node database system, we can then see what do we need to change to make it distributed, to make it run on multiple keys. And then we'll finish up with an appropriate session talking about a bunch of random stuff that I think is important and interesting, which sort of goes beyond the fundamentals. So for today's lecture and the next two or three lectures, we're going to start at the lowest level, the disk manager. How do we actually represent a database on disk? What does it mean to go get a page from disk? What's inside of it? And so forth. Because this is going to affect how we architect the rest of the system. So one key idea we need to understand for this semester is that we'll be talking about what is called a disk-based architecture, disk-oriented database system. And this basically means that the software is going to be designed to assume that the primary storage location of the database is on some kind of non-volatile storage, non-volatile disk. SSD, spinning disk hard drive, it doesn't matter at this point. And then the components we're going to build in the system of these different layers is going to allow us to manage the movement of data back and forth from the non-volatile disk and the volatile storage, memory. This is a classic Von Domen architecture where we have the data and rest sitting on disk. Anytime you want to manipulate it or do something with it, we have to bring it into memory so the CPU can front-phone it. Now, some GPUs and things like that can break that abstraction. For our purposes here, let's assume that. So the big thing we're seeing going forward throughout not this class, but the next week and what you'll be able to build in project one is that a lot of the systems will be designed with this assumption that the data you need is on disk. And that means that anytime you go read a page, go read something, it may not be in memory and it's on disk and you've got to go get it. You've got to coordinate the system to deal with that back and forth. So I always show this every year. The way to think about this is the storage hierarchy that we're going to think about is sort of as follows. You might have seen this. I know this is the textbook. You might have seen this in other diagrams and other things. So the way they were this hierarchy is that the bottom layer, we're going to have storage that's going to be very slow, but much larger and much cheaper. And as we go up the stack, the size of the device, the amount of data we can store in it, is going to get smaller and smaller. It's going to get much faster, but it's also going to be way more expensive. So the most extreme storage you can have are CPU registers. These are small things you can store 64 bits that literally sit on the CPU and that's what the instructions operate on. And at the bottom you have something like network storage. I think in the textbook they have magnetic tapes at the bottom. I mean, I'm assuming nobody knows what that is. It's helping you do things in the old days. You primarily only use it for like long-term archival. Like Amazon has a service called Glacier. It's literally like these tape drives, cassette tapes that a robot arm puts in and out if you ever need to retrieve anything. But it's like super long latency. So the exact speed differences between these different levels is not going to matter for this semester. The thing that we are going to care about is this division line here, where anything above it is going to be considered volatile. And that means that it's going to be, if I pull the power, I'm going to lose whatever was stored in it. But these storage devices are going to support random access, meaning I can jump to any offset in the storage device and get it with roughly the same speed as any other offset. And by that means it's quite addressable, right? Anything below that will be non-volatile. And it's obviously going to be slower, but it's going to have what is called block addressable access. Meaning I can't jump to a single exact, like memory location and just get the 64 bits in there. I got to go get the block or the page that contains the data that I want. And even though I may only want 64 bits, I got to get the whole thing. In some cases, too, the speed at which I can access data of these non-volatile devices will be faster if I'm doing sequential access versus random access. I'm going to take a guess what I mean by that. Yes, yes. Right, so what I mean by this is like what he said, he's absolutely correct. It basically means that if I need to get one megabyte of data and there's exactly one megabyte, they're all contiguous on the actual physical device. It's not exactly true, but for most of us, I assume it is. Then that's going to be way faster than getting one megabytes in one kilobyte chunk where I got to jump around to different locations. The easiest way to think about this is a spinning disk hard drive. No laptop comes with NM anymore, but they still exist in large storage appliances, where it's a spinning disk, they've got like a vinyl record and there's an arm that jumps around to do seeks and things like that. So if I had to have that arm jump around to go get my one megabyte in a bunch of locations, that's a physical movement, that's something happening in the real world. That's going to be much, much slower than plopping the arm down once and reading things to crunches. So even in SSDs, just because the way that they're actually implemented and how things look like underneath, even though it is, there are no physical devices like it's a solid state storage, it's just doing electronic communication between the different circuits and these things is still going to be faster to do sequential access versus random access, right? And then there's a whole bunch of other stuff above in the operating system and how do you queue the requests and so forth. Like in general, sequential access is going to be preferable over random access. And we'll see as we design our algorithms, when we execute queries and do other things in our database system, we're going to want to choose algorithms or implementations that on paper, if you take out algorithms course, it seems like it's a slower thing to do, but because we can maximize random or maximize sequential access, we're actually going to get better performance. So for this class, this semester here, we're going to assume that whenever we talk about memory, we need this, the DRAM, right? We're not going to worry about CPU caches or CPU registers, that's things we'll cover in 7.21 in the advanced class. For our purposes here, if it's in memory, we just say that that's good enough for us. There are ways to take advantage of registers to CPU caches to do certain operations more efficiently, but then the day the disk is so much slower that as long as it's DRAM, that's good enough for us. And so again, for throughout the semester, when I refer to disk, I'm going to say anything down here, right? Again, and this is up to the CPU. Now, there are devices that sort of try to span this, these different levels and try to get sort of best of those worlds. So there is fast network storage, things like a finnaband, RDMA over TCP or Ethernet, right? There are things that kind of look like they could be, treat them as if they're, you know, random access and non-volatile, but you can't get fast network storage. The thing that I was super excited about was this thing called persistent memory. Who here has ever heard of this for Intel Optane? Okay, a few of you guys, right? This thing was amazing because the promise of this, if you actually have this, it looks like DRAM, it smells like DRAM, even goes into DRAM dim slots on your motherboard, but like if you pull the plug, you don't lose any data, right? So this is actually something that when I started at CMU that I worked on with my first PG student, who's now faculty at Georgia Tech, we ended up writing a book about this kind of stuff. Like to me, this was a game changer because all the crap we talked about this semester of like leaving things back and forth to the bucking pool. Well, a lot of the algorithms we talked about this semester would all go away if you actually had this device, if you had the storage device that could look like DRAM, but it was non-volatile, right? And then lo and behold, Intel actually had this device. So they called it Intel Optin. So this is amazing. This was like, again, it looks like DRAM, but from the application standpoint, you can rely on the fact that it's non-volatile, right? Have you guys guessed what happened? Did I know the story? No, Intel killed it. Yeah, this is two or three weeks ago. Now it's ended July, right? So Intel was not making any money of this. They hired a new CEO, so they cut a bunch of things. And the one thing that sadly to say that they've killed off is the Intel memory, right? A bunch of other manufacturers claim they had sort of devices that look like this. IBM had something, HP was supposed to have these things called memorysters. They bet the farm on it, but then that fell through. Hynix was supposed to have something. Intel was the first one that stepped up into this. And they basically didn't have a market, which is kind of sad. And this would be a radical change if you had this. As I said, you basically changed how you build all your software. And I have to bleep that out. It was actually building a database system that actually was relying on this, but this has been dead. So this is sad. And I don't see this happening in the next decade, because basically everybody's gonna be gun shy, because Intel couldn't make money off of it. Who else will try this? So we'll say maybe it's another 10 years. All right, so this is gone. All right, so you may have seen this chart as well. This is oftentimes labeled as latency numbers every programmer should know. It's sometimes attributed to Jeff Dean. I think the mid 2000s, I've seen Jim Gray, another famous database researcher, have similar numbers. But this is a way to sort of think about, to understand how expensive it is to get stuff from disk and why we're gonna go design algorithms to try to reduce that back and forth. And so the exact numbers aren't, they change every year. There's a website at Berkeley where you can have a little slider and see how these numbers evolve over time. But roughly think about an L1 cache reference is gonna take you one nanosecond. And then if we start from on DRAM, that's 100 nanoseconds. Once you get to the non-Volto stores like in SSD, now you're around 16 microseconds. A spinning disk hard drive is two milliseconds. It's actually pretty good. And then network storage would be something like EBS or S3. This can fluctuate from like 50 milliseconds to maybe like 500 milliseconds depending what happens. And then again, this is the tape archives that no one would actually use. So it's kind of hard to wrap your head around these numbers in nanoseconds because humans we don't think of typically a nanosecond. If you just replace the numbers with one nanosecond equals one second, now you actually see how huge a difference this actually is. And so the metaphor that Jim Gray likes to use that I like to use is just thinking about what it would take to actually read a page in a book. So reading an L1 CPU cache on SRAM would be like me having the book right in front of me and just finding the page and reading it. Whereas if it's an L2 cache miss, then I gotta go maybe walk to his desk and go read it. And all the way down would be the tape archives, 31 years, this is the equivalent to like flying to Pluto to read one page in a book, right? It's orders of magnitude slower. This is why we're gonna avoid this at all costs. And even reading from a fast SSD, like 16 microseconds is pretty good, but it's still gonna take you four hours compared to one second. So we're gonna try to maximize the amount of work we could do with data we have in memory and reduce our reading rights to disk. So these numbers I think are in reads and some devices writes are actually slower. SSDs writes are slower because there's more work to have to do, but it's roughly the same order of magnitude difference between these different levels, okay? And so this one we also covered as well. Again, I'll just say it again, that the knob also stored is gonna be much better to access using sequential IO. Again, it doesn't matter if it's reads or writes, sequential is always gonna be better. And we're gonna try to choose algorithms and try to maximize this. And again, this may look different from what you see in an algorithms course where you just assume everything's in memory and everything has equal access characteristics, but on real hardware it makes a difference, okay? And I'll try as we go along to talk about maybe some of the assumptions that the textbook would make or some of the algorithms we'll talk about that they'll make assuming you have like really old, slow disks and how the SSDs are making this a little bit better, but still sequential access will be problematic. So the overall goal we're trying to build in a disk-oriented system is to build software that can manage a database that exceeds the amount of memory that's available to the hardware, to the system itself, right? So again, because reading, writing disks is super expensive, we wanna be clever about how we're gonna move data back and forth to avoid large stalls and performance degradation as our database grows larger and larger, right? And again, because sequential access is better, we'll choose algorithms, choose methods to maximize sequential IO. So at a high level, the database system is gonna look like this. So we have some kind of non-volatile storage and then we're gonna have a database file and then the database file is gonna be broken up into pages. So sometimes I'm front of these blocks, sometimes I'm front of these pages, they basically mean the same thing. And then above our division line, we'll have memory and this is gonna have a buffer pool manager that can have some location of memory where it can copy pages in and out, right? Sometimes the textbook might say, sometimes they're called buffer managers or page caches. Again, they all basically mean the same thing. It's memory that's being managed by the database system to move pages in and out from disks. And then there's some upper level part of the system, we'll just call this execution engine, right? When I showed that course outline in the beginning, these are just the upper parts. We don't care what it is, we just know that something's requesting this and they're gonna come to the buffer pool manager and say I want page two. And then we'll look at this page directory that we'll talk about in a second. And it's gonna tell you, okay, if you need page two, here's the location on just where to go find it. And then we'll copy it into our buffer pool and then get back the execution engine, a pointer to page two, in the memory that we've allocated here. And then now that the system can go do whatever it wants with that page and interpret the layout, make changes to it, do whatever it needs to do and execute queries or whatever it's asking to do, right? So in some ways, this part of the system here doesn't really even know what's actually inside these pages, right, is it an index, is it table Y, table X, doesn't matter, it's the upper level parts of the system that are gonna handle this for us. Now in cases where we can give hints up above and say, hey, I'm accessing this page and it's gonna be used in a certain way, then there's some optimizations we can do at these different levels, but at high level, it doesn't matter, yes. The question is between the execution engine and the buffer pool, only bytes are being passed. You'd get a like a 64 bit pointer to a page in memory, right? So this is, think of this as like, all within the process space of the data system. So when he says, give me page two, you get a pointer back in page two, yes. The question is, is the database file one file on disk or multiple files, depends on the location, we'll get that in a second. The answer is SQLite one file, most systems multiple files, yes. All right, we'll get into the second. So his question is, you're actually jumping ahead. His statement is, hey, this sounds like virtual memory. It does, we'll see why we don't want to use virtual memory in the operating system in like two lines. That's a very good point. So just again, outline where we're going. For this piece here, we'll talk about this in the next two, three lectures, including this one today, the buffer pool, then which will be project one will be based on, we left it six next week. And then all this other stuff, how to build execution engine will cover later on. All right, so he basically hit the nail on the head. What does this sound like? A less virtual memory, right? So why would you want, why are we going to spend time in our database system re-implementing the wheel if potentially the operating system can already do this for us, right? So in particular, under politics, you could have what is called memory mapped IO, or M map, and there's a sys call called M map to do this for you. Basically what that does is M map, you give it a location to a file on disk, and then the operating system will set up the pages, sort of virtual memory in your process that maps to the pages on that file. So now if my thread tries to read some location in the file through that memory address, the OS is responsible for bringing it into memory, and then I get a pointer to it to do whatever I want. And then if I run out of space in my process, the OS can page things out as needed, right? So roughly it looks like this, you have it on disk file, you call M map, and then you get a bunch of virtual pages set up, and then if your thread comes along and says I want page one, the operating system knows says, okay, well, I want page one, it's this offset in the file, and then you're going to copy it to my virtual memory, and then the thread can do whatever it wants with it, right, and then we do a mapping from virtual memory to physical memory. If I need page three, same thing, copy it in here, and then I can do whatever I want. Now the challenge comes along as when I need page two, right? What happens here? Well, the operating system has to say, I don't have any more spaces in my virtual memory, I gotta go make space. What page is actually gonna remove from physical memory to make space for the new one depends on a bunch of different factors, and it may actually not be the right thing that you actually want for your database system, right? And actually, what really is gonna happen here is your thread's gonna stall, right? Whatever thread says I want this, you touch a memory location, you try to do whatever you want on it, the OS says there's a page fault interrupt, it says the page you want is not physical memory, stalls your thread, and then the disk as a go fetches it in and puts it in for you. So now your thread is just sitting there when it could be actually doing useful work. So the big problem also comes up too, now we have multiple threads try to access MF files, right? Cause you're trying to have these page stalls, maybe like I'll have one thread be the dispatcher to say, okay, go do whatever I need to do on this page, and then that way if there's a stall, that thing can get blocked and I can go up and still do more useful work. You do a bunch of these things, the tricks to get around these problems, but essentially you're gonna build a database system in the end, right? So I've had a long standing, if you use not the right word, obsession with why MF for database is a bad idea, I mean I'm writing a paper about it, so let me go to sort of the four key problems of why you don't wanna use the operating system for this. And this will be reoccurring themes throughout the entire semester, is that the operating system is not your friend, your database system, it's always gonna get in the way and we're gonna try to avoid it as much as possible. All the major database systems will do this. Postgres has one exception that we'll talk about, but in general, you don't wanna rely on the operating system to do anything, other than just give you memory and make sure your process doesn't get killed, we wanna do everything ourselves and inside the database system. The reason why we're gonna do this is because the database system is always gonna know better than the operating system, right? We know what the queries are, we know what's actually in the data, we being in the database system, and the operating system just sees a bunch of like blind reads and writes on files or memory locations. Doesn't know what you're trying to do, doesn't know that there's a transaction also running that needs to touch this page and you gotta wait for them, there's a whole bunch of complications or other sort of knowledge we have about why we wanna do certain things in the database system that the OS simply can't know. So we're gonna avoid them as much as possible. All right, so the first problem we're gonna have, if we use MAP, is transaction safety. So we'll get into what transaction means, but think about if I write to a page because I did an update query, when should I actually write it out the disk, right? Ideally, I wanna write it out when I know the transaction is committed, but the operating system is free to write it out any time that you want. You can call Mlock Audit to basically lock the page to keep it in memory, but that doesn't prevent the OS from running out dirty data. So now maybe the case where I do an update, but I don't wanna save all my changes yet because I'm updating multiple pages, and then the OS flushes out your page, you crash before you update all the other pages and now you come back and you have torn writes. All right, so that's the problem. The next one we already talked about is IO stalls, right? The data system doesn't know which pages are in memory because this is all transparent for you from the operating system hides all this from you. And so now if a thread tries to search data, that's not memory, there's a page fault, you get stalled and then you don't get the thread back, you don't get scheduled again until the data you want is available. So again, you have to basically deal much of crap to like prevent this or basically have other threads do asynchronous IO, but by then you're basically building all the bumper pool stuff that we're talking about anyway this semester. So the next problem is error handling. So one is the database system is gonna maintain check sums on every page to make sure that data is correct when it comes in and out and you can actually check when the page comes in if you control the reads and writes, know like is this thing correct or no, and therefore throw errors if this happens. You can't do that easily with M-Map because the OS is hiding in and out. And then anytime there's like a failure because like the file is corrupted or you can't get something from the file system, you don't get an exception because that's the operating system. You get a SIG bus interrupt, right? And then now you have to have throughout your entire database, you have to have handler code to handle those signals because some thread might be doing something at a given time and the signal shows up and now you have to have code that actually handles that. Whereas if you build it with a sort of the abstraction layers that I talked about, we can isolate the checks to see is this data corrupted or not within just the disk manager portion and not have single handling everywhere as else. The last one also too, the last one is going to be obviously performance issues for really small databases and really like sort of light contextual workloads. M-Map might be okay, but if you want to have built high performance systems, then the OS is going to have its own internal data structure for its own page table and that's going to become a bottleneck for your thread, for your system. There's also a TLB shoot downs where like if you have to remove something from your translation look-aside buffer on hardware, you got to send a message to all the other cores and see if it can be updated. So at large scale M-Map is a problem. So as I said, yes. So he says, even if you build a buffer pool, which you call Malik one, get data, it's still going to use virtual memory. How do you still avoid all these issues? Because the like the contention problems will all be in our system, right? So we can have, we can build sort of a, we can basically build our own scheduler that does all this access stuff. So it's only sort of one or two threads doing all this rather than multiple threads trying to do this together. Oh, sorry. No, the OS won't page that out. So like swapping? Yeah, we take this up like, like you basically set up the system and you turn off all that crap. You like you turn off all like the swappiness and all that, you tweak the kernel parameters, do that. Yeah, yes. The question is, sorry. Don't use MAP, like bare metal in terms of like, like a unicolonel? Yeah, so his question is, if the operative system is so terrible, has anybody actually tried to build databases that run like extreme bare metal? Right, when you say bare metal now, it usually means like you're running on like EC2 where you don't get a VM, right? Well, you're talking about even more extreme, like I'm running my own, my own like unicolonel that only has a daily system, doesn't run a sound drive or whatever else Linux does. Nobody does that because the engineering effort to do that would be quite significant. Did you end up building your own operating system? I haven't seen anybody recently try to build unicolonels for databases. People have done, there's other inefficiencies in the operating system that you wanna fix before you even get to like throw all Linux away, right? Let me get a reason. The file system is usually a problem. So like you can do like, in the 80s people used to build their own file systems. Actually, you can still get that work on the DB2. So like when you buy an SSD or buy some storage device, you don't format it with the EXT4 and EXT3. You literally just take raw, like raw bytes and you build your own opposite, or file system on top of that. People have done that. People rewrite their own, don't do thread scheduling in the databases themselves. So Microsoft does that. Yellowbrick does this. There's a bunch of other things you can rewrite before you go extreme and remove all the kernel. Engineering effort to do that is so significant. It's probably not worth it. I would say, if my example of the file system one, I know in the 80s, I've seen numbers that basically say, if you remove the file system, you maybe get a 10% boost in performance. So it's like, that's not worth it. You're not gonna get like 100X with that effort. Okay. So I don't dwell too much on M-Map to say it's a bad idea. The reason why I'm banging on it is because, we have all these people come and give talks at CMU, and to my surprise, a lot of them say they're using M-Map. And I always say that's why, it seems like a bad idea. And usually the response comes back as, oh, it was quick and easy in the beginning. But then later on, they find out that that was a bad idea and they have a lot of problems. So the three most famous systems, our foremost famous systems that are used M-Map, would be mode ADB, which is an analytical system to mostly doing re-queries. LMDB, which is a embedded key value store. RavenDB is a document store like Mongo and then levelDB was implemented by Google as a sort of key value store. And Elastic and QuestDB uses this partially, or uses this as well. So down here for parcel uses, MongoDB, single store and influxDB, these systems all got rid of this, right? They no longer use M-Map, they now manage memory themselves in the way we were talking about this semester. And again, because of all the problems that I've talked about, it got them up and running quick and dirty to get something going like MVP of the database system. But then as they try to push the system further and further and M-Map falls apart and betrays them. You know, I don't want to pick on Mongo, but like to use them as an example, like they were the hot system in the mid 2000s or early 2000s, 2010s. And they had all the money in the world, right? And they're based on M-Map. And we had really great engineers with all this money. If M-Map was really the right choice, they could have made it work potentially. And instead of what they do, they threw away the M-Map engine, bought a company called WireTiger that didn't use M-Map, and that's what they use today, right? Single store has a blog article basically says, hey, we started off with M-Map and it turns out it was a bad idea. Here's why, right? This is the reoccurring theme over and over again. The LMDB guy, like, he's the exact opposite of me. I think M-Map's a terrible idea. He thinks M-Map is the greatest idea, and he sends emails. Anyway, all right, so, again, the main thing about this is that, okay, we'll see this in other things throughout the semester, is that Davidson's almost always gonna want to control everything themselves because we can always do a better job than the operating system, right? Think of the operating system as a generic, like pickup truck, like, yeah, you could take it racing on, you know, against the form of the one cars, but you're gonna lose. Whereas, if we build a system that's customized for exactly how, you know, what queries we're gonna execute, what the data's gonna look like, we can always do things in a better way. These are a bunch of different things that we'll see throughout the semester, like, we can do prefetching in a way the operating system really can't do. Like, M-Map can do some prefetching, but it only fits a bunch of access. It can't do, doesn't know the high level of meaning of the data structure and how pages are connected together and we can prefetch much better than it can. Right, so again, the OS is not your friend. If I die this semester and, you know, put use never M-Map on your database or my grave. And this is the paper that we actually published this year that goes through all the problems that I'm talking about here today. And if you follow that link, there's a 10 minute YouTube video that describes all these problems in further detail. Okay? All right, so there's a long answer to his question. We don't want to use virtual memory, it's gonna be a problem. All right? Okay, so I'll say it with Postgres, Postgres doesn't use virtual memory, does use the OS page cache for the file system. That'll cause problems later on, but we can cover that as we go further. All right, so for database storage, there's two questions. However, actually gonna represent the database in files on disk and however, actually gonna manage memory to move data, those pages back and forth from disk. So today's lecture and next lecture will be about this and then the buffer pull lecture on, will be next Thursday. Again, that'll be project one, okay? All right, so today's agenda, we're gonna first talk about file storage, what these files are actually gonna look like, then we'll go inside the files, say what do the pages look like, and then we'll go inside the pages and say what the two points are gonna look like, okay? And I'll say also too that what I'm describing here is sort of the canonical implementation of a database system with these files and these pages and so forth. There'll be another architecture we'll talk about on Thursday this week called a log structure storage or log structure, grocery storage and that's gonna look slightly different but at the end of the day, it's still gonna be files and pages and so forth, okay? But pretty much all the systems you can think about today, it's actually not having less true because of RockDB, but most of the systems are classic databases and will look like what we're talking about here, okay? All right, so the database system is gonna be a software that we're gonna build that's gonna manage the database because one of our files on disk and typically the file format is gonna be proprietary to the database system, meaning I can't take a SQLite file and have Postgres interpret it, right? It's not a CSV file with one of the texts I can parse. It's gonna be a format specific to the database system. Now, there's gonna be sort of open source file formats that we'll talk about next week like Parquet or Orc or Avro where these try to be sort of generic file formats that any databases can interpret, but in general for the high performance ones, it's always gonna be proprietary format. But furthermore, the operating system is not gonna know anything about what these files are. It doesn't know what's inside of them, doesn't know that how things are connected. It just sees files, right? It can go up and up the terminal, look in your file system and you just see a bunch of files. So this last point brings up the question he had about like, can you ever just go bare metal? And again, people tried this in the 1980s, but no one pretty much does this today. Everyone sort of assumes you're building files off of a, WinFS, EXT4, whatever generic file system you have, or if it's cloud storage like S3 or Azure storage, whatever those APIs are. So the storage manager, it's gonna be responsive for maintaining these database files. And again, it's just figuring out where the files are located, what the director hierarchy is. Some will have their own dispatcher so that when they've thread inside the database system or processes inside the database system goes and needs a page, it can then schedule that for you or you can let the operating system do the read for you as well. And so there are other optimizations you can do based on the sequential read versus random read that we talked about. And the dispatcher can be clever about scheduling those things together in batches to improve performance. Then within our files, we're gonna break them up into a collection of pages. And essentially the storage manager is responsible for getting the pages that are requested by the upper levels of the system and they keep in track of maybe what threads are reading or writing to a given page at a given time. They also keep track of available space, may need to run compaction to defragment the pages as we go along, but we'll talk about that later. And for this discussion today as well, we're just gonna assume that the pages only contain tuples, the actual records in the tables, but they really could contain anything. They could contain log files, they could contain index information or indexes, data structures, right? But for our purposes here, just assume that it's tuples and everything still works. So a page is gonna be a fixed lock, fixed size block of data. And again, it can contain anything. Typically, most database systems will not mix page types, meaning within one page, I can't have index data and table data. It's for simplicity, it's usually gonna be for table XYZ, there's a page and it only contains data for XYZ, right? It's not gonna mix data from other different pages. Some systems will require pages to be self-contained. And what I mean by that is all the metadata, all the information you need to know to how to interpret what's inside the page has to be contained in the page in itself. And so we'll see why you wanna do this in a second. Every page is gonna be given a unique identifier, simply called a page ID or block number. And then there'll be some piece of the system that we'll call the directory that's gonna allow the storage manager to, for a given page ID, to know what file and what offset that page can be found in. Okay? So we're not, the storage manager at the lowest level we're talking about here is typically not gonna do replication. All right, it's not gonna try to write the page multiple times, there'll be upper parts of the system that can do that, right? I know I need to write to different locations because you have to do this transactionally. That's above us. And then if they're doing like file system replication like ZFS or RAID, all that is below us and that's transparent to the storage manager stuff we're talking about here. Okay? So this part gets confusing but there's essentially three notion of pages in a database system. There's gonna be a hardware page. This is what the device actually exposes to us. Then there's the OS page, that's typically gonna be four kilobytes. And there's gonna be a database page and what we care about. So the hardware page is the sort of smallest size of data or the chunk of data that the storage device can guarantee it can write out atomically. So what do I mean by atomically? Yes. Let's say it again? Yeah, so what does that mean though? One transaction? Yes, so he says if you say write out this page the hardware will guarantee that either is all of it's written or none of it's written. So if I wanna write a four kilobyte page either it's gonna fail and I get back our error message but when I come back I won't have sort of partial data for all that's gonna be written. If I need to write out two four kilobyte pages I need to write eight kilobytes. I can write the first one and the hardware will guarantee that but I can't guarantee that it's gonna write the second one. Right? I'll have to do extra stuff up above to make sure that this actually happens. The OS page in Linux and Windows the default size is four kilobytes. You can do huge pages in Linux where you can have larger, you can have sort of one page ID represent a larger chunk of pages in the OS but again, I don't need to cover is the hardware can't guarantee those things are written automatically. Right? The thing that we're gonna care about in, we're gonna care about is, sorry, zoom. The thing we're gonna care about is the database page and the size of this is gonna vary based on the different database systems. Right? So, some systems by default like SQLite, DB2 and Oracle can do four kilobytes pages. SQL Server can do eight kilobytes and some pages were discussed as well then MySQL is can do 16 kilobytes. I think MySQL might be the largest that I've, the largest one I've seen so far. And so everything I guess why you'd wanna have different page sizes for your database system. Right? The hardware can do four kilobytes. Why would I maybe wanna do something smaller like a SQLite? Says you can make less system calls with smaller page sizes? With larger pages, sure, yes. So he says with a larger page size that I can do less system calls to do what though. Right, to do read, right. So again, sequential IO. So if I have 16 kilobytes pages in MySQL with sort of, I can say to the OS give me at this offset on this file give me 16 kilobytes and that's sequential IO. Right? In the high end systems, again, I'll refer to these enterprise systems like Oracle, DB2 and SQL server. These are systems that cost millions of dollars and have a lot of different ways you can configure them and tune them. In the enterprise systems, you actually can change the page size on a per table basis and per like index basis. You can say this table is gonna be sequentially read all the time. So I want you 16 kilobyte pages and this table is gonna be random access so I'll do four kilobyte pages. You can't do that, I think in SQLite. You can't do this in any of the open source systems that I know about. Right? Again, so we can have larger pages sizes in the database system but the hardware can't guarantee that it's always gonna be atomic. So again, this semester is about doing the extra stuff to make sure that we don't lose data when we write out chunks or write out pages that are larger than the hardware support. Right? So there'll be a bunch of different ways we can, yes, question. So good question. So question is, if I'm saying that with 16 kilobytes, if I can batch things together and it's faster for sequential IO, why doesn't everybody else do this? Well, again, think about what we talked about in the beginning is random access versus sequential access. If I need one tuple and that one tuple is 512 bytes, I gotta go read 16 kilobytes to go fetch it in. Then I gotta maintain 16 kilobytes in memory to keep that page in there even though I maybe only want one, one record. Right? So there's trade-offs. This is gonna be a recurring theme about the entire semester. There's no free lunch. There's pros and cons to all these different techniques and everybody does something different. Yes. Question is, does this have to do with memory segmentation in terms of what? Wait, what do you mean by waste? Like my example there, like if I only need one record, I gotta bring a 16 kilobyte page. Yeah, there's trade-offs to this, yeah. But I mean, with a larger page now also too, like in the database system, now with one page ID, I can represent 16 kilobytes. So my page table, basically the hash table where I do look up the page IDs to frame in the buffer pool, that can be much smaller and take less space. Again, there's pros and cons to all these techniques. All right, so the different ways we're gonna manage the pages on files on disk. For our purposes here, we're only gonna really talk about the first one, heap files. The ISAM is kind of archaic, this is how they did stuff in the 70s. There are some systems that still do this. The tree file we'll see when you index organized tables. So MySQL works this way, we'll come to that later. And then log structure is slightly different from all of these. Well, for this class here, we'll assume that it's a heap file. So the basic idea of a heap file is that the tubeless are unsorted, right? It's just, it's a heap. It's just a bunch of pages and tubeless can be placed anywhere inside of them or the data can be placed anywhere inside of them. The data won't be pre-sorted in any way. The tree one will give you that for free. ISAM gives you that for free as well. I don't, free, I mean, you have to maintain it, you don't have to do, you potentially sort up above in the execution. It comes back to you to sort it. Again, at this point, we don't know anything what's inside of the pages. It's just about how do we find page one, two, three, or page four based on someone's request up above. So the heap file is going to be a collection of unordered pages that are stored in random order and it's going to have sort of high level operations like get a page, create a page, write to a page and delete a page, right? The other important thing we need to have also is support iterating over the page. So if I want to do a sequential scan on a table and I don't have an index or I don't have a data structure, I want to be, I'll say, start at this, start at the first page and keep giving me pages until I say stop, right? So we need to be able to iterate them as well. So this is super easy to do. A heap file like this, if there's only a single file, right? Like in SQLite, you have some requests as give me page two and you know the starting address or the starting location of the file addresses zero and you just take the page number you want multiplied by the page size because all the pages are gonna be fixed size and then you jump to that location to find what you need, right? That's easy, yes. Question is how does a DMS know what index is in what page? Which tuples in which page? Oh, so how do I know that I should be getting page two? We'll get to that in a second, yeah. Right, easy to do if it's a single file. I mean, there's typically gonna be some header stuff in here in the front, right? But we'll, we can ignore that. The SQLite actually manual, the SQLite documentation has a really great explanation of what actually their single file looks like. I can, I'll post them like a Piazza afterwards. The challenge of course is then if I have multiple files and they get page two, you know, where do I go? Where do I jump to? Yes. What's the, what, sorry? This question is what's the advantage of creating multiple files in a database system versus like one giant file? I mean, so you can take, you can take file locks on smaller pages and not have them block other threads. It can potentially reduce like the blast area if like a file is deleted, you can essentially still recover other parts of the system. Those are the first two I can think of. Pretty much everybody does it this way. You can, other things you can do at the file system level, like if you have one directory for this database and another directory for this database, you can then like on Linux, you can like sim link them to different hardware devices. So maybe you have like a table that is super important and you make sure that directory goes on to a really fast disk and another one goes on a slower disk, right? You want that kind of freedom. Good question. Okay. So, if we, so we're assuming now the database is not in a single file. We need to give track of what pages exist in multiple files and which ones have free space. So the textbook talks about a different ways. As far as I know, the most common one is called a page directory. And basically the data set is gonna maintain some special pages that kind of look like a database too, but it's just for keeping track of what's in our own database where this directory is gonna keep track of the data pages that people want and where the files are on the file system to go get it. Right? And so we need to make sure that these directory pages are in sync with the database. So we don't wanna create a bunch of pages in the database heap file and then not update a page directory. And we crash, come back, and we've allocated a bunch of space that we don't know about anymore because we never updated things. So there's, there's actual work we need to do to make sure our bookkeeping keeps things in track. Right? So now I have a bunch of pages. I don't care what actually what file they're in. And then in my page directory, just thinking of like a hash table lookup where I say I want page two and it knows where to find that page on disk. Or at least what file and what offset should be located in. And again, the page directory can be like located in the file itself like a SQLite or in the case of Postgres and other systems it'll be scattered across separate files. And then we need to keep track of like what the number of free slots per page and I'll explain what a slot is. I think it's like free space. And then we keep track of any pages that are free or empty. So if I, if I someone says, hey, create me a new page, I can check this list and say, oh, I have one that's not being used and give you that rather than going allocating a new one. Okay? This is just, again, just think of it as a giant heap of pages and there's some data structure that tells you where to find the data you want. Okay. So let's talk about what's actually inside the pages. So every page is contained a header with some metadata about what's actually inside of it. So obviously there'll be something like a page size that checks on what we talked about for checking whether the data is corrupted when you read and write a disk. Typically you also see the version of the database system of software that created it, right? Every time you update Postgres to a new major version, they may change the page layout but they still need to maintain compatibility for any older pages or force you to upgrade to the new layout. So you want to know what version of the software created the data that I'm looking at in a page. We'll talk about transactions and visibility stuff later in the semester, but basically you got to keep track of like, is this page, can my thread, my transaction actually see this page, right? Or is it something in the future or something in the past that I shouldn't see? And then if it supports compression, you also maintain some information about how the data is actually compressed and what algorithm they use. And as I said before, some systems require the page to be entirely self-contained. Oracle's not as famous for this, meaning in the header they have to contain, like here's the table that this page came from, here's the schema, here's the column is actually mean, right? And that way, if the system gets trashed, the file system gets corrupted and some page over here that contains the metadata of the schema for your table, if that gets lost, I can still look at this page and figure out what the data is actually located in. So when people talk about DBAs are expensive or enterprise systems are expensive, it's for like insane things like disaster recovery where like the machine caught on fire, but I was going to pull up disks before they melt and maybe something got corrupted. And in DBA we actually literally look at these page headers and figure out what, you know, for each page, what table does it belong to, and try to reconstruct the table as much as possible. Nowadays that's less of an issue because you can replicate off-site, there's a bunch of, you know, the hardware's gotten much better to do this kind of things, but in the old days, when you did disaster recovery, it was literally like, you know, like looking for the black box on a plane crash, looking up at the pages and what happened. All right, so for now, for any page-short architecture, we need to decide how we're actually going to organize the data inside the page. We're always going to be a header that says what's the page contains, and then now we need to talk about what's actually inside of it. So as I sort of alluded to already, the basic two approaches to do store data in a page, ignoring indexes, just dealing with tuples, will be a tuple-oriented approach and a log-oriented approach. So for this class here, we'll talk about tuple-oriented and then next class we'll talk about log, log-oriented, okay? So let's do a really simple page layout scheme. So if you want to store tuples in a page, we have our header and maybe just keep track of the number of tuples we have in our page, right? And then for a really simple approach is we just keep track of the number of tuples and anytime the data system wants to store a new tuple in our page, we just look at the counter and figure out, oh, it's zero. Then I just, you know, increment it by one and jump to the new offset and start writing the data, right? You think this is a good idea or a bad idea? I'm setting up first, it's a bad idea, obviously. But why? Yes. So he says, if I delete a tuple from the middle, and I'm only looking at the number of tuples that I have in there, it doesn't tell me where the actually the free space is, right? So how am I gonna know I can put another tuple here? What's another obvious problem? Yes. So he says, if you want to find a tuple, you have to like do what, a linear scan? Ignore that for now. Assume something up above knows magically where to jump to the right offset page. So I'm assuming here all the tuples are fixed size, the same length, right? So if they're all 64-bit integers, then sure, yeah, I know I have five 64 integers, so I take five to 64, ignoring nulls for now, but like I gotta jump exactly where I want, right? But it, yes. Question, why can't you do a string null character? So in terms of what, sorry to say there's empty space. Yeah? Yes? So his proposal is that for each tuple, if I put on the null terminating character at the end of the tuple, right? Assume there's nobody, no tuple storing that for now. I put a null terminating character at the end of the tuple and then a demarcation to say this tuple is ended. Do we think that's a good idea or a bad idea? Bad idea, why? Correct, he says you basically have to do linear search, but now it's more than linear search, well, it is linear search, but you gotta examine byte by byte to see until you see the null terminator and then when you do it, okay, now I'm done. It's basically like CSV parsing, right? Which is super slow, we don't wanna do that. We can be smart about this, right? So for all these reasons, and deleting things, how to fill in a free spot, keep track of that, and also variable length data is to be problematic as well, right? Because you wanna store email addresses, you wanna store names, Androidies, they're all gonna be variable length, and you can do the char data type versus var char where you could allocate always 32 characters no matter what size of the string it is, but that's being wasteful, right? Right, what if I wanna have a one megabyte field, which you can do, or even larger, but maybe some of them are only be 10 kilobytes? Do I wanna allocate one meg for every single tuple? No, right? So they handle this, and the most common scheme in sort of the tuple-oriented page storage that we're talking about here is called slotted pages. The exact implementation of how you do slotted pages will vary from one system to the next, but the general idea is always gonna be the same. The idea here is that we're gonna have this slot array at the beginning of our page that's gonna map tuples, tuple slots to the actual location in the page, right? So the slot array is always at the beginning, and then all the tuples that was fixed length and variable length data will be sort of the bottom, right? And the idea is that the slotted pages will tell you what offset you want to find the tuple that you're looking for, right? So now if I wanna go and retrieve a tuple, if I wanna retrieve tuple one, all I need to know is the page ID and the slot offset, and once I get to the side of the page, I look at the slot array and that tells me what byte offset to jump to to find the data that I'm looking for. Then now as I insert more and more data into this page, the slot array is gonna grow from the beginning to the end, and then the data of tuple data is growing from the end to the beginning, and at some point I'm gonna run out of space where I can't put it in tuple, and I can't extend my slot array, and then the page is considered full. So this course can mean that I may have a little extra empty space because I can't fill things up, but that's the cost I'm willing to pay to have this in direction, right? To have vehicle support variable length tuples. So now if I wanna, yes, question. The question is, how do you know the size, how do you know the size of the tuple? You would store it in the slot array. Yeah, you could compute that as well, yes. Yes, the same as the slot array, each slot in the slot array is fixed size, right? So like 16 bits, but the tuples themselves can be variable length, yes, yes. All right, so it says if you're deleting tuples, and let's say you wanna delete tuple three, and now empty space here, how do I manage that fragmentation? It depends on the implementation, right? So you could just leave the old tuple where it was, and everybody where they were, or you could slide it over, and now all you do is update your slot array, and say now tuple four here, so let's let it over, it's now found at this location. It will very, we'll do a demo, it'll vary per database system, of when they, eventually everyone will do compaction, depends on when they actually wanna do it. Yes, her question is when you slide this guy over, or slide tuple four over, does that mean you have to update all the other ones as well? Yeah, within the page, yes. So yeah, this goes back to what I was saying before, we wanna do as much work as we can while stuff's in memory, so I can move anything around I want inside this page, because I have it in memory, I have it latched, we'll talk about that in next week, but I basically have a lock on the page. I know nobody else is updating this page while I'm doing this, so I can do as much manipulation as I need. And now, the key important thing is that because this slot array is in direction to the rest of the system, I can move where a page is located, anywhere inside, or sorry, where tuple is located, anywhere inside the page, and I don't need to tell any other part of the system, right? So it's not like when I slide tuple four over, I have to go tell whatever indexes I have on this table, oh, by the way, I moved tuple four, they don't know about that. All they know is like tuple four is that this page and this offset, and then I can move anywhere I want. So who cares if I have to move some data over? This is way cheaper than having to go fetch something from disk later on. Do you have a question as well or no? All right, so yeah. So this question is if you delete tuple, if you, well, say this guy here, tuple three was pointing by this slot, how do I mark that it's deleted? This is a bit up above, say the slot is empty, right, yes. Question, is the slot array fixed? Test implementation doesn't have to be, right? Because again, if I have variable link data, if I have, I could say my table, you know, Varchar 1024, so I have one kilobyte fields, but I only put one character in each of them, right? I don't want to pre-allocate because I don't know what the data is going to be and going into it. So the slot array typically will grow. Yes, will we do sliding the slot array? What do you mean by that? Sorry, sorry, will we reclaim space in the slot array? If it's unused, actually that one I don't know, right? But the slot array is typically not, it's not going to be the bulk of the space in that page. All right, cool. So let's get into this thing that I've been talking about of like how do we actually identify tuples, right? And it sort of goes back to the point I was making before about like I can change the location of a tuple in a page, but it maintains the same slot number and I don't have to update indexes, right? So a database system is going to maintain what is called a record ID. Different database systems will call them different things. Postgres will be CT ID, MySQL will be row ID, SQLite row ID, like they're all, but the high low ID is going to be the same thing. And it's going to be a unique identifier that's going to say here's how to find the physical location of a single logical tuple. And so we don't have to get this just yet. You could have a single logical tuple have multiple physical versions and how you actually find what version you want. We will cover it later in the semester, but just assume that I have a one-to-one mapping from one logical tuple maps to one physical tuple. And so the record ID tells you where to go find this thing. The page ID plus offset is the most simplest way to think about this. And the offset is just the slot number and the slot array and the page is located on. You can contain additional information. We'll see this in SQL server, like what file is it in. In Oracle, you can keep track of like what table space is in, like what directory and so forth. So in Postgres, the CT ID will be six bytes. In SQLite, it'll be eight bytes. And then the row ID in Oracle is 10 bytes. SQL server, I think, is eight bytes as well. I might be wrong with that though. And so I'll show you an example now on some demos. We can see this, we can actually see what these IDs are, but we don't want to rely on them in our application because the database systems are free to change them at any time they want, right? And again, this is sort of abstraction or separation between logical and physical. I'll still have my logical tuples, I know what they are, but where they're actually physically stored and how they're physically stored, I shouldn't care and I don't want to care, okay? All right, so let me switch over to, can you access what? Yeah, so this question is, can you actually access this? Yes, let's do it. All right, so let's do Postgres first. So we're going to create a simple table, right? That has three tuples. That's embarrassing, okay. Right, three tuples, right, with these values. So in Postgres, what you can do is they have this again, the CT ID. I think of this as like a virtual column where I can get back the pair of page number and slot number, right? So you can see here, I inserted three tuples and the CT ID is all, you know, zero one, zero two, zero three, right? So page zero, slot one, page zero, slot two and so on, okay? So now if I delete one of these guys, what's that, sorry? Okay, so I delete a tuple and then I run that query again. Now you can see that I deleted the second tuple, but it kept the other ones where they were located, zero one, zero three, right? What's that? We can answer something, yes, he's very excited, sure. Okay, so we insert this tuple, all right, so you think it's gonna be, so we have zero one, zero three, so we have a free space in zero two. Do we think it's gonna put it in zero two or a dependent after zero four, or is there zero three, make it zero four? He says zero four. Rachael, do you think zero two? Yeah, small amount. Rachael, if you say zero four, more people, okay. Zero four, right? Is that correct or wrong? Yeah, who cares, right? It's fine. Did what I wanted to do. So in Postgres, they had this thing called the vacuum. This again, we'll cover this later on, but think of this as just the, like the garbage collector, or just the defragmentor for the database system. And we're gonna run it on this table, and now when I run the query again, now it moved it to fill in the free space, right? Zero one, zero two, zero three, where before it was zero one, zero three, zero four, right? Yes, the question is, does it mean in Postgres we can't have a column called CTID? Let's find out. Nope, it doesn't let you do it. Yeah, so the question is, at least in Postgres, there's this notion of like, like R.star, these are all like the user defined columns, but there's all these other ones that are in the header of the tuples that you don't get by default. But if you know what they are, you can get them. All right, so another one I think I'm gonna, I think it's R.minmax. Now there's a bunch of other ones related to timestamps, and visibility stuff, which you can do as well, that you can get, but like R.star typically won't return this. All right, so in Postgres, what you can do, you can actually, you can address tuples with their CTID, you shouldn't, but you can, right? So again, is this a bad idea? Yes, because I ran the vacuum, and then the location of the tuples basically changed, right? All right, so let's go to SQLite. I need to make the table first. So SQLite has this thing called row ID, right? So row1234, and then if I delete a tuple, and then insert a new one, I'll do it out, so I delete it. It's 03, and I insert one, and that's 134, right? So again, is it correct? It doesn't matter, we don't care. All right, last one would do SQL Server. Oracle's also interested in one, but I don't have the, couldn't get it to work in Docker correctly, so we'll pass on that. All right, okay, the table. So in SQL Server, they don't have, they don't have like a row ID thing. They have this function you can get to get back data. Now you see in there, sort of row ID, it's not just page and offset, it's like file number, page, and then offset, right? And Oracle would have even more things. So now if I call delete, run this again, and it's 02, so it didn't do compaction. If I insert, run again, now it changed, or it started into it, right? Actually, no, sorry, it not only didn't start into it, the tuple up here, this was 02, when I insert it, now became in position one, or slot one, right? So what SQL Server is trying to do here, because it says, oh, I know this page is in memory, I have a latch on it. Let me try to tighten it up, remove dead space before I insert the new tuple into it. Yes, question is when the systems do compaction, are they doing it on a page at the time versus incremental? Like incremental within a page? Of a page? No, no, no. Would you ever do compaction? We ever do compaction on a subset of a page? No, because again, page is four kilobytes, or eight kilobytes, right, it's super small. Where you would see compaction in PostgreSQL, this thing called the auto vacuum, it will try to, it tries to keep track of like what pages got updated since the last time I ran the vacuum, and it only tries to compact those. And then I think only on our vacuum full, which is more expensive to do, will it try to say, this page is half full, this page is half full, we'll merge them together into a new page. And they typically try to clean up pages by themselves, because it's just faster to do. Yeah, so his question is for the systems that do vacuum, it's pretty much only PostgreSQL, right? Do they require you to manually run the vacuum, or do they try to run vacuum for you automatically? PostgreSQL, the answer is they do try to do it automatically, but they try to be, how do I say this? They don't want to interfere with queries, right? Because the vacuum is going to take latches, it'll take lock clocks, it's very expensive to do. So they have backoff mechanisms to say, if I try to, I want to vacuum this page, but somehow the query holds it, they'll back off, right? And maybe sleep for a while and come back again. So at some point, you definitely have to vacuum. We'll talk about this later in the semester. I keep previewing stuff coming up, there's a lot of things to cover. Like you can get a bad situation, it's called the transaction ID wrap around in Postgres, where if I run out of 32 bit numbers for transactions, I got to run the vacuum to reset back to zero. So if you don't run vacuum and you hit that, the system can't do anything and it dots. So like in that case, you definitely, the vacuum can't keep up because it keeps backing off. You can tune the knobs to make it be more aggressive in how it vacuumed. It'll slow down your queries while the vacuum runs, at least you can clean things up to avoid that transaction ID wrap around clip. The question is, since Postgres is the only one that's doing the manual vacuuming, other ones just find some time to do it randomly. Yeah, so they'll have, I mean, so the vacuum command in Postgres is not in the SQL standard, that's a Postgres thing. Like in Oracle, I think by default, they run it at like 11 PM at night, they want some maintenance stuff and then they can do the packaging down. I don't know what SQL server does, but typically it's like a background maintenance job that runs in CRON every so often, clean things up. In, there's a command called optimize in the SQL standard, where you can optimize tables and that sort of, it's basically like the vacuum. You can do that for indexes too. Yeah, you can run it manually. It's not in the standard, it varies per system. We'll get to this later. The way Postgres does multi-versioning is that every time you update a tuple, you copy the whole tuple, put it in a location. So you basically, the vacuum has to scan through and try to find all the dead tuples and clean them up. That's the wrong way to do multi-versioning. MySQL and Oracle and others do it better, where you, and it can be updated tuple, you copy the, like a get diff, you copy the old version in a single location and only update one tuple. So now when I wanna go clean some things up, it's like one file, one piece of memory, I just go blow away and I remove all the old version. Postgres has to scan to find them. All right, cool, any other questions? Awesome, so again, the main takeaway from this is everybody's gonna do something different and the high level idea is still the same. So just to finish up quickly, what's inside a tuple? So a tuple is essentially just gonna be a sequence of bytes. And again, it's up to the other parts of the database system to interpret those bytes and make sense of them, but to know what the scheme is, know what the attributes are and what the types are and how to understand what's going on. So in every tuple, there's gonna be a header and there's gonna be a metadata about what's in when the tuple itself. So there'll be visibility info, like is this tuple deleted or not? Like we'll cover that later. Typically they're gonna represent nulls with the bitmap. The busy says for this, you know, is for this attribute, if it's null, there's a bit set. There's other ways represent null, we'll do new next class, but this is the most common one. And then within the tuple itself, you typically don't store the metadata about what's inside of it. It's either done at the page level, like an Oracle, or in some other catalogs, sort of separately, right? And again, all the data within a tuple, for a tuple at a logical level, doesn't necessarily need to be stored at what I'll call the primary storage location of in a page. For variable length data, you can instead, in the tuple itself, you can have pointers to some other page that has the data that you want. Like for really large fields, you can have a pointer to somewhere else. And we'll cover that next class, all right? One tricky, one clever thing you can do is, some systems will allow you to store data from multiple tables in a single tuple with itself. So this sort of violates the thing I talked about earlier where I said, every page will typically only store data for just one table. There are some systems where you can actually embed data that's related to a tuple within the tuple itself. So this is called de-normalization. And the basic idea is that if I have a, I'm sorry, jumping ahead, right? So if I have a two tables like this where one has a foreign key referenced to the other table, so I could just store them as separate pages, right? But now anytime I wanna go get all the data for a table or a record of foo, I've got all the rows that are related to it in the bar table, I gotta go fetch two, fetch, do a join, and fetch two separate pages. But some systems will allow you to embed the tuples for the related tables in the original, the base tuple itself, right? So I have this example here for foo, I have A and B, the attributes belong to foo, but then I have this variable length piece that belongs to all the tuples that are related to it. Now logically, it's still to the application, it's still gonna look like these are two separate tables, right? But underneath the covers, I can embed them to make things faster, right? Because now again, for go over to one record of foo, where I want also the records that relate to it in bar, it's one page batch, right? Yeah, so you're saying that this makes writing more expensive in this case? So instead of having two separate pages, like I take all these guys here and I embed it inside of this one here, so I don't have the original one, right? It does make updates more expensive potentially because now if I wanna update the one record, like one attribute in foo, I gotta, the tuple's now bigger because I've embedded all this other stuff, I potentially need to extend the, if I keep adding records that relate to it in bar, I keep extending this array and I maybe resize my pages. It doesn't come for free, but it makes the reads go faster, yes. So this question is, how do you actually figure out whether this is a good idea or not setting your application? Some systems, in theory, a relational database system could figure this out for you automatically. As far as I know, nobody does this. The systems that do support this, like Spanner is probably the most famous one that does this. Akabon was a database startup that could do, it's sometimes called pre-join, could do this as well, but in those, you have to create, when you create the tables, you specify I want this embedded inside of this other one, right? As far as I know, those systems will figure this out for you automatically. This technique though, also looks like a lot like the document model we talked about in the first class from all these sort of no SQL systems. And that's essentially what they're doing, but again, they're doing this manually. You're trying to say, I know that these two collections of documents are similar to each other, like for a customer, here's all their orders, and I can bed them manually inside of the, in the collection itself. The difference though, in our relational database, I would still have two logically separate tables, I can still query them separately, but aren't either covers, since I always know maybe I'm gonna join foo with bar, it makes sense to embed them ahead of time, right? Yes. Is this quite a larger storage? Why? Why would it? Yes. So again, going back to my example here, like when I created the table, like this is a one and relationship. So for one tuple and foo, I have n tuples and bar, right? So I just take those n tuples and those n tuples can only belong to one and foo. So it's just copying them directly in. So I'm not gonna have duplicates in this example here. In the document model, if you just blindly do this, then yes, you could have much duplicates. And then think of this, we're at the physical level here. The application, what we're talking about here is the application doesn't know, doesn't care that things got embedded, even though I could tell it, I wanted to embed it. The system itself could figure this out for you. Right? All right, just to finish up quickly about this. So, let's take that question. So this is called again, this is called denormalization. The NoSQL guys would call this pre-joins. This is not a new idea. This is done by IBM, did this in the 1970s. They abandoned it because it was a pain to maintain. And then several systems can do this. Then if you define it in your table statement. What's the point it's going to make? Yeah, it doesn't matter. Okay. It'll kind of be there. Okay, so in this class today, we talk about how to organize database and pages. We then talk about how to keep track of those pages and our directory. Then we talk about the different ways to actually store them. And then we have different ways to actually represent tuples. So next class, we'll go through log structure storage. Well, how do we actually represent the values and attributes, and then catalogs to keep track of everything.