 All right, guys, this is just starting. Welcome. Again, sorry for being away last week, but I'm in town now. So real quick, a pause around for our DJ, DJ Drop Tables. So I don't think people know how bad you are, right? So why don't you show the people a little bit, some of what you can do, all right? Let her rip. Gas won't, gas won't, gas won't, gas won't, let me think about me. So this is real, yes, we do have a course DJ, right? I've always won one, now I have one. How was your summer? It was good. It was good? Would you do out there? You know, just slicing around, you know. Just doing your thing? As soon as I think. OK, awesome, yeah, all right. What else do you do? Not much, you know. I got hired to DJ this class. Yeah, there's that. Yeah, all right. How was your summer? I mean, so we're still been a database system here at Carnegie Mellon, so we're still grinding on that. That's just taking all my time, right? You know how it is. OK, I already got someone pregnant. Yes, that is true. I did get somebody pregnant. OK, so let's go through this real quickly. All right, so I only care about two things in my life. I say this every semester. Number one is my wife. Number two is databases. I don't give a f*** anything else, right? My family voted for Trump. I don't talk to them. I have a dog, that's fine. But I don't have any of their hobbies. It's just databases, right? And again, so people think I say crazy stuff. So if I start talking fast, I want you to please interrupt me to slow down, right? And just tell me to repeat myself, right? You say you're sounding stupid, you're going too fast, slow down. You can also interrupt me when you don't understand what I'm talking about. And I actually encourage you, and I want you to do this, because I know this, like I don't know if you guys know it. So please stop me if you don't understand what I'm talking about or you have a sort of database related question. What you can't stop me about are stupid things, like if a question about blockchain, like in previous years, I don't care about blockchains, OK? This course is not about blockchains. That's not what we're going to talk about here. If you also have a weird rash on your body, don't share it with the class in front of the middle lecture. This happened the first year. It's weird, don't do that, OK? So what does this mean? So this means that I want you to interrupt me as much as possible, right? Don't be embarrassed, there's no stupid questions. So if you have a question, though, about the lecture, don't come up in front of me at the end of the class and be like, oh, what about slide three? What did you mean on this? I will not answer these questions, because if you have a question while I'm talking, then somebody else probably has a question. And I'd rather just discuss it in front of everyone rather than a line of people saying, what about slide four, what about slide six? Over and over again, OK? So again, I will not be offended, I don't care. Just interrupt me if you have questions, OK? So with that, any questions? OK, so I also post on Piazza that Homework One is out. It'll be due on the 11th next week. And then we'll also be releasing Project One on GitHub next week as well. Just sort of keep in mind that this is on your schedule what's coming up. Who here has finished the first homework? I think somebody already did and got a perfect score. And they're not here. Awesome, even better, OK? So again, we're giving you a SQL lite file. We're making you write some SQL queries. It's sort of to force you to do some of the things we talked about in the second lecture. And this will actually be the only time you will write SQL query for the rest of the semester. All the projects you will be doing and will be in C++ and all the homeworks will be pencil and paper, OK? So I want to at least have you take a database course and at least write some SQL once in your life, OK? All right, so let's jump right into this. So last class, at least the last two classes, we spent time talking about what a database looks like at a high level, at the logical level. What does the application programmer see? They see relational tables. They see SQL. And so now what we want to talk about is how are we actually going to build the software that's going to manage this database system, right? This is essentially what the course is about, right? How are we actually going to store a database and allow queries to execute and drive new data from it? So the overall outline for the course, again, we've already covered the logical part at the top of the relational databases. And now we're starting to be going through different parts of a database system one by one as if they're like as if different layers. So we'll start talking about storage, query execution, current SQL recovery, and then we'll get to like distributed databases and other topics at the end. And again, the way to think about this, this is a gross approximation of what a database looks like. It's just a bunch of layers built on top of each other, right? So we're going to focus on this lecture and the next lecture on the disk manager. How do we actually store data on files on disk? And then above that, once we know what API we're going to expose to the upper levels in the system, we start adding those extra levels until at the end we have a full-featured database management system. So that's the way to think about what we're going to talk about at this point. No more SQL stuff, no more relational model stuff. Aspects of it are going to be important for how we make different design decisions in our system. But again, we have to figure out how do we actually run SQL queries. But it's not like we're going to worry about how to write complex SQL queries because we've already done that. So again, we're focusing on different levels of the system, different layers, one by one, and then sort of going up the stack. So as I said in the first lecture as well, this course is about building disk-oriented database management systems. So just to reiterate what I mean by that, a disk-oriented database system is one where the software makes the assumption that the primary search location of the database is on disk. And so that means that any single time we have to execute a query, it may actually want to access data that's not in memory, and we've got to go out in the disk and get it. And there's a bunch of components in how we design our software that are going to be based on this assumption to protect ourselves from losing data, having invalid or incorrect data. So this assumption is going to permeate all throughout the entire system, and we need to be aware that at any given time, something we're trying to read is not in memory. So to understand this a bit further, we want to make the distinction between volatile and non-volatile storage. So essentially what we're trying to do is we're trying to have the data system manage the movement of data from non-volatile storage into volatile storage. So what do I mean by that? The way to think about the storage hierarchy of computers is as such. You sort of start at the very bottom, or sorry, at the very top, you're going to have things that are being very fast, very expensive, and very small. So the way to think of this is like a spectrum. At the top, you have things that like CPU registers or CPU caches, L1, L2, L3, these things are very fast, but they're going to be very small capacity because they're like literally sitting on the CPU itself. Then we're going to have DRAM, and then below that we'll have SSD, spinning disk hard drives, and network storage. Again, at the bottom, they're very large, but they're much slower, and they're cheaper. So again, the dichotomy that we care about is this division line here. So anything above this line is volatile. What does that mean? Yes. Yes, she says it's not persisted when you lose power. Absolutely. So all these storage devices require constant energy, like electricity, to maintain whatever they're storing. When you pull the power from your computer, everything in DRAM gets wiped, everything when your CPU caches get wiped. Everything below this is non-volatile, meaning it doesn't require constant power in order to persist whatever was stored in it. So at a high level, that's the major thing that we care about, that we have to move data from here up into here. There's other aspects that are going to affect how we design our software, and that has to do how we actually can access this data. So if it's in volatile storage, it's going to support fast, random access, meaning we can jump to any sort of address location in the storage device very quickly. And we're going to get roughly the same performance no matter what order we access things. I mean, if I jump to this location, to this location, and maybe back to another location, I'm going to get approximately the same latency, the same speed. Now in volatile storage, instead of having byte addressable access, you're going to have block addressable access. So in byte addressable access, that means that I want to read 64 bits at this storage location. I can just go read just that 64 bits and get exactly what I want. I'm oversimplifying, but that's essentially how from the programmer's perspective of us as the database system developer, that's what we see. In a non-volatile storage, we can't go get exactly just the 64 bits that we want. We have to go get the block or the page that has that data that we want. And we have to get everything that's along with that page. So I only want to read 64 bits. And it's a non-volatile storage. I have to go get the four kilobyte page that it's stored in and then go pick out just the piece that I want. Another aspect of this is that these systems also usually have faster sequential access, meaning if I read a bunch of continuous blocks in the storage device, I can do that very more efficiently than just reading random locations. And the easiest way to visualize this is just think about like a spinning hard drive. Most laptops or every laptop pretty much doesn't come with a spinning hard drive, but basically the way it works is that you have this arm that's physically moving on the platter, like a turntable, like a vinyl record. And so every single time you've got to jump to a location, you have to pick the arm up and move it to another location. That's a physical movement, and it's very expensive and slow. SSDs don't have this problem because it's solid state, but there's other issues. So in these storage devices, we want to try to maximize the amount of data that we can read that's sequential. And the volatile storage, we don't care as much. So for the purposes of this course, we're just going to say that anything that's in DRAM, we're just going to call this memory. That's what we mostly care about, how we actually put things into memory. And then anything below this line here, we're just going to say this is disk. And for most of the algorithms and most of the methods we're going to talk about in this course, we don't care which one of these it is. It's not entirely true when we talk about joins, because sequential access matters a lot, but we'll come to that later. And so I don't think the textbook talks about network storage. Usually, when you see these kind of hierarchies, they always have tape drives at the bottom, but nobody runs databases on those anymore. Those are only for disaster recovery. So network storage would be like EBS or S3 on Amazon. So the reason why we're not going to talk about these things at the top is that we focus on these things in the advanced course in the spring. And for our purposes in this class, this course in the semester, this is so slow anyway that who cares how fast we can be putting things in CPU registers. In the advanced class, we assume the database is always here in DRAM, and therefore these things actually matter, but for this entire semester, we won't really talk about worrying about things sitting in CPU caches, because it doesn't matter, because it's going to disk so slow. Now I always talk about this every year. There is actually a new class of storage devices that sort of straddles the line called non-volatile memory. Who here has heard of this before? Who here has heard of Intel Optane memory? One, two, three, all right. Yeah, so Intel is actually the first manufacturer that has to release this. Everybody's been working on this for like 15, 20 years. Intel actually put the first one, has put out the first devices. So it's like DRAM where it sits in the DIMM slot and it's byte addressable, but it's like an SSD, meaning if you pull the power on the machine, it persists all your data. All right, so this is super cool. This is like the future of what computers are going to look like. And eventually we'll have to rewrite this class to take this into account, but we're not there yet. It's not widely available. Now, again, this is something I've been working on for a while. So this is a book that I wrote with my first PhD student. It was basically his dissertation they put into book form. But I think this is the future. We're just not there yet. It's not widely available. But you can't get it on Amazon yet. But eventually a lot of the people talk about what could go away if you have non-volatile memory. All right, so let's talk about how slow these things are. So there's, again, there's different tables on the internet that have different numbers. Approximately the same, you know, roughly in the same ballpark, the thing that matters most is the orders of magnitude that's different from these storage devices. So let's say I need to read a 64 bits from different storage device. If I'm in L1 cache, then it's like half a nanosecond. If it's in L2, it's seven nanoseconds, and so forth until you get to really long delays. So, again, this is why we're going to spend all this time in this class worrying about how we can try to minimize the impact of reading data from disk, because there's a pretty big difference between 100 nanoseconds and 150,000 nanoseconds. And so for every single query, if we always have to go out the disk, then we're screwed. The thing is going to be essentially grind to a halt, and we're not going to get any work done. Now, I realize putting this in terms of nanoseconds is hard for us as humans to wrap our heads around. So if you just replace nanosecond with seconds, then you start to realize how long these numbers actually are. So the way to think about this is like another metaphor I'd like to use from Jim Gray, a famous database researcher, is that say that I want to read a book. I want to read a page in a book. So if it's in L1, then it's just like reading the book right in front of me on this table. If it's in L2, then maybe it's going across a room to read it. If it's in DRAM, then I've got to go walk to the library. And then now you start to get to these larger and larger orders of magnitude. And if you have to read it from a tape drive, it's like flying to Pluto to read a page in a book. It just takes forever. So again, this is why people don't want to store data on this, but in the old days, they had to. So the goal of what we're trying to do in our database system is that we want to provide the illusion to the application that we have enough memory to store their entire database in memory. So essentially, there's a fine amount of memory on our machine. And we want to store a database that exceeds the amount of memory that's available to us without having to grind to a halt every single time we read something or write something. So again, that's what the focus in this course is. And the next three lectures is really about how can we be careful on any single time we've got to read something from disk or run a query that we minimize that impact. And we're going to do a bunch of different tricks to mitigate this problem by allowing different threads or different queries to run at the same time, by caching things, by pre-computing some data. There's a whole bunch of slew of tricks that we're going to have in our database system that we have essentially around avoiding this long problem. So let's look at a high-level diagram of what a disk-oriented database system looks like from the perspective that we care about at this point in the semester. And then we'll see in the rest of the lecture how we're actually going to fill in and design these things. So again, at the lowest layer, we have the disk. And we have our database file or files. It doesn't matter. And then we're going to represent these through different blocks or pages. Pages, the canonical term you use to describe this, sometimes slip and say block. But at a high level, they mean the same thing. So now in memory, we're going to have what's called a buffer pool. And this we'll focus on this in the lecture next week. And this is what you'll be implementing in the first assignment, the first project. So there's some higher-level layer in the system, an execution engine, a query engine. We don't care what it is. But it's going to make requests to our buffer pool and say, hey, I want to read page two. Page two is not in memory. So we've got to go look in the page directory on disk and say, here's the list of the pages that I have. Here's where to find them. So now I can go find where page two is. I bring it into memory. And now I hand off to my execution engine. Here's the pointer to page two in memory. And then it can interpret it, do whatever it wants. We don't care. So this is what we're focusing on here for the next three lectures. It was how we actually build this part here. So today and next week, we'll discuss what the data files look like on disk. Next week will be the buffer pool. And then later and so forth, we'll talk about how we actually represent the directory. So what does this look like? Have you taken an operating system course? What does this sound like? I'm trying to make it appear that I have more memory than I actually do. Virtual memory, exactly. So now I may be thinking, all right, I've taken OS course here. Why do I want to have my databases manage memory like this? It seems like a big waste of time. The OS can already do this. Well, it's not a good idea. Here's why. So in operating system parlance, we would call those memory map files. Or there's a sys call called mmap in POSIX. And essentially what this does is it takes a file on disk and you tell the operating system, map the file's pages into the address space for my process. And now I can read and write to those memory locations. And if it's not memory, the OS brings it in. I can write to it. And then eventually, if I can tell the OS to write it out for me, I can do an MSYNC and write it back out the disk. So we're essentially giving up control of the movement of memory back and forth, of data back and forth between disk and memory, and letting the operating system manage this for us. So again, at a high level, it looks like this. We have a bunch of pages on disk file. And then in memory, the OS has its virtual memory page table and we have physical memory. So what happens is the application says, hey, I want to read page one. It looks in the virtual memory. We get a page fault and say, this thing's not backed by physical memory. It's still out on disk. We go fetch it, back it into a physical memory page, and then update our page table to now point to that memory location. So if I come along and I want to read page three, I go through the same process. I fetch it into memory. And then the application can do whatever it wants. But now let's say that I read page two. What's the problem? There's no free physical memory page to put this page in. So I need to make a decision on which of these pages to remove. And while I'm doing this, I eventually have to install the database system, install my thread that requested this page, because now the disk scheduler for the operating system is going to go out to disk fetch it and bring it into memory. So there's ways to figure out from the application perspective, am I about to read something that's not in memory? So maybe I can hand it off to another thread, so it stalls and not me, because I always want to try to keep doing useful work, because I want to mitigate the stalls when I have to go out to disk. But essentially, the operating system doesn't know exactly what the hell we're doing. Doesn't know anything about what the data system is doing. It just sees a bunch of reads and writes to pages. It doesn't understand any of the high level semantics about what a query is, what data wants to read. So by going with virtual memory, by going with memory map files, we're giving up knowledge that we have inside our database system over to the US that's blind and doesn't know anything. So if we're only reading data, there's a bunch of sys calls we can to mitigate some of these problems. But if we start writing things, then it becomes problematic, because now the OS doesn't know that certain pages have to be flushed out the disk before other pages do. And again, we'll cover this later when we talk about logging and concurrency control. But the OS just sees, yeah, I need to write some data out. It needs to go ahead and write it out. It doesn't know whether that was an OK thing to do or not. So you can get around this by giving it hints, like using M-Advise to tell it how you're going to access certain pages, whether it's sequential or random, how to prevent pages from beginning paged out, although M-Lock doesn't prevent it from getting written out, which, again, could still be a problem. And this is when you can tell it to flush. So I would say that memory map files of virtual memory sounds like a seductive thing we want to use in our database system. And every year, some student says, why are we doing all this buffer pool thing? Why can't we just let the OS do this for us? And trust me, you don't want to do this, because it can be you have performance bottlenecks and you'll have correctness problems. So there's not very many systems out there that actually use M-Map. The most famous two are probably MoneDB and LMDB. LevelDB, have you ever heard of that from Google? Is another one. Elasticsearch is a search engine or the document store. And then RavenDB is a JSON database out of Israel. So all these guys use M-Map, but there's a bunch of extra stuff you have to do to prevent the OS from doing things that are incorrect, or the certain limitations or assumptions you have to make about what the OS is allowed to do. There's not very, so this is like, I mean there's a few more, but there's not very many. So what's missing here? We're missing all the major database systems. Perscrest, MySQL, Oracle, DB2, SQL Server. None of those guys use M-Map because it's a bad idea because you're giving up control and the data system can always do better than what the operating system can try to figure out. So there's some systems that still use M-Map in very limited cases. This is actually at a date, I mean, I talked to the guy last week, MemSQL got rid of M-Map entirely. SQL Lite has a special engine. You have to tell it, I want to use M-Map for some embedded devices, that's actually what you want to use, but default, you don't get this. Influx DB only uses this for read only caches. But the example I always like to give, talk about it as MongoDB. Everyone needs to hear it, it's heard of MongoDB before, right? It's a famous JSON database system. So when they first started, their default storage engine or storage manager was using M-Map. And there's a bunch of crap they had to do to make that thing actually work. But it was a super pain in the a**, it was a big bottleneck for them. And then they raised a a** little money, and then the first thing they did was got rid of M-Map and bought this thing called WireTire, which was a non-M-Map storage engine. So if M-Map was a good idea, these guys had all the money in the world, had some top engineers, they could have figured it out, but it became untenable. So if I die in this class and you have a memorial or something, just say Andy hated M-Map, you can publicly say these things, okay? We're actually working on a paper this year and actually proving that it's a bad idea. All right, so the main takeaway I want you to get from this is that the database system can always do better. It always knows exactly what the queries are trying to do, it knows what the workload looks like, and therefore it can make the best decision. The operating system doesn't know anything, it just sees a bunch of, again, read and write calls. So some of the things that we'll talk about maybe later in the semester that we can do if we're not using M-Map is like prefetching, better replacement policies, better scheduling. Again, the OS is sort of a general purpose pickup truck whereas we can tune our system like a Porsche or Ferrari to be exactly what we want to do for application. So another main takeaway is that the operating system is not your friend, we don't want to rely on it, we want to try to avoid it as much as possible because it's going to make decisions that could be hurtful to our database system. So it's like a friend of me, you need it to survive, but ideally you don't want to talk to it, all right? All right, so for database storage, this is what we're going to focus on today. So there's two main problems we have to take care of. The first is how we're going to represent the data on files on disk, and the second is that how we're actually going to manage the memory back and forth between the disk files and the buffer pool. So for this lecture today, we're going to focus on this problem, next class we'll also focus on this problem, and then starting when we talk about buffer pools on Wednesday next week, we'll focus on the second problem, okay? All right, so today's lecture, again, we're going to deal with that first question. How are we actually going to represent the database on files on disk? So we're going to first talk about how we're going to organize the database across a sequence of pages, then we'll talk about how we're actually going to store the pages inside those files, and then we'll talk about what are the tuples look like inside those pages, all right? So we're going to go sort of at a macro level and step down to inside the data that we're actually storing. All right, so at the end of the day, the database is just a bunch of files on disk. Some systems store the database as one file, like SQLite does that, for the first homework you download that .db file, that's the entire database encapsulated in that single file. Most other systems, however, store things in across multiple files. You ever look at like the data directory for my SQL Postgres, you'll see a bunch of different directories, bunch of different files, right? You do this because databases could be very large, like petabytes, you don't want to hit up the file system limitation of the size of a file. So again, the OS doesn't know anything about what's in these files, it just, there's a bunch of binary data to the operating system, they're not special. But the format for these database files are typically proprietary or specific to the database management system. So meaning you can't take a SQLite file, plop it down inside a directory for my SQL and think my SQL is going to be able to read it, right, they're always specialized to whatever the software is. So these files for the database we're typically just going to store them on top of the regular file system that the OS provides us. EXT-3, EXT-4, whatever Windows has now, I forget. These are just, man, the OS just sees a bunch of files and we're going to rely on the file system to provide us with basic read-write APIs. In the 1980s, people did try to build database systems that use custom file systems on raw storage devices. So like say you plop down a new hard drive instead of formatting it and setting it up for NTFS or WinFS or XT-XT-4, you say screw all that, just give me the raw storage device and I'll manage what's actually being stored in it myself. Some of the enterprise systems, like enterprise meaning high end ones like Oracle, DB2 and SQL Server will still do this but most of the new database startups or anything new that's come out in the last 10 years or 15 years doesn't do this, right? Because it's the engineering effort to make your own custom file system for your database system is not worth it. You get maybe like a 10% improvement but now you're managing your own file system which is a big pain in the ass and it makes your thing less portable because now you can't easily run it on Amazon and other hardware vendors. So what we're building essentially is now, again what is called a storage manager. Sometimes also called the storage engine and this is the piece of the software, the component in our database system that is responsible for maintaining our database files on disk. Now we could do reads and writes and let the OS schedule things. Some of the more high end database systems will actually have a shim layer right above the file system that allows the database system to do some disk scheduling. You do this that like, if I know I have a bunch of threads writing to blocks that are better close to each other I can maybe combine them together and do a single write request. The OS can kind of do these things but again it doesn't know exactly what's above, doesn't know what the semantics of the query is above it. Most systems don't do this and then for the project we'll be working on here we don't do this. It's typically for the high end ones, yes. I would like you to do that. You said it's very public-level file that is being visited in the middle of the file. Yes. Is there anyone other than a user for the software memory? The question is, I said that most database systems split up the file, the database files into multiple files because you don't want to hit the file size limit of the operating system. Is there any optimization for putting things in memory? No, if there were to be, so the file has an operating which leads to the size and the amount of memory? Yes. Just a file, oh, your question is, does the file have a limit for the size it can be in memory? And so with virtual memory, no, when we talk to whatever the swap size is of what the OS lets you store but it's essentially limited to the physical memory that's available to you. Yeah, but is that the case? Is it better to have just one file because if you're, like, get rid of the overhead, you have to just put in multiple files? The question is, would it be better to have a single file because then you get rid of the overhead of having multiple files? Wouldn't you let the overhead, like the i-node that you have to find to go open a file or what? I mean, I'm just saying like, when you have to go tell some of the stories back, do you like, like the name, the location of the file and make sure it's going to have a limit? Okay, so you're talking like the metadata. So her statement is, if I have one file, then I have one file name and I have one i-node in my file system that points to it. If I have multiple files, then I have multiple i-node entries and each one has their own file name and much of metadata, you know, referencing it. But, like, that's what? Maybe a kilobyte of metadata? It's nothing, right? If your database is one petabyte, who cares if you have a bunch of file names, right? I think really large scales, it doesn't make a difference, right? I think now for, like, for modern file systems, it's not really an issue anymore. Like, you can have, like, exabyte, you know, single files that are exabytes. But thinking like in the 90s, early 2000s, when, like, people running like Fat32, you can only have a four gigabyte file, right? So that's, back in the days, it mattered more, not so much now. But even then, the metadata doesn't matter. Yes? His statement is, doesn't this limit the number of files that you can have open? It's usually open file handles, the number of things you can create. And therefore, you have to have permissions to do this. Absolutely, yes. And so, if you go look at, like, the tuning guides or setup guides for a bunch of different database systems, they'll talk about, like, tune this kernel parameter to allow you to have a bunch of, you know, this number I notes or file handles open. Absolutely, yes. Okay, awesome. All right, again, so we're trying to build a storage manager. And the storage manager is responsible for maintaining these files on disk. And whether it's one file or multiple files, it doesn't matter. So now, within these files, we're gonna organize them as a collection of pages. And so, our storage manager is gonna keep track of all the reads and writes we're gonna do to these pages, and it's gonna keep track of what available space, what space is available to us to store new data in our pages. So, a page is essentially just a fixed size chunk or block of data. We're gonna organize our file into these chunks. So, a page can contain anything, right? It can contain the actual tuples, the database itself, can contain metadata, indexes, log records. From the storage manager perspective, it doesn't really matter, right? But we always have to store things within a single page. So now, some database systems will allow, require you that to have the page be self-contained. And what I mean by that is all the information you need to know, how to interpret and comprehend the contents of a page, have to be stored within the page itself. So let me give you an example. Let's say that I have a table, and I have, the table has 10 columns, and they have different types, right? I call create table, and I create the table with different attributes. So I can have the metadata about what's in that table stored in one page, and then all the tuples for that table stored in another page. So the problem is now, if I have a disk failure, like my data center catches on fire, my disks melt, and I lose that one page that tells me what the layout of the schema is, now I don't know how to easily interpret what the contents are of my tuple pages. And so some systems like Oracle, for example, require all the metadata about how to say, here's what's in that page, has to be within the page itself. So that way, if you lose any other page, it doesn't affect, you know, you lose one page, it doesn't affect any other pages. You think there's a bit overhead to this, this seems crazy. Well, they do it for disaster recovery. Again, so now, again, the machine catches on fire and you lose a bunch of pages, you can literally open up a hex editor and try to reconstruct what the database was by looking at one page at a time. And all the metadata you need about what's in that page is stored within itself. All right, so other thing that's important to understand is that we're not gonna mix different types of data within a page. There's some research systems that do this where you can have, you know, one page have tuple data and log record data. For our purposes here and most systems, they don't do this. It's like, here's a page in only source tuples, here's a page in only source index information. So now each page is gonna be given a unique internal identifier that the database system is gonna generate for us, a page ID, right? And we're gonna have then now have an indirection layer and this is gonna be a re-encoring theme when we talk about storage. We're gonna have an indirection layer that's gonna allow us to map a page ID to some location in a file at some offset, right? And we wanna do this because now underneath the covers we can start moving pages around, you know, if we start compacting the disk or setting up another disk and it doesn't change our page ID because we have this page directory, say you want page one, two, three, here's where to go find it. So there's a bunch of different page concepts we need to talk about to, again, to put it in the context of how real computers work. So at the lowest level, we're gonna have what's called a hardware page. This is the page API or page access level you get from the actual storage device itself. I just with SSD or spinning disk hard drive exposes. This is typically four kilobytes. Then above that, you have an operating system page. And again, as you take things out of the storage device and put it into memory, right? They represent that as an internal page as well. And that's typically usually four kilobytes by default in Linux and Windows. There's things like huge pages where you can turn it, you can take a one gigabyte page, could be broken up to small to four kilobyte hardware pages, but for our purposes, we don't care about. The thing we care about is the database page here, right? And this is gonna vary between different systems. So at the low end, at 512 bytes, that's like something like SQLite, like an embedded system. But then at the high end, you'll have like 16 kilobytes, that could be like MySQL. So different database systems do different things, and there's different trade-offs for all of these, right? The main thing that we're gonna care about though, is that the hardware page is the sort of the lowest level that we might do atomic rights to the storage device, and typically four kilobytes. So what I mean by that is say, I need to modify a bunch of data, the hardware can only guarantee that if I do a write and flush to the disk, it can only guarantee that at four kilobytes at a time, it's gonna be atomic. So I'm gonna clarify what I mean by that. So like if I say I need to write 16 kilobytes, I could try to write the, I tell the disk, hey write 16 kilobytes for me, it might crash before, it writes the first eight kilobytes, then it crashes before writing the next eight kilobytes, and then you come back and now you have a torn right, you only see the first half and not the second half. Because the hardware can only guarantee four kilobytes at a time. All right, this will come up, we'll talk about this more later when we talk about logging and current control, but this is something we need to be mindful of. Again, there's different systems do different things. The high end systems, like Oracle SQL Server and DB2, you can actually tune it so you can say, I wanna start things with four kilobytes, eight kilobytes, or 16 kilobytes. You can even vary, say, for index pages, stormage is larger page sizes, and then data page is sort of smaller. You can go crazy and do much different things. All right, so now we wanna talk about how we're gonna represent the page storage architecture. So again, there's different ways to do this. There's different trade-offs for this. The most common one is gonna be the heat file organization. So we'll focus on that. But the thing to understand is that at this point, at this lowest level in the storage manager, we don't care about what's actually in our pages. We don't care whether it's indexes, data, or tuple data. We don't care. You ask us for a page, we'll get you that page, right? Or you ask us to delete a page, we'll delete it. So a database heat file is a unordered collection of pages where the tuples of the data can be stored in random order. So again, the relational model doesn't have any ordering. So if I insert tuples one by one, I'm not guaranteed that they're gonna be stored that way on disk. Because it doesn't matter, because I write SQL queries that have no notion of ordering. So the API we need to have, again, is being able to read and write and access pages at a time, as well as being able to iterate over every single page that we have in case we need to do a scrunch scan across the entire table. We'll have some additional metadata to keep track of what pages we have, which ones have free space, so that if we need to insert new data, we know where to find a page to go ahead and do that, right? And internally, we can represent this heat file in a bunch of different ways. Again, at the lowest level, we're gonna organize these in pages, and then within these pages, we can represent them with different data structures. So let's first talk about doing linked lists, because that's sort of the dumb way to do this, or nobody actually does this, but it exists. And then we'll see the page directory way, which is a better approach. So the way we're gonna do this, again, the goal of this, what we're trying to do here is we're trying to figure out, within my file, I have a bunch of pages, what pages, where are those pages exist and what kind of, whether they have data or not, or whether they have free space for me to store stuff. So in the header of this heat file, or sorry, for this linked list, we're just gonna have two pointers. We have one pointer that says, here's the list of the free pages that I have in my file, and here's a list of the pages that actually are completely full or occupied. So then again, this is just a linked list. So it doesn't matter where these pages are stored, it doesn't matter whether they're contiguous or not, I just, I now have this pointer to say, hey, here's the data that, here's the first page in my linked list where they're occupied, and here's a pointer to the next one. So if I wanna say, find me a page, a free page that can store stuff, I can follow the free page list, look in here and traverse along till I find something that has enough space for what I wanna store. And because we need to go possibly and iterate in reverse order, we need pointers on the way back as well. Yes? Why is the heat file unordered? The question is, why is the heat file unordered? Thinking at a high level, like the data we're storing does not need to be ordered as we insert it, right? So if I insert like three tuples, I could insert, I could in my page layout, the actual inside the pages, I could have two-plus three, two-plus two, two-plus one. I'm not required to put them in order that they're given. But if you have to look at a particular page, then you have to traverse the whole data with this, right? Right, so this question, if you had to look for a particular page with these linked list things, I had to traverse potentially entire linked list to find what I want. Absolutely, yes, this sucks. This is a bad idea, I'm saying, yeah. If it is unordered, then you can always search faster, right? This question is, if it's ordered, you can always search faster. Right, so there's different trade-offs. So I have no metadata to say where I have free pages. So I need to insert something now. Where am I gonna insert it? Now I essentially have to do a sequential scan and look at every single page until I find one that has free space. Or in this approach here, I'm not saying this is the right, I'm not saying at a high level, this is the right way to do it. I'm saying this is how this works. If I have this one here, then I can just go follow this pointer to find the first free page and see whether it has enough space for what I want to store. It's a trade-off, right? I can either go do almost binary search to find exactly the page that I want. Or I can just do, I can do this linked list. Can you maintain ordered sets for both free pages and data pages? The statement is, you can maintain ordered sets with free pages and field pages? Yes, just ordering on the page ID, right? Instead of a linked list, just use a tree. Look, the statement is, instead of using a linked list, use a tree, a linked list can still be ordered, right? So you could say, all right, say I delete all the tables from this page and it's page two, and this is page one, this is page three, so I could insert it in between these two guys, sure. So why have we not ordered, like we make two different sets, free pages and data pages and we vote for them all? Order one on the page ID. Yeah, like if you want to search for a... But again, I think the page ID is just like an offset in it, right? It's not, this is sort of a logical thing built on top of the heap file. Let's, we can take this offline if you don't understand it. This question over here, or, okay. Let's keep going, if you have questions, we can talk about it further. Okay, the main takeaway of this, this is a bad idea, nobody does this. So I don't want to dwell on it too much. What people typically do is having a page directory. And for this one, it's a, we have a page now again in the header of our file that's going to maintain the, a mapping from page IDs to their offset. And then we can also maintain some additional metadata in this directory to say, hey, here's the amount of free space that's available to me in a particular page. So now when I want to go say, I want to insert some data, I don't have to go scan that list. I can just look at my page directory and find everything that I need, right? So again, my pages are just ordered sequentially like this. And then this is just a mapping to where they're located, right? So the important thing about this, going back to what we talked about, about the atomic rights for hardware. So now I have a bunch of metadata that's a summarization of what's in my actual pages themselves. And I had to keep them in sync, but I actually can't guarantee that because the hardware can't guarantee that I can write two pages at exactly the same time. So let's say that I delete a bunch of data here in my page and then I want to update my page directory and say, oh, I have this amount of free space. I may delete a bunch of data, write that out. And then before I can update my page directory and write that out, I crash. So now I come back online and say, oh, I think this page is full and therefore I can't write any data to it but I know it's actually not. Or in reality, it's not. So you could say, all right, well, when I boot back up, I'll just scan through all my pages and figure out what's actually really there. But now again, think of the extreme. If I have one gigabyte of data, then that's gonna take forever, or say one petabyte of data that's gonna take forever to actually do this. So there's a bunch of extra mechanisms we'll talk about later on or how we can maintain a log and initial metadata in sort of special files so that if we crash and come back, we know how to reconstruct what's inside all these things. I think of this as just a hash table to say, I want page 123, here's where to go find it and I just can get it. Yes. Each page has the same size, right? Yes, each page has the same size, yes. And it's like, it's specified in different systems. What do you mean? Say it again? Like, what's the size of the page? This question is, what is the size of the page in this world? Right, so this goes back to this diagram here. They do different, different systems do different things. Right? For the failsafe, it's the size. Yeah, failsafe is like, we can write four kilobytes and the hardware guarantee that's atomic. But now I need to write, say my pages themselves are four kilobytes but I need to update one page, say, clear out a bunch of data, update the page directory and say, all right, that page has been cleared out, I can't guarantee they're right both of those pages atomically. I can write one, crash, before I write the second one. Right, so again, this is what the pages are gonna look like inside, yeah, sorry, in files. Then why some systems use bigger pages than 40 kilobytes last time? So he says, why do some data systems use larger pages? There's trade-offs. So internally, inside my database system, I have to have this page directory in memory. They can be mapping pages to some location, either memory or on disk. But now if I can represent a larger amount of data with one page ID, then that size of that table goes down. Think of this as like in the TLB, the translation local side buffer inside on the CPU, if I am trying to match all a bunch of pages, my page table is gonna get really large and now I'm gonna have cache misses. So by, you can represent more data in, you know, with a few number of page IDs. Furthermore, going back to talking about the difference between random and sequential access. So now if I can write out contiguously, you know, say four kilobyte pages to represent a 16 kilobyte database page, when I do a read, I just read all that sequentially and bring it in. Now I'm getting potentially more useful data that I need. But again, it makes doing writes more expensive because I now have to stage a bunch of crap ahead of time to prevent myself from getting torn writes. So there's pros and cons of both of them. And this is why, again, the commercial systems allow you to tune them in different ways based on what your application wants to do. Let's go to the question. Yes? So the self-contained pages for like that solve this kind of issue? So his question is, for self-contained pages, would that solve this particular issue here? No, you still mean, so self-contained pages would mean like the contents of the inside the page, I have all the metadata that I need for it. I still have to have a page directory to tell me where to find that page if I want page one, two, three, or four, five, six. So it's not entirely self-contained at the higher levels in the system. It's unavoidable at the bottom level. So where we're at here, right? Again, we haven't talked about what's actually inside the page. But within the page directory, we can't guarantee that it's self-contained. It doesn't make sense, right? Yes? Is there any way you can verify it like that? You said you have the partial rate, like you're trying to write in 16 kilobytes, you write in the first name, crash, don't write in any way, you can verify that that crash happened and reach by or something like that? Yeah, so his question is, is there any way to guarantee that the, if a crash happens, when you come back, you can identify that the crash happened? Yes, so you knew checksums, right? So say that my database page is three, three pages here. In the header of the first page, I'll put a checksum and say, the next, from my starting point here, the next three pages, the checksum should be like a CRC or MD5 should be this amount. So I come back online and after a crash, I would look and say, oh, the last page, when I compute the checksum doesn't match because this thing didn't get written out. So then therefore I would know I have an error. Right? And then we'll talk about logging in a second, but like you can log the operations you do to modify the pages. That's essentially what the database and worries about mostly. All right, cool. So that's all I'll talk about what that actually looks like inside these pages. So again, every page is going to have a header and it's sort of what he asked about. We're going to have some information on what's the size of the page, the checksum, what database version or the version of the software wrote this data out, right? What can happen is people can, data is companies put out new releases, Postgres puts out new releases. Every single time, the page layout may change. So when you want to upgrade, you want to know, am I looking at pages that are created by the new software, the old software, and I can have different code paths to interpret them. If you're doing compression, like the dictionary compression, or like LZ4 or GZIP, you store information about that. We won't talk about this at this point in the semester, but it's also information about what transactions or what queries modify this data and whether other queries are allowed to see it. Again, and then we've already talked about the issue that they need to be self-contained. All right, so now within a page, we can represent data in two different ways. So we can do this as a sort of a tuple-oriented approach, and I'll explain what that means in the next slide, or we can do a log structured approach. So again, it's within a page now, assuming you have a page directory that tells us how we need to get to that page if we want a particular page, one, two, three. Now we're talking about what does it look like when you look inside the page? What are the data systems actually going to see? So for this one, let's just assume that we're storing tuples. And let's say a really simple case here, a really simple strawman idea is that in our page, all we're gonna do is just insert tuples one after another. Start from the beginning, we have a little header space and say here's the number of tuples that we have, so we know what offset we want to jump to if we want to insert a new one, but it's super simple, we just insert one at a time, right? So let's say I insert three tuples, assuming they're all fixed length, every single time I insert one, I just jump to the next offset, and then update the counter. So this is a bad idea, why? You have to delete a tuple, you have to move everything below. Perfect, so he says if you delete a tuple, you have to move everything. Well, not necessarily, right? I could just do this, right? Free the space up. He says external fragmentation. He says external fragmentation. Well, why? Why can't I just insert it in there? What's that? It can be of the fixed length. So I made the assumption that they're fixed length size, but he's absolutely right. So this works great if everything's fixed length, could I just shove it, the new one, where the old one was, but if it's not fixed length, then this slot may actually, this location may not be big enough for what I would insert, and now I gotta try to put it down in here, right? So that's one issue. I mean, the other issue too is like, every single time I need to go say I delete this thing, I either need to maintain metadata at the top and tell me, hey, here's a location in this page where you can write some data, or I gotta sequentially scan and look at every single tuple to figure out where I can go, right? So this sucks, nobody does this. Instead, what you do is what are called slotted pages. So this is the most common scheme that every disk-oriented data system will use. The exact details of how they're gonna represent these pages will be slightly different, but at a high level, this is what everyone does. So the way this is gonna work, we're always gonna have our header, and the header again stores that basic metadata about checksum or access times and things like that, and then we're gonna have two sort of regions of data we wanna store. At the top we're gonna have what's called the slot array, and the bottom we're actually gonna have the actual, the data we wanna store. We're assuming we're doing tuples here. So in this one it can be fixed length or variable length tuples, it doesn't matter. So what the slot array basically is, is a mapping layer from a particular slot to some offset in the page where that's the starting location of the particular tuple that you want, right? And the reason why we wanna have this indirection layer is because now we can start moving these within a page, we can move these tuples around any way that we want, and again the upper levels of the system don't care, right? They can always, the record ID is gonna be the page ID and the slot number, and all they need to do is move these things around and just update the slot array and say here's where you're actually pointing to. And the way we're gonna fill up the page is that the slot array is gonna grow from the beginning to the end, and the data's gonna grow from the end to the beginning, and at some point we'll reach in the middle where we can't store any new information, and then that's what we say that our page is full. So yeah, this means that there could be a small little gap in the middle where we can't store anything, but that's, you know, because we wanna support vario-length tuples, we have to do this, right? We could do what's called a vacuum or compaction, we could just scan through and reorganize our defragmentation in old file systems. We could do that in the back on the data system, we can do it before our purposes, right? This is what we end up with, yes? So here are we assuming that a page can have tuples from various tables, if so, how do we associate them? Good point, so his question is, are we assuming here that within a page we could have tuples from different tables? In practice, nobody does that, because you would have to maintain some metadata to say this is from table one, this is from table two. We'll see at the end that there is a way, there's some systems that do do this, but in general nobody does this, like if you open up SQLite or Postgres or whatever, you call create table, it'll create pages, and only tuples from those tables will go in those pages. It's a good question, we'll come to that in a second. So these are tuple-oriented pages. Again, at the end of the day, we're trying to store tuples inside these pages. And so when I do an insert, I do an update, I want to find, I take the contents of that tuple and just write it out in its entirety in this page here. We'll see in next class we'll talk about for really large data, like if you have attributes, like you want to store a video file in the database, don't do that, we'll explain why later. But for that case here, you couldn't store this because you won't fit it in a single page. So you have some extra metadata and some pointers to say, here's the pages that have the rest of the data that you're looking for. But in general, we want to pack in the entire tuple in the single page because now when we need access to the tuple, it's one page read to go get it. And not a bunch of different ones. Again, we'll break that assumption in next class, but for our purposes here, it's fine. So another way to store data in pages is, is that a question? Oh, it's like on the left side, like say you move like the third tuple, you like push everything back. So this question is, say I remove the third tuple here. Right? What happens? Well, it depends. I'll give a demo in the end of the class. Some systems will actually compact it before it writes out the disk, some systems will just leave the gap here. And then if it gets full and you say, oh, I have some free space, maybe I try to do compaction. Yes? So the slot points to the starting position. Yes. So this question is, the slot is pointing to the starting position of the tuple. And your question is what? Does it point? What's the order of the tuple storage seems like it's coming from higher address to lower address or higher? So this question is, what is the, what is the ordering of the storage address within the slots? No, it's like allocates this much from lower to higher or just go from higher to lower. I'm not sure what you mean. Like I have, say this is four kilobytes. I want to store a one kilobyte tuple for tuple one. So from starting from the offset, I jump to the one kilobyte and then my slot are at points of that. All right. Yes? Can you delete one tuple in the middle? Do you still need to move the slot? The question is, if I delete one tuple in the middle, again, I see tuple three, what do I do up above? Nothing. In the slot array? Yeah, so again, the header could, different systems do different things. The header could have a bit map and say, here's the slots that are empty that you could point something in. Or I just sequential scan and read it. It doesn't matter. The key thing though, I think, is that the other parts of the system don't know and don't care that where I'm actually physically stored. So for tuple one, this comes out of the slot here. So in the upper part of the system, it would say, oh, tuple one, you can find it if you have page one, two, three at slot zero or slot one, depending on what your starting offset is. So now, no matter how I reorganize my page and move tuple one around, I know that I always want to go to the first slot to find where it's actually located. And now if I reorganize, I don't have to update my indexes. I don't have to update anything. And this is sort of what the page directory is trying to do as well. So no matter where I move the pages on the file, either on disk or on the network, other parts of the system don't care where it actually got moved to because I have a page one, two, three. I have the page ID. I can always use the page directory to find where it's actually being stored. These indirection layers avoid having to have updates propagate through all the parts of the system. Yes? Can you explain then how you exactly know that tuple one is stored exactly at which slot? So this question is, how do I know that tuple one is stored in slot one? So I don't know, I jump ahead too much, but we'll go ahead and just do that. So this is always the last slide, but I'll just talk about it now. So the way we identify tuples is through these record IDs or tuple IDs. And it's essentially a unique identifier to say here's the logical location or a logical address of a tuple. It's a blend of logical and physical, but usually the page ID and the offset or the slot. So all the parts of the system that want to address tuple one, they don't know what tuple one is, they just know I have a page ID and I'm a slot number. And so I go to the page directory and say I want page one, two, three. The page directory says, oh, it's in this file at this offset, jump to that. Then I get to that page. Now I say, oh, I want slot one. I look at my slot array and that tells me where in that page you'll go find the data that I want. So all the parts of the system, like the indexes, log records and other things, they're gonna address tuples through these record IDs. So it's like separate from the page. What do you mean separate from the page? Like basically you just said you basically have to go through it to tell you what page it's in and like what slot it's in as well. Yes. So that information, this record, separate from the page. I mean, it, like, I'll give a demo and hopefully this will make more sense. Like say I want to find the student record or the professor record, the name Andy. I look at the index on the name and it's gonna say, oh, there's a professor named Andy and he has a record ID of page 123 offset slot one. That's what the index gives me. So then I say, go to the page directory. Okay, where do I find page 123? Go get it for me. It goes and gets it. Now I have the pointer to the page and say, oh, I want slot one. I look at my slot array and that tells me what offset to jump to that page to find that I need. Okay. Right? And so different data systems do different things. The most common approach is the page ID and the slot number of the offset. And again, the advantage of this is that if I start moving data around, either moving the page around or moving data within the page itself, the index and all the other crap doesn't have to get updated because they're still looking at page 123 offset one. So again, I'll give a demo and explain it in a little more detail. So in different systems do different things. Like in Postgres, it'll be four bytes. Oracle is 10 bytes. There's much extra metadata that they store and SQLite is eight bytes. So let's give a demo because that's always fun because I hate typing on the surface. I'm gonna use my other laptop and I can see it better. All right. So I'm just gonna give an example again of how we can actually see what these pages look like because again, the data system stores this internally. You're not supposed to see it but there's different commands to do to actually to get at it. I can ever in the back see that or is it too dark or too light? Is that better? All right, so we're gonna make a simple table that has two columns, ID and value and we're gonna insert some tuples in it. So this is Postgres, ID and value. So Postgres has this thing called the CTID that represents the physical location of the data. So I can add this virtual column here, CTID and I get it in my output. So this now is a tuple, sort of a pair that's gonna have the page ID and the offset. So again, here's the data. I inserted three tuples. So now it's saying that page zero slot one is the first tuple, page zero slot two is the second tuple, page zero slot three is the third tuple, right? So it's not actually storing this data. It can derive this when it runs the query because it says, oh, I look at my page, I see the slots. Here's the tuples that are actually found, right? So let's go ahead and delete one of these tuples. Let's say I delete the second one. So I go back and look at my tuple or look at my data and you can see that it deleted the second tuple but it didn't reorganize the page, right? The third tuple is still at page zero offset three or slot three, right? So let's say now I go insert a new tuple in. What do you think it's gonna do with that? So again, I deleted the second tuple. Now I have a tuple at slot one and slot three. I insert a new tuple. Is it gonna be slot two or slot four? Raise your hand if you say slot two, half slot three, or sorry, slot four, less. Plot four, right? It's not wrong, right? Because for our purposes, the relational model doesn't say anything about the order of how we insert tuples. Postgres, the way it's able to decide what to put it at the end. So Postgres has this thing called the vacuum. Think of this again as like the garbage collector for the database. So it's gonna go through and reorganize all the pages. It actually may take a while, so we'll come back to that. But when it does this, it's gonna then say, oh well I know that I have a free space in two so I'm gonna compact the pages and write them out sequentially, right? It's gonna take a while, let's look at other ones. So we can do a SQL server. Same thing, I have three tuples. Actually let me drop the table start over. So now I have three tuples, select star from R, right? One, two, three. So Postgres, sorry, SQL server doesn't have the CT ID. It has this other built-in function like this. And it tells you here's the file ID, here's the page number and here's the slot, zero one two. Okay, so let's do the same thing. Let's delete a tuple, insert a new one. Or we can look at the old one querying and it didn't compact it, it still says zero one. Actually no, I did compact it, sorry, take that back. Let's do this over again. I missed that, sorry. I always forget which one does it correctly or not. Again it's not correct, which one does it that way. So we insert three tuples, do a select, right? Zero one two, now I delete the second tuple, do the same select, I did zero two, I didn't compact it, sorry. Let's insert another one again, zero one two. So this was the second tuple. This was two, now it's one, it made this two. Because what it did is when it updated the page says, oh, I have a free space, let me compact it and write it out, right? Again, from the higher level part of the system, we don't know, we don't care. Internally it can do whatever it wanted. So going back to Postgres, what is it, one. Postgres, when we inserted the new tuple, I put it at the end, but then I run the vacuum and that does basically compaction. Now it reorganized to zero one three, one two three, where that was one three four, right? So the last one I want to show is Oracle. And although Oracle is sponsoring the class, I would say this, their terminal s***, you can't hit up. There's no way to go back and drop table R. Let's create a table. So this is a nice shortcut. Every data system allows you to do inserts, like one insert query and then once you're commas, I separate it and insert them all once. Oracle doesn't let you do that, so I have to go do one by one. So Oracle has something called a row ID. Again, so this is an internal thing that Oracle is maintaining. Normally you run queries, you don't see this, but if you just add the row ID column, it's like an internal virtual column, it exposes this information. So this is some 10 byte gibberish. We don't know how to interpret this. So there is a command, there's a bunch of functions you can run. Again, I found this on the internet, I didn't write this. You basically decipher this and now you get like here's the file number, the block number, and the row slot. The same way we solve for SQL server. Okay, let's do that. Let's delete the second guy and go back. Can't do that. So it was 012 for the slots and now it's still at 012. Let's go insert our new tuple. So who says it's going to be at slot 1, 2? Who says it's going to be slot 4 or slot 3? Even less, nobody knows. Slot 3. So again, the main two. SQL server compacted the page before it wrote it back out. Oracle and Postgres just leave the empty slot there. Okay? Again, it doesn't matter to the other parts of the system. This is just something how the system is internally organizing tuples or slots within the pages. Okay, so let's go back. So did that answer your question? I realized that was a long, okay, yeah. So let's talk about now, how are we doing this time? Yes? Why expose the API to the user? This question is why do you expose the API to the user? So database systems are very complex pieces of software. People get paid a sh** money to maintain them. And so by exposing all the metadata you can to the end user, like administrator, it could potentially help them decipher why it's doing certain things. That's what I would say. But you don't want to write your application using any of this. It's not reliable, right? Let's use Postgres because I know how to do Postgres. No, f**k right. So going back to this. So Postgres, again, we've already done the compaction with the vacuum, but Postgres will actually let you do this. You can say where, CT ID, let me put it at the top, sorry. Where CT ID equals, and then 0, 1, right? So I can access a tuple exactly based on its storage location. I don't want to do this in my application because, again, at any time, the data system's allowed to reorganize it, and I could end up with a different CT ID. So it's unique. I could uniquely identify a tuple, but I don't want to do that because I could get screwed up. Again, I think the answer is in it's just exposing the internal system to allow administrators to understand what's going on. Yes? The name, CT ID, is it like the table? Boom. His question is, what happens if you have a, if you try to name a column with that? Or might try to name a table with that? Let's see that. So drop table XXX, create table XXX, ID int CT ID int. Who says it'll be allowed to do this? Yes or no? Raise your hand, yes? Raise your hand, no? Yes, you can't do that. Let's try Oracle. Create table XXX, ID int, row ID int. Nope, didn't like it. So yes, it's a reserved name. There's a bunch of other things you can't, like I can't name a actually, let's see if I can name a column. Oh, this is a SQLite, sorry. SQL Server, press press, right? So let's try create table, like you can't name a column int, I don't think. Oh, you can. Different systems do different things. Yeah, OK. Don't do that, that's a bad idea. All right, how are we doing on time? So we're short on time, so let me quickly talk about Tuva layouts, and then that'll set us up what we're talking about for next class. So a tuple in our world is just a sequence of bytes, right? We get a slot offset, we write a sequence of bytes, and we're done. And so if the job for the data system is to be able to interpret what those bytes actually mean, again, and that's where the schema comes in. So the schema is going to say I have an int, it's 32 bits, I have an int that's 64 bits, so when I look at my sequence of bytes, I know how to jump to different offsets to find the columns that I want, right? So again, it just looks like this. In our tuple, again, this is like the starting location within an offset within our page. We'll have a header, we'll keep track of different things, like what transaction or query modified this, and then we'll have the actual metadata about things like whether we have null values, and then are actually tuples. So we typically don't need to store the metadata about the tuple within the tuple itself, right? So when we store a tuple, we don't want to say, hey, I have four columns and they're of this type. We store that in a higher level metadata information either within the page or within the catalog pages themselves. You have to do this in JSON databases or schema-less databases like MongoDB, because every single tuple, every single record could be different, every document could be different, so you have to store the metadata about what is actually inside of it. So inside the tuple data itself, you typically store them in the order that you created the table. So if I say, create table, ABCDE, I'll just usually most systems just store it in the order that they have, right, that we got created. You don't have to, relational model says you don't have to do this, but typically most systems do this. If you get talking about memory systems that are trying to be cache efficient, you can reorder this so that your word aligned for cache lines, but for our purposes, we don't care, right? So the last thing I want to talk about was his question before about storing data from different tables inside the same page, and I said most systems don't do this. And the reason why you don't want to do this is because, again, if you want the things to be self-contained, you don't want a bunch of extra metadata about these different tables, where it does show up is when you denormalize tables or pre-join tables. So we're not talking about normal forms or functional dependencies in this class. You don't know it yet, but you'll thank me when you're older because they're terrible. People cry every single time you try to teach them. Most database classes teach them. I don't think they're important. Nobody does this in reality, in practice. Maybe some DBAs do, but almost nobody does. So for that reason, when I first taught this class, we did two whole lectures on normal forms. We don't need to do that. You just need to know what it exists and what it means. So that's what this one slide is. So we're going to encapsulate two lectures in one slide. So normalization is basically how we split up our database across different tables. And this naturally happens when you have foreign keys. Like, I have artists and albums. I could have foreign keys sort of break them up. And so we do this happens naturally as we design our application. Now, there's some cases where we actually want to embed one table inside of another. If we want to avoid the overhead of maybe doing a join, here's all the albums that artists put out. We just inline them in its own tuple. And in that case, within a single page, we could have data from two different tables packed in the side of the same page. So let's look at a really simple example. I have two tables, foo and bar. Bar has a foreign key dependency reference to the foo table. So normally I would store my tuples like this. I would store them completely separate. All the data for the bar table is stored in its page. And all the data for the foo table is stored in this page. But if most of the time I'm trying to join these two tables together, for every query I want to join these two tables together, give me all the bars for a given foo, then maybe what I want to do is just embed the bar tuples directly inside of the foo tuple. So now if you go back here, I was replicating the A attribute for every single bar tuple. But now if I pack it inside of it, I don't need to repeat it. I just have the columns that are unique for that other table. So this is called denormalization. Another way to think about it is like pre-joining. I'm packing tuples inside of each other. I can either do this logically by rewriting my application and creating tables that way, or I can do this physically, which is what we care about here. And underneath the covers, we're storing our pages like this. The application can still tell us, hey, I have two separate tables. But underneath the covers and our pages, we'll actually combine them and pack them together. So this is the only time I think that systems actually try to store data from two different tables inside the same tuples. This is not a new idea. It's super old. It goes back to the 1970s. IBM first invented this when they invented the first database system, system R, the first relational database system at IBM. But it turned out to be a huge pain to maintain in the software. And then they abandoned it when they went off and made DB2. So system R was one of the first relational databases that were out there. But they never commercialized it and never sold it. They took up some of the code in the 1980s and created DB2, which is still around today. And so it actually is showing up in more modern systems today. So if you get Cloud Spanner from Google, when you find a Protobufs API, you can actually pack in. It'll pack in the two different tables together in the same tuple. There was a startup a long time ago, 10 years ago, called Akabon that basically sold a storage engine from MySQL that did this kind of denormalization automatically. They got bought by FoundationDB, and then FoundationDB got bought by Apple. So Akabon doesn't exist anymore. And then a bunch of these document databases or JSON databases are essentially doing the same thing. When you define your JSON document, you can pre-join and pack in related attributes within the JSON document itself. And that's essentially doing the same thing. Again, that's the only time people actually stored tuples from different tables inside the same pages. OK, so we covered this, and we're done. So again, what did we talk about today? Databases are organized in pages. There are different ways to track the pages that are in our files. And then within these pages, we can store them in different ways and store our tuples inside the pages differently. So for the first assignment, first programming project, we will already take care of the page layout for you. And we've already written the disk manager for you. It's when we get into the second project, you're actually going to have to define what the page layout is for your index that you're going to build. OK? So next class, we'll talk about how to actually represent values inside the tuples. So we'll go now inside of the byte sequences for tuples. We'll talk about what the single attributes look like. And then we'll talk about storage models. How are we actually going to represent how we organize tuples within a table itself? All right? Any questions? Yes? Like, how can you tell that you're getting all the information of the correct information? The question is, which is an awesome question, what does test-driven development look like for databases? How do I know that I'm running queries correctly? Huge topic. Let's talk about it afterwards. I'm extremely interested in that. I'm actually trying to hire somebody full time to do that for us. It's f**king hard. If you can do testing for databases, I can get you job yesterday. OK? Not just me, like in Silicon Valley. Any other questions? All right, guys, enjoy your weekend. See you on Monday. Oh, yeah, coming through with my shell and crew. Two cent for a case, give me St. I's crew. In the midst of broken bodies and crushed up can. Met the cows in the gym, oh, I'll try. It's with St. I's in my system. Crack another, I'm blessed. Let's go get the next one, then get over. The object is to stay sober, lay on the sofa. Better yet, damn, push the ball up, be tame, stressed out. Could never be suned. Rick and St. Jelly hit the deli for a part one. Naturally blessed, yes. My rap is like a laser beam. The force in the bushes, St. I's family can't change. Crack the bottle of the St. I's. Sipping through those, you don't realize. We're drinking every only two we drunk. You can't drive, keep our people still alive. And if the St. don't know you, f**k, can't pay.