 Okay, so real quick, before we sort of dive into the material, just again the reminders for administrative things, homework one is do tonight at midnight, and then if you're not going to make tonight's deadline, check the syllabus. For both projects and homeworks you're given four slip days, so if you're one minute over whatever grade scope says is midnight, then it's considered late, and then your docked one of those four missed days. The second announcement is that in class on Wednesday I will be releasing and presenting to you guys the first project, and we'll spend a little time sort of outlying how the four projects that you're going to be working on fit together and how it's related to the discussions we're having in the lecture so far. And then the last thing I want to talk about is sort of a confession, and that is to say that I sort of lied at the beginning of the semester, and I said that the only two things I care about are my wife and databases. I really, really love vaccines, I've even got mine already today, my shirt says vaccinate f***ing kids, so everyone should go get the flu vaccine, they're free, again I've realized some people have medical conditions and you can't, but there's no excuse if you're healthy you should be getting a flu vaccine, to give you an idea of how obsessed I am about vaccines. Two years ago I tried to get them to set up the flu vaccine center or the table at the CS department in Gates, they wouldn't do that, and then my first year I tried to give extra credit if you got the flu vaccine and if you had a selfie of like them sticking the thing in your arm, apparently that's illegal and I couldn't do that. So then last year we tried to buy some illegal flu vaccines from Mexico and started giving them out to students, apparently that's illegal too. So I went through all this effort to try to get vaccines, it's super easy, you can make appointments or go whenever they have the free clinics, okay? Okay, so let's jump into databases now, my second love of my life. Okay, so there were some questions at the end of last class that people were sort of grappling with trying to understand where I was going with database storage at the file level and breaking things up to pages, right? And they were trying to understand what's the big picture of what I'm trying to talk about here and how's this going to fit into the things that we talk about in subsequent lectures. So I'm going to spend a little bit of time talking about sort of going over again what I mean by a disk oriented database system architecture. And then I'll show sort of an overview of what the sort of the high level semantics of the system we're trying to build and why the various things we've talked about so far or how they fit into this overall system. So I said this course is about disk oriented database management systems. And the definition I gave for disk oriented was a database management system where the architecture is predicated on the assumption that the primary search location of the database is on disk. And so that means that any time a query needs to read something, read a tuple, and we've already talked about how tuples are stored in pages and pages are stored in files, right? Whenever it needs to read one of these tuples, it's going to assume that it's not in memory and therefore it's going to go out the disk to get it, copy it into memory, and then it can do whatever it is that it wants to do on it. And the reason why that we're saying this is disk oriented is again, is this major assumption because a lot of the things that we're going to talk about as we go along, we'll see that we have to add certain protection mechanisms or certain functionalities in the components of the database management system. Just because any time we go read something, it may not actually be in memory, right? And we'll see this when we talk about concurrency control, allowing multiple queries or transactions to run at the same time. And at any given time, I could read something and it's not in memory. I'm going to stall because I have to go to the disk and get it. And meanwhile, other queries are going to start running at the same time as I'm running, and therefore the database management system needs to protect the data. When I say protection, I mean at a logical level, from other queries doing the same thing, trying to operate on the same data at the same time, right? And this will make more sense when we talk about concurrency control and locking and latching and other things like that. But that's the high level idea of what we're trying to do here. We're essentially trying to manage the movement of data back and forth between disk and in memory so that we can handle databases that are larger than the total amount of memory that we have. So is that clear for everyone here? And I was sort of as a precursor or as a teaser for the advanced class. In the advanced class in the spring, the disk goes away. We assume everything's in memory for reading and writing data, and that allows you to do a bunch of optimizations that we can't actually do here. So you can sort of take this as a classical database system architecture, although it's the most widely deployed implementation that's around. So we can make it more run faster in some cases if we assume everything's in memory. For our purposes here, we assume there's a disk. We'll see how we actually deal with that. So to better understand now what we've talked about so far in the last lecture and how it's going to fit into the subsequent lectures that we go through. I want to sort of again present this sort of high level diagram of what a database management system is going to look like. And you'll see how all the various pieces fit together. So I said last class we talked about the database storage, right? We assumed that it was going to be one or more files on disk, right? SQL lights a single file, other systems will break it up into multiple files. Sometimes it's like one file per table, one file per database, right? For our purposes, it doesn't matter. And then we said that the way the database system was going to organize data inside of this file is in pages, right? And we said there had to be sort of this sort of special page, it's called the page directory, that essentially is a map to say, if you want page one, two, three, here's the offset or here's the file to go get the things that you want, right? So this is what we see when the database is at rest, it's just on disk, right? You can't actually operate on it, we've got to bring everything in memory. So now when we start, when we start operating on this, as I said, you have to find the page you want and copy it into memory. And so we'll discuss this in the next class, but the thing that we're going to end up building to manage this movement and memory back and forth is called the buffer pool. Now remember I said that we could let the OS do this and use virtual memory with the, with M-map, the memory map files. But we said that for our performance reasons and that we don't actually want to do this, actually for correct reasons as well, we want to manage everything ourselves. So this is where we're going to end up having to build this buffer pool thing. So when you first turn the database system on, there's nothing in the buffer pool, right? We had these sort of empty frames or these slots where pages could be, but there's no data actually in it yet. That's not actually true, when you boot the system, they usually preload a bunch of stuff. But again, for the simplistic diagram here, there's nothing in there. So then we come along now with this execution engine. This is the thing that's going to execute queries, execute transactions, and read and write data. I'm drawing this as a cloud, right? To try to be nebulous, just to say that we don't actually care at this point what this actually is, we'll cover this later on. But just assume that there's something up here that has a SQL query and wants to read and write data. So the way it's going to interact with the storage layer that we're building in this first part of the semester is to go to the buffer pool and say I want a page with this particular page ID. How it got it, whether it read an index or it somehow magically pulled it out of the air, it doesn't matter, right? This is the interface that's going to interact with the buffer pool. So now the buffer pool manager going to go down and say, well, I need to figure out where this page actually is. So let me go get the page directory page, right? Because I know where that is, because it's sort of a special page. It has to know where it is in order to bootstrap anything. And it goes and copies the directory page into memory, right? Now it takes up a slot or frame in our buffer pool. And now we can go and then say, we want page number two. Well, here is where it is. Let me go copy that into memory and put it into another free slot. So then we get back now a pointer to this page, to the execution engine. And sort of the guarantee is that anytime that it's operating on that page, it will be pinned in memory. Again, we'll see how this works in the next class, right? Because obviously it would be bad if I try to start reading data and the buffer pool manager all of a sudden swaps out the page I was reading while I was reading it and then I would have problems, right? And then there's some additional logic up in the execution engine that knows how to then interpret whatever the layout is of that page to do whatever it is that it needed to do. Now we talked about the different page layout schemes. You could have a log structured one, which is a pending log records. You could have a slotted page. We have actual tuples and the slots to point to where they are inside the page, right? From our point of view here down below in this storage layer, we don't actually care what the layout is. We just were told, hey, get us page two and we'll go hand it off to something else in the system who then knows how to interpret it, okay? So now, to sort of map this to the schedule for the semester, we've already covered pages in the last class and we'll cover it a little bit more today, right? So that's lecture three and four. The buffer pool manager stuff will be covered on Wednesday's class, right? Because there's a lot of stuff that's going on here that I'm being super high level about, right? How do you actually copy things in? How do you make sure that things don't get swapped out? How do you decide what to swap out, right? We'll cover all that next class and that'll be the first project you guys implement. Then we'll just talk in section lecture six. How do we actually represent in memory the page directory with hash tables? And then we'll cover how do you actually interpret the bytes in lecture ten and actually a bunch of the other lectures that do query processing, okay? This makes more sense now, right? We're starting with files on disk, right? We then copy them into memory and then there's some layout to the files and some layout to the pages that other parts of the system know how to deal with. Again, we're going up the layers of the database system. Ideally, it'd be nice if sort of a clean abstraction from one layer to the next, but we'll see in a second that it's not always the case. Because how you're going to decide actually how you lay out pages or the data in pages will affect how you build your execution engine, right? So it's not a clean abstraction in some ways, but at a high level it's good enough, okay? All right, so this is sort of the roadmap for us for the next couple of weeks. Question? No, okay. All right, so now in today's class, we're going to pick up where we left off last class where we were talking about the files. We talked about the page directory. We talked about the page layout. And then we talked about how you actually lay out tuples inside those pages. So now the first thing I'm going to talk about is how do you actually represent individual values for tuples inside of a page? And then we'll talk about the system catalogs at a high level, which is the internal metadata the database system is going to maintain to figure out what data is actually being stored and how to actually interpret the bytes that are inside of tuples. And then we'll finish off talking about storage models, which is a way to think about how to then take data in tables and then map them to pages. Or how you actually organize the values in pages. Which is sort of higher level concept that goes beyond all the things we talked about here so far, okay? Okay, so as I said last class, the tuples from the database system's perspective, there's nothing magical about them, right? It's just a sequence of bytes. And the database management system knows how to interpret those bytes to then derive structure on and extract values from them, right? I can just open up char array and see and write a bunch of crap. But then the database system will know how to then take different offsets in that byte array and say, well, this is an integer. I know how to represent, it's 32 bits. I know how to then do whatever operations or manipulations I want to do, assuming that it's 32 bits. But the CPU and the operating system, it doesn't see that, it doesn't know about the schema. It just sees a bunch of byte arrays. And so we'll see this in the catalogs. The catalogs, again, is the metadata we're going to use to figure out what the actual values actually mean in that byte array. So here we want to talk about how we actually represent basic types in this byte array. So there's sort of four broad categories of types inside a database system. We'll cover user defined types or complex types later in the semester. But at a high level, when you think about it, there's essentially four things you could be, right? You can have integer types and the different types you can define determine how many bits you want to use to store data, right? So you can have integer, which is 32 bits, four bytes, big int, 64 bits, small int, 16, and tiny int is a single byte or eight bits, right? And so for this, we're going to represent this data just using the built-in sort of C or C++ representation of the data, right? If I open up a C program and I say int and declare an integer, right? Internally, the CPU knows how to represent that, those bits to mean an integer. And we're just going to leverage the same thing inside of our database system. There's sort of nothing magical there. Again, in C and C++, you can have 32 bit integers, 64 bit integers, all those mapped to the SQL types. Where things get tricky is when you want to start having decimals or floating point numbers, and there's this trade off between having what are called floating point numbers that are using the built-in hardware primitives, and then there's the fixed point decimals where we're going to manage that ourselves inside our database system to ensure that we have accurate calculations, accurate numbers. So as a quick show of hands, who here has ever heard of the IEEE 754 standard? A few, actually a lot of you, awesome, yeah. So the 754 standard defines how hardware should actually represent data, or these floating point numbers. And it tells you what should happen when you manipulate them, right? And we'll see in a second, if you're having a variable precision numbers, you're not going to guarantee accuracy in your calculation, particularly much faster because you can do this in hardware. Whereas if you want to guarantee accuracy using fixed point decimals, we have to manage this ourselves in our database system. And of course, we pay a performance penalty to do that. Then we can have variable length values. So var char, var binary, text, and blob, right? Var char is just like a string, and it can be any arbitrary length. You can have a char field, which is just always going to be the same size with padded with zeros or spaces at the end. And a var char, typically what happens is you put in the header the size of the value, and that tells you how many bits to go until your value is complete, right? And we do that instead of having an alternative there, like in C, because we need to know exactly where to jump from one value to the next. I bet you could have timestamps and dates and times, right? Very often these are usually represented as just the number of seconds since the Unix epoch. I actually don't know what Windows or SQL server does on Windows for this. And we saw, I think, a few lectures ago, how the various systems can represent these things in different ways, and they have different properties and different accuracy, just because everyone does different things for this, right? The standard says one thing, but everyone slightly deviates. So I want to spend more of our time talking about the variable precision numbers, because again, this is a good example where we could just let the OS or the hardware manage these data types for us. But if we do it ourselves, we can get better accuracy, but we're going to pay a performance penalty for it. So for variable precision numbers, as I said, these are represented by the 754 standard, and these will be floats, reels, or doubles. Sometimes reel and double are just mapped to the same thing. There's synonyms, float would be a 32-bit floating point number, and reel and double is a 64-bit reel and double, right? Again, the problem with these native types is that they can't actually accurately represent decimals or floating point numbers, right? Because the hardware can't actually handle this. So to give you a really simple example, say we have a little C program like this, and we want to take two 32-bit floats, X and Y, make one B 0.1 and one B 0.2. If I just add these two numbers together and print it out, I get widely different numbers versus just outputting 0.3, right? We as humans know that 0.1 plus 0.2 should equal 0.3. But in both cases, either just printing 0.3 or adding those two numbers together, you get widely different numbers. This is because the hardware can't actually accurately represent these floating point numbers. So this would be really fast to do, but if your database is maintaining things like money or measurements for scientific instruments, right, this would be problematic because now you're going to have weird rounding errors, which would be inaccurate. So the way to solve this is to use fixed point precision numbers, right? So these are typically either called numeric or decimal. And so the idea here is that we're going to store some extra data for the type inline in the tuple that's going to tell us all the things we need to know to accurately represent one of these decimals. And in some cases, some systems like Oracle, as far as I know, you can declare that you want a float or even an integer, but it's always going to end up getting mapped back to one of these sort of managed values because they want to guarantee that you're accuracy in your calculations. Again, you pay performance penalty, but for a lot of cases, this is OK. So we'll see in a second how we actually can represent this. But at a high level, the way to think about it is you're basically going to store the actual number almost like a string. And then you're going to have some extra metadata that you store in line with the tuple that'll tell you where the decimal point should be, what is the scope, and things like that. So I want to do a quick example in Postgres to show you sort of how expensive, what the performance difference is between these two different types. So let me do this. Let me move my SQL to the bottom. So what I'm going to do is, well, this is Postgres 10. So I'm going to turn off, they have a new parallel query execution engine. I'm turning that off just to see that it'll make the performance difference much greater. So I'm going to create two tables, one of real values, so the floating point numbers, and one of decimals, that built in the custom Postgres data type that's doing fixed point precision numbers. And then what I've done is I ran a simple Python script that generated a simple CSV that has 10 million numbers, two columns with 10 million rows of two random decimal values. And then I can load this into my two tables in the CSV file using the copy command. Because a copy basically is a command that you can tell the database system, hey, go on your local file system, get this file, and it's a CSV file, it's whatever file, and there's a parse it and load it in the table. So now if I do select star from test decimals, just to show you what it looks like, to AB two numbers like this. Again, we have one table starting them as decimals. Fixed point precision, one table storing them as floating point precision. So now what I'm going to do is I'm going to add together, for every single row, I'm going to add together the first column with the second column and computer sum for that. So I'm putting two keywords in front of this query if you've never seen this before. Explain is a command that tells the database system to return back the query plan that the optimizer generated. And we'll discuss what query plans look like later in the semester, but you essentially get something like this. It's a tree that says do a sequential tan on the table and then compute an aggregation. And you can see what we'll cover this later on, but it shows you things like, here's the number of rows I think I'm going to read, and here's how long it's going to take me. But if I add analyze in front of it, this tells it to actually run the query. Explain just spits out the query plan. It doesn't actually run the query. If you add analyze, it actually runs the query and shows you the query plan. So for this, I run it like that, get a run for a bit, and then it tells me my execution time is 1.4 seconds. So it planned for a fraction of a millisecond, and then it ran for 1,400 milliseconds. So that's the one with real. So that's using the floating point precision. So now let's use the fixed point precision. Same query, same data, different table representation. And now it took 2,800, right? Took twice as long, right? And just to show you that it doesn't matter if it's cash, I can run the same query again. Actually, let me turn on timing. It got a little bit faster, but if I keep running it again over and over again, it should roughly be the same, right? Because now everything's in memory and it's just reading as fast as it can and doing these calculations. Same thing for the reels, right, and so forth. All right, so now the next thing to point out is we notice that when we run on decimals, we get something starts at 9998, but then we run it with the reels, we get 1.014, and then scientific notation. So we can actually see what that value is a bit easier. So we can cast it as a decimal, right? So with the decimals at the top, we got 998 with reels, we got 1.001, right? So same data, technically the same calculation, two different data type representations, we're getting two different results. So let's try to figure out what the actual correct value is. So we can run a really simple Python script. So I'm going to import the load in the same data set that I loaded into Postgres and just add the two numbers together, casting the values as float, and then we can print out what the total is, 998. So in this case here, Python is doing something to make sure that it has accurate calculations just like Postgres is, right? So that would tell you the correct value is what we're getting at the top here, which again, you can't see me highlight that, right? So again, if we use the built-in primitive types of the hardware and C++, then we're going to get much better performance, but we're going to have inaccurate calculations because of rounding errors. If we want, if we care about these things, then we want to use fixed point precision and fixed point decimals, but we're going to be much slower, okay? Okay, so let's understand what's actually going on, how this is actually, Postgres is actually implementing this. So this is actually from the real Postgres code. This is the internal type representation that they define to represent a decimal or numeric, right? So you can see what I was saying before where at the bottom we're going to have a string, sorry. The bottom here, numeric digit, this is essentially the, almost like the var char of the actual string representation of the value. And then there's all this extra stuff we're storing to say what the weight is, the scale factor, whether it's positive, negative, or not a number, right? Where the decimal point is, things like that. So what we could have stored as a 32-bit integer is now storing this as what, 4, 8, 16, 20, and then whatever the arbitrary string is. So it went from four bytes to possibly 30 or more to store the same value. And then when we actually want to look at the code now, see how they actually add two of these decimals or numerics together, right? It's way more complicated than just taking two numbers and adding together, which is a single instruction on the CPU, right? Again, this is from the real Postgres code. I think it's 9, 6, or 9, 5, right? This is just the add function, right? Just to add two of these numbers together. And there's all these cases you're checking to see whether it's negative or not a number, right? And at the very end, it ends up just actually being adding these two numbers together. So again, in my example of four, this is all the code we're running when I was using decimals just to add the two numbers together. And there's obviously the same kind of calculations for multiplication, subtraction, and other aggregations as well, okay? Again, so the main takeaway is you want to use fixed point precision when you care about accuracy, anything with financial data, anything with scientific data, if you don't care about these things, then using the floating point numbers is the way to go. And you start to see the same kind of ideas in machine learning, right? Instead of using 32-bit floats for all your weights and your neural networks, maybe you can drop down to 16 bits. And you still get good results, because accuracy doesn't actually matter in that environment that much. All right, so now we want to talk about how we actually store, so we know how to store integers, we know how to store dates, which usually is just integers, and decimals and varchars. For this, we're all assuming that all these things can just fit in our slot at pages where our tuple is, right? Typically, most data systems don't let you have the size of a tuple exceed a single page, right? But that would be problematic when you have really large values, like a var binary or blob, right? Because they can often be larger than a single page. So to handle this, we would use what's usually called overflow storage pages. So basically, a place where you can store large values that are no longer going to be in line with the tuple itself, right? So in this case here, inside of our tuple, say I have this attribute C, it's a var binary, and it's more than the size of a single page. Instead of actually storing the value inside my page with my tuple, I'll just have a pointer, page ID, and an offset to a overflow storage page that tells me where my actual value is, right? So in Postgres, they do this if the value you're trying to store is larger than 2 kilobytes. In MySQL, they do this if the value you're trying to store is larger than half size of the page. There's actually a bunch of other optimizations you can do for these overflow storage pages that you can't actually do in the regular tuple pages, right? So in the case of Postgres, they had this thing called toast, I forget what it stands for, but that's where they store these large values. They can actually compress these pages on disk because for these things, it's not like you're trying to insert and update things all the time, right? I write a tuple, I write a large value. I'm usually not going back and updating that large value over and over again, because that's actually not what a database system would be good for. There's also a way to, I was just going to say too, we'll cover this when we talk about logging, but all the protections you have for your regular data that fits in the regular tuple pages, you have to have for these overflow pages as well, right? Because from the outside point of view, from the application standpoint, you don't know whether your tuple is stored across two pages. You just know if you crash, you come back, you want to make sure everything's still there. So in some cases, you actually don't want to store large files entirely in your database, right? Because it's expensive to do. And instead, you have what's called internal value storage. And it almost looks exactly the same as the large value storage in the last slide. The difference is that the pointer from the attribute for the regular tuple is not going to another page managed by the database system. It literally is just a file path to somewhere that's accessible to the database system where the data is actually being stored, right? So think of this. Say I have a photo application, right? My database is storing photo information. Instead of storing the really large photo files inside my database, I can just have them strewn across a directory. And then inside my database system, it just has a pointer to where that file is that it wants to store, right? So from the query execution standpoint, these external files are essentially going to be treated the same thing as a bar, binary, or blob, right? The upper levels of the system don't know that you read this data from the file system, right? Does it need to know? Just knows that if I need to read it, internally I know how to jump to that file and then stream the bytes in, then hand it off to whatever other parts of the system actually needs it. So in Oracle, they have a special type called B file. In Microsoft, they have a file type called File Stream. I spent some time looking at other major systems. And as far as I can tell, they all sort of have something similar to these, but these are the two main ones that I know about. So now, the thing to point out, though, is unlike in the last slide with the overflow values, the database system is not going to be able to protect or do anything with these external files other than just reading it. So in the case of Oracle, they don't let you actually manipulate or do updates to these files. It has to be managed all outside of the database system, right? And then we'll also see the case, and that means you also don't get transaction protections, meaning if I try to manipulate this thing inside the database system, and I concentrate on transaction, and I crash, and I come back, the database is not going to know how to replay the log to recreate the changes that you made. These are considered external. So you see this a lot of times where database systems are really expensive and storage is expensive, and you can store these external files on a distributed file system or a SAN that's external to the database system that's actually much cheaper to do. A lot of times you see this in like, if you're paying per gigabyte on your database system, some vendors actually charge this way. Instead of storing all your large files in the database, you just have it point to these external files, and you don't pay extra to store them, right? Again, inside the tuple, we're just storing now the file path. Question, yes. The question is, for the overflow page, will there always be a fixed size page? Is the overflow page, is each overflow page going to be the same size? Yes. So typically in a database system, it wasn't clear from last class, if I say my page size is eight kilobytes every page is eight kilobytes, regardless of whether it's a logging page, regardless of whether it's when the overflow pages, they're always going to be the same size. The reason is because when we saw that buffer pool manager, I had those slots where I wanted to put my pages in. I don't want to have variable length slots. Now in some systems, I think DB2, you may be able to tune this to change the page size per buffer pool, but in general, it's always going to be the same size. So now the next question I might be asking is, if my database system has four kilobyte pages, and the value I want to store is larger than four kilobytes, what do I do? We just have now a pointer inside of this thing to say, I can only store four kilobytes, but here's the point into the next page where you can continue reading these things. And now we get into what are the optimizations you can do to make streaming this data very efficient? Well, if I know I'm going to store a one megabyte value in one of these variable length fields, I want to maybe allocate a bunch of four kilobytes pages all continuously on disk so that, yes, there's a pointer to go to the next one, but I know I can jump ahead and read everything ahead of time without having to jump around in different locations. I wish I had numbers for this, but there is a paper from a few years ago, actually 10 years ago now, where basically they try to figure out at Microsoft what was the right cutoff point of making decision on when to store things as large values inside the database and when to store things in external storage. And I think the number they came up with was really small, like 256 kilobytes. That seems too small. When we had the SQLite guy come here a few years ago, they did experiments and they were showing that it's better off to store, in phone apps, people were storing one megabyte thumbnails. You were actually better off storing the thumbnails inside the database system because it was super fast to read these files from disk because you already had the file handle open. So there's no conventional wisdom or hard-fast rule to say when you store things like this or when you store things in external file, different people say different things. I will say from a management standpoint, if you store it inside the database, if I do a dump now of the database, everything's always there. If I have these external files, I'll maybe copy the data from one machine to another. I got to make sure I copy the database plus the files and make sure the file paths are all point correctly. So there's different trade-offs for each of these. From a performance standpoint, don't store one gigabyte files in a database. You can. Some systems will let you do this. I think Teradata was like two terabytes for a single value. Don't do that. So this is a good question. These pages are always the same size. And then for external files, it's whatever the operating system of the file system actually uses. We don't care. We just know how to stream things. All right. So as I said in the beginning, the tuples are just byte strings or byte arrays. There's no meaning to them other than what the database system imparts on it from interpreting those bytes. And the way it's going to be able to do this is through its system catalogs. So the system catalog is the internal metadata about the data. It tells you what tables you have, what columns they have, what types they are, what their names are, what the ordering is, everything about indexes and views. We need all this for query planning because we need to know whether, if you do a read on a table, does that table actually exist? And we need to know when we actually read tuples inside that table how to actually find the data that we need interpreted correctly. So in general, almost every single database minute system essentially stores the catalog in its own database format. So it stores information about tables inside of tables. And we'll see it in a second. It actually makes it really easy to then find information about these tables because there's sort of a standard location or standard view, if you want to call it that, where you can see all this metadata. But internally, what's going to happen is the data system is going to know when they call query table, it defined four fields, and these fields have these different types. So when I read a tuple, I know what those types are. All right? So the anti-standard, I think it's in the SQL standard now, says that every database in a system that has going to have catalogs, you want to expose this to the user through what's called the information schema catalog, information schema view. Again, this is basically all the information about tables and columns and everything that the data systems could use internally. This is just a way to get to it in your application code. So as I'll see in a second, although there is a standard way to get this information, every single database system that I'm aware of has their own little shortcuts to make it easier to do this, because writing queries against information schema is kind of cumbersome and long. So let's say that you want to real simple get all the tables inside your database. The SQL standard would say that you do select star from information schema.table, and then you pass along table catalog is the name of the database. But in PostgreSQL and MySQL and SQLite, they have their own idioms to get this information real quickly. Slash d for PostgreSQL or slash d plus because you need more information. Show tables for MySQL and dot tables in SQLite. So say now we want to get all of the tables in the student database. In PostgreSQL slash d student, MySQL is described student and dot schema student in SQLite. So again, the main takeaway I want to get out of this is that when we call create table, we store all this information in our catalogs that is then propagated to the execution engine part that was sort of in the cloud that was showing my diagram that says, all right, if you're reading this page, it's from this table, and here's how to find the offset for an attribute that you may be looking for. So we'll try to cover more about this in the later in the semester. But I find catalogs are actually very interesting because they vary widely in implementation. And this is something that actually the research doesn't really cover because it's not sexy, right? Like writing a catalog for the database system, you need to have it, but nobody actually does it really well. MySQL used to actually just store files on disk, and that was the way that you could figure out what tables you actually have. Actually, we can give a little demo of that if you want to see that. So this is Postgres at the top. I might have to connect again. So as I said, I can do select star from information schema to tables, and you get a bunch of stuff, right? So I can do where table catalog equals Pablo, right? And this tells me all the internal tables that I have, right? Anything with pg underscore is an eternal thing that Postgres sets up for you. I can also do like this, I do slash d, and it shows me all of the tables that I have, slash d xxx, and it shows you what the schema is. There we go. All right, so my SQL, same thing. I can use select star from information schema to tables, a bunch of stuff. It's hard to see. So in case, here, they're showing every possible thing, every possible table that's in the entire system, right? So their shortcuts is show tables, right? Or show databases, if I want to see what databases I have. So this is my SQL 5.7, my SQL 8, they've actually fixed this, and they actually now store, as far as I can tell, all the catalog information in tables themselves. In this earlier version, they would actually store this in the file system. So if I go to var, lib, my SQL, I see a bunch of directories here. And it matches up with what tables, or databases they have. So it says I have a database called information schema, or performance schema, and there's a directory up above where it says, you know, performance schema, that I have a database called test and sys, and there's directories called sys and test. So what my SQL basically does is, when you say show me the databases you have, it looks in this directory, and any directory it sees, it assumes that it's a database, right? So we can fake it out by saying, let's make a directory called XXX, and down below, and I show databases, and voila, XXX, all right? Let's see what happens if I try to ask what's inside of it, or use it, or let me use it. Nothing's in there, so let's go see in here, there's no files in there. Let's see what happens if I call it create table. I actually don't know what's going to happen here. Yeah, didn't like it, because there's nothing there. Anyway, so show databases, XXX is there, come back here, we'll delete XXX, XXX is gone, right? So the main takeaway here is I'm going to say Postgres does a much better job. Most database systems do a much better job of actually storing all the catalog information that we use to understand what the data looks like inside of tables themselves. My SQL got better in this in version eight. There's a bunch of weird stuff though that they don't do correctly, which we'll talk about when we do transactions, where although it looks like a regular table, and you would think we'd have the same transaction semantics the regular data has, in some cases it actually doesn't, right? Because these catalogs are sort of special case in the system and they don't always get it right. Again, I'm extremely interested in catalogs, I just find them really fascinating, which is sort of a, my stupid thing, okay? So now I want to talk about storage models. And I think this is a really important concept and in some ways this is probably one of the things that a design decision you're going to have to consider when you go out in the real world and you say maybe you're not building a database system from scratch, but you need to pick one, right? The thing we're talking about here is it actually can make a big difference, right? So remember that in the relational model that there was nothing about it that said that defined how we actually had to store the tuples inside of our database, inside of our pages, right? And that means that, what I mean by that is in all my examples so far, when I said let's store tuple, I showed this array with these attributes contiguous to each other, right? And when most people think about a database and think about like a row or a tuple in a database, they always sort of think about it like that. You have one attribute after another. But again, the relational model doesn't say anything about this because the relational model is a high-level logical concept, which means we can sort of data in any physical way that we want as long as we expose it to the user as tuple has these values, these attributes. So the reason why this matters is because storing it the way we took that so far, storing everything as rows or contiguous values, may actually not be the right thing for some workloads. So to understand this bit better, I'm going to use this sample schema. It's actually derived from the real application, the real software that runs Wikipedia, right? You can download this thing called MediaWiki, it's written in PHP, it uses MySQL. If you look at their DDL files, you'll see something that roughly looks like this. We've modified it slightly to make it more simple. So we're going to have two tables, we have user account, pages and revisions. Revisions have a foreign key reference to the user account table and the pages table and then the pages table has a foreign key reference to the revisions table. So users have accounts, users have accounts, there are pages like articles on Wikipedia, and those pages can have revisions, right? And those revisions are created by users, that's the basic schema. So the two classes of workloads that we're going to talk about are OLTP and OLAP or OLAP. So OLTP stands for online transaction processing, right? So think of this as the kind of application where you have people interacting with a website and you're filling out forms and submitting things and you're recording new data, right? So think of it again like on Wikipedia, I can go and create an account, right? I go fill out the form and click submit. Then the data system takes that data and stores it as a new record inside the database, right? So in these workloads, the queries are actually going to be usually pretty simple. They're not talking about very complex joins, very complex aggregations, because the query is going to be operating on a single entity at a time in the database. Again, think of like going to Amazon, I have my Amazon account, I can add things to my cart, I can make purchases, I can make payments. The actions I'm doing only affect my account, right? So the transactions, the queries that are executing based on my invocations of the website only touch my data from my entity. So this is typically what people usually build first. Like if you're gonna be a startup and you say I'm gonna build something that wants to do machinery analytics, you need to get data first. This is how you're essentially collecting that data. You set up a website, people use it and then installs new data in the database. So in the case of our Wikipedia example, the queries that will fall under this category will be simple things like click, get all the revisions for a page by doing a join, update my user account by saying the last time I logged in with the current time, and then inserting new things into revisions, right? And these are simple operations that only touching a small amount of data and operating on a single entity. Now contrast this with OLAP or online analytical processing. These are more complex queries that will be reading large segments of the database or tables that go across multiple entities. So again, using Amazon example under OLTP, I make purchases, I buy items, and I'm only operating on my account. Under OLAP, I'm doing calculations like compute the five most bought items over a one month period for these geographical regions. I'm reading large segments of data to compute some kind of aggregation. I'm deriving new data from the data I've collected from the OLTP side. So you typically do this after you built the OLTP side and you collected your data, and now you want to infer some meaning or derive new knowledge from it. So in the case of our Wikipedia example, the sample OLAP like query that I'm showing here is that we're going to do a sequential scan over every single user account record, look at the host name, and see whether it ends with the .gov domain name suffix, and then compute the number of times that you've logged in for each month. This is actually a real query you can do. You can see which government employees are going on to Wikipedia and modifying things. A few years ago, there was a congressman where they had their assistants go modify the Wikipedia page and make it more flattering. So this query could find people like this. Is this sort of clear? The OLTP is for transactions and updating new data, ingesting new data, and you're modifying the database. The OLAP is more analytical queries. So a high-level diagram, sort of a really simplistic view of what the differences are, what looks something like this. This is from an article written by Mike Stonebreaker from a few years ago. So along the y-axis, we have the operation complexity, how complex are the queries. And then on the x-axis, it's whether the workload is doing more writes versus more reads. So in the bottom corner, you have OLTP, where you're doing simple operations that write data. And then in the top corner, you have complex operations and OLAP queries that are typically read only. Now social networking is sort of this weird thing in the middle because it's a mixture of both, right? You're updating your timeline, you're making posts on Twitter, but they're still doing calculations to figure out who you should be friends with and things like that. But for our purposes, we just focus on OLTP and OLAP. So now the storage model that we're going to talk about is the way we're going to represent tuples across our pages. And how we want to deal for OLTP is actually going to be different how we're going to do it for OLAP. And as I said, so far we've been assuming that tuples are stored as contiguous rows. And the mathematical term for this is called n-ary storage model. So the n-ary storage model is where you store all the values for the attributes contiguously inside of a single page. Now again, you can have the overflow pages or the external files. For our purposes, that doesn't actually matter at this point. If I have a tuple, I have five attributes, I'm going to have some location inside of a page where I have my header for that tuple, and then I'm going to have all those five attributes in line with each other, right? And so this is ideal for OLTP workloads because, as I said, most of the queries operate on a single entity. So and you typically are going to try to read all the attributes for that single entity. So if I want to get my account information, I want to do a select star query from account, where username equals Andy, all my attributes. So when things are in line, this is ideal because it's one fetch to go get the page we need, and then one read to get across to get everything we need. So it would look like this. So say that we have a page with four tuples, we can divide it up where the first tuple starts followed by the second tuple and third tuple. There's no data for the first tuple in our mix with the second tuple, only when it ends does the second start. And this is the architecture of the slotted page design that we saw before. So now I can represent this in a single page that's stored in my database with a bunch of other pages. So say these are all the pages for this user account table, right? For every single page, it'll have all the values for just one entity. So if I need to go get that entity or get the data for that single entity, it's one fetch to go get it. Let's see how we'd actually write queries on this. So let's say we didn't want to do a select star from user account where username equals something and user password equals something. So there's some index we're going to use to map a user account to the actual tuple ID or page ID and offset that we want. For our purposes, we don't care about what the index looks like at this point. You can just think of it as a hash table or tree. We'll cover this in lecture seven. But this is going to tell us for this query, for this user account, here's the one page you need to go get and then we know how to jump to the offset inside that page to jump to the starting point and then just read all the data right across, right? It's really simple. If you want to do an insert, well, that's easy too. Again, we just find a free page using our page directory and then we just jump to inside that page, we jump to some offset and we just write all the data one after another, right? In this case, it's really simple because we assume we have a byte buffer of the byte array. So we have a byte buffer of the values we want to insert from the insert statement and we just write them contiguously with a single write operation or mem copy into our page, right? So again, for OLTP, this works out great. For OLAP, though, not so much. Let's take that query we had before. We want to do the scan across the user account table and try to find all the records when someone logged in with the .gov hostname. So for this, it's a sequential scan. There's no index that's going to help us because we want to look at every single user account record. So that means we're going to have to touch every single page. We're going to look at every single page. So let's say that we go grab the first page, we fetch it, bring it into memory, right? What are we going to do? Well, we want to execute this query. Basically, we want to just look at our where clause and figure out, well, what attributes do we actually want to read? We care about the hostname. So that's stored here. So what's going to happen is we're going to fetch this page in and then we're going to scan through and jump to the offset as defined by our catalog to where the hostname is and then check to see whether that matches the query or the predicate in our where clause. And if it does match, what do we need next? Well, we need the login, right? Because we want to do the group by. So same thing as we're going along as we check the hostname. If it matches, then we want to go also then read the last login to then build out our hash table or if we're doing sorting aggregation to fill out some intermediate heap or byte array say, here's the values that actually pass our predicate and we can use them for the aggregation later on. So what's the problem with this? What's the problem with it for this query? But I've talked about so far. What data did I have to read for this query? I only access two attributes, right? Last login and hostname. Yes, exactly. I only need two columns, but I have this huge chunk here that I had to fetch in because they're stored together with the tuple, right? It's unavoidable. Even though my query doesn't need this, it comes along for the ride when I go fetch the page. So this is all now useless data. So for, you can sort of think about it. The data that I actually need to read it to actually process this query or produce the result, I also have to read other useless data just because they're tucked in the page with them, right? And that's the problem with the row storage or the NSM storage model for LSP queries. So what the advantages are that it's really fast to insert, update, and delete because all the data that I need is going to be right there. I can just do whatever I need to without having to go fetch other pages. And it's good for queries that are doing select star because, again, everything's always there. But as I said, it's bad for doing all that queries that typically only scan a portion of the total number columns and have to look at the entire table. So the alternative is called the decomposition storage model or more colloquially known as column stores or column storage. And the idea here is that instead of storing for a single tuple, all its attributes and then go to the next tuple store all its attributes, we're going to take a single attribute or column and store all the values continuously for all the tuples for that single attribute together in our pages. So then what happens is when we have to go then read that data in our query, we only end up going read the data we actually need. We don't read stuff that we don't need. So we go back to our example before. Again, this is when we were storing as a row store or the NSM. If we break up now all the attributes into columns and now store them all within a single page. So to think of the row stores like this, the column store is storing things like that. So I'll take the hostname column and I'll make a page that only has the hostname. So it's the hostname for the first tuple and then hostname for the second tuple and so forth all my single page. And I'm not going to mix up data from other attributes if I have space, right? We say this page only stores data for this column. And so I'll do this, I'll have different pages for all the attributes. So now if we go back to our query that we had problems with in the row store, now to execute the scan with the where clause on hostname, all I need to do is actually really only get these two pages, right? The first page will have the hostname and I just rip through this really quickly and just check to see what tuples actually match. And then for the ones that do match, then I can go back to the hostname, I'm sorry, the last login page and just go get the values I need, right? In the case of the row store, I didn't know what tuple's actually matched until I actually looked at the hostname. So you end up reading not only that the columns you know only actually need, you end up reading the last login attribute for some tuples you may not need either as well. In this case here, I only need to get the data that I actually need that satisfy the first predicate. So is this clear? Right, and then you can do the sort of same architecture as the slot is designed if we talk about before, it's just now the slots are pointing to where you can put the next value in. So what's the sort of big problem with this then? Yes? Right, so he says if you want to do a select start, you've got to go a bunch of pages and stitch things back together. You go in reverse too, if I want to do an update or an insert, I've got to take it, split it back apart and write this all out, absolutely yes. Anything else? So I said here, right? As I'm ripping through my hostname, some of these will match, some of these won't match. Now I need to go back to the last login page and find the tuples that correspond to the ones that matched, yes? Exactly, so you see that, so the question is, and what I was alluding to is, how do you reconstruct the tuple? Absolutely right. So there's two ways to do this. The first is you can use all sets and the other one you can use embedded IDs. So with all sets, the way it works is that you assume the attributes are all fixed length, right? Say our column is an integer, there always can be 32 bit integers. So now what'll happen is say, if I'm ripping through the hostname field, assuming it's fixed length, and say the fifth tuple matches, now when I want to go get its corresponding attributes or values for the last login column, I just know how to jump to the fifth one, right? When you have variable length fields that gets tricky, you either have to pad them out so they're always the max size of what they actually could ever be, or you can use dictionary encoding where you just store things as fixed with integers and then you have a mapping table to map those integers to the original values, right? Different systems do different things. So you can use all sets, the alternative, which I think this shows up in some of the literature, but I don't think anybody actually does this, is to actually embed the values of the tuples, the tuple ID inside the, every single value inside the column itself, right? So for tuple zero and column A, I have to store it in same as for all the other columns. I don't think anybody actually does this because you still need an index to say, if I want offset five or tuple ID five, how to jump into that? As far as I know, all the column stores actually use the offset approach, right? All right, so the advantages of the DSM is that for OLAP queries, we only end up having to read the data that we actually need, and this is gonna make things be much, much faster. We'll see this later on when we talk about query processing. There are some additional optimizations we can do where if we know that we're just operating on offsets, the columns, maybe we don't actually retrieve the data until the very last moment we actually need it, right? We can just manipulate these offsets as far as we can up the query plan and then only go back and get the data when we actually have to show an answer to a human. For compression, there's some optimizations we can do because now we know all the values are gonna be sort of in the same domain. So we can do more sophisticated encoding schemes beyond just sort of GZIP or dictionary compression. So for example, if I know that I'm storing a sex field and just say there's just either male or female, it's only gonna be either one or two values, so now maybe I can do run length encoding or other ways to compress that down further. And then the disadvantage is what he pointed out was, now anytime I wanna do an insert or do a select star, I either have to break everything apart across the columns to store things or put it back together to produce our final result, right? Now there are some optimizations we can talk about later on of how to make inserts and updates really fast by sort of having a row store in front of the column store and this is the general approach that most people use. But if you have sort of a pure column store, you would have to split it up each time. All right, so any high level questions about this? Yes? How do you deal with the databases that need to support both workloads? Like two separate databases? All right, so his question is, how do you support databases that need to support both types of workloads actually do this? That's a whole nother lecture we could do. So there's a couple of different ways. One is that you essentially have two storage engines or two execution engines inside the same database system. So MemSQL does this. In MemSQL, you can declare you have a row store and then you can also declare you have a column store. In their case, as far as I know, in that system, you can't sort of automatically migrate data from one to the next. You have to sort of dump out the data from the row store and then reinsert it into the column store. Now there are some systems that try to be clever underneath the covers do that migration transparently. We were originally doing this for our system at CMU. We ended up abandoning it because it was too much engineering overhead. The other approach is to actually have a, you basically maintain a renowning copy of the database in a column store format. This is called Fraction Mirrors. This is what Oracle does. So I insert everything as my row store and then in the background, I'll make a copy of it as a column store and any analytical query that could use it is then redirected to that thing. So there's a bunch of different approaches. The most of this way to think about it is you have a little mini row store that you just append changes to and then every so often there's a background process that merges it into the column store. But in that case, you have to be careful about making sure that if anybody does a read in one side, if they're reading something that hasn't made it to the column store yet, that you don't miss it because it's in the deltel store. Do you always use twice the storage? So this question is, do you always use twice the amount of storage? If you're doing the Fraction Mirror approach, if you're making the copy, you have to, yes. It's not always gonna be exactly twice as much because again, on a column store, you can compress it much better than you can in a row store. But in general, if you're not compressing, in that case, yes. If you're doing the deltel store, no. The deltel store is usually much smaller and then you migrate things to the column store where it exists permanently. I'll check the schedule. Maybe we should talk about this a bit more too later on. So column stores are hot now, pretty much every newer database system that's doing analytics that's been enveloped in the last eight or 10 years is a column store. But the core idea is not new. It actually goes back to the 1970s. So there's an early system called Cantor from the Swedish military. This was never released publicly. There's only like one or two papers that describe it. They're in English. And they don't really describe it as being a database system because back then that term wasn't common as it is now. And they're mostly manipulating files, but they talk about these files being as a column oriented format. Then the 1990s, there was sort of the first academic paper that describes what the decomposition storage model looks like from a theoretical standpoint. But it wasn't until the 1990s when somebody actually implemented what we now understand to be a column store or DSM system. So side-based IQ was a in-memory query accelerator. So sort of the example I said before where you have your row store, you're doing all your updates and then you could use side-based IQ to make a in-memory columnar representation or copy of the database to make those queries go faster. Side-based IQ actually still exists. Side-based was bought, it was a database startup in the 1980s, got bought by SAP and as far as I know that the system is still, people should actually still use this. In the 2000s, this is when the commuters really took off. There was a big movement of a bunch of newer systems that were built from scratch not to be query accelerators or copies of the database, actually to be the primary storage location of the database. Vertica is probably the most famous one in full disclosure. Vertica was a startup founded by my two advisors before I started grad school, got bought by HP and then they got sold off again to a holding company. Actually, Vertica has an office here in Pittsburgh and they're hiring. If you wanna go talk to them. Vector-wise and Monadb are two systems that came out of Europe. But in the 2000s is when people sort of recognized that if you're doing analytical queries, column stores are the way to go. And part of the reason why these took off in the 2000s is because this is when the internet was sort of starting to take off. And now a lot of people had a lot of data. Back in the 1980s, 1990s only really a small number like the Fortune 10 companies had big data sets. Maybe some of the big banks, some of the big retailers. But now you can go whip out a Twitter app pretty easily and start collecting a lot of data. So people were hitting these problems with existing systems and this is when column stores really took off. Then as I said, now in 2010s like if anybody's building a new analytical database system and it's not a column store, it's a non-starter. You're not gonna get the same performance benefits you can get using the DSM storage model versus a row store. And then this is just sort of a smattering of a bunch of column store databases that I'm aware of. There's way more than this but I quickly put up a bunch of them. Okay, any questions about this? Column stores are hot, you'll see this all the time but it's good to understand what's actually going on. Okay, so the main takeaway at this point before we start moving now into memory, to the memory side of the database system is that the storage manager ideally would be independent to the rest of the system but in actuality in some ways it's not always the case because we'll have to expose some information about what the layout is or how we're actually storing these pages to the upper level parts of the system that we can then use for optimizing our queries or doing other more sophisticated operations on the data other than just treating as a black box on the file system. And then as I said, choosing the right storage model for your database system is really important at the very beginning because you wanna make sure that you can get all the optimizations or target the right sort of workload based on what you think you need to do. And so what I mean by this is that it's gonna be very difficult to take a row storage system and then build that out and then come back and say, I wanna be a column store. It's not just, you know, just changing how you're storing tuples in the pages. There's a whole bunch of other see later on that is predicated on knowing whether you're a row store versus a column store. So some people try to do this with Postgres, right? There's some people that have extensions for Postgres to actually store data as a column store instead of a row store, but the actual execution engine itself is still a row-oriented storage engine where something like Vertica, which was based on Postgres, they actually ripped out all the execution engine stuff and the storage stuff and built it from the ground up to be explicitly a column store and you get much, much better performance, right? So it sort of goes beyond just the storage model stuff but the main takeaway here is again, you have to know this ahead of time because there's a bunch of design decisions we're gonna make later on that'll be predicated on this decision, okay? All right, so next class, as I said and for database storage, there was these two parts, right? We've covered the first one. Now we know how the database system is gonna represent files on disk, right? The actual database itself and now we wanna start talking about how do we actually bring this into memory in our buffer pool and move data back and forth, okay? Any questions? All right, guys, enjoy the rain and I'll see you on Wednesday. That's my favorite all-time job. Oh my God. Yes, it's the SD Cricut IDES. I make a mess unless I can do it like a Gio. Ice cube with the G to the E to the Tio. Here it comes, Duke. I play the game where there's no rules. Homies on the cusse, I'ma focus, I drink, bro. Put the bus a cap on the ice, bro. Bushwick on the go with a blow to the ice. Here it comes, Willie D, that's me. So G and St. I's when I party by the 12-pack case of a fart. Six-pack for the act against the real price. I drink, bro, but yo, I drink it by the 12. They say bill makes you fat. But St. I's is straight, so it really don't matter.