 Where are we at in the course is that we've spent all this time talking about new transactions, we spent all this time how to actually build indexes, and so now we're going through the bottom of the storage layer and talk about how we're actually going to store data. No longer are these abstract tuples, there's these things we point to in our indexes. Let's actually talk about how we're going to actually lay out the bits of these things. Then going forward for the next couple of lectures, we'll begin how to make storage even better, and then we'll have spring break and every spring break, then we'll focus on for the most of the rest of the semester and how to actually execute queries. Again, because these are all the different pieces of the systems you need. So just as a reminder, I showed this chart or diagram in the very beginning of the semester. This is what we're looking at in memory database. We have an index, it has pointers, which can be block IDs and offsets that then point to some tuple within the offset of a block and the fixed length data blocks. For this one, I'm showing this as a row store, the idea still applies if it's a column store. Then if you have any attributes that are larger than 64 bits or variable length, you don't want to store them in line in the fixed length data blocks, you just have a 64-bit pointer to some offset within a variable length data block. So part of today is we'll explain why we have to do this, why we want to do this. Actually, it's the next slide. So the way to essentially think at a high level, what an in-memory database actually is, it's just a bunch of arrays of bytes, right? It's just this giant byte buffer. So what we're going to do in our system is that we're going to look at the schema that was defined for our tables when you call create table, and we're going to use that information to figure out when we jump to an offset, like say we're following the pointer, we jump to a tuple offset, we then know how to interpret the bytes that we're looking at that offset, and the schema is going to tell us this. So at the very lowest level when we allocate memory for our database, we're just malloc-ing a big byte array, but then we're going to impose structure on top of that as defined in our catalogs that we're storing when we create tables. So that's essentially what we're focusing on today, what those bytes actually look like. So obviously every tuple has a header in front of it, and this is going to store all that metadata we talked about for MVCC, like begin timestamp and end timestamps, and then we'll see some things later today. We can put more things like keep track whether what attributes are null and other things like that. So the reason why we want to store most of our data, the tuples in these fixed-length data blocks is that it's going to make it super easy for us to figure out where one tuple stops and one tuple begins. We know exactly what the length of every single tuple. So if we're back here, when we have this block ID, that gives us the pointer to maybe the starting location of this block, and then the offset will be things like, you want the fourth tuple, you want the sixth tuple, something like that, and you know with the length of every single tuple inside, this is the same because these are all fixed-length attributes. So I just do a simple arithmetic to say, I want the sixth tuple times 100 bytes, and that's where I jump to my offset to find that. So again, what we're essentially just doing is just taking these large byte buffers we allocate with malloc from the OS, and we use our schema to then put structure on top of them and allow us to manipulate them as needed. So for today's agenda, we're going to talk at the lowest level of how do we actually represent the individual types in a tuple, and then we'll go a little bit bigger and we have to worry about how we actually want to lay out these data, make sure we're word aligned, and then we'll talk about how to actually now take either multiple attributes from the same tuple or multiple attributes or the same attribute from multiple tuples, and we can store them as a row store and a column store. So we're starting at how do we deal with a single attribute, and then we'll get bigger and bigger and look at maybe storing multiple tuples and then maybe multiple blocks of tuples. Then we'll finish up hopefully the time, explain an aspect of the database systems that I find super fascinating which are system catalogs, and we'll just do quick high-level what they actually are. I actually want to do a whole lecture on system catalogs, but there's no good paper on this, and we're trying to write that now, so eventually we will have this. All right, so let's jump into this. So the way the database system is going to represent the single attribute values for our tuples, is going to be either a combination of what the hardware is going to do for us or some kind of higher level user space code that we're going to write. So at the lowest level, you think of like integer, so these are like the basic types you would get in the SQL standard. So ints, big ints, small ints, tiny ints, the same way that the operating system or the hardware is going to represent these different types when we allocate a variable in C, it's the same thing we're going to store in our database system, because this is actually defined by the hardware itself, because there's going to be instructions that know how to do operations on integers with different sizes. For floating point numbers or decimal numbers, we're going to have this distinction between, again, what the hardware is going to provide for us, floats and reels, so 32 and 64 bit reels, and these are going to be defined by what's called the 754, or IEEE 754 standard, and this is a universal standard that says here's how to actually represent these kind of values in hardware, but also we could have fixed point decimals like numerics and decimals, where this is all going to be inside our database system, we have the right code to actually handle these. And then for timestamps, dates and date times and other, these sort of time types, the various systems will do different things, sometimes the time and date and timestamps are just all synonyms for each other, and underneath the covers are stored exactly the same, either 32 or 64 bit integers. Most of the modern systems are all based on the number of milliseconds or microseconds or regular seconds since the unit epoch, which is like January 1st, 1970. The older systems from the 1980s are the things that run on Windows, obviously don't do that, so they had their own internal representation for doing this, or maybe just use it with the operating system uses. Sometimes in some systems, like you can say date without time, and then it can store that as a smaller size, but again, for simplicity, a lot of systems just use the same thing, and then when you access it through SQL, they allow you to have different properties on them. And then for varchar and varchar and text blobs, again, if the value is less than 64 bits, we actually can just store it in line with the fixed length data, but it's larger than 64 bits, then we just have a pointer to some location in the variable like memory pool, and at that pointer, you'll land, there'll be a header, and there'll be a, that says here's the length of the data you're about to read, and maybe a pointer to the next piece if it's broken up across multiple chunks. Again, I'll show diagram what this looks like. So the thing I want to focus on is this, the decimal sorts of reels. So as I said, you can either have floating point numbers or fixed point numbers, and the floating point numbers are actually what you get from the hardware, again, as defined by the IEEE 754 standard, right? And these are gonna be faster than the fixed point numbers because the hardware is gonna have instructions to take either 32-bit or 64-bit floating point numbers and do whatever arithmetic you wanna do on them very quickly, right? And the standard specifies how do you handle overflows or how do you handle rounding issues and things like that, right? Because we can't extra exact values with these basic types. So to show you exactly what this looks like, so this is a really simple C program. I wanted to declare two floating point variables, 0.1 and 0.2, and in the first case, I'm just gonna add them together and see what the output looks like, and then here, if I take 0.3, which I know should be the answer, and just print that out with a lot of leading decimals. So when you run this on x86, you see that neither one of these gives you the exact value you'd expect, right? Neither one of these is 0.3, this one is 0.3 followed by a bunch of stuff, and this one is 2.9 followed by a bunch of stuff. Again, because the way we're representing floating point numbers in the hardware itself is it can't be 100% precise. So that's why you're gonna have these issues. Now, obviously if you just say give me the, if you just say give me the only one significant digit, you would get 0.3 and 0.3, but underneath the covers, it's actually, the bits are represented like this. So if I actually do say, is this equal to this, you would come up with negative, right? So in a lot of database applications, especially when you're dealing with money, with things that have value, where you don't wanna have these kind of rounding errors, you have to switch and use the fixed precision numbers or the fixed point decimal numbers, right? And so sometimes that you can clarify these numeric, sometimes other systems call them decimals, right? And again, the way to think about what's gonna happen is that the database system is gonna store the exact value as almost like as a string, like the exact decimal. And then it's gonna maintain some extra metadata to say, here's what the decimal point is, here's what the variance is, or whether I'm signed or unsigned, all the extra things you need to have to guarantee that these are fixed points. But the issue is gonna be, this is gonna be way more expensive to use, or these kind of values, because the hardware can't do this natively for us. We have to write this code in ourself, right? So I always like to give this demo, just to show you how bad things actually get. They're not bad, the performance difference actually is. So I'm gonna run this in, can you ever see that? So it's we postgres at the top, my SQL, and then this is actually SQL Server running on Linux. So what I did was I generated a simple Python script. We generated a CSV file. I'm gonna log in, sorry. So I generated this really simple CSV file. Can you see that? Yeah, over here, let me close this. It's just a bunch of random decimal numbers, right? And there's two of them, right, per row. And then in the database, I'm gonna go up here, select star from, so there's like, I think there's 10 million of them, right? And it's just a two column table. So what I did was I created two tables. I'll do this in postgres, drop table, just reels. Where one is that it's gonna be decimals, and one's gonna be with reels. And then I'm loading the CSV file in. So let me create the two tables. So the first one is with reels, right? And the second one is with decimals. And then I can load in both of them from the CSV file with a copy command. It shouldn't take too long, right? So that one's loaded and then again, so you do copy and then it tells you how many it loaded, right? So I wanna see how the query I'm gonna run is gonna be to scan every single row and add A and B together. So the query is super simple. It looks like this. And I'm gonna compute the sum of them. All right, so this is gonna look at every single row and take A plus B and then compute the aggregation of them, okay? So I wanna show you that it's more complicated and expensive to use the fixed point precision numbers instead of the floating point precision numbers. So to make sure that there's no interference of reading things from disk, I'm gonna use this nice Postgres extension called PG Warm, which basically goes and gets every single page for these two tables and brings it into the buffer pool. So now everything's in memory. Now I still have to go do buffer pool lookups and we talked about how to set latches and go do that, right? That doesn't go away. But now at no point are any of these queries gonna stall because of disk. The other thing I wanna do also too because it's Postgres 10, I wanna turn off parallel workers, okay? All right, so the first query we'll do this with decimals. So here's the query we're gonna run. We're gonna put explain in front of it followed by analyze and buffers. So explain what basically does normally without any of these extra flags is returns back the query plan that the optimizer generates without executing query. But if I put analyze says, give me the query plan but also run the query. So you can actually figure out how long it takes. And then buffers, well it's a Postgres thing that'll say what percentage of the data you had to read came from memory from the buffer pool versus came from disk. And in theory, everything should be in memory. Right, so the first query runs, right? And dirty, don't wanna stop at that. That's weird, whatever. So that's one of my pages we had to read. And then the time it took was up there, 2858. So it took 2.8 seconds, right? So that's it with decimals. We're on the same query with reels and see now it took 1.2 seconds, right? So the width decimals was twice as slow as real numbers, right, because we're gonna be doing extra work to make sure again that we're taking fixed point numbers and getting exact results. So let's see the same thing in my SQL. I've already loaded the data, so we don't need to run it again or load that again, right? This may actually be reading from disk. Yeah, 3.74 seconds. Let's do it again and see that the time is correct. All right, 3.6 seconds. And then we'll run it on reels. 2.3 seconds. So first of all, in this case here, Postgres is faster than my SQL. But again, the reels took one second, the reels was one second faster than the decimal one. All right, the last one, I just got it working this morning, will be with SQL server. Same thing, same data set loaded in. So let's do decimals first. You can't see the time, you gotta jump out of it. All right, so it took 1.1 second and then reels 0.6 seconds. So SQL server is faster than all of them. SQL server is actually really good. But again, the decimal version was twice as slow as then the real one. So this sounds like you'd always want to use the reels, right? Not the decimals, because it's faster. But no, because I showed you before that you have rounding problems, right? So we can actually can see, let's go back and use Postgres, right? So let's just actually get what the answer is. So here's it is with reels, right? One point something with, and then in scientific notation, here's it with decimals, right? One with a bunch of other stuff. So let's actually go see if we can take the real one, we can cast this as decimal. So this allows us to read it more clearly, right? So round it up, right? So the clarity you can see at the very top, it's one point and then five zero seven three six, and this is like one zero one, right? Clearly the different numbers, right? Because the real one is not gonna be 100% accurate because the harbor is gonna do rounding stuff. Whereas with decimals, because we're controlling exactly the calculation for doing when we do the summation or the addition of the two values, then we get a precise result. So again, if you have money, you don't want this because you lose stuff. You always want to use the top one, okay? All right, so we can go real quickly and actually see what Postgres is actually doing. So this is actually the struct from Postgres's implementation of numerics, right? And you see that there's a bunch of stuff in here, right? And this is again, this is the extra data they're maintaining per value in a tuple, right? So it's not if I'm storing, well we could have been a 32-bit integer, or sorry, 32-bit real stored in my system. Now I have four 32-bit integers and then this thing's basically again the var chart, right? This is a type def to up here. This is where they're actually storing the string of the actual decimal itself. And then when you get in, you look at the code of Postgres, here's the addition, right? Here's how to take two numerics and add them together and you see there's a bunch of I gotta deal with for like overflow or if one is negative, one is positive, if one's null, right? This is just again, we have to run this code to take a plus b in my example and then do it 10 million times. These are clearly gonna be way more instructions than the single two numbers plus together that we can do in the hardware. So I think numerics, so numerics you have to have, but the idea is that for as much as possible for all other data types, we wanna use the CPU instructions, right? So we don't want our own exotic integers, Oracle does something like this, we want the hardware to do everything. But for fixed point decimals, we have to do it ourselves. So this is actually a very interesting project for someone to do for the class project at the end, actually add support for numerics and actually make it work as efficient as possible and be able to do like vectorized execution on it, which I'll explain what that is later, but basically use special CPU instructions to do this in batches. So I'll talk about that later. All right, so we now know how to, we wanna store our individual bytes for our attributes. Now we gotta worry about how we're actually gonna lay them out and put them next to each other. So again, think of a tuple is just a byte array, or char array. And so when we have our schema here, the size or the attributes that we're defining here specify how much space they're gonna occupy in that byte array, right? Again, we always have our header, this one's kind of small, but like, because again, when we talk to MVCC, we're storing like 64 bit timestamps and things like that, right? The header will be much, much bigger than this, but we have our 32 bit integer and then we have our value, right? Which is a 64 bit integer. So if I wanna access this thing here, if I go through my index, my index is gonna give me a block ID and an offset. Now I'm gonna land at some location where the block header is. I do math to say, well, I know the size of my every single tuple is the header size plus 32 bits plus 64 bits. Multiply that by my offset. Now I can jump to this location here. I wanna access the first attribute. I know my header is, say, 32 bits, so I know how to jump into here to get the single attribute that I want for this tuple. So now the code you gotta be able to access, you wanna do something with this, right? You wanna treat it as a 32 bit integer. But as I said, we're just storing everything as giant byte arrays, char stars. So in the code, we would use reinterpret cast to take whatever address that we're pointing to here and tell the compiler that we wanna treat it as a 32 bit integer. And then now whatever code is on the other side of this can access it as if it was a 32 bit integer. So reinterpret cast is a compiler instruction. It's not something we're doing at runtime. It's just saying, you're just telling the compiler that I'm gonna read this address and treat it as, in this case, a signed 32 bit integer pointer. All right. So for variable length blocks, again, we don't wanna store this in line because in order to do that nice and easy math to jump to offsets and tuple locations, we don't wanna store anything at the variable length. Everything goes as we fix length. So in this case here, say we have a single attribute for our tuple, it's a var char 1024. So this is just, when we jump to this thing here, we would know that this is always gonna be a 64 bit pointer that will then point to some blocks and variable length data blocks. So this could either be a single chunk or we could break up the multiple chunks. And the way you handle multiple chunks is thus, you have a pointer here in the header to say, here's all the, if you want more data for the rest of this value, here's the location in the, another location in the variable length data pool. And whether this is 32 bits or 64 bits or even something even smaller depends on whether this other chunk is gonna be guaranteed to be in the same variable length data block. For simplicity, if you just always make this 64 bits, then you don't care where the second thing gets moved around. If you make it something smaller, then you have to make sure that this guy's always in the same block or chunk, or sorry, same block as the other one. So as we're scanning along now and just maybe doing a where clause and examining does my string, is my bar chart equal to this or does my bar chart start with this byte sequence or string sequence? We had that same issue we saw last class when we talked about B plus trees where say we wanna do, as we're scanning along, and we wanna do quick evaluation to see whether our tuple matches our predicate or key matches our predicate, we would always have to follow this pointer to lay in here and then read the first couple bytes just to execute our predicate. So for some systems, actually what they do is they increase the size of the variable length data pointer, the double the size, you can do 128 bits. And then they have just a prefix of the first couple characters in the, that's down here in the variable length data pool in front of the pointer, right? Or at the back of it, doesn't matter. So now when I'm scanning along and I wanna say does something equal something and I'm looking for where value equals X, Y, Z, I can do a quick comparison here and say, well, does X, Y, Z equal Andy? No, and now I don't even have to follow that pointer. So as far as I know, for Emory databases, Hyper is the only one that does this. You don't see this in the disk-based database systems because they always store the variable length data for the most part always in the same block and the same slotted page as the rest of the tuple, right? Yeah, so I should have said at the beginning. In a disk-based data system, the tuple, all the fist length and the variable length data is usually always stored together in the same page. In the main memory database, in memory databases, we store them separately. So to avoid having to follow this pointer, we can put the prefix there. Okay, so now we know how to store variable length data, we know how to store fixed length data, whether it's based on hardware or it's something in user space. The next thing we need to handle is actually how to restore nulls. So there's three ways to do this as far as I know. So the first way to do this is that in the domain of whatever the value type you're defining, you just have a special value represent null, right? So say if you're doing 32-bit integers in 32, right? If you go and look in limits.h in libc, they'll define the min and max values. So you could make the min value, the minimum value you could have for a sine 32-bit integer defined by this pound of fine, you just have that represent null, right? So the nice thing about this is that you don't have to store any extra space to keep track of whether an attribute's null or not, which you have to do in these other two here. You only kind of lose one value in the domain. There's one less possible value you could store for a 32-bit integer because of this. This is the way we did it in h-store and then the way that BoltDB does it, as far as I know, still does it this way. Monadb does this as well. You don't really see it that often because there's a bunch of extra you have to write in the layers above the storage manager to make sure that nobody tries to actually insert a value in 32 min because they may think they're storing that actual real value in, but then when they go to read it back, they're gonna get back null in SQL, which is not what they would expect. So you have to add extra code to make sure that nobody actually tries to insert this thing and you throw an error and say it's out of bounds. And then you also have to handle overflow to make sure that if someone takes this value and subtracts it by one, do you do wrap arounds or not? Because you don't want to land on this thing because it could magically go to null. What is the most common approach is actually to store a separate bitmap to keep track of for every single tuple what columns in that tuple are null, right? So basically you can think of this in the header of every single tuple. If I have 10 columns, I have a bitmap of 10 positions or 10 bits, and then if the bit is set to null at one offset, that tells you whether the column at a particular offset in the tuple is set to null. So pretty much everyone does it this way. All the disk-based systems do it this way. Oracle, Postgres, MySQL, SQL Server, everyone. This is the way Hyper does it. This is the way we do it now. In the old system in Peloton, we did it this way because we inherit a lot of design from H-StrawVolTB. In the newer system, we do it this way because this actually gets faster. This is faster when you're doing scans now over large segments on columns because you can just look through this thing, figure out what's null, and then go and then read the actual values. So if you're looking for something equal to something, I can do a quick vectorized lookup on this thing, which I'll explain what vectorized lookup means later on, but I can do a very efficient lookup in a, over across a lot of tuples in this bitmap, find the ones that aren't null and then go do a little lookup on the actual data. So the downside of this one is you're actually storing, you're actually storing obviously more data. It's an extra bit per attribute per tuple. That could be null, which isn't that bad, right? Because if you think of the MVCC stuff, we're storing two 64-bit timestamps per tuple. So this thing is not that big of a deal. The last option, which is way less common, which you still could do, is instead of storing a single bitmap with flags per tuple for all the different columns they have, you just store a separate flag per attribute to say whether it's null or not. As you can see, I store my attribute and then I prefix it with a little flag in front of it to say whether it's null or not. So the tricky thing about this though is that you just can't store an extra bit in front of the attribute, right? So if I have a 32-bit integer, I can't make it a 33-bit integer and just put a single bit in front of it, right? We'll explain why that f*** up in a second, but to give an example, what you have to do to pad this out, this is actually a screenshot from the documentation of MemSQL. This is from version six. I should have checked whether they still do this. MemSQL is the only system that actually node does it this third way. But the thing I want to point out is here, so they had the different data types, that's part of the SQL standard, but they have two columns for what the size actually will be. So you have the size when it won't be null and the size when it could be null. So the size when it won't be null, it's exactly what you'd expect it to be. So like this looking at integers here, so you have a one-byte integer, tiny end, two bytes, three bytes, four bytes, eight bytes, right? Third-tube integer, four bytes, 64-byte integer, eight bytes. But when the column could be null, the size they have to store is almost, in some cases, double, right? So to store a 32-byte integer that could be null, you have to store it in 64-bits. And I'm taking a guess of why you have to do this. I already sort of said why before. Yeah, but why is this, like why are they storing eight bytes instead of like five bytes? Because the alignment of memory affects how quickly you load it. Yeah, so he said exactly what we're going to talk about next. Because the alignment of memory of this data, again, think it's just a byte array. The alignment of memory for these attributes will affect the performance of the system when you actually load it in, right? Right, so this is, I already said it here, right? Because this is all going to mess up with word alignment. So here, who here knows what word alignment is? Okay, perfect, so maybe like 80%. Okay, so let me say real quickly about this. Okay, for this one, note MemSQL is the only one that does this. I think it's a bad idea, but whatever. A couple systems do this. This is the one that's most common. And this bitmap takes space. I just wanted to mention to, sometimes you see some systems have limitations on the number of columns you can have per table. Like in Postgres and SQL Server, it's like two to the 16, right? It's that number for a reason because that's how they want to allocate these bitmaps. Oracle is 1,000 columns per table. There's no real reason they did it that way. It used to be 100 in the 1980s, and then some dude changed it to 1,000, had to fix a bunch of stuff, and then they told me that it's just too much of a f***ing change, so it's 1,000, right? For our system, we obviously want to not have those artificial limits. All right, so let's talk about word alignment. Why this is problematic and why they had to pat it out. So what I'm about to describe to you is not 100% correct because I'm going to describe this in terms of 64-bit words, but we're going to word about alignment across 64 bits. In actuality, in the hardware, we're going to care about 64 bytes, which is the cache line, or 16 bytes, which is the word size of x86. So I'm going to show you this idea in the context of 64 bits because it'll fit on my slide, but in the real system, it's actually much larger, okay? But the basic idea is the same. So, again, our tuple is just a byte array, right? But we can divide this byte now up into words. And so when we start storing the attributes for a tuple, what could happen is those attributes could span these word boundaries. So say I want to store this table here, I have four columns, I have a 32-bit integer ID, well, that fits 32 bits nicely in here. Again, I'm ignoring the header for now, right? The header would be over here. I have a 32-bit integer, my timestamp is going to be 64 bits, I have a 16 bits for my char and then a 32 bits for my integer, right? So in this case here, say I would now want to do a lookup on the creation date, right? I do my arithmetic to figure out, here's my offset for my tuple, then I look at my schema which is stored in my catalog and it says, all right, I want the creation date timestamp, what columns exist before me? Well, I have the 32-bit ID here, so I know that when I want to jump to this address to get this data, I'm jumping 32 bits over, right? Of course now what's the problem? The creation date spans two word boundaries. So what would happen if I do a lookup on this? Well, it depends, right? Three things. So the first thing that the harbor could do for us is that it could be nice and do the extra reads across those two words, stitch the data you want back together and then hand it back to you nicely and in a single, in cache lines as a single continuous piece, right? The other approach is that it could be undefined behavior and it just reads whatever, maybe it reads the first 32 bits and then gives you that or maybe reads the second 32 bits because it was only trying to do one word access, one word read, right? It just gives you some random stuff and it's up to you to figure out whether this is correct or not. The last approach could be just they reject it and say you're trying to do one unaligned word access and trying to read two words in a single member location and just says I'm not gonna do this and throws an error and it's up to you to fix your code to make sure that you're reading a line data. So x86 and the more recent versions of ARM will do this for you, right? Because again, x86 goes out of its way to make your life easier as a programmer, right? It keeps everything cache coherent, make sure that if you're doing reads across boundaries it'll stitch them together for you, right? You don't have to worry about these things in your application, right? In terms of correctness, you'll get the correct data, it just might be slow. So when we're building our in memory database we wanna avoid having to do those extra reads, the hardware will do it for us but we know it's gonna suck and we wanna avoid it as much as possible, right? This is what ARM used to do in the past, right? Because again, it's technically still correct. Your program won't have incorrect results, it may fail but it failed because you wrote it incorrectly but the new versions of ARM would do this one. And this is like old exotic hardware, right? Our embedded devices will do stuff like this. We don't worry about here. So what's one obvious way we can fix this? Yes, padding, exactly, yes. You know this because you did it, right? So what we could just do is when we define our schema, again, we're a relational database. So we're told this, we had this already. Like the schema-less databases in the world, they had to figure this out on the fly because the schema might change per record. In a relational database, we have this information ahead of time so we can prepare accordingly and we say, well, we know we have 64 bit words and the first attribute we're storing is 32 bits, the second attribute can't fit in the remaining 32 bits so let's pad it out with a bunch of zeros or garbage, right? And then our next attribute starts, that fits in a single word and then our second attribute starts and then those two guys fit in a single word and we pad it out so the next tuple starts at the beginning of a word, right? Now to do that single lookup on the creation date, I just jumped to this one location and it's one word access, one read into memory and the hardware is happy with us and that runs efficiently. So if you run Perf, Perf, one of the events you can get is unaligned memory access. It'll tell you how many times you're doing unaligned lookups and if you're trying to figure out why your program is slow, that's a good tool to figure that out. Like call grind is not gonna give you back because it doesn't know anything about what the hardware is actually doing because it's a low level hardware performance counter. All right, so padding's one approach and again in the execution engine we built above this, we would be aware of this padding so we know how to jump accordingly to find the data that we want. What's another alternative? Yes, in the back. I guess you can reorder the attribute and then put all three together at one word. Beautiful, yes. Okay, so he said you can reorder things. So in this case here, the zip code, that was the last attribute but now I moved it into this attribute here to be physically next to the ID field because now these two guys fit into a single 64 bit word, right? Again, the beauty of the relational model is that we have the separation between the logical definition of what our table looks like and the physical representation. So the application just said, hey, create table, give me these four attributes, right? And it defined them in these orders but we said, you, we're gonna go reorder your thing to put this guy over here because we know that's gonna run faster for us, right? Now in this case here, and for this example, I ran out of, there was more attributes, we could fill that over here but to technically from padding to this one, we took the same amount of space. If you have more attributes, you could maybe pack them in there but again, we always wanna make sure that the next two will start to the right boundary, right? So, and again, the upper level parts of the system will know how this thing's actually physically laid out and can read it accordingly, right? So, you'll see things like, again, in the relational model, the order of the tuples themselves, right? Because it's bag algebra, the tuples as you insert them are unordered. So, you're not guaranteed to get back, if I insert tuple one, tuple two, tuple three in that order, when I do a select star and get back all those tuples, I'm not guaranteed to get those things in that order, right, because it can insert them in any way that it wants. I actually don't know whether relational algebra specifies that you have to get back tuples or say columns in that same order. I don't think it does, but as far as you know, most systems, even if they reorder these things, will give you back the results in the order that you specified here. Because I feel like that would throw people for a loop, right? And a lot of times, you'll write code where they like it's, you know, instead of actually accessing the, you know, you get back a result set instead of accessing the column based on the name, you access it by the offset that you define it when you call create table. So the point I'm trying to make is, even though in theory, I think the theory says we can give these back in any order, we'll always reorder them to put them back in the order you specified in create table. So when we were building our new storage manager for our new system this summer, we had to do this. We actually had to go and make sure that everything was actually word aligned. So giving the idea of actually the performance difference you can get, so this is a simple microbenchmark that one of the students here ran from the summer and Wayne was helping with this project. So this is the very first benchmark we did without doing any alignment. And so I'm measuring here, so this workload is just trying to insert tuples as fast as possible. So the performance metric is the amount of data we're inserting per second, right? How many megabytes per second? So without any alignment, we're getting like 500 kilobytes per second, right? And then we did two rounds of optimizations. I think the first one might have been padding and the second one was reordering plus padding. I don't remember exactly, but we're going from 500 kilobytes a second to 800 megabytes per second by just reordering the attributes as we stored them, right? So alignment makes a big, big, big difference and every memory database will do this. So the way we do it is you almost think of like a bin packing problem. We try to figure out how to reorder the attributes to take the least amount of data while making sure everything's still aligned, right? Okay, so any questions about layout or type representation? All right, so let's now jump into the storage model stuff. So again, we now we know how to store single attributes within the tuple and now we know we want to layout the memory contiguously so that we're word aligned. So now let's talk about how we actually want to represent multiple tuples. So we can either store these in a row, row oriented format where it's all the attributes for a single tuple together or in a column oriented format. So I'll discuss these two first and then I want to talk about some approaches that have been people pursued to try to get the best of both worlds of both these guys. So these are called what are hybrid storage model. So as a spoiler I say this is what Peloton used to do in our old system. We threw that all that away and now we're only in a column store. And I'll explain why in a second. So the paper I had you guys read is a bit old but I like it because it really lays out, it gets through the core differences of row stores versus column stores. Now it wasn't in the context of in-memory database which is fine, but the basic ideas are still very sound. I really like this paper a lot. Dana Boddy is, he won the best digitization award for SIGMOD on column store databases. He's now faculty at University of Maryland. I love this paper, again it's really easy to read and describes all the core issues we care about. My biggest problem though is the title, the fact that they put the word really on a line by itself, right? They should have put the new line after the colon and then had how are they different together, right? But whatever, it's a great pair of beside that. All right, so the n-ary storage model or the row-oriented storage model is what people normally think of when you think of a database. Like when we taught the intro class or you see when people describe databases, they're almost always describing the NSM storage scheme. And the basic idea is that all the attributes for a single tuple will be stored contiguously with each other. So my tuple begins and then I have all my various columns and I don't start the next tuple until I go through and I store all my data. Now I may have pointers to the variable length data stuff that's sort of separate, but within the fixed length data I represent all my columns, right? So we talked about the difference between OLAP and OLTP in the beginning. The row storage model is ideal for OLTP workloads or transactional workloads because these kind of queries are gonna be accessing a small number of tuples and they usually want all the attributes. So when I do a look up and say, get Andy's tuple, I say I'm logging into Amazon or whatever, I just, it's a single jump into memory at an offset to find the starting location of my tuple and then I just read across every single column all together and that may all fit in a single cache line. The other aspect of this, we're not gonna talk about this just now, we'll talk about this later in the semester when we talk about query processing, but typically the way people ex-use queries and process data using a row storage model is tuple at a time, right? Because in OLTP workloads, I only care about getting a single tuple or a small number of tuples. I'm not doing large scans over an entire table segment. So the way you can actually store a row storage model is using heap organized tables and this is essentially what we've been talking about so far, these fixed length data blocks, right? Where we're storing large chunks of memory and it's all the tuples are contiguous. You can also store them as index organized tables and this is where the leaf knows themselves and it'd be plus tree or a skip list, whatever you're storing your database as, that's where you actually store the tuples themselves, but you're still gonna store them as in row storage manner, right? So this is what MySQL does, or if you get the row store in MySQL, right? I'm not saying one is better than another, these are just two different ways to organize things and it again depends on the workload. So the advantages of NSM is that they're gonna be fast for insert updates and deletes because it's going to a single memory location and getting all the data for a single tuple and you can use this index-oriented storage in a way you can't really do with a column store. Now, it's not good for OLAP, it's not good for OLAP is because when we want to start scanning large segments of the table, an analytical query is you're not going looking at every single attribute in a tuple, right? So we're in every database, we're not worried about disk reads anymore but still as I start scanning along, I'm gonna be within a single cache line fill, I'm gonna, because I have to do my lookup to get to the tuple before I can read it, that cache line fill is gonna bring in a bunch of crap I don't care about. Maybe I'm just doing a computation on a single attribute, my cache line fill is gonna bring all the attributes in for that tuple and so that's wasted space. I said the same idea, we want to reduce amount of disk IO, we want to reduce a number of memory fetches and so we can't do that in a row store if we're doing analytical queries. So this is where the column store stuff is better for because I'm gonna organize all the attributes for all the data for a single attribute across all tuples continuously in memory. Now I still have to worry about layout, still have to make sure everything's word aligned but now within a stride of memory or block of memory, I'm just gonna have all values for one attribute. So now if I'm scanning and doing a lookup and it's just looking at one attribute, it's one jump to memory to find the starting location and I just rip right across and I'm only reading data, I'm only fetching data from memory and putting it into my caches for data that belongs to my query. So again, we'll talk about different query processing models but you can use the vectorize execution model or a query or vector out of time model that's better for column stores with this approach here. Again, I'll explain what that is later. So for the background for column stores, it sort of seems obvious now that you'd want to organize your data this way for analytical queries but it's only actually in the last 20 years, what is it, 2019? 15 years that people recognize column stores are the way to go, right? Like some of you are like 20 something years old so it's when you were in kindergarten but you gotta understand back in the early 2000s, column stores weren't really a thing. Now they're everywhere but back then it was unheard of. So the very first column store it was back to the 1970s, this thing called Cantor. It was actually from the Swedish military, some prototype system they had for processing data. So it wasn't really a full database system that we think about today like it didn't support SQL. It was like a file processing, a batch processing system but it organized the data as columns because it recognized that if you're doing analytical queries or doing analytical operations, going through single columns at a time instead of having to look at the entire tuple is much faster. In the 1980s there was a paper, sort of the first known paper of proposing what the decomposition storage model or the column store storage model was came out. In the 1990s, now we see the first sort of commercial system that was a column store. So there was this thing called SciBase IQ. It wasn't a standalone database system. It was actually like a query accelerator and memory query cache you could buy for SciBase. So you would have your regular SciBase row store system and then you could buy this thing and put it in front of it to try to accelerate analytical queries. And this is what was called fractured mirrors which I'll describe in a second. Okay, in the 2000s, when the column store stuff became an evoke, Vertica was the commercial system, the commercial implementation of C-Store which was in the paper you guys read. Vector-wise was the sort of better version of Monadb which we'll cover this later and Monadb is like this academic system that is actually pretty good that came out of CWI in Europe. So these are the first sort of specialized column store systems and then everyone recognized column stores are the way to go and then so in our decade, every single major database vendor has a column store add-on and then there's a bunch of standalone systems that are entirely based on a column store or column-oriented architecture. If you're building OLAP system now today, whether it's in memory or in disk and you're not a column store, you're not a contender. It's stupid. You would not do this. All right, so the one thing we need to worry about though is how can we find our tuples in our column store? So in the row store case, when I did my lookup of my index, I got back what, a block ID and an offset. So then I knew how to jump to find the starting location of that tuple and all the attributes were stored contiguously, right? But now in a column store, the starting location for the different columns are in different regions of memory. So we need a way to figure out maybe as we're scanning along or as we jump around to try to find individual tuples because we still want to support that, how can we go maybe reconstruct the tuple and put it back together? So the two approaches are to either store fixed, just use fixed length offsets or to actually embed the IDs. So obviously, as I said, I've already sort of spoiled this before, fixed length offsets are the way to go. So this is the way everyone does it. In the paper you guys read, they said this is what Oracle does, which is the System X thing. You can't use Oracle by name, that's called System X but it's Oracle. The reason I know this is because the third author is, this guy, Nabil was an Oracle DBA in Boston that they would always hire to help them with papers. So what they would actually do, so the way Oracle would do vertical partitioning, not the way that Oracle does real column store, which we'll see in a second, the way they would do vertical partitioning to approximate column stores was they would actually prefix the tuple identifier in front of the value itself. So if I had a 32-bit identifier to say, here's tuples, this attribute for column A is for tuple zero, this attribute for A is for tuple one, I'm storing that over and over again for every single column. The better way to go, as we already said, is to do fixed length offsets because now as I'm scanning it through, I know what the starting memory location of this is, I know what position I'm in, I know the size of every single value because it's fixed length, and then it's simple math to figure out what my offset is. So now if I'm at tuple two in column B, and I want to get the tuple two in column C, I just do simple math to jump down here. In the back, yes. When you have that null author for the big math. Sadie, when you what, sorry? When you have a null attribute for the big math, so it points out the source? Yes. Where is that big math stored in the home? Right, so this question is, when you have a, if you're using the second approach to store nulls, which is the null bitmap, but now I'm using a column store, where do I store that bitmap? You treat it as a separate column that's stored in the same block as these guys here. Again, same thing. It's fixed length because it's one bit. So I know how to really jump to any offset. Does that make sense? OK. So in the paper you guys read, they talk about a bunch of extra stuff about query processing, which we're not going to cover here. They talk about light materialization, columnar compression, which will be the next lecture, and then vectorize or block-based query processing or processing models. Again, the first and second one we will cover when we talk about query execution, and this one we'll cover next class. The main takeaway here is that the strategies you're going to use to store data, actually compress data, and actually run queries on column store data will be much dramatically different than how you would do it from a row store. And these strategies allow you to take advantage of the fact that you're storing things as a column store. Again, we'll cover compression next class, and then the first and third one we'll cover when we talk about query processing. Again, part of the reason I gave you guys read that paper is, again, I think although it covers a bunch of stuff that we're not going to cover in this lecture, I think it lays out the difference between the two of them quite clearly. I like it a lot. So again, what are the disadvantages of a column store or the row store? So the key one that we care about is we're going to reduce the amount of waste of work we have to do because we're only going to read data that we actually care about for our query. So going back here, I have four columns. If my query only needs to access two of them, like B and C, I don't worry about A and D. I never even read them. I don't pollute my cache. I don't spend my time crunching on them. I can do that easily because we're a column store. Next class, we'll see how we get better compression. But the main way to think about this is now all the data storing in a single column is going to be quite similar to each other. I have a zip code column. There's only 35,000 zip codes in the United States. So I know that every single value within that column is going to be one of 35,000, not zero to infinity or zero to two to the 32. So there's a bunch of compression techniques I can use to take advantage of that. Now the downside is going to be that because the data is going to be decomposed across these different columns for a single tuple, in theory it's going to be slower for what I mean it is. But I would say, the reason I'm saying in theory because it's maybe not as bad as I originally thought it was. But it'll be slower for point queries in certain updates and any LLTP things because if I need to stitch the tuple back together, that's a select star where I want all the attributes, I got to do separate memory fetches to find the tuples that I want and put it back together. Or if I'm doing an insert, I got to take my tuple, break it apart, and then sort it into separate columns. Whereas in the row store case, I jumped to one memory location and I do a single mem copy. So let's talk about the motivation for hybrid storage. So again, when we talk about the HCHAP systems, the idea is that we want to be able to run transactions and analytics on the same database instance. So we have our transactions coming in, they're updating the database, and then we want to analyze that data as we ingest it and extrapolate new information. So the common access pattern you see in these type of applications is that data is usually hot when it first gets entered in the system, and hot means that it's more likely to be accessed again in a transaction in the near future. So data is more hot when it first gets entered, but then over time it cools off and you're less likely to update it but you're still going to be able to access it through an analytical query. So the basic idea here is that if you can recognize when data gets hot and when it's cold, then you maybe want to store the hot data as a row store because that'll be faster for the transactions because those are super latency-sensitive, so you want to run those as fast as possible. But then over time, if it cools off, then you convert it to a column store because then that'll be faster for when you run analytics on it, all right? So this is essentially what the hybrid storage model is trying to do. You're trying to get the best of both worlds. You're trying to get new data into the system as fast as possible, storing it as a row store, and then you want to take advantage of compression and faster query execution for the colder data by turning it into a column store, right? So there's two basic approaches to do this. Now we're actually talking about the actual implementation. So the first is that in the system itself, you have two separate execution engines, one for row store data and one for column store data, and then when queries show up, you try to figure out whether it needs to run on one side versus the other, right? One execution engine will control data that's in a row store, and the other execution engine will control data as a column store. Now from the outside, from the application standpoint, it looks like a single database instance. You don't know that it has these two separate engines, right? You have a connection, you throw SQL queries at it, and then there's this little middleware piece in front of it that says, all right, well, this is an OLAP query, you go here, you're an OLAP query, you're updating something, you go over here. And I'll show different ways to do that next. The other approach is again, what we did in Peloton about the throw away is to have a single architecture that can handle both these operations on the same sort of data set. So single code base, single execution engine that can do both of them. So for the separate execution engines, at a high level, one way you could think about this is that you essentially have two different database systems running simultaneously in the same database, or the same system, right? So again, from the outside world, it looks like a single logical database, but underneath the covers, it actually could be storing things quite differently. And then any time you have a query that has to span both of these two engines, you need some kind of mechanism or coordinator above it to make sure that everything is synchronized. For select queries, it's easy, like you run one query here, run one query there, and then you stitch it together to produce a final result. If you're doing updates, you may have to update both sides, and you have to use something like two phase commit to make sure that everyone is in sync. So there's two different ways to do this. The first is called Fractured Mirrors, which I talked about in this before. And this is what, if you buy the column store accelerators from Oracle and IBM, Oracle, I think it's called Immemory Column Store. Or IBM, they call it DB2 Blue, B-L-U. These use the Fractured Mirror approach, and then SAP HANA uses the Delta Store approach, which will look like the time travel tables from MVCC. All right, so Fractured Mirrors, the basic idea is that we're gonna store a second copy, an entire copy of the database in the comm store layout. And any time that there's an update from the row store side, we have to propagate it to the column store side. So think of like Oracle, right? So if you download, you give Oracle money or whatever, Oracle by default, the original Oracle is a row store, it's an NSM, so that's the primary copy of the database. But then, and all your transactions go here and update things. Then there's this background process that recognizes, all right, I'm making changes here, and then applies these changes to a column store copy that's in memory, right? So now if any analytical query comes along, it can recognize, oh, I can run that on my column store, I had the data you're looking for over here, and it runs the analytical queries on this copy here. And it's much faster because one, it's in memory, but two, also because it's the column store. So in the case of, for Oracle, actually I don't know what IBM does, but in Oracle, this thing is ephemeral or transient, meaning if the system crashes, this thing gets blown away, and you just rebuild it when you come back up, right? And they have a way to invalidate things, like if I do an update here for a single tuple, I can make sure that this thing doesn't, I remove it here before any analytical query tries to touch it. And again, if they have an analytical query that maybe wants to span data, or maybe wants to look at the entire database and you haven't propagated all the changes from here to here yet, then it knows how to again get some data over here and then merge it or call less it with the data from the results you've collected over here. This is very expensive. I don't know how much it costs, but there's a reason why he has his own super, right? All right, so the other approach is a depth of store, and instead of having the column store side be a copy of the row store side, you have the column store side actually to be where you store all your cold data. So again, there's this background process where you're gonna take data that's stored in the row store side, at some point it's cold, you say I'm not gonna update it anymore, and then you migrate it over here. So when you bring it over here, it is no longer in this side here, right? So any transaction still always goes into this thing. So again, think of like the time travel table from SAP HANA, they sort of had two separate tables, one had older versions and one had the latest version, right? You could think of this as like always the newest version for MTC and this is always the oldest version, but when you create the old version, you convert it to a column store. It's not exactly how they work, but the high level it's the same thing. So this is not also, what I'm describing here is also not specific to memory databases, right? So in the fractured mirrors, that was a mirror was in memory, the primary side was in disk. For this one here, HANA's entirely in memory, so both sides are in memory, but Vertica does this as well, Vertica's a disk-based system. So this Vertica, this is all disk, and then this is backed by disk because you have to log it, but this hangs out in memory. And again, you have some mechanism to be able to run queries on both sides and coalesce the results. All right, I'm gonna jump through this really quickly. We will cover this later on when we talk about larger than memory databases, but basically there's a bunch of stuff you have to do to figure out what's cold, right? The easiest thing to do is have the DBA tell us, MemSQL does this, so for MemSQL, if you wanna go from the row store to the column store, the DBA, they're completely separate tables and the DBA has to pull data out from one and put it to the other. There's mechanisms to basically track the user's patterns of queries on tuples, and you can either compute this offline or online. Again, we'll cover this, I think in two weeks, the mechanism to do this. But again, this thing on MVCC, we're already keeping track of timestamps on when tuples are accessed or modified. We can use that to figure out what's cold and what's hot. So let me tell you what we used to do to say why it sucks, but it is what it is. So what we used to do is that we said that you actually don't wanna maintain two separate execution engines. That you wanna single engine that's flexible enough to be able to operate on both row store and comm store data all together within the single query. I didn't care about what layout actually was. So in the case of the fraction mirrors from Oracle or IBM, those are actually two separate engines running. They had to combine the results. In the case of SAP HANA, they're also two engines as well because at least in the original version, they bought a bunch of database companies, put it together as this Frankenstein monster and they called it HANA. It was like T-Rex, P-Time, XDB all together and they did two phase commit to keep them in sync. The newer version doesn't do that, but the old version did. So we argue that the better way to do this is have a single engine that can handle both of them. So what we would do is that we would have a mechanism that would recognize that here's how queries are accessing data, right? Here's what columns you care about. Here's the hot data, here's the cold data and then we would store the hot data as a row store and the cold data we would break it up as a column store. And again, this is all transparent to you as the user. You just define your table. You don't specify whether you want things as a row store or as a column store. What portions of the data we did this for you automatically. So the way we would do this is that if you look at the old documentation of Peloton or we'll end up reading, I think, some Peloton papers, we talk about this thing having this tile architecture and the idea was that we would have these different tile groups within a single table and think of these as like blocks for the table. So one block, tile group A would store things as a row store and another block would store things as a column store and in this case here, these two attributes were used together often so we would combine them together. And then what would happen is when we do queries and do execution lookup, we would have this additional header column that would tell us what is the layout for every single column that we're looking at. So then when we execute a queries, it would get back this result set which would end up just being pointers now into offsets within these columns. So we could pass this result set, this vector of offsets up into different parts of the engine and then it then do the translation to figure out the data you want at this offset since the tile group is laid out this way is at this memory location. So again, this was all sort of abstracted away from the actual code that process queries. We had a sort of indirection layer that could do this mapping for us. So I will show you a graph here that actually does work. So for this one, we're running a workload that's sort of this dynoral pattern of scans followed by inserts. So you sort of think of like you're a business, you insert a bunch of data during the day like you're a trading company on the stock market, nine to four you're trading so you're doing a lot of inserts, then at night you're doing risk analysis so you're doing a bunch of scans. So we have scans, inserts, scan, insert, right? It's back and forth. So in the row store case, the inserts are fast, the scans are slower. In the column store case, the scans are faster but the inserts are slower. And the idea with the hybrid layout, the adaptive layout in Peloton is that everything always started off as a row store but then as we saw a bunch of scans on it we would recognize that you're trying to do scans on this data, you're not updating it so therefore we convert it to a column store and then over time that made the scans get faster. And then for our inserts, we always inserted things as a row store so we got the same speed as the pure row store but then when we did again the scans we would do in this case we were doing better than the column store because we could actually store some tuples that were always used together in a line together in a single two column group and that improved cash locality and that's why we were doing better here. Right, so I saw this graph and like this is great we wanna do this, turns out it was actually a pin they had to contain and implement. Every student that came along that worked on the system we'd like what the pile of stuff they couldn't figure it out. So we decided to abandon it and then the system you guys are working on now were actually a pure column store and the justification is that in the old version of Peloton we were actually doing a pen Delta version storage in MVCC. So every single time you updated the tuple you would create a new version and complete copy of that tuple. So in that case, when you're doing scans you're scanning over a bunch of stuff that you actually don't need because it's not visible to you, right? And so making things a column store sort of helped improve that. The, we now are a pure column store with Delta versioning. In that case, you think of the Delta version as almost the same thing as a row store. Like if I do an update, I'm doing, you know, I still have to split my update up into the different columns but I'm only updating the fields that I modify, right? So if I have a thousand columns I only update two of them I'm just doing two memory writes. In the old Peloton code because we were pen storage it would be, you have to write out all 1,000 columns because you're making a complete another copy of the tuple. So there's some other tricks we can do to speed up we think we can do to speed up the version chain operations for transactions in the newer system but we're already much faster than what the old system could do so I don't care, right? So I think a well-written column store with Delta versioning is the right way to go and you don't have to do this. And from a code, from a software engineering standpoint this is much more, way more simple, right? Actually going back here looking at this indirection layer it's a nice abstraction but we pay a penalty for this, right? Because now we have to follow these pointers and do additional lookups whereas now in the current code we do not have to jump right to memory to get the data that we need, okay? All right, so I have like five minutes so I'm gonna teach you guys the catalogs in five minutes, okay? Again, I could do a whole course on this or sorry, a whole lecture on this but there's no paper for you to read so this is like, this is the quickest brain that I can think of, right? So almost every single data system that's out there stores the system catalogs in tables themselves, right? So you sort of eat your own dog food. So the system catalogs are where you store metadata about the database. So you keep track of all the columns you have all the tables you have all the attributes, everything, right? So the issue now is when you actually want to use this internally inside your system, right? You're storing a bunch of things as tables. You don't wanna have to like do lookups into the catalogs or catalog tables running SQL queries because that would be super slow. But I think it is, I have a select statement, select star from table foo. I need to do a lookup in the catalog to figure out what the strings foo match to. What, where in memory is this thing actually being stored? I don't want to do another memory, you know another lookup within the SQL query into the catalog to figure out where that thing is. So you have sort of an abstraction layer on top of this that allow you to manipulate the catalogs sort of directly. I sort of think it's like an object oriented interface to catalog tuples themselves, right? So bootstrapping the catalogs is a bit tricky, right? You need to have a mechanism that again doesn't rely on SQL to install a bunch of stuff about like here's my database, here's my user names that I need to have in order to turn that thing on. So every system that stores the catalogs in the tables themselves have to have a separate initialization function that does this. And the way Postgres does it is that they generate macros from the actual code itself and then they have an interpreter for these macros that then reads it and then populates the catalog which I think is a nice way to do this. So what's really awesome about this and I haven't tested the commercial systems, my SQL and Postgres don't do this 100% is that if all your catalogs now sort as tables and you're gonna manipulate those tables with transactions now your catalogs become transactional. And this actually makes your life way easier to do a bunch of stuff. Like when you start making changes to your database like I wanna add a column, if that's done in the context of the transaction then under snapshot isolation no query that's running before can ever actually see that column even though it's physically stored there. So you don't need extra protection mechanisms to make sure that nobody reads data from the tables that they shouldn't be reading if you always go through your catalogs and your catalogs are following the semantics of serializability or snapshot isolation. So when we wanna do schema changes there's actually a difference now of how hard it actually is when you have a row store versus a column store. So to add a column in a column store is super easy because I don't care about all the other columns I just create a new column segment proof I'm done and update some catalog information to say if you want this column here's the offset to it. In the row store case you gotta scan every single tuple add in the new and copy it with the new column in it and at some point invalidate the old one. So it's almost like a select followed by a delete or it's actually basically an update where you're doing actually a physical change. So there are tricks you can do to do this lazily like if you have a default value then maybe you don't actually update tuples that already exist any new tuple gets added has the default value in it or it would have the new column in it and then if you read any old value you just read it as the default because it's not set yet. Like there's optimization like that you can do to speed this up for the column store it's super easy. Same thing drop column if it's a row store you just blow away that column and who cares it's gone and obviously you need to do this with the garbage collector to make sure you're not doing this when any active transaction is reading it. Again if everything is running under MVCC this is super easy to do. For drop column you could copy tuples into a new region the same way you did add column the alternative is actually just mark the column as deprecated and then at some point any new tuple would not have the column any old tuple gets garbage collected you go ahead and remove it. So this is essentially this is what Postgres does and I think other systems do this one. Change column depends on the change you're making if it's a say you're changing something from a 32 bit integer to a 64 bit integer but that means you need more memory and you have to reallocate the space and copy things over but maybe if you're going from 32 bit to 16 bit maybe just leave the old 32 bit numbers there and just it's sort of wasted space for now. Indexes are still really tricky even with transactions. So there's a bunch of different ways to do this in a non-blocking manner but the easiest way to build an index is you lock the table make sure no transaction touching it no other transactions are running do a complete central scan and then populate the index and then unlock it. You obviously don't want to do that because it'll be slow and block transactions. So there's a bunch of techniques where you can do this in an online fashion or in a concurrent fashion. Postgres and MySQL and Oracle all do different things. The basic idea is that maybe you scan through the table, populate your index but then you keep track of any changes that transactions are making to that table as you're running so then at some point if you build the index you go back and fix it up based on the changes that you missed, right? That's the high level that's what you do. Drop index, all you have to do is just logically drop the index in the catalog, right? That means basically deleting the entry in the catalog table and then it's gone no other query that comes after you if you commit can never see it again and then at some point the garbage collector says well I know there's no query touching this index so it's safe for me to go ahead and remove it, right? You have to make sure that you actually also invalidate any query plans and make sure that no cash query plan can go still look up on it. So catalogs are super interesting. We're in the process of actually still building this in our system now. We decided not to use any of the old Peloton code because it was kind of hacky the way we were doing catalogs. We were still in the tables but it wasn't 100% the way I wanted. So we're still working on the new version now but a bunch of these schema changes things are actually super interesting, super complicated and would actually make a great project for the class as well. All right, the last thing I wanna talk about is sequences which I also find super fascinating. So sequences are like the auto increment numbers. If you're familiar with my sequel or like Postgres serials, like you can define a column to have this auto increment number. So every single time you insert a new tuple it increments this counter by one and it guarantees that you always get a unique number, right? So what's super fascinating about sequences is that although we're gonna store them in the catalog with all of our other metadata about the database they don't actually have the exact same transactional semantics you would expect of regular tuples. And this is because these sequences always need to be marching forward in time, like sorry, always be incrementing. And I never wanna roll back even though the transaction that may have incremented that counter had failed. Right? So let me give a quick demo and I realize we're out of time but let's do this. So we're gonna do this in Postgres. Postgres is at the top. I'm gonna kill these other ones, sorry. Okay, so what we're gonna do is we're gonna create a table that has one key. You killed this too, sorry. All right, we have a table, xxx has one attribute, all right? And we're using serial. So serial is a shortcut in Postgres. It'll say this thing is an auto increment column, create a sequence for it and then set the default value for this table, for that column to be the sequence. Right, so this is getting the description from the catalog about what this column, the table looks like, right? So I have here my id column and it says the default value is the invocation of this function here. So next value is to increment and get the next value for a sequence and then it automatically fills in the sequence that it created for me. xxx is the table, id is the column and then sequence is that, right? So I can have a function now that, I have this next value function here, right? When I invoke that, it gets the next value and then increments it, right? So I do this again, I'll get two, I do it again, I get three, right? So now if I go to my table, if I insert into the table and for this one, I'll do insert and then I set the value to be default that just tells it to use whatever the default value is as defined in the catalog and I do a commit and I do a select star from the table. I should get, I think the value is three because that was where I left the sequence for, right? If I insert another tuple, it'll have five, right? So again, it's updating this in the catalog and we have to store it out in the log because if I crash, I wanna make sure that my sequence is where I left off. So the reason why I think this fascinating is because now if I do things like run this in a transaction, right? Say I run insert into xxx, I should see now I have a six, right? I roll back, the tuple's gone as expected because I rolled back, but what should I get back for next val? Seven, right? So even though my transaction said, roll back any changes, I sort of leaked a change outside my transaction because the sequence is maintained separately, right? And even though my transaction aborted, I still have to write a log record out the disk and say the sequence is now seven or six, right? So that if I come back, I wanna make sure that I get the correct value. So this is why I find these things super fascinating because they work completely different than other parts of the system, right? Yes? Does this not violate the atomicity part of the system? Yeah, so he said, does this not violate the atomicity aspect of the guarantee of transactions as an asset? Absolutely, right? So I can do this. I can do two transactions. Oh, I have two transactions both in Postgres. Transaction starts here, transaction starts here. So this guy is gonna insert, actually we can do this, select next val, right? Run the same query. What should I get? Well, as he said, I should get eight under, again, serializable theory, I should not see the effects of this other guy, but in the whole, I get nine. This guy then rolls back, right? And it didn't roll back. So sequences are awesome. We had students implement this last class or last semester in the old system. We threw away all the code. We'd like to do this again. And I think there's a paper here. There's no good paper that describes actually how to do this efficiently, right? Okay, I realize I remember time, sorry. So we abandoned the hybrid storage model. I think the Delta version of storage plus the column store is the right way to go and that's what our current system does. Alignment matters a lot. You solve performance differences. You make sure everything's aligned. And numerics are interesting. That might be another interesting project to look at and make those efficiently. Catalogues are hard, and catalogues are awesome. And making them in terms of actual makes your life a lot easier. We'll see this in a bunch of other phases. All right, I've rushed at the end. Any questions? Take care. You know what I mean? Got a belt to get the 40 ounce bottle. Get a grip, take a sip, and you'll be picking up models. Ain't it no puzzle, I guzzled cause I'm more a man. I'm down in the 40 and my shorty's got sore cans. Stackin' six packs on a table. And I'm able to see St. I's on the label. No shorts with the cloths, you know I got them. I take off the cap and first I'll tap on the bottom. Throw my three in the freezer so I can kill it. Careful with the bottom band.