 Other database companies are still coming, so Oracle is giving a tech talk on campus this Wednesday. I think this guy is from the in-memory database group. It's from the same group as Shashank Chavan who came in, gave a guest lecture in the intro class last semester. So he's going to come talk about, I suspect, to be a very similar talk than what we got last semester. That's at 4.30 over here in New Simon. There'll probably be pizza for this, because if it's at 4.30, then there's usually pizza. The 12 o'clock ones, unless it's on a Thursday, there's usually not food. Anyway, before we get started, Matt posted on Piazza last night on the speech for Project One that you need to target, and then he'll be setting up Gradescope today so you can submit that. So again, what'll happen is you'll submit your project on Gradescope, that'll just prove that it actually builds, and then we'll have to run these offline on the same EC2 instance that we're telling you to when we do grading. So when he posted on Piazza last night that if you use the instance store of the instance type we give you, then it runs out of space. This is because we do static linking in our system. So when you build everything, every single test case is like one gig, so you run out of space. There's no way to build the benchmarks without building the tests. But it might be, if you just target building the slot iterator, it may not actually build all the unit tests, it may just build that one binary, so you should still be okay. But there's no way to, you can do it, pass a flag, say, in CMake, turn off building benchmarks. You can also turn off building the tests, but that's currently broken, because there's some dependence between the two of them, okay? Any high level questions about Project One? Okay, and then again, we'll try to answer on Piazza as the week progresses, okay? All right, so we spent the last couple of weeks talking about sort of more OLTP aspects of a database system. So we talked about how to do transactions, we talked about how to do indexes. And now we're going to go to the bottom layer of the database system, now talk about storage. And then going forward for the semester, we'll start building up new layers on top of that. So we'll do database storage, then we'll do execution, query optimization, and so forth. And so the reason why I started with transactions and indexes first, because in the back of your mind as we go forward, when we discuss other aspects of the database system, you should be thinking about this in the context of an MVCC system. And then think about whatever technique or method we're discussing, would this actually work for if you're doing MVCC? And some things work great if they're single version, some things not so great when they're multi version. So that's sort of why I front loaded the transactional stuff in the beginning. All right, so this is the chart I showed in the beginning of the semester of what a high level, what an in-memory database actually looks like. So say we have now our index, our BWT, our index doesn't matter. We're going to do a lookup to find a given tuple. And then this thing is going to spit out a block ID and an offset. So the block ID could be a pointer to the beginning of the block and the offset could be within that block. But that would sort of be kind of wasteful. This thing could just be just a straight pointer to the tuple itself. But now you get the issues of you start moving things around. So in our system, what we actually do, and a student came up with this and was pretty clever, we use a trick from C++11 called Align As. And basically this is telling the compiler in the system to align the starting address of a memory block at a certain offset. So we do this where we have all our blocks are one megabytes and we use Align As to enforce this. So now the block ID offset we get is a 44 bit pointer to some one megabyte block. And then we have 20 bits now to jump into some region. So instead of having an extra interaction layer to say, how do I get my block ID to a block address, right? We can pack this all in a single thing. The only reason I bring this up is in the old system in Peloton, it was a train rack. This thing would actually be a pointer to a pointer. So we had a thing called a tuple pointer pointer and that became a nightmare. We had so many layers of indirection where now this one just tells you exactly how to go to the block ID, or the block location and then this is just the offset to it. So again, the main thing to point out though is that all of our tuples are going to try to store these fixed length contiguous pieces of memory. Whether it's a column store or a row store, it doesn't matter. These values need to be fixed length because then we can do this arithmetic very easily to say, oh, I want the fifth tuple. I know the size of every tuple in my block. I know how to jump to the starting location for that tuple. Editing that could be variable length will instead store it in a separate data pool where we have variable length data blocks. And this sort of looks like what Malik does underneath the covers. It's a bin packing problem trying to stick in different regions of memory of different sizes, right? And so this would just be now a 64 bit pointer to that location. So this is what we're going to talk about today. We're talking about how we're actually going to represent stuff in here and then stuff in here. So at a high level, what you can think of in memory database just is, it's just a large byte array. We malik a bunch of chunks of memory and then we need a way to now interpret those chunks or regions of those memory chunks to infer what the schema actually is. So because we're relational databases, we're told the schema ahead of time. JSON stuff is usually just treated as a variable length data, but we know I have these in our columns that can have these types and these sizes. So now when we jump to an offset in that fixed length data pool, we know that this is the starting location of a tuple and we know how to interpret the bytes that we're looking at. We know whether to interpret it as an integer or a float or a timestamp. So again, we already talked about this before. Every tuple is going to be prefixed with a header that includes the metadata, such as visibility. We'll talk about nulls and sticking that in there. But basically, when I jump to an offset to say this is the starting location of the tuple, I'm really jumping into the header and I can look at that header and try to figure out what's going to come after that. We're not going to store any of the schema information in the header, because that'd be wasteful to do that for every single tuple. The JSON databases that are schema-less have to do this. But in our world, every tuple has to have the same structure, the same layout, and so we don't need to repeatedly store that in the header. So today we're going to talk about the type representation, how do we actually represent individual elements or attributes in a tuple. Then we'll build up the top of that and talk about how we lay out now those attributes for a single tuple. Then we'll talk about different storage models and then we have time we'll finish up and talk about system catalogs. Because again, this is sort of an additional stuff that I want you guys to sort of be aware of and this is actually kind of cool things that you can do because you have relational catalog, relational database with the schema. So the bulk of the main material we're going to focus on is storage models and above, okay? So this is the same, I've shown this before in the intro class. But basically, we know all the types that our database system can support. The SQL standard specifies what the basic types are and we know then how to actually represent the bits for each interval value of an attribute. So for integers, big ints, small ints, tiny ints, we're just going to use what C or C++ provides for us when we say allocate like an int 32 or int 64 or small int. There's nothing special there, we just take whatever the bits we are when we allocate that variable like that and that's what we're storing in our database. For floating point numbers, we'll have this distinction between floats and reels. And reels are just going to be using the IEEE 754 standard, which is telling you at the hardware how we're going to represent floating point numbers. But this is going to have rounding and accuracy issues. And then there's the fixed point decimals, the numerics and decimal types that are specified in the SQL standard. Every database system is free to implement this any way they want. Some are more efficient than others. The Oracle one is better than the Postgres one. But the main idea is that this doesn't have the same rounding errors that you have in the floating point numbers. Time stamps, dates and times, types, these also vary per implementation. The SQL standard doesn't specify how you actually represent these type of attributes, but they do specify what kind of operations you can do on them. Like can I add two times together or can I subtract them? Things like that. One common approach is just to use the Unix epoch. So it's the number of seconds or milliseconds since January 1st, 1970. And you use that to calculate what is the current date and time. In C++, there's this thing called time spec, which gives you a time offset and then nanosecond offset. And that's actually 16 bytes larger than this. In our current system, I think we're using 64-bit integers, which is the middle second since the Unix epoch. But again, for the SQL level, you don't know and you don't care. You just know you have time stamps you can stick things in and you can do comparisons on them. For the VAR charts, for our binaries, text, and variable length fields, again, in the fixed length portion of the tuple, we'll store a pointer to some block of data in the variable length pool. If the value that we're trying to store is greater than 64 bits, right? If the size of the data we're storing is smaller than the pointer, then mind we'll just store the data rather than have a pointer to that data. Anything larger than that, then we'll store it in a variable length data pool and we'll keep some extra metadata to keep track of what are we looking at and if it overflows to another data pool, we have pointers to those as well. So I'll cover that as we go along. So this quick demo talk about the difference between variable precision and floating point precision decimal numbers. So again, if you go for a float, a reel, or a double, you're going to get the hardware representation of a decimal, or a floating point number, right? And that's specified by this 754 standard that every single PC or every single CPU has to follow. So if you store a floating point number on a power CPU, it'll be represented the same way in x86. The andes might be switched, like the bit order might be switched, but at the sort of how you keep track of the decimal point and what's before and after it, it's all specified by this. So this will be really fast because the hardware is actually going to have low level instruction support to take two floating point numbers and compare them or add them together. It's not going to be as fast, potentially, as doing addition on integers, but it's certainly faster than doing anything that we can manage ourselves in the database system. But the downside, again, is you're going to have rounding errors. So if you write a really simple C program like this, where you take two numbers, two 32-bit floating numbers, 0.1 and 0.2, if I add them together, you think the answer would be 0.3. But the problem is when you actually start looking at what's, you know, give positions in the decimal, you see you end up with all sorts of weird stuff because, again, the hardware can't precisely store 0.3. So even just, you know, print out 0.3, you get a bunch of stuff. So this, again, for some aspects in applications, this is fine. Maybe this is okay. Like, if I'm storing, you know, the temperature of this room, I don't need to have, you know, a super accurate measurement, right, 98.6 or 90.0 is, like, good enough. But if I'm dealing with, like, money or anything that's like scientific measurements, where I don't want these flat rounding errors, then I want to use fixed point decimal numbers. And the idea here is that the database system itself is going to keep track of what is the exact value that we're trying to store and it handles all the rounding errors or all other issues that can come up when you start doing multiplication and manipulation. So the, this is actually what I'm saying here isn't accurate. So this is actually what Postgres does. What Postgres does will store the decimal point number essentially as a var char, like a string, and then it has an extra metadata to keep track of, like, where is the decimal point? Is it negative or positive? What are some rounding issues? So they have these giant, like, switch statements to deal with all these different variations of the type of operations you want to do. And it runs about twice as slow than the floating point numbers. The hyper guys have a low level, sort of, bit manipulation method to distorting fixed point death moles that is very, very efficient. And actually some cases can run faster than the floating point numbers, which I don't know how that is possible, but this is what they show us. So I'm not gonna have time to teach that. I don't fully understand what they're doing just yet. The German guy recommended this book called Hackers Delight that tells you how to do all this little bit manipulation. But this is something you want to pursue for project three, then we should talk. Yes? I think, you know, with banks, they use like four bits and then just use that to represent a single decimal. It's like instead of using 16 on the scale, like from one to 10. For a single decimal position? Only four bits? Yeah, just four bits. They use like really old CPUs. Oh, okay. Yeah, so that, I don't know anything about. I know that, like, for the banks, they never overwrite data. So like, if there's, like, if a rounding issue could occur because the way their application was written when you computed, like, interest or something like that, you could always go back and fix it by, you know, rerunning the competition again. So it's not like you would lose that, lose any data by having imprecise precision. I don't know what, I don't, this is something I'm very interested in, very interested in, but I don't know enough about it to, like, opine whether, you know, what people are actually doing. Other than I know that Germans have solved it. But I don't know exactly what they're doing yet. I don't fully understand it. Okay? Again, this is something you want to do for project three. I'd be happy to try to help out with this. And I'll, you know, point to the chapter in the book that he recommended. Okay. So, the, as I said, like again, the database essentially is just a byte array. It's just gonna be, you know, a char array or byte array. And then now we need to now put, start putting meaning onto what the bits we're storing in these bytes. So say I have a simple table here, has two fields, ID as a 32 bit integer, and that's primary key, and then value is a 64 bit integer. So the layout of the tuple would look like this, right? I have my header of some size where I'm storing the, you know, the timestamps and everything that we talked about before. And then when the header ends, now I have the ID field. And when that ends, now I have the value field, right? So if I need now, in my database system, I want to understand, like, I need to read, you know, this field, right? I know how to find the location of this tuple, right? Because the index got me there. All the headers will be of the same size. So I know how to then jump past that. And at this position here, then everything I read afterwards for up to 32 bits will be the ID field. So how do we actually access this data when it's just, you know, in our C code or C++ code, we just see a bunch of byte array. We need a way to convert this into something that we can interpret as a 30 bit integer. Let me take a guess how you do that in C++. Reinterpret cast, right? So all that's gonna say now, and this is a compiler construct. This is gonna say, like, all right, when I read this memory address, treat whatever you're reading as not a byte array, but a 32 bit integer. So now in the compiler, it knows that when it, when it accesses, it's just gonna access just those 32 bits, right? So now for the variable length data, say I have a VAR chart 1024. So again, in the fixed length data array, I have my header and then I have the 64 bit pointer. And this is gonna now point to now down to some memory region here in the variable length data pool, the starting location that's gonna tell me, you know, to find the actual value that I want. So in this case here, for each of these entries in the variable length data pool, it'll have its own header that keeps track of what is the length of the data that we're storing contiguously for this chunk. And then if it overflows this chunk, there'll be a next pointer that jumps down to some other memory location where we can read the rest of the data, right? So this is not unique to memory databases. The disk-based databases do the same thing. If you're VAR chart or text field overflows between different pages, you need a way to connect these all together. The other thing that I'm not gonna talk about too much is that this is not gonna be null terminated typically because otherwise you're wasting a byte for no reason. So in our code, we can't just go to call like string compare and string length and I'll add the C library to compare this thing because it's not gonna have a null at the end. So it's not truly a C string. So we have to write some additional, we have to write our own string functions so we don't interpret this thing and understand it. Okay? The other optimization we can do here is that say I'm trying to look up, find me all the strings, the values where the beginning of the string starts with the word Andy. So how would I have to do that? And with this setup, I'd have to scan through my table and then for every single tuple, follow this pointer to some random location and then jump to the first, some bytes and see whether it starts with Andy. So that would be really expensive to do because there'd be a lot of indirection, a lot of branch misprediction because there's no conditionals but basically I'd be scanning one part of memory then jumping over to another part and then going back and scanning more and back and forth. So I would have bad cache locality because it's not like I'd keep scanning the same, a bunch of stuff that I brought into my caches all at once, I gotta keep jumping over and over again. So a really simple optimization is actually to pad out the pointer portion that you're storing in the fixed length data to include a prefix of the string down below. So now when I scan across and try to find things that start with Andy, I could just look at this thing and see whether it matches or not and I never have to even touch any of this. So this hyper does this and we actually do this in our own system. So the pointer portion of a var line field in the fixed length data pool is gonna be 16 bytes or 128 bits. So we store 64 bit pointer to the actual data and then we use the remaining 64 bits to store prefix. You could also store a hash of the entire string, right, there's different methods to do that you could use, okay? All right, so now we know how to store basic scalar values or fixed length values, integers, floats and timestamps. Now we know how to store very length data. Now, the last thing we gotta deal with is nulls. So there's three ways to do nulls at least as far as I know. The first approach is to just designate a special value in the domain of a type in the database to represent null. So one thing you could do is say the smallest 32 bit integer you could ever store which defined in libc is n32 min, like this pound define. That will be null. So if I ever see this value when I'm looking at a tuple then I'm gonna treat that as a null not really n32 min. So now up above the storage layer in the database system you have to account for this. Like if someone tries to insert this value you have to throw an exception, throw an error because they try to store this value and then you read it back and it says it's null then that would sort of be confusing. You basically have a much extra boundary code up above to make sure that this is considered out of range. So we did this in hstore and voltdb. We did this, monadb does the same thing and then we originally were doing this in Peloton in the old days that we were building here at CMU. But as far as I know other than monadb and voltdb the only two ones that I know that does this. The more common approach is to use a null column bitmap. So we're gonna now do in store in our header for our tuple we're gonna have this bitmap field that's gonna say for every single attribute that I have in my tuple, the bit is set to one if that attribute is null. So if I have a thousand attributes per tuple then I have to have my bitmap to have a thousand entries. So now as I'm scanning the tuple and applying predicates I do a lookup in this thing to see whether the attribute I'm looking at is, is that bit set to true and if so then it's null. This one is way more common both for in memory databases and disk databases like MySQL, PostgreSQL, SQL Server, Oracle everybody does it this way. This is what we now do in our own system because now we're also a column store. This is now not, we're not storing this bitmap per tuple we just have for a single column here's the whole bitmap for every single tuple that's in a block. All right, the last one is, I'll just say it for you, I think this one's stupid. Only one database I know that does this and that's MemSQL and this is where you're gonna store for every single attribute that's in your tuple you're gonna have a separate flag to say whether it's null or not. So if I have a 32 bit integer that could be null then I need a little flag in front of it that says that is this thing null or not. So why is this stupid? Yes? It ruins your offsets. It says it ruins your offsets. Now when everyone has the same flag then the offsets are fine. Essentially it means to add your 32 bit integers now 33 bits. Okay. Everything is like slightly off going the whole way around and you can't just jump right to your thing you have to jump and then like go over and then reinterpret and you can't like No, it's not an issue, right? Instead of like if I have four attributes and everyone is 32 bits plus one bit for the flag then if I want the third attribute I take 33 times three and I jump to where I need to go. Bingo, that's it. It says you don't have a bit addressable memory. So you can't, again, a 32 bit integer just can't make it 33 bits because that's gonna make everything unaligned and you can't access memory at bit offsets. You can do byte offsets. So if you go now look at the MemSQL documentation and say what's the size of all the different data types that they support. So this is from MemSQL 6. I recently looked, this chart from the documentation is maybe like two years old but recently looked and they're still doing this. So for every single int or every single data type the fixed size data type they're gonna have two potential sizes. They'll have the size if it's, if it could be null and the size if it's not null. So we just take Boolean, right? Boolean is true or false. So that's, you store that as one byte because we can't store single bits. So if it could be null, then I have to store two bytes, right? Same for the other ones, right? And it's like for tiny ints, double in size, doubles in size, this one they get by just adding a single byte to make something. Actually, no, take that. Medium bit is three bytes so they add a single byte. For int and big int they have to have four bytes because that's how you do word alignment in x86. So like, again, if I store now, if you think of like the bitmap case or in the, going back here, this is the smallest one to store because you don't take any extra space. You have to store the attribute anyway and just using this extra one to say, oh, this is null. This gets a little bit larger because now I have to have a single bit for every single attribute in my tuple. At least in that case it's just a bit and I don't need to pad out the bitmap about just a little bit, right? This one is for every single attribute I have to pad it out, yes? So like, what is null being used for in this instance? Is it just like a gravestone marker or is it like? What is null being used, what do you mean? Like, I mean like, yeah, like is it just like simply saying, like, because I know like if you delete something you might just put a null in there instead of putting it again. No, I mean, so in SQL you can have, it's a three type value system. So I can have a value and I can set it, like true false or null. And null means unknown. I don't know what the value is. But how do you know that there are because they're telling you size? Well, I said, yes, yes. So his point, okay, so his point is correct. I, we should, I, yes, I agree with you. What his statement is, oh, am I just looking at this and saying, oh, well, you're using this extra two bytes just to store null or are they doing our VAR chart trick and storing some extra crap in there? But what the hell would you have to store for a single attribute, right? Like, right, like, and for these fixed line types, like any, like I can imagine like in a VAR chart, like say, like my example where I pad it out, well, like I showed the prefix, I showed the hash, I could store maybe the size up there. So I'm looking for things of a certain size. Yes, there's things I could store that you, that you want to be, would be potentially to materialize, expensive materialize for a VAR chart. Any operation I could do on these things for a single attribute, that's gonna be super fast. Like, you could say, all right, well, maybe they're also now storing, I mean, for out of these, you can't, there's not in this case, if you need that bit to store something as potentially null, like say I'm gonna sort of like, let's be stupid, like the negation of the absolute value of something. First of all, that's cheap to compute, right? That's a few instructions, but I wouldn't be able to store that in this anyway because I still need to store that null flag. So I don't know what they could be pre-computing. And everything is aligned. Now, Memsego put out a new version of their column store, I should double check to see whether that, like late last year, they might have fixed this, but I don't know. They're the only ones I know that does this or didn't this. Yes. What's the bit that it is? Oh, these questions, what's the bit that it type and why is it eight bytes? That is a good question, I don't know. Let's go find out. This is not a good use of our time, but... I, it's probably like a bit field or maybe it's a bit map. Bit SQL type. Bit data type is an integer data type that can be taken value of zero, one or null. SQL Server serves as one byte, all right. I don't know, that's a good question, yes. Because SQL Server can store that as expected, okay. Let's look at that later, okay. All right, so again, this is the most common one, this is what we do now. There's some other optimizations we can do now on our bit map because everything is contiguous. Like if I'm saying find me all the tuples where this value is null, then I can just take this bit map, if it's stored as a column store, so all for a single attribute, I have a single bit map and all contiguous, and now I can do like SIMD operations or vectorized operations to do counts and other things more quickly. Where everything, these two guys, you have to still have to look at the attribute. Okay, so let's understand now why the deeper, why they have to do this padding out, right. And I think he's already sort of set it, right, because of alignment issues, because you can't access things at like funky 33 bit offsets, all right. So what I'm gonna describe to you now is I do cache line word alignment for our tuples in our database, but I'm actually gonna be telling you that this are using examples with 64 bit words or 64 bit cache lines in the real x86, the cache line is 64 bytes, but for simplicity to make everything fit on a slide, I'm gonna use 64 bits as an example, okay. So just understand the high level comp is the same, just the length of what the cache line is will be different. So let's say now I have a table here, right. I have four attributes, it int a timestamp, a char two, and a zip code. In the case of a char versus a var char, some systems actually will store the var char, some systems will store char as a regular var char, but in actual out if you say well I know it can only be ever two bytes, then you store that exactly in the fixed length data pool, a fixed length tuple slot, right. So again, here's our alignments, we're assuming we have 64 bit words. So now when I start writing out this data to our tuple, right, the ID is 32 bits, the timestamp is 64 bits, then this color is two bytes or 16 bits, and then I have my zip code, right. So now let's say I do a query, I wanna do a lookup on the date field. Again, I do my arithmetic, I know what the schema is, so I know what the size of these attributes are, I know how to do that simple math to jump to my offset and now read this. But what's gonna happen here? What happens if my processor reads something that's not word aligned? Bring the whole cache, the whole cache. Well, he says bring the whole cache line, two cache reads, that's one, what else could it do? Someone else could come in and write into it, like when you read the first half, so you get to work on it yourself. So he says, yeah, so that's also an issue, we're not gonna worry about that. He says that someone else could come in and write to the other half, and I only see the first half. Let's assume that we have transaction protection above that, that won't be an issue. There's no two writers. So there's actually three things could happen. So they got the first one, right? And this is actually what x86 will do and the newer versions of ARM will do. So in their world with x86, they're trying to make you as the programmer's life as easy as possible. So if you try to read something that's not word aligned, they'll do it, they'll multiple reads for you, and then stitch the value back together and hand it back to your program. But now as we said, that's gonna be slow because what should have been just one read into memory or into our CPU caches is now gonna be two reads, right? So we're gonna bring a bunch of other crap that we actually don't need. If we just needed to get this date field, we're gonna have to bring in this ID field and the portion of the zip code and this char array, even though we only just wanted those 32 bits. So we're gonna read 128 bits just to read 32 bits, right? And that's bad, we don't wanna do that. So for this one, under Perf, you can tell it to, one of the events you can record are unaligned accesses and actually we run with the ASAN stuff from Google when you run our test cases. So if you do unaligned memory access, it'll throw an error and say you can't do that, but you can also use Perf to take a binary and run it and count the number of unaligned accesses that you have. And that can help explain why your program is running slow. Call grind's not gonna give you that because call grind only sees what instructions you execute and what lines of code doesn't tell you like what the Harvard actually did. The other approach would be to do random reads. Basically, you're just get some random combination of data and you have to then, the program have to figure this out on your own. This is obviously way faster than this. This is also easier to engineer on the actual chip because you're not doing extra stuff to make, to reassemble things. This is what ARM used to do in the very beginning. They now do this, right? Of course, this is problematic because now you do a read and you think you're getting something that you think you're getting that one field but you're getting a bunch of other crap. The alternative is to actually do, just reject the request. Like, you try to read something that's not word of line, we actually deny the load operation and we throw back an error. So, right, so again, modern CPUs or not modern, but like the more common CPUs that you're gonna encounter in the wild will be doing this. Don't know who actually does this anymore. This is mostly like in super small embedded devices, like low power things, which, again, you're not gonna be probably building a full-fledged database system to do this. Actually, SQLite would probably run on these guys and of course, they've already checked in with these things. So, this is what we care about in our world and so we obviously don't wanna do this or this is gonna be slow so we need a way to resolve this. So, how can we make our thing word aligned? What's an easy fix? Padding, exactly, yes. So, what we'll do here is, again, we know in our data system, we know what the word alignment is for the hardware that we're running on so as we start now laying out our data for our tuples, we will recognize, oh, well, I can store 64-bit words, this guy's 32 bits, this next one was 64 bits, so rather than spanning the word boundary, I'll just put in a bunch of zeros for these 32 bits to pad it out so that the next actually starts exactly at the beginning of the next word, right? And so now, in the upper layers of the system, depending on how you give back data to the execution engine, you could, the storage manager could say, all right, well, you need these fields, I don't wanna send you a bunch of garbage so let me go ahead and I'll stick this back together and a memory buffer, hand it back to you or you have to have now your executioner to be aware of the padding going on in your storage manager so that you know when you need to jump to the customer date field that you have to skip over these 32 bits that are just padding. What's another thing we could do? Is it possible to rearrange the field? Perfect, yes, reordering, that's the other one, yes. So we could also basically run the same thing of a bin packing problem where we say I have 32 bits I can store here to fill out my word, instead of actually just storing things in the order that they were defined when I created the table, could I just find another 32 bit thing or 32 bit attribute and stick it in there and then now I could pack the rest and only here do I have to pad it out to make sure that the next people behind me follows the right boundary of the word. So again, the upper levels of the system need to be aware of that this padding is going on or you need to do some copying to hide it from it. So we actually do both in our current system today and so just to give you an idea of what improvement you can get, so this is a micro benchmark that one of the students that helped build out the new version of the system two summers ago, this is the first implementation we ever did without any alignment and so this is just doing a micro benchmark that just tries to insert as fast as possible and so if you don't do any alignment, then you can run it maybe like 400 megabytes a second, you insert 400 megabytes of data or 500 megabytes of data per second. No, sorry, not, sorry, 500 kilobytes, yes, but then if you do padding, then this bumps up to 11 megabytes per second but if you do padding plus sorting, now everything is nicely cache aligned and you can insert at a very high rate. So this is just how fast we insert in things into memory. This is not writing any law records at the disk, this is just showing the performance advantage you get from this. I don't think we have numbers that show what sorting provides you. We only have padding plus sorting. So what happens in our current system is that when you create the table, again, we run this little algorithm that basically tries to figure out the optimal ordering of columns in the system. The execution engine is aware of this ordering, this is what the projected column stuff gives you but then before we hand back the result to the application, we make sure that we put it back in the right order as specified by this create table statement. I think the SQL standard specifies that you have to do this. So now if someone calls select star, they will get the attributes for each tuple ordered as they define here. Because what happens a lot of times people write their application, when they have these select star queries and then when they access attributes, they access them based on offsets, like give me the first one, the second one, the third one. So if I now start shuffling the order of these things because I ran some bin packing algorithm, then that's gonna break programs. So we do an extra step to make the life easier for the end user. So any questions about this? Okay, so now let's talk about storage models. So the storage model specifies how the database management system is gonna organize the tuples and their attributes internally in storage. So there's the two main approaches are the n-area storage model and the decomposition storage model. So this is the row store, this is the column store. And then there's a hybrid approach that we'll talk about a little bit where you sort of try to get the best of both worlds in a single unified system. So the row store and a standard approach is what people normally think about when we talk about databases or relational databases. And the idea here is like I did in all the examples I showed so far, that all the attributes for a single tuple will be stored contiguously in some region of memory and then only when we get to the last attribute then the next tuple begins. So now again, I wanna go access the third attribute for the fifth tuple. I know how to do that, address arithmetic to jump to the offset at the beginning point of that tuple and then jump to that attribute. So this storage model is ideal for transactional workloads or OLTP workloads because in this environment, the transactions or the queries are only gonna touch a small number of tuples and they typically also touch all of the attributes of the tuple. Like there's a lot of select star queries where account name equals Andy. And I want all of the attributes and therefore if it's stored in as a row store, I can just jump to that one region of memory, scan across and get all of the attributes that I need and then hand it back to the execution or whoever else needs it. This is also where you're typically gonna use the tuple at a time iterator model and we'll discuss this more in a few weeks but basically how we're gonna have the execution and access tuples. This approach is ideal because again, I'm just jumping to the starting location of every single tuple and handing back that chunk of memory to whatever operator executor needs it. This is also gonna be really good for insert heavy workloads, which is what you see a lot in transactional workloads. Like every single time I log in, I insert a new record to say this is when this person logged in. Every time I place a new order on Amazon, I insert a new order record, I insert all the items that I bought. So in a row store, this is super fast because I can just jump to some memory location or some free slot in my table and just write out contiguously all the attributes. So everything I've already said before, the advantage is that it's fast for inserts and updates and elites, go for queries that need the entire tuple. This one we can ignore for now but where this is gonna be problematic for is when you do analytical workloads where you wanna scan large segments of the table and you only want a subset of the attributes. Like if I wanna compute what was the most, what was the average price of an item bought from anybody who lives in the city of Pittsburgh, I only need to access the city field and the price that they bought some item for. I don't need all the other attributes for that tuple. So therefore if I'm doing a row store, then that's gonna be wasteful because I'm gonna be accessing chunks of memory that I don't actually need and I have to keep jumping over and over again to find the right offset of the attribute that I want for every single tuple. So this is what the column store or the DSM approach solves. The idea here is that for a single attribute in a tuple, we're gonna store all the values for every single tuple contiguously in memory. Now it might not be for an entire block of data, it might not have all the, just the values for that single attribute. We could have all the attributes together but internally the goal is always gonna be stored as a column store. This is great for read-only queries because you're gonna scan over large portions of the table at a time but only accessing a subset of the attributes. So this is the stuff we've already covered. So we'll get better compression through a column store, that we'll cover on Wednesday. And for point queries and updates, this becomes now our expense of potentially because now you have to take the tuple that you need to insert or update and break it apart into individual attributes and do multiple memory writes to update all of that. So the history of column stores is kind of interesting. So column stores are super common now. Like any new OLAP system that's been developed or released in the last 10 years, in the last decade, we'll be using a column store. If they're not, they don't know what they're doing and it's not worth your time to talk to them, right? But what seems sort of obvious now wasn't always the case. So back in the 1970s, the first implementation, the known implementation of a column store system was this thing called Canter that came out of the Swedish Defense Division. It wasn't actually a database system. As we know today, it was more of this sort of batch processing system, but it would organize the data as columns. In the 1980s, there was an actual paper that formalized the idea of what a column store database looks like. The first sort of known implementation was Sybase IQ, which came out, it was an in-memory accelerator for Sybase. The idea was, you had your regular Sybase ASC database system and then you bought Sybase IQ to sit in front of it. So if your analytical queries show up, they would run on that copy of the data. So they were using what is called a fractured mirror approach, which I'll cover in a few more slides, but basically you still had the row store and then this thing sat in front of it and made the analytical queries go faster. I've heard some horror stories about this in the 1990s that it never worked correctly, but it's still around today. And this is what they were using, you can connect to it with an SAP Hanum. In 2000, this is when the column store idea really took off. Three main ones in the space were Vertica, VectorWise, and NoniDB, VectorWise was an improved in-memory version of NoniDB. This later got renamed or got bought by Actian and then got renamed to Vector and it's still around today. And this is actually still really good. If you benchmark it, it does still really well. There's certain aspects of VectorWise that we'll cover later on in the semester. NoniDB is still around as well. They are, it came out of CWI, the same sort of school that built DucDB. These guys sort of were early in the space as well. And like I said, now everyone recognizes the advantages of a column store and their widespread. All right, so the paper head you guys read, the reason why I chose it because it was sort of showing you that how to do updates and support hybrid workloads on a column store, right? So in our system, we still support transactions, we still support doing updates and as well as also doing analytical operations, but we're designed to be a column store. And so this particular paper on a system called CASPER was meant to show you how you could still maintain a column store and still get good support for old to be execution. So the paper discusses some high level design decision you need to be mindful about when you're building a column store system. Tube identification is how to find tuples in the fixed length data array, how you're gonna organize the data that you're storing in your columns and then the update policy and buffer location. We sort of already covered these a little bit when we talked about MVCC, it's basically telling if you have a Delta store, where do you actually store these things, right? And then how do you apply them later? So I wanna focus on the first two. And this is also, and then we'll then talk about how the different approaches for building a hybrid store where you have a row store plus a column store. So CASPER is a pure column store that supports transactions, but then we'll talk about things like the fracture mirror approach for storing row stores and column stores together. So the first issue is that we did figure out how we're actually gonna identify our tuples, or how can we get to the starting location for an attribute. Again, we assume that all the values will be fixed length. The VARCHAR stuff is stored in a variable length data pool, but at least the pointer itself will be fixed length. And so if everything's always the same length, then you don't need to actually store any additional information to identify what tuple you're looking at. You just use the implicit offset of where you're looking at in memory to say this is the identifier for the tuple. So if I want the second tuple, or the tuple number two in my table, I know how to do the arithmetic to say, well, what is the type of each of these columns? And I want this location, so I know how to multiply the size of the attribute versus the starting location I want to jump into to get the values of this tuple. And I can do the same thing for all the other columns. In some systems, however, they can't do this, or they don't want to do this, and instead they actually store explicitly the tuple identifier for every single attribute that you're storing in a column of store. So for column A, for the first tuple, I'll store it's tuple ID, and the second tuple, I store it's tuple ID, and so forth. And this is sort of like a bastardized version of a column store because you're not getting all the advantages of doing the fixed length of riptick, you're just wasting extra space to store this, yes. But the offsets are still fixed, even if you store... This question is, are the offsets still fixed? I think so. But like that won't make sense. Why? So I think Oracle used to do this because they were sort of grafting on a poor man's column store into the system, right? They were trying to keep the row store execution engine, but operate on columns, so they had to embed these extra things. In the newer versions of the column store, they don't do this, everyone does this. But some systems actually do do this. But it's a bad idea. If you have everything fixed length, then you don't need to. Yes? If you want to sort the column... This question, if you need to sort your column, you still need these IDs. We'll get to that in two more slides. But yeah, so like, if I sort the column, sorry, if I sort the column here, I could just sort A and not affect B, C, and D, assuming I have a way to, like, for a given tube identifier to jump to that exact offset for each column. I think this is why Oracle did it. Vertica does sorting where when you sort one column, that then cascades to all the other columns. So that going across horizontally, at least my illustration here, you're looking at the exact same tuple. Most systems don't do that pre-sorting, though. Actually, this is the next topic, right? So the next question is how we're actually going to organize data when we have to make changes or add new entries. So the most common one is to do insertion order. So we just find any slot that's free in our blocks and we just insert our tuple in there. So there's no locality, there's no ordering of the attributes, it's just sort of random. It doesn't, you don't necessarily always have to go in, like from beginning to end. In our system, we actually, we do a compare and swap on a random location in a bitmap and try to find the first free location, which may not be the first free slot, but at least in that case, you know, you're not having threads all try to get 10 inserting to the same location at the same time. The other approach that he was referring to is when you insert the tuples into the columns based on some ordering scheme. It's again, Vertica is the most famous one that does this, but this will be problematic, we'll see in the next slide because if now I have a bunch of issues showing up and I need to enforce that ordering, I don't want to pay the penalty of having to reshuffle things every single time because that would be expensive. And then the last approach is from the Casper paper where you're going to break up the blocks into ranges and according to some partitioning scheme and then within each block it doesn't need to be sorted, but sort of globally there's this, what they call a shallow index, at least it will jump into the location that will have the data that you're looking for within a particular range. So let's look at the first one. So this is just insertion order. So this is my column store. I want to insert this tuple. So I just find whatever the first free slot is for my tuple, my new tuple, and it's add my entry there. Right, it's easy. And when this thing gets full then I create a new block and just keep inserting into that. So then now if I want to do the sorted table, the say I take all these columns and I sort them according to the scheme. So I'm going to first sort all the tuples by A ascending, then by B descending and then by C ascending. So again, we need to have the fixed length offset for every single tuple be the same. So if I'm looking at for tuple four, right, I want to be able to jump into B at this location and I should see the attribute for B for the same tuple. So now if I sort, say I want to sort on A I have to then propagate that sort order over to the other columns, right? So I start with A, I sort these by this value but now within for B, right, I'm going to say here's all the attributes of B sorted, sorry, all the attributes with A equals one. And now I sort them according to my sort over for B for B and then within that I have now the sort ordering within C. So again, now this guarantees that again if I'm looking at this tuple here, I'm going across and seeing all the attributes for my given tuple. And let me take a guess why you want to do the sorting. Yes? Secondary indices? She says secondary indices, what do you mean? So you say if I jumped to a certain place on column B but here in this case here, I sorted on A, right? So if I have a predicate that says find me all the tuples where A equals A one then I can just jump to this and scan down and then once I see A two, I know there's never gonna be another A one and I'm done. But in case of B, if I find me all the tuples where B equals B two, I can't just look in this region, I actually gotta scan the whole thing. Yeah, B two. Right, so she's saying, well once I get here, so I'm looking for B two, once I get here, now I see a B one, I know there's not gonna be another B one for this region A and then now, but now I need to know how to jump to the next region A. I think what she's saying is you have better cache locality on secondary indexes because you're gonna have more of that. Like you'll see her try to find all the values between B one and B two, you have secondary index bills on that or something like that, right? You can just like, you grab all the ones from B one on the first chunk, you grab the second chunk of B one then you grab the first chunk of B two, you grab the second chunk of B two and those are all relatively grouped together. I see what you're saying, so it's not really a second to index, so it would make the scan go faster, right? But my, like, assuming this is, if you're doing partition, right, if you know where the boundaries are, so if I, if I, this is the boundary for A, for A equals A, A equals A one. So if I scan, if I'm looking for the B twos, if I scan the B two and then I see the B one, I know there's never gonna be another B two after this, so then if I knew what the next partition is, I could just jump down here. That, that, that's, yeah, that'd be the one benefit, yes. I was gonna say the other B two is like it essentially kind of produces a system. For B though? For A. For A, yes. So we'll see this actually, the answer I was looking for, we'll see this on Wednesday. We're actually gonna be able to compress this way better if everything's sorted. So the most common, one common kind of sorting you do is called run length encoding. So say like this only, we only have two values for B, it's B two or B one. So that's sort of copies of B two over and over again. I can say I have B two twice, two times. And I have B one, you know, one time, two time, three times, right? This is like, think of like male, female, assuming that you just have two, two, two sexes. Everybody is either one or the other, so if I have a million students, if I sort them on their sex, then I could say here's 500,000 males, 500 females. And I just need to store just that small encoding, not every single instance of the value. We'll see this next class. So Vertica does this. Vertica's the strongest proponent of doing sorting. They call these projections. The idea is that I sort everything on A and then within the boundary of a value of A, then I sort for the next column and so forth. You're narrowing the window. Okay. That actually is Vertica's index. Yeah, Vertica does not have any B plus three indexes. They only have sorted projections. All right, so now if I, the problem's gonna be if I keep everything globally sorted like this, if now I want to insert this guy here, well, he wants to go, this new attribute wants to go in this position here, but there's some other attribute, there's some other two being stored there now. And in my case here, A two is less than A three, but I'm sorting on A, so I need to move these guys down before I can go ahead and insert my tuple. And that would obviously be expensive to do on the fly if I have a billion tuples and my guy lands at the beginning of my column, if I have to shift everything over, or even within a block, this would be expensive to do. And this is what the Casper one is trying to solve. So the idea is that you're gonna have this index above your columns and you're gonna split things into these partitions based on ranges. And then now what will happen is when I do an insert, I just need to make sure that my tuple I'm trying to insert lands into the right range. It doesn't need to be sorted within that range as long as I have space that I can put it right in there. And then now I can easily find all the data that I want when I'm doing, when I'm doing trying to do point queries. If I had to do sequential scans or long scans, like an OLAP queries, then I ignore the index and I just rip through the columns anyway, although you can use them to guide the scan search. So there's an additional component of Casper where they have this offline algorithm where they're gonna examine the workload or examine the dataset and then try to come up with an optimal partitioning scheme such that for tuples that are accessed through analytical operations, they have larger partitions because you're not likely gonna update them and have to reshuffle things. And then for the data that's hot, it can be updated very soon, then you want smaller partitions so that you can find them more quickly using the shallow index. All right, so let's go look at an example here. So again, we have our column store. They have these partition markers. They don't have to be physically partitioned. In their example, they show, it sort of just seems like a giant contiguous space and these are just demarcation points. But obviously you don't want to, you're not gonna allocate some giant array so this is still be organized as blocks. But the important thing to also point out to is that these are variable length partitions and they can grow and shrink based on how the tuples are accessed. So if I want to insert this tuple here, it's gonna go into this partition because we're trying to certain A2 and I have space to store A2 right here, that's fine, that's really fast and we're done. But now if I want to insert this other tuple here, I want to insert another tuple that has A2. Well that needs to go into this position but now within my partition, I don't have any more space. So they use this technique called ripple insert. The idea is like you, is how I understood it is like you drop a pebble into a lake and so you see the ripples propagate. You're just start moving things around going from one partition to the next. So for this one here, we want to insert our tuple into this one, this partition, but we don't have any space. So we need to steal some space from the partition below us. So we need to move this guy to the end of its partition and then now I just slide down my demarcation point and readjust my range partitions and my shallow index and now I can insert that tuple. Now maybe the case that when we try to move this tuple to a new position, this partition didn't have more space so therefore we had to go now steal from the next partition and so forth. That's the sort of ripple effect. But the idea here is in best case scenario, I can insert into my partition and I don't have to move anything around. Worst case scenario, I have to reshuffle every single partition but in practice, I don't want to say how often this should occur, but in practice it shouldn't have to reorganize the entire database. Certainly way less than we had to do if we were doing that global inserted thing that I showed in the beginning. Yes. Is this like sorting based on one attribute only? This game is, is this like sorting based on one attribute only? No, you can still sort on different attributes. It's just this. Because when you have to like resort B and doing like on all attributes like the previous example. I statement is wouldn't have to resort on B and C if I'm sorting multiple attributes. So the paper doesn't want to give explanation to this. This thing could, this thing could be multiple attributes. And all it's going to say is this is the range of some, this is the range domain that's being represented by a partition. And that could be a multiple attributes, could be in a single attribute. Within the partition though, I don't have to resort things. Okay. I'm not keeping these things sorted. Oh, it's just like. This is just telling you, hey, the data you're looking for is going to be in this range, but when you land here, go figure out where it is. Okay, yeah. And again, if it's OLTP, then I don't want this thing to be huge because then I got to scan a bunch of columns to find exactly what I wanted. But if my partition is bounded, if my partition is smaller, then the probability I'm going to find what I need, it'll be faster because there's less data to look at. Again, they had this offline algorithm to kind of resize these things based on how things are being accessed. Okay. So I've already said this, but I just want to sort of reiterate it again. In, for these HTAP workloads, for these hybrid workloads, there's this dichot between hot data and cold data. The hot data are things that were just recently inserted, just recently updated, and just recently accessed through an OLTP transaction. And then they had the rest to be all this cold data, which are things that we need them there to do scans and analytics on, but it's very unlikely that we're going to have to go ahead and update them. So if we are aware of this, and this is what the Casper thing is trying to exploit, then instead of maybe just resizing partitions, we could also reorganize or store the data in different ways such that the hot data is in a row store because that'll be faster for transactions. And over time as it cools off, we move it to a column store because that's where we want to store our cool data. So this is what the hybrid storage model approach is. So I would say that also up front about this, this is how we originally did Peloton. This is what I thought used to be the right idea. I actually now think that the pure column store, whether you're doing the Casper approach or the hyper approach that we do, that's actually a better way to go. It's not worth the engineering overhead to have all these extra storage managers in place to make this work. But you should at least know how it works because a lot of systems actually do this. So the idea is that we're going to have a single logical database instance where underneath the covers we're going to store our hot and cold data differently. So I call create table and any time I do a lookup, I see one table but underneath the covers, the database system is managing different, the tuples differently. So again, the hot data will be in the row store, the cold data will be in the column store. So there's two approaches to do this from an engineering perspective. The first is that you actually maintain two completely separate execution engines and the storage manager in your system and then you have something that stitched together the answer up above. And then the other approach is what we did in Peloton is that you have a single execution and a single storage manager that can operate on both row store and column store data at the same time. That you have this sort of indirection layer that knows how to read the data from the two different models and can unify them in a single engine. So the separate execution engine is the most common one because what mostly happens here is that people take two existing database systems and they slap them together and they put a little middleware thing on top of that to unify them. So the two ways to do this would be fraction mirrors which is the Oracle and IBM are doing and then the Delta store approaches what HANA used to do where that sort of looks like the, when we talked about the time travel table remember they had like the main data table and the time travel table and the main data table always had the oldest one and then time travel table had the latest one. That's because the main table was the column store and then the time travel table was the Delta store. So I'll go through both of these. So the fraction mirror one is again we basically have a complete copy of the database but now stored is a column store. So it's like a mirrored copy but it's called fractured meaning like I broke the mirror because I reorganized the rows now to be a columns. So the row store will always be the primary storage of the database. All my reads and writes are gonna go here. All my transactions will update this portion and then in the background through some process I will then copy the data out, convert it to a column store and store it in the mirror. And then now when any analytical query shows up if they only need to touch data that is bounded within the column store then I can route it over here and it'll run more efficiently. Yes? So this data can be a bit stale right? This question is, our statement is this data could be a bit stale. Depends. So sometimes you can and sometimes you want to and sometimes you don't want to do this. So if it can't be stale then when analytical query shows up you need to recognize oh you need data that I don't have in the column store yet so I also need to execute that query on the row store and then put it back together. So Oracle does this. They decide whether they want it to run a query on the transaction level and then do the. It's more like they figure out what portion of the query needs to run on the row store and what portion needs to run on the column store. And they'll run them both and then put it all together. Sybase IQ didn't, as far as I know didn't do that. It was sort of separate from the row store. It didn't know about the row store other than getting updates from it. So if you ran analytical query on Sybase IQ it would only touch this it wouldn't go back and get the rest of the data from this. Whereas Oracle combines everything. But like illness can be fine, right? In what real life example. So his statement is Stianis is okay. I think he brought this up earlier. Like in what scenario would you need to have like serializable execution of an analytical query with up to date information. People want that. It's a, people wouldn't want enough that Oracle decided to do it. But the important thing to point out though in the case of Oracle's case for their in memory column store which the guy we talked about on Wednesday this thing is ephemeral. So like if this thing if you crash and come back or stop the system and turn it back on like it'll repropagate the column store by reading from the row store. So again all the updates have to go here. This is the primary storage. This thing is propagated lazily. Yes. So this question is MemSQL doing this? No, MemSQL would do the Delta store approach. And so the way this one works is that again you have this like Delta store like a row store. I think like there was a pending update so long, right? And then over to actually the newer version of MemSQL might do this the old version does. New version does this. The old version didn't. The old version is that you call create table and you would say I want to be a row store or a column store. But then if you say, all right, I want to make this table now be a row store or sorry, if it started off as a row store and now I'm like a column store I got to dump it out and load it back in. Right? So like I couldn't take the data from a row store table and have it seamlessly converted to a column store. I had to have two separate logical tables. The newer version actually does this where you apply the updates to this Delta store and then they're appended to the column store. Right? So again, you have the same issue that we had on our fractured mirrors. If now I have a query that needs to touch data on both sides, then I need a way to execute on the row store plus the column store. Now if you're doing MVCC, you sort of like with the Delta store you kind of get the same thing for free. Like you already need to be able to traverse back in the version chain and find the right version of a given tuple. That essentially becomes the same thing as the Delta store. It's just, we're going from, in our system we're doing newest orders rather than oldest and newest. So when we go look in our column, that's always the newest version. But we still have a way to go back and reconcile the right version by replaying updates in the Delta store. But some of the, some of like the old Delta stores. In MemSQL or just in general? In MemSQL. From MemSQL? His statement is in MemSQL are these things ever flushed out the disk? As far as I know, no other than the lock, right? But that's separate from those. So the reason I brought up that MVCC stuff is like, in this, going back to this case here, the fraction mirrors, you basically build two database systems. You need one that operates on a row store data and then you need a column store one. Like you need an execution engine that knows how to efficiently operate on, on, you know, columnar data. That's going to be different than how you do things in a row store data. So you're basically maintaining two separate engines. This one here is a little less work. Because again, if you're doing MVCC, you still need to, you need to handle this anyway. This is actually, I think, the better approach. You know, whether or not the, the MVCC approach that we're doing now is the same thing, but whether this is constitutive full, in our system with the Delta chains, the Verge chains constitute a full Delta store. That's debatable. I think this is just semantics. But I think this is the better approach because now you're not having this engineering monstrosity you are trying to maintain two separate database systems together, right? So with Peloton, what we did was, we were trying to be clever and we were trying to have both a row store and a column store together, but then we have an execution engine above that that knew how to operate on both of them. And we did this by having this abstraction layer that we called tiles that knew how to just operate on these tiles and didn't know underneath the covers whether it was a row store or a column store. And then the access methods for accessing tiles of a different type knew how to put things together or access them efficiently. So again, the idea was that we would look at our data, see how the transactions and little queries were accessing them, recognize which regions were hot, which regions were cold, we take the hot data store as a row store and then take the cold data and store it as a column store. And then now, again, when I execute queries, I know how to put things together as needed. So I think we should have one graph of how the work. Basically, everything always starts off as a row store. So if you're doing a workload where you're going to do scans, insert, scans, insert, so it's basically OLAP, OTP, OLAP, OTP, so forth. So with the row layout, the scans will be slower, the inserts will be faster. If you're doing a columnar layout, the scans are faster, but the inserts are slower, because now I'm breaking up a tubal and storing it to multiple locations. And the idea with the hybrid layout was everything starts off as a row store, but then over time, as you start doing more scans, the data gets colder, it gets converted to a column store, and then now that speeds up the scan operations. So the rows are just as fast as a row store because it is a row store, but then the scans get faster. So we ditched all this because it was engineering nightmare and one students couldn't understand what the hell we were actually doing, but then having to write test cases and performance regressions for this hybrid thing became a big problem for us. So we ditched all this, kept it more simple, and just went with a clean column store. I think that's actually the better approach. And whether you have the shallow index that the Casper guys are doing above that, that's debatable. I don't like the idea of having additional background mechanisms to go and try to reorganize data for you by looking at the workload. I think that should also be centralized and for this implementation, we didn't have that yet. So that was kind of rushed at the end, but the main takeaway again is like, there's a bunch of different ways to do row stores and column stores together. I argue that, at least for the workloads that we're looking at, I think a pure column store is the way to go with MVCC, but a lot of times you're going to encounter the fraction mirror approach. So that's pretty common for all the major vendors. Okay? All right, I want to quickly talk about catalogs. So, we're not gonna have time to do demos, but catalogs are super interesting. Remember, we've been talking about this at the very beginning, like how do we figure out the layout of a tuple? How do we know how to jump to different offsets? Well, when you call create table, we store the metadata that you specified when you created that table in a catalog. And then when we go access the tuple, we'll read the catalog and figure out what is the layout of our data. So, the tricky thing about catalogs, though, is that we actually want to eat our own dog food, meaning we don't want to have a separate database to maintain just to store a catalog. We want to store the catalog in our own database, in our own tables. But now it's like a chicken for the egg problem. Like, I need to store metadata about tables in the catalog, but I need to store my catalog in tables, right? So, we have to have this specialized code in order to bootstrap the catalogs. The Postgres one is actually kind of interesting. They basically have this DSL that then convert into C code for doing this bootstrapping, basically inserting all the things you need to have in order for the catalog to turn on. I'm not aware of what other systems do, because I can actually see the source code in Postgres. In the early versions of MySQL, like MySQL 5.7, they just stored the catalog as flat files on disk, which I think is the wrong thing to do. You want to store everything as regular tables, because now you get all the asset guarantees that you would expect for regular tuples in data tables. So, let's go through some examples of how to do schema changes, and you'll see now how some things will be easier if you're a row store and some things will be harder if you're a column store and vice versa. So, if I add a column, then I basically need to scan through every single tuple from a row store and copy the tuple into a new location and insert it with the new column in it. Some database systems don't let you insert the column into any location, like you can only insert it at the end, like Postgres does this. So, sometimes, like MySQL, you can specify exactly where you want to insert it, because basically what they're doing is they're copying every single tuple from the old schema to the new schema. For a column store, potentially it's super easy. Yes, question. You did this, sorry. Coms are super easy, because I make a new column, right? And I don't need to modify any of the other columns. Now, it depends how I organize my blocks, but in general, this is way easier than a row store, because I don't have to touch any existing data. For drop column, same thing. Well, if I'm gonna go and take the column out to save memory, then I, again, go through every single tuple and copy it over. In some systems, you can just mark them as deprecated, like Postgres will do this. If I drop a column, it doesn't actually get dropped. It just records in the catalog that it's been dropped. You know to ignore it if any query comes along and tries to read it, but physically it's still there. And then at some point, when they want the garbage collector for MVCC, they'll go ahead and compact it. For column store, again, if I'm storing my columns as separately in memory, then I just go ahead and just free it up, and I'm done. And it doesn't affect anybody else. Change column, like if you change the column type, change constraints in the column, depending on what the change I'm doing, I may have to scan through and look at every single tuple before I can make that change. Again, and then depending on whether I'm a row store or column store, I could leave the data as it exists, or I had to go change things. Like if I go from a 32-bit integer to a 16-bit integer, I have to scan through to make sure that I'm not gonna have out-of-bound errors if I do that change, but now maybe I don't need to actually copy everything and make a 16-bit column, I can leave it as 32-bits. Some later point I'll clean it up, but the conversion is just sort of done at a logical level. Indexes are super tricky. This one, I don't think we'll have time to discuss this semester, this is, I will say, this is something that people have tried for, I think for two years now in this class, we've tried to always do create indexes. We can create indexes now, like in our system, if you call create index before you load any data in, the table, that's fine, but when you insert data, then we'll update the index. It's going through and populating an index on already created table, that's when things get tricky. Because if you wanna do this without blocking inserts, because I could just lock the whole table, scan it, populate my index, but that could take a long time, so I wanna, there's ways to do populations of indexes without having to block everything while you do this. And this is something that people have tried two times in this class with Nick's success. I think it'd be fun to try this semester again. Drop index is super easy, right? You just drop it logically from the catalog, then in the background at some point, you know that nobody else is looking at it, and you go ahead and drop this. And we have this deferred event framework in our system now that does garbage collection for any kind of memory we allocate in the system, and we can handle this already, right? The last one, it will be sequences. And so sequences are auto increment numbers, or like the serial key in Postgres. Think of like a column, I have an attribute where every time I insert a tuple, if I don't specify a value, there's some counter where they're gonna add one to it. And then that way, if the next tuple gets inserted, it'll have the counter plus one and so forth. And so what's super interesting about them is that they don't follow the same transactional guarantees and isolation protections you get from regular transactions, right? We're gonna store them in the catalog this counter, but if I insert a tuple, the counter gets increased, and then I roll back my transaction, the tuple will get rolled back, but the counter always still goes up. And so what needs to happen is you actually have to log that to disk because that way, if I crash and come back, I need to know where that counter left off, even though the transaction that modified it may have not gotten written to disk. So this is actually something that students did do in the old system, which was a lot of fun, and I think this would be good to do again in this class. There's obviously optimizations we could look at from Postgres and other systems. Maybe you only update the counter on disk for a batch of values, like every 10, write that to disk, then you hand out new values in memory, and then only when you hit the upper bound then you increment the counter in the catalog again. Okay? We're out of time so I can't do a demo, but we can do it after class if you're interested in this kind of stuff. All right, so just to finish up, so as I said, we abandoned the hybrid storage model. I think simplicity, at least for this case, is this part of the system is a big win and the right way to go. The delta versioning we would use under MVCC plus a column store is equivalent to sort of the different approaches that we saw before. Catalogs are really hard. This is actually something we got right in our system now. I'm actually happy to say this. We followed the Postgres schema for the layout of catalogs, but everything's stored as tables, so we get all the transactional guarantees you would normally get for regular tuples. So there's a lot of cool things we can do now that our catalogs are transactional, but we can cover that more when we talk about project three. Okay? Any questions? Bank it in the side pocket. What is this? Some old Porsche. Aye yo, aye yo. Took a sip and had to spit because I ain't quit that beer called the O.E. Cause I'm old, cheap, ice cube. Aye, you look, then it was gone. Just rocked up.