 So now, again, is an overview of where we're going in the course. Now we're going down into the stack, going from sort of the higher level parts of concurrency control and index management and things like that. Now we're actually going to the lower sort of levels of the storage manager. So this is actually how the database system is actually going to store data in memory. So for today's lecture, we're going to first start talking about how we represent the different data types, the values of different data types in memory. And then we'll talk about how we can do different layouts of, or how we're going to lay out our data in memory. And then we'll finish off talking about the different type of storage models that you can have. So one high level comment I'll say is that a lot of the things that I'll be talking about here today are not specific to in-memory databases. I'll be describing them in the context of an in-memory database. But they're certainly still applicable to your disk-based systems as well, okay? So we saw this diagram at the beginning of the semester when I showed you how the in-memory database was going to lay out the actual data. Right, we said we were going to have some kind of index, whether it's a skip list, it'd be at every tree, a B plus tree, it doesn't matter. And the value portion of the key value pair in the index would be a memory address that was going to point to some location where the tuple is that this thing is pointing to. And you see this in your skip list invitation, right? The value type of your skip list is an item pointer. The item pointer is essentially this thing here. And what the item pointer actually is is going to be a composite key comprised of a unique block ID that tells you which one of these blocks you want to go to, and then an offset within that block. Because what you would do is you use the block ID to figure out what the starting address is for this block of memory, and then use the offset to jump to where that is. And we can do this because all the tuples here are going to have fixed length attributes. So we know exactly how to jump down to whatever position it is that we need to get to. And then for things that are variable length, we'll have a pointer that points to some block of memory in a variable length pool. In our current invitation in Peloton, the ephemeral pool is just a wrap around G.E. Malak, this is why you're not allowed to write Malak and free in your code, right? We do all that for you in that pool. When we start talking about doing sort of analytical queries on NUMA architectures, it does become actually important to actually know where your memory is. And maybe G.E. Malak is not exactly what you want to use. Maybe you want to manage a memory a little bit more in your app. So you can think of an in-memory database is essentially just a bunch of large blocks of arrays of memory or arrays of bytes, right? We're just going to Malak these large blocks and the database system is going to use the schema that was defined for the table to understand how to interpret the bytes within those arrays to be whatever it is the type that it is that we're looking at, right? And what's going to happen is for every single tuple in our block, we'll have a little bit of extra space in front of it as a prefix, as a header to tell us whatever extra additional information we need to know about the tuple. We saw this when we talked about NVCC. When we talked about how these protocols would store the versioning information or the next previous pointer to the next version in the header, right? We'll have that in front of the tuple and then everything comes after the header is the actual data. So because we're always going to store tuples in the fixed-length data blocks, again, it's going to make it really easy for us to be able to jump to whatever tuple that we need, right? Because we know exactly what the size is going to be for each tuple. And if we know we want the fifth tuple, we know how to take whatever the starting memory address for the block and then multiply it by the length of the tuple plus the delta for the header to jump to the location that we're looking for. And this is essentially kind of the same thing that a disk-based database system is doing. Disk-based database system is sort of managing pages, but it's moving them on and out, in and out of disk. In our case here, we're just managing in-memory blocks. So the way you're actually going to represent the low-level data types will obviously depend on what the type is. So for the basic integers, like integer, big int, small int, tiny int, medium int, right, the 16, 32, or 64-bit integers, for these, we're just going to use the sort of the native C or C++ representation of these numbers, because this is sort of the native representation of the numbers in the low-level architecture. So we don't have to do anything special for these, right? We just declare a 32-bit integer in C or C++, and that ends up being what gets stored in memory, and that's how we're going to represent those values, right? So that's pretty straightforward. When it comes time to doing sort of floating point numbers or decimals, that's when things get a bit tricky. Some of you saw when you implemented the extract function in project number one, some of you were hitting weird rounding errors, right? And so there's a distinction between whether you want to have the float or the real type, which would use the native C or C++ data representation for these values, or whether you want to use the numeric or decimal type, which would allow for fixed point decimals or arbitrary precision decimals to avoid the rounding errors that you guys are hitting. So I'll spend more time talking about this, but this will explain why some of you were hitting the weird rounding errors when you started doing the math to compute the extract function. And as we said before, for very length values of varchar, of our binary, text, blob, anything that could be a variable length, if the value is less than 64 bits, then we can store it directly in line in our fixed length block of data. If it's larger than that, then we'll have a pointer to the location in the variable length pool. And at that location, there'll be another header that says, what's the length of this blob, and then followed by all the data bytes. And if your blob is really big and exceeds the size of a slot in the variable length pool, you can also have the pointer to the next blob, so you can change these things together. Typically, you don't want to store really, really large items in a database. You don't want to store a 10 gigabyte file, but you could if you wanted to. And the way you would do that is just have these pointers. For timestamps, dates, and times, this really depends on the implementation of the actual database system. Different systems do different things. So in the case of like in Peloton, we follow what we did in HStor and BoltDB, where we're always going to store these types of attributes as a 64 bit integer as either the milliseconds or microseconds since the Unity epoch. In other systems, if you say you want a time field versus the date field, then it can use a lower number, because you don't have to store additional information about things you don't care about. So for example, if you want a date field, you don't care about time, then you can just store that as a 32 bit number. I'm not really talking too much about this. Things get kind of weird when you bring in time zones, which is part of the reason why we didn't ask you to do time zones when they extract function. And in that case, you may need to keep track of additional information to say that this time stamp is for this time zone. That's a whole other ball of wax, which gets into ISO standards that I don't want to talk about. So we spend more time talking about the floats and the reels and the numerics and decimals. So the float and reels or doubles are in exact variable precision floating point numbers that again are using the native, when I get a native in quotes, C or C++ data types. And these essentially map down to the native types of the underlying architecture on the CPU. And so the way these CPUs store this is follows what's called the IEEE 754 standard. So this is a standard that came out a couple of decades ago that would describe exactly how floating point numbers or variable precision numbers should be represented in hardware. And therefore, this is what the C and C++ guys adopt. These are going to be much, much faster than the arbitrary precision numbers which we'll see in a second. But the downside is that as you saw in the extract function, sometimes you get weird rounding cases, rounding errors that are unexpected. So the standard essentially specifies exactly how the data is stored in terms of number of bits, where the metisa is, where the actual underlying value is. And it tells you what the hardware should do when you have rounding errors, when you have real numbers and things like that. Because if you think about it, you can't represent certain numbers like one divided by three in an exact form. So they have to specify what do you do in examples like that. So everyone should have taken an intro systems course. So this should not be groundbreaking news to anyone here. But this is a really simple example you can see in a simple C code. You can see what I mean by why this is problematic. So let's say we declare two floats, X and Y, 0.1 and 0.2. And if we add X and Y together, we'd expect the answer should be 0.3. And likewise, if we just take 0.3 and we just print that out, we expect the number just to be 0.3. But when you actually run this code here, because I have this qualifier in my printup statement to say, show me all 20 places out for the full digit, you get this result here. So X plus Y equals 0.3 and then a whole bunch of stuff down here. And 0.3 directly equals 2999 all the way over. This is because the IEEE standard specifies that you can't store these numbers exactly. So this is why some of you were getting rounding errors, but then you did print F to actually see the output, you would only look like it would exactly match. But when you actually do an equality compare some between these two numbers, it would always come up different. This is because at the very low end of the value, there's a bunch of these extra decimal points. So if I got rid of the dot two zero, when I printed this, it would be, they would look the equivalent, it would be 0.3, 0.3, 0.0, 0.3, 0.0, 0.0. It would look exactly the same when they printed out the string, but when you do the comparison, they would be different. Right, because this is because, again, for variable precision numbers, they cannot store decimals exactly. So this is bad, right? If you need to store something that where you don't want to have rounding errors, you can't use these. So this is where the fixed precision numbers come along, right, these are sometimes called arbitrary precision numbers or fixed point decimals. And the basic idea is that we're gonna store decimal values in the exact form inside of the database, and we're gonna have to do a bunch of special extra, maintain some extra metadata to keep track of all of the sort of precision information about the decimal, right? To do this, you would do the numeric and decimal data types. Some of the times you read the database systems, you read the manuals, they'll say that these things are the synonyms, they're equivalent to each other. When I was poking on Postgres last night, they are actually different. The decimal has less metadata than the numeric one. What's gonna happen is rather than storing as a floating point number, what we're gonna do is we're gonna store it as a sort of like a var char, like the actual digits as they were given to the database system in their exact form. All right, so sort of like storing a string without really being a string. So this is actually a snippet of the code inside of Postgres of how they represent the numeric data types. So again, Postgres was written in C from the 1990s, so they don't have classes, they have structs, but essentially the same thing. So this is a struct for how you represent a single numeric value inside of Postgres. And as you can see, there's a bunch of extra metadata that they're keeping track of for just a single value, right? So first of all, they have to even know the number digits that was provided to it. Then you have the weight for how to, if you need to have a new base value for the number to store it in a smaller form. Then you have the scale factor to how to convert it from a decimal to an integer. Then you have the sign information to tell you whether the number is positive or negative or it's not a number. And then you have the actual bytes for storing the digits, right? So this they have, say, numeric digits and it's just an array. And this numeric digit is just an ALS where they have a type def above where they're storing unsigned charts, essentially charts. So for every single digit in my number, they're storing it as eight bits or as one byte, right? Whereas before I could take a single number and store it as, I could store it as a single 64 bit value, I had now I had to store extra bits or actually for every single digit, right? So this is how Postgres and all the data systems that support numeric data types are gonna avoid the rounding errors that come up with, you know, which are in exact answers, right? But now the problem is this is slow, right? So when you actually look at the Postgres code, this is just a snippet of their addition function. So this allows you to take numeric plus numeric and produce a numeric. I'm not gonna read all the code, but you just wanna see that there's all this like conditional statements to say like, if one's positive and one's negative, then you have to convert it a certain way. If one's not a number and the other one is a number, then do something different, right? So this is gonna be way, way slower than doing just the native IEEE 754 data types because in that case, the harbor supports it. You just put it down into, you know, floating point register and do your addition, right? We're in this case here, we're executing multiple instructions just to do this computation. To give you an idea of how much slower this is, I have a quick demo with Postgres. So what I did here is I created a simple table in Postgres with 10 million records that's gonna have, oh, don't do that, sorry, with two columns, right? There'll be an A and B column and both of them are gonna be just regular integers. Or sorry, it'll be floating point numbers. So for each one, I loaded a dataset once into a table that has all decimal types, so the arbitrary precision ones, and I've loaded another one that would use the real type, right, which is using the native format. So this one has 10 million and then this one has 10 million as well, right? So these exact same data stored differently in Postgres. Let me slide it over, sorry. So what we'll do is we'll compute an aggregate to compute the sum of A and B together and we'll do it once with the, as a real, right, using the IEEE standard, right? And it'll produce some answer and then we'll do the same thing now with decimals and it produces a different answer and obviously much slower, right? And now you see here, like this is some scientific notation and rounded off a lot of stuff where this looks like an exact number and you might have saw that the, doing it on the reals was much faster, which we can see when we do explain analyze, right? Now we're on this query and it says it took 1.7 seconds. I do the same query now on decimals using slain analyze and it's gonna tell me it's gonna do it in almost four seconds, 3.9 seconds and just to show you that this is not cash in any way, I can execute this multiple times and see that it's always gonna be roughly produced the same answer, right? There's no caching effect going on here, right? So this is showing you that in the decimal case it has to go through all that code that I showed you before to do that addition where they're checking to make sure that you don't have any rounding errors, you're doing the addition correctly. Whereas in the real case, it's just doing the hardware to go directly at the two things. So now again, I showed you here that we had two different answers, right? So we can actually execute some Python code that I would just rip through the CSV file. That sucks, oh, sorry, no, why sorry. So we'll read in the CSV file, have a total and then it's gonna go through it and calculate the total by casting two objects as floats. So in this case here, right? It produces the answer 9950687, 4.7 and we go back and that's roughly the same answer that Postgres got when we did it as decimals, right? So this is showing you that Python is doing a little extra work too to make sure it doesn't have the rounding errors by looking at the native types, right? So again, the main takeaway here is that we have floating point numbers that would use the IEEE 754 standard and that will be faster but has rounding errors or we can use numeric and decimals where we're doing extra work to make sure we don't have rounding errors but it's gonna be slower. All right, so you would use these numeric types anytime you're dealing with money, right? You don't wanna lose money and anytime you're dealing with any kind of scientific data where you wanna make sure you have exact answers, right? Because one of the big problems with using the native types is if I load my database on one machine with one kind of CPU and I compute the answer and then I run that same query on another machine with the same data, it may come up with a slightly different answer and that's obviously bad. So again, anytime you're dealing with something you don't want to have rounding errors, you would use an numeric one. So we saw another way to do this sort of fixed point decimals in the paper you guys read last class where we saw how to do sort of decimal encoding in the SQL server columnar indexes and so I skipped over exactly the protocol to do this but I wanna show it now and just get in as another example of how you can do this kind of conversion to do extra work but again get exact results. So let's say we have a list of values, 0.5, 10.77, 1.33. So the first thing we're gonna do to convert these decimals into fixed point numbers is that we wanna produce an exponent that we can multiply these values by 10 to that, raise to that exponent to convert them all from decimals to integers. For this case it's three because we need to shift all of these numbers over by three decimal points. So 10 to the three would do that. So you would end up like this and answer like this. So now what we're gonna do is now we're gonna pick the lowest value as the base for these converted numbers in the initial encoding. In this case here the lowest value is 500. So now what we're gonna then do is go back and subtract 500 from all these values to put them into an even smaller form. The idea here is again this is sort of we're doing encoding to go from a floating point number to a fixed point number and then we're also doing this additional compression so that if we have really large numbers instead of storing them as maybe full 32 or 64 bit integers maybe we can store it as a lower data type. And then to do any kind of calculation to when we add two of these numbers together we can just do this natively and then the reverse do the reverse steps to put it back into its correct form. So currently in Peloton we use the IEEE standard we use the native floating points for our decimals but I think this would be a pretty good project number three for possibly for a group to actually supporting these arbitrary precision numbers and looking at ways to actually speed them up using query compilation and LLVM but we can talk about that later. Okay, so now that we know how to lay out its individual values let's talk about how you actually lay out a series of values for a tuple together in memory. So as I said we can think of just the data space for a memory database as just a giant array of bytes, a char array. And so we said before that we're always gonna have our header at the beginning but then what's gonna happen is for all the different data types we have in our table specification we're gonna lay them out contiguously one after another in this array. So in this case here the first thing we have is an ID field and that's a 32 bit integer so the next 32 bits would be stored here for the ID and then after that we'd have a 64 bit value type and then we'd have the next 64 bits for there. So again the way this works inside of our system when we actually wanna say I wanna go read the ID field for a particular tuple we know how to find the tuple that we're looking for and then jump to the header and we know the size of our header typically for bytes or something around there and then we can jump to that offset to where the starting point is for this value here. So now to actually read this we wanna use the reinterpret cast function in C++. Does anybody know what reinterpret cast does? Yes, exactly so what he said is you keep the binary representation but you change the type. Absolutely, so we have some address here that would be a char star, it's pointing something as a location of memory but then we wanna start doing integer operations on it so we need to cast it. So reinterpret cast essentially says take whatever was at this address and whatever type it was before now treat it as this type. So in this case here it's now a 32 bit integer pointer. So then now all the other part in the code we can then operate directly on that data as if it was a 32 bit integer. So this doesn't actually do anything, this doesn't actually invoke new instructions. This is a compiler directive so that when it compiles all your code since C++ is strongly typed it makes sure that whatever's on the other side of this that starts accessing it needs to be, which should be expecting a 32 bit integer. You could do this in C where I'm just sort of hand casting everything but again it's dangerous because there's nothing preventing us from treating this address here as a 64 bit integer and to start writing whatever you want to it. This is why we do a lot of testing and be very careful when we do coding inside our database system because there's nothing gonna prevent us from just trashing this storage location. So we use reinterpreted cast and other things to make sure that the code is doing the thing we expect it to be doing. So you'll see this a lot and we look at every look at the type system in our database system. You'll see a bunch of reinterpreted cast to take sort of arbitrary pointers into memory and then convert them to whatever type we expect them to be. So now we wanna talk a little bit now how we represent nulls in our database and again as I said before, there's nothing specific about what I'm gonna describe here that only pertains to in-memory databases. These are some of the same techniques that people use for disk-based databases as well. So there's essentially three ways to represent nulls as far as I know. The first is to designate a sort of special value in the domain of a value type to represent null for that type. So for example, say you wanna represent for 32-bit integers, you would use the lowest value you can store in a 32-bit integer to represent null. So in this case here in 32-min, right? This is specified in like the limits.h. So what would happen is if anybody tried to then insert a value with in 32-min, we would treat that as a null, right? Unless, oh wait, let me rephrase that. Anytime we need to store null, we'd always just store it in 32-min. But in the upper levels of the system, if someone tries to come and insert a value with this value here, we would throw an out-of-balance exception to say we can't store this because we're not gonna let the user mainly insert this for us, right? So basically, the number of unique values you can have for a 32-bit integer is now subtracted by one because we're gonna use a special value for that. So this is what we actually use in Peloton because this is what we originally used in A store and VolTB. And what I like about this approach is it doesn't require any extra memory, right? Yes, you have to check in the upper level parts of the system to make sure people don't try to insert this. Because it'd be bad if someone tried to insert 32-min, we let them do it, but then we would represent it as null when they actually thought they stored it, right? So that's why you have to throw an out-of-balance exception. So again, this doesn't require any extra storage overhead, which is why I like it. The other approach that is actually probably more common is to maintain a bitmap in the tuple header that keeps track of which attributes in that tuple are null. So this is the approach that's used in Postgres, MySQL, Times10, SQL Server, but this is what pretty much everyone does. And so the downside of this obviously is that if you have a really wide tuple with a lot of attributes, then you can have a really wide bitmap, right? So if you have, you know, this is why you see sort of limits on the size of the width of the number of columns you can have in tables. Like I think Oracle is like a thousand. I think MySQL or not MySQL Postgres and SQL Server can support like two to the 16. Right, because again, if you think about it, you would have this huge array that you had to keep track of for every single attribute in a tuple, and that would be expensive. The last approach is to, instead of maintaining a single data structure for the entire tuple, you maintain a single flag per attribute. So what happened is you would, for every single attribute in your byte array for the tuple, you would prefix it or add a little extra space to keep a little flag and say, whether this value is null or not, right? So you only really need a single bit, right? Because it's either true or false. But the problem is because we need to care about word alignment, we actually have to store a lot more data than just a single bit. So if you actually look at the manual for MemSQL, because MemSQL as far as I know is the only system that actually does this, you see when you look up integer types, they keep track of much information with the sizes and memory and what the min and max values can have. But for these two columns here, you have the size when the value could be null and then the size when the value isn't null. And so what you see is in the case of like take a 32 bit integer, when it can never be null, it's just four bytes, right? 32 bits. But when it could be null, they have to have a little flag in the front of it that says, you know, true or false, whether it's null or not. And in that case, they have to store another four bits. So even though it's just a 32 bit integer, you have to store 64 bits because it may be null, right? And basically for all the different data types, they're doubled in size. With big int, you're going from eight bytes to 12 bytes. You can do this because you need to keep track of word alignment. Now when I say word alignment, does everybody know what I'm talking about or no? Okay, some people are saying no. Okay, good. So I'll describe what word alignment is now and why we care about it in memory database. But I want to caveat with what I'm about to say is that, what I'm about to say is actually not the way CPUs actually do it. So the word described word alignment in the context of 64 bit words, but in reality, in real CPUs, you care about cache line words or 64 bytes. So I'm just showing you how to do this, what word alignment means in 64 bits. But again, when you go out in the real world using a real system, you care about 64 bytes. It's just easier to see it at a smaller scale. So the thing about, so the thing with word alignment is that the idea is that we want to make sure that we don't have the CPU do extra work in order for us to read data. So let's say we have our table here, right? And now we have four attributes we need to store on a byte array. But in our byte array, what I'm going to do is I'm going to break it up to 64 bit words or 64 bit chunks. And as you see, as I start adding values in, you'll see that we're going to start crossing these boundaries, which is problematic. So for this, ignore the header, right? Assume that it's already there. It doesn't change this. So first we start off with our ID and it's cut off, but this is 32 bits. So we would store the ID field in the first half of this first word. Then we store the timestamp and then say it's a 64 bit timestamp. Sorry, that would be stored here. It would cover the latter half of the first word and then the first half of the second word. Then we have our color field and this is a fixed length char. So that's always going to be 16 bits. And then we have our zip code, which would be 32 bits, which is now again going to span the word boundary here. So now if I want to do a look up on the date field, right? What would happen in this case? How does the harbor actually work? When I say I want to read a word, right? And I tell it what address I want because everything's byte addressable. What would happen in this case here? T says two fetches, right? Actually it depends, but on x86, yes, it's two fetches. So there's a couple of different things that could happen. So say we want to read a 64 bit value and it's not word aligned. There's three things that could occur. The first is that the harbor is being nice for us and it's going to recognize that you're trying to read stuff that spans the word boundary and it's going to do the first fetch, put the upper half of the bits you need into a register, then do a second fetch and get the other half you need and then combine them together and provide it to you in a single register with them together, right? So essentially for a single fetch, you now have a single load operation, you have to do sort of two loads into memory to bring things into the CPU caches. The other approach is that it's simply it's undefined behavior, right? It's left to the CPU to decide whatever it is that it wants to do. It could read the first word and give that to you without actually aligning it correctly. It could read the second one to start dropping things, right? You just don't know. And then the last approach is a throwback and exception to say you're trying to do unaligned memory access. I can't let you do this. This is going to be slow and your operation fails. So the first approach is what x86 or x64 will do. This last approach is what ARM does as far as I know from last year. And then as far as I know, I don't think anybody still does this. This was like what people did maybe in the 80s and 90s. So this is actually problematic, right? Assuming we're going to run an x86 hardware, then again, if any time we need to do a read, it's going to do twice the work to do that one read. So when we have word aligned tuples, the basic idea is that we would recognize as we start writing out the attributes for our tuple into memory, if the next sort of attribute cannot fit within the remaining space we have for our word that we're looking at, then we'll just add a little extra padding so that the next attribute starts at the next word alignment. So in this case here, the ID is a 32-bit integer, so we'd have 32 bits to pad it out so that the C date starts exactly at the word alignment. So now when I want to do the read on C date, I just jump to this cache line and get everything I need. Is this clear what's going on? So now I said before, right? This is not, 64-bit words is not exactly what real hardware does. In actuality, what would happen is you would, you care about whether your tuple could span your cache line boundaries. So if you have 64-bit words, 64-bit cache lines, if your one tuple was like 50 bytes and then the next tuple would span that boundary, you would have the padding at the end of the tuple so that the next tuple starts after that, right? And the padding's always gonna be the same size because you have these fixed-length tuples, right? It's another advantage of these fixed-length tuples. So I know that when I wanna jump from tuple one to tuple two, I have to account for this padding that I have to get to the correct location. So this is the approach that most memory databases actually do. We don't do this yet, it's on our to-do list, but any memory database will do this for you, right? Because the performance benefit you get from it is actually quite significant, yes? So in addition to padding, you also have to make sure that this tuple's not the only memory address that it's not the only one. His statement is, in addition to caring about padding, you wanna make sure that your memory starts, that the physical address where the tuple starts, I guess it's really the block, starts at the word alignment. Yes, and the hardware does that when you do mallet. Yes? So I think the piece is really your data types in order to reduce your padding? So our question is, do any database systems actually rearrange your data types at the physical layer to reduce this extra padding? So an example could be, I can recognize that the color field here could be packed in here. It wouldn't be directly in order, but I would use less space. I don't know if anybody actually does this. There's no reason you couldn't, right? This is the nice thing about SQL is you have the relational models, there's an abstraction layer between the logical layout and the physical layout. So there's no reason I couldn't do that. I don't think they'll do this though for all ATP systems because the overhead of maybe transforming the lookup might be too expensive for transactions. We're not gonna have to actually do this when we do column stores, which we'll see in a second. Yeah, I don't know the answer. It seems like it'd be actually kind of easy to do. So yeah, we should look into that. It's a good question. Yes? What's the point of the last one? The last one is empty. Is it necessary? What's the point of this padding here? Next to it. This? Yeah, yeah. Less empty space, that's where the next two will go. Yeah, is it necessary? What do you mean necessary? This is where the next, so there's another tuple that starts here. Oh, I see. Yeah. Back. Cash line alignment so that the next tuple for example won't span two cash lines. Yes. So would it be beneficial if you keep that last 64 bit? Yeah, so his statement is I said before that we care about whether our tuples span cash lines. So would it make sense to have the next tuple start here? Yeah, so again, in this example, I'm doing 64 bit words. Sorry, 64 bit words. So I'm assuming that this is the cash line that starts here. In a real system with 64 byte cash lines, then yes, if this is not 64 bytes, then I maybe want to start it later on, right? And it's just additional padding. All right, cool. All right, so now we know how to actually store things in memory. We know how to take the data types. We know how to put it into memory and read and write to them. And then now we know how to lay out a tuple in memory. So now we'll go a little bit higher and say how we actually organize the multiple tuples for a table. And so the three storage models I don't want to talk about are the NSM, the n-ary storage model, decomposition storage model. And we'll finish off talking about the hybrid model that we actually use in our database management system. So when you took the intro database class, we made this huge assumption for almost the entire semester that everything was gonna be in the n-ary model without actually telling you about what it would actually meant, right? And because this is usually how people, it's easier to understand databases if you think about them being Excel spreadsheets where you have rows and the values go across. So in the n-ary model, the basic idea is that you're always gonna store all the attributes for a single tuple contiguously in memory. And this is what you wanna do for LTP workloads because these workloads are doing, these types of queries, the type of queries in these workloads are doing lookups to go grab single tuples at a time and they usually need all the attributes, right? You think of like, I wanna go log into my Amazon account, there's some account table or user table and it wants to go fetch all the attributes for that user and then use that to generate the webpage. So this is why storing things in a sort of row-oriented manner is good for LTP workloads because it's really easy to look up to get everything. It's also really easy to do inserts because now you just take the tuple that was provided me from the application and I just do a straight write out to memory without having to break it up and do multiple writes. Yes, you may have to do extra padding, but it's not like you have to write to this location, this location, that location, right? Sort of single, single sweep. The other aspect of this is that they're gonna use the tuple at a time model that we've been talking about before because again, this makes more sense in a LTP environment when you wanna go grab single tuples, right? So this is that again, that volcano approach that we talked about at the beginning of the semester. So there's two ways to actually store NSM data. So the first is to use heap-organized tables and this is what I sort of showed at the beginning where we're gonna have our tuples be collected together into these blocks called a heap and we'll just have all our item pointers, our memory addresses point to the block location and then offset within them. So now in this case here with the heap, the heap is not gonna define any order for our tuples. So we can store, you know, insert and delete and update things in any way that we want. Now contrast this with the index-organized tables where the actual data for the tuples themselves is gonna be stored in the leaf nodes of the index. So this is what MySQL actually does. So MySQL and NODB, they have a B plus tree and then actual leaf nodes is actually where you find the real tuples. So this is not quite the same as a clustered index but it's sort of similar. And so with clustered indexes, you can use them for either the heap or index storage models. If you use the index-organized storage, essentially it just becomes the clustered clustering index, right? So again in MySQL you always get this and the way they do it is that whatever you declare as the primary key for the table, that ends up being the clustering index for that table. So again in MySQL, if you don't have a primary key, if you go call create table without one, it'll make a hidden one for you called the record ID and they'll use that as the clustering index. You can't see this from the application but internally this is what it's doing. And sort of thinking of this as the same thing as like the logical pointers that we talked about before when we were talking about secondary indexes for MVCC, right? Some sort of logical identifier that allows you to identify the single tuple that doesn't change no matter how often you update the tuple. So not all database systems can use clustering indexes. In the case of Postgres, because Postgres uses the append storage method in MVCC, any single time you modify a tuple, it ends up getting a new version gets appended to the end of it. So it'd be very difficult to have a clustering index for this and maintain the sort order across the entire heap or table because every single time you updated something you would have to reshuffle everything. Postgres does have a cluster function where you can sort of force it to sort your table as if you had a clustering index but it's not a permanent thing. It's sort of like a one shot invocation or operation. And then as you start updating the table things get all out of order anyway. So in the case of things like Oracle, they expose to you as the DBA whether you want your table to be sort of index organized or heap organized. And then depending on what you choose the optimizer knows what kind of optimization it can apply when it does query planning based on how it knows the data is laid out. This shouldn't be anything new. This is sort of the standard stuff that we covered in the intro class. So now, so with the NRI storage model the advantage of using this regardless of whether you're using the index-oriented storage or the heap storage is that you can be able to support fast inserts updates and deletes because you only have to modify the data at a single memory address, right? And typically if your data, your tuple is less than a cache line then you can go do a single fetch from memory to bring it into your CPU caches and go operate directly on it. So this is really good for queries that need the entire tuple. Now the downside of using the NSM is that it's really bad for doing OLAP queries to have to scan large portions of the entire table and only need to look at a subset of the attributes, right? Because what would happen is as you sort of scan through the table say you only need one column out of 10 you're bringing all the other nine columns into your CPU caches and polluting it and wasting space. So this is what the DSM model is designed to, DSM is designed to overcome, right? So the idea with DSM is that we're gonna take a single attribute for our table and align all the values for all tuples in that table put them to continue together in a single block, right? This is sometimes called a sort of vertical partition the same idea. Instead of having things be sort of row oriented we're gonna store things in a column oriented manner. And again this is really great for OLAP workloads when you don't have to modify the database or the table where you just wanna read things and you're gonna read large portions of the data across a subset of the attributes, right? Contrast this with the DSM model where we assume we were doing point queries to find single operations. So the example would be like on Amazon I would use the DSM model to do sort of front end operations like go get a single order record for a customer go update payment information for a single person. The OLAP model, sorry the DSM model for OLAP queries would be like compute the average sale price of some item for some barge segment of the calendar year, right? In that case I'm not looking at all the different attributes I have for the product I'm just looking at just the sale price. So we'll talk more about this how to do optimize query execution on the DSM model later in the semester but the bottom line is you're able to use the vector at a time approach to go grab chunks of data at a time rather than the tuple at a time approach you would do in NSM. So the column store approach or the DSM approach is not new, it's actually been around since the 1970s. So the first database system that sort of could be ported to claim to be actually a column store was this thing called Canter that came out of the Swedish defense ministry in the 1970s. So this is sort of like an internal project that they were using for their data sets. There's only two papers that describe it and people typically overlook Canter as the first column store because when you read the paper they talk about not in terms of columns or databases because it was really early in the 70s but more in terms of like files. They're doing transformations of files from a row oriented format to a column oriented format. But the high level idea is exactly the same as the DSM approach here. Then in the 1980s there was an academic paper by these guys in Texas where they sort of laid out the principles, the high level concepts that you'd want to have in the DSM database and what are some of the advantages you can get from it. But no one actually really tried to implement the DSM database until the 1990s when Sybase came out with this thing called Sybase IQ. So this wasn't like a full-fledged database system, like it wasn't a full-fledged column store like we have now. Sybase solved this as sort of an in-memory query accelerator, in-memory cache for their regular row store database. So if you had like Sybase ASC or whatever was called back then as the primary row store or NSM storage for your database and then you could use Sybase IQ to sort of stream out updates or stream out values from the database and store them in the same kind of columnar index as we had in SQL server from the last class. So then it wasn't only really into the 2000s of when we started seeing sort of these specialized database systems that were pure column stores, a pure DSM storage sort of architectures. So probably the most famous one was Vertica that Mike Stonebreaker and other people at Brown MIT invented. VectorWise was another famous one out of the Mone ADB project in Europe. There was some other ones where they were actually kind of crappy, like Astrodata, Green Plum's still around, that's okay. But Vertica, in my opinion, was sort of the most state-of-the-art one and really was groundbreaking back then. VectorWise is still around. So HP bought Vertica and then in HP's wisdom they sold Vertica to somebody else. VectorWise got bought by Actian, which was the rebranding of the original Ingress company and then they renamed it to Vector. And then unfortunately they try to kill it, which sucks. All right, so if you Google VectorWise, it'll take you to a page that you think is gonna be the Vector page where you can download it but then it redirects you to the homepage, which is kind of terrible because VectorWise is actually really, really good. So in my opinion again, from the 2000, Vertica was really sort of the best system. And the reason why this was significant and why it took to the 2000s before all these sort of column stores took off is that prior to this, before the internet started getting really big and had these large, you know, these applications could have a lot of users generating a lot of data, most people didn't really have big databases, right? Maybe it's a few of the Fortune 50, Fortune 100 companies, but most people didn't have really, really large datasets. And those that did have really large datasets had a lot of money where they could pay Oracle or Teradata to ship in a crate of a machine for to do all the computations they wanted to do. Stoneburger got the idea for building Vertica because he spent some time working at Walmart Labs and saw how they were struggling with their Teradata installation. And he came back to MIT and created Vertica. All right, so then now after this, everyone realized that column stores, or DSM model was the right approach. So pretty much everyone today has their own sort of column store add on a column store database. So Oracle, IBM, and Microsoft all sell some kind of column store extension to their database systems. So Oracle has the Fracture Mirror stuff we'll talk about in a second. You've read about the SQL Server or columnar index as last class, and then IBM has their DB2 blue accelerator. And then now there's also sort of the specialized database systems that are open source, and well, I guess CloudR is the only one that's open source, but now there's other databases that are rebuilding off of the ideas that were done in these early guys here. So CloudR and Impala is a column store database that runs off top of the HTML data. And actually we're gonna have the creator of Impala come give a guest lecture at the last class at the end of the semester, who is also a student of Mike Stonebreaker. Amazon Redshift is a rebranded version of ParkSell, and now they're actually rewriting that to be a brand new architecture, and that's being headed up by CMU alum. SAP HANA is this, well I'll talk a little bit about it in a second, but it's the thing SAP sells now instead of trying to sell you Oracle, and then MemSQL we've talked about a bit is a memory hybrid system that can support transactions and analytics in the same box, but they now have a column store extension as well. So basically the ideas here at column stores are now sort of more common than they were maybe 10 years ago, and anytime you wanna do analytics, it's clear that you wanna use them. All right, so with cluster, and with column stores, you wouldn't necessarily have a clustered index like you would on a B plus tree, or B plus tree for the NSM model, which you normally do instead is just pre sort the data in your columns, because this is gonna allow you to get better compression. So in the case of Vertica, Vertica doesn't allow you to create any index at all. You specify for these segments of data how you want them to be sorted, the column projections. And the idea is that if you choose a good sort ordering, that'll allow you to get better compression as we saw before with one line of encoding and other things. And then the downside is that anytime you update it, you have to make sure you fix the sort ordering, and they have a way to do that, which I'll talk about in a second. All right, so the last thing we need to care about in a column store in the DSM model is how we actually identify tuples. So there's basically two approaches to do this. The first is that you always have fixed length all sets in every single column so that you know how to jump to a particular position at any column to find a particular tuple. So in this case here, since I know that my data for every single tuple is always the same size, if I want the, you know, at offset of that position two, I know how to multiply the size of each value by, you know, size of each value by my offset that I wanna get to to jump to the right location across every single column. The other approach is you actually embed the IDs of the whatever tuple you're looking at inside the tuple itself, or inside every single value. So if you wanna say I wanna get position two here, I know how to scan and find position two over here. And you have to do this if you have variable length, variable length fields, because you can't jump to an arbitrary location and know exactly you're running at the right spot. So I would say no one actually, as far as I know, no one actually does this. When you read the papers, you read the literature about column stores, they talk about how you could do this, but in practice, I think everyone does this. Right, and the downside is when we talk about compression next class, when you have to have fixed length offsets, you're not gonna get as good a compression rate as you can get if you have variable length offsets, because you may have to do some padding to make sure that things are aligned. All right, so the advantages of DSM compared to NSM is that we're gonna reduce the amount of waste of work we have to do in our system when we process queries, because we're only gonna have to read the data that we actually need. So if we have four columns in our table, but their query only specifies, it needs to access two of them, we only just read those two columns. And we're also gonna get better compression as we'll talk about, we talked a little bit about last class, we'll talk about more on next week, because all the values within a column in that block of memory are gonna be of the same type. Right, and they're all gonna be roughly in the same range. And of course, the downside is that you don't wanna run transactions on a DSM table or DSM storage, because any single time you had to do an insert, you had to break up the tuple and to write it to different memory locations. Anytime you need to do like a select star, I get all the attributes to put it back in the final output, you have to stitch the different memory locations back together, and it's additional memory copying. So one observation that I wanna make is that the way people would use a DSM versus NSM is typically based on how long ago the data was actually modified or insert. So typically what happens is when a tuple is first inserted into the database, it's more likely to be updated again in the future. But then over time, as it ages, it's less likely to be updated or used in a transaction. I think of this like how you use websites, right? If you go to Reddit or Facker News, I look at the top posts, I go in and I add a comment. Maybe I'll go back and tweak my comment if it's a spelling mistake, or maybe I'll upvote things. You're doing, those are transactions that are updating the database on data that was just recently inserted. You almost rarely ever go back to like two years ago and try to add a new comment to a post that no one's ever gonna read, right? So it's very typical in database applications that all the hot data where you can do all your transactions is stuff that was recently modified or updated. And then over time it becomes colder. And when it becomes colder, that's when you want to do analytics on it to figure out what, you know, try to extrapolate new knowledge, new information based on the data that you've collected. And this is the decision support stuff we talked about last class. So, but we want to be able to do, we want to be able to combine the hot data with the cold data when we do our computations or analytical queries so that we can, you know, use the freshest data we have to help us make sure that we're not, you know, in a couple of days behind in our decision making. So I showed this diagram before from last class of the bifurcated environment where we have our old to be silos and this is where all our new data and updates are coming in. And then we stream the changes we have into our ETL component which then feed into now our data warehouse. This is not actually a one-way street, right? There's actually another path going back the other way where, again, you do your decision support processes or analytics here to, again, learn new information about what your customers are actually doing and what your users are actually doing. And then you compute some new, you can compute new answers that you can then feed into the front end to help them do, you know, make different decisions. So the example I always like to give is assuming your Zynga or whatever the hot gaming company is now, right? This is where all the new updates are coming in as people play your game. You then load it out into your data warehouse. You do some kind of machine learning or analytics to figure out what kind of things you should do to get people to play, you know, buy more crap. And then based on what you find, you send that to the front end and have them tweak what they actually, how they present the game, right? So there's always this one example of, it's Candy Crush, right? The hit the little button thing. So what they do is they watch to see how you play the game. And if they give you a hard puzzle and you can't figure it out, you end up putting the game down because you get frustrated, right? So you would see all these updates coming in and you see that this person played a game and they stopped playing because it was too hard. And then they compute a model to say that when this person comes back three days later, make sure you give them an easy game because if they get really frustrated the second time, they'll like screw this and they'll never play it again. But you give them an easy game that gets them to be addicted, right? So there's sort of this feedback loop mechanism. We're going back and forth where we're getting new updates, we're doing analytics and we're pushing new changes to the front end application to change this behavior. The downside is though this whole pipeline is actually slow, right? In some cases, people don't move the hot data from the OLP side to the OLAP side until the end of every day, maybe every hour, right? And depending on what kind of application you're trying to support, that actually may be too slow. And again, it's also not free to sort of shove the data back to the front end. So this is what the hybrid storage model that you read about is trying to solve. The idea is that we want to have a single logical database instance that's gonna use both the DSM and the NSM storage models to support both hot and cold data, meaning transactions on hot data stored in an NSM model and then OLAP queries stored on both the hot data and the cold data, all within one single database. And I'm pointing out here, I'm saying single logical database and not single physical database because for now, for the paper you guys read and in our current system, we only run on a single node, but other systems like MemSQL or Splice Machine or these other systems, they're distributed. But again, to the application, it looks like a single logical database, but underneath the covers, they may have multiple ones. Again, we're gonna have, store all new data when it first arrives in the system in NSM because that's fast, that's what you need for OLTP. And then over time, we'll migrate it to a DSM model when we know that we're never, it's unlikely we're gonna have to go back and update it. And we're gonna do this because we're gonna get better compression with DSM than with NSM. So we're gonna store all this much more history about our application in our database in a compressed format in DSM. So there's a couple of different ways to do this. So the first way is that you maintain separate execution engines inside your database system when you have one engine that's optimized for NSM and one engine that's optimized for DSM. So this is actually probably the most common approach. So this is what SAP HANA does. SAP basically bought a bunch of different database startups. They bought PTIM, they bought TREX and MaxDB. So mashed them all together into this Frankenstein thing and they called it SAP HANA. So when all the transactions would show up, they'd first go into PTIM and over time you would mainly migrate the data to TREX. And now you can hook up side-based IQ to the bottom as well. So again, there's separate engines that are written explicitly to operate on data for each of these two different storage approaches. And then what happens is now you need to have sort of upper level part of the system that knows how to take a query and decide what part do I run on NSM side, what part do I run on the DSM side and then combine the results to produce a final answer. This is what snappy data does. Snappy data takes Gemfire and Spark. Splice Machine does this. They take HBase and Spark. This is probably the most common approach. You take a legacy database systems, write a little middleware layer on top of it and then sort of combine them together. So again, it looks like a single logical database but physically it's multiple ones. Another approach is to have a single flexible architecture that is able to support both NSM, DSM directly on data directly without having to do any kind of coordination between these two separate engines. And this is the approach that we actually use. And this as far as you know, this is what MemSQL does as well. So if you have separate engines, essentially what I just said, you're gonna run basically separate internal database management systems inside of your single system and one will operate on NSM, one will operate on DSM. And then you basically have to use two phase commit if you have a transaction needs to modify data on the two different engines. You need a way to know that your changes are durable on both of these engines. So two ways to now do this is the fractured mirror approach which I think we talked about before and then the Delta store approach from Hanaj, Snappy Data and Splice Machine. So with fractured mirrors, the basic idea is that the database system is gonna maintain a complete separate copy of the database in memory but in a DSM layout. So when we have the primary storage would be the row store and the column store would be the mirror. So when a transaction comes along, we have any new updates. They always hit up the NSM side first and then there's a background process that will migrate it over and insert it into the column store. So now we have an OLAP query that comes along. The database system has to figure out, can I operate directly on the DSM or do you need to get data from both sides and combine them together? So this sounds kind of wasteful, in the case of Oracle, this DSM mirror is actually not durable. It's just a memory cache. So if you kill the system and come back, it has to rebuild this all over again. It's an interesting approach because it's actually, it has the minimal amount of impact to the rest of the system, right? So in order to go make a major change to have, you have to make a major change to this side of the system to support the column store sides, right? But if you had this thing to be as sort of an ephemeral cache, I don't have to change any of this. So if a customer doesn't need to do OLAP queries and doesn't need the mirror, then I don't have to change any of this. And then if they do give Larry else some more money and they do get the mirror, then they don't have to change any of your application because this is completely transparent to the upper level parts of the system, right? It's up to the optimizer and the query planner to figure out what part to run here, what part to run there. This is essentially the same thing you saw in the paper you read for the columnar indexes from SQL server, right? They left the NSM side alone and they made in-memory mirrors of the indexes. Now, they couldn't, you know, in that early version of paper from SQL server, when you added the indexes that you made this side read only, but in the later versions, they allowed you to modify both. So again, this is what Oracle does. This is what IBM does. This is pretty common. This actually originally came from a paper written in Wisconsin in 2002, and everybody basically follows a very similar model. Yes? So why does the OLAP query spawn more of the databases since the DSM should have an entire copy of the data? Her question is why would the OLAP query potentially have to span both sides? Because this thing is done lazily. So there may be data you just inserted that you wouldn't incorporate in your query and it hasn't been migrated over yet, so you have to go back and do the read over here as well. So that's what it would be, a small percentage. Great, a small percentage, but it's not, yeah. So there's extra bookkeeping that I'm not talking about where you have to know what portion of my database over the NSM side has been migrated over to the DSM side. So I know I can skip those things if I run the query here. Right, yeah, so I'm being very hand-weighty, but there's a whole extra smarts going on. There's a whole bunch of intelligents going on to know that my query does have to test data over here, but I don't have to read everything all over again. Let me just read the one, you know, few number of pages that I actually need. I mean, this is part of the reason why in the SQL Server case, it made the table read only because at that point they didn't have the functionality to know how to span both of these. You just, when it was read only, you know everything was always on the in-memory column side, the columnar indexes. So I just fired my query and exeered everything there. So there's an extra bookkeeping going on here. All right, the more common approach is the Delta Store. And the idea here is that all our updates are first going to land in this sort of NSM storage space here. And then over time in the background, we will migrate this to the DSM side, right? And this is different than the Fraction Mirrors because for any given tuple, it's either in the Delta Store or on the column store side. It's never in both. So I don't need to do any extra bookkeeping to say, all right, I read something here. Now I need to run my query over here. Make sure I don't read the same thing twice. You know if you can read it here, then it's not over there yet, right? So this is what, this is what pretty much everyone does. This is essentially what the HANA guys do is a MemSQL does, is what Vertica does. Essentially, when you think about, when I say before, like, Splice Machine was HBase plus Spark, it's basically the same thing. Think of this HBase as the Delta Store side and Spark as the DSM side, right? And again, the advantage, the reason why you would do this is because your hot data is usually very small and so you can store this in the Roe Storage format here which is better for transactions and then as it gets cold, you then you migrate it over to the column store side, right? And you do your analytics there. Yes? Isn't this just basically the same as ETL though? Yeah, yeah, so his statement is, isn't this the same thing as ETL? But now it's a single database instance rather than separate instances, absolutely yes. But the difference is how do you actually decide how to move things, right? In the ETL case, it was you as the application programmer have to put a manual process in to migrate things over and that's essentially what the first part is, right? The first part is you have to decide what should be in the DSM, what should be in the NSM and how to move things back and forth. Another approach is to look at the logs of how queries access data and then you can figure out, like, you can specify what data to move. I think this is essentially what, I think this is what HANA does, right? In HANA, the system will figure out what it should move but then you as the administrator still have to tell it, go ahead and start moving things. The last approach is the online approach which is what Memsego does and what we do is where we're gonna track the access patterns of tuples at runtime and then we can make a decision about when we have idle cycles and when we know we're running out of space, how to move things back from the NSM side to the DSM side, right? I think in the paper you guys read, this is what we talk about doing. Right, so then in the final few minutes, we'll talk about what we're doing in Peloton. So in Peloton, the differences that, of all these different systems is that we have a single execution engine that knows how to operate directly on NSM data and DSM data directly without having to have two separate systems to figure these things out. So this means that we don't need to store two copies of the database and then we don't need to sync multiple data segments or different portions of the database any time we have a transaction that updates either the NSM side or the DSM side, right? And so with this, we're essentially still doing the same DeltaStore approach that I talked about before but we no longer have to have an engine for the DeltaStore and engine for the column store, right? Sort of single architecture that can handle everything. So to give you an idea how it works, say we have two queries here, we have a alt B query that's gonna do an update and an aggregate query that's gonna do a scan across a large segment of the table. So the first thing you see here in the alt B query, I'm accessing all four columns but in the analytical query, I'm only accessing B and C. So what would happen is the database system would recognize that here's much of cold data that's not being updated anymore and here's the portion of the data that is hot where it's getting updated a lot. So it would create different layouts for the same table where the hot data will be stored in the NSM format and then the cold data will be stored in the column store format. In this case here is combining B and C together as to be packed together in a single column rather than breaking up individual columns because B and C are accessed together here and depending on the selectivity of the predicate C, it may make sense to include BC together because now it's a single cache mess to go fetch the thing you need and then you can P-regration directly on that, right? Yes? I wanna say hot data, I mean the data is up there now but maybe it's not. Even if it's like if some data is red often, is it regarded as hot data? So this question is I'm defining hot to be data that's updated often but data could still be hot if it's red often. Depends on how it's being red, right? If it's like select star in a point query to go grab the single tuple, or sorry, a single tuple with all its attributes then yeah, that would still be considered hot, right? If it's being red in a lot of analytical queries, Yeah, you can still put it in DSM. You would wanna put it in DSM because again, you're not doing point queries to go grab a single tuple, you're scanning segments, right? Okay, so real quickly, the way we're gonna make this all work in order to have a single execution engine is that we need to have an interaction layer that's gonna hide the actual true layout of data and tuples and memory from the actual query operators that we used to process data or to use to execute queries and produce answers. So in this example here, this is the same table I had before, A, B, C and D and what I'm gonna do, I'm gonna divide the table into two sections called tile groups where tile group A is gonna have a single tile where everything's stored in the NSM format and tile group B is gonna have three tiles where I have a single attribute stored separately, B and C stored together and then tile four is just D. And so in addition to this, we're also gonna have a tile group header and that's sort of as a separate column that's sort of packed in all together and this is where we're gonna store the versioning information you need for MVCC and the other stuff, right? So this sort of implicitly, this is stored along this. So now I'll say our query, the aggregate query that I had before and I wanna generate the relational query plan for this. What's gonna happen is instead of doing, sending whole materialized tuples or vectors from one operator to the next, what I'm gonna instead have is these logical tiles which are essentially lookup tables or position lists that then map tuples at some position to a memory location or where we have the actual attribute. So now what happens here is in the case of this first tile group, I have my pointers out to B and C being located here and I don't care that it was stored in NSM versus DSM. So when I pass up this logical tile, up the query plan to the aggregate operator, it can just do this lookup and go directly to the data that actually needs without having to do copy things into either a uniform format. Because otherwise what I have to do is if I either have to copy the two segments, the two attributes I need out, this tuple and convert it into the column format or convert it the other way around in order to have a single code line that can execute both of these. All right, so when you start looking in the operator code inside h-store, or not h-store, sorry, inside of Peloton, you'll see these logical tiles and there essentially being these position lists being passed around from one operator to the next. So does it mean that it will work better if your cluster index of primary key is, for example, prime step and like how data are closer to each other and how data are closer to each other? So his question is, would this work better if you were sorted on the primary key which would be a timestamp? Oh, like some representation of hot data and cold data? Would this work? Does this work better when hot data is stored together in memory with other hot data and cold data is sorted? Yeah, so what happens is when we have these tile groups, we know, we're keeping track of their tile groups. So how do we know what data is actually stored in here? So since we're using NVCC, any single time we update a tuple, if we update it and it's in this tile group here, it ends up going in the NSM tile group. So what happens is all the data within a tile group were essentially the equivalent coldness or hotness factor, because they're sort of all migrated from one layout to the next, right? And at any time that you update a tuple and it becomes hot again, it would then move out of the cold tile group into the hot tile group. So implicitly, all the hot data is stored together on a tile group, all the cold data is stored together on a tile group. So how can you, on the other hand, I mean, do you have other tuples that are stored in a very bad order again? Well, the same is, if all the tuples are stored in a very bad order, what's a bad order? It's like cold and then hot and then hot and then very cold. But you can't do that, right? Because you have to store within the tile group, they all have the same layout. So tile group A, I mean, I'm showing two tuples here, but it could be a thousand tuples. They're all gonna be hot because they were all inserted, updated recently, and they're all gonna be in the NSM format. Then as things get cold, the background process, then migrates them to the DSM layout, and then they're not intermixed with hot and cold. Now, if a cold tuple becomes hot, again, we're MVCC, so we would append the new version up into the hot tile group, okay? Now, we can talk about compaction and other things like, eventually, if tuples get deemed as being cold and then they get updated and they get put back in here, now we're gonna have a bunch of tile groups with a bunch of holes in them because we're not gonna insert, when the way we do the migration now is we take a tile group and convert it from one format to the next. But you may, that means when you start taking things out of cold tile groups, you're gonna have holes and you're not gonna put other tuples back in them. You may wanna compact two tile groups together to save space. Assuming that there's no cluster effects. Assuming there's no cluster, correct, yes. So what I'm describing would be the append-only method from MVCC, right? Yeah, you'd have to do more work with if you had the, if you had in place updates or master records, that's a good point. All right, so we don't have much time left so I'll maybe just skip ahead to the demo here or the example. So when we demonstrate how using an adapted layout or a hybrid storage layout where you have both rows and columns together in a single system, gives you the benefit over a sort of singular architecture. So in this case, we're gonna do, we're gonna have a workload where we're gonna model an application where we have different day and night phases. So during the day, we're doing all of the operations because we're inserting new information, think of like, you're trading in a financial company, you're trading in the stock market. So you have all these inserts for all the new trades. And then at night, you switch over to do risk analysis. So now you start doing analytical queries that scan a lot of data. So we're gonna compare, and it's all inside of Peloton. We're gonna pair this workload running on a pure row layout system or NSM system, pure column store, and then our adaptive hybrid layout. So with the row store, what you see is that the scan queries obviously take more time because you're reading a lot of data. And then when you switch over to do the inserts, you get much faster. But you see this sort of stepping function where the scans are getting slower and slower as you go across in time because the database is getting bigger and you're scanning more data. And we compare this with the column store. The column store would be faster for scans, but slower for inserts because again, you have to take your single tuple and do multiple writes to different column locations in memory. So now we compare it with our hybrid layout. What you see is that when we first start off, we're just as fast as the row store layout because when we first loaded the database, since we don't know anything about the workload, everything defaults to the NSM layout. But then over time, as we start doing analytical queries, we recognize that we should be migrating the data to be a column store. So that's why you see this little tick, this sort of downward trend as we get faster and faster as we start moving the data to be in a column oriented format. Then when we switch over to do inserts, we're just as fast as the row store because we're just inserting to the regular row store. It's overlapping here, but you can see that the red line overlaps with the row store line. And then we switch back to now, to do the scans. And there's a little uptick where we have to migrate the data, we just insert it as a row store part now to be the column store. But over time, we're doing less and less compared to the amount of data we have to read. So we end up getting better performance than the, then totally than the row store, but also the column store here. So this is showing you that within a single execution engine, we get the benefits of a row store and a column store without having the user do anything special. Because we can migrate the data from one layout to the next, transparently. And we can adapt it to base on how the queries are actually using the data. Okay, yes? Why is it even faster than the column store? This question is, why is it even faster than the column store? Because in the pure column store, everything's stored exactly as separate columns. In this case here is the example I showed before where the two columns we packed together in a single cache line. Again, that only works that the selectivity is very high. And I think this example it is. Okay, so, there's another paper that's optional called the single H2O. And this is a precursor to what we were doing here. The bottom line is what they would always do is they could not support hybrid layouts to do both transactions and analytics. Actually, I shouldn't even show this because they can't even do updates. Basically what happened is they were sort of doing like fracture mirrors on the fly. You would have the original data and then you would look at what the query you're executing wants to do. And then as you scan the data, you would transform it to the layout that's more optimal for that particular query. So if you came back again and executed the same query or maybe with different parameters, the data would already be laid out for you in the format that you want, right? In my opinion, this is wasteful because you're doing this transformation on the fly as you execute the query. And they can't support updates, right? This is a read-only approach. So this came out in a year or two before our paper from the guys at Harvard and MoneDB. All right, so to finish up. So it's my opinion that the flexible architecture that I showed you here is the better approach. The hybrid architectures is certainly where everyone's going, the major trend in database systems today, right? You see all these ASAP systems that talk about having to support analytics and transactions all in a single logical database, right? And they're essentially doing this thing what I'm describing here. They have both the row store and the column store all represented together in a uniform location or presented to the user as a single database. So it's my opinion that if you have the hybrid layout, this is gonna allow a relational database system to support all possible known database workloads that we know about. So document stores, row stores, column stores, key value stores, graph stores, all of these things can be represented in a relational database system and get really good performance just as good as the sort of specialized systems. The only thing that we're not gonna be able to support is machine learning workloads because that's doing linear algebra on matrices and it's difficult to represent matrices in a relational model. So again, with that, with the exception of machine learning, we can do everything you'd wanna do in a database in a hybrid store. Okay? All right, so next class we're gonna talk about compression.