 Cool, okay, let's get started. So today's class is about database storage part two, so it's a continuation from last week's lecture where we talked about kind of the low-level implementation details of storing data in the database management system. Before we get started, just a couple administrative things to check off. So project number one is going to be released today. I'm not going to talk about it in today's lecture because actually the piece of the system you'll be implementing, the buffer pool, we're going to discuss in the next lecture on Wednesday. So I'm not going to talk about it today, but it should be released to you if you want to take a look by the end of the day today. So the second piece, project number zero was due last night at 11.59 p.m. If you'll recall from the previous lectures, the first lecture and the last two lectures and also on the website, it was explicit that if you didn't or weren't able to complete the project by the deadline with a square of 100%, you're not going to be able to continue in the course. So that's administrative assignment stuff. One other thing is that there will be the database tech talks starting today, this afternoon right after class actually. The website is there and we're going to have different people come every week. They're all on Zoom so you can just join the Zoom call. All the information should be on the website for how to join and they're not like marketing talks, they're like really low level tech talks so you can kind of see how some of the topics and concepts that we cover in the course are being applied in real world systems. So if you're interested in those, feel free to join the talks. So if you'll recall from last lecture, we talked about kind of the focus of the course is going to be about this disk oriented or disk based architecture. So in that, we assume that the DBMS has its primary storage on some kind of nonvolatile disk, that could be a regular hard disk, it could be an SSD. But the key idea is that the storage medium that is the primary location where the database lives is on some kind of nonvolatile storage. So if we remove power, that means that we're going to keep all of the data that we have in the database, we're not going to lose it. Whereas if it's in memory or the CPU cache and we get rid of the power, we're all our data is gone. So kind of the DBMS rather than the OS and we talked about why it's important that the DBMS manage this process but the DBMS is going to handle all of the data transfer between nonvolatile and volatile storage. So from disk to memory and back again, when we want to persist our data. So just to kind of recap some topics we talked about last time, the idea of slotted pages is a common layout scheme in database systems where what we're going to do is we're going to map individual tuples in a page to these slots. So you have the slot array along the top, which is going to handle the offsets of the tuples. And then we have fixed and variable length tuples along the bottom that the slot arrays are going to point to. So we have pointers in these slot arrays and that allows us to fill up pages with variable size tuples. We don't have to worry about finding location of individual tuples inside a page. So again these two, the slot array and the tuple data at the end of the page are going to grow towards each other in the middle. Just give me one second here to make sure, okay. Okay, so kind of an alternative storage approach. That's the slotted array page approach. An alternative storage approach is what we call log structured file organization. So rather than storing the tuples, the data directly in the pages, the DBMS is only going to store basically log records. So by log records what I mean is kind of logs of what data we're inserting and how we're updating the data in the system. So every time, for example, we're going to insert a new tuple, we're going to store that as a log record in a page. If we want to delete a tuple, all we do is we mark that tuple and the page is deleted. And if we want to update something, we just add a log record to the page that specifies which is the data value that we updated. So what this is going to look like is again we're going to have a page and we're just going to go through and start appending these new entries as we make changes to our data. So you'll see here we're inserting a tuple with ID1 and value A. We insert another tuple ID2 value B. We delete a tuple with ID4 and so on and kind of rather than storing the actual values we're just storing these sort of log records that specify the changes that we're making to the tuples in the database. So can anyone tell me why we might prefer this storage layout versus the other one we talked about with the slotted array layout? What advantage would this sort of storage layout have over the- That is correct. So you're going to get really fast writing with this because you can just rather than having to go and update individual tuples in a page, you can just append the log records to the end and it's really fast to write. It's also great for disk IO because if you recall from last class we talked about the advantages of having sequential reads or writes versus random IO. So in this case, since we're always appending to the end of the page, you get much better sequential performance. So of course, if we get fast writes here then on the other side of things, it's going to become a little bit trickier to do reads. So to read a record, what we have to do is essentially scan this log backwards in order to recreate the tuple that we want to query. So for example, in order to reconstruct, let's say tuple ID2, we have to start reading the log record back and we say, okay, we're updating the value to value y and then we go back further and we get the other potential attribute values for that tuple by reading back through the log records. So we kind of have to do this reverse scan over the log to reconstruct the tuple by applying basically all of these log records along the way. So of course, we can do different optimizations like we could build an index that would allow us to jump into the log just to find the different log records that apply to each of the tuples. So that way we don't have to do a full scan for all of the log records. We just, for example, we want to find all of the log records that apply to ID2. We can go through and just pick out those individual ones. That's okay, but again, now you have to have this additional data structure where you're maintaining kind of these offsets to go find the exact log records for particular tuple. Yes, I'm sorry, could you repeat the question a little louder? Why is it back, so the question is, why is the scan through the page backwards and not forwards? I guess that you could implement it either way, it doesn't really matter. I think there's certain conditions where it may be more efficient to implement it one way versus the other. So for example, if you're scanning backwards and you get, as soon as you hit the insert, you know that you're done, you don't have to scan any further backwards. Whereas if you're scanning forwards, you may have to scan, I mean you would have to scan all the way to the end of the file because you don't know if there are any other changes or update log records that are coming in the future. But if you scan backwards, you could as soon as you see the insert or perhaps even as soon as you see like a delete, you could just abort your scanning early, you don't have to continue all the way to the beginning. Does that make sense? Yes? If we were replicating the native log structure, file organization, the Borab and Cages, then a slot map or other things? So the question is, if the database is replicated on several machines in a distributed setting, would a log structured file organization be more advantageous than the slotted page array? I think it depends on a lot of factors. I think that probably the biggest advantage to the biggest reason you would choose a log structured layout is based on the read versus write ratio of your application. I can't think of anything off the top of my head where you would prefer this for a distributed setting. Maybe it's a little bit easier to have something like eventual consistency where if your distributed nodes aren't necessarily all simultaneously updated like you could have updates trickle to replica nodes over time, this might be a little bit better because then you could just to apply the update, you just append a new record rather than trying to lock things in a distributed setting. But I don't think that there's any inherent advantage or would be any inherent advantage in a distributed setting. So you can build indexes that help you navigate these log pages. Another option is you could periodically compact the log. So again, we have all of these log records building up. We have to replay them in order to get individual, reconstruct individual tuples. If we perform a compaction, we can get something that looks like this. Basically, we just squash down all of the log records into single values. So now we don't have to reconstruct tuple ID2. We can just go and look it up in the page. So we can perform this periodically, lock the page, make sure no one can do anything concurrently, and then perform this compaction and then release the log. So it simplifies things a little bit. We don't have to scan through the whole page again. We can just go and look at the individual tuple values. So there are a whole bunch of systems that take this approach versus the other slotted page approach that we talked about. It's become a lot more popular in the last 10 years or so. A lot of these systems are more recent. You may know LevelDB is from Google. RocksDB actually is from Facebook. They took LevelDB. And the first thing they did is they ripped out Mmap from LevelDB because, as we talked a little bit about last time, Mmap has some problems. So RocksDB is kind of a fork of LevelDB from Facebook, which they are currently using. The idea was to replace the default storage engine for MySQL. They had a huge MySQL deployment. The idea was to replace the default storage engine for MySQL with RocksDB. And again, this is kind of taken off in the last 10 years. But the idea isn't new. There were papers published 20 or 30 years ago kind of detailing these techniques. So to go a little bit further into detail about how the compaction actually works, basically we just want to, as I said, coalesce or squash together these larger log files into smaller files by removing unnecessary records. So for example, if you have a bunch of updates, you can just squash those down into one that specifies the value of the tuple. So the way this usually works is we perform this level compaction where level zero is the top level. We kind of have this sorted log file that's building up. Then we build up all these pages over time. And what we'd like to do is perform a periodic compaction. So perhaps when these two pages fill up, what we're going to do is squash them down into a larger file in the next level. So we take those two pages, kind of replay them to perform the compaction. We get this compacted larger sorted log file that's now in level one. And again, more log files are going to build up in the top level. And we kind of just keep repeating this process, compacting the files into smaller single files. And eventually, we get down to some bottom layer where everything is fully compacted. So you can think of it kind of like a merge sort or something where you have these different runs that you've sorted independently. And now you want to merge them together into one larger run. That's all sorted together. So kind of this is how the level-based compaction works. There's an alternative, which is called universal compaction, where basically what we do is we can just take any two adjacent blocks and merge them together like this. So rather than doing it in levels where we're specifically working from the top level that has these pages that fill up and then merging them as we go down, we can just start merging adjacent files in sort of a universal setting. And this is actually used by, it was created for RocksDB by Facebook. So there's no level concept here. We're just merging adjacent files. So that's it for page-based storage. So we kind of looked at, again, the slotted page-based approach and the log-structured approach. So that's kind of it at the page level. Now we're going to move on to kind of the next level down, which we started talking a little bit about last time, the data representation at the level of the tuple. If we have an individual tuple inside a page, how do we store it? How do we interpret the data that's stored inside of it? And how do different layers of the DBMS leverage or extract the data that they need from the tuple storage? So that's going to be the first piece. The next piece we're going to talk about is kind of system catalogs. So how does the DBMS as a whole store information about the data that's stored in tables? So if you want to know, for example, for a particular table, like the student table, you need to know that students have an ID, a name, an email address, et cetera, how is that stored within the system? And then finally, we're going to wrap up talking about a few different high-level storage model alternatives. So again, I showed this slide in the last lecture. Just at a high level, a tuple is essentially just a sequence of bytes. So think about it just a series of bytes stored inside a page that has some data in it. And the various pieces of the storage manager are going to tell us how to extract individual data values from that sequence of bytes. So the DBMS figures out how it's going to interpret these bytes and give us values back. So the catalog, as I mentioned, is going to contain the high-level schema, which is just information about the tables. It's going to tell us how exactly the bytes are laid out inside a tuple. So the different data types we have to worry about, these are probably the most common. There may be some other ones. And of course, there can be user-defined types. But pretty much, this is what you'd expect to be working with. So things like integers. There are different sizes, integer sizes described in the standard based on how many distinct value, like the range of values that they should be able to support. So those are integer, big int, small int, tiny int. Basically what these are stored as is their c or c++ representation. So if you have an integer that fits in just their 256 distinct values, you can store that in a single byte. If you have an integer in a larger range, you may need to use a four byte or eight byte value. So basically, these map more or less pretty clearly from SQL data types to c or c++ data types. Floating point values are a little bit trickier. And we'll talk more about those in detail. But essentially, for floating point values, you can think about things that you'd store in a 32-bit or 64-bit floating point. Numeric and decimal types are special fixed point decimal types, which we'll cover in the next slide. So in Varchar, our binary text are all basically just variations on strings. So usually, they are stored as some kind of header that specifies the length of the string followed by just a series of bytes that represent the string data, the character data. Sometimes, I guess, you could also store it as null-terminated, like in c or c++. You'd store it as a null-terminated string. But most systems, I think, implement it with this length header followed by some number of bytes. And then finally, time, date, and timestamp, I think, were pretty important in the homework. And basically, again, this is just you can think of it as like a 32-bit or 64-bit integer. This stores a number of seconds or microseconds or whatever it is since the UNIX epoch, so January 1, 1970. And you can kind of convert that back and forth from this integer value to a date. So kind of the trickiest one I mentioned is floating point values or numeric values. So these are arbitrary precision numbers. And they're kind of the most difficult to deal with. So we're going to talk about some kind of specialized ways that systems tackle them. So kind of the big problem with variable precision numbers is that they're inexact, and they have their difficult to represent in computer architecture. So as you may know, like floats or double data types in CRC++ are approximate representations of decimal numbers. And this is the IEEE 754 standard. Kind of specifies how they should be stored. And they're typically faster than kind of these arbitrary precision numbers that we're going to implement things like decimal type or numeric type that we're going to implement in the database management system. But they can have rounding errors. There's a big problem with them. So if you think about something like, I don't know, you're just collecting maybe like temperature data about my office or something, a little in precision in that might be OK. I probably don't need finer grained values than whole degree numbers. But if you're storing data that's like the balance of your bank account, you might care if they're going to be rounding errors. So kind of just as an example of what this looks like, this is just a really simple C program. Basically, we just have these two floats, x and y. x is 0.1. y is 0.2. The first print statement is going to just print the sum of x and y. So it just prints out 0.1 plus 0.2. And the second print statement is just going to print out the value 0.3. So who can tell me what the result of this, if I compile and run it, what's it going to look like? It's not a trick question. So the values that are going to be printed out is the output. You can try it on your own machine. So x plus y gives 0.3 with a bunch of zeros after it, which is what you'd expect. And printing out the constant value, 0.3, also gives 0.3 with a bunch of zeros after it. That looks fine. But if we specify that we want more precision in our program, so here it just changed the print statements to control the number of significant digits that get printed out, what we end up with is something that looks a little bit different. So the first one, x plus y, 0.1 plus 0.2, I think we can all say that that's not quite correct. After several zeros, we get kind of this garbage at the end. And also printing out 0.3, 0.2, 9999 is not quite 0.3. So again, depending on the type of data you're working with, you can tolerate a little bit of imprecision. These floating point values might be OK, and they'll be fast. But if it's something important, like a bank account, that you want to make sure you get the values right, we can't tolerate this loss of precision in the database. So that's going to force us, as the developers of the DBMS, to implement numeric data types that have a fixed amount of precision. So we guarantee some level of precision in the data types that we implement. It could be potentially arbitrary, precision in scale. And we want to use these when rounding errors in the data are unacceptable. So the data types are numeric and decimal data types. And there are many, many different implementations of how exactly these things work in real systems. Typically, the commercial ones are much more sophisticated since people, business applications, have a lot of need for fixed precision numeric values. So they typically have pretty sophisticated implementations. But there's no hard and fast way that all systems implement this. Everyone does something a little bit different. But again, the trade-off is that to ensure the amount of precision that you need is going to require more overhead in your processing. So you're going to have to be more careful than just using the basic 32-bit, 64-bit floating point values that C or C++ provide to you. So just as a few examples, we'll look at first what Postgres does to implement their numeric data type. So this is the struct that they specify in their code. So we can look at the different pieces. Basically, you start out with the number of digits that you have in your numeric data type, the weight of the first digit, the scale factor, whether it's positive or negative or a NAND. And then the actual digits themselves. So that's the last piece there. And if we kind of look at what that actually is implemented as, numeric digits are basically just a char array. So it's just char bytes stored in some variable length array that we can use to decode these individual values. So if we look at now how the code to actually handle this looks, if you have a floating point value, it's just one instruction or whatever to add them together. All of this, and it's cut off at the bottom. You have these nested if statements, switch statements, all kinds of stuff going on in there, just to add two of these numeric data types together. So you have to check what's the sign of the first value you're trying to add, what's the sign of the second value, and all this kind of stuff that this has to execute every single time for every single tuple where you want to add these two values together. So you can see this quickly gets a lot more expensive than just using floating point values. Yes? Does anyone have a try to find any dedicated hardware that's in front of the first instruction number? So the question is, floating point values are fast because there's dedicated hardware for it. Has anyone ever tried making dedicated hardware to add fixed point values together? The short answer is I do not know. There could be, especially in other communities outside of data management, things like hardware, software co-design, there may be research that looks at that, but I'm personally not familiar with any of it. I don't know. It's an interesting question. Since there are some DBMS vendors that have tried to fabricate specialized hardware for their DBMS product, but I am not familiar with anyone who specifically has designed any piece of hardware to do this fixed point addition or computation in general, but I can look into it and get back to you next class. Thank you. So that was Postgres. This is MySQL. MySQL, this is their numeric type. This is the struct that defines its decimal. And again, we have these pieces that tell us the number of digits before the decimal point, after the decimal point, the length and bytes of the value, the sign, whether it's positive or negative, and then the actual digit storage. And again, kind of similar, this is stored as in 32 rather than just a char array. But the end result is the same, basically. You're storing this variable length array of digits that you then need to decode, again, with all of this difficult logic. So this is the add function from my sub set of the add function from MySQL. And again, you're kind of going through and doing all of this complicated if-else conditionals, switch statements, all kinds of stuff to check and to perform the actual computation by decoding the values that are stored in the data array. So does anyone have any questions about the fixed point arithmetic before we move on? OK, so the next piece that I want to talk a little bit about is for large values. I think someone in the last class asked either what happens if the individual value is too large for a data page, or if an individual tuple is too large for a data page. I think I said that there's nothing, in theory, stopping a tuple from spanning multiple pages or a value, an individual value from spanning multiple pages. I think most systems don't allow a tuple to exceed the size of a single page. So it's constrained either in the size of the columns or the number of columns or both. And you're not able to specify a single tuple that's going to be larger than one page. So now the problem comes up is, what if we have individual values that are larger than a page? So imagine a really long string or something. The way that we store that is in what's called an overflow page. So basically, we're just going to take the individual tuple that we have, and we're going to specify the value. In this case, c is a really large value, and it's just going to point to the Varchar data that's stored in this overflow page. Now, of course, maybe that's, again, too big for the overflow page. So this could, in theory, span multiple pages. You could just point a chain of pages to have an arbitrarily large Varchar value. So this is called different things in different systems. Postgres calls a toast, which these overflow pages are going to kick in if the value is greater than 2 kilobytes. MySQL, I think it's greater than half the size of the whole page. Or in SQL Server, it's greater than the size of a page, and this is going to kick in. So an alternative way to handling these large string objects, usually, is to store it externally to the DBMS. So we talked in earlier lectures about why is a good thing for the database management system to have full control over all of the data so it can understand everything that's going on, manage the safety and all the properties we want to apply to ensure for the data. But if you get sufficiently large data, there are certain cases where you may want to have that stored externally to the DBMS. So some systems let you store this type of really large file in what's called the blob, binary large object. So different systems call it different things. It's called a B file and Oracle file stream in Microsoft. But basically, the way the blob works is you just store, again, a reference in the tuple, and that's going to point to some external object. So can anyone think of a use case when this might come in handy? What sort of data might you want to store external to the DBMS? Netflix is a good example. So basically, like multimedia data, images or videos that kind of don't fit well with the data model in our DBMS. If you think you have an individual Netflix, like video file, movie file, that can be gigabytes. So that doesn't fit well in our DBMS. So we just store a pointer to that data, and it lives somewhere else on the file system, and we can reference it when we need it. The drawback is that there is no explicit durability protection. So since it's outside the DBMS, we can ensure that it's stored safely. Part of the guarantee that we get from the DBMS is that if our data is inserted, then it's going to be persistent. We're not going to lose anything. But if it's stored externally, there's no way to control for that. And also, there's no kind of transactional protection. So if you have some other program that's concurrently modifying the file, we can't guarantee that it's going to be in a consistent state when we go to read it. So a question I'd have is, when do you want to take this approach versus when to store something physically inside a DBMS page in one of those very overflow pages that we talked about? So outside of different types of files, like images or videos that don't fit well with the processing model we have, the other reason you might want to do this is if you just have a large enough and by large enough there's a paper that came out. This is from Jim Gray. I think I mentioned him before. He's a famous database researcher when the Turing worked. This paper was kind of evaluating the size of when it's more beneficial to store, blob directly inside an overflow page versus externally. And the conclusion they came to, I think this is about 15 years ago this paper came out, the conclusion they came to was about 256 kilobytes as when it's more beneficial to store externally. Because again, remember if it's stored inside the DBMS, inside the database, then we have to read pages and we get these huge objects written to and from disk every time. So that's the trade off that they figured out and the exact numbers may be different now, but kind of thinking at a high level there may be cases when it's more beneficial for you to store it directly inside the DBMS, other cases where it's more beneficial to store as an external object. So the next piece that we're gonna talk about is the system level catalogs. So in order for the DBMS to know what's going on with the data stored in the database, it keeps all sorts of metadata about the database internally so it can perform different tasks. So as I mentioned, one of the most important things is we need to know how to encode and decode the data that's stored in the just bytes that represent tuples. So we store information about tables, columns, indexes, views, all that kind of like structural stuff. DBMS is typically also store information about users and permissions. So like access permissions, which users should be able to view or modify which pieces of data. And finally, they keep around a lot of internal statistics. So things like the number of distinct values or join cardinalities or data ranges, things like that, which are gonna be really important for query execution, which we'll talk about in later lectures, things like planning the most efficient way to execute a query. It's helpful if you know statistics about the data that you're storing. So almost every DBMS that I know of stores the database catalog like inside of itself. So it basically just stores it using the same abstractions, things like tables, indexes, that kind of stuff that it uses to store regular data. So now the problem is there's like this chicken and egg problem where if we need to write a SQL query to query the metadata, but we need the metadata to answer the SQL query by decoding values or referencing objects, it's not clear where we would start. So kind of what systems do is they have these kind of special wrappers around metadata objects that the system can use to directly encode and decode the values stored as part of the system catalog. So they like bootstrap the catalog tables in that way. So you can, as the user can query the DBMS's internal catalog, it's usually stored in this information schema to get information about the database. You can see all of the objects that are stored in the database, it'll tell you information about tables, metadata, statistics, all kinds of stuff. And it's defined by the ANSI standard as the set of read-only views that are going to provide information about everything you need to know that's stored in the database. And kind of before this was standardized, it was sort of a mess where every kind of system had their own way of exposing the metadata. So kind of to make it more manageable and portable, they codified it in the standard so that all of the systems are exposing the same metadata to the users. And they're also like a lot of non-standard shortcuts. If you were playing around with SQLite at all, you may have looked at some of those to kind of show what tables or data was stored in the schema. So just, I'll show a few examples. If you want to list all of the tables in the current database, you can write it as this SQL92 compliant query. We're saying select star from information schema.tables and you can get the table catalog is the name of the database that you're interested in. And these are just some of the shortcuts that different systems have. So in Postgres, you can just do backslashd, MySQL show tables and SQLite is the dot tables. So again, to get the schema for just an individual table, you want to get all the information from student. You can again select star from information schema.tables where the table name is student. And again, there are all these different shortcuts and the different systems where you can get the same data that's provided by the standard SQL query. Okay, so there are any questions about schemas or catalogs before we move on to the last part of the lecture which is about the data storage models. Okay, so kind of an important piece to talk about before we actually look at the different storage models, high level storage models is to talk about the different types of database workloads that we can have. So the first one we'll talk about is what's called online transaction processing or OLTP. And basically what this means is that you have a lot of fast short running operations that only access, so either read or update a small amount of data each time. So you can think about that as like your bank account, if you want to get the balance of your bank account, and you're just reading one value or if you want to do a transaction deposit money with raw money, it's a fairly short transaction. Another way to think about it is something like an online store like Amazon or something. You go through browse different products, add them to your shopping cart, check out pay, ship. And the amount of data that you're accessing relative to all of the products that Amazon offers is relatively small, so you're just buying a few individual products. But if you think about all of the people around the world that are concurrently making purchases, then it starts to add up. So kind of the amount of data that is accessed or modified by individual transactions, so your individual purchases is relatively small, but taking an aggregate across all of the concurrent transactions, it can get quite large. So that's online transaction processing. The kind of other side of the spectrum is what's known as online analytical processing, OLAP or OLAP. And basically these are kind of like analytical queries that read lots and lots of data, so they scan whole tables, they produce aggregates, there are lots of joins. These are used typically in things that are like decision support or business intelligence. You kind of wanna look at all this data that you've accumulated and see if you can get any insights out of it. So you're executing these large scale queries that provide you some type of useful or meaningful answer. This is kind of similar to what the homework assignment was. You know, you wanted to figure out certain high level information about different products or answer queries like in the Amazon example, maybe Amazon wants to know what are the top five most popular products purchased in the past month by CMU students. So kind of you wanna scan a large sample rather than just updating individual or reading individual records. You wanna scan a large sample and see if you can get any useful insights out of it. So these typically are longer term, long running queries. They might happen overnight or they might be run to populate like a report or something that you give to a higher level executive decision maker and they're oftentimes a lot ad hoc. So they're not queries that are repeated a lot unlike transaction processing, every new order transaction that gets sent to Amazon basically has the same template. The OLAP queries are usually kind of one off ad hoc and more specialized. So that's kind of on the one side, you have OLTP OLAP on the other side. There's this new buzzword thing that's becoming more popular recently. It's called hybrid transaction and analytical processing or HTAP. Basically, the goal is to be able to do OLTP and OLAP together on the same database instance. So you have essentially your transaction processing and your analytics running concurrently. Hopefully you can get the ideas that you'd be able to get faster or more real time insights about the transactional data that's coming in. So if you think about kind of the spectrum here that you have on the X axis is the workload focus. OLTP is more heavily geared towards writes. There are some reads obviously that you need to do but think about ordering a new product, you wanna insert information into the table that you ordered it, you wanna insert information about your shipping address, maybe update some aggregated data that Amazon stores about you. Whereas for the analytical queries they're more heavily read focused. So they're gonna spend a lot of their time scanning a lot of tables, performing a lot of joins to kind of get the data together that you need for the analysis. Then on the Y axis there's the like notion of operation complexity. So how complex is the query or the program that you're running for each of these workloads. So again OLTP is towards the simpler side, just inserting new records, updating individual values in tables like maybe you update your payment information or your shipping information that's just changing a few values and fields versus on the other end, OLAP is usually a lot more complex, again joining many tables or performing complex aggregations, windowing functions, all that kind of stuff that you saw in the homework. Again kind of this htap sits somewhere in between, you wanna do both of these things at the same time. So the way that this usually looks in practice is a company, an organization will set up these kind of two separate environments. So on the one side you'll have usually multiple OLTP data silos. You have all of these database instances running accepting concurrent connections from different clients to perform the transactional side of the workload. And then on the other end you have this really big OLAP data warehouse where you're gonna wanna dump all of your data for analysis. So again we kind of have all the transactions running on the OLTP side, all inserting or updating or whatever their operations are on the individual silos. And then we're going to go through this process, it's usually called extract transform load ETL where we take all of the data out of these different silos. It may be formatted differently, there may be redundant data. So for example, you could have two students with the, you could have a student stored in multiple databases that has the same name. So it could be like my name Andrew and Andy. And one is just my nickname or Drew. And you want to kind of merge these records into a single logical record during the CTL process to get them cleaned and ready and formatted to go in the data warehouse. So kind of this ETL process might run overnight or periodically throughout the day. We kind of gather all of the data from the different siloed OLTP databases, format them and then shove them into this data warehouse where we're going to run different analytical queries. So again, kind of based on the results of the analysis we perform, we might want to push some data back out to our OLTP data silos. In the Amazon example, maybe we run the analysis to figure out what products the CMU students have been buying. So then we want to push back from our data warehouse to the OLTP side, product recommendations when you visit the webpage for it to buy. So again, kind of the idea of moving the analytical queries and having them execute concurrently with the transactional workload is what this HTAP idea is about. So kind of why is all this important? Why does it matter that we have these different types of workloads? Well, if you think back to the relational model, it doesn't really specify how we store the data. It gives us certain rules and requirements for the different operations we should be able to perform on the data, but it doesn't tell us how physically we need to store the data. And in fact, as we've seen, a lot of different systems store the data in all sorts of different ways. There's the different page layouts, slotted pages, log structure pages, all different kind of options there. There's all different kinds of options and storing individual data types. But going a step further, there's nothing that says that we have to store kind of a tuple's attributes together in a single page, like as consecutive bytes as we've been doing. So I think I mentioned early on that kind of there's this notion of a row and why that might be different from a tuple or a record. And that's because the row storage layout assumes that all of the different attributes in a tuple are stored consecutively. So again, if you have the student record that has ID, name, email address, you'd store all of those in consecutive values. So based on those different workloads that we talked about, OLTP, OLAP, the row layout might not actually be the best for all cases. So I'll kind of go through and explain what that means. And just as an example, we'll use this simplified version of the Wikipedia schema. So it's just cut down to three tables. Basically you have a user account which is the Wikipedia users. They have a user ID and a name. You have the different pages which store the Wikipedia data. And then you have a revision history which says which edits or revisions were made by which users to which pages. So kind of those are the two references there. User ID references the user account table and page ID references the pages table. And I just, I'll point out there's this one other cyclic reference here from pages back to revisions. Basically it's just an optimization that Wikipedia does is if you wanna get the most recent revision for a page is they store it right in the page. So if you think about kind of the OLTP types of queries that we might be running against the Wikipedia pages, again it's just simple queries that reader updates small amounts of data. So usually single entities. It might be things like getting the page data. It might be updating when the last time a user logged in or it might be adding a new record into the revisions table. So just kind of these really short running, simple operations that only read or write a few values in the database. And this is, if you have a new company or a startup or something, this is usually the type of application that people build first. So you kind of have these transactional things for your new app or whatever that basically manage the simple aspects of the application. So kind of on, again the other side, the analytical side, we're issuing complex queries that read a large portions of the database. So in this example query here, we're counting the number of distinct logins over the last month from users where the host name is like .gov. So where you have people logging in from .gov. This example is from, there was a scandal 10 or 15 years ago where politicians were having like staffers log into their Wikipedia pages and they would have them remove unflattering or scandalous things or a whole bunch of people like Mike Pence, Joe Biden, a whole bunch of people that were having staffers do this. So kind of if you wanted to do that, so kind of if you wanted to know how many times staffers were logging into Wikipedia and making these types of edits, you could issue this kind of like analytical query against the data that you collected on the transactional OLTP side of things. So all of those revisions that we were adding or so revisions or logins or those types of things that we're adding, you can query those later to get insights out of them. So again, I hinted that this was leading towards discussion of storage models and the key piece behind this is that we can store tuples differently behind the scenes. So without exposing this to the application or the user that are perhaps better for either OLTP or OLAP. So kind of what we've been talking about so far with this row based storage model where all of the values are stored consecutively is what's referred to as the NRE storage model. So again, that's this row format where everything is just like an array of bytes where everything's stored consecutively. So again, everything's contiguous in a page and it's in particular ideal for OLTP workloads because queries tend to operate on individual records or in this case rows. So if we need to read or update an individual row we can go grab it and everything is stored together in a row format. If we wanna do an insert, we just insert now a bunch of contiguous values that are all stored consecutively together in a page. So kind of this is the classic way we've been talking about this. If we wanna get tuple number one, everything's stored together right there in the page. We have the header and all the different aspects of it. And again, if we wanna get any of these other tuples we can just go and access all of the values together. So each of these pages are stored like this in a row based format in the database. So this is kind of exactly how we've been talking about database pages so far. And again, if you just wanna execute this simple query says login query or something, we're gonna load their profile, we need to go and get all the information about the user account from the user account table. So what we're gonna do is we're gonna send this query to our system. It's gonna hit an index. We haven't talked about indexes yet. We're gonna talk about it in lecture seven. But basically you can think about an index kind of like a page directory where the page directory maps page IDs to like physical locations. The index is going to map values. So like in this case a username or something to a page ID and record ID pair. So this query is gonna go to the index. The index is gonna tell us, okay, hey, go get this page. And then it's going to tell us which tuple exactly to locate in that page. So, and again, if we wanna insert something it's pretty straightforward. We just now add a new row in that page right there. So kind of this is exactly how we've been talking about database storage model so far. So that was the allTP side. If we come back again to this query to get all of the logins from .gov hosts, we can kind of see how this works in a similar way. We're gonna go get all the pages. We have to scan through all the pages because we don't know what exactly we're looking for. We just wanna look at all the pages in the database. And we're going to go through and perform first the filtering so where the host name is like .gov. And that's going to require us to look at the values in that column, the host name column. And then the next thing that we're gonna do is we're gonna have to go and look at this last login piece to do the count and the extract part of the query. And that means we need to look at the values in that attribute there. So the big problem here is that what's gonna happen with all of these other columns that we don't touch? We don't need to look at them for this query. They don't have any impact on anything that we're doing. So kind of it's just useless data that we're having to read in from disk even though we're not looking at it or touching it at all. So kind of to summarize the NRE storage model has these advantages of really fast inserts, updates and deletes and they're good for queries that need to access the entire tuple. So again, loading your whole profile when you log in we need to access all the data in the tuple but they're disadvantages like in that previous query I showed they're not good. They have a problem with scanning large portions of the table or subsets of attributes because what you're ending up doing when you read a whole page is you're reading in a bunch of columns or attributes that you don't need when there's no need to because they have no impact on reading the query so they're eating up your IO bandwidth, file IO bandwidth, disk bandwidth with no benefit. So kind of the alternative approach to this NRE storage model that we can take is what's called the decomposition storage model or DSM and basically what we're going to do is the DBMS is going to store the values of a single attribute for all tuples contiguously in a page. So rather than storing all of the different attribute values for a single tuple contiguously we're going to extract all of the attribute values for a single column which is where the name column store comes from. When you extract all of those values and store those contiguously for several tuples. So this is ideal for like OLAP workloads where you have a lot of read-only queries you don't have to do many updates or writes in general and you wanna perform large scans particularly over subsets of the attributes. So if you have now all of the values for a particular column stored in one page you only need to read that page you don't have to worry about the other useless columns that don't get used in the query. So again as an example if we go back here what we're going to do is store each of these attributes individually in a page. So as an example let's say the host name page we're going to store the host names for all of the tuples together all in one page. And again we're gonna do the same thing for user ID last log in username whatever the other attributes in the table are. So they're probably if you have enough data you'll end up with multiple pages but the point is that for each attribute for each unique column in the table we're going to store the values for all of those columns all together consecutively contiguously. So now again going back to this query we only need the host name in the last log in we don't need any of that other stuff that any of the other attributes from the table so we can just now go and get the pages that specifically corresponds to those two columns we don't have to worry about the other columns we don't have any ways to bandwidth we can just go get those specific pages that we need for those columns. Yes, so the question is if you had some kind of partitioning strategy where you split up the data into let's say it's organized by year and you want more recent data grouped together so you can sort it by year or something and then partition it that way so you only get the most recent data versus older data you might not care about what storage model would that be under? So I guess it's orthogonal in some ways to this row versus column layout. I think it's kind of like so for example you could have you could take your row based data in the entry storage model and you could sort that by year and partition that by year so that way you still get kind of the temporal locality for the data items you could do the same thing with the columnar storage or the DSM you could sort columns by you could sort the whole data by year and partition it up that way so I think it's just an additional optimization that you could apply if you know that there's some kind of for example temporal locality another way you could think about it is if you have different regions like a continents or something you have European clients and North American clients, Asian clients, whatever you can partition your data that way so I think any sort of partitioning scheme or strategy that you apply on top of this is orthogonal to the row versus column layout. Does that answer your question? Are there any other questions about kind of this entry versus DSM trade off? So the next piece that we need to worry about in kind of the DSM implementation is how are we going to I've told you that we're going to split up the tuples into individual column values and store all the column values together how do we reconstruct an individual tuple? Like how do I get back to if I have the username and the email address and all that stuff split up and now I need to query an individual student how do I get that back together to reconstruct the record? And the answer is there are two different ways that are typically applied. I think the offset based one is more common so we'll talk about that first. So the idea is that you have these fixed length offsets that you can apply because individual column values all have the same type. So for example let's say user ID is just an integer we know that's 32 or 64 bits or whatever it is so in order to access the nth value we can just figure out okay that's 32 bits times n gives us the offset of the ID in that column. At the same time you know if we wanna go and get the email address or something you can kind of just jump directly into the corresponding offset in the email address column. Now you might run into a problem if you have variable length strings there are different ways around this. One is you can pad out strings to a particular length. If you have a lot of repetition in the strings that you store you may be able to do some kind of encoding like a dictionary encoding where you replace strings with just a fixed length integer code. We're gonna talk more about that kind of stuff in later lectures but kind of the key idea is that if you split things up into columns you can use these positional offsets to jump directly to the offsets you need to reconstruct individual tuples. The other option again is to kind of store the IDs for individual tuples embedded directly in the columns themselves so rather than storing columns individually you now have like these joint columns where you store first the ID followed by whatever the data value is that you have. And here they're all sorted in the same order but you could imagine you could have individual columns sorted in completely different orders and then use the tuple IDs to reconstruct them. So you might ask why do I wanna sort columns in different orders? Again we're gonna talk about this in later lectures but it may allow you to apply better compression. So for example if you have sorted data in a particular column you might be able to get better for example let's say it's dates. If all of your dates happen to be just one day apart you may be able to store some like delta encoding where rather than storing the full date you can just store a much smaller value that says this is just one day. Add one to the previous day and you can get kind of much better compression. The other thing is again if you have sorted data that lets you do a binary search or some more intelligent search for individual values than having to scan the whole column with like a sequential search algorithm. So those are kind of the two ways that you can rebuild individual tuples. So just to summarize like for the entry storage model the DSM has advantages and disadvantages. Obviously DSM reduces as I mentioned the amount of wasted IO because you only read the data the column specifically that you need to answer your query and you can get better query processing and compression for the actual stored columns. The disadvantages are it's slow for point queries so if you wanna go and reconstruct an individual tuple it's a lot easier if you just have it laid out and contiguously as a row but here you have to kind of do some reconstruction either using the offsets or using the IDs as in the previous slide. And it's a lot harder to do insert it's updates that kind of stuff because you have to go and add now values to multiple columns, multiple pages rather than just writing all at once. So DSM systems are not new. They've been around for a while. The first one I know of is from the 1970s. It's not actually like a DBMS I think there's a Cantor as a file system but basically if you look at what they're describing in the paper it's a column store. In the 1980s there were the first like I guess theoretical underpinnings or proposals for DSM storage. In the 1990s there's a product called SideBase IQ. So it's like an in-memory accelerator for the SideBase row store. I don't think it ever became that popular but kind of what they were doing was storing the data in memory in a column layout to accelerate certain types of queries. It really took off in the early to mid 2000s with these three systems, Vertica, VectorWise and Monadibi. If you're interested you should check them out. They are kind of the first popular and commercially successful column stores that really paved the way for a lot of the techniques that are now commonplace in columnar storage. And it says 2010s everyone because pretty much everyone and this is not everyone but this is a small subset of everyone. Pretty much everyone nowadays has some columnar storage option if you're in the analytics or OLAP market. So pretty much if you have a row based storage and a storage model for analytics you'll get crushed. So kind of everyone implements including big, big commercial systems like Oracle implements this kind of columnar storage model for the analytics type queries. So just to wrap up, I kind of have been talking about ADBMS as having a stack comprised of these independent pieces. As we've seen they're not entirely independent since sometimes you need to know about certain other layers in the stack. So for example the storage manager might need to know about kind of the higher level application pieces in order to pick NRE versus DSM versus some other partitioning strategy. So kind of there's not always necessarily this clean cut between layers in the stack. And it's really important to choose the correct storage model for the workload that you are targeting. So just as a high level piece if you've got nothing else out of this lecture just remember OLTP, you want row storage and OLAP, you want column storage because otherwise you'll get crushed. So for next class we finished today, last lecture and today the problem number one which is how the DBMS represents the database and files on disk. In next class we're gonna talk about how you move disk pages back and forth from disk to memory. And that's going to be the topic for the first project, programming project, so thanks and I will see you on Wednesday.