 Let's get started. DJ 2PL. So you've got a show this weekend, right? Yeah, yeah, yeah. You don't know where it is. No, not yet. You don't know when it is. It's running Rangos, there's 830. It's on campus? It's on campus, yeah. Is it real? Yeah, I mean, as real as it could be. OK, all right, cool. All right, so we'll post on that. And I'll talk about that on, we're now on Piazza, we're now on a class on Wednesday. We have a lot to discuss, a lot to go over, so let's jump right into it. So on the dock of you guys, obviously Project Zero was due last night. We haven't gone through yet and looked at results for everyone. I think we had about 150-something people complete it, that's good. Project One is out and that'll be due on, sorry. Homework One has been out for a while, but we bumped the deadline up to this 15 days or four days from now. So that should be reflected in Gradescope. And then Project One is out and that'll be due on October 1st. Any questions about Homework One? There was somebody posted on Piazza about Project One, the leaderboard assignment doesn't work yet. We're still working, we're fixing that and we'll push that on GitHub later this week. And I'll announce on Wednesday what the leaderboard is, what the implications of it are, and why it matters. OK? All right, the other thing that's going on for additional things, if you want to learn beyond the stuff we're talking about in the course, there's a couple of database talks that are coming up. So today, after class at 4.30, we're having the Quadrant guys out of Germany. They're one of these vector databases that target LLMs or ChatGVT kind of setups. They'll be talking about the internals of their system and that'll be, again, 4.30 over Zoom. Tomorrow at 6 PM, the Databricks people are giving talks somewhere, somewhere in Gates Building. It's a recruiting talk, but that means they're probably going to feed you and then you can talk to them about getting jobs there. Databricks has hired pretty much almost all my best students in the last two or three years. They've all gone to Databricks. And I was there in July, and they're all doing great. They have a lot of money, and they don't give us any. That's another problem. And then Auditron is actually my startup, but my former PG student who's a co-founder with me, Dan Van Aken, she'll be giving a talk about what we're doing using machine learning to optimize database systems, post-customized SQL next week. Yes? We're going to have information about the location of these things. So my talks, the Quadrant Auditron 1, that's on Zoom. And then if you go on the slides, the link here will take you to whatever it is on my calendar. It's somewhere in Gates for the Databricks one. OK, so I'll look at the slide. Yep. Other questions? Again, these are optional. These are like, if you want to go beyond the stuff, talk about it in the course. What I like about these kind of talks is even if you don't understand anything right away, we'll hit a lot of these topics throughout the semester. And then you realize, I'm not crazy. Well, I'm crazy, but not that crazy. I'm not making stuff up. The things we're talking about in this semester, you deem to know are applicable to building real systems. OK? All right. So last class, we talked about the initial setup from what the framework we're going to have in our minds for describing how we're going to build a data management system. And we discussed how it was a disk oriented architecture where all the components in the system are really going to be based around this key premise that the primary search location of the database, when it is at rest, will be on some non-volta disk, right? An SSD spinning this hard drive doesn't matter. And that the components of the system are really about moving the data back and forth between disk and memory, because it's a von Nomen architecture. You can't operate on it while at that rest, right? So that's really what the big picture of what we're trying to achieve. And of course, now since disk is slow, we need to do a bunch of tricks and a bunch of other techniques to hide the stalls of going to disk by maximizing the amount of sequential IO. And we'll see in the beginning right away, today, we'll talk about a different method alternative to what we talked about last class that tries to maximize sequential IO. And then again, there'll be other things like filters and indexes, a way to reduce the amount of data we have to actually look at when we run queries. And we also talked about a page-oriented storage scheme, the slotted page architecture, where it was allowed us to store tuples of arbitrary length, variable length sizes, across these heap files, and then we could expand the size of the tuple as needed according to whether it fit in the page or not, right? So I would say what we were describing last time is what I loosely termed to as a tuple-oriented storage scheme. What that really means is that the system is really about, I got a tuple, I got to put it somewhere, right? And the pages, the layouts of the pages are really based around this, I got a tuple, let me store it. And so in this architecture, if you wanted to start a new tuple, the way you would do it is you go look in the page directory and find somewhere in your heap files a page with a free slot. We said that the page directory would maintain metadata about what space is available. Once we have our page that we want to insert the tuple into, if it's not memory, if it's not memory, they've got to go to disk and fetch it in, which I'll talk about next week, how we do that, but think of reading a file, bringing the memory. And then once we have that page, we go look in that slot array and we say, what's the next free slot where we can store this tuple, update the slot array, put the tuple inside the page, and then we're done. To update a tuple in this environment is basically the same thing, where we're gonna have some way to get the record ID of a tuple. We said this is typically the page ID and the offset or the slot number. Ignoring how we got that, which is what index will do for us, ignoring that, assuming we could do that. We go in the page directory again, find the location of this page, if it's in memory, we're good, if not, we gotta go disk and get it, then look at the page in the slot array, find the offset, and then if the new tuple we're trying to, or the updated tuple we're trying to install, if that's the same size of the original tuple, the existing tuple, then we just overwrite it. If not, then maybe you gotta find another page that could accommodate it. There's no space in the page we're looking at. Right? Again, this is the core idea of what the heat files with the page-oriented architecture and that's based on tuples. This is basically how any system would actually work. So what are some problems with this? We touched on some of these last class. Is it efficient? For reads maybe, right? Right, because if I need the entire tuple, I go to one page and get it. That's okay. But if I start updating things, I've started making writes and doing inserts, updates, deletes, I could end up with fragmentation in my pages, right? I could have pages that are not fully utilized, meaning I have a little empty space where I can't fit any new tuple, right? It's not big enough for a new tuple, but I can't use it, but it's just wasted, it's just there, right? Or even before, if I don't run out of space, if I have to insert a new tuple, I gotta allocate it, assuming I have nothing on my table, I insert a new tuple, I allocate a page, I insert one tuple in that page, there's nothing else on that page. Again, depending on the size of my pages, which could be different per systems, there's a bunch of empty spaces that's not being used. Next challenge you face is there's a bunch of useless disk IO. So if I gotta update one tuple, if it's not in memory, I gotta go disk and fetch it. But if it's in that side of that page, what am I getting? We're not storing one tuple per page. I mean, you could, but you typically don't wanna do that. So now if I gotta go update one page, or sorry, on one tuple, I gotta fetch that entire page and bring in a bunch of data that may not even be what I need, right? Cause there's a bunch of other tuples that I'm not updating. Same thing when I'm gonna do a write. If I'm only updating one tuple, I had to bring in 20 tuples in the page into memory, now I gotta write those 20 tuples back out. And the last issue is that we're gonna get a potentially a lot of random disk IO. Again, the cop out answer for people ask like, is this more efficient or what approaches better? The answer is it always depends in databases, right? So if your workload is only updating a single tuple at a time, very per query, then maybe this architecture isn't so bad. But if I'm updating 20 tuples at a time, and those 20 tuples are in 20 separate pages, I gotta go read 20 separate pages from disk into memory. I gotta update them, and I gotta write out 20 different pages in memory, or separate from memory into disk. Now that's random IO, and that's gonna be slower. And then not necessarily a problem with the architecture itself, but it may be the case that we're operating in an environment where we can't do those in-place updates that we assume we could do in a slot of page architecture. Meaning I can't fetch a page that's in disk, bring it into memory, update it, and then write it back to where I got it from. You can't do this in some cloud storage systems, right? S3, you can kind of trick it out using versioning, but I can't do in-place updates in some cloud database systems. And the Hadoop file system, it's not that common anymore, but there's another good example of like, that's a distributed file system where, again, I can't do in-place updates, I can only do appends. So this tuple-oriented slot of page architecture wouldn't work in this environment because I can't modify a page and write it back where I got it. So this is why we need to look at potentially alternative methods. And in particular, all the problems that I just talked about, they'll be solved with a log-structured storage scheme. And beyond the sort of the heap file that's on a page architecture, log-structured storage is probably the second most common approach people take in database systems. It's probably even more common today because of embedded storage managers like RocksDB, which are log-structured. So if you've ever seen a database system that's using RocksDB, they're inherently log-structured because RocksDB is log-structured. And then we'll talk about another approach, not exactly log-structured, it's sort of amalgamation of the two, will be index-organized storage. That's what MySQL and SQLite and others use. And then we'll finish off talking about how to actually represent the data of attributes in tuples, right? So again, we were working out what a file looks like, what a page looks like, we're still sort of in that world, and then we'll spend more time talking about what actual individual tuples look like, the values in the individual tuples, okay? All right. All right, so log-structured storage is an old idea. It's loosely related to log-structured file systems, which predates it about 10 years, log-structured file systems were like in the 1980s. The log-structured storage sort of was first proposed in the mid-90s. Actually in the textbook, they'll call them log-structured merge trees. I'm not gonna describe what the actual log-structured merge tree looks like because I don't think you need to know the details of the tree part of it, right? So I'll describe basically the same idea, but without bringing in the tree because that makes things slightly more complicated. But the highlight is what I care about, but I want you guys to understand. So the basic idea of log-structured storage is that instead of storing individual tuples, we're gonna maintain a log record of the changes to those tuples. And think of just like, it's like a key value store, key value system. So I'm gonna have some operation, either just put and delete. And then I'm gonna have a key value pair with the key corresponding to some tuple identifier. You can't use the record ID we did before because we're not gonna have pages, we're not gonna have all sets and slots. So it's not gonna be that. But it'll be some key identifier and then the payload would be, here's the actual tuple itself that I'm trying to install in the put. And so as the application inserts data of makes changes, we're gonna append new log entries to an in-memory buffer just in the order that they arrive. And then at some point that buffer is gonna get full and we're gonna write it out the disk. Pretty simple, right? All right, so let's see the example here. So again, the only two operations we're gonna have are put and delete, right? There's no insert because that's just a put. There's no update because that just put on, doing blind right over top of whatever is there before. And so in our log file in memory, right? We're gonna go from oldest to newest. So at the beginning of the file or beginning of the buffer, that'll be the oldest entries and then we're just impending to them as we make changes. All right, so the application may say, I wanna go ahead and do a put on record 103. Where are the 103 came from? We don't care, assuming that's some upper part of the system that figured that out for us. And then again, the payload in the log record would be we're setting the value to whatever record 103 is to one, A1. Same thing, next guy comes along. He wants to do a put a 104 and then updates that record, right? And then if we have a delete, where we just again, we just delete, have a delete operation in a log record and then with again the same tuple identifier and keep it pending to the log as we go along, right? So in this example here, we don't need to go actually read what the original record was, the original tuple was. Anytime you wanna update the log, at least again, think of the lower bounds of the system. Obviously if I'm doing a query like update table set ID or set value equals value plus one, I gotta know what the original value was, right? And that's essentially doing a read followed by a write. But at this lowest level of the system, we don't have to know what the original value was for a given key, right? And again, that's different than a, the tuple oriented architecture where I had to go fetch the page that had the original tuple and then I can update it. I don't have to do that with this, right? So again, at some point, this thing will get, this memory page will get full and we gotta write it out the disk, right? And that's just literally just taking the entire contents of the memory page and plopping it down to a bunch of pages on the disk. Clear out my memory buffer and then start filling it up with new log entries. And then when that gets full, same thing, I write that out. Now, important thing about this, or just two important things to point out when we do this write. First of all, this is all sequential IO now, right? Because my in-memory page could be, say like a megabyte or 10 megabytes. When that gets full, I write out sequentially those 10 megabytes to the file on disk. So no matter, again, in the tuple oriented architecture or the page oriented architecture where I would have, you know, 20 tuples spit across 20 different pages, in this environment with this setup, those 20 tuples are always gonna be on the same page. You wanna write them out because they're just depending on log records. The other important thing in this architecture is that once a page is written to disk, it's immutable. You can never go back and do in-place updates. We'll compact it, we'll see that in a second, basically do in garbage collection, but we never can overwrite a log record that I was already there before. We're not a carplan distributed databases just yet, but there is some advantage in making sure your files are immutable. Ignoring the like, oh, well, if I'm on a cloud storage, I can't do in-place updates, but it does make it easy now if it's just appending the log because it's essentially what packs us or something a raft is doing. There's adding log records and never go back and making changes, right? Because if it could change in the log, it would be a new log entry, right? So this makes the architecture a lot easier once it's on disk and you don't update it. Now, for now we're gonna ignore what happens if I need to rewrite the memory buffer out before I want to, before it's, you know, completely full. Like if I have running a query or a transaction that wants to make sure that my changes are written to disk before I tell the outside world that the data is safely written to disk, I may write this log buffer out before it's finished or before it's full, but I'll write it to a separate location, like a local disk where I can do these kind of writes. But again, we'll ignore that for now. Again, so in a, the log circuit architecture, this is gonna make our writes really fast, much faster than in a two-point architecture. Because again, we're just appending log records and we write them out sequentially. What's potentially gonna be slower now? Reads, right, again, in computer science and database systems, there's no free lunch, so we're making the writes go faster but now the reads potentially go slower. So to do a read, what do we have to do? Well, again, assuming something in our system has figured out the ID or the key of the log record I want, like 102, 103, 104. We can ignore that for now. In order for us to find the log record for a given key, we first wanna check the in-memory page, start at the end, cause that's the newest records, and then just scan sequentially in reverse order going back to the beginning until we find the log entry that we want. If it's not there, we may have to go to disk. We'll cover that in a second. So is this efficient? No, right? So the way to get around this, and this is where that log structure and the textbook comes in, but again, we don't have to worry about the details, is that they're gonna maintain some kind of index, right? For every single record ID, it'll tell you where in the in-memory buffer page is it located, or if it's not in-memory, where is it on disk? Right, so to get record ID 104, I would just do some lookup in this index. I'm not telling you what data structure it is, it doesn't matter. Typically it's gonna be a B plus tree, but some systems you just try, some systems you skip list, it doesn't matter, right? Do my lookup and find 104, and then I'll tell me what offset in the memory page, in the memory buffer has the data that I'm looking for. In the case that I wanna look at 103, then I gotta go out to disk and get it, right? So far so good? Yes? Without getting into details about what the index actually is, is it possible to implement it in a pen-only file system? This question is, so your question is, is it possible to implement the index in a pen-only file system? Yeah. Oh, so yeah, so the way you would do this is like, you can sort of just treat this as a log itself, and then in memory you build a data structure on top of it. So like, in a B plus tree, and a typically is done is like, when you write the pages out the disk, you're still maintaining the data structure itself, like the pointers between the children and parents and so forth. In this environment, you would basically reconstruct the in-memory index by replaying the log. So you could do it in a read-only file system. Actually, I don't know what RocksDB does. Yes? Is this index the same index as what we get when we run create index in SQLite? This question is, is this index the same index you would get when you run create index in SQLite? And specifically SQLite? No. Well, SQLite's not log structured. Again, you're basically asking, is this the same index as like a primary key index, right? Potentially yes, but not always, right? SQLite's kind of complicated. They're index-ordinated pages, and then you can have the non-index, the non-table indexes, and then the primary key table indexes. Give me a sec, we'll get to that. Take it this as like, it's almost like the internal bookkeeping of find, the find records, right? Almost like the page directory. It's not something you would necessarily expose to like the SQL queries themselves, but you could use them for that. Yes? I think the only thing that we're ever looking at is the indexes that point into the memory page. So why don't we just like maintain like a bunch of pointers to the latest update rather than storing everything that's happened? So your question is, your statement is, this index is pointing to things that happen in memory, which is not true, right? It could point down to the disk. Sorry. It's the only point of the, the latest version of the, yes. Why would we store everything? Why would we store everything down here? Yeah. But like, in this case here, I have, what was it? ID equals one or three. It's not in memory. It's somewhere on disk. So, but where, right? I can't just blow away the whole, the whole file. I would have to pull it out, right? And that's expensive. That's compaction. We'll get that in a second. Yes? So the statement is, why do we need to store this delete record? If it's, if it's been deleted, why you even store that? Because there's going to be a put for like, say one or two, there's a put before it, right? But say there was another put that got written out the disk. I, again, think of like, I'm going back in time and I want to make sure that like, if I, if I don't have that delete, then it does exist, right? Cause I can't go back and just, okay, you want to get deleted. And you find a page where it's in and pull it out. I can't do that. So I just pen a log record and say, okay, if you're going back in time and you see one or two, no, it's been deleted. And then we'll call us them in a second to remove the extra entries of those things. Okay. So as both these guys sort of alluded to is like, well, some of these log records, we don't need to maintain these forever, right? And delete was an example of this or puts over the same key over and over again, right? And so in a log structure database system, what they're going to do is they're periodically going to run some background job that will compact the pages to coalesce them to reduce redundant operations, right? So in this case here, I have page one and page two. Think of this going as newest to oldest, or sorry, oldest and newest. So this one is older than this one. And so if I want to compact them, then all I need to do is recognize that here are the latest entries that I care about for the keys that are referenced in these two pages, right? So one or three, one or four, and then we delete one or one and one or two, and then the put one or five, right? Again, so like, there's a put one or five here because this is newer than this put one or five. We know we don't, you know, we want this one, not this one. So instead of storing two put one or five, we only need to store one in our coalesce pages, right? And as he brought up as well, like it may be the case that I actually don't need to store the deletes at this point as well because there's some other upper part of the system that says, all right, I've removed, you know, I've removed one or two one and one from my index. So anybody has a look up, they'll see a key not found, and therefore I don't need to store the log entry for this, all right? So this is called compaction. And this is, again, no free lunch. The log records, the log search is going to make the inserts much faster because it's just depending to the log, but at some point we're going to have to go clean things up, all right? So again, the idea is that we do this compaction, now we're down to a compressed form of the log record. Again, this is only on disk. We can't do in-place updates. So this is literally taking one disk page and another disk page and then writing out a new one. We can't overwrite an existing one. Another important thing to keep track of it too is that once it's on disk, we know that it's going to be older than, or once we have a page on disk and once we've already compacted it, removing the redundant or the operations on the same key over and over again, that means that within a disk page we've compacted, it only contains, or each key is only going to be referenced once. At this point, we don't care about the temporal ordering anymore, the law. We don't care about newest to oldest, right? So now, if the operation we need to support is go find me key 103, 104, 105 or whatever, in the disk, the temporal ordering doesn't help us and actually what we want to do is sort the disk pages, sort the keys based, sort the records or the log records in the disk page based on the keys. Right? So we do something like this. Because again, all I need to know now is if I'm looking at this disk page, I know that these pages are older than each other, so I have sort of some metadata like that, but each log record, I don't need to know whether one's older than another. So when you do this compaction and then you sort them based on the key values, these are sometimes called sort of string tables or SS tables. I think this term is coined by Jeff Dean and the Sanjay guy when they wrote LevelDB at Google. This is for big table in the mid-2000s. And again, the advantage of this is that when I have to go fetch this disk page in, I'm not looking for like, give me the put for 103 at this time stamp. You're just looking for put 103. And so you want to do a look up to find that record quickly as possible. And so if you're sorted, you can then build like an index or a filter, some way to quickly jump to that record you're looking for rather than having to do binary search across the entire file. So there's some metadata and the header for each of these SS table pages that keeps track of, I'm sorry, files comprised of multiple pages that'll keep track of where the offsets are for the different keys. Yes. But wouldn't the index that we talked about... The question is, wouldn't the index that we're talking about already point to the exact location of where something is? Not necessarily. You may want to keep a more core screen index that says, you know, here's not maybe the exact offset of the thing you're looking for, but here's the page that, or here's the file that has it. And once you get to that file, it'll tell you where to find it. Yeah, so I may have not drawn a good example here. So this, I'm saying disk page, this could be multiple pages for an SSG file. Cause these things get big, so it's not gonna be a single page. In the back, yes. For an SS table or the one back in memory? Back in memory, yes, because you don't want to have to recreate it upon restart. And as I was saying before, either you could just write the file, the pages themselves to disk, or you could just maintain a log record that says here's how to rebuild the index. Yes. Why don't we write the pages in sorted order first place? Cause if it's in memory, we could get rid of it. Yeah, so the question is, why don't we write the pages in sorted order at begin with? That's what they do, yes. Yeah, weren't the compaction have any impact on the read performance because you might have to take some locks? Absolutely, yes. So this statement is, and he's right, isn't compaction gonna have a impact on the performance of the reads? Cause not only are you just taking locks, locks, you're doing disk IOs, right? Cause now you're like, we'll get to different types of compaction in a second. Now you're potentially bringing in gigabytes of files in, compacting them and writing them back out. So absolutely, yes. Again, no free lunch. Okay. So there's sort of two main ways you can do compaction and this terminology here I'll use is what's used in RocksDB. So the most simplest form is called universal compaction where you're just taking adjacent sorted log files that are on disk. Again, these can be multiple pages. Think of like, again, megabytes, gigabytes, terabytes and then you just wanna take two sort of log files that are adjacent and then compact them. So I would take these two guys, basically do a sort merge or they're already sorted and so now I'm just doing a merge and figure out whether the different keys you're looking at, whether one is subsumed by another. Assuming that this one, what I said, this one's older than this one. So if I see an update or put for key one or three here and a key one or three there, then I know I want that one and I can throw the other one away, all right? And I can do the same thing for any possible combination of these sort of log files. I can keep calling them into more compact forms. Another approach is to do what's called level compaction. Again, this is what the level and level DB comes from. You actually, who here has heard of level DB? Very few. Who here has heard of RocksDB? More, okay, not much more. RocksDB is Facebook's fork of level DB. Google wrote level DB, RocksDB forked it. Very first thing they did, remove mmap, right? And then they expand it and did a bunch of other stuff. And so this level compaction comes from level DB. All right, so you have your sort of file on disk and at level zero, they're gonna be a certain size and you keep adding more sort of files until at some point you run compaction and then you'll combine them down into a larger file at the next level, right? Make, keep making more of them at the top level and at some point that'll get merged together. And once I have enough at the next level, then I'll run compaction for that one and produce something at the lower level. So just sort of cascading down, I'm getting larger and larger files as I go down. All right, so as I said, because RocksDB has sort of become the default choice for a lot of database vendors, people building database systems as like the underlying storage manager to use, they're essentially log structure. But then what they're building at the top of RocksDB is all the SQL parsing layer, the SQL execution, the indexes, all the additional things we'll talk about throughout the semester. And like RocksDB is essentially providing a key value API. Like you don't, in my examples here, I just said here's the value, here's the payload I'm putting out and I'm storing in the log. It has no notion of attributes or columns, right? So even though I have 10 columns in my table, but I only update one of them, my put record has to contain all 10 columns. We'll see multi-versioning later in the semester after the midterm, but we can be smarter with this, which essentially looks a lot like log structure storage, but for now we can ignore that. This is almost how Postgres, this is how Postgres was originally envisioned in the 1980s. It looked a lot like this. So they said RocksDB is super popular, and it's a fork of LevelDB, and this is just a sampling of different companies that are using a log structure storage. Again, some are based on RocksDB. CockroachDB really started off using RocksDB. They threw it away and wrote their own thing in Go called Pebble. Cassandra has their own log structure storage. Tidyb has Tikev, I think Dgraph uses BadgerDB, but there's a bunch of these log structure systems. So we already said the reads are slower, but what are some other problems we would have with log structure storage? We said read was slower and the compaction was expensive. There's one more core issue with this approach. Yes? It seems less dis-efficient. What do you mean dis-efficient? Like you have to store extra copies of every tuple, and when you compact you have to create, like you have to use other parts of this to create the compacted picture. So the statement is that it's less efficient because you have to store pitching multiple copies of a tuple because there's a bunch of puts for them. And then when you do compaction, you basically have to have a staging area, more or less where you have the two original files you're trying to compact, two or more, and then you're writing out a new one. Yes. That's, I would say that's an issue, yes. But what about the, related to this point of compaction, what am I doing? Well, at some point earlier I had these log records in memory. I wrote them at the disk. Now for compaction, what am I doing? Reading it back into memory, writing out back out the disk. So this is called write amplification. And the idea is that the issue is that for every sort of logical write I do my application, like insert a tuple, update a single tuple, how many times am I going to read and write it back to disk? And in a log-christ approach, potentially infinite, right? I just keep compacting and compacting it over and over again. Obviously that doesn't happen, but I could potentially do, for a single logical write, I could do dozens of physical writes because I'm bringing it back to memory and writing it back out. And in the page-oriented architecture with a lot of pages, we don't have this problem. When I do an update a single tuple, I bring in a memory, I update it, I write it back out. And then if I never update it again, I never write it out again. We can ignore backups, we can ignore the write-ahead log, we'll get to that later in the semester. But if I'm not reading, I'm not using it, I'm not bringing a memory and writing back out. In a log-christ approach, you have to, okay? So again, if you want to go beyond this, there's the log-christ-mercury part of the textbook. I think it's a bit, it's overly complicated because it's really about like, how do you merge these trees based? It almost looks like the level of compaction, but understanding at the low level of data structure. The key thing I want you to understand is like, here's a different approach to storing tuples through these log records. And we'll see this idea pop up again when we talk about multi-version control and when we talk about distributed transactions, distributed databases. Yes. This question is, why is the level of compaction preferred over universal compaction? I don't know if it actually is. I don't think it makes a difference. Yeah, and I don't know what the trade-offs are between the two of them. Other than it's like a, I think it's sort of like a cleaner architecture in terms of like, I know at this level, I'm gonna compaction, and it's gonna go to this side and go to the next level. Whereas in the universal compaction, you have to have some additional logic on the side. Okay, if I could emerge this guy and this guy or this guy and this guy, which one should I do? But I don't think, the ROC-CB manual has a lot of good information on those, the blog articles. I can post on Piazza if you want, afterwards. Yes. So in the previous slide, I said the most periodic compaction, what does that mean? Is it like funny secrecy, or is it like on every read? Yeah, so this question is, what do I mean by periodic compaction? You would have some kind of trigger threshold or something that says it's time to compact, right? It could be, if it's level compaction, it could be, okay, I've got three of these guys, go ahead and run compaction, right? Or it could be, I've done this many of writes, go ahead and compact. And this is, so it can be done, basically, whenever? It doesn't have to be triggered by a read? Correct, it doesn't have to be triggered by a read, it can be done whenever, yes, yeah. But it's like, how does it is? It's like, you need to change the oil in your car. You can go a long time beyond the mileage when you're supposed to, but it's kind of like you shouldn't, right? So it's sort of like the best practice is, you want to make sure you do the upkeep that you need to do. But of course, if you're running it every second, then that's gonna make your reads go slower, so it's a balance how to figure out when to do it. And again, we'll see this when we talk about Postgres and multiverging control. There's this thing called the auto vacuum. When should it run? How long should it run? It depends on the workload and the hardware. So the two approaches we talked about so far, again, the log structure storage and the sort of the page-oriented storage, these are tuple-oriented storage. These are, in these approaches, they all are gonna lie on indexes to find individual tuples that are separate from the sort of core storage of the tables, the tuples themselves, right? In the tuple-oriented storage, there's these pages, they're unordered. And to get that record ID, what gives us the page number and the slot number, there's some other magical data structure or index I said that's gonna get us there. Same thing for the log structure storage. We need an index that tells us, for a given record ID, where to go find the data that we're looking for, right? And so an alternative approach is that what if we just keep the tuples automatically sorted by just putting it inside the index itself? And then now you don't have a separate district between here's the log structure storage and the index or here's the slot of pages and here's the index. It's all just indexes. And so this is what is called an index-organized storage or index-organized tables. And the idea here is that assuming we have some tree data structure or it could be a hash table, for now we'll assume trees. Instead of having the leaf nodes in the tree with values that provide us the record ID that tells where to go find the page that has the data we're looking for, what if the leaf nodes themselves were just the data pages with the tuples? So now when I do want to do a lookup and say, find me key 102, I follow this index and then poop, I'm at the bottom, poof, not poop, a poof at the bottom, and there's the index I'm looking for. Or sorry, there's the data that I'm looking for, right? So this sort of idea looks like, again, this is a rough diagram of a B tree which we'll cover soon, right? There's essentially the inner nodes and then the leaf nodes. The inner nodes are basically guide posts that tell you for a given key should I go left or right, right? And then the leaf nodes themselves, these will look a lot just like slotted pages. But the difference is that we're gonna sort them in the actual tuple, or in the page itself based on the key and not just a random location in the, based on where we had a free space in the slotted right. So now again, when I want to do a lookup, find me key 102, I traverse the index, I get to a leaf node, I pop over here and then I do binary search on the list of keys and then that'll give me an offset to go find where the data I'm looking for. So this is how my SQL wanna use the innerDB engine. This is what you get for SQLite, this is what you get as well. I think I said last class in SQLite, they had this internal primary key called the row ID, right? And we could see it through SQL, but it's slightly, it's different than the primary key you may define in your table itself, right? Because they're using index organized storage and then the row ID is the key that you do look up something inside of the, inside of this index. So for the real primary, sort of the logical primary key, say like a customer or a student email address, we'd have a separate index that then maps the email address to the row ID, then you do a lookup in the primary key index to get the actual tuple that you're looking for. Yes? Is it two keys and one key for the page and then one key inside the page? The question is, is two keys one key for, to get to the page and one key inside the page? No, so like if I, again, if I'm, if I SQLite, find me row ID equals one, I just traverse this index, the keys are based on row ID. I land in the page, now I need to find within the page where row ID one is, and I do my lookup on this. Also the entire tree is sorted. The entire tree is sorted, yes. It has to be because it's a balanced tree. So you get this in SQL Server and Oracle, but not by default, you have to tell it I want this. If you use my SQL SQLite, you get this by default. I don't think you can turn it off. Yes? Is this approach still suffer from the downside like fragmentation and decryption? Yes. So the question is, it's a good point. Does this approach still suffer from the things we talked about before like fragmentation and random IO? Well, so for fragmentation, yes, it's unavoidable because in a B plus tree, it needs to be at least half full, so you're going to have a bunch of leaf nodes that are going to be empty. That's unavoidable. In terms of the random IO, if it's updates to random locations in the leaf nodes, yes, that's unavoidable. But if you're just inserting, then again, using SQLite's row ID as an example, the row ID is just an eternal counter for every new tuple increment that counter by one, one, two, three, four, five, six, seven. It's a monotonically increasing. So if I just keep inserting to SQLite, I'm just going to keep appending to this side of the tree and not just the other side of the tree. So it's not as bad as doing a bunch of random IO. It's not as good as doing the sequential IO you get log structured, but it's better than it may have in a tuple-oriented storage because at least now the tree is guiding me to only update pages on the side over here. So there is some benefit to it. Yes. This question is, the row ID is finding a page and key ID is finding a tuple node. So what I was trying to say in SQLite, there's the primary key index that stores the tuples and the leaf nodes. But instead of getting whatever the primary key you tell it in your create table statement, they have an internal row ID as the primary key. So if you do a lookup of where email address equals Andy, there's some other index that's going to give you the row ID and then you use that to traverse the primary key index. In MySQL and ODB, they don't do that with a row ID. It'll be the real primary key you declare in the create table statement. That'll be the key you're using here, right? And that's the lookup that you have over here. Again, and the pages look like this slot architecture where the key and the offsets are growing in one direction and then the tuples are growing in another direction. All right, again, so the three major approaches for storing tuples, it's sort of within files, are gonna be the heap storage with the slotted pages, the log structure storage with the appends and the SS tables getting ready to disk and then this index organized storage. And there's other one like the isams, these are archaic or they're legacy and we don't need to worry about it, okay? All right, so let's talk about now once we got, once we have a, once we got to this, like a tuple, let's talk about what's actually in it now. So a tuple is just a sequence of bytes and it's the job of the database management system based on the schema that it's storing in this catalog, like when you call a crate table of I have these attributes of these types, it's the job of the data system to interpret what those bytes actually are and how to do whatever the operation that it is you want you want on it, right? If I have two columns on column A plus column B, the database system is gonna know, okay, well column A is a 32 bit integer, column B is a 64 bit integer, therefore I need to do the addition operator based on those two types, right? And so you can sort of think, again, just think of it as just a byte buffer, you know, a char array, there'll be some header that says, it keeps track of like maybe the size of it, the nulls, probably on the second, and then after the header's done at the first offset, you would have the first column, the ID column here, and then followed, and we know the ID is an integer, so that's gonna be 32 bits. Then after 32 bits, we'll have the value, which would be 64 bits. And so internally, basically the data system, if you want to do it in C++, is looking, gets the starting location of the tuple, right, using whatever the slot array method or how we get to jump to that offset in a page. If the header is always gonna be the same size for every single tuple, so we know how to jump past that, and then now we just do simple arithmetic to say, I know that the offset of this first column that I'm looking for is, there's so many bits or bytes after the header, or I want the second column, how to get there. Var char is a little bit more complicated, you get the store, the length of the field, and that could be in the header, right, or inline, for now it doesn't matter. But essentially what you're just doing, you're just taking some address, and you're doing the interpret cast to say the system itself should treat that address, that's starting by that address as a 30-bit integer or 64-bit integer or whatever the type is. Yes, yes, who else would be doing it? We're writing SQL, right, like there's no interpret cast in SQL, this is like the implementation, yeah. Again, this class is like we're doing this, not the JavaScript programmer, right. All right, so someone brought this up last class, which is a good topic and I wanted to include it, is one of the things we need to be careful now of as we start storing these bits is dealing with alignment. To make sure that the data we're storing aligns to how the CPU actually wants to operate on data. So what do I mean by this? I guess the reason I put Andy sucks is like people take my slides and they don't know what Andy sucks means and so I Google that and you find who's who copies it. All right, so we need to make sure that all their attributes are aligned to based on the word boundaries of the CPU or whatever the architecture we're running on to ensure that we don't end up with unexpected behavior when we do operations on this data and that the CPU doesn't have to do extra work. So let's say I have a table, I have four columns here. I have a 32 bit integer, a 64 bit timestamp, a four byte char and then a zip code, right. And so assume that we're gonna break up our char array representing this tuple into 64 bit words. Cache lines are 64 bytes, but Postgres aligns are based on 64 bits or I don't know what SQLite does, but they're all doing some variation of this. All right, so the first thing we're gonna do is again we have our for ID column that's 32 bits. We store that there. Then we have this date timestamp, the creation date that's 64 bits. So we just store that right after that and so forth with the other ones, right. And so again now when I wanna do a lookup in my system to do some operation on this byte array that I've gotten for this tuple, say on the customer date, the creation date, the problem with this is that that attribute is gonna span two words, right. Cause this was, each word is 64 bits. The first ID field was 32 bits. So this 64 bits spans two consecutive words. If anybody know what happens when you do this in the CPU, right, when you try to jump to a memory address to do some operation on something that spans the word boundaries. What does x86 do? So x86, Intel likes to make your life easy and not write the word about these things. So they'll do the extra reads for you, right. It's good, they wanna hide it. They wanna hide all the complexities of the architecture. So they'll do extra work, but now this is gonna make your data system run slower because it should have been one register read or one cache line read to go fetch something into a CPU register. Now there's gonna be two cache line reads, right. But again, there's no error. It just, Intel takes care of it for you. But not every system, not every architecture will do that. Previously before in ARM, they would give you, they would reject it. They would recognize that you're trying to do a misaligned operation and then throw an error hoping you would catch it. Now in the newer versions, I think ARM 7, they handle it now like Intel does. But there's, it's just slower. This is rare, but what could happen is that it'll do the reads for you, but there's no guarantee that the bits are gonna land in the right order, right. So going back here, I have to do two reads to get this word and this word to put together the date attribute. It may put the last bits in front of the other ones. Seems like a terrible idea, but the older CPUs would do that. Of course, that means now your program is kind of random errors and messed up data and people are gonna notice and complain and that's bad. Again, that's part of the reason why Intel tries to hide that from you, even though it makes your thing run slower, right. So we need to make sure that our, none of the attributes in our tuple and essentially in our byte array, because again, now we're talking about things we brought into memory that none of them are gonna span these boundaries. So the two approaches to handle this are padding or reordering. So with padding, the basic idea is to recognize that if I'm breaking up to 64-bit words, as I add my attributes that's going across, if I recognize that the next attribute doesn't fit within my single word, then I'll just put a bunch of zeros there and pad that. And then internally, the bookkeeping of a system, when it's interpreting these bytes, it knows that okay, I need this ID here and then the date attribute, that's gonna be the next word so just ignore these 32-bits there. The other approaches to reordering, I don't think any, most systems don't do this automatically. Some of the academic systems, we built one that did this, will do this automatically. But most systems will lay out the bits exactly where you tell it and then put padding in to make it better. So the idea here is that if I keep the, sort of a logical view of the table or whatever to find the quick table statement, I'll tell you if things are sorted in this order. But underneath the covers, I'll just move things around so I can pack things in better and then if necessary, I'll pad bits at the end like that. Yes. So how long do varchars kind of need to be discussed, especially for reordering and padding? This question is, how do varchars handle in this case, especially for padding? Yeah, for padding. That's what I'm saying. So in the systems that do automatic reordering, you don't store the varchars in line unless they're 64 bits or less. And instead you store a pointer to some other location that, well, stay in the second, these external, not these oversized attribute tables or pages that are sort of separate. So you can do this reordering and not worry about very length things. Yeah. This question, do I need this last, like this thing here? Do I need this? No, yeah. So we can see this in Postgres. So Postgres will not do automatic reordering, but it will do padding. And there's some simple things we can see about like if we, when we reorder, reorder the, when we reorder, reorder the crate tables or reorder tuples that we can store things in less space. So, so just more Postgres syntax here, but Postgres has this nice little function called row. And essentially it just takes up the comma separated list of values you give it, it makes a row, right? And then we can cast it now. We can add this, this colon, colon thing at the end of all our values. And then that'll, basically casting the value to a given type. So I can do a small int, a regular int and a big int. So a two-byte int, four-byte int, or eight-byte int, right? So now Postgres has a nice little function called PG column size that'll tell you the size of this record, this tuple in bytes. So in this case here, it's telling me the size of this row that I create is four bytes. So I go back to my previous one and run that. Yeah. Sorry. Go back to the row I had before without casting two types until I was 36, right? Which makes sense because this last one here, I was making that 64-bit integer, or Postgres, I think in this case here, it's storing it as all eight bytes, four-byte integers, 32-bit integers, and then a little extra space for padding. Okay? So we can see this now if we take a, let me do it first without the size. So let's make a row that has some chars and then two-byte, four-byte, and eight-byte integers. But I'm intermixing the chars with the integers, right? So if now if I say, I guess Postgres, what is the size of this? Okay, 48 bytes, but if I redo it where I put all the integers first, basically reordering as I was showing before, and then put the chars all at the end, now I get down to 44 bytes. Again, because Postgres has to pad things out to make sure that everything is 64-bit aligned. But it doesn't do this for you automatically. You have to do it. You have to tell Postgres I want this. Again, where there's some systems that can do this for you automatically. Make sense? Again, I like this because, again, just through SQL commands, we can get a view to the internals of the storage manager of a database system and say, how is it actually laying things out? Okay? All right, so now, let's talk about, we talk about integers, we talk about the vartars a little bit. Let's talk about the other sort of the core SQL data types and how the data systems actually gonna represent them. So for all integer data types, these are essentially gonna be the same thing as you get when you allocate a variable of an integer type or a large ant, whatever, in C++. It's gonna be the same representation because that's what the harbor supports. The harbor is gonna have, there's a standard that representation for whatever two complement integers, either signed or unsigned, whatever you get in C++, that follows the standard and that's what the harbor supports and that's what you get in SQL. For floating point numbers or fixed point numbers, there'll be floating point or real numbers and again, that's defined in the IEEE 754 standard that specifies how harbor should represent these decimal numbers, but every data system is also gonna have what are called fixed point decimals, so numeric or decimal, where each of those implementations are gonna be different per system and we can see the performance difference of the two approaches in a second. For var char, var binary, text and blobs, these are typically gonna be stored as something that with a header that tells you the length of it, followed by the bytes of the actual value or if it's too big to be stored in line in the tuple itself within a page, there'll be a pointer to some other page that has the data that you need for this attribute. So I said for in every system, if it's less than 64 bits, they'll store it in line, if it's not, then they store a pointer in a different, in a disk-based database system, it's gonna depend on the implementation and we'll see, again, we'll see that in a second. For timestamps, dates and intervals and so forth, these are gonna be typically 32 or 64 bit integers that's just the number of milliseconds or microseconds since the Unix epoch by January 1st, 1970. And then if you wanna store this with timestamp information, typically they'll store it as, based on UTC timestamp, right, whatever, GMT0, and then they store additional metadata and say what timestamp are you in and they can convert it as needed and the system handles that for you. So because for these types up here, the integer types, because we're relying on the hardware to store whatever, to store the data, how the hardware wants to represent it, that typically means you just can't copy the files, like the raw database files that you generate from one architecture to another. Like if it's big endian or little endian, like x86 is little endian, power and arm or big endian, like you can't take the binary files from data system and put it to another one because they're gonna, the bits are gonna be flipped and it'll get messed up. SQLite avoids this problem because they store everything as actually varchars. And at runtime, they cast things based on the type in the attribute because then they get that portability guarantee that no matter where you plop the file in, they'll always have it in the right order. All right, so I'm gonna spend a little time talking about floats and reels and numerics and then again, this will be a good example of where the database systems are gonna do something different and you can't just rely on the hardware to do certain things for you because we care about correctness of data and the hardware can't guarantee that for us. All right, so for variable precision numbers, just like before in integers, we're gonna rely on the C++ implementation for this. So if you call float, real or double in SQL, you'll get the same like float or double you would get in C++. And so typically, these are gonna be faster than the fixed point numbers, so we'll see in a second, because the hardware can natively support this. But the problem is though, they're not gonna have the, they can't guarantee the correctness of values when you start doing larger calculations because of rounding issues because you can't store exactly decimals in hardware. So everyone's probably seen a simple test program like this when you first learn C or C++, right? I have two floaty point numbers, two 32-bit floaty point numbers. I wanna store 0.1 and 0.2 and then I just wanna add them together and see what the output is, right? So in the first version, I'll just call printf to dump out the X plus Y like that and I would get something that should look at 0.3, right? And when I run that, I actually get that and that looks okay. But in actuality, if I increase the number of digits, I'm gonna write out in my printf statement, now I end up with something that looks like this, right? Because again, the hardware can't represent 0.3 exactly, it's gonna be some approximation based on that. So okay, if I'm doing a simple program like before where I was just doing X plus Y and I print that out to a human, yeah, sure, maybe that's not a big deal, right? But if I'm doing complex calculations because I'm trying to land something on the moon or put a satellite in space or if it's your bank account and you're doing interest calculations, then this rounding error is actually gonna matter and people are gonna notice and complain. So for this reason, database systems are also gonna provide these fixed precision numbers or fixed point decimals where the database system is gonna do a bunch of actual work to make sure that you don't have these rounding errors. You can get this in Java with big decimal, you can get this in Python, I think, with decimal type as well, right? They're all basically, all the different systems are gonna do something something slightly different, but at a high level, essentially, they're gonna store a variable like representation of the number you're trying to represent and then in addition metadata to tell you where the decimal point is or whether it's signed or unsigned or is it negative or not a number and so forth, right? Again, we have to do this extra work because the hardware can't guarantee this for us. So here's what Postgres does. So this is the numeric type of Postgres. This is actually from the source code itself and you can see that they're gonna represent the type of a numeric as some kind of struct with a bunch of additional metadata about what the number actually is. But the core thing they're storing internally along with this metadata of like, here's how to store the actual number itself is this numeric digit array here. Well, that's just a typecast to an unsigned chart up above. So they're literally storing your decimal as a string value and then they use this metadata to figure out how to then interpret that string to put it to be the correct form. So again, the hardware doesn't know anything about this. This is what the data system has implemented. So we can't just do X plus Y, like we can in C++. We gotta do more complicated arithmetic when you wanna start calculating or using these numeric types in queries. So this is just a brief snippet of the addition function for two numerics in Postgres. And as you can see, there's a bunch of checks for that struct. We're checking to see whether it's zero or negative or signed or whatever. And this is just to add two numbers together. This is obviously gonna be way more expensive than calling a single struct in the CPU, X plus Y. I don't get the impression I'm shaming Postgres. MySQL has the same issue. They're doing the same thing. They're gonna store their set of their digits as a var chart, they're gonna store it as a 32 bit integer. Again, they have additional metadata to keep track of what the numeric type actually is. And just like Postgres, they're gonna have their own implementations of doing addition that does all the additional checks. It's not sexy, but you do need it. So in the sake of time, if we have time at the end, we can do a demo to show you the performance difference. But it's about two X, right? The numeric versions the div is implemented versions of these decimals. It'd be about two X slower than the, than like the hardware versions. All right, for nulls, the most common way to do this is that for every single tuple in that header will be a bitmap that keeps track of which attributes that are set the null for that given tuple, right? And again, the header, the size of this bitmap will vary based on the number of attributes you have, which we know whether it could be null or not, because it's in the create table statement, right? Again, there's the advantage of using a schema instead of destroying JSON, whatever in there, we have a schema, we know whether a column has been defined as not null or not. And therefore, if it has been declared as not null, we don't need to store this bitmap or you don't need to store entry for it. So this is the most common approach. Now there does mean there's some overhead here, right? For every single tuple now in the header, we got to have this bitmap. Less common but another approach to do this would be to have special values where you basically say there's some value within the range of values I could have for each type that if I had that value, then I'll assume that it's a null. So if I wanna know whether a 30-bit integer is null, then I'll say the 30-bit min number I could have, like negative whatever it is, if my value is that, then I'll treat that as null. So there's one less value I could potentially store, and now there's a bunch of extra stuff I have to do in the rest of my system to keep track of, okay, if I'm looking at a 30-bit integer, if it is that min value, then I know it's null and not let people insert it arbitrarily. The worst choice, and I don't have a screenshot of this, actually I might, the worst choice I've only seen one system ever actually do is for every single tuple itself, sorry, every single attribute in the tuple, you have a little flag in front of it that tells you whether it's a null or not. And the reason why this is terrible is because when we talk about alignment, right, I can't have a 32-bit integer and put one bit in front of it to say, hey, this thing's null or not, I gotta store another byte. So now all my 30-bit integers, and if I wanna be 64-bit aligned, maybe I gotta store the double size. So like if to store a 30-bit integer to keep track with this null if I'm putting this flag in front of it, I may have to store another 30-bit just to have one bit to say that it's null or not. Do I have a screenshot here? Let me see. The only system that I know that actually did this was MemSQL, which is the earlier name of single store. So despite them sponsoring the class, yeah, I don't have the screenshot here. I'll post it on the slide. That was the shittiest idea, it's one of the shittiest ideas I've ever seen. But they got rid of it, right, because it's super wasteful and they do the column header now, right? For large values, like really large values, variable length values, the most data systems are not gonna let you store them directly in the page itself, right? Again, a page size is defined by the data system and every single page within that table has to have the same page size, right? There's an experimental system at a Germany that they can support variable length pages, we can ignore that, nobody else does that. But so that means that at some point I decide should I store this large Varchar, large string in my tuple page or not? And so for this, if it exceeds, for everything they're gonna have different thresholds to say, when can you not store it and we need to put it into what is called an overflow page. So in Postgres, they call it the toast, I forget what it actually stands for, but any attribute that's larger than two kilobytes, will store it as a separate page. And then in the actual tuple itself, they'll just have a pointer or a record ID and an offset and then points to where to go find the actual value that you're looking for, right? And again, you as the SQL programmer, you don't know this, you don't care, you call a select star on the query and the data system is responsible for going, reading the base tuple, recognizing that it's pointing to an overflow page, go get that data and then copy it into the buffer that it then produces the output for you. So it hides underneath the covers that it's actually done this for you, right? So Postgres is two kilobytes, they're up to eight kilobytes. I think you can tune this, but it goes up to, obviously you can't exceed eight kilobytes. In MySQL, the overflow size is one half the current page size. And then in SQL server, surprisingly, you can set the default is, if it exceeds the size of the page, then it overflows. So the size of the data trying to store in this oversized attribute, plus the regular data, the combination of that exceeds the size of a page, then they'll put the oversized data to another page, right? And you can chain these things together. So if say you wanna store, for whatever reason, a one gigabyte video or 10 gigabyte video, your data system couldn't let you do that. And then this overflow page, since they all had to be the same fixed length size as well, it could just have a pointer to say, okay, here's the data for this range of the data for this attribute. But oh, by the way, here's a pointer to the next page. And you gotta follow along that link list and go get all the data and put it back together. So the last thing you can do is call it external file storage, external value storage. And this is where the database system is not gonna store the large data, the large attribute in pages that it manages. It's gonna write it out to your local file system and then internally store the URI or the URL of where that data is located so that when you query against the table and you go get that attribute, it goes to the OS and goes, gets that data, copies it into his buffer and then it hands it back to you. So in, I think only Postgre, or sorry, Oracle and SQL Server can do this. In Oracle, they're called B-Files, in Microsoft, they're called File Streams. And again, it's just a URI to some data on disk and it does the sys call to go get it from the operating system. In Postgre, you can do this for, they're called foreign data wrappers. There's additional mechanisms to go store data and cloud storage and then again, now within a single SQL interface, I can go fetch the data and it haven't appeared as if it was in the table itself. So when we write things out to these external files, the database system cannot make any changes to it. It's like you're written out to the file system, I'm not gonna go and make in place updates to it. I can't update it, I can only read it in and then if I delete the tuple that's pointing to this file, there's mechanisms aside, do I actually want to delete the file as well? And so the reason why you may wanna do this is because, as I said, you don't wanna store like a 10 gigabyte file in your database system for management reasons, because then it's a log record, you have to, that comes expensive. But also too, typically database management systems are stored on higher end hardware and that makes storage expensive. Like if you use Amazon RDS, I think they charge forex more for storage than you get from EBS. And certainly EBS is even more if you have a locally attached disk. So you don't wanna be storing these large files that maybe are on read only directly in the data that's managed by the files that are managed by the database system. Let the OS on some cheaper storage handle that for you. So there's a paper from, I'm 15 years old now, from Jim Gray, which is one of the guys who went for databases in the 90s and he vented a lot of the stuff we're talking about this semester. He had a paper he wrote at Microsoft a few years ago that talks about whether should you store large data in a database or not? And I think for their recommendation, they said anything larger than 256 kilobytes stored externally. Again, this is the wild go. I wouldn't recommend that anymore. And actually we had the guy that event is SQLite. He came to CMU, gave a talk five years ago. And he mentioned that in his experience, it's actually better to store things in SQLite like if you're running on like a phone app. So if you have like your application has a bunch of thumbnails images, you're better off storing that in the database system because now when you go and retrieve them, your application already has to handle to the database system open, as the file is already open. So it's much faster to go get those thumbnails directly from the database system versus having to do a bunch of F opens and F reads to a bunch of files on disk. So I would say, I mean, this is pure conjecture. 50 megs or less is probably okay. Anything beyond that, you want to use external storage. And ORMs like Django and other application frameworks, they have mechanisms to handle that for you. Okay, so next class, next class we'll again continue on storage, talk about storage models and then columns versus row storage. And this will be again, explaining to you why DuckDB is faster than SQL Lite. And on that note, the DuckDB people will send me stickers. If you want one, come get one. I have pins too. Hit it. My all black, 38 is send you to the peri gates. You get Gonzama trying to skate and that's your first mistake. I ain't lying for that cake. You're family see your weight. My grandson's heavy weight and ran through every stake. When he asked me how I'm living, I tell him I'm living great.