 Before it gets started, like it's beginning semester, people are leaving stuff. Uh, if you want, someone left the umbrella last time, and then someone left, uh, zigzags. Wait, that's, that's mine. Sorry. Give me yours. Okay. Thank you. All right, and then, uh, someone left brass knuckles. You want me to leave these? Oh, f**k. One, two. All right. Great. Why do you have those? All right, because I do databases. Can I make sounds? Okay. All right, um, for those in the class, uh, again, homework one is, and project zero are due this Sunday, and then project one, we hope to release, uh, later today, or, or early next, or sorry, next, next Tuesday, um, any questions about homework one or project zero? Yes. I'm going to go to the website or on, on Grayscope. I'll, we'll fix that. Yeah. Everything will always be due 1159 p.m. the day it's due. Yes. So the question is, are you allowed to use SQL like keywords that are not what? Yeah. Why not? Yeah. Go for it. Yeah. If you could, if there's something that's SQL, if there's a way to make the SQL better, do it. Yes. Okay. Cool. Um, the other thing too, also, and I'll post this reminder on this on Piazza. We have a seminar series this semester, we're inviting outside speakers, people from outside the university come talk about their database systems. So this would be a combination of, well, they're all pretty much at companies. The first speaker at, uh, on Monday is a famous database researcher in German. He's probably the best data researcher in all of Europe, uh, if not the world. Um, so he, he, he had a database system that he sold to, uh, Tableau and then if you download Tableau or now, now it's known by Salesforce, you get his database system. So he's betting a new one. So Thomas can talk about his system, but there's a bunch of other ones coming later this semester. So I'll post a reminder on Slack. This is optional for the class. You're not required to attend, but if you want to go beyond the things we're talking about this semester, uh, I highly encourage you to check these things out. Okay. Sorry. All right. So, um, the, in the last class, we were spending time talking about introducing the idea of a disk or any database system. Right. And we said that a disk or any data system would be one where the, the software is set up or it's written in such a way that it assumes that the primary search location of the database is on disk. That means at any time you go read a page, that page may not be in memory. It's on disk and you got to do something to get it. So we'll talk about, you know, how we move things back and forth next semester. Um, or sorry, not next, next week, not next semester, next week, uh, because that is project one. Um, but the, the last class we spent time talking about, uh, what do the actual files look like on disk? What, what do the pages look like on disk? Um, and we spent time talking about what I loosely call as a page oriented architecture. The database heap file has been broken up to pages, uh, and then the, we knew how to weigh, we're using the record IDs to get the page number and the offset in a slot array to go grab that page we wanted, paste in the page directory, bring it in and then figure out, find the exact location that had the data that we're looking for. So at a high level, the way you would insert a new record into this architecture are a new tuple is that you would look for the page directory, uh, find where that is. That would have some kind of information to say, here's the pages that are free or have empty space. Go retrieve that page. If it's in memory, great. So you just use it. Otherwise you go to disk and read it. And then you look at that empty slot array to find the first empty slot where you'll, you'll, you can put your tuple into it, right? If you want to update an existing, existing record or existing tuple that already resides in a page, you do the same thing. You look in the page directory, find where the page that you want is located, retrieve it, uh, getting it from disk if necessary. Then you use that slot array to find the location, the offset within the page that has the tuple you want to modify. And then if your new data, your new tuple will fit right, you know, right, right over right, the old one. Otherwise you got to find a new space in the page where, uh, your new tuple will fit and then you put that one in there and then delete the old one. If your tuple doesn't fit in the new, in the existing page or your new update doesn't fit in the existing page, you got to go update another page, right? This is the high level architecture that we're talking about for most of the semester. Typically, I would say, it's becoming less common, but this is typically the way, you know, traditionally database systems have been built for the last 30, 40, 50 years. So what are some problems with this approach? We touched on these a little bit last time without jumping ahead in the slides, uh, if you have the PDF, what are some, what, what, what, what are some problems with what we just talked about so far? Yes. So disk retrieval could be expensive, um, yes, uh, but there's, if it's a disk oriented architecture, it's somewhat unavoidable. If you got to read it, you got to read it, nothing you can do. So there's sort of three things. First is sort of fragmentation, right? Where the, we may not be able to get the entire tuple, um, in the page and we may have to go update a bunch of multiple pages, right? So now we're doing much of random writes, uh, which is going to be bad. Uh, since the page maybe have to bring in may have a bunch of data or tuples that we don't care about. Right. We were trying to update one tuple. We got to bring in the page. It has much of other tuples. We don't care about those things. We got to go read, read them too. Right. Cause they're along for the ride because again, non-volatile storage, it's going to be block oriented. So we can't go get exact byte ranges of the data we need. We got to go bring, uh, the entire page in again. Now if you're doing, uh, I'm doing a lot of page, sorry, a lot of tuples all at once, those tuples may reside on different pages. Right. So if I got to update 20 tuples and those 20 tuples are stored in 20 different pages, not only that I got to read those 20 tuples or 20 pages in, then I got to update them. I got to write them back out. Right. And that's much, much more random IO. We talked about last class how, uh, for the, the non-volatile storage in general it's going to be much faster if we can do things sequentially rather than at random, random access. So let's also talk about a different world where we actually can't do in place updates on pages. Right. Like, like this, this architecture we just described won't work in a world where you can't grab a page, modify it, and then overwrite the old page. And you can't do that in, uh, uh, you can do it, there's some, there's some, uh, cloud platforms where you can only get append-only storage. HDFS probably, HD, the Hadoop file system is probably the same as this. One of these Hadoop file systems, you can only append to files. You can't overwrite existing data. Right. And you do this for, uh, replication and consistency issues we'll talk about later in the semester. But, like, this makes your life a lot easier when you do an attributed system because now you don't have to worry about, uh, if I write data, uh, is, and then I go update it, is everyone going to get my update in time? Right. If you update, you write a file, write a page and say, all right, here it is, guys, in perpetuity forever. This is the contents of this page, you know, it gets delayed a little bit shorter, but, like, you're never going to have to make changes to it later on. That makes your attributed system way easier to build. Yes. This question is, can we not just think of updates as deletion and insertion? Yes. And that, that's what we're going next. Yes. That's how you're basically going to handle this. So for today I'm going to talk about log structure storage mostly, um, and this is an alternative to the page-oriented architecture that we talked about so far. Um, and I was saying this is not how systems were typically built. Postgres sort of looks like this in the 1980s, uh, we'll talk about what they actually do in a second. Um, but this is becoming more common now with things like RocksDB and LevelDB. There's now, like, storage manners that are specifically designed for this log structure storage. Um, so what I'll describe to you is a, sort of, a 50,000 viewer, a high level of view of what log structure storage looks like. In the text book, they go on about log structure merge trees. Now, I'm not gonna, we can ignore all that, I'll explain what that is in a second. Uh, I think they go into too much detail and it's not a good explanation. So I'll try to get the high level just to see, again, to distinguish this approach versus the page-oriented approach. And then we'll talk about what the, uh, what the data actually looks like instead of people, like how do we represent integers, floats, strings and so forth. And then if we have time we'll finish up talking about the system catalogs where we keep track of, like, what our tables actually look like, how many columns they have and so forth. Okay? All right, so in a log structure storage system, a storage manager, uh, there's pretty much two operations we can do on, on data. Put and delete. And I'm not calling this insert, I'm saying put because we actually can do line puts. We can, we can overwrite things, uh, at a logical level just by going put, put, put over and over again. So for every single, uh, operation we're gonna do on, on this, on our storage, storage manager, we have to provide a unique identifier. So this won't be the page number and, uh, the page number and the offset we talked about before because that doesn't exist in this world. It's some unique identifier. Something up above in the system is, is figuring out, okay, this is 2.1, this is 2.2, 3 and so forth. You can think of it as, as a simple counter. Um, and so for every single operation we're gonna do as you modify the table, we're just gonna insert a new log entry, our new entry into this log file that we're gonna maintain in memory that, of course, that just contains the information on what that operation was. And you can think of the, depending on this file, the, the, the beginning of the file will have the oldest log records and then over time, we keep inserting new things and extending the file. I, those, those are the newest ones, right? So it's sort of going oldest and newest inside the file. So this is the example here. If I want to update or modify or insert, I don't know at this point, I don't care, tuple with identifier 103, then I just put a put record in there and then with the identifier and then with the values of the, of the, of the tuple, right? For now, for simplicity, I would just say, okay, this table here has, is a key value store where it's just key 103 and then the value is just some, some string, right? And I can keep doing this, I can do put 104, I can do delete 102 and so forth. And I keep going until I fill up the file. If I have a delete, I don't have to put the original value and I say, I'm deleting this tuple, right? And so as the application makes changes to the database, all we're doing is just keep inserting into this file, right? Pretty simple. So now when a page gets full, I gotta, I gotta write it out the disk, right? So that's, that's pretty simple too, right? I have some kind of page directory that says, okay, here's the next location where I can write a page to disk because my in memory one is full. I write that out, clear out the buffer, just do it again, keep inserting things over and over again, right? And once the pages are on disk, we're immutable, meaning we're never gonna go change them again. We're never gonna go back and modify log record in different offsets. Once it's on disk, it's persistent forever. So what's the one obvious big advantage of this approach so far? If you're a disk guy, oh, but why? You bash them up into a single write. Yeah, but like, yes, like, but it's, and it's sequential too, right? So if I update 20 tuples, I just put those 20 log records in this my page and write that out all at once versus having to go update the 20 different pages where it might, it may exist in another architecture. Yes. This question is, is write constant time? Yes, for our purposes here, yes. In actuality, I mean, so we'll get through this faster. There's the algorithm complexity and then there's the reality complexity and our work constants matter, right? So like the algorithm guy, oh, who cares if it's like, you know, if it's two X slower, that matters a lot in our world. All right, so again, we just keep doing this when this guy gets full, we just write it out and flush it out and keep doing this. All right, so this is how you do updates and deletes or insert to puts, right? And then something up above is responsible or keeping track of, you know, if I delete a record, do I really have that record? Right? Like, at this lowest level here, we don't know, we don't care. Yes. The question is, do we need to back up at the MMI page? Yeah, so to her point, which we'll cover, so this is the challenge I face, I don't know at what point can I introduce these things? And like, as you're filling this thing up, if you have a notion of a transaction up above where it says, I want to put 104, like say the first guy that up here, he does that first put and I want to tell the application, yes, I persisted your change, you want to flush that the disk, right? So you could keep flushing this over and over again or you could have a separate log, right? It's, the page may actually be written a disk before the thing actually gets full, is what I'm trying to say. Yes. Okay, so your question, how would you handle this in a distributed setting? Let's ignore all that, right? Let's learn to walk before we can run, okay? It does make things distributed. This actually is basically how sort of PAXOS works because you can think of like, I mean, jumping way ahead, but you can think of like, going back here, you want to put this in the log, you say you want to put 103, you got to get everybody to agree, okay, put 103, that's going to log, then you go on to the next one. It's, yeah, it's basically the thing, the block shit we talked about before, this is the ledger, right? All right, so this is how to do updates, pretty straightforward. It needs to get a bit more tricky now, right? Because what you now need to do is you want to go sort of, back in time, try to find the newest log record for the thing that you're looking for, right? And so maybe you luck out, you're trying to find 105, record 105, and so you just start at the end of the log and go in reverse order, and lo and behold, the first log record has put 105. I know that's the newest version for 105 that could exist in my database, right? Ignoring distributed systems, ignoring multiple threads, it's in there, that's good. Now I know what the value is of 105 is, right? Right, so yeah, one or two, whatever. So that's fine, but obviously this is going to be slow if I got to scan the file, or scan the in-mary page over and over again, and furthermore, maybe the thing I want isn't in the in-mary page, now I got to go to disconfined it, right? So that sucks. So what you'll do instead, and this is what the log structure merge tree stuff that's in the textbook that we can ignore for now, what you can do instead also, or in addition to maintaining this log record, you maintain an index, again we'll explain indexes in two weeks, I think it's just like a glossary in the back of the textbook. You look up a keyword, it tells you the page number, you can jump to that page and find the thing you're looking for, versus like scanning through the page one after another, right? So you'd have this index that for all the different IDs, it'll give you the location of the latest log record for that given ID. So now if I want to go find record 102, I look at my index and it'll tell me that I follow the pointer and it'll take me to the latest log record and I know what happened to it, yes. So this question is, is this index, is it transient or ephemeral, meaning it's just in-memory, if I crash and come back, I got to rebuild it? No, because that would be slow, if they get extremes, they get petabytes, I don't want to go scan a petabyte data to build my index, this'll be persistent, right? Yes, correct, yes, we have to make sure that the data and index are in sync. Yes, we'll get to that later. Yes, your question is, why isn't a hash table being used as the index? So yeah, so we're getting ahead of ourselves. This question is, why are like B plus trees, like the merge trees, the Cassandra guys are looking at tries, like why are, there's been skip less for use these as well, why are tree data structures used instead of hash tables for these lookups? Because you then can't do, sometimes you want to do range scans, I want to get all, I want to give me a log records 101 to ID 101 to 103. I can't do that in a hash table. Okay, so what's the obvious problem with this is that the log is going to go forever, right? Because all I'm doing is to keep inserting things over and over again, and eventually I'll run out of space, right? If I have a table with a million tuples, but I only ever update one, right? I'll just keep making log records for all those updates and the log files just keep growing forever. If it's a page oriented architecture, then I don't have that problem because I just overwrite the old one over and over again. Not exactly how they work, but again, we can order that for now. So all of these log structure systems have to do what is called compaction. The basic idea here is that you're going to take two or more pages and consolidate them. And essentially what you just need to do is you just go back again, and assuming these are sorted in temporal order, go back in reverse order, say this page is newer than this one, or again, just like you're doing a read, go back and figure out what is the latest version of, or a log record for a given tuple, and then you know that's the only one you need to store in your new consolidated page, right? And at that point, you can blow away these pages away because you have everything you would need for the updates that occurred that you stored in them. Yes. These questions, are timestamps stored across these things? Typically, yes. What's on disk, not really. I'll tell you in one second, but either is usually timestamps, yes. All right, so he says, if you start compacting this, won't your timestamps start getting messed up? That's what I'm saying. You don't actually need the timestamps. Which is the next slide, thank you. So once the page is compacted, we don't need the temple ordering anymore, because we know that for any given tuple ID, there'll be one, at most one log record corresponding to it in the compacted page, right? So we actually can get rid of the, we don't have to do the temple ordering anymore. We don't have to keep timestamps. We'll keep timestamps for other bookkeeping things, but we don't have to keep them in temple order anymore. And yes, that should be does not. Yes, let's fix that now. There we go. There you go. Let me try to break the recording. All right, we're still recording, good. Okay, there we go. Does not. So does not need any temple ordering. And so instead, what you can do is actually change to sort the keys based on the identifier, the key ID or the key, right? And then now if you do this, you don't maybe need to maintain additional sort of fine-grained indexes to jump to exact locations. You can sort of use a more coarse-grained index to jump to different offsets based on keys and then just do binary search after that, right? So these are, if you do assorted compacted table or compacted pages on desk, these are sometimes called SS tables or sorted string tables. This comes from the original, I think the Google big table paper from the 2000s, right? They sort of coined this phrase. But this approach is used in a bunch of different systems. Yes. Yeah, so his point, yes. So yeah, he's actually right, I can remove this. His comment, which is correct, is that I don't need to retain the delete 101, delete 102, because I don't care that they're deleted because there'll be some other index up above that wouldn't have an entry for 101, 102. So therefore, when I do a lookup, say it does 101 exist, is that gonna be taking me down to where, hey, it's been deleted, it just comes back with nothing. Key does not exist, and therefore I assume it's been deleted. Actually, I don't care whether it's been deleted or just never existed. So at your point, yes, I will fix that too. We don't need to keep track of deletes. So this type of compaction that I'm describing here, in the RocksDB world, they would call this universal compaction. And the basic idea is that you just look at two contiguous sort of files, because you'd write these out in a bunch of different segments, right? So these two guys together, since they appear one after another in temporal order, I can just combine them together, delete the old ones and have a larger kind of herb. A compact one that is more slimmer than the original two. And likewise, I can take two different ones at different sizes. Again, long as they're continuous in time, I can just do a compaction like that. In RocksDB, they also have an approach called level compaction, and this is actually where the word level. RocksDB is Facebook's fork of Google's level DB. Google wrote level DB, I think, for Bigtable. It was like Jeff Dean and the other guy, Sanjay. And actually, when Facebook forked it, one of the first things they did, they got rid of M-Map, because Google used M-Map. And Facebook said, that's a bad idea, let's get rid of it. So the basic idea is you have different levels where you can have compacted files that will be sort of different sizes, right? And so as you go along, as you keep creating these compacted files of different sizes, once you have enough of them or two or more of them, you can then compact them into a larger one at the next level, right? And this goes away, generate two more, two ones at the level above you, then now you have enough to actually then consolidate these guys, and then once you have enough at the next level, you can do compaction down below. The idea is that you keep sort of cascading down, having larger and larger SS tables or log file sizes that is more consolidated over and over time, right? Yes, this question is, are the log files always on disk or could they be partially in memory? So if you need to read them, you gotta bring them into memory, right? But once they're in memory though, you never write to them. There's still that, they call it the memtable where there's that in memory pages where I keep, that's where I'm pending new ones. But I may need to go retrieve something that was on disk, but then I have another sort of memory space where I can bring in the ones that are out on disk. Correct, yes, so same it is for all of these. Like again, once it's written out the disk, I may need to go read it again and I can bring it back into memory. But I'll never insert new things into it. Yes, so this question is, this question is, after consolidation, could I actually have some empty space at the end where I could fit a few more records but I don't? Yes, that would be the case, yes. But again, a page is like, Harvard page is like four kilobytes. So if I miss out on one kilobyte or two kilobytes, three kilobytes, who cares? Yeah, think big, think petabytes. Yes, no, no, these are on disk. This is on disk, yeah. You would have sort of one page, again, they're called the memtable. It's larger than a 16 kilobytes, it's usually larger. You would have the staging area in memory, a pen to pen to pen, when that gets full, you write that out. So all of these now when you're doing compaction, these are ones that are out on disk. Yes. This question is, after they've been compacted, are these still sorted? Yes, why wouldn't they be? Right, you're just doing merge sort, you're just checking to see whether, it's like, you know, write the left merge sort. You know that this one maybe comes before that one. So therefore, I know this one's newer. If I exist over here, I can just drop it and that goes in the new one. Yes. This question is, what is the condition for deciding when is time to compaction? There's a bunch of different thresholds. Could be by size, could be time, bunch of different. RocksDB has a bunch of different flags, you can set for these things. Yes. So this question is, if this, yeah, so these, the lower levels, they can be terabytes, right? Or gigabytes. So again, you would have an index. This question is, the statement is, these compacted files can get really, really large. That means I can't bring the entire thing into memory. And the answer is yes. You would usually have the header of the file would contain index metadata, right? Here's like, jump to this offset if you're looking for values within this range. So you can bring in partial chunks, partial pages of the larger compacted pages. Yes. This question is, do the compacted files overwrite the previous files? No, they're entirely new files. Again, you can't overwrite anything existing in this architecture. Okay, cool. All right, so as I said before, log structure storage managers are more common today. The term log structure merge tree or log structure storage, as I'm using here, they're pretty much interchangeable. RocksDB is probably the most famous, one of these log structure storage systems. And a bunch of other systems have their own, either either use RocksDB or they have a clone of RocksDB written in Go or Rost or whatever. I think this approach is pretty common these days. Again, from a, which we'll talk about later, from a distributed system perspective, it does make things a bit easier if you just do log structure stuff. But Postgres SQLite, Oracle, MySQL, all of these will be page-oriented systems. Right? So what's one obvious downside of this, this approach? Yes. Yes, so he says, if you're doing a lot of reads, it could be worse potentially because you gotta find the records that you're looking for, right? It may not be close by, or contiguous in the compacted files. So let's say now I do a write to a record, it's in the memory page, gets written out of the disk, and then it gets, you know, it never gets, we never change the tuple, so my log record is always the latest version, so I gotta keep it, but it keeps getting compacted over and over again. So what am I doing? I'm reading it in, then compacting it, writing it back out, then reading it back in, compacting it, writing back out. So this is called write amplification, all right? It's basically for a given write, given like a single logical write to a record or tuple, how many times am I gonna have to physically write it out over and over again? So this is one of the big downsides in addition to the reads being slower in this approach potentially with these log structure storage systems. If your disk is fast and cheap, who cares? If you're paying Amazon for your IOPS, then you're gonna care, right? A lot of times, so Amazon has their own log structure version of Postgres called Aurora, it's the also for MySQL, it's supposedly, they claim it's faster, not always, and speaking now from with our startup with Autotune, we see people that are running regular Postgres versus Aurora Postgres, and people, they don't realize how expensive sometimes switching Aurora is because of this log structure stuff, right? Yes, this question is, where does the write amplification come from? Like if I insert tuple 103, I never update it, right? I just insert it once. That log record, it goes from in-memory page out to a on-disk SS table page, but then over time it gets compacted over and over again. So due to compaction, I gotta read it back in, compact it with another page, and write it back out. So if you're doing the leveled one, I'm reading and writing it out over and over again, and that's called write amplification. Yes, yes, so the statement is, we could use this log structure approach for any, for relational database, no SQL database, absolutely yes. We're at the lower bowels of the system, right? We don't care, like we don't care what the data model is up above. It just goes back to the blockchain shit, right? Like it's all I did was do, it's basically a key value store. I insert, I do putts and deletes on key value pairs and get some key value pairs. It's up to the upper part of the system just to interpret what those, what the bits are, it's getting back for the value. So I say it again, yeah, yes. Oh, so I mean page oriented versus like log structured. Yes, correct, yes. So like tidyB is a relational database that does SQL, cockerstdb is a relational database that does SQL, it uses rockstdb, like we're in the bottom layer of the system. Okay, cool. All right, so yes, go ahead, yes. Shout out, it would go for it, yes. Correct, so in a, to his question is, or statement is, and what I've shown here today, if you have to update a single attribute, I would say a multi-attribute tuple. So the table has 10 columns. If I don't update one column, I have to write out all 10 columns in the log structure storage, right? There's a lot of, yes, it depends on how it's actually being stored, because if it's a column store, you could get by without that. Typically the, for these key values, these log structure key value stores, they don't know what's in the value portion, right? It's just a bunch of bytes. And so to do what you're proposing, you'd have to have additional metadata up above and say, I know I only wrote this attribute. So if I want this tuple, I got to go here for this attribute, here for that attribute, you can do it, but nobody does it that way. There's nothing preclusive from doing that. But like at the lowest level, it's just a key value store. Put, put and get and delete. It's all we see of a key value pair. All right, cool. So let's actually talk about what are in tuples now. And again, this is independent of whether it's a log structure storage or page-oriented storage, right? So the tuple is essentially going to be a sequence of bytes that don't necessarily have to be contiguous, which is sort of alluding to what he was asking about, which I'll talk about on next class. But let's sort of ignore that for now. Just say, okay, if we have a sequence of bytes, we want to be able to interpret them and say, we know that this is a tuple from this table. This table has this schema, has these attributes with these types. We know how to then reason about what those bytes mean. And the information about what the schema is will be stored in the catalog, which we'll cover at the end, that tells us, basically, instructs the database system to say, okay, at this offset, this tuple, you'll find an integer, you'll find a float, and so forth, right? So the sort of four basic groups of data types that most systems will support are integers, floats, bar charts, and dates and timestamps. So for integers, big ints, small ints, tiny ints, all right, these sort of times are called exact precision numbers or exact numerics. For this, we're just gonna use, think of this as like the C++ implementation of what integer is. It's actually based on the IEEE 74 standard, which specifies, for a 64 bit integer, choose compliment or whatever you're using, here's what the bits should look like. We can ignore NDNS for now. Depending on how portable your database system is, you want it to be, you can deal with flipping bit orders based on NDNS, but we can ignore that. So you can literally think of this, like if I call, if I write a C++ program and I declare a variable with int, with the var name, that's essentially what the database system is doing when it stores the integer for you, right? And no matter whether it's eight bits, 16 bits, 32, 64, it's all the same. For decimals, there's gonna be two variants. We'll spend time on this. There'll be the inexact ones, the floats and reels. Again, these are defined by the IEEE 74 standard. But then also the database system will maintain what are called fixed point numerics or fixed point decimals. And the implementation of this will vary from one system to the next. And again, the idea here is you wanna have an exact value, exact numbers. You use this for like money, scientific measurements and so forth. For very length strings, like subvar charts, var var names, text blobs, typically the way this is implemented you would have at the beginning of the value. You would have a header that says here's the length of the value. Then followed by the actual, the bytes that would follow it. For things that relate to like the strings and text and characters, the systems also gonna need to worry about collations and sorting. You don't store that with the value itself but it's some setting in the system that deal with A comes before B and B becomes sort of C. Because that can change depending on language or locale that you're doing. There was a bug in Postgres a few years ago where they rely on libc to do collations and sorting for some like libc upgraded how they sort of unicode and like you could have strings that in the old version of Postgres would want to be less than another but after the upgrade it would flip because they were relying on the operating system to figure these things out for you. The enterprise systems write their own collation and locale stuff themselves, right? Because that way it's more portable and it's more exact. And then for the time stamps, dates and other things, there's a bunch of different invitations where you want 32 or 64 bits, where you keep track of like the milliseconds, nanoseconds, microseconds, since the Unist epoch, whether you store the time stamp and so forth, all that we can ignore, right? It's how it is. It's not that interesting other than like the different systems will do different things. But I want to spend time on talking about, again, the distinction between floats and reels and numerics and decimals. Like to me, I think this isn't interesting. So if you go, if you ask the data system for a float, a reel or a double, what you're going to get is actually the same thing you get when you call, like, declare a float or reel variable and see your C++ slots. Yes. So the statement is, is there a database system that gives you arbitrary precision? Yes, that's what we're talking about here. That's the next slide, the fixed point precision. Python, I think is, we can test this. Python is only, I think, 64 bits for floating, like floats. But they're basically doing the same thing that we're talking about here, it's fixed point precision. But again, but if you call float, and this is again the difference between database systems and programming languages, like, if you call float in SQL, in most systems, you'll get the harbor version of a float, which is imprecise, the variable length. If you call for a decimal, you get the Python version, even though Python calls that a float, right? We're right, they're wrong. So okay, so these sort of native types will be really fast because they'll have harbor support for this, but you're gonna have rounding errors. So this is like 2.13, 5.13 all over again, where you can't see how you can't get exact decimals in float input numbers in C++ slots, most probably in languages. So if you take a real C++ program like this, I have two floats, X and Y, and if I just try to add them together, X plus Y, I would expect to get something like this. Sorry, sorry, sorry, sorry. If I don't specify how many leading zeros I want, I'll get what looks like the right answer, but only the covers are actually what's happening here is the printf function is doing some rounding for me. So it's hiding what's actually going on. If you try to get the full precision, all the leading zeros, then you get values like this. Because again, the harbor can't represent exactly these kind of decimal numbers. So this is fine for some things, like if I'm trying to record the temperature in this room, who cares if it's, we don't get 72 degrees exactly as it's being measured. But if it's your bank account, you would care. If you're trying to do something very precise, like shoot a rocket in space, then you want exact values. So to handle this, the data system is going to be called support fixed precision numbers. And the idea is that this would be a numeric data type with potentially arbitrary precision scale. You can specify what you want, but you would use these data types when rounding is unacceptable. Yes, in C++ or in data system. The question is how would you check, what happens when you check for a quality in these systems? You would probably get the same, you basically get the same behavior you would get in C++ or C. Yes. This question is, last week I showed an example where, this question is if all these data systems are using for floats and reels, the IEEE 74 standard, why did I show that example? I think whereas MySQL gave different results than everyone else. It's a MySQL issue, I don't know what that was. Yeah. And I don't, sometimes they play games where like an Oracle, I'm pretty sure, like we can test this out. Oracle, they use an alias for float and reel, so even though you think you're getting the variable of precision, they just alias that to the American decimal, right? Because they have their own implementation that's actually pretty fast. MySQL doesn't do that, but yeah, we have to go back and look at that example. I don't remember. It was off, I don't know what they were doing. I would chalk that to MySQL rather than the hardware. All right, so the basic idea here is that we're gonna, again, you think it's almost like storing like a string representation of the decimal. That's not exactly how it works. We'll see what it looks like in a second, but then you keep track of where the decimal point is, what comes up for, what after, is it negative and so forth? But there's a bunch of metadata we're gonna keep track of for every single value in order to give us exact precision. And so this doesn't come for free. We're at the, the operations will be more expensive because the data system is now gonna do a bunch of additional stuff itself. That it would normally let the hardware do in a single instruction or a few number of cycles, right? So I'm gonna skip this for now, but this is something that I had its former student work on. We never finished, but we had our own sort of library that you fixed point decimals. Try to get the same performance as you could get in the 754 standard are almost as good. But we never actually, I'm finishing it. The students are actually Oracle now doing awesome there. All right, so let's look at what Postgres does. So if I declare a numeric type in Postgres, this struct here is actually from the source code. This is actually what they're gonna store, right? So you can just see there's a bunch of stuff. There's like, there's a 32 bit integer for the number of digits, the weight, the scale, the sign. And then they have this numeric digit alias, which is just a string. So they're basically storing this again, the string of the decimal or the decimal literally as a string has some additional metadata to say, here's how to interpret the string. And that's how they're guaranteeing exact precision. But of course now when you look at the source code, just add two numbers together, right? It's not, again, not number plus a number is a single instruction, which you do fast at hardware. It's all this stuff, right? This is the function which I'm cutting off here. This is just to add two numerics together, right? So it's obviously gonna be a lot slower. But again, the standard says you have to have, you have to provide this accuracy, this precision, but how you actually implement it is up to the database system. So here's how MySQL does it. So there's this different, they have different things that are storing. And then the way they're gonna store the extra digit itself is instead of doing a character string, it's gonna be some kind of integer. And just like Postgres, they have a function to add two numbers together that is way more complicated than literally adding two numbers together. So to get a sense of how slow this actually is gonna be, let's do some quick demos. Any questions before we jump into the demos? Okay, cool. So what I've done is I've created two tables in a bunch of different data systems where I pre-generated a, here we go. So I pre-generated a file with 10 million decimals or it's a single table has two columns and I have 10 million rows of two decimals, right? And I have two versions. I have one with using real types and I have one with decimals, right? So this is Postgres. So I can do select count star, test reels and I get 10 million, right? And the same thing for decimals. So for this, what we're gonna do is run a query that is just gonna scan through the entire table. Actually, let me turn on a bunch of stuff. So I'm gonna turn off, this is Postgres. So we're gonna turn off parallel queries and then the JIT is basically how they do just some compilation for expressions. So we turn off all that fancy stuff. And this is like raw, see people performance, how fast you can read the data in memory and add two numbers together. So first we're gonna run is the decimals and that takes about two seconds. And again, it's just doing a summation of the first column plus the second column. So I have to read all 10 million rows and compute this together. And here's it for reels, right? Less than a second, 700 milliseconds, right? Again, because of that function I showed you where it had to do a bunch of extra work to do this addition, because it's checking all the contents of the numeric type. So let's jump to MySQL. So MySQL by default is single threaded. So unlike Postgres and much other different systems, they can't take a single query, break it up into different pieces and run it into different threads. One query is one thread and always a MySQL. All right, so just double check. I'm not crazy, select count star from test decimals. They have that, they have to have a completion. And then test reels. All right, so new test decimals, MySQL is gonna be slower than Postgres. So that took four seconds. I can run it again. Wait, why is this shocking? It's what? Then Postgres? Four seconds versus two seconds? Two X? Yeah. It's not order of magnitude, right? You're shocked by two seconds? Okay. You have high standards, that's good. So let's do test reels. All right, three seconds. Right, but again, so like the relative differences between the reels versus decimals, that's what we care about, right? Yes, in the back. So the statement is, the statement is, well, first of all, it doesn't wanna be rude. Appreciate that. That this time is not being dominated by things that are related to attitude numbers together. So before a class, I preloaded all the data into memory. It's all in memory, right? So there's no desk overhead. The one second overhead, I would say, is because of the difference between the interpreting the decimal versus interpreting the actual reels. So like to your point, yes, there is all this extra overhead of like getting the page in the page table, even though it's in memory, interpreting the place, looking at the slot array. All that sucks, right? So the exact numbers I don't care about, it's really the difference between the two. So they're doing the same work in terms of like getting the pages, jumping to where the date is located. It's that one second cost is the additional. Right, if you go back to Postgres, we can do explain, analyze. I think we have to do buffers. Where's the buffer? There's a way to tell, I forget how to do this. There's a way to do it in Postgres, you can say, show me whether, how much of this is being spent from, sorry, you can show how much it means spent for different operators within the query plan. But then you can also determine, is it ever going to disk? I'm forgetting the syntax right now. Right, so this is actually, it's getting a little faster here. No, actually, sorry, it got slower. It could be from the analyze part. But again, the performance difference is again, because there's this overhead of interpreting what the values are, and generating and doing that computation. All right, so I just got Oracle running before class, which is always a bad sign. Let's see if it works. So we'll do the same thing, run the same query. And in Oracle, you got to turn timing on like this. We have to manually set the terminal to not round our data results. So there's the first query, so that took, what is that, 60 milliseconds. And then for decimals, oh, I must not, I didn't load the data, that's probably why. Yeah, I didn't load, all right, that's my fault. All right, well, you would get, you would actually get the same result, because again, they do the ALS between reels and decimals. Last one we can do for SQL Server. So for the syntax here, this select statement, it's kind of hard to read because of the coloring. But there's that little option, max DOP, that's the degree of parallelism. That's basically how you turn off multi-threading in SQL Server. So we set it to degree one, meaning it'll run once. So run with one thread, right? So that took one for two seconds. So I run it again, takes the same time. And I'll run it on the decimals, it's a little slower. It's a little slower, but not by much. I mean, this is because again, in the commercial systems, the decimal implementation is very highly optimized. Whereas the Postgres, my SQL one, it's correct, but I don't think it's as optimized as the commercial ones, okay? All right, so any questions about, yes. So this question is, will multi-threading make a big difference here? Probably not for 10 million rows, but we can try that in Postgres. I think you can do five, four, see what it does. Yeah, it'll do four threads, right? So cut the time in half in this one, right? Execution time, 3.69 milliseconds. Or if I set it back to zero. This statement is, if I do multi-threading parallel query execution, will that make, will that, will that, will the ratio change? I mean, it's pretty much CPU bound, because again, you have to do, you have to invoke that function, like you have to do CPU work to do the addition for those. So if that's the high pole and the 10, or it's not the high pole and the 10, because there's other things, but like if you can do that in parallel, then like the ratio should be the same. We can just try it, right? Like we'll set it to four, do test decimals. So what, it was what, before it was, is that three, yeah, so it was like 3x slower. So now in parallel, it is about the same, yeah. All right, I know the questions. So let's deal with large values now. So most database systems are not gonna let you have a tuple that exceeds the size of a single page. So if the page is eight kilobytes, I can't have a tuple that resides on that page by larger than eight kilobytes, right? But obviously I want, I want, there's tables with attributes that are, you know, one megabyte columns, right? So the way they're gonna handle that is through what are called these overflow storage pages, where in the sort of, what are called the base tuple, or the, yeah, it's called the base tuple of where you have, you know, where your page number and offset point to, in the, in that tuple, instead of having the actual, you know, one megabyte chunk, you'll instead have a pointer, which again will be another page number and offset to this overflow page that can then be stored, you know, storing the larger data, right? And typically these overflow pages will be larger than the regular table page sizes, right, to accommodate larger attributes, larger data. So again, the great thing about the relational model and this abstraction, the independence between the physical layer and the logical layer, you don't know, don't care whether your, you know, one megabyte value is stored in line with the base tuple, or is it one of these overflow pages? The database system's responsibility to go get that overflow page and stitch it together as if it was stored contiguously anyway, right? So the criteria that the different data systems use to decide when things go to the overflow page varies. Like in Postgres, they're called this toast, the toast tables. So if the value that you're trying to store is larger than two kilobytes, then they'll shove it off in a separate page. MySQL, if the value is larger than half the size of the page, at SQL Server, it's just larger than the page in general, right? So again, you can store larger things over here, but again, the base tuple can't exceed the size of the database page. So if I can't make a table that has a billion columns, even though each of those columns is one bit, because that won't fit in a single page. And Postgres says it's a hard error, you can't have more than two to the 16 columns per table, right? The other great advantage also too, if you're doing this overflow page is that now you actually can compress these larger values separately than the base tuples and the original table, right? So Postgres will do this with GZIP or ZStanner, they'll compress the overflow page, but they don't compress the base table pages. Because typically, the larger the data is, the larger the piece of the data is, the less likely you are to update it, right? If I store a giant image, how often, for like your profile photo on a website, how often do you change that? But you might change like your login timestamp or often, right? That's much smaller. So you can go compress these things and you can update it if you want, but typically people don't, right? And these overflow pages can actually be changed together, so if your large chunk of data doesn't fit in the overflow page by itself, you can store it in another overflow page. You just have a point of these things together. The other thing you can do is, what's called external value storage, where you can tell the data system the data that corresponds to this value or this attribute doesn't actually exist in any pages that the data system controls. It actually exists on a file on disk or some network address at some remote server, for the cloud service, right? So the idea here is that the, as we see, would have some kind of URI or URL to some other location of data. And the data system knows how to go open that file and suck it in and bring it in back to you through the terminal as if it was actually really stored in the data system. So now all systems support this. Oracle and Microsoft do, they have this B file, a file stream. Postgres, you can sort of do this with what are called foreign data wrappers, but it's less, the whole table has to be external, whereas in Oracle and Microsoft, you can say a single column or a single attribute will be external. So in this world also too, you can't, the data system is not allowed to actually update the file because it's outside the database system. It's sort of outside our walled garden. So there's no guarantee that if the data system crashes, if your journal file is being opened by something else outside the system, we don't know and they might trash the file and we can't recover it. So you should lose a bunch of guarantees on the system to do this. Yes, the question is, what's the difference between doing this approach with these sort of native types that know how to point to external files versus just storing as a Varchar, here's the URL of where the file is located, right? Because the data system can go read the data if it knows it's one of these types and then bring it back for your select queries as if it was being natively stored in the blog. Yes, so when I say you can't manipulate, meaning like I can't call update to update this record, through SQL, I can, if it's on a local file and disk, I can open up the terminal and modify it myself, but that's outside the control of the database, you can do whatever you want. Yeah, I'm gonna take a guess on why you actually wanna do something like this, what's a good example. Yes, perfect, so he said that go beyond images, say multimedia files, videos, whatever, sound files, that you could have different backup strategies and storage strategies for these file types rather than being sort of embedded together with all your database data. Absolutely, yes, and that sort of goes back to what I was saying before where these large files, you typically read them all the time, you write once and read a lot, never update them. So maybe you don't wanna store these images in the backups of all your, with all your user data because you have one good backup that's good enough, you don't need to back it up every night. And then you can maybe store it on a different server. Yeah, so that's the typical use case for this. There's a paper that came out a few years ago by Jim Gray when he was at Microsoft. So Jim Gray is the name we'll hear about throughout the semester. He won the Turing Award in Databases in the 1990s. He worked on the System R project, he invented a lot of the stuff that we'll be teaching this semester. Sorry, I'll bleep that. I should do a blooper reel. I have like, I've dropped this thing all the time over the years of just being like, oh, it's all on YouTube. All right, so you wrote a paper where he basically said, and under what conditions would you actually wanna store data as a blob versus like store it as an external file as you recommended. And their basic, this paper something 2005, it's pretty old now. But back then they said if the data you're trying to store is larger than 256 kilobytes, store it externally, otherwise store it as a blob. I think that number has changed, but certainly for a large email file, you probably don't wanna store this in your database. Because again, like, thinking like storing on Amazon, Amazon database stores, they charge like two or four X more than they, like, if you wanna store something in EBS versus like an RDS, even though it's the same backing like storage system, they charge like two to four X more for the database. So maybe you don't wanna store all your like large files in your database, you're gonna pay a premium for that. And as he said, you wanna set it back up in storage policy for these larger things. But in some cases, I think 250 kilobytes is actually too small, probably three megs, four megs is probably the right number. A few years ago we had the inventor of SQL Lite come to the email and give a talk, and he talked about how on like cell phones is actually faster to store thumbnails for like your little phone app inside SQL Lite than rather than scattering them on the file system because you don't pay the F open syscall penalty to go open up a thumbnail, the database system is already open, right? The database file is already open and you can go and retrieve it much more quickly. So in that case, for those thumbnails, that was faster to put it in the database than have an external. But certainly anything, anything probably beyond 10 megs you probably don't wanna do, okay? All right, so the last thing to talk about is the system catalogs. So we're not gonna go into details of how this is actually implemented, but basically this is the metadata the system is gonna maintain about what your database actually looks like. So all the things we talked about before, tables, columns, views, indexes, all the ACLs and permissions and other things, all this is gonna be stored in the database and you can almost think it as a zone database. Typically the way this is implemented, the correct way to implement this would be the database stores the catalog in its own database, right? It's just another table. Now in your procedural code, or in the code that actually builds the database system, you're gonna write that in Rust and C++ and Go, whatever. You don't wanna write SQL queries in your programming language to go get metadata about what's in tables because it won't work because it's a chicken before the egg problem. If I need to run a SQL query and I gotta go look up the catalog, what I need to run, then I gotta run queries in my catalog. But to do that, I need the catalog. So typically the way this is implemented or the way this is implemented is you have like wrapper code in the database of itself that can access the tables directly to get the catalog information. But the catalog themselves would just be stored as tables because you get all the sort of the durability and the safety guarantees you would for regular data you can get for your catalog data as well, right? And of course there's the issue of like how do you bootstrap the database system with the catalogs when there's nothing before and all that is done through procedural code. So typically this used to be a good example of what separated the commercial enterprise systems versus the open source systems because the enterprise systems had really robust catalogs that were transactional that held all the safety guarantees and could do schema changes without blocking queries and so forth, right? That has changed in a lot in recent years where my SQL process have gotten a lot better. SQL light is single threaded, so single writer threads, we can ignore that, but the open source guys have really caught up in the recent years with their commercial guys. Typically the commercial guys would be much better. I don't have an example I can show for MySQL because this is the old version, like 5.7, but it used to be in the old version of MySQL, I just go to the directory where MySQL stores all its data. If I create a directory inside that, MySQL thinks it has a new table or new database, right? Because all this stuff is being stored outside the system. They fixed all that when they upgraded the MySQL 8. So the standard way to get information on what's in your catalog is through what is called the information schema. This is defined in the catalog, or sorry, in the SQL standard. And these would be sort of read-only views that give you a sort of relational view of what's in the database. But in addition to like writing queries on this, which is sometimes awkward, all the different database systems will have their own shortcuts to retrieve this information. So to get what tables I have in my database, you would write the SQL query like this, but in Postgres, MySQL, and in SQLite, you can use these shortcuts in their terminal. Basically what happens is if I call slash D in Postgres, the psql client will then convert that into a select statement that knows how to retrieve the catalog information. I can show that quickly. If I call slash D, I get this, but I can basically do the same thing I call slash. This isn't gonna be a lot of stuff, but whatever, you can't see this. But this is basically what the catalog looks like. And you sort of see here, here's our tables that we created. And it's just another table. Actually, you go, they have other alias, they have PG tables, right? There's a shorthand view of this, right? You see all the tables we have in this. But if I do explain, you can see that it is just a, because you explain again, it's showing you the query plan. They're querying PG class and name space. And slash D is basically doing the same thing, right? Within a schema itself, right? This is how to get all the attributes within the table, and then everyone has their own version of this, right? All right, any questions about the catalog? Yes, so the question, why is this not in a human readable format? So if I do, there's just a lot of columns, right? So these are all the columns that are in it. And it is just overflows. There's a way in Postgres, the PCQL terminal where you can have a bit more vertical view. I turn that off, because I don't like the view of that. But if you just scroll over, it's all here. It's just a lot. All right, any questions? All right, cool. So the log structure approach we talked about is an alternative to the page-oriented architecture. For our purposes going forward, we don't care what's underneath. We'll talk about roadstores as column stores next class. And in that case, you typically don't build a column store on a log structure storage, except you could have the right portion of the column store be a log structured. Again, you'll see this idea of like, there's a pen only thing where we just put log records in and we apply the maybe updates later either compaction or merging. That'll be a reoccurring thing we think throughout the semester. And this notion of like, I could have multiple versions of a tuple existing as different log records, that's gonna show up again when we talk about multi-version concurrency control. That's basically how it works. That's what basically Postgres did back in the 1980s. And then we'll see again that the storage manager won't be entirely independent of the rest of the system. There are some assumptions that we'll have to make about what we know is underneath. But for the most part, we should be okay going forward. Okay? All right guys, hit it. Super snakes.