 Let's get started. So DJ Drop Tables is not here today. He should be here next week. He's a bit of a crafter, so it's his girlfriend's birthday this week. And he blew all his money last week in Vegas. So he lied and told her that he's going to visit his family for Thanksgiving. So that's why he's gone this week. But that's just really to get out of a ruse of buying him a present. So he'll be back next week. All right, so again, just a reminder for what's on the docket for you guys. Homework 5 is due next week. Project 4 will be due after that. The extra credit for the feedback submission was last night. Some people emailed about an error because of some funky Unicode issue. But everybody else should have submitted. The final exam will be on December 9. Actually, they announced the room, but I haven't looked to see where it is. Has anybody looked? It doesn't matter. We'll figure it out. And then for next week, again, the Oracle talk will be on Monday. The assistance propory and the final review will be on the Wednesday. So we'll do the final review in the beginning. And then we'll cover three or four different database systems that you guys voted on. And I'll just give you a 10-minute overview of what's interesting about it, why it's good, why it's bad, and so forth. So if you haven't voted yet, please go in here and vote. The numbers look similar to previous years. And so again, you can go look at the videos in last year to see what we're going to discuss. But go vote first, then go look it up. And as I said, on Monday, Tuesday, next week, our friends at Oracle are coming. And there's going to be three different talks. So on Monday's class in here, they'll be we'll have Shashank come and talk about his group is building. Again, this is not a recruiting talk. This is like a scientific or a systems discussion of what they're building. And basically, he'll use all the same keywords and buzzwords that I've used throughout the entire semester so you realize that I'm not just making shit up. So he'll come and talk about their system. There's a more recruiting oriented systems talk. We'll be Monday at 4.30 in Gates. And that one, they'll be pizza. And then a peer research talk will be on Tuesday, December 3, the following day at 12 PM on the CIC floor. And so the Hideaki Komora is, as I mentioned on Piazza, I went to grad school with him. He's probably one of the most hardcore systems programmers I've ever met in my life. He is also the most stubborn man that I've ever met in my life. When we were building the first summer we were building H-Store, we had to build our expression trees, like for the ware clauses. I had my way of doing it. He had his way. We literally had a four hour debate in my office just yelling at each other about what to do. And he just broke me down and said to do it. He was wrong, and we removed what he did later on. My code's still there, so I was right. But in general, he's awesome. So he's going to come talk about some. He's just going to talk about some of the non-volta memory stuff that they've been working on for the system because he's part of Shashank's group. So again, if you're interested in doing interships or full-time positions with them, he will come tell you the kind of things you could be working on. And I'll send reminders about all these things on Piazza. Any questions about any of these? Okay, and then I'll also make arrangements if you want to meet these guys one-on-one to talk about interships and full-time positions as well. Well, I'll send that email. Okay, so last class was the second lecture we had on distributed databases. The first class was just defining what a distributed database system looks like from an architecture standpoint. What are the, where is data relative to where the computer's actually going to run on it? And then last class was all about talking about taking these distributed databases when you want to do transactions and making sure that we provide all the asset guarantees that we do want on a single-node system but now doing this in a distributed environment. And we spent most of our time talking about the atomic commit protocol and replication because again, that's the hard part of actually having an distributed database when things are split up and now you have transactions doing writes on a bunch of different nodes at the same time. How do you keep it all in sync? How do you avoid losing data? How do you have reads not see stale data if you care about those things? So for today's class, now we're going to sort of just leave all the transactions that we talked about before and now I'll start talking about how to do analytics. Where we're not doing a lot of writes, we're not doing transactions, we're mostly going to be doing reads. But the amount of data we're going to be reading is going to be much larger than what the old to be transactions we're doing from last class. So I just want to show what a sort of a typical setup would be in a analytical database. This does not necessarily have to be a distributed database, but this is a common arrangement. So on the front end, you have your old to be databases. This is where you're ingesting new information from the outside world. And this could be distributed, could be single node, it doesn't matter. And then you want to get all the data from these front end data silos into your back end analytical database, sometimes called a data warehouse. So there's this process called ETL or extract, transform and load. So there's tools you can buy that do this or you can just write Python scripts or do whatever you do manually. But the idea is that to take all the data from these different front end old to be databases and put it into a universal schema inside your data warehouse. So for example, let's say you have that your front end database, one application has a, they all have different customers names. But this one has F name for the first name. This one has first underscore name. So you can't just chuck that all into a single database because the database doesn't know that F name equals first name. So this is where you do that clean up process here in the ETL world. So this is a very common setup, right? If you're gonna build a startup, you typically start with this because you need to get data first. And then once you have a lot of data, then you want to put it into your back end data warehouse. And the idea that we don't want to do analytics on the front end because that's gonna slow us down, slow down our transactions. So you can put this into a back end data warehouse. So this is what we're focusing on today. How do we actually do this? So I use the term OLAP online analytical processing. Sometimes you'll see these types of systems we've heard to as a data warehouse or more traditionally, sometimes they're called decision support systems, DSS. And again, the idea is that these are applications we're gonna write on our back end data warehouse that is gonna analyze the data we have that we've collected from the OTP side, extrapolate new information and then guide our decision making processes for the business, for the organization or for the OTP applications. So a very common setup. The one thing I always like to use is like Zynga. Zynga has all their stupid Farmville games up here. These are all OTP databases because for every click in the game that's another transaction or another update to the database. But then they're gonna shove all those clicks into the back end data warehouse, do some analysis on this at whether decision support systems or machine learning to try to figure out some bullshit how to make you buy more stuff on the front end, right? Like the one example I always heard was it's the Candy Crush game. So if you play the Candy Crush game and so you get all these updates on the OTP side and then say you get a hard puzzle and you can't beat it, right? And so you put the game down. So they're gonna collect all this clickstream to see how you played the game and then they'll learn that, oh, if you come back after not having played the game for like a day because you got frustrated they make sure they give you an easy puzzle that you can solve right away so you can hook again, right? And then keep playing it because they know if they give you a hard puzzle you'll get frustrated and never come back ever again, right? So that extrapolating that information that, oh, this is how I give the person a game that they'll be able to beat. You figure that out on this side and then you push the update to the OTP side. So in general at a high level there's two different ways you can model a database application on a back end data warehouse or analytical database. So you could take the standard schema that your application would have like typically it's usually a tree schema because you have this hierarchy. I have customers, customers have orders, orders have items but those schemas can be quite messy and they're not gonna be very efficient for analytical queries. So instead you would model your database using either what's called a star schema or a snowflake schema and sometimes you'll see analytical databases that they'll say, hey, we only support star schemas. You can't do a snowflake schema and you'll see, this is basically a subset of this but you'll see why, let's discuss why this actually might be better for some analytics. So a very common arrangement would be something like this. So this is a star schema and you have two types of tables in a star schema. You have facts tables and dimension tables. So the middle of the star is the fact table. Think of this as like everything, whatever the event you're trying to model this is where you store all the occurrences of them. So if you're like Walmart and your data warehouse keeps track of every single item that anyone has ever bought at any Walmart store at any given time all those items getting scanned at the checkout counter that's another event that we've put in our fact table. So this thing is gonna be massive like hundreds of billions of records. Same thing Amazon, every single item that anyone's ever bought on Amazon goes in your fact table but we're not actually gonna store any information about what these items are that someone bought won't stay gonna have foreign key references to our outer dimension tables where they're gonna maintain that additional information. Like this is things to be massive. We want this to be sort of as trim as possible because we're gonna have billions of rows. So we put all the actual metadata in the dimension tables but in a star schema you can only have one level of dimension tables out from the center of the star, right? So there's no additional tables over here that these guys can join with, right? In this case here I have a category name, a category description. So I could extract that out and normalize that store that has another dimension table with another foreign key going from this to this but under a star schema you're not allowed to do that. And I'm gonna take a guess why. Yes, exactly. So you said the time it takes to traverse or join those different tables is gonna be expensive because again, we're not doing like find all Andy's items. We're saying find all the items that the state of Pennsylvania has bought within this date range. It's gonna be hundreds of millions of rows. So we wanna avoid as much as doing as many joins as possible. So a snowflake schema is where you're allowed to have multiple dimensions outside of it, right? So again, going back up here, right? I have now can break out my category information. I have a foreign key in the product dimension table and then I have what is called now a lookup table which is the things beyond the dimension table where I have that normalized information as the output. And as I said, some database systems, some OLAP systems will say explicitly you can't have lookup tables, you can't have multiple levels beyond the first dimension table. So the main sort of two issues in this world as she sort of said or one of them is performance. The other one's actually gonna be the integrity of the data that we're storing. Again, so going back here, if I collapse down my lookup table into a single dimension table, well I'm gonna be repeating the category name over and over again. So now if the category name changes, I need to make sure in my application code that I go update all the records that have that same category name so that everything's in sync. If I'm normalized out like this in the snowflake schema, I don't have that problem because I only have one entry for the category, right? So if you do a star schema, then there's extra work that you have to do in your application to make sure that your denormalized tables are consistent. You're now actually potentially storing more overdone information that's unnecessary and so the size your data is could be larger. It's not that big of a deal because again the fact table is the main juggernaut in this model and we have ways to compress that down. So the storage overhead of denormalized tables is not that big of a deal. It's more the integrity stuff that's more important. And then as she said, the complexity of the queries with a star schema are gonna be significantly less than the complexity of the queries in the snowflake schema because there's only so many joins I could possibly do. I only have to go sort of one level deep. And as we talked about when we talked about query optimization, the having more tables are joined against just makes everything super harder when we have to figure out the join ordering and so by restricting ourselves to a star schema we may end up finding the optimal plan whereas the snowflake schema we may not be able to. So again, when you go out in a real world, if you come across data warehouses you're likely to see either these two approaches because they're better for analytics. And this distinction between the dimension table and the fact table will come up when we start talking about doing joins because we need to decide how we're gonna move data between nodes if we can't do any local joins at our own machines, okay? All right, so let's talk about the problem we're trying to solve today. I've already sort of briefly just mentioned just now. So our query shows up in our master node that wants to join on R and S and let's say that the two tables R and S are just split across the different partitions on the different nodes uniformly. So what's the stupidest way for me to execute this query? In this setup? Yes? Oh, for you all, the table's back. Exactly, right, the dumbest thing I could do and it would work, it would still be correct is that I know I need touchdata's at partitions two, three, and four. So I just copy them in their entirety up into the node where partition one is. Now all my data is local, I do my join and spit back the result. Why is that stupid? Yes? You may not need all of the data. He says you may not need all the data, potentially yes, but team is stupid. Yes? You're not doing any competition on any of the other? Absolutely, you're not doing any competition on any other nodes. This defeats the whole purpose of having distributed database. I think about what I did. I bought a bunch of machines, I partitioned my table across these machines, but then my query shows up and I just copy things back over to the single machine anyway. So I would have been better off just buying this one machine and doing the join there. So this is the problem we're trying to solve today. We're trying to say, all right, if a query shows up and it wants to do a join, we need to access data that's split now across multiple resources. How do we actually do this efficiently? What do we need to be mindful of when we decide whether we move data or copy data or push the query or pull the results? These are all the design choices we have to deal with today. This also assumes an interesting example that my database can fit in on a single node. Again, think of the Walmart database, it's whatever, hundreds of petabytes, it's not gonna fit on a single machine. So we're gonna have to run a distributed environment in order to get any work done. OOTP is not an issue, because again, in OOTP, I'm only touching Andy's data. Andy's data is maybe a couple hundreds of kilobytes or megabytes, right? It's not, I can easily fit that on a single box and do all my transactions on that one box. In analytics, I'm trying to touch the entire table or large portions of the table, I'm not gonna be able to do everything on a single node. So today our focus is gonna be on, first discussing the execution models we have in a distributed database system for analytics. We've already briefly touched on this a little bit in the first lecture, but now we'll talk about it more concretely and see why it matters. Then we'll briefly talk about the issues doing query planning. Then we'll talk about how we do distributed joins. The spoiler would be that all the algorithms we talked about early in the semester are still germane, we still do them. There's no magic distributed join that doesn't exist on a single node. It's just the question is again, where do we move data or where do we move the computation? And then I'll finish up with sort of a quick smattering of what the sort of state of the art of cloud databases look like in the world today. And just sort of, again, you'll see how just because it's in the cloud doesn't mean we still don't care about all the things we talked about the entire semester. Okay, all right, so as I said, I briefly touched about this first issue for the execution model when we talked about just sort of introduction to distributed databases, but the two approaches we have to execute a query are either a push versus a pull. So with a push, the idea is that we wanna send the query or the portion of the query like a plan fragment to the location of where the data is located. Then run that portion of the query at that local data and then now just send back the result to whoever asked for it, like the home node or the base node that's coordinating the query. The idea here is that we wanna, just as we would do projection push down or predicate push down on a single node system, we wanna filter out and remove as much useless data as early as possible before we send anything over the network. So if we can send a portion of the query to where the data is located, crunch on it there, do some early filtering and then when we transmit the data back to another node, we're not just blindly copying all the data that the node has, we're just limiting it to the subset that we actually need for this particular query. Now, we'll see in a second. The lines get blurred in a shared disk system whether you're doing one or another because in a shared disk system, in general you can't do any filtering because it's just reading right a single page, you can't do anything special. But for shared disk systems, again, the lines are blurred. The other approach is to pull the data to the query which is what a shared disk system normally would do where we grab whatever the data we need for actually this query, we recognize based on the query plan, these are the pages I wanna access, we pull the data out, make a copy of it, transmit it over the network, bring it to the node where our query is located, then we can then process it and crunch on it. And of course, again, the issue here is that in an analytical system, the amount of data, the size of the data relative to the size of the query is gonna be quite stark or quite different. Like a query, say it's just a SQL query, couple kilobytes, the most I've ever heard from like Google or Facebook is that sometimes they have queries that are like, like the SQL text itself is like 10 megabytes, right? That's a large ass query, but still that compared to like reading a terabyte of data, it's nothing. So the thing we have to be mindful of whether you wanna do one versus the other is where's the data located, how can I access it and is my query gonna be larger or smaller in size to transmit over the network than the data I'm trying to access. In analytics, it's always the case that the data you're trying to crunch on is larger. So let's see this in the context of a shared nothing system. So shared nothing systems are typically pushed data to push the query to the node. So my query shows up to this node here, it's in charge of coordinating with the other node to process the join. So we're gonna do our query planner on say on this node, and we recognize, oh, we need to access the ID field for the RNS tables. And I know that this node down here has a partition that I wanna access. So I'll just send information, I'll send the query plan frame down to this guy and say, hey, I know you have this data between 101 and 200, crunch to this join and then send me back the result. And the node up above is responsible for taking the result that this guy sent plus it's a local result, combining them together to then produce a single result back to the application. Again, we have this transparency issue or transparency guarantee where the application doesn't know, doesn't care where the query actually executed as long as it gets back a single result. So in a shared nothing system, it's quite obvious that you'd wanna do push the query to the data, because this makes no sense for us to actually have to copy this data up and process it there. It's just better just to send the query to the crunching local. For this example, we'll see some scenarios where maybe you do wanna do some copying. Yes? It says distributed then what? Your question is, if S is distributed as well, then what? Well, in this case here, it is distributed. I'm just saying it's partitioned, but it's partitioned. We'll come to your question later on. Like- Because you may need to join it with some other partition also, right? Correct, we'll come to that later. In this simple example, we'll assume that R and S are both partitioned on ID. The ranges of the values at this partition are exactly the same. So therefore, I know when I'm doing a join at this node, I don't need to look at any other partition in my cluster. Everything I need to have computer node for a single tuple in R is located here. If you're joining on the partition key. Yeah, exactly right. I'm joining on the partition key. We'll see those scenarios in a second. Yes? Question? Oh, I'm sorry. The question is, in my example here, tables are partitioned on my join key, which is the best case scenario. How often does that happen? Usually for the fact table pretty often, right? Like, I would say it's pretty common. I mean, think about it like in a real system. Like, so I want to partition, I partitioned on user IDs, and the fact table could say, here's this person bought this item. And so, that's a bad example. So say I want to do a join between like, here's the user IDs and here's all their items that they bought, and then here's the session of how they visited the webpage to figure out what items I looked at before I bought something. And you want to try to figure out to learn like, oh, if they go look at these bunch of items, then they're more likely to buy something. And so in that case, the user ID would be the partition key and that would work out nicely. So it's not always the case, but I would say it's common enough. But we'll see in a second how we handle the case where it's not this, like this. The main thing I'm emphasizing here is where we can push the query to the data and that's gonna be better for us because this data is gonna be larger than this guy here. Plus we also get the additional benefit of that we paralyzed the computation because now the top node doesn't have to do all the join. This guy can do the join, you know, portion of the join and this guy can do the other portion of the join and we just combine it together. And the combined part is cheaper. It's cheap relative to the join cost. So the other approach is to pull the data to the query. And again, this is what I'm saying in a shared disk system, the lines are blurred. So we send our query to this node here. This node would recognize that we've logically partitioned the data such that this node down here is responsible for this range. That node is responsible for that range. So then they then go to the shared disk, go access those pages, pull back the results, or so pull back the pages they need, then they compute their local join and then this guy shows up the result to the other one. So again, this step here would be pull the data to the query because I'm just blindly asking for pages where it's located here and I have to copy it over here. But then certainly this part here was push the query to the data because this guy was able to compute the join locally and send the result up. So would you say that this is a pull versus a push? Again, it's both. I don't think I have a slide about this, but the cloud vendors are recognizing that having dumb disks, if you wanna call it that, in the shared disk databases is a bad idea because again, I'm just always copying this page without checking to see whether I actually need any of the data on the page. I just know that I think I need to look at it. So I just say, go give me this page. On Amazon S3, they now have a filter command where you actually can do predicate pushdown and even when you say go get this page, you know what else I'll say like, oh but also check this filter for me to see whether everything actually matches inside the page. And if yes, then send it to me. If no, then don't. So again, that's pushing the query to the data. So again, the lines are blurred. Okay. So one thing we talk about though is that I said that last class we made a big deal about if we have a transaction commit and it touches multiple nodes, I wanna make sure that everyone agrees that all the nodes have to agree that this transaction is allowed to commit before we tell the outside world that it committed. Right, because we're modifying the database, we don't wanna lose any changes. But in an OLAP database, we're just doing read only queries. So we're not really worried about updating the state of the database on multiple locations and keeping those in sync. But now we have to deal with the case where a node could crash while we're processing the query and we have to figure out how to handle that. So the important thing to understand is that when we go request data from another machine or the shared disk and we get the copy of that data when we receive that, that's gonna get stored in the buffer pool just like any other data we have that we read from disk. But it's stored in like a temporary buffer space, meaning it could get paged to disk because we run out of space, but if we crash or restart the system and come back, all that temporary disk space just gets blown away because the query or the transaction that was, I'm sorry, the query that was running that needed that data is now gone because I crashed and so I don't need to persist anything. So these OLAP queries for really large databases can take a long time. It's not unheard of to have queries that could take hours. I've also heard of queries that take days. For column stores, it's gotten much better, but back in the old days, it was certainly very common for a query to take days, like you run your reports once a month and it takes like a week to run it. So if we had this long running query and our node crashes, what should we do? It's not a correctness issue because we weren't updating anything, but ideally it'd be nice maybe that we may not have to restart the whole thing from scratch if it's gonna take days. So the design decision that most shared nothing distributed databases make is that they're not gonna actually support query fault tolerance. Meaning if your long running query crashes, if a node crashes during why your query is running, unless there's a replica that has the data you need that could fill in some missing piece depending where you are in the query plan, they're just gonna abort the query and throw back an error and tell you to restart it. And we're gonna take a guess why they make this decision. It's expensive, right? I'm running a long running query, it starts generating a bunch of immediate results and now I gotta make sure I flush them all to write them out the disk and make sure there's durable across replicas in case there's a crash. That's gonna make your query run a lot slower because the disk is super slow. So in the sort of traditional data warehouse world, they would say, oh, you just paid me $10 million for this very expensive database system software. I assume you're not running on machines you found at goodwill, right? I'm assuming you're buying on high-end hardware. So the likely that high-end hardware is gonna have a catastrophic failure where it's gonna crash in the middle of a query is gonna be low, so therefore I'd rather not pay the penalty of taking snapshots or writing immediate results out the disk as I run. So in general, most OLAP systems are gonna, if a query, if a node crashes during query execution and let's say it's not just reading from a disk where you have copies of, like it's the middle part of the query where you have intermediate results, that is say the query fails. It's when Hadoop came out in the mid-2000s, they were actually taking snapshots, they were actually taking, you know, writing at every step of a query plan, writing that out the disk, but that made it super slow, right? Because back in the old days, when, even still now, but like when Google was building that produce, they talked about running on cheap hardware where if you're running a thousand node custer and your query's running across a thousand nodes, the likelihood that any node's gonna crash during that time is pretty high. So in their world, they would rather take the snapshots to avoid this. Yes? I think one node fails, I mean, no crash and no longer respond, or what transaction fails? No, okay, we're not talking about transactions here. No transactions. Read only queries, analytical queries, right? So if, say, from last class, we're talking about transactional distributed, distributed databases for transactions. If a node fails that, while we're executing transaction, we used to abort that transaction, because who cares? So that transaction ran for 50 milliseconds? Who cares? If now I have an analytical query that's gonna take days to run, if it takes five days to run and I crash on the fourth day, then I just wasted four days at work, some people would get actually pissed about that because now you gotta, you know, start it up and run all over again. And so you could take a snapshot. Some systems allow you to take snapshots as you run the query at the, oh, like as the output of one query plan as it gets fed into the next operator, I'll just write all that out to disk as well so that if I crash at that point, I can bring it back. There's ways to turn that on, but by default, most systems do not, because they don't wanna pay that performance penalty overhead. So again, on a shared disk system, it's easy to visualize, and we're doing that same join. We send the plan framing down here, and then as it computes the join result, it's gonna write that out to the shared disk. There's some notification to a coordinator to say, hey, if you're looking for this join result, here's where to go get it on the shared disk. And that way, if this guy crashes and goes away, this guy knows he can just pull it from there and not worry about having to re-comput anything. So again, this is an important design decision that the distributed databases are gonna make. They're not gonna provide you sometimes called query resiliency or fault tolerance whether query executes. They maybe restart the query for you, but they're not gonna try to pick up where a node was running at the moment of the crash. Okay? All right, so the other thing we gotta worry about now is also how to do query planning. So we have all the same issues that we have before. Join orderings, how to do predicate push downs, early projections, all those same decisions we had to do on a signal node system we still have to do in the distributed system. But now we have a sort of extra level of planning where we need to reason about where our data is located, how it's partitioned or replicated, and now account for the network communication cost for our algorithms. Right? Again, this'll make more sense in a second when we talk about the different scenarios, but like the join ordering still matters, but now it's also like, all right, well I need it. Should I join these two tables first because they're on the same node and partition the same way and therefore that can be really fast, even though if I was on a single node I may not wanna join those two tables first. So that just gets now included in your cost model to help you make a decision about, again, what's the right query plan, a most optimal query plan for the system. Again, just as I said before, doing query optimization on a single node is hard, this is just even harder, okay? And like we, you know, sometimes you can make a decision on a single node and centralize the location, sometimes you can make it in a distributed across all the nodes, but now again you have to make sure all your statistics and all your nodes are updated as much as possible to help you make decisions about what the best plan could be. It's just everything just much harder when it's distributed. But now, certainly we have now a query plan. What do we actually wanna send to our different nodes that are gonna participate in executing a join query? So there's two approaches. One is we could actually send the physical plan or plan fragment to the node to execute. By just taking the query plan that we've generated on the base node or the home node, and then carving it up to say, well, I know this portion of the query plan needs to execute on this node, this other portion of the query plan needs to execute on that node, you distribute those physical operators to the other nodes, they just take them and immediately execute them without reason about whether that's the best thing to do for their local data and then sending back the results. So as far as I know, most distributed databases that are doing analytics do this first approach. Well, they generate, they run the query plan through an optimizer once, generate a global plan for the entire cluster for the query, then carve it up to plan fragments and divide it up. Another approach is to actually take the SQL query that came in and then rewrite it to have it be on a per partition basis, have a SQL query for each individual partition and then send that out. Then when the node where the partition is located gets that SQL query, then it runs it through its own query optimizer to generate the physical plan that it wants to execute. The idea here is that if we can assume that if we did a global plan on a single node, the statistics and information about what's best at each partition is not gonna be up to date or fresh. And therefore, rather than me trying to reason at my home node, what the best thing to do at this other node, I'll just say to this other node, hey, I think you need to execute this query or I need you to execute this query for me. And then that node can then do all the local optimization and all local planning when that SQL query arrives. So again, look at an example here. So this is my query like this and so I could, in this case, there's no join order. Well, it's either R or S or SR. But I just say, all right, I need to execute this query and need such data at these three partitions. So I'll rewrite my SQL statement to now include a where clause that says, here's the portion of the data you need to look at, which matches what the partition key is or the partition range is. Then the node gets this, each node gets these individual query plans running through its own optimizer and then they can decide, oh, based on the data I have here, is it better to join R than S or S than R? Should I do a hash join, should I do a sort merge join? I can make a local decision here because I have the best view of what data I'm actually storing whereas the home node, again, could be out of date. So the only database that I'm aware that actually does something like this is MemSQL. Everybody else sends the physical plan. I don't, I mean, to me, this seems convincing that you could do this. I think you still have to make a higher level decision of the possible join order at, if you have multiple tables more than two, like if I RS and T, should I join R and S or T and S first? Like I think there is some information there that you may have to reason about at the upper level, but when it comes time for the local node to make this decision, it doesn't worry about communicating with anybody else. I'm not saying one is better than other. I think this has interesting implications that have not been explored in the research. Okay. So now we want to talk about joins. Again, joins are the most expensive, most important thing you have to do in a single node database and just also in a distributed database. For analytical workload, the large portion of the time that the system is going to spend executing a query where I'd be reading data from disk or doing a join. And as we said, reading from data from disk, there are some methods you can do to speed that up, but at the end of the day, you're usually bound on how fast you can get things off of the physical device. Well, for joins, we can be a bit smarter about things. So as I showed in the very first example, the easiest way to do a join is just get all the data we need to do our join, put them on a single node and run our join algorithm. But as I said, you lose all the extra parallelism and the extra resources you have by dividing the data across multiple nodes. You lose all those benefits and you may not be able to actually put everything in memory to run things fast. So there's gonna be four different approaches, actually how we actually, or four different scenarios you need to handle when we want to do a distributed join. And again, a distributed join is gonna be the exact same way we would do a join on a single node system. But the idea is that we need to figure out how to get the data we want to join together on a node, could be one node, could be multiple nodes, such that we can do a join locally without having to coordinate with any other node while we're doing the join. So when we do that local join, it's all the same algorithms, the SARTMERS join, Nestor loop join, the hash join that we talked about before, all the same optimizations apply. It's the step you have to deal with before you do that join of how to get the data to a node that's gonna need it. So again, we're gonna talk about four different scenarios. These have already sort of come up in our early examples and we'll see how we actually want to handle them. And again, the main takeaway is that there's no magic here that we can actually do. There's no magic in a distributed join algorithm that's gonna perform so much better than a single node one. It's all about how to get the data to where it needs to be. All right, so the best case scenario would be one where the table is, one of the tables we're joining is partitioned on the join key and then the other table is replicated in its entirety at every single node. So in this case here, the R table is partitioned on the ID, it's ID field which is in our join clause and then the S table is replicated at every node. Again, this could be the fact table, this could be the dimension table. And this is gonna be small enough where we could split it up or small enough where we could replicate it on every single machine. So in this case here, again, all we need to do is have every single node do a local join to produce the join result for the data that it has stored in its local storage and then we just need to transmit the output of join on one node to some centralized location so that we can combine the results and produce a single answer to our application. All right, so again, this is the best case scenario because this transfer to get the join result from this node to that node is unavoidable. We have to do it. But when we actually did the join, we did need to coordinate or communicate with any other node because everything we needed was local to us. So this gives us the benefit of a distributed database because now we can run this join in parallel on every single node without any coordination. And then everyone just sends the result back to the head node. The next best case scenario is where the, both of the keys are partitioned, both of the tables are partitioned on the join key. Again, this case here, the SID on the last slide, it was repicated. This one is partitioned on the ID field and the range that's in this partition is the same as the range of, sorry, the range of the partition for S is the same as the range for partition on R. So again, just like before, we compute our local join. It could be a hash join, it could be an SLU, it could be sort merge, it doesn't matter. And then this guy transmits the result to this other node where we combine it together. Yes? So in this partitioning thing, like isn't it necessary that R and S are equally distributed according to the partition key that it may happen that R has most equal to 1,200 and S has all of its equal to like 200 to 200. So his question, his question is about data skew. So in this example here, I'm showing you that the ranges are the same, one to 100 and one to 200, but it doesn't necessarily tell you how many tuples exist in this range for this partition. So like, say this ID is the primary key for R, so there's exactly 100 tuples for R, but say this ID is not the primary key for S, I could have a billion entries within this range and then a billion entries for that range. How do I handle that? Well, in that case, you would not want to have the same range as the ID field. So in that case, you will have to shuffle some data around, which we'll talk about in a second. This best case scenario, these are uniform. Uniformly distributed, exact same range, I don't need to coordinate. Sure, it doesn't, I agree with you, it won't have a real world. I'm saying best case or worst case scenario and we'll see how to handle as things get worse. All right, so related to him, so the next issue is gonna be, let's say that one of our tables is not partitioned on the same attribute that we want to join on. So in this case here, S is partitioned on the value field. So in this case, we can't compute our local join because for every single value of RID here, I don't know whether it's a matching tuple will exist on my local partition. There may be one over here with the same ID. So in this one, in this case, this is called a broadcast join and the basic idea is that you copy the portion of the table that's missing from each partition to every other partition so that now this node has a complete copy of the table and then now you just compute your local join and then send the result over to the other guy. Again, this assumes that S is small enough that it could reside in memory or not overwhelm this machine or there's a node here. You try to copy everything here. Right, the broadcast means that you're basically generating every node with an end of the replicated copy of the table. So sometimes just you refer to as a broadcast join or broadcast, broadcast, sort of join. It just means they're doing this initial step where they transmit the data across two different nodes that everyone has a complete copy and then you do the join. The last scenario is the absolute worst case scenario is where both tables are not partitioned at all on our join key and now we need to reorganize the layout of the data so that we compute our join more efficiently. So this will be called a shuffle join or shuffle hash join. So basically we recognize that, oh, we really want things to be partitioned on the ID field. So let me just start copying the data that I need from these two tables to read these two different nodes. If this has to spill a disk or run out of space, that's unavoidable, so we just go ahead and do that. And then once I know I have everything partitioned in the way I want to partitioned, then I can compute my local join and generate the result. Yes? I don't got it. I mean, would we win the space limited? How do we do it? So his question is, when the space is limited, what can you do? So the point I'm trying to make here is like, think about what you end up doing here, right? So I have to now make another copy of R on, you know, and store it on this node. And so if it doesn't fit in memory, just as I said, it gets spilled as a temporary result for the query, could get spilled to disk. That's unavoidable, right? But the more important thing is that, now when I compute the join, I want that to be fast as possible because that's the most expensive thing. So by copying the data around, when I do the join, everything is partitioned nicely the way they want it, and it's more efficient. Do we always assume that the disk is enough? We, this question is, do we always assume that the disk is enough? For our class, yes, in the real world, no. What will happen is the database recognizes that if I get to here and I can't copy any more data to this node here, the query would fail. You just say I run out of swap space or temp space and you throw an error and the query fails. And actually related to his question about the distribution of data, if say it was partitioned on the ID field, but the distribution was highly skewed so that most of the data was on this node for S instead of that node, I can still again do this reshuffling to realign my data, maybe move some data from RNS over here so that things even get balanced. It's still called a shuffle process. Question. So I was wondering if that was possible to start? How does a lot work in practice, sorry? Yeah, for example, we have, so we think the, it's not a question that you have a sense of nodes in some ID field. Yes. What if for example, if someone first negated, how do you manage not giving me the key of data or what do you do? Your question, how do you manage what, sorry? Oh, so for example, if someone's data is recreated, you don't want to send both, like, twice the data, right? So how do you manage that, that's the key name, like, what do you do? The question is, how do I make a decision about what data to send? Because if things are replicated, I don't want to waste network transmission of sending data that I don't need to send. You know everything ahead of time, right? SQL is declarative, we know what the query is, we know what data you're trying to access, we then look in our system catalog, our system catalog is going to tell us how this data is actually partitioned. We know this ahead of time. So the query planner can make a decision, oh, well, this data is partitioned this way and it's this size and it's on this node, so therefore, I need to move or not move or copy here, don't copy there. I can do all of that ahead of time. It's not like as I'm going along, I say, oh, well, yeah, maybe I should copy this. You figure out everything ahead of time. Except for the issue he brought up where like I ran out of disk space, you know, you just fail, nothing you can do. And then the how efficient the database system is in making those decisions about what to copy or not to copy depends on how good your query optimizer is, which is why people pay a lot of money to have people that can work on query optimizers. Question. The data is sorted by the partition ID. The question is, what if the data is sorted on the partition ID beforehand? Partition ID or join ID? So if it's sorted for what I'm showing here, who cares? Right, because what I care about is the locality of the data. I need to join RID on SID. So I want to make sure that when I do that join locally, I have all the data that I need to join the outer table with the inner table is at my local node. I can't be at some other node that I don't know about and I can end up with a false negative or false positive. So sorting doesn't matter. Sorting would matter when we can make a decision about do I want to do a certain mergers as hash join? We're like a step above this. We're deciding how do we move data or not. Yes. Okay, so the question is, what am I actually transmitting here? And to be clear, I'm not swapping this. I should have a divider line. So this is the permanent data on this node. It does not go anywhere. I just make extra copies as temporary data to do my join and then I throw it away when my query is over. So even though I'm shuffling here on ID, I'm still gonna be at the end of the day partition on name and value here. This stays. But then your other question is, what am I actually transmitting? Am I transmitting the whole tuple or am I transmitting some identifier? Next slide. We'll get to that. Yes. So in the local node, what is how do you make sure that there are no duplicates? The question is, on a local node, if the data is not sorted, how do I make sure that there's no duplicates? Duplicates of what? Duplicates of what? For the join? For partition ID, you said there should be some. Aha. Is the partition ID the primary key or unique ID? I'm sure that the primary key is unique. Hold on. The, well, yeah, so all right. You had two questions there. So I think what you're asking is actually, if I have a primary key, then I need to guarantee it's unique, but my partition key is not the primary key, how do I make sure that I enforce that? So that is a transaction, right? That's not us. We're just doing analytical queries. We assume, we don't like, we assume that someone else has solved that for us when they stored the data as a part of a transaction. So in your example, if I have a partition key that's not the same as the primary key, then I insert a new record, how do I make sure that's unique? Well, I even need to maintain a sort of centralized index that I can look up and see whether this key does already exist, or at the broadcast, the query to every node to say, hey, I'm about to insert this key. Do you, do you already have a copy of it? I'm saying you could. Right? I mean, how could, I mean, how could that be? Here's a node. Here's an index. We're building a database. We can do every one, right? We can do this. Again, like, I don't care about, in this world for these queries, I'm not enforcing integrity constraints. I'm just running this read query, these analytical queries, to figure out how to compete to join efficiently on a large data corpus. The transactional side, it cares about integrity constraints, because you're modifying the state of the database. So you are assuming that the whole database only kind of read on data. So everything read on it? No, no, no. So, going back to my initial example here, like for the CTL thing. So, this thing is like, you're bulk loading a bunch of data. Like, this is sort of streaming. It's not happening. It's not like all at once, here's a bunch of data. So you're streaming incrementally updates from the front end to your back end data warehouse. And the back end data warehouse could choose or not choose to enforce those integrity constraints as it comes in. But it won't be on the critical path of when we execute our queries. Because I'm running a select statement, I'm not checking to see whether my primary key is unique. So now, how you enforce that integrity constraint as you ingest the data into your back end data warehouse, well, that's the same thing we talked about in the last class, how to do transactions. Because that is a type of transaction. Insert something, make sure it's unique. I need to coordinate across multiple nodes. If I'm not, everything's not on a single node to make that check. Does everyone agree that we can go ahead and make this change? So a lot of times in these analytical data warehouses, they will sort of have a separate engine or storage area that is designed to do more efficient updates than what a traditional column of storage database system would do. If you take the advanced class, we will cover that. We didn't cover that here. So again, for our purpose here today, we don't care about enforcing that integrity constraint. We assume that it is already handled for us. Some database systems, I mean, some data warehouses, they just turn all that crap off. They turn off foreign keys, they turn off unique keys. Your data's a little dirty. Who cares? For analytics. So even if the front end layer, I mean the RTP database, to handle the situation, what if they want to update the value and this value must be updated in the whole last database as well? Right, so this question is, you have your bank account or whatever your game information, you update your user account on the front end RTP database because the users are touching this part and they don't touch the back end data warehouse. They make updates here. That update gets propagated and you want to go update the record here. How does that happen? Advanced class. In general, what I'll say is like, you buffer a bunch of changes in a sort of right optimized storage layer or extrusion engine in these data warehouse and then they peer out if they merge the change into the data warehouse or into like the main column store tables for the to do analytics. Different systems do different things. So, any other questions about the stuff here? Yes. Go ahead. But if you run out of space in this, then it would help you. Yes. So does that mean you're doing it something else? You're running anything? Is it going to query out to make it available? How do you take that out to make some changes? The question is, I said here that if, when we go to copy this thing here for this particular query, if we run a disk space and the query crashes, couldn't the query optimizer figure that ahead of time and say, oh, I'm not going to have a disk space. Let me, let me make sure that, let me maybe not run it a certain way here. The query optimizers usually do not reason about what are other queries running at the same time. It assumes your query runs in isolation and that it, it can reason about certain things like some systems you can specify how much temporary buffer space or temporary disk space the query is allowed to use. And then if you exceed that then your query fails and it throws you back, hey, you know, creases parameter if you want to keep running. So it could potentially figure out at planning time, oh, I'm going to copy over more data than I have space for, for this particular query and throw an error. But if like your disk physically runs out of space, even though you're not within that, you know, you, you haven't exceeded that limit on a per query basis. The query optimizer usually does not think about or cannot reason about what queries are running at the same time. Cause that just makes your life harder. Cause like this query first shows up, I plan it, I say, all right, nothing's running now. Let me go ahead and choose one kind of plan because I'm running by myself, then start running. And then this other query now shows up. I don't want to go back and modify that other query's plan to say, hey, now you're also going to be running with this other query at the same time. That's way too hard. I'm sorry, is that the old app database doesn't support an input? No, no, no, no. My comment was that sometimes you see in data warehouses, so the database system itself could support scheme, enforcing integrity constraints, foreign keys, primary keys, referential integrity constraints, stuff like that. It could support those, but the application developer, the person building the data warehouse, may say, I don't want to pay the penalty to check for foreign keys. Let me turn all that off. So does old app database support recovery and all these? Right now you said it doesn't support the query for tolerance, it doesn't support the old app. His question is, I said that in the previous slide, that in general, most share nothing, actually even share this. Most distributed old app databases do not support query fault tolerance or query resiliency, where if a query, if a node crashes, that's a response for XU and query, halfway through the query execution, they are not able to recover the enemy results that that query was had or computed and then pick the query up where it left off. In general, they just kill the whole thing, throw you an error, or maybe restart it silently for you. That has nothing to do with logging and recovery. We absolutely still need log and recovery and we still do that. This is more about while the query is running, do I, can I take snapshots as I go along so that I can pick up midway where the query is running if a node goes down? And the statement I made was that, to the best of my knowledge, most distributed old app systems do not support that query resiliency because taking the checkpoints or the snapshots of the media results is expensive. What if you do the logging, where the logging takes a lot of space? Logging of what? So logging of what? Logging of recovery of the query. The query of the database? The database. We're still, all the fault, all the D and asset stuff that we talked about before, we are still doing that. We are still making sure that if we ingest data from the outside world up to our database, we don't want to lose that. And so they will all provide durability guarantees. Again, I have a hundred petabytes of Walmart, they don't want to lose that data, right? And the data system will guarantee that they won't. I use that, there are a lot of read-only data, so it is possible that we don't log any operations from this data. When you say operation, again, what do you mean, like an update to it? Or a query, like a select query. You have to join operations. Let's sit down afterwards, okay? We'll go through this. I think there's something you're missing here. Okay, so to his question about what am I actually sending when I do these distributed joins? Am I sending the whole tuple, or am I sending an identifier? And I would say the answer is that you're sending, at the very least, you're sending the minimum amount of information that you need in order to compute the join, and then the worst-case scenario you're sending the entire tuple. And in general, again, the high-end, the good distributed databases that are doing analytics will minimize the amount of that data that you actually need. And so this particular type of query would be called, using what is called a semi-join. A semi-join is like a regular join, but the idea is that we're not really gonna do a join on the right table, or the inner table. We're only gonna check to see whether, if we did a join, a tuple would match. So the query optimizer can recognize that it may not need any values or attributes or from the columns of the inner table, and therefore it can rewrite the query just to do an existence check and send the minimum amount of information back and forth between those to do that semi-join rather than copying the entire tuple. Again, like a natural join, you would do the join and then the output would be all the combination or the concatenated tuples from the right and the left table. In a semi-join, it's just the attributes needed to compute the join from the outer table. So in this case here, say I have a query like this, select RID from R, doing an outer join on S, and we're just matching on ID, and this is a poorly written query because they're basically saying where RID is not null, then match up on S. So if we didn't do a semi-join, we either have to copy S over here or R over here. Again, we're copying the entire tuple, that would be expensive. But we could rewrite this query to be like this, where we just check to see whether there exists a tuple in S that has the same ID as RID. And then if that's true, then we just spit out all our RIDs that match. So in this case here, the only thing I need to send over is just maybe just the RIDs, because that's the bare minimum information I need in order to compute this join. So some systems like Cloudera's and Pala actually has an explicit semi-join keyword. You can give it, otherwise you can try to fake it with it exists. The high-end systems can actually try to rewrite your query into a semi-join, again, as part of figuring out what data I need to transmit between the different nodes. And it includes that in its cost calculations and the optimizer's cost model to decide how much data am I transmitting between different nodes as one plan better than another. What do you mean by you can fake it with a SID? So like I could have, I could say select RID from our semi-join on SID, explicit SQL that says, hey, you're doing a semi-join, or I can rewrite it as this. Most systems you have to do this because they don't have, I don't think semi-join's in the SQL standard. Yes. Who does this work, like who did the rewrite? Questions, who does this work? And this example here, this is the application programmer. The high-end enterprise systems that have good query optimizers could figure out how to potentially rewrite this for you. They could do that for you. So a prior optimizer is a, another node, like any semi-join. The question is, is the query optimizer another node? Another, like, what do you mean by another node? I know that the two nodes, like the two nodes. So this is like a partition, like you can, it doesn't matter, it doesn't matter whether the query optimizer is running this node here or another node, it doesn't matter, what we're talking about here. This is just sort of like the semantics of the semi-join. So from a relational algebra standpoint, it just looks like this, I'm joining R and S, and then the output, this is the semi-join operator, the output would be just in the AID and BID that I used to compute the join, and nothing from the inner table. So is this clear? There's a homework question on it. All right, so we have 10 minutes. So as I said, this is just a crash course on the main design decisions and issues in doing an analytical database. If you're not gonna be building an analytical database management system, actually working on the internals of the system, you just wanna be a user of them, the main issue you're gonna deal with is that partition key. How to pick that in such a way that most your joins can operate on a local node without having to do a broadcast or shuffle. And the various systems that are out there, mostly the enterprise guys, have tools to help you try to figure these things out for you. Okay? All right, let's talk about cloud databases. So the definition of cloud database is a bit nebulous. No pun intended. The typically what it means is is some vendor offering you what is called a database as a service or DBAAS. The idea is that you give Amazon whoever your credit card, and they will say here's your JDBC connection port number and host name that you just start shoving queries into it. And you don't worry about how to manage the nodes. You don't worry about how to do backups. They take care of all of that for you. And so I already sort of mentioned this before, but in these major cloud vendors, people that control the whole stack, like Amazon, like Google, like Microsoft, the lines between what is a shared disk system versus a shared nothing system is starting to get really blurry because they can push down database logic up and down the different layers of the system stack in a way that you typically can't do unless you control the hardware yourself. So for example, Amazon has something called Aurora. Aurora is a shared disk version of MySQL and Postgres, but they actually push logic to do transaction management down into the storage layer in EBS, like at the shared disk level. And so now it's not a pure shared disk system anymore. It starts to look a bit more like a shared nothing system. So I think in the next 10 years, I think that the cloud systems that we see the most innovation in the cloud systems and I think it's gonna be really interesting what they can kind of do. All right, so I think I really talked about this. In general, a cloud system can either be a managed database or a cloud native database. So a managed database would be just taking an off-the-shelf database system that was written to run on-premise on dedicated hardware and now you're just running this for you as a service. Take MySQL, take Postgres, and then you plop it into an EC2 instance and then have people connect to it. And they don't know and don't care that you're managing EC2 for them, that they could have done that themselves, but you're just providing a service to do all the backup and other management stuff for them. So most of the times when you get a cloud database, it's gonna be the first one. Now there are some systems where they'll refer to themselves as being a cloud native database management system and these are ones where they're designed to explicitly operate in a cloud environment and typically they're gonna be a shared disk architecture because they don't wanna actually have to build, replicate EBS or S3, so they'll build on top of the existing storage infrastructure that these cloud vendors provide you and they're providing the compute layer on top of it. You still have to do query planning, you still have to do all the fault tolerance stuff we talked about before, but they don't worry about actually how to persist things to disk. They just let the cloud vendor provide that for you. So there is now also a new class of systems that label themselves as being serverless. Yes? That's what it means, right? They have a hard time understanding if it's different from management to management, is it like to optimize all the concepts that you're making? Yeah, the question is, what is truly the difference here? So this would be, I'm gonna say this would be just using my SQL as example. I take my SQL, I just run it without making any changes to it. I run it in a container or I run it in a VM and it's the same software that I would run on my local machine, but just now I'm running it in a cloud for you, right? And then the service provider will then also deal with the backups and recovery and all the other stuff for you. This would be like, I'm having a new database system from scratch or I take an existing one and I make heavy changes to it to be designed to work in a cloud environment. So up here, I'm running my SQL. My SQL doesn't know anything about S3. It doesn't know anything about the performance implications of reading from EBS and things like that. It just have a disk that has these properties. This would be like, oh, I've designed the system explicitly to work on S3. S3 provides me these guarantees providing me these properties. That information now permeates all throughout the system. My query optimizer's cost model and then reason about what's the speed of writing S3 or what can you do in S3 you can't do on EBS? Stuff like that. All right, so the buzzword going around now is serverless. So of course there's serverless databases. And so it's the same, everything's the same that we talked about before. It's just that the idea is that when your machine goes idle or your database connection goes idle because your application is no longer sending queries, they will try to then deprovision or have you not pay for hardware that you're not actually using. So let's say that I have, again, a managed database system, I have a single node, it's running my SQL. And so I pay for the instance, I pay for the storage, and I have to provision it to be running all the time. So my application sends queries to this guy and gets results that comes back. But now if my application goes to sleep or walks away or goes to the bathroom, does whatever, then I'm paying for these resources I'm not actually using. Like, because I have to provision the hardware, I have to provision the EC2 instance, I have to provision the EBS storage, right? So I'm paying for the stuff to run and not actually do anything. So the idea with a serverless database is it's almost always to share disk architecture, is that I can do all my queries that I had before, but now when I go to sleep, I decommission the compute side of things. This goes away, but before I do, I basically take a snapshot of what pages are in my buffer pool. I take a checkpoint, then record all the page IDs of what's in my buffer pool, write that out to share disk, then kill off my compute node. And so now the only thing I'm paying for is storage. Just paying for my database to rest at idle on disk. And then if I ever wake up and come back and execute a query, the very first thing I'm gonna do is say, all right, well, before I shut down the last time, what was in my buffer pool? And gotta try to fetch that in and make it as if I was still running all the time. So this is one way to do it if you assume that there's only one customer running on a single node. Another common setup would be you run multiple customers or multiple tenants on a single node, and then you just recognize that this customer has not sent me in a query in a while, and then I write out this buffer pool page contents out to disk before. Yes. What's the advantage of writing and being in the buffer pool as opposed to just like pre-building the buffer pool starting with? So his question is, why am I doing this extra step? Why am I actually writing the buffer page contents out to disk before, you know, why do I care about this? Because you wanna make it as if, like the most expensive thing is fetching things from disk. So ideally I wanna have, you know, my query, the next query shows up a minute later. I don't wanna have everything now evicted from my buffer pool, and I have to pay this big upfront cost of warming all my data and bringing it into memory. So you record all this information so that when you come back again a minute later, it looks like the thing was still running. And you don't, like you still pay the penalty because you have to fetch it in, but you don't have to wait to like fetch everything in. You try to maybe pre-fetch some stuff. It's just pre-warming. Yeah, pre-warming. It's pre-warming to cache, yeah. Every database system does this. When you call shutdown, do like a correct shutdown, they're already doing this. So tell me if you don't write buffer pool page table in the storage, it still works, right? Do you start a new machine? Correct. So this question is, if I don't do this step, will this still work? Yeah, it'll be correct if it'll be slow, right? This is just an optimization to get more of the cache. For page table storage, a mapping order for memory page can be different. No, we're just storing the page IDs we had in the buffer pool at the moment we did a shutdown. So then the first query shows up, and then say the first query says go give me page one, two, three. It says, oh, I also had four, five, six, seven, eight, nine, also in there, let me go fetch them all in. Yes. This question, her question is, how does it like to be stateless as in the serverless, it's usually what people mean by serverless, it's stateless. But it is and it isn't, right? So it's serverless from the perspective of the end user, meaning like I don't have to say, provision me this machine. So a lot of the cloud vendors that they're not doing a serverless architecture, you basically say I want to be, I want to pay for this dedicated resources to be available for me at all time. If you're not using them, they're still charging you, they're happy, right? But if I have one query a minute, I don't want to provision a whole machine just to execute 60 queries an hour. Whereas in this architecture, I can still have my database, still as if it was running all the time, but I'm not paying that pending. So I pay on a per query basis, plus whatever I'm storing over here. So the main vendors that are in this space would be Amazon has a serverless version of MySQL, and then FaunaDB is a separate database startup that does, so Amazon will try to, all these guys are doing not exactly this where you kill the machine, they're just recognizing this customer has not sent a query while, and then they write everything out to disk. So Azure can do this and then Google, I think it's, these icons are so useless, because it doesn't say the name, I don't know exactly what this is. I think it's the Google Firestore. So not Spanner, not the BigQuery stuff, it's only for Firestore. All right, so another interesting thing about this is that you could potentially build a database system without having to write every piece of the system yourself. So there's enough open source software that's out there now, or services that are out there, you can cob all bunch of these things together and make a new cloud database without having to write everything from scratch. So we've already talked about, like as I said, her question was, what is a cloud native database? Well, it's one where you assume that you can write, the software is written to assume you're writing to S3 and you take those, it's guarantees or performance implications into the design. So that would be sort of one example, where you don't have to write the disk manager, you just let S3 handle that for you. You may not have to write your catalogs, you can get metadata service, get metadata as a service through these bunch of different types of software. You may not have to manage your cluster, you can rely on Kubernetes or Yarn or these other tools that the vendors have to handle all that for you. And then you may not even have to build your own query optimizer. So there actually are some open source sort of optimizers as a service where it runs on a separate node, you just feed it a bunch of XML or JSON metadata to say, here's what my data looks like, here's what my query plan looks like, and then the separate machine will then crunch on it and spit out a potentially optimized plan for you. I don't know how good these things are actually are. There hasn't been any studies actually to evaluate them yet. We looked at Orca when we were building our system and we passed on it because at the time the documentation was terrible. Calcite's written in Java, so that was a non-starter for us as well. All right, the last thing I'm gonna briefly talk about is file formats. So pretty much until very recently, every single database management system always had their own proprietary binary file format, meaning like my SQL branch is a bunch of files at disk, you can't read those files into Oracle or because Oracle has its own file format. Can you do the same thing when you build your projects on Bust Hub, Bust Hub has its own page formats that can't be read by anybody else. So this is problematic now if you're in a cloud environment where you have a bunch of different services that may want to share data, like a bunch of data I want to generate might all to be databases and maybe I want to run that data through Spark or run that data through Vertica or some other distributed database. So right now, if everything's based on a proprietary format, the only way you can get data out from one system and put it to another system is to make copies and put it into one of these human readable, text readable formats. And so now instead is that there's these open source binary formats that a bunch of cloud vendors and distributed databases or data science ecosystem tools are now supporting to that I could just write out to S3 or EBS or my distributed file system, a bunch of these files and these format that my data system generated and then I could suck them in and read them into another database without having to do any destitilization or conversion. So some of these you may have heard of but these are sort of the main ones. Parquet and Oracle probably the two most common ones. Parquet came out of CloudDare and Twitter and Oracle came out of Hive. And again, think of these are like binary column store formats that are not tied to any one specific data system. Like it's an open specification that anybody can modify their database system or modify their application to read this data natively. Carbon data is a newer one actually 2016-ish I think that's like Oracle and Parquet that came out of Hawaii in China. Iceberg is another new one from Netflix that I was just notified recently by somebody. I haven't looked into too much into it yet but they claim that they can support schema evolution. Like I can do change columns, change names, change column types in a way that these other guys can. These other guys are read only. Like I create the file and then I freeze it and then I can't go back and change it. HD5 is not typically used in cloud systems or in sort of traditional Silicon Valley database or tech companies. This is mostly found in HPC and the scientific world. This is for like array data. Like you can have your particle collider we'll spit out a bunch of these files in this type. And then arrow is an in-memory column format that came out of Pandos and GemIO that think of this as like it's like Parquet and Oracle but it's for in-memory data. So our data system we're building here at Carnegie Mellon our native stored format is actually arrow. So you can take data that our data system generates, dump it out and feed it into Pandos or whatever scikit-learn you want. Anything that needs the arrow format. So I think this is the right way to go. It is sort of you get to the lowest common denominator like there's the compression scheme and all these formats may not be the best for all possible applications. And certainly if you write a custom one you could probably get much better compression or better performance but this provides you interoperability. All right, so just to finish up. OLAP, it means if you need an analytical database that needs to scale then you have money because you're getting data, right? People are actually using whatever application you have and you're actually able to process it. But the more data you get, the more problems you're gonna have because these distributed databases are, you know all the additional management concerns and problems that you have with the distributed system you have to account for, right? So his question last time, can I name what are some interesting or I don't use the word good but what are the main OLAP systems you could possibly look at or in the cloud, Redshift and Snowflake are the key to players, Oracle and Microsoft and Google also have their own Pacific services but I think these two are the probably the biggest ones. So this is what if you're running on a cloud if you wanna run on premise, the one that I'm actually super interested in is Clickhouse. So it's a distributed in-memory constro system out of Russia from the Yandex guys and when you read their webpage the list of the things that they support it's actually amazing. And it's open source too. Presto runs on top of I think Spark and Hadoop Splice Machine is HBase plus Spark, Green Plum is a forked version of Postgres that somebody made distributed. It was a startup back in the 2000s they got bought by EMC. The EMC says we don't wanna be a database company so then they merged with VMware became Pivotal but in the open source dip. Vertica was founded by my advisors when I was back in New England. They got bought by HP. The guy came and gave a talk a few weeks ago early in the semester. So the way to sort of think about this is like if you have no money, start here. If you have a lot of money, you could start over here. Like Oracle data and Teradata are super, super expensive. Like Exadata, I don't think you can buy a machine for less than like two million dollars because you're buying custom hardware, okay? All right, if you don't want to go down this route it's setting up a whole distributed database. There's a newer system that I'm also very excited about called DuckDB. It's out of Europe. Think of this DuckDB as like SQLite for analytics. So you can run it as an embedded database. Actually you can connect to it through the SQLite terminal but it's a column store and can do, you can do analytics in the same kind of way you can do here. So if your data can fit on a single node DuckDB might be the right thing. And it's open source, okay? All right, so that's it. For the main material in the semester, the next class after the holiday we'll have a guest speaker from Oracle and again we'll have the final review and the potpourri session on Wednesday. So at this point, again, you should be confident enough to go out in the real world and either manage your database or use your database and know enough about to opine whether the system you're working with is making good ideas, okay? We've covered a lot and I think you guys hopefully have followed along. So I'm confident you guys can go out in the real world and discuss databases, okay? And if not, talk to me after class. Okay, see you. No, no, no, no, no. We still have two more classes. Oh yeah, coming through with my shell and crew Two cent full of case in me, St. I's crew In the mix of broken bottles and crushed up cans Met the cows in the jam, oh I'll try He's with St. I's in my system Crack another unblessed Let's go get the next one, then get over The object is to stay sober Lay on the sofa Better hit down my show Call me to follow me Tim, stressed out Can never be sun Rick and say jelly Hit the deli for a part one Naturally blessed, yes My rap is like a laser beam The fulge in the bushes St. I's family can't team Crack the bottle of the St. I Sipping through doors, you don't realize The drinking air, only to be drunk You can't drive, keep my people still alive And if the St. don't know you're from a can of pain Pain