 Okay, class stuff. We're almost at the end. Homework five is due December 4th. Project four is due seven days later on the 11th. The Snowflake lecture next Tuesday in class or sorry next Tuesday's class on December 6th, that will be virtual. We'll have some former CMU students that are at Snowflake building their system which is relevant to discussion today. They will be calling in over Zoom. I'll send out details and post that on Piazza. So we can sit in here and watch Zoom on the big screen. That seems kind of weird. So let's just do it from your dorms or house or whatever. Okay. And then as I posted on Piazza this weekend, the call-in lecture will be on December 8th. That will be in person, but then people can call in. Are you guys going to ask questions live? And it'll be again anything that you want to have questions about throughout the entire semester about databases we can answer. So I posted a form on Piazza. If you have a sort of complex question that you want sort of a multi-part answer, then submit it ahead of time. Otherwise, we just shoot from the hip. We'll bleep out anything that should be bleeped. I guess it's live. We can't. So we'll keep it somewhat professional. But yeah, any database question go for it. Okay. And then the final exam will be on Friday, December 16th. Any questions about any of these things? Okay. So we've gotten some great feedback emails again. Actually, first one is about you. So this guy wrote in and said that you're fine, but you got to stop giving money to people, which is relevant to what we just talked about. Too many people owe you money. So there's that. But then I got this other email from this guy who is an old school database veteran. And I said in the last class when we talk about distributed OTB systems, I mentioned that, oh, two-phase commit. Nobody really knows where it came from. And I because I thought Jim Gray invented it. And then I found something else. But then he pointed out in the Jim Gray textbook, there's this paragraph here where there really isn't like a notable inventor of two-phase commit because they just said it's just contract law where you get two people to agree to something and then you go ahead and commit that, you know, commit the contract. And so there is this Italian guy, Nico Gorzada, he's he's he's sort of credited as first implementing a security system. But then in Jim Gray, he's the first one to documenting it. So that's why I thought Jim Gray invented it, but actually predates that. And there's this it's page 575 in this book here, which is like sort of the transaction Bible written by Jim Gray. It is like this is from 1992. It's a bit, you know, it doesn't have like RAF doesn't have Paxos. But the core concepts of when we talk about current control and some of the distributed extrusion stuff can be found in this book here. It's fantastic. So again, I like when this is why I put things on YouTube, because if I'm wrong, people pointed out and send emails or he has problems, people send emails to, which is good. All right. So today we're going to be talking about OLAP systems. And the, you know, we have this distinction between the OTP systems where you're ingesting new information from the outside world, you're running transactions, you're updating your database. And then now with the analytical system, now we actually want to analyze that all this data we collected from the OTP systems and extrapolate new knowledge. And so this is not specific to distributed databases. What I'm about to describe here is not specific to distributed databases. I just want to show you here's when you got in the real world, here's how you're probably going to come across these two different systems. And the OLAP system could be distributed, it could be a single node. The OTP system could be single node or distributed. It doesn't matter. So you would have this bifurcated system where you have all your OTP databases. And this is where, again, this is where people are going through the applications or the website, the phone app, whatever it is, they're making updates or they're looking at things and they're hitting up these databases. But now what I want to do is I want to analyze and find some, you know, run data science algorithms or whatever I want to run to extrapolate new knowledge on all these different databases. But I don't want to do it on the front end nodes themselves because if I run a really, really expensive analytical query, that's going to take locks and slow down transactions and interfere with people using the application. So typically people want to separate these two. And so what you would have is a process called ETL, extract, transform, and load. There's a bunch of tools to do this kind of stuff. You might have heard of like 5Tran, GoldenGate from Oracle. There's a bunch of tools that know how to take the data out of these OTP systems, do some kind of cleanup, and then shove it into your back end data warehouse. And so the reason why you have to do this sort of this cleanup process and the transform process because oftentimes the schema or the naming of things and these different OTP databases will be different from one database to the next. So the example I always like to use is Zynga, right? Zynga buys a lot of these different phone apps and they usually leave the front end database alone. But now if you want to run analytics across all the different games you have, the data you collect from people, you've got to clean up the schema and make things be uniform. Right? Because one database might have a table of users. They might have the first name, the name of the column be F name, or first underscore name, or first name with by itself. Right? So from a database system perspective, those are seen as different column names with different attributes. Because we humans, they know it's corresponding to the same thing. So you would clean this up in the transform phase and so you put it into a standardized schema. Right? There's another style of, that's becoming more common now. We'll see you a little bit in a second. In a data lake, where instead of doing extract, transform, and load, you do extract, ELT, extract, load, and transform. So you just dump all your files into a data lake and then when you run your queries, you go ahead and clean things up and put it into the standardized schema. Right? So ETL is the classic way of doing this, but ELT is becoming more common now. But again, the main thing I want to share is that you would have separate systems for these things. So now, we didn't really talk about what these OLAP databases are, the database itself is going to look like at a logical level. Right? We talked about it being a column store. We talked about it could be a shared disk system, a shared nothing system. We talked about how we execute parallel queries. But how you would actually model your data in an OLAP system is going to be slightly different. The third way of doing it is slightly different than how you do this in OLAP systems. And so, I'm using the word OLAP. Sometimes you see these things called DSS or Decision Support Systems or Analytical Systems or Data Science Systems. They're all more or less mean the same thing. Right? The idea is that like your, you, this is the repository. You're going to stir all your data you collected over the entire history of your application and you want to find new things. Right? So, the, there's basically two approaches to doing this, two sort of, two different modeling approaches. There's a star schema and a snowflake schema. And everyone wanted to know why snowflake called snowflake, right? Because they support snowflake schemas. Right? So, let's go through the star schema first. This is a, this will be a sort of simplified form and then snowflake will be an expansion of it. So, the, in an OLAP system, the way to sort of model a common, a common approach to modeling them is this idea of a fact table and dimension tables. So, remember before when I talked about schemas, it was like, oh, you could have, you know, you would have a table for customers and customers would have orders and orders have order items. Right? That's sort of like a tree structure. And that's very common in OLAP systems. But in analytical systems, often time, what you want to do is just do some analytics on a giant fact table. And then you do joins against these dimension tables to fill in supplemental information. Right? So, say in a really simple, like Amazon warehouse, data warehouse, you have a fact table of all the sales of, that ever occurred in, in Amazon or Walmart, whatever it is, like every single item anyone's ever bought. So, the thing is like, billions and billions, maybe probably in trillions at this point, this giant table of just like a line, there's a, there's an entry or tuple for every single item ever, anybody's ever bought. And then instead of in, in lining in this fact table, instead of having like, here's the product name, here's the product price and so forth, right? You wouldn't have a price, you wouldn't have the product name or like the location of the user and all these different things. You would store that in, as foreign keys to these dimension tables that sort of span out or emanate out from the fact table. Right? So, you would have a dimension table, you say here's the product category, sorry, product dimension table, here's the product category, here's its name, its description, here's the location of information where the person bought the, bought the, bought the, bought the, the item where the user is, here's like a time dimension to say, instead of saying here's the exact time stamp or maybe when somebody bought something, I want to group it up by like per hour, per day, per week, per month, so forth. And then same thing for additional information about the customer itself, right? So, you have all these different foreign keys, so now when you, when you run, run around analytics and say, find the, find the, the, the, what are the top 10 items that people bought that live in this location at this time or these conditions, you're basically scanning through the dimension table and joining with the dimensions, right? So, you would go look up the product dimensions and do a join on that. And you would use the join algorithms that we've talked about so far, right? So, the challenge, so this is, the reason why this, this is a good idea because the earlier OLAP systems, the early analytical systems, again, whether they were distributed or not, they weren't really good at doing, you know, joining a lot of different tables, right? Like, you know, doing more than five joins, the query plans usually get pretty bad. So, if you can simplify and denormalize the tables down into a simple fact table and then one outer ring dimension tables, you'll get better performance for your joins because the, the joins are more simplified. Yeah. So, this question is, how's the simplify the, the, the, how's the simplify the schema reemerging this and making one? Yeah. So, this question is like, you still have five tables here. Like, think of like, if I had, you know, if I had, say, say, say the product one, right? I could have one for a table for category and that would be a separate table. Another table for the product would be separate category or separate table. Like, you just, you span out everything. Like, be multiple levels going out. You just condense it down so you have one level of dimensions going out. The alternative is what, what I was referring to now is it would be the snowflake schema. What you do exactly what I just said, you span it out and have multiple dimensions going out like this, right? You still have this giant fact table. You still have the thing you're going to basically do a sequential scan on and then try to do joins as quickly as possible against the dimension tables to find the data that you're looking for. But now you're allowing for, for, for, you know, multiple levels of, of foreign keys coming out of this, right? So, in case the product table, again, instead of having the category information and the product information as a single denormalized table, it's now, it's now normalized and broken out, right? Okay. So, this seems kind of weird. Like, this is a, it's a different way of thinking how to design a database, which we haven't really talked about how to data modeling that's done in Heinz College. The, but now we can see, now we can build a database system that's optimized for ripping through a huge fact table and doing these joins quickly as possible. All right. So, the first issue with, with these different approaches are, is, is the normalization or denormalization issue. We don't talk about the normal forms in this class because when you go out in the real world, you'll never see it. We used to teach it. We don't teach it anymore. It's a waste of time in my opinion. It's basically this idea of like, how, how much can I break up tables into those sort of different pieces and then have foreign keys between them, right? And the reason, and there's this, there is a whole bunch of theory and on previous work done on like, what's the optimal way to break this up? And again, nobody follows these, these, these different rules. It's just, it just sort of happens naturally. But then you would have maybe a database administrator know that if I'm going to set up a database to be, for analytical workloads, I would follow sort of this, either the snowflake schema or the star, the snowflake schema or the star schema paradigm and model things that way. But the issue is, if you start denormalizing and try to create, create giant tables, so you don't have to do these, these, these multi-way joins going out dimensions, then you end up with a lot more of a done information. And now it's your response on the application to make sure everything's always correct, right? Whereas in the relational model, the idea is was we'd break up these atomic pieces so we, we've minimized the amount of redundant information we have, so that when we update a category name, everybody sees that update correctly, right? All the, all the things pointing to that table would see that update, rather than we have to update multiple entries. Again, and then the other issue is, again, the complexity for, for snowflake schemas you require more joins to get the data needed for the query, whereas star schemas will be usually require fewer ones and therefore be potentially faster and have more accurate statistics and get better query plans. This is still an issue in modern systems, but for the most part, people did the star schema stuff late 90s, early 2000s. Since then, though, everyone reck, and to support the flexible nature of the snowflake schema, most OLAP systems, or pretty much every OLAP system at this point, will support them. It used to be that you didn't, you couldn't support, you know, multi-dimensions out. They would, they would restrict you to a star schema. Every, every modern system today will support snowflake schema. This is what people normally do. Again, I'm just going to expose you to this to say like, hey, look, you'll see this in the real world. You'll see data warehouses have this giant fact table on these dimension tables. This, this is a common setup. And again, we can design our OLAP system to rip through things really quickly. And this is why a column store would be super useful because if, going back here, if I only, if I'm trying to find all the items bought within a given location, then if I'm scanning the fact table, I only need to look at maybe the location foreign key to do, and do that lookup. I don't care about these other columns. Right. So there's a lot of optimizations that, that we can apply for this. All right. So here's the problem we're trying to handle today. We have an application server. We have a bunch of data. We want to run some join query. We want to join table R, table S. Right. And let's say that both of these two tables are partitioned on, on some, some column. We don't know what it is at this point. And we need a way to, to produce this result. Right. So the data is scored across the different partitions, but we need to produce a single result as if it was sitting on a single logical node. Right. So the easiest thing we could do is just to take the, the data from the different partitions that are stored in different nodes and copy them or transfer them over the network to the, this one node, then do my join on R and S and produce the result. It would work. It would be correct. But this is obviously stupid and defeats the whole, the whole point of having a distributed database. Right. Because I'm not taking advantage of these additional resources. I mean, I'm showing them as the database drum, but it could be these additional CPUs over here. So maybe I could run the, the joins in parallel. I'm also assuming that the node up here has enough memory or even disk to store all the different partitions of data. Right. So what we really need to be able to do is you need to be able to take a query that is trying to scan a lot of data, farm it out to a bunch of different nodes, have those run in parallel, and then code this results and put things back together. And joins are the main thing we're going to worry about because we want to make sure that we don't have any false negatives. We don't want to be the case where, you know, there's some value in R that we try to join S and there is a match that it does exist. But because we sent the portion of the query to the wrong node, we don't, we didn't send the data to the right location, we end up coming back with incorrect results. So we won't, we won't avoid that. So today we're going to talk about different execution models for distributed OLAP systems. We'll talk about how basically at a high level, how we're going to do, do query planning. Then we'll focus most of our time on doing distributed joins. Because again, that's the most important thing we do. And most of our time in the system doing that. And then we'll finish up talking about cloud systems and data lakes and just sort of expose you to these different concepts and how the, you know, people say data lake, it's not radically different, something brand new. It's just an application of distributed, distributed OLAP system technology or ideas applied to a bunch of random files in S3. Okay. Okay. I'll say also to the, you know, just because we're running our Davies in the cloud doesn't mean we don't care about all the things we talked about this semester. We care about performance, we care about data correctness, integrity constraints and all those things. Like typically in the cloud doesn't make things, make things magically, magically better, right? We still have to worry about all the same things. But now somebody else is running our hardware for us. All right. So the first design decision we have to make is whether we want to do a pusher versus pull for the data. I think we talked about this before. But this matters a lot now for OLAP systems because we're reading a lot of data. So the first approach is to push the query to the data. And the idea is that we want to send a, the instructions of the test to execute some part of the query to the location where the data is actually being stored. Now, if it's a shared disk system, we can't always do that because like S3 doesn't expose, you know, raw CPU, we run whatever tasks we want on that. So it may be the case we have to pull the data out to a compute node and then run some portion of the query there, but we don't want to have to do what I shared before. We're copy all the data to a single node, right? We want to break things up as much as possible. So the main goal here is the same thing we talked about before when we did query planning is that we want to try to filter out and throw away as much useless data as possible before we start transferring the data. And so this is ideal if you can push the query to data because we can do as much processing as we can to filter things out before we have to send it over the network to some other node or back to the client. The alternative is the reverse of this is the pull data to the query, right? And again, this is, as I was saying, was like we have the data stored somewhere else and we have to copy it to where we're going to run and compute the result there, right? So the, an obvious way to think about this is, okay, I want, I want to run, I usually want to pull the, I usually want to push the query to the data because the size of the query itself is going to be small relative to the amount of data I have to access or send around, right? Most SQL queries, maybe, you know, a couple tens of hundreds of kilobytes, the largest, like the actual SQL string itself, the largest SQL string I know about. I've ever seen some Google, they told me they had one query, like one query itself was 10 megabytes, right? Because it's like a dashboard, you cook a bunch of stuff and these, these giant encloses, right? That's not the entire story, though, because again, if you don't have a way to run any code or run that query, then you can't, you know, you can't push the query to the data. And so the lines get actually kind of blurred in the cloud because some of the cloud vendors for their distributed cloud storage object storage will actually let you push the query to the data in these limited fashion. Then it does some local filtering and then you pull it to the node where you actually want to run it and do further processing on it. So in Amazon S3, they have this thing called S3 select, where you can actually send simplified select queries on a request to S3 when you get data and they will actually run your query and examine the data locally before it sends you out the result. And they natively support parsing CSV files, JSON files, even Parquet, which is a column store file we'll talk about in a second. Microsoft has something similar for their Azure Blob storage, right? You can post a request to Azure Blob that's a select statement and it'll do some localized filtering. I looked and I don't think Google actually supports this. So again, this is what I'm saying. The lines are kind of blurred. It's a shared disk architecture, but you can push some portion of the query to the data. Exadata does this or Oracle does this with Exadata where they actually have FPGA accelerators on the shared disk storage layer itself. It's like a separate rack and they can push some like projection push down or predicate push down. When they send requests for pages, they can actually send the filter as well and then the system only sends back pages that have matches. Again, so the way I think about this is like, I want to get page 123. I pass along the select, like you know, only give me this data if there's a match for this, you know, this record. And then it will, I think for all of these, they will filter out anything you don't need and just send back the data. I might be wrong about that. They might have to send back the whole block and then you have to get into additional processing to find the thing that actually matched. But the idea, this is ideal if you can not bring back data you're never going to need because you let Amazon or you let Microsoft do the filtering for you. All right, so here's just, you know, another way to illustrate what I talked about. So here's the push to query to the data. Assuming we're a shared nothing system, my query shows up. I want to do this join. I partition P1 and P2 on two different nodes or sorry, a partition R and S on two different nodes. So I would send the query to the first node. The first node says, I know I need to do a join on R and S, but I don't have all the data. So let me push down a plan fragment down to this node and say, hey, by the way, do join on R and S for these IDs that I know you have. And then this sends back just the result. It ran the computation locally, sends back the result. The first nodes were also combining the results and then sending it back to the client. And we can do this because again, we have compute memory here and a shared nothing system. So we can access our local disk and do this processing. Yes. So this question is like, how does the top node node, the bottom node, has the 101 to 200? So this question is, could it be the case that the range of the partitions don't match? Like they don't have exactly 100. Or even worse, what if they're partitioned on completely different keys that aren't the joints? We'll get to that in a second. You have to handle that. Yes. Yes. So you're saying even if you don't pass this, do a join on 101 to 200? Correct. Yes. I mean, yeah. It doesn't matter, I think. Yeah. So this assumes that you know, this assumes that you have range partitioning so you know the boundaries, right? And if you're doing hash partitioning, which is I think is more common than range partitioning, especially OLAP systems, that you wouldn't pass the ID range because you wouldn't have it. You just say because it's random. So yeah, so vision I'm trying to say like, okay, I'm going to do a join on the bottom node from 101 to 200. And then when I send up the result, I union the join from the top node, union the join from the bottom node, and then that's the complete result. And I don't have any false negatives. Yes. So the same is, am I assuming in this case here that the top node has enough memory to store the intermediate result from its portion of the query and then the intermediate result from the query on the bottom? It doesn't. You don't have to, right? We can spill it on the disk. Yeah. That's all the same buffer pull stuff we talked about before. Okay. So pull the query to the data of the query. I'm showing this as a shared disk system, but again, doesn't necessarily have to be. So query shows up. It sends down the, again, the plan fragment to the bottom here. So this is actually pushing the query to the data. And then each node goes out to the shared disk storage, goes, fetches the pages that it needs, gets back the result, computes the join locally, and then this guy sends the result up above. So this is actually doing both. This is pushing the query from the top node to the bottom node, and then each node then pulls the data from the shared disk, does the process locally, and then sends it up. Question? What is pulled? What is pulled data to the query if you haven't shared nothing? It'd be my example from before where you would copy the data from where it's primarily residing to the node that needs to process it. And we'll see this. We have to do this when we do some joins. Right. So this is saying that this is not a, this is, it's not, these are not mutually exclusive. You would use some combination of both of them. And again, the database system should be in the best situation to understand, okay, when the data is over here, and I need to run the query here, and I could, could run the query there, what's the right choice for me? Like what's, what's the most efficient way, most efficient thing for me to do. So I mean, as I said before, the, the, the data that, that at the node that receives the, the, the note, the data that rides at a node from the resources will be just cached in the buffer pool. And again, we can page that out the disk, write that out the disk as needed. If we run out of memory, right. And we want to do this because for these really large OLAP queries that are maybe processing a lot of data, you know, it might be one petabyte of data and we only have 100 gigs of RAM, right. If we're processing a lot of results, we can't, you know, we can't retain that all into memory, right. But now the problem is that if you have a long running query, what do we do if one of the node crashes during execution? Right. Should be in, in, when we talked about transactions before, you know, on a two-phase commit, if the, you know, the coordinator crashes or one of the nodes crashes before we get far, far enough long in the two-phase commit process, we would say, oh, the whole transaction aborted. And we just punt at, you know, throw an exception back to the client and the client is responsible for actually then, you know, rerunning the transaction. But now these, these OLAP queries are running for like for, you know, for minutes, hours, days, or it's less common than the old days it used to be before the column stores. Like I had a friend that was early employee in Vertica, he went down to Australia to set Vertica up, was it Vertica was a column store system, it still exists. They replaced for like the Australian phone company, they had some row store system that was taking like five days to run monthly, like, you know, phone bills, degenerate the phone bills for customers. They switched over to a column store and now it took like hours. Nowadays, computing's got so much, so good and these systems aren't really good, you know, those kind of queries probably take minutes or seconds, you know, depending on the resources. But still, there's a, the query is not running in like 10 milliseconds as you would in a transaction, these are running for larger periods of time and therefore, it is possible that the node could go down. But since we're not running updates, we're not necessarily worried about correctness of the data, it's more about should be, you know, restart the query or start over. So typically in most of the shared nothing systems, the traditional way of doing this is that as soon as a node goes down, you just say the whole query fails. And the reason why they would make this decision or this design choice is because they don't want to pay the penalty of having to write out the intermediate results to disk as like checkpoints as you're running along, because that'll slow things down. And you assume that your nodes aren't going to fail because your cluster is not that big, then it's not, it's not worth paying that performance penalty. This is change in the cloud setting where now the fan out could be quite large, where now the systems actually maybe take these intermediate snapshots as you go along of the results, write them out to share disk. So that way, if some node goes down as you're, as you're running a task or running the query, some other, you know, you'll spin up a new instance or new node who can then pick up where the other guy left off, right? But of course, again, you pay a performance penalty for this. So you want to be a bit strategic about when you actually do it. So the basic idea is here. So say I'm running on a shared disk system, my query shows up this node, I send a portion of it, a portion of the join down here, and it's going to compute the result of the join. But maybe instead of sending it to the node at the top, it's going to write it out to share disk. And then that way, if it crashes and goes away, the node at the top says, well, I know I asked this other guy to compute this join. I'm going to go get it from shared disk. And that way I don't have to recompute it from scratch. So assuming most people have heard of Hadoop or MapReduce, it's not really, it is a thing, but it's not how you would want to build a distributed OLS system anymore. But when that first came out, it was designed by Google, the original MapReduce model, and they were taking checkpoints and writing out the shared disk at every single step of a query pipeline, which was super slow. But they did this because Google was assuming they were running on hundreds and hundreds of machines that at any time one could go down. And instead of having restart the entire query, you had the checkpoint on shared disks from HDFS, and you picked it back up and kept running. But in the modern systems, they're a bit more strategic, as I said, because you don't want to be writing everything all out, the disk every single step of the query pipeline. But again, you don't want to restart it if there's a crash. So there's a middle ground where you want to do some checkpointing, but not always, right? Okay. Next question is how we actually do query planning for this. So all the optimizations we talked about before, all the way we would do a search to find the right join order, predicate pushdown projections, all those techniques still apply in this world. The thing that changes though is now we have to be cognizant of the network cost of sending data from one node to the next. And so in the query optimizer, before maybe we just cared about how much blocks I got to rewrite in the disk when I joined, but now I've got to be aware of where the data is located and where the result is going to go or where it's going to come from. And I account for that in my cost. So again, I said query optimization, query planning is hard, distributed query planning is even harder. And again, there's no magic bullet to make this work, right? So there'll be some systems like DB2 where they were actually, when the system boots up, they'll run some quick microbenchmarks to see what's the latency between different nodes, and they can use that in their calculations for query costs. All right. So I showed before also two that there's one node sending a request down to another node. And I said it was a plan fragment, but I didn't really define what that actually physically looks like, right? So there's two approaches to send out plan fragments to different nodes in a distributed system. The first is if we can send the physical operators, like we showed before, when we talk about pipelines where I could break up the query plan into excessive physical operators and send that down. And then now the other system then just takes the query plan and executes it as defined on the data that it has local. Right? And this is what most systems do, right? SQL query shows up once, you parse it, plan it, generate a bunch of query plan operators, break them up, and assign them to different physical nodes, and then send those out. An alternative is to take the SQL query on the node where the query shows up, parse it, plan it, generate physical operators, then convert it back into SQL again, but for the different nodes that are going to execute the query. And the idea here is that if you do all your parsing at a single location or you're planning at a single location, the optimizer decisions will be made on whatever statistics that they have collected at that node, which may not be accurate representation, what's at the different nodes. So instead if I can convert my plan fragment back into SQL, send that over to the node that's going to actually run that plan fragment, it can then run its own optimizer with its own locally collected statistics and potentially make better decisions because it has a local view of the data of how it actually wants to produce the result. Right? So maybe the case that with the second approach, I send the different SQL plan fragments to the different nodes and one node says, my data looks like this. I want to use a sort merge join. Another node says, my data looks like this. I want to do a hash join. And then at the end of the day, it doesn't matter, right? Because as long as I get the correct results in the join when I combine things, but each node can make it again, can make its own local decisions about things. So the only two systems I know that do this is single store and the test single store is a distributed, now it used to be in memory, it used to be mem SQL, but now it's a distributed real-time analytical system. And when they send the SQL query, they actually annotate it with non-standard SQL constructs to say things like, hey, run the SQL query, and by the way, send the result to this other node here, sending back to a single location. So you can actually put hints into the SQL plan or the SQL statement itself to tell the local node where to send things. And then the test is a middleware system that does distributed my SQL that came out of YouTube. What's that? Surprised by this? YouTube's huge, right? It runs my SQL. This is a weird reaction. Make it sound like they wouldn't have a big database. Okay, all right. Yeah, so it's been spun out as plan of scale, which is a new startup, but again, same thing. They have the SQL query shows up, they generate the query plan for distributed query plan, and then they convert it back to SQL because the different nodes in the cluster are just running bare bones or stock my SQL. So they can't accept query plans, they have to run their own SQL statements. They have to take in SQL and run those. So the basic idea is that, so say if you're doing, where you're sending SQL queries, so this is my original join at the, this arrives at the planner node, but then my data is broken up into different partitions. So I would do, I would generate what the physical query plan, distributed query plan, that I want to run for this join, then I'm going to reverse it back into SQL, and then send down each node, a version of the query that is targeted, targets the data that they have, right? And then each node can then parse it, plan it, optimize it just like before, as if it was a single node query, and then they send back the results to a single location, and then there's some coordinator node that has to union all results together. I would say also to the advantage of this is that you don't need to have super accurate statistics at like the centralized coordinator node, right? Because again, these guys are, each node is going to analyze on the data that it's responsible for, and therefore it can make better decisions about how to do query optimization. Okay, so the, obviously the, you know, joins are the most important thing we want to do in OLAP system, and the efficiency of the joins is going to depend heavily to, it's going to heavily depend on what the query actually wants to do, and what the, how the data is actually partitioned, right? And as I showed in the beginning, the easiest way to do a join is to put all the data from your different nodes back into a single node and run that, but of course that's not realistic, you can't, you know, it won't scale. So we need a better way to do this. So there's going to be four different scenarios with the DOF to do a distributed join. And the basic idea is that we want to say just join two tables R and S, and we need to make sure that the right tuples from R and the right tuples from S are at the same location, at the same node, so that when we do a local join using all the join, any of the join algorithms that we talked about before, usually will be hash join, but it doesn't necessarily matter. But when we do the local join at the node, we're guaranteed that the data from R that would potentially match to the data in S are located at the same node. Because otherwise, if like ID equals one for R is at this node, ID equals one is at another node for S, then when I do the join, you're going to get false negatives, right? So we want to avoid that. So again, so once we get the data at the right node, again, ideally when we keep this distributed, we can then run a local join algorithm, produce results, and then send them back to some location to coalesce things. So we're going to go through four scenarios from best to worst case scenario, and the technique will be roughly the same for, actually will be the same whether it's a shared disk or shared nothing system, right? In the case of shared disk, it's just where do you send from the shared disk to what compute node? For a shared nothing system, assume the compute node already has the data local, but it may need to send it out some other location. All right. So like I said before, best case scenario to worst case scenario, best case scenario is that one of the tables that we want to join is replicated at every single node in the cluster, right? So going back to the dimension tables that I showed in the beginning, say there was a dimension table on zip code, there's what 40,000 zip codes in the US, that table is super small, it's only updated by the post office four times a year, so I can easily replicate that on every single node. So then now when I do my join, all the data I need from one of the tables is right there, and I don't need to communicate between the different nodes to send data around, right? So R is partitioned on ID, and I have 100, and 101 to 200, and then S is replicated. So that just means again, there's a complete copy of the table S on every single node. So to do the join, each node does the join locally on R and S, and then we need to coalesce the results. So say this node here will transfer its results to this other node, again, just union them together, and then produces the final result, the correct result. Again, if it's shared nothing, assume they have a local copy, if it's shared disk, then every node is going to pull S up into its local memory, your local disk. All right, next case scenario is when the tables are partitioned on the join key, and the partition ranges overlap, right? So in this case here, R is partitioned ID, S is partitioned ID, and then although the ranges aren't the same, if we make R the outer table, it will cover everything in an S. Because I'm doing an inner join and not like an outer join, I don't care that I don't have a match from 101 and above in R, because I only care about if there's R matched into something into this, right? So again, there's no data transfer. Every node has all the data that they need. I do my local join, send the result from one node to the other, coalesce them together, and I'm done, right? So there's no intermediate transfer between these two nodes. You're actually lacking, and this is like, isn't this an inner join? This question is, aren't I potentially missing data here? That should be, that's a typo. That should be, yeah, sorry, that's, this is a typo. This should be, this should be 100, right? So you would have a match. Yeah, sorry. What's that? I'm just storing the tickets now then. Yeah, it's okay, potentially. Yeah, I'll fix it in the slide. That's a typo. That should be 100. So again, it's an inner join, so I care about what's in R from 100 on the first node. I'm guaranteed if there's a tuple that exists, I will see it in S. Okay. All right, so the next scenario is when the tables are partitioned on different keys, right? And so the, what we're gonna do is we're trying to figure out which table is the smallest between R and S, with the inner table and the outer table. And for the smaller table, we're gonna send that out a copy of it to, or the portion of the data that I have at my node to all other nodes in the cluster. So this is, we call it a broadcast join. This is typically used for hash joins. So if you ever see something called a broadcast hash join, or if somebody says broadcast join, it's usually a hash join that's doing broadcasting. So in this case here, R is partitioned on ID and that's what, that's what we want to, we want to do a join of that. But S is partition on value, right? So we can't do the join because we don't know whether the, you know, from SID 1, is that gonna be on this node or the other node, right? And we would have a false negative. So in this case here, because S is the smaller one, we're gonna send a copy that I have to, from this node to this node. And then likewise, he's gonna send me his copy. And then now I'm back where I was at the beginning where I had a rep, you know, S was replicated every single node. All right. I do my local join. Every node produces its local result. First guy sends it to the second guy and I coalesce it. All right. So we had a transfer data, but we only transfer data for S and not R. Yes. Question is what is the, what is the, what is the, what is the objective of partitioning by value? Like why would I want a partitioned by value? Maybe there's like a thousand other queries that also do lookups on, on value, right? But this one query shows up and it wants to, wants to join an ID, right? So her question is a sort of larger question that we've been, we've sort of talked about this semester of like, okay, great, the relational model has this nice abstraction where you don't really care how data is actually being stored or where it's stored. But then that means that someone's got to figure out what index is the build? What part, how do I portion my, my, my indexes or how to portion partition my tables? So it's an empty hard problem to figure out for a given set of queries, what's the optimal partitioning key for my table. And as I said before, even though it's optimal for most of the queries, it may be the worst thing to do for a bunch of other queries and you still have to be able to support them. So in this case here, someone decided let's partition this table S on value, but the query shows up and it wants to, wants to do a join an ID. So again, the database doesn't, will, will, has to figure out where it is actually located and compute the result. There are some tools that will try to figure out the optimal partitioning key for you. But again, it's not guaranteed to be the best, the most optimal thing for all possible queries. Okay. So the last scenario is the worst case scenario is when the both tables are not partitioned on the join key. And so I need to basically reorganize or shuffle, as it's called, a shuffle join to resend out the data across all the different nodes so that we can then compute the join locally like, like we want. So in this case here, R is partitioned on the name field or name attribute, S is partition on value. I want to join on the ID field. So I can't do, you know, I can't do a local join yet. So I basically have to do a reshuffle, a repartitioning so that I basically do a local scan, sequential scan on the data at each node, then partition it by ID, send out those buffers, send out the results or the end results to the different nodes. All right. So R would go over here, this copy of RID where then that range goes there, the R values within this range go here, and then I have to do the same thing for S. And then now I'm where I was before, where now I have the data for R and S at each node matches up my partition ranges. I compute the local join and then send those all over. All right. Yes. So the statement is, would you have to figure out how many nodes you have to compute the join, figure out what the ranges of values are, and then send the data out to those nodes. This is where hash partitioning makes this easier because I said I have 10 nodes, so just hash it mod 10, and that tells you where it needs to go. Yeah. Yes. So this question is like, who's deciding at this, like if I had to do a shuffle, who's deciding who gets what range? You would have a coordinator node. This would be like the centralized planner, like the node that got the query does the parsing and then, you know, runs the optimizer. It's going to figure out, oh, it would know the data that it wants to do this join is not partitioning on my join keys. Therefore, I have to do this like shuffle step. So then it would, as part of the query plan, it would say, okay, well, this node is going to send the data to this node, this node is going to send it to this node, within these ranges, or it's hash partition that way, right? That's how I was saying before when I talked about like, if you send the query plan fragment, it's a SQL statement, you could also instruct in the query plan and say, send the output of the scan or this portion of the query plan to this node because they'll be waiting for you. Again, so this goes, this all sort of ties together now. When I was talking about this query fault tolerant stuff, like this shuffle phase is not cheap, especially if the data is really big. And so if one node goes down, if I've stored my shuffle data or repartition tables in some intermediate storage, then I don't have to go through that entire shuffle phase again. Again, think in extremes, think of like 10 petabytes of data and I have to do this shuffle thing. I don't want to have to do this every single time and then a node goes down. Question? So your question is, in this example here, why don't I partition, sorry, which one? Oh, so your question is, when I'm back here in the very beginning, why partition R and then partition S, why not just partition R and then do my thing before where I broadcast S to everywhere? Again, think in extremes. If this is 10 petabytes, this is 10 petabytes, I don't want to send 10 petabytes every single node. The broadcast joint is going back here. The broadcast joint is sending S to every single node. I'm showing two nodes here, but I've had 1,000 nodes. This one node's got to send its copy of S, so it's portion partition S to 999 different nodes. So there's a lot of transfer going on. In this case here, with the shuffle, it's a targeted transfer. So this node here would send only its portion of R to one other... Well, it would send the data within its partition to the different nodes based on how they're being partitioned up after the shuffle, but I'm not sending a complete copy to every single node. I'm only sending like, okay, I know this node gets this portion, here you go. This node gets that portion, there you go. Whereas if it's the broadcast joint that it's N minus 1 times the data size transfer. So there's one optimization we can do for this in some scenarios called a semi-join. And the idea here is that it's almost like vertical partitioning or projection pushdown, where instead of having to send the complete copy of a table from one node to the next, as I'm doing my join, I can send just maybe the portion of the data that I actually need to compute the join. And then the result I get back doesn't need to be complete materialized tuple, it's just the IDs, the primary keys or the IDs that actually did match for the join. So let's say I'm doing a join like this, where I join R and S on ID, and my output of the query just needs to be the RID. I don't actually care what value I have in SID. I just need to know that there is a match. And so without a semi-join, the way I would have to compute this is that I have to send all S to this node and then all R to this node to compute the join and so forth. But now you're actually sending up the entire materialized tuple. But instead with a semi-join, I really only care about dot is there a match at the other node for my outer table, for something in R. So it's essentially this query, right? Select RID from R where there exists some tuple in S, the one that says it's basically true, does this match, if there's a join. So now when I want to send the, do the join this other node here, I just send the RID column to this node, it does that join which is basically matching this, and then I only send back the RIDs that match, not the entire tuple. So in SQL, there is not a semi-join, like you can declare inner join versus outer join. In most systems, almost all systems, you cannot declare that you want a semi-join. And Paula does support this. There's another system that does, something from Microsoft that does this too. But underneath the covers, the query optimizers can figure this out and say, oh, I know you don't care about actually the result of the values from the inner table. So it'll actually, they'll have a semi-join implementation that they can then use to minimize the amount of data transfer going from one node to the next. All right, so this is a very common optimization that you would have in an OLAF system because again, the network transfer is usually the most expensive thing. If it's in the same data center or the same rack, maybe less of an issue, but certainly if you're going across the wider network, it's going to be slow. And so the semi-join optimization is the way to go. All right, so I want to finish up now talking about cloud systems. And again, this is just sort of exposed you to a bunch of, if you read Hacker News or whatever, like, I would say Twitter, but mess it on. Twitter runs my SQL. We're, yes, that's true. All right, so there's a bunch of, it's going to expose you to a bunch of these keywords that people are using now in a modern cloud setting for databases. And again, we talked about before where there's the shared disk versus shared nothing, the pool of the data, the query, push the query to the data, the lines get very blurred in a modern system. So it's not always going to be clean divide with these things. So a very common system approach now is to expose what are called databases as a service offerings or products to users. The idea here is instead of like you spinning up your own EC2 instance, downloading Postgres or MySQL and, you know, managing that yourself directly in the OS, instead they will, you give them your credit card and they'll give you a connection URL, a JDBC connection URL, whatever you want to use. And then now that the vendor is going to run the data system for you, you can't SSH into the box, but for all types of purposes it is a database. So the application standpoint, it doesn't know, doesn't care that it can't actually SSH into it. It just knows that here's the database I can communicate with. So there's basically two approaches to this. The first would be a managed database system. And this is where there's some vendor that is going to run like an off-the-shelf database system in a virtualized environment for you, you know, give you a nice UI and so forth, take care of maybe some backups or things like that. But they didn't make any modifications to the database system itself to be cognizant or aware that it's running in a cloud environment or running on S3 or running on EBS. Right, just sees local disk, sees memory, sees compute. It doesn't know anything about the cloud, doesn't know anything about maybe disaggregated storage. All right, so this would be like RDS from Amazon. GCP has Cloud SQL where I think just take stock MySQL, stock Postgres, and you're running it inside a container, you're running inside a VM, and you know, it's just running out to EBS as if it was local disk. There's another class of systems that fall into the term cloud-native database systems. And cloud-native is not a scientific term. I can say it has to have exactly these features to say it's cloud-native, sort of a marketing term. The basic idea is that they've either written a system from scratch or have made heavy modifications to a, you know, what used to be an on-prem system, something you'd run a managed database system, to now be cognizant of that that is running in the cloud. And usually what happens is that it's switching over to a shared disk architecture and maybe also being aware that there is local disk for, on like the EC2 node that can write, you know, there's a fast cache, and then it writes, you know, it's the primary sort of location of the database will be on S3 or something. So Snowflake would be the best example of this. Probably was the first like distributed OLAP system built specifically for the cloud. BigQuery is based on Dremel, which was an internal system that Google developed. That's not saying it looks like Snowflake, but it's meant to run in like a local cloud infrastructure. Redshift didn't start off being, I would say, cloud-native. Redshift, Amazon bought a license to Parkcel, which was a fork of Postgres, a shipping version of Postgres from the 2000s. It was a startup that didn't kind of go anywhere. Amazon bought a license of that, slapped it up. It was a shared nothing system. They made a ton of money on it, and then they realized, oh, we should probably make it look more like Snowflake. And then since then, it's been completely rewritten to be cloud-native. And then SQL Azure is again the sort of hacked up version of SQL server that is made to be cloud-native. So is one better than another? If your primary business is to run on the cloud, run your news on the cloud and not support on-prem, you'd want to start with this approach here. Snowflake does not sell an on-prem version. They only sell it in the cloud. So another term that's being thrown around is what are called serverless databases. So typically when people think of serverless, you think of like Amazon Lambda functions. There's this way to install a function into AWS, and then there's an endpoint URL you send data to, it then fires up your function in a container, runs it, sends back the results, and then the container dies and goes away. So we imply the same idea to a database system. But of course, a serverless database system doesn't make sense because you kind of need a server to run a database. It doesn't exactly make sense. No, superficially it doesn't make sense, but basically what it means is that they're able to turn off the compute layer. So let me show what I mean. So say I was running on either a shared everything or a shared nothing system. When the application sends a query request, it lands on this node, it does process the query and sends back the result. But then now if the application goes away because the user falls asleep, goes to the bathroom, checks the phone, whatever, since the node is storing the data, at its dislocation, I have to keep this instance up. I provision it, it's got to run forever. So the clock is running on this, even though I'm not running any queries, I'm paying for the compute, I'm paying for the storage, because it's waiting for the next request to show up. So in a serverless architecture, again, it's typically when, I think it always is, separate a compute from storage. So now I run just like before when I run my queries, goes to the node, the node gets the data needs from the shared disk and sends back the result. So then now when these guys fall asleep and I don't run any more queries, I can kill this compute layer. You could terminate the VM, it was easy too, or you turn off the pod on Kubernetes, but because the state of the database is stored in shared disk, I can flush out the puts in my buffer pool and page table, like the intermediate state of the system, flush that out the shared disk, kill this compute node, I'm no longer paying for it. I still pay for the storage over there, because again, I don't want to delete my data. And then now whenever the application wakes up, it says another query, I got to spin up the node, page back in the contents of the buffer pool to make it look like I never turned off, then I run the query and then produce the result. So this is an extreme example where you would literally shut down like the compute layer, and like I said, and you want to have this application state or the state of the database system as captured in the buffer pool, you want to retain that on disk and bring that back in. So you could kill the compute node. Oftentimes the case is that they will have the run multiple customer databases on the same node, so then you just you flush out what pages belong to that customer out the disk and page that back in, but the server is always still running. So this is becoming more common now. So there are systems like Fauna and Neon, I would say we're written from scratch to be serverless. Amazon has gone back and retrofitted like Aurora to have a serverless version, same with Azure, Congress DB and Planet Scale, same thing. They were originally not a serverless architecture, and they now provide it as an offering. And the basic idea is like why pay for the compute if you're not going to run any queries, so they'll basically turn your node off. Whether it's logical, physical, it doesn't necessarily matter for the application's perspective. So of course the first query pays the penalty, right, because if I go to sleep for an hour, you flush out my buffer pool, next first query that shows up, you've got to fetch that and bring that in, I'll have a higher latency for that first query because I'm fetching things from disk, and then subsequent queries that come immediately afterwards will take benefits of things being hot in the cache. So this is a huge difference if cost savings, if I'm not having the database all the time, then yeah, I could turn things off and save a lot of money. But like if you're blasting it with a million queries a second, you pay a premium to give this architecture, you're probably better off having provision instances. All right, so the next common trend we see is in, for a lot of systems, it's called data lakes. Basically it just means that there's some kind of centralized repository at your company, your startup, your organization, whatever it is, that people can dump in all sorts of different types of data, like structured data, the things that have been parsed, semi-structured, maybe like log files or something, and unstructured data like binaries, images, video, you dump these all into a centralized repository without having to define a schema ahead of time or ingest them into the database system and convert them to the database system's internal format, and then I can come along and run queries on top of these files, as if it was a full data warehouse that has ingested things. So say that again, without a data lake, the kind of system we talk about so far in a data warehouse, OLAF system would look like this. Before I can put any data into the database, I got to call create table. This goes to the compute layer. It's going to update the catalog and say, okay, now I know about table foo. That's these columns and these types, and it'll figure out where it's actually going to store them on files or on disk in the storage layer. And once I've done that, now I can call insert. The nodes are going to look up in the catalog and say, okay, I want it to start to foo. What do the columns look like? Where does it actually can be stored? And then I can construct the node where to actually write things, right? Same thing with select. I go to the catalog and now I can do, you know, scan all these guys. So in a data lake, the sort of structured storage layer goes away. Like even though it's S3 and S3, it's going to, you know, just writing out blobs, the data system itself is responsible for what the binary format of what's putting into those blobs. And that would be structured as defined by the schema. But the idea with data lake is that I have these S3 buckets. It's got a bunch of scsv files, json files, parquet files, column store files, whatever. And now when my select query shows up, this node still got to figure out somewhere from a catalog and say, okay, you want table foo. It's located in these files and these S3 buckets or these blob locations, right? Maybe there's some information about what the scheme is. Maybe there isn't, right? But I go to this catalog service and say, where's the data I want? And then we know how to go to the data lake and get it. Yes. This question is, if the data lake allows for unstructured and semi-structured data, does that mean it's no longer in a data format? Yes. Or table format? Yes. So it's no longer like? Is it like MongoDB? No. So think of like even worse. Think of like it's a directory in your local laptop. It's like a Google drive, like Dropbox, whatever the kids are using these days. Yes. Right. So this question is, how do you actually translate a SQL query into something that can hit up unstructured data? Right? So you would potentially have to have either functions built in that can know how to, the dataset may have to know how to parse CSVs, parse JSON, right? And there's, you would have to say, okay, if I'm looking for this column or this table and some of these JSON files, if the JSON document, you know, each document is missing that attribute, then I ignore it, right? But as a bunch of random binary images, you would ignore that, right? So this catalog thing, I'm being vague, there's a catalog service that says, okay, you want this table foo? It's defined by these files in your data lake. There may be a bunch of other files that are like images or whatever, and it just knows they'll ignore them. So your question statement is, instead of having this data lake approach, what if you had a key value store where you just, for given key, you throw back data? That's S3 or Azure Blob, yeah. But again, someone needs to understand the contents of what's inside them, right? So a bunch of these logos you guys are probably familiar with, Snowflake Databricks, Google BigQuery, Amazon. Trino is a fork of Presto, which came out of Facebook. Again, I screwed this up. There's PrestoDB and PrestoSQL. They don't like each other, because Facebook owns one, not the other. And then, so the guys that aren't at Facebook, they made Trino, but they're basically the same thing. Which one's better, right? It's unclear. And then Hive also came out of Facebook. Hive was doing SQL queries into MatReduce, which is a terrible idea. Facebook realized that, and then they built Presto. But anyway, so to his point, someone's got to figure out what's actually in these files, and these different OLAP systems know how to have, they have engines that can parse like Parquet, parse these different data files, and then run the SQL query on it, as if it was the system's eternal format. The question is, why is it a terrible idea to build SQL on top of MatReduce? Let's save that for next week. Yes. Okay. Yes. So his question is, what's the advantage of storing the raw data and then processing it on the node rather than pre-processing it once and then reusing it? So that's the ETL versus ELT thing I said before. So you could take unstructured data, log files from a web server, parse that into the date field and so forth. You could do that and write out more files, but that assumes that whatever the query is, it's going to want the data to process that certain way. So the argument for the data lake is just put all your data here, and then everybody who wants to then consume this data could write their own transformation pipelines to convert the data to the format that they want. And you could have your data scientists do some processing to transform data and then put it back in and make it available to other people who then could discover your processed data through the catalog. That's the dream, that's the vision of this. The end of the day comes down to humans, like how good are your engineers and your data science is going to be about keeping track of the provenance of the data, the writing out, providing documentation for other people to use it. It's hit or miss. The advantage of this is you could have your application of some microservice writing about S3 files into S3 and then something that's running like a Presto or Trino could then just run SQL queries and parse it without having to first go and ingest it and start it in the whole pipeline. You just plop it in S3 and then let everyone else consume it. Again, there's pros and cons. I'm not saying one is better than another. It's certainly a lot faster if I just dump S3 files into S3, don't define a schema, then worry about how to keep the schema in sync between different versions of the files later on. That will be much faster than having to define the table ahead of time and do much work to clean things up to put it in. Now, we've been sort of dancing around. I've mentioned this a little bit already. The data that we could be storing in a data lake or in an OLAP system, it doesn't have to always be just text files or CSVs. Traditionally, most data systems have their own on disk binary format for the pages. Think of like when you're in bus tub, it wrote pages out to disk, this little .db file, which you probably call .bus tub. We should fix that. There's a bunch of .db files and that's a proprietary format that only the database server knows how to maintain. Actually, my favorite story about SQLite is there's an internal file format that SQLite writes out the disk. The creator of SQLite used to have it be .sqlite, but then he started getting phone calls from people because a bunch of different applications were using SQLite for like Photoshop or something like that, like desktop applications. It would write out .sqlite files and then the application would crash. Nothing was SQLite. The application was buggy. It would crash and then the user would then look in the local file and see a bunch of .sqlite things and assume that was the problem. Then they found his phone number and would call him and complain about SQLite. He's like, I don't know what you're talking about. So in the code, you can see it where the temporary file is . and then SQLite spelled backwards, right? ET, whatever, like IL and that prevented people from calling him because they don't know what it is. Anyway, that's like the SQLite proprietary format. Postgres has its own proprietary formats. Everyone has their own thing. The problem is now if I want to share data between these different systems, I'm going to take it out from the SQLite format or the DuckDB internal format and then convert it through some process into another internal format so that some other data system can consume this, right? Or alternatively, I just write it out as a bunch of CSV or JSON files or XML files, but that's going to be slow because you're taking binary data, converting it to text, and then have to convert it back into binary data. So they're not really new anymore. It's over 10 years now. But there's a bunch of these open source binary file formats that these systems can natively consume and generate so that I can write out a bunch of these files from my one system, write them S3, and then have some other system know how to consume them and run queries on them. So the most famous one is probably Apache Barquet, and this came out of Twitter and Clondera. Orc is one that came out of Facebook and Hive. It's essentially the same thing, but think of like it's a columnar format that is compression. It does zone maps, a bunch of other things, right? And the idea is that they have this standardized format, then they have a bunch of different client libraries written in different languages like Rust or Java or C++ that you can then use to consume and get the data out of these things and write them. Carbon data came out of Huawei. This might not a scam, but it doesn't work. And I don't think anybody uses it. You try to run the code, it doesn't work. So the open source version might be different from what they have internally. But the open source version doesn't work. Iceberg came out of Netflix. This is Parquet plus some additional files for ingesting data, because these file formats are right once read many, meaning like once I create the Parquet file, I can't make incremental updates to it. Because again, think of like S3, I can't make fine-grained changes to an S3 bucket. I can either put it or get it or delete it. So I'd want to generate one of these Parquet files and just write it all out. And if I need to make incremental updates, you would use something like Iceberg to keep track of these things and then merge it later on. HDF5 is an older format from the 90s. This is primarily used in like HPC or scientific workloads. I think of like the Hadron Collider. You would smash them, whatever, atoms, and then it would produce so much of these measurements. It would be in HDF5. Apache Arrow is a newer one that came out of Pandas and Dramio. And this is for any memory representation of columnar data. And the idea here, which is kind of cool, is that you could store your data in memory in the arrow format and either do fast transfers over the network to send it to another node or potentially have another process running on the same boxes you peek into your memory and know how to consume the arrow data directly. Again, we'll have any serialized it or copy it out. So Parquet is probably the most common one out of all of these. And then Orc is probably the second most common one. And again, all these different OLAP systems will have their own connectors or know how to parse these things and consume them. The other things we've talked about so far is how to break up the system even further. We talk about system catalogs. H-catalog is the most common one that came out of Hive. Again, it's the metadata about the data, like where the data is located, what format it's in, to potentially send some additional statistics. But Google and Amazon and Microsoft have their own catalog services. To run the nodes, it's Kubernetes is the hottest thing now, or it's a defective standard so everyone uses that. Then the Cloud guys have their own stuff. And then we didn't really talk about query optimizers as a service, but in some systems, since building query optimizers is super hard, instead of everybody replicating this process over and over again, I try to pull out and extract out the query optimizer, make it a standalone thing so I could build a new database system that could reuse one of these query optimizers to parse the SQL generated query plan. And then I have an engine that can actually run that query plan itself. So the most famous one is CalSight that came out of LucidDB. LucidDB was a startup that failed, but then they pulled out CalSight and that became the optimizer as a service. Orca came out of Green Plum. CalSight is more common, but Orca is another alternative. Again, so in theory now, you could kind of build your own OLAP system by like reusing this piece and this piece and this piece and kind of like mung it all together in a cloud service without having to write everything from scratch over and over again, which is kind of cool. Okay, so just to finish up, I think there's a lot more activity and a lot more money sloshing around in the OLAP market than there is in the OOTB market, and I think the cloud has made this so much easier. So there's a lot of revenders, they're all hiring, and there's a lot of interesting problems still in the space. But of course now, you have a lot of money, you get a lot of data, and it makes things a lot harder. So go get jobs and help these guys out. All right, so the question is, why is there more money in OLAP and OOTB? Because there are more people looking to ramp up their OOTB or sort of their OLAP data solutions than OOTP. And because it's those legacy applications and new applications, they want to get OLAP stuff up, right? In OOTP, people are way more conservative about running transactions. Like if I'm a 100-year-old bank, my IMS database is running all my transactions, I spend that way since the 70s. If it works, I don't want to touch it, because if I break that and I can't run transactions, I can't make money. So they're more open to new OLAP systems than OOTB systems. Okay, so next class. It'll be the last lecture, and then next week will be all the fun stuff. So we'll talk about embedding applications logic into the database server. And I will fully admit that this is, I'm going to tell you all these cool things you can do. I think it's the right thing to do, but when you go in the real world, nobody actually will do it. Okay? So enjoy that. Hit it.