 And round of applause for DTGPL, you're back, you feeling better? Yeah, kind of. Apparently your mom saw the video and was concerned. I'm doing all by myself. So your mom watches these videos? Yeah, yeah. She just watches them to just make sure that I'm alive. Okay. But I mean, does she know everything that's going on with you? I mean, I try to tell her the parts. Like, I'm okay and she's okay with me knowing. Does she know about the pregnancy scare? I know. All right, all right, all right. All right, guys, let's get started. All right. Again, this is the same as last time. Homework 5 is due on Sunday. Homework Project 4, we do the week following that. If they post some piazza on Monday's class, we'll not be here. It'll be on Zoom. And that's the guest speaker from Single Store. And then please go vote also too for the, when we do the speed run on Wednesday next week. Some people are voting for Single Store. We have a whole lecture on Single Store, right? You don't need to vote for it again, right? We're not going to cover it. So put other systems. And then the final exam will be on the 12th. And then as always, we'll post some piazza as well. We can sign up to be a TX, okay? And then also post for the faculty course evaluations. And I'll just say things every year. Please be brutally honest. Tell us what sucks about the course, what you don't like, because we actually listen to it and incorporate your feedback in future versions of the course. So in previous semesters, Project 2 used to be like, hey, let's build a concurrent B plus tree. And the students complained that that project was super hard and the three and four were comparatively easier. So we started to spread out the load throughout the entire semester. So that's a good example of something we listen to. One year, a student started to give me a psychological evaluation on myself, and that I did not listen to. So, but there's other things you don't like about me. Please go for it, okay? Any questions about what's remaining for you guys or expected for you guys for the rest of the semester? All right. So today's class, we're going to talk about now distributed database of distributed analytical database systems. So before we get into that, I want to talk a little bit about how these analytical databases are being used in sort of modern application scenarios or setups. And then that'll then help motivate why we want to start using, you know, why we may need distributed database. Sorry, yes. Sorry. A question is for the final exam, is it comprehensive? No. But there's things you need to be obviously aware of in order to do the questions. Like if you're brain dead and completely forgot sequel, then you're going to have problems, right? Or if you don't know what a buffer pool is used for, you're going to have problems. But like the core questions will be from the, whatever, the lecture for the midterm going forward. And we'll post a study guide online. Yes. Is there any practice exam? There will be a practice exam just like the midterm. Yes. Yes. Is it full three hours? Question, is it a full three hours? Like, what do you mean by that? It's supposed to be 8.30 to 11.30. So is expected to be a three hour exam, or is it? Oh, it's a question. Like is the, will it be twice as long as the midterm? No. Oh, so it's going to be an hour and a half. It'll be, you can take up to three hours as long as you need. Oh, okay. But it's like, is it going to be double the number of questions on the midterm? No. Okay. So we used to keep track of like the time when people were turned in the final exam and their grade similar to the midterm. And it used to be like, for previous years, like if you, if you took all three hours, you typically were in the lower end of the distribution curve. But in recent years, people get like perfect scores. They take the whole three hours. So it's, but it's not meant to be twice as hard as the, as long as the midterm. Other questions? All right. Cool. All right. So what I'm going to talk about first is like, what these OLAP systems sort of look like, how they're incorporated in a, in your application environments. And then that will then motivate why we want to be potentially build a, or need to distribute a database system that do analytics on it. And in most cases, I'll say also to the, and this is purely a conjecture of mine, but like the, most people are going to need a distributed analytical system because you're going to have potentially have a lot of data enriched with not just the data from your applications, but from outside sources. And for those kind of things, a scaled architecture like an, like for an OLAP system makes more sense than maybe in an OLTP system. But again, it, as always in databases, it depends. So this is a very common setup right here where you have this separation in your application between the OLTP databases and the OLAP database. And I'm using OLAP database in a singular form because it's sort of like the term data warehouse is meant to be like, here's where all your data from your front end OLTP databases go and you're going to run all your analytics on there, right? And even though I'm showing that, you know, the database drum, you know, there's, there's multi-data drums here, and then one giant database drum for the OLAP system. Again, these are just logical views. It could be spread across multiple nodes. So a very common setup is that you have your, your application running with these OLTP databases in the front end and think of these as individual silos, like there's, you know, one application running data in this one, another application running there, and they don't, they don't really talk to each other. And then you want to combine them together into your giant OLAP system and you'll use a technique or a method called extract, transform, and load or ETL. And the idea here is that you're going to extract data from your front end OLTP databases, transform it in some way to clean things up, to put it into like a uniform schema and then load it into your data warehouse, right? And the reason why you have to do this is because, you know, these applications could have been written by different people, at different times, and using different naming conventions, right? So maybe the one at the bottom, they have a table of users and they're using for, you know, somebody's first name, they'll use fname or f underscore name. But then in another database, they'll use first underscore name, right? And as you, we as humans, we know they're corresponding to the same thing, but if you just look at the raw schema, they're different. So that's the transform phase. It's sort of, you sort of clean things up. The example I always like to use is Zynga, I think like Farm Bill, about the data reference, right? But they would always buy these different game studios and buy, you know, buy their online applications, their online games. But then they would leave the front end database alone. So whatever the front end database was when they acquired these companies, they just let them do whatever they were doing. But then they obviously wanted to put it into a giant data warehouse. So they would do this ETL process to clean things up because everyone would use different naming conventions. So there's a bunch of tools that, that'll help you do this. Informatica is probably the most famous one of all these. But then all the, all the various database vendors like Oracle and SQL Server, they all have their own various versions of this, right? So this is how traditionally people did data warehouses starting since like, maybe like in the 1990s. The current modern trend is a variation of this called, so the ETL, it's ELT, attract, load and transform. And the idea here is that you're going to still pull out data for your OTP databases. And then you're still going to load it into your OLAPT system, but you're not actually going to transform it before you load it. You just load all the raw files into your data warehouse. And then the transform process actually really occurs inside the data warehouse itself because there's going to be a bunch of SQL queries to convert it and then load it back into as more data files, right? DBT is probably the most famous one of all these. There's AirBite, there's a bunch of other ones that can do this kind of stuff. And again, you think, okay, well, is it actually really different if you do one versus the other? It depends on how the architecture is set up. But it's not to say that the ones at the top can't do ELT, it's just these ones at the bottom are specifically designed for doing those transformations. So again, now you can see why you could have a ton of data in your data warehouse because you're getting data from all these other things. Or things like 5Tran and I think a bunch of the other ones, they can pull data from outside sources, like say you're getting weather information, you can have a pipeline that pipes that data into your data warehouse, that you can enrich the data that you do have. So I'm going to use the term OLAP. Sometimes you'll see these kind of databases you refer to as BI or Business Intelligence databases. Or sometimes you'll see them called Decision Support Systems, DSS. They all pretty much mean the same thing, right? That you're trying to extract new knowledge from the data you've already collected from your front-end database system, right? You're trying to identify things like, here's the item that most people buy in the city of Pittsburgh in December when the weather goes below some temperature, some degree, right? Another famous example was, Walmart wanted to figure out, this is like early 2000s, they wanted to figure out right before a hurricane occurs and right after a hurricane hits in the south, what does everyone go buy? So that they know they see this hurricane coming, they start stocking their warehouses near the stores in the south but obviously outside of the danger zone of the hurricane, so then immediately after the hurricane hits, they then send in the trucks with all the supplies that they know people are going to buy, right? That's the idea of what we're trying to do here, right? We have the information from all our front-end database, we're trying to then process it and then help us make decisions. So the way we're going to model our database is going to use potentially two different techniques in our OLAPS system called the Star Schema and Snowflake Schema. Anyone want to know why Snowflake is called Snowflake? It's because they support Snowflake Schemas, right? Star Schemas were bigger in the 90s. Snowflake is a, Star Schema is a subset of what you can do in a Snowflake Schema, Snowflake Schema is more general, but traditionally doing joins is really expensive before column stores, before all these other acrylic acceleration stuff you can do. So you would say I want to do Star Schemas, but nowadays pretty much everyone does Snowflake Schema. So there used to be some systems where you couldn't actually load a database unless you put it into Star Schema form. Now we don't really talk about data modeling in this class because it's really about how to build a system. I think there's courses in Heinz College that teach you those things and we don't teach about normal forms because that's a waste of time because nobody actually uses them in the real world. But I'm going to talk a little bit about what things look like and then that'll help us motivate why we want to do efficient joins or why we can use some things with joins that may be not others for certain tables. So this is what a Star Schema looks like. So the basic idea is that in the middle here you have what is called the fact table. So this is modeling say like, again, this is using Walmart example, right? This fact table here is going to be the, every item that anyone has ever bought at a store at Walmart, like anything that was scanned along in cash register or think Amazon, they probably bought something online. The fact table is going to have a single row for every single item. And so you would keep track of like the price and the quantity, keep those as attributes directly into the fact table. But then you're going to have all these foreign key references to what are called dimension tables that are going to be out around the outside of the fact table, right? So this should be sort of obvious. Like for every single item ever that are bought, I don't want to store the name and the description of that item over and over again. So I'm going to do a dimension table so that anytime I need to get it, I just have a foreign key look up to this, right? And then now when you're doing a bunch of joins, you're basically, you know, do the calculation I was saying before like find what item people bought right before or right after a hurricane in this geographical region. I just rip through my entire fact table and do additional filtering on whether additional dimensions that I have, right? But again, for the start schema, I can only have one level of dimension tables going on the outside, right? Because you want to reduce the number of joins you potentially have to do. And so a snowflake schema, you can have any arbitrary amount of nesting or leveling. And this is a term that we don't really talk about. It's called normalization. So normalization is like splitting a table up into the smallest atomic units to reduce the amount of redundant information you're keeping track of. It's like going back here for like the product category, I'm keeping track of like within the name of the product in the description, but also the category name in the category description. So like I'll have multiple entries in the same category. I'll just repeat that information over and over again. The idea is there because you're denormalizing it, putting it into one table, I don't have to do a join to go get that. But then the downside, of course, is that if anytime I update the name of a category, I got to make sure I update all the tuples. That's basically what normalization, normalization stuff. We don't teach you these things. But you'll see it in the real world, but not described in terms of like normal forms, which is the whole other, the theory of how to design our model database. We can ignore that. Nobody does it in the real world. So again, going back here. So now if I do the snowflake schema, I can normalize out the category information for my products so that now I have a separate category lookup table. And now if I want to get the name of a category for a given product, I just do a join against that. This is a lot of what I've already said before, but in a star schema, the advantage is going to run a lot faster, potentially because I don't have to do a bunch of joins, but I'm going to have this duplication of data because I've sort of flattened my tables down or combined tables into single tables. The snowflake schemas are going to require more joins, and potentially the queries will be more difficult to do good planning on, but again, I have the advantage that I have that isolation of the data, or I'm reducing the number of copies of it, and it sort of makes it easier to reason out what's going on. So when you go in the real world for analytical systems, you will probably see a snowflake schema. Nobody really does star schemas anymore. Again, this is why snowflake is called snowflake. So with that said, here's the problem we're trying to solve today. Application server comes along against our data warehouse, our analytical database system, and they want to do a join on R and S. And say R is the fact table and S is the dimension table, but for now it doesn't matter. And I partitioned my database for these two tables across four different partitions, or four different partitions. It doesn't matter whether it's shared disk or shared nothing at this point. We still have the same problem. We want to do a join. So the dumbest way to do a join would be, well, I know I need data at these four partitions, so let me just go copy all of them back to a single node so I can do a join. But obviously that defeats the purpose of having a distributed database, right? Because if my database is 10 petabytes and I only have a terabyte of memory on that one partition and one node, then I'm going to read everything from disk from the other nodes, put it into memory, and then now I'm basically a single node machine. And I don't get any benefit of having multiple resources. So this sucks, and we want to avoid this. And obviously you want to do this in such a way that I distribute the query across these multiple nodes in such a way that we don't have any false positives or false negatives. We want to appear as if we're running on a single node even though we're not. So we've already gone through the snowflake versus star schema stuff. We're going to talk about the execution models you could have for a distributed database for doing OLAP stuff. We'll talk a little bit about how we want to do query planning and what changes now when we're in a distributed environment. And then in the TLDR, for that, it's going to be everything's the same. It's just now we need to account for the location of data and maybe how it's partitioned. But all the stuff we talked about before, like join ordering, predicate push down, all of that is still here. Then we'll talk about how we can actually execute joins. And again, it's still going to be either sort merge join or hash join in a distributed environment. Most distributed databases are going to do hash joins because most of the time you're going to be hash partitioned. But we'll get that in a second. Again, we need to account for where the data is actually located that we need. And if it is partitioned, what key is it partitioned on? Is it the thing we're trying to join on or not? And then we'll finish off doing, again, a quick overview of what cloud database systems look like, specifically in the context of OLAP systems. And I'll confess now that this is a precursor to what 7.21 will be next semester. OK? All right. So the first thing we've got to do is execute queries. Obviously. And as I said already, it's basically going to be the same thing as we would do in a single node database system. The query plan is going to be most likely a DAG for these physical operators that are going to be moving data from one operator to the next. And whether we're doing a push versus pull in our processing model, it doesn't matter. And so now in our database system, since we know we're distributed, we know that data could be not physically located where the operator is actually running, we have to consider where is the data coming from and where does it need to go next. And this is all stuff we just embed inside of our query plan. So again, using table scans, we would know, we need access to this table. And if I'm partitioned or not, if yes, where's the data located? Or if I'm shared disk versus shared nothing, is it from a central disk I can get the data from? Or do I need to go to a node that has it? For joins, again, we'll cover these in a second. Aggregations and sorting, all this is still the same. So then the next question is, for a given query plan, how are we going to get the data we need to execute these operators? And we talked a little bit about this before in the OTP systems about this notion of I'm not going to push the query to where the data is actually physically located, or should I pull the data to where I want to run my query? And in the OTP world, the size of the query and the size of the data, the data's not going to be that big. But if I shared my early example, if my database is 10 petabytes, then maybe I don't want to pull the data to some node. I may want to send my execution request over to where it's located. Then there's also now a question of what happens with the data, the immediate results that I generate. Should I store them on my current node and then let someone come get it for me? Or can I decide, okay, I know where this needs to start sending it now. Or should I write it out to a shared disk? So that way if I crash and if my node crashes, the other node can say, okay, well I know they processed the thing I asked it for, but here's, let me go get it from a shared disk. Instead of having to restart the entire query. So let's go through these one by one. So again, pushing query to the data, again, the idea is that we're going to send the query or some portion of it, like a query fragment, to the node that contains the data. And again, this doesn't matter where the shared disk or shared nothing, because we have this notion of logical partitioning. We know that, you know, if the data is at this node that it's not actually located, it has not physically stored it yet, but it's responsible for it, it knows how to go to shared disk and pull it in. And the advantage of this is that we want to do as much filtering and processing on the data where it resides before we send it off to the next node. Because traditionally, the network was always faster than disk. That is changing a little bit now. Both networks and disk are getting ready very fast. But again, just think of like data pay for the network traffic itself in a data center or to go outside of the data center, like Amazon charges you when you leave their data center. I want to throw away as much useless data as I can before I send it along. Again, it's the same thing as doing project pushdown, when we did query planning on a single node, we want to filter things as much as possible before we start sending it along. So if we can send some portion of our query to where the data is located, do much processing as we can there before it moves on to the next stage, then we could get a big win there. In some cases, though, you actually may want to pull the data to the query. Again, if it's a shared disk, you may have to, right, because I can't maybe run my query on the shared disk architecture on the object store. Not entirely true, we'll see that in a second. But like, you know, if you can't, there's no compute resources, there's nothing to say, hey, execute some piece of code for me where the data is located, then you have to pull it in. And so the reason why I'm saying the lines get blurred is because in modern cloud systems like the object stores, you can actually run what looks like queries. So in S3, in their documentation, you can basically run SQL queries on S3. And it's not full SQL, obviously, and it's pretty basic. But like, this is basically predicate pushdown that we talked about before where I say, okay, I have this filter clause, this filter, my where clause, when I go request the data from the object store, I also pass along that where clause and let it do some filtering so it only sends me back the data that I need instead of copying the whole thing, so I can do some filtering out later. So it's not just Amazon, Microsoft also has this for their blob storage, that you can do something that looks like SQL. And actually what's really cool about this stuff too, because we'll talk about file formats in a second, is that these object stores have native support for CSVs and JSON files and Parquet, which is a binary format, we'll cover in a second. Like, it's not just like, how do I say this? It's not just like raw text files, they store things in an efficient, compressed binary format and they'll know how to process it on the fly for you and run your queries. I don't know whether you can do aggregations, I know you can do filtering. So it's not a full database engine down there, you can do whatever you want, but you can clean some things up before you send it over. I haven't looked this year, but GCP or Google didn't support this when I looked at it. And this is actually not unique to these cloud vendors or object stores, like this is an old idea for shared disk systems before the cloud, like Oracle Exadata basically has an FPGA on their storage node, so you can send the where clause to the FPGA and it filters the data as it's coming over the wire to you, which is pretty cool. And obviously, since Oracle controls the whole thing, it's in Oracle's proprietary format. So the lines get blurred when you say whether it's shared disk or shared nothing, the shared disk systems, especially in the cloud, are getting very good for this kind of stuff. And I don't know whether Snowflake and others take advantages, I know Redshift does. That's public. So again, this is just repeating what I've already said, but we want to push the query to the data, query goes to this node here. It recognizes that it wants to join RNS, but RNS are partitioned based on the ID column, which is what we're joining on. So instead of the top node telling the bottom node, send me all the data you have, it sends down the query plan fragment. Because again, there's some metadata we're keeping track of that says for these tables that are partitioned on this column, in this case, we're doing range partitioning, and I know what the range values for that ID that are located at different nodes, they're controlled by different nodes. So I can send my query plan fragment down here by the way, join RNS for this range, and then the results get sent back up to this node, who then does a union of the local result it computed and the result that it got from the other node. Share disk, same idea for pulling the query to the data. So in this case here, I want to get, query shows up at the top node, it knows the bottom node here is responsible for this range, but they both have to go to share disk and get the pages that they need to then do the processing, and then the bottom guy sends his result up to the top node, and again, it just unions the results. So in the example that I've shown here, when you get the result from the other node who processed some portion of the query for you, that node is just going to store it in this buffer pool. That way, if it runs out of memory because you're getting too much data from the other guy, it just spills the disk until you can combine the result and send it back. Now, in some cases, for some queries, you actually can just immediately start sending the data back out to the client as it comes in, but in some cases, if there's another stage in the query plan, you may need to store it locally and then send it along the next stage. You just can't immediately send it out. And then in the last lecture, we made a big deal about making sure that about crash recovery, doing two-phase commit across the different nodes, that if we want to make a change, that everyone agrees that this is going to happen. But we're not really worried about that in this world because we're not making changes to the database. We're just doing some kind of read-only select query that's trying to get new data. But then the challenge is now, what happens if my query is going to run for a really long time? I think hours, days less common now, but the old days, this was an issue. If my query is going to run for five hours, but then after hour three, one node goes down, what happens? If I'm just storing things in an ephemeral cache in my buffer pool, then, and it's written to my local disk at the node, again, whether it's shared disk or shared nothing, then the whole query has to restart. There's another notion of fault tolerance for databases, but it's really about query fault tolerance, meaning if my query is being spread across multiple machines, all processing in parallel, that I want to avoid the issue of one node going down, taking the whole thing and crashing the whole query and having to restart. And again, this is nothing to do with whether the data is replicated or not. You wouldn't really necessarily want to maybe replicate the result at different nodes or have two nodes compute the same answer so that you can case one of them goes down because that would be really inefficient. So we need a better way to record the intermediate results as our query runs so that if there is a crash, one of the nodes does go down, then we don't have to restart the whole thing. And so the idea is what we're going to do here is that we just need a place where we can store data for the query while it's running so that if one node goes down, we can go pick up those results. What can we use for that? The shared disk, right? Because that thing's not going anywhere. Amazon S3 going down would be... I mean, it does go down, but it would be taking the entire Internet down, right? So we can just use a shared disk storage as a way to keep... almost like a checkpoint for our queries while they're running. So again, same setup here. I want to run this query. I asked the bottom guy here to do the join. And then instead of maybe sending the result all immediately back up to the node there, I'm going to write it to my object store or write it to my shared disk. And then tell the guy up there, or you would coordinate it ahead of time, hey, by the way, you asked me to run this query plan, query plan fragment. Here's the location on shared disk where you can go get my result. So now if this guy crashes, then the other node can just retrieve that result and pick up where it left off. Now there's a bunch of coordination going on and this node went down. Let me spin up another node in Kubernetes or whatever you're using to then replace it. But I don't need to recompute everything that it actually did. Who here has heard of a Hadoop? All right. Less than half. So, or MapReduce, who here has heard of MapReduce? Okay, same people. So in the 2000s, Google came up with this paper in 2004-ish for this technique called MapReduce. This is basically a distributed programming paradigm, a framework where you write these specialized Map and Reduce functions that allow you to do data processing. It was basically arbitrary Java code, or at least in Hadoop it was. And in their implementation, because they were assuming you're running on cheap hardware, like thinking thousands of single unit servers, they would do this checkpointing after every single sort of MapReduce phase, and they would duplicate it like three or four times. It would be super expensive. So in my example here, I'm showing, like, oh yeah, I run this join, and then all the results get sent back to shared disk. They would do that for everything. But you can be actually a bit smarter about it and recognize, like, okay, well, maybe I'm going to do this join, and then immediately do something else right after, but I can do that locally on my node. So maybe I don't need to send out the result to the shared disk. So in modern systems, they'll do this kind of checkpointing similar to what Hadoop was doing to avoid having to restart the entire query if there's a failure, but they're not, like, checkpointing blindly for every single step as you would do in MapReduce. There's a, we can go, I can go for a long tie rate about problems with MapReduce. We're not out of time, so let's skip it. Basically nobody runs MapReduce now anyway, right? Nobody runs Hadoop. That's all been deprecated. And then things like, oh, like, as I was saying, like Hadoop, it was, you write raw Java functions for, you know, our Java code to process data as if it was a query. People realized that was a bad idea, so then they put SQL on top of it with this thing called Hive, which is invented by Facebook, and then that's a terrible idea, too, because you're basically converting SQL queries into MapReduce jobs. So all the crappy problems you have of MapReduce, like inefficient architecture, you inherit, even though now you're at least running in SQL. And then everybody realized that was a bad idea, so then there's things like Presto or Trino, which we'll cover in a second. There are, like, more efficient replacements for running analytical queries on top of shared disk storage. All right. But this query checkpointing that have fault tolerance, that is one thing that came out of, like, the MapReduce world that has permeated throughout in distributed databases, distributed relational databases. All right. For query planning, again, it's all the stuff we talked about before. We still want to do projection pushdown, predicate pushdown. We still need to figure out the optimal join ordering. Again, all of that doesn't go away. But now what we need to do is consider, again, where data is actually physically located, and also the network transfer costs. Because, again, that's as equivalent to, like, reading something from a disk, I got to send things over the network. That doesn't come for free. There's a cost to that. And so there are some systems, like DB2, for example. I know. When the data system boots up and it knows that it's in a distributed configuration, it actually runs a bunch of micro-batch marks, basically, you know, running, like, sending packets over the wire to the different nodes and measuring the latency. And then it uses those measurements to, as values in its cost model, decide, you know, how expensive is something versus, you know, reading from local disk or reading from memory. So I would say that's the right way to do it. Most systems just sort of have a hard-coded value to say, here's how much it costs to send a byte over the network or a megabyte over the network. But, of course, those, the network connections aren't always symmetrical, so doing some kind of micro-benchmarking, like DB2 does, is the right way to go. All right, so the question is now, what are we actually going to send between the different nodes to tell them to do work on behalf of our distributed query? So the most common approach is to send physical operators. And this is basically, again, the same thing you would get in parallel execution on your single node database system. You would break up your query plan into plan fragments and then distribute them amongst the nodes, and you may annotate it with information about where the data is coming from and where it needs to go next. But what you're sending to the different nodes are these physical plans, because the physical operators, because the plan's already been decided by some kind of centralized coordinator or centralized optimizer. So those systems are going to use this approach. An alternative is that you take the output of a query optimizer that's running in a centralized coordinator, that's going to be physical operators, and then you break that up into the fragments that you want to send in different nodes, but then you reverse those physical operators back to SQL, and then you send SQL to the different nodes. And the idea here is that you've already done some of this global optimization of figuring out what query plan fragments need to execute on what nodes, but then rather than deciding for that node, here's exactly the query plan I want you to run, you give them SQL, which they can then now parse and optimize locally because they may make a better decision on the node that they're running on based on what they see in the data that they do have. The idea, again, instead of having a global view or trying to maintain a global view of how to optimize the system, you get far enough to say, okay, guys, here's the work I want you to do, but then they can each make their own local decision. So this is rare. Very few systems do this. Single store, I think they still do this, I don't think they would have changed, but they do this, and then Vitesse is a... it's not really an analytical database system, but it is a distributed system. It's a sort of sharding middleware for MySQL that was developed by YouTube. So YouTube runs off MySQL, and because the nodes that they're talking to are all just MySQL nodes, MySQL can't take in a physical plan. They got to convert it back to SQL and then send it to the node, and then that node, again, does all the parsing and optimizing locally. So the idea is like this. If you want to send SQL queries. So this is the join we have before. This is the catalog we keep track of. Here's the range partition we have for our database, for our tables. So we take the... I take the original query, and then we figure out, okay, here's the data I need to access, and then I modify the query now to include the join clause for the data that's local to it. And then this node will get that query, run it through the same optimization pass that it did before. And the idea here is, again, a centralized view doesn't have the complete view or up-to-date view of what the statistics are in the database, or at each node. And then, again, there'll be some kind of centralized coordinator that knows how to union results and put things back together at the end. I think it's a clever idea. Reversing from physical plan back to SQL is non-trivial, and most people don't do this. And when you think about it, too, the data system, oftentimes the data is not changing that often. There's ways to handle that. We can cover it later, but like... So it's not... The idea that every node may have a better view of what the data looks like versus the global view, it doesn't always hold up. Yes? Yeah, to this point. So why do you have to send SQL? Why couldn't you send, say, the physical plan or the logical plan, then just have their query optimizer just optimize that, right? Because then you need to build a separate code for the optimizer to take in that as input and inject all the internal metadata about the search process. It's weird. It's not how people usually write their optimizers, right? You should. Ideally, you want your optimizer to be able to stop, dump out at state, and then load it back in. Very few systems can do that, right? It's like rebuilding the stack of a search tree. You'd have to suck all that out and then inject it back in. You can do it as additional engineering. So now we're going to talk about how we actually want to execute our joins. So again, as I said before, we're still going to be either doing hash join or source join, nest loop join if you're really unlucky. But the trade also between those two approaches are still the same even in a distributed environment. And as I said before, we could just try to put all our data that's spread across different nodes onto a single node and then do the join there that it would work and it wouldn't have any false negatives because we're guaranteed that the data we're trying to join will be located with each other. But obviously that's not realistic, right? And you're not unlikely that a single node can handle all your data. So the way we're going to do this is that to join two tables R and S, we need to get the ideas that get the data that we're trying to join from the two tables on the same node based on the join key, regardless of how the data has been either replicated or partitioned. And so we're going to go through four scenarios, going from best to worst, and we're going to see how basically if the data isn't partitioned or placed in the way that you need for the query, the data system is going to have to move it around. And this is all transparent to you as the application developer, as the user, writing the query. You don't know how the data is getting where it needs to go, but obviously if you're actually building the system, we need to care about these things. And the key thing we need to avoid is false negatives because we don't want to join like R and S looking, where there is a join or it is two tables that should match when I give them join key values, but because they're physically located on different nodes, we're going to get incorrect results for this. So again, we're going to go through four scenarios and I would say everything I'm going to talk about here is the same for whether it's shared disk or shared nothing. It doesn't matter. So best case scenario is when one of the tables we want to join on is replicated at every node. Again, you can have this in your data system. You can have some tables are partitioned and some tables are replicated. Coming back to that snowflake schema stuff before, those dimension tables are usually pretty small relative to the fact table. The fact table is going to be huge. Every item everyone's ever bought from Amazon. It has to be billions if not trillions. But your dimension tables, think of things like zip code. So what, 40,000 zip codes in the U.S.? Post office changes them four times a year. So you can take that 40,000 tuple table and you can replicate that on every single node as a dimension table. So if you want to join a join against your fact table, the data is just right there to do the join. So all you need to do now in this scenario here, again, so if I want to join R and S on ID, S is replicated everywhere. We partition R based on ID. So each node just does its local join and then this node will send its result to this other node and then this unit is the results. You don't actually need to look at the results for this example here because it's non-overlapping partitions for R. So I just literally concatenate the byte buffers on top of each other and then send that back to the client. So this is the best case scenario because I did no data transfer in order to compute the join and then obviously I had to send the result but it depends on this activity or whatever it is I'm trying to join on. This is the bare minimum I need to send over to process the square. If those aren't replicated, then it's basically the same thing but if the two tables are partitioned on the same attributes as you're trying to join on. So in this case here, again, I want to join R and S on the ID field and then it just so happens that the range partitions for R and S in each partitions are exactly the same. So now, again, I have each node process its local join in parallel and then this other node sends the result to the other guy, concatenate the results and send it back to the client. So this is nice but it doesn't always happen. It's not always the case that you're going to be exactly partitioned on the thing you want to join on. We're not going to talk about how you actually pick the partitioning key but that's a whole other problem. That's been shown to be NP-hard. For an arbitrary set of queries and an arbitrary set of attributes you could partition on, figuring out the ideal partitioning scheme for your tables is non-trivial. Of course, in some cases maybe you get for 99% of the queries you get this nice layout like this but of course there's some query that shows up where it isn't joining on the partition key. That's the third scenario. So in this case here, the R table is partitioned on ID but now my S table is partitioned on value, some other attribute. So if I just do the join on the local data here, again I can end up with false negatives because the thing, if there's some ID equals one, it may be on this other node here. I don't know that because I partitioned it on a different key or different attribute. So when this occurs you have to do what is called a broadcast where you have the, you're basically reorganizing one of the tables and so you're going to take all the values of S for some ID within some range and you can send it over to that partition. Sorry, I take it back. You're going to basically send whatever data you have here from S. You're going to send that to every other node that's involved in the join. So now basically S is going to be replicated just as it was in the first scenario at every single node. The reason why it's called a broadcast is because you're sending, hey guys, here's the values I have for this table and everyone gets a copy of it now. And again, you can do this if it's like a dimension table that's going to be much smaller than the fact table. So sometimes you'll see this in the literature or in documentation they'll call this a broadcast join and it really just means that they're doing this step to send the data around. And they'll usually say it's a broadcast hash join or broadcast sort merge join. So actually if you can do this anyway why don't I go back to scenario one? I mean this will put you into scenario one. It may be the case that there's that someone picked I want a partition on value because I have most my queries are going to want to do join and value or do look up some values. It just happens for this one query the data is not in the layout that I want so I've got to move things around. In other times we'll talk about data lakes in a second in that world you are just loading a bunch of data files in and you're not doing any reorganization to do any partitioning on that. So you basically have to look through the file and then say okay well here's the data that I'm seeing do I want to partition it and send it around or do I want to broadcast it and send it around? Yeah, I didn't really say this ahead of time. Whatever I'm talking about here so far is what we call sort of managed storage and it's literally like me calling insert queries into the database and the database says oh I know what this data is I know what table this is and they can decide how to move things around in the data lake world you don't have full control we'll cover that in a second. Yes, everyone. So this question is if you have n nodes is every other node sending into the n-1 nodes? Yes, you're broadcasting everything you have to everyone else. Yes, your statement is instead of doing the n-squared broadcast all the nodes pick this one guy everyone sends s and then it sends it out. You can do that. The big idea is that we're basically replicating s after it wasn't replicated. That's the broadcast phase. How you actually do that depends on the presentation. His mom's calling probably. Anyway. And then we do, yeah, sorry. Each of those nodes like just run out sorry, run out of storage. Sorry, it's in a shared disk system. No, it's shared nothing. Shared nothing. You combine like a bunch of stuff into a single node. What do you mean by stuff? You mean like s here? Yeah, sure. You combine s and then you run out of memory? Yeah. Okay, so say first one memory. If I run out of memory again the 803 results, it's going to get staged in my buffer pool. It just gets written out in the disk and then I pay that cost of swapping it back in. That's all the stuff we did before. I was like, what if I run out of disk? Well, that's the same thing as a single node. If I run out of disk, I crash. There's nothing you can do. Right? You can't like. Is there not a way to like sort of like hop around the, or like have instead R like sort of hop around the nodes and try and join with them and sort of aggregate the aggregate the result in some way? Your statement is could, is there a way to like instead of set like basically saying like if I recognize that if I put everything on a single node here, I'll run out of memory or run out of disk so let me try to just rebalance things on the fly? No. Instead like. We can come back to you. Yes. I think it's kind of like similar to that. When you, if you pull like S or in R, both are too big for an individual node's disk. Yes. Is there a way to break S and R up in such a way that you can perform the join across the different nodes by only loading a part of S and a part of R? Yeah. So could you so the statement is if R and S are too big the partitions of R and S are too big to put on a single node, could you basically like a streaming thing where you bring in some of it and that way you can do a portion of the join that you have so far. So again, just think going back to our hash join example before. I got to build the hash table on the build side of the join. So I need to build that first. If that if I, that one I can stream, right? And then when I do my my probe, same thing, I can stream the data in and do that incrementally. So yeah, you can do that. So yeah I'm showing this sort of at a high level, like okay you're going to move data around it's not like you do this step and then you can do the join, although I'm showing that in PowerPoint, but like again using hash join example, I could I could build a hash table on r and then as as I'm getting with the 2 plus 1 s are coming over the network, then I do the probe and then send out the intermediate result somewhere and then I go back and get more as I bring things in. Yes. Same idea as on a single node system you do the same thing, right? You wouldn't going back to single node for a hash join I wouldn't on the probe side I wouldn't bring everything into memory although some systems do that bring everything memory and then do the probe I can do it in you know get next and get a batch of things right and the worst case scenario is when the both tables are not partitioned on the join key and so now in this case here I got to reorganize and basically sending out a complete copy of the database across nodes. Now in this case here since we're not going to replicate the two tables we don't need to do the n squared broadcast we know where data needs to actually go. It's basically like dumping the table out and then loading it back in but with this time with a different partitioning key, right? So r is partition on n s is partition on value but I need IDs so I'm going to send all the data for some range of rid over there same thing for r over here and then I can send over s for both of them and then now the data is in the form that I need right where I'm guaranteeing there's no false negatives because if I do my join it's all going to be local for the IDs. Everyone does their join locally and then we ship back the result and produce a fine answer yes this question why can't you broadcast one of the tables across all the disks and do the join regular IDs but I still need to get the like I still need to get all the s values for giving ID over here so when I do my join like I don't end up missing something that is over here because I didn't bring it over well okay so if we just broadcast the entire s table say s is 10 petabytes but in this case here I'm sending only a partition of the data to the other nodes I don't have to put, are you proposing s in its entirety replicated on every single node? Right, I mean we could do any sort of screaming to partition another question is why do we need to reorganize so this is actually would be more efficient than that because if you're broadcasting to everyone then like say again using s as an example then I'm sending potentially IDs of values of s where the ID is never going to match anything here in r so I send that data yeah but like I don't, you're moving the table no you're so each single node is going to send their internal corresponding ID correct yes because again I can do this because it's SQL I know what the join clause is I know what my data looks like I know where it needs to go so just like in the shuffle join, sorry in the broadcast join you'll see this sometimes called shuffle join and again it's still doing hash join underneath the covers typically they'll say it's a shuffle hash join but sometimes they might just say it's a shuffle join but it's really a shuffle hash join doing the sector step right so in this example here for this query it's a select star meaning I'm getting, I want all the columns r and all the columns of s, sorry yes a question is why do you broadcast versus a shuffle depending on the size of the table right so in this example here they're all select star queries so I need the columns of r and s so therefore I have to send all the data over you can do basically some kind of projection push down to say okay well I'm only going to send the actual columns I actually need and this technique is what is sometimes called a semi join so the SQL standard doesn't define what a semi join some systems actually have this in their syntax like a explicit semi join like clause like inner join or outer join and the basic idea here is that instead of sending over the the actual all the data from the columns that match during the tuple thinking like if it was on a single node I'm just going to send over the the bare minimum of the data I need to actually do the join again this is basically just like a projection push down but for whatever reason they explicitly call it a semi join so in this case here I'm doing a join r and s like before but I only want r,id and I I only want to do matches where r,id is not null so now again if I'm split across two nodes like this instead of sending all of s over to r to do a join instead I'll likewise for r instead what I can do is just send here's the ids that could match and you send it over to the two of them and this is equivalent to basically rewriting the query with a select one basically saying hey for this given r,id something does match I'm not telling what the rest of the tuple is I'm just saying like something is here again just thinking like it's doing the join but instead of getting back the result you're just getting back like a true false the set of ids that did match again some systems will have explicit clauses for this you send ids, yeah alright so in the meantime I'm going to talk about cloud systems and again there's a lot more there's a lot more activity at least in the marketplace for analytical systems running on the cloud and part of it has to do with chasing after all that the snowflake money and the snowflake IPO and Databricks will be there pretty soon so in the cloud systems they're going to offer what is called database as a service or abbreviated DB AAS and the idea here is that they're going to provide you with a managed database system environment meaning like instead of you going allocating an EC2 instance downloading MySQL, Postgres, whatever you want and running that locally and you managing that entire that whole VM along with storage and backup and recovery and all that stuff instead they'll provide you with just a URL where you can connect your application to and interact with the database system so you can't SSH into the box because that's all hidden from you for most people who cares and instead they're going to manage everything for you and as we said before a lot of the times that you can manage and share nothing systems all that gets underneath the covers a lot of that looks the same so there's two ways to run a cloud database system the first is we'll call whatever I said before a managed database system where they basically took some off the shelf software like Postgres MySQL and instead of you running an EC2 they're going to run an EC2 for you and they'll have some management interface in front of it they'll handle backups and snapshots and recovery kind of stuff but for the most part it's going to be exactly the same as you would download and run locally and this means that the database system itself is not going to be aware that it's running in a cloud environment I mean running in a disaggregated architecture like with shared disk and this is what most vendors do less so more recently but most of the time when you see like hey here's like this hot open source project that has like based on a startup and then you can download off GitHub and then soon after they have their cloud version of it this is typically what they're doing they're just taking it and they're going to run it for you there's another category of systems called cloud what I'll call cloud native systems and this is not a scientific term this is sort of what people mean when you say cloud native database where the the database system has been built from the ground up or it's been modified significantly to be aware that it's running in a cloud environment specifically with shared disk and it can take advantage of all the sort of flexibility and the scalability and elasticity of a cloud native system or a cloud based database system so Snowflake probably was the first in this space maybe now BigQuery was an internal project called Dremel that started like 2006 Snowflake is 2012-ish 13 but Snowflake is the one that really made this architecture popular and so if you have now you're running in the cloud one of the things you can do is support what's called a serverless database system so this seems like a weird thing to attach to a database system is to say because obviously you need servers to run it right and so what they really mean is that you're not going to provision servers ahead of time for each customer for each tenant and that if it's ever the case where a tenant becomes idle meaning they don't run any queries you actually can turn off the compute nodes those resources for the system and then when they come back later and run a query you spin all that back up so it's like you're turning machines off for them and not charging them for those compute resources but then the database system is always available so assuming we're like a shared everything system we allocate some EC2 node we have memory of disk of CPU and say the application is running queries on it that's just fine right but then let's say like the application goes away right because someone falls asleep they're not running any queries right nothing's happening and in this environment you pay for this node right because basically just like in bus tub there's a while loop where this thing is spinning waiting for incoming requests so you don't know when the request is going to come along so you always have to be waiting but now you're just burning idle cycles so in a service environment assuming that it's shared disk what you can do is you just run the queries like before fetch things from the shared disk as needed but then when people go away you flush out the contents of the buffer pool and the page table to storage basically you take a snapshot of your page table that's in memory keep checking what page is there and any dirty pages you write all that out to storage then you go ahead and kill this thing the data is still there because it's still in shared disk everything's fine and then when people you're paying less for that and then when the application server wakes up sends a query it's as if the system is booting up for the first time but instead of having no information of what was in the page table you go fetch that information back in and you sort of bootstrap the system to say here's what the state of the system was before I shut down I'm showing like we're killing the compute node entirely there are some systems where it's a multi-tenant setup where it's the same one sort of instance of the database system is supporting multiple customers so all you need to do now is look in the buffer pool and figure out for this customer that I know is idle let me write out its results but the thing is still running and so there's a bunch of databases that are now sort of in this support what I call serverless databases like this Amazon took Postgres and MySQL and rewrote it for this thing called Aurora they have a serverless system Fauna is a serverless database that has portions of Cassandra in it but everything, a lot of it's written in the scratch Neon is probably probably one of the more famous ones for Postgres but they took Postgres, ripped out the bottom half similar to what Amazon did for Aurora and then reconfigured it to be based on our shared disk architecture right and PlanetScale and Cockroach have their own things PlanetScale is the commercial version of a test the thing I said before that came out on YouTube it's the YouTube guys basically went and forked a company alright so the other thing we talked about and mentioned before is there are these data lakes this is sort of the modern buzzword now to describe basically people using an object store as a data warehouse instead of just having the server management proprietary storage as before so it's typically almost always going to be a shared disk architecture right but in a traditional data warehouse what would happen is and if I want to load any data into my data warehouse I got to call create table that then updates the catalog and then I have to insert a bunch of data but that's all going to be going through this compute node that's controlled by the database system so it's going to know like here's the data that you're trying to insert in this table it's going to go into this location on storage and it has to look at the catalog and figure out where to write stuff but and then any query can do the same thing but in a data lake architecture the idea here is that I don't have the database system be the gatekeeper for all new data coming in instead I have this this object store where any application can start writing a bunch of files in there and they can put in CSV files, JSON files parquet or court cover in a second like they just start throwing whatever data that they want into this object store I have to update some catalog somehow to keep track of it and then now when a select query comes along on this node we look in this catalog and figure out what was there and then we know how to go get the data that we actually need again the idea here is we want to remove the gatekeeper that Davidson has to be the gatekeeper and let anybody write stuff in here and then we'll go figure out what's actually in it when we run queries so this goes back to what it said in the beginning to track load transform this is that set up here where anybody can just get data from whatever front application and shove it into S3 as a bunch of files and then someone's also going to come along and clean things up and figure out how to make sense of it so there's a bunch of people in this space Databricks is probably at the forefront in some marketing talking about this and they have the term of this idea that yes there's an object store but then you have this execution engine this catalog infrastructure on top of it they would call it the lake house a play on the words of a data warehouse right it's basically what I'm describing here so Databricks has this and everyone has their own variation of it Redshift actually didn't start off being a cloud native system like it was a fork of park cell which is a fork of postgres it was very much a shared nothing system over time they've rewritten a lot of it to be shared disk and look very similar to this architecture here alright so the last thing I'm going to talk about will be a segue into what 721 will be about if you continue on with this stuff but one very interesting trend we've seen in databases in the last decade is that instead of having these giant monolithic database systems where everything is written by the same vendor or same group organization like inside the system itself the last couple years people have been breaking off components of the system and have them being stand-alone services that you can then connect together in some way to build a larger database system or data warehouse system essentially sort of like the lake house stuff that I'm talking about here and what's interesting about this is that a lot of times these components that people are building they're not being built by a database system vendor which is traditionally how software has been written for decades right instead you have these big tech companies or in some cases even smaller startups where they have some need for some piece of a database system and they end up building it, open sourcing it and other people pick up and start using it and even again that company who's building that piece of software they don't make their money with that software they make their money doing something else trying to think of a good example Facebook has this open source execution engine Facebook is not a database company but they need the execution engine for their own internal needs but then they open source it actually Facebook has put out a lot of its software in the last decade or so in the context of databases so we'll go through the example so the idea here is that you can now build if you want to build a new database system you can take everything from scratch anymore or take Postgres or Clickhouse and fork that and then try to modify that specifically you can take these different components and start putting them together and so what I mean by commoditization is that the what made like Snowflake unique ten years ago is not really it's not really significant anymore because everyone has it like everyone has a vectorized query engine or everyone is going to be using a column store the thing that matters the most now is like the user experience, the frontend stuff the query optimizer certainly right so you can put these things together and make a system but then to differentiate yourself you'd have to mostly focus on the frontend stuff actually using DuckDB is an example DuckDB is an amazing piece of software but the core ideas are well known and not new they put it in a great form factor and connect it to like pandas and things like that that's not the core database engine stuff that's all like the user experience stuff all right so catalogs we talked about again this is how we're going to keep track of what data we have where our files are located on storage what the scheme is and so forth the probably the most famous sort of standalone catalog system is this thing called H Catalog that came out of Facebook came out of the Hive project the H stands for Hive or Hadoop and again the idea here is that I can write a bunch of files in S3 and then I update H Catalog and say hey by the way here's these files that I have and here's the schema and then maybe some basic statistical information about what's in them and then certainly all the cloud vendors will sell you something as well Databricks has their thing called Unity it's not an open source but they all provide some mechanism to make sense of what data you have in your lake house query optimizers we've talked about again this is the hardest part of building a database system and to no surprise most people don't want to build it themselves so instead there are at least two that I know about open source projects where like these are just meant to be standalone off the query optimization as a service CalSites is probably the most famous one that was a there was a data system in the 2000s called LucidDB I think it was European they were startup, they failed the company failed and then they pulled out the query optimizer they built and then had that be a standalone project that became an Apache project CalSite has the query optimizer but it has the ability to ingest queries or parse SQL statements from Postgres, MySQL, a bunch of different dialects and then Orca is another one of these optimizers as a service that came out of Green Plum which was bought by Pivotal which I think has been bought by VMware which has been bought by Broadcom I think, yeah but this is open source, this is less common than CalSite but Green Plum built this originally because there's the Green Plum data warehouse system they were building that's a fork of Postgres but then they had another system I think called Hawk with a queue that was like high it was like SQL on top of Hadoop and so instead of having to build two separate query optimizers for these two different database systems they put everything into one as far as I know nobody uses Orca but I think this is the one part of the system that I'm most interested in right now in terms of research the other cool thing that's happened in 10 years is that there are now these universal or open source file formats that we can use across different database systems so until maybe 10 years ago most databases either had their own proprietary binary format I think of like on your bus top project when you write out a file that rates out a .db file do we rename it to .bus top yet or no still .db right I'm assuming you know it doesn't matter bus top has its own proprietary format that only bus top knows how to use Oracle, MySQL, Postgres they all have their own proprietary format but if you want to start sharing data across different systems again thinking in the cloud I have my front of applications writing some data I want to put out to S3 to consume that I could either store it as all JSON or text fields or CSVs which would be very inefficient but maybe if I have a file format that's like a columnar binary encoding with compression if I could have the application write that out then any database that I use can then read that so there's a bunch of these file formats that make it easier to access data generated from one application and shared across other systems so the most famous one of these is Parquet this one came out of Clader and Twitter the next most famous one is Oracle which came out of Facebook these are again they're basically the column stores that we talked about before but now they have a bunch of libraries written in whatever query language or whatever programming language you want Rust, C++, Java, Python that you can then read and write these files in so again I can have my application generate these files and then without having to talk to you about the database system and the data system can then parse them Carbon data is at a Huawei that's a fork of Parquet we did some experimentation on this and it doesn't work at least the open source one doesn't work Apache Iceberg is Parquet files but additional keeping track of additional metadata so you can do incremental updates and do schema changes where it's Parquet files or sort of write once, read many but this can keep track of those things for you but this came out of Netflix, this is another good example Netflix is not a database company, they're not making money selling you a database system but they wrote Iceberg and they open sourced it and a lot of people have picked up and started using it HD5 is not very common in our world in data systems but this is an array format that's common in high performance computing like scientific computing so like all the satellite images or the telescope stuff that usually is written in HD5 and then Apache Arrow is a in-memory, a columnar format that allow you to do data exchange between applications running in memory so think of like like when DuckDB, when they read a file in and then you want to query it to get that data in your PANAS code instead of having a proprietary format if it puts it in the Apache Arrow format now PANAS can just read that memory directly without having to write the disk to own disk format and parsing it and bringing it back in why is it all called Apache? why are they all Apache? what is Apache? this question is why are they all called Apache? so the Apache computing foundation is this non-profit open source computer open source software foundation initiative think of like if I want to have an independent person control the there's the Apache license but there's also the Apache computing foundation it's even more confusing because there's also the Apache web server we'll get at it, it's all the same organization but it's basically Apache is one of them, there's also the Linux computing foundation there's a cloud computing foundation they're basically these non-profit consortiums that allow you to have a governance a governing body for open source software so even though Netflix wrote Iceberg people may want to say oh I want to use this software but I don't want Netflix to be in charge of it because then think of like Netflix's competitor wants to start using the software and Netflix starts getting weird about it so you donate the software to this consortium, the foundation and then they have control of it so that's why you see Apache in front of things in a similar vein why does Apache continue to maintain five of them instead of saying alright this question is why does Apache continue to maintain five of these instead of just picking one winner because that's not the model of Apache it's not a company where somebody up above is making decisions about what's going to succeed or not there's criteria you have to have about getting your software to be part of the Apache foundation but then you have to elect leaders to be like who controls the commits and testing and how you vote for things there's a bunch of bureaucracy stuff that they provide but they don't choose who can win or not things get deprecated certainly things like you can become like a top ranked or top top level project in Apache but then if nobody uses it then it gets relegated actually Gignesh I was building this in Wisconsin called QuickStep and it was an incubation process for Apache but then it didn't go anywhere and then it got killed off yes suppose someone pushes out open source but they don't put out the surrounding system that they built around it what would you be able to use this question is a good question is if someone makes an open source file format but then there's no software to access it what's the use of it it's useless it cares right and so like for parquet for example for an orc there isn't a database system that is the parquet database system instead they provide you low level libraries like in Rust to then parse the data and do some basic manipulation of it and then you build a larger system around it potentially using the components that I talked about before yes yes question is how hard is it to cobble these components together if they were all independently how would you actually put them together so there's like the one this one will cover in 7.21 but Apache Aero helps a lot of this because now I can send data between the services in a universal format but there's a bunch of other semantics about what things should look like what does it mean to be a query or what a query really looks like that may be different from one of these components I would say that it's no different than having different abstraction layers in your database system like we talked about before it's just now that it may be the case that you don't have full control over this one component if you want to not do a hard fork of of parquet even though parquet does some things that you think are wrong you just have to live with it actually a good example was a few years ago we had the founders of Blazing they can't give a talk and they had their own proprietary storage format but because everyone they were running a data system on GPUs but everyone was coming to them saying I have my data it's going to be an Aero format they then had to drop their own proprietary format and switch to Aero even though they felt theirs was better so like you have to make sacrifices yes yeah people do this all the time absolutely yeah I think DuckDV does this it'll read any of these file formats but then when they bring it into memory they put it into their own format and then when it spits out the result to you then they'll put it back in parquet or whatever yeah we have a paper we have a paper that came out to be basically shows like parquet and orc were designed 10 years ago and the hardware has changed significantly the bunch of design stages that they make is a bad idea let's finish up execution is again the things you had to build in bus tub there's now libraries that you can download and use Velox is from Facebook DataFusion I think is from the Apache AeroGuys and InflexDB and then this Intel thing OAP I don't think anybody uses this these are going to be the two bigger ones and then this is what we're going to play with in 7.21 alright so the main takeaways from all this cloud is definitely made to distribute databases for OLAP systems way more common than they used to be and I think the byproduct of just having online applications the internet you get more data very quickly it doesn't take that much work anymore to like write an application that can also take a lot of users to start scaling up and getting a lot of new data and so there's a lot of vendors in this space a lot of VC money someone's dying down and focusing on vector databases but there's still a lot of problems to be solved in this space alright so again next class will be the single source speaker that will be on zoom as they post on Piazza just go to that zoom link and then if you want we can come watch it in my office and if we run out of space we can spill it to the database lab any questions yes no no sit in your house in your bathtub whatever you want to do with all these open source components is there like a consortium like IEEE standards or something to agree on these this question is is there a like an IEEE standard to specify what though sorry you won't say your output format to be a special training like if you take for example like CPU architecture for something you have these companies I think for the things that are outside the system like that go outside the internal the internals of the system it's going to be typically Aero that's the language of Franco for exchanging data between different services but I was saying that's just like for the encoding there's the semantics about what's actually in there that could change alright hit it still got you shook up I smack you at the bottom of the clip to tell you look up show me where the safe's at before I blow your face back I got a block on taps the feds can't trace that style is like tamper proof you can't lace that at the dominican or you could call me dominican black scally black leather black suede timmelins my all black dirty haters send you to the purly gates you get gizama trying to skate and that's your first mistake I ain't lying for that cake if I'm a CEO wait my grams is heavy weight when they asking how I'm living I tell them I'm living great