 Carnegie Mellon University's Advanced Database Systems course is filmed in front of a live studio audience. All right, yeah. Let's get through this, get out early, and go see the clubs. So from this point in the semester, we're going to start reading papers about individual systems. So obviously the paper you guys read today was on Google BigQuery, but the purpose of this part of the semester is now to examine and look at how companies are building real systems based on the technologies and methods and techniques and algorithms that we talked about throughout the entire semester. And the goal of this is to learn how we can read the papers from an industry, some of it's a little bit marketing heavy, some of it will be actual true systems discussions, but basically understand how they've applied the things that we've talked about to solve the real world problems, and also for you to then be able to interpret it and recalibrate maybe how they describe certain things based on the fundamentals of what we talked about. So for example, sometimes you'll see systems talk about technique ABC in slightly different language that we talked about, but then you know because we've read certain papers that it is really describing this. Not to pick on Dremio, but we'll see Dremio in a second. They talk about having these things called reflections, like what the hell is a reflection? I've never heard of that before. You go kind of read the documentation a bit more, oh it's just materialized views. The idea is like again now you have this core background knowledge about how these real world systems are actually built, then you can cut through the BS and understand what's going on. They call it reflections. Yes? Okay. And then also too now you also have this internal catalog you'll be able to build to say okay well in the good papers, especially in the Databricks one, they'll talk about oh we had this problem so we solved it this way, we had this problem solved that way. So again when you go out in the real world you'll be able to see how you know apply those same lessons you've learned from reading these papers and other systems to different situations. And of course it's always nice to understand that like I didn't just make the stuff up. Everything we talk about this semester is real. So as I said we're going to start off with Google Dremel, a big query today. Wednesday we'll talk about the Databricks, SparkSeql and Photon, and then Snowflake, DuckDB, Yellowbrick and Redshift. So because of the stomach virus we had to drop the last one and I decided to drop the Microsoft paper instead of the Yellowbrick one because as he was asking about before the Yellowbrick one is wild because they do all sorts of low level system optimizations and they report real numbers in their papers whereas these other papers you're not going to see that. Part of it because these big companies don't want any numbers they report in these papers that are used against them in like a judo marketing move by their competitors. But Yellowbricks, they don't give a f**k, they put all the numbers in. It's awesome. Okay, so the re-acquiring themes we're going to have throughout the time the papers we're going to read are again all the things that we've talked about this semester. So obviously the resources aggregation is separating compute and storage, like this is the lake house data lake model where a bunch of data is on S3, whatever your object store is, and then we're putting query engines on top of that. We're also going to see the challenge of dealing the lack of statistics. The BigQuery paper certainly talked about this. We'll see this over and over again where it's a bunch of files that were uploaded into the object store outside of the control of the database system. So now you get the query shows up, you want to start planning on it. If you're a call space optimizer, what are your costs going to be based on if you don't have any statistics? Obviously everything's going to be columnar but we also want to handle the non-relational data, like the JSON, in the case of BigQuery it was the protobuf files, and then vectorize execution as we talked about before. Again, this thing is pretty much standard in every OLAP system today. So none of the papers are really going to talk about any unique aspects of what they're doing, it'll always be that. All right, so let's talk about now the setup for Google's Dremel BigQuery paper. So maybe not for the people that are younger here, maybe not now, but you don't think of Google as the hot tech company. Maybe OpenAI is what everyone's excited about these days, something like that. But in the 2000s, Google definitely had a sizable influence on how people thought about and design and develop database systems. And even to this day, I would say their pushes maybe are influenced in many ways, not as strong as it used to be, certainly because the technologies had been spread out and proliferated much farther. Now, obviously, LLMs are the hot thing. But back then, pretty much any time Google put out a paper, a research paper that said, hey, here's this system we built internally at Google that everyone read it, everyone got really excited, and then people started building open source clones of these things. Because the mindset in some ways was, well, Google's super successful. Google can operate at scale if our company wants to be successful, and we need basically the same stuff that they're building. Because Google didn't release their things as open source, at least in the beginning, didn't release their things as services. It is like, hey, here's this paper written by Jeff Dean and others, who are obviously very smart people, but everyone's like, OK, they scramble them to reinvent everything. So this is sort of an incomplete list of a bunch of the database systems or data systems that Google has developed over the years that have been very influential. And that's sort of subdivided into two groups. At the top here, you have all the NoSQL systems. Because again, Google was without maybe coming out and saying, yes, we're a NoSQL company, but they certainly were at the forefront of the vanguard of the NoSQL data movement. And you saw this in the paper you guys read. And then there's all these other systems in the late early 2000s, early 2000s and going forward, where Google realized Oh, SQL was actually a good idea and then started building systems around this. Again, in the paper you guys read, there's this paragraph right here that talks about how the conventionalism of Google was because SQL didn't scale. And then again, everyone else sort of followed along with Bandwagon and was designing systems based on some of these early ideas. Vongo is probably one of the biggest ones, where people were getting saying, we don't want to use SQL SQL as a scale. We want to use joins, and then now the tide is turning. And so the Dremel guys talk about how they see themselves as the ones that actually made SQL cool again, or important again, or matter again in Google. So a lot of these systems have sizable influence in a bunch of different data systems. And as I said, a lot of these things started off as early research papers that Google put out. And typically the way companies put out papers is usually about two or three, maybe even four or five years behind what actually the state of the art is. So they'll build a system, get it up and running, file the patents for it, and then they'll write the paper. By the time the paper comes out, again, it's a couple years old. But MapReduce obviously was very influential, made Hadoop, made Spark. Bigtable was cloned. It's HBase, a cumulatal hypertable. And so what LevelDB was, this is the other one they actually did open source. And then got forked off as RocksDB. And RocksDB is more common now. And then there's other examples down here. And there's a few other like Priscilla's from YouTube. Again, it's just the paper. That's not open source. There's a couple other systems I could list here that I sort of forgot, or just didn't have room. Yes? Why is it that we just look at Dremel and not the other one? I feel like we'll butt down to that. So why are we looking at this one? Not these other. This question is why did we care about Dremel in this class and not these other ones? Because Dremel is the only one that's doing analytics on the OLAP stuff. Megastore was a early sharded version of MySQL to do transactions. The test is sharded MySQL out of YouTube, Spanner, I wish to know about transaction. These are all transactional ones. Napa is sort of getting into the realm of like doing analytics, but it does. Well, we can talk a little bit maybe next class. The context is like Delta Lake or Iceberg. But Napa is all about like, I can do any Dremel-style analytics on historical data, but I'm also incorporating newly ingested data. And I can make a trade-off between how much do I want to read from the, we only pay the cost to be like the road data I just inserted versus like the historical data. And then Delta Lake, Iceberg, they're all doing something similar. Yeah. But again, part of the reason I, again, read your question, why before using Dremel, I mean, this is a very influential paper. This is pretty much how every, you know, Lakehouse engines meet up today. Yes? What do you mean only? So I have an asterisk for CockfishDB and TidyB, only for marketing. Because when these systems first came out, the CockfishDB guys certainly worked at Google, but they didn't work on Spanner. And then TidyB gives had nothing to do with Spanner. Spanner was like the hot thing for like transaction databases. So these guys were, maybe they explicitly didn't say it, but others are saying it and they didn't correct them that these are open source variants of Spanner, right? But in the case of CockfishDB, and TidyB, that's not the case, because Spanner relies on this true time service with like GPS clocks and atomic clocks, like part of the true time Harvard service. CockfishDB is doing everything in software. So again, certainly now they're not gonna like say that they're the open source version of Spanner. But if you use Google open source Spanner, you probably get CockfishDB and TidyB. Yes? So he said that lots of like every lake house is from. Yes. But I also know that Lakehouse. So his question is like I'm saying that all these Lakehouse systems are based on the high level architecture of Dremel, but then the Databricks guys, they're the one that brand in the Lakehouse term, right? But that's just marketing again, right? The Trino guys say, no, it's not a lake house, you want an ice house, right? So it's just marketing, but the idea of a disaggregated storage of vectorized execution engine that can read data that it's never seen before, all that comes from Dremel. There's other aspects though, I think like, in case of Snowflake, like Snowflake was doing the vector execution, or actually more significantly like vector wise, vector wise was doing the paper guys right over, like all that vector execution stuff, like that's prevalent everything and that's common now in a lake house system, but they weren't calling it a lake house back in the day. It's just a marketing term. Yes? Sorry, the original was 2010, the VLVB. I, so the original paper is like 2011, right? I think in the paper they say, it came out, it was like a side project in 2006 or something. The original paper story, 20% thing. Yeah, I think I'm putting the date here based on, yeah, so maybe the dates are starting off. This is like, this is when it was first known. So the paper you guys read is the 10 year retrospective of the original paper. The original paper is 2011, but I think that paper mentions that somebody was booting it in 2006 as originally a shared nothing system. Right? It's just like the test of time reward here that the VLVB. Yeah, so whatever, I'm all by year. But again, the paper didn't even say as though they started it earlier. 10,000 years ago. Yeah, okay. Whatever, again, Napa, I know that these guys are booting this since 2017, 18, but I was under NDA and I couldn't say that, the paper came out in 2021. Napa? No, it's- Does it get chopped then? Because you're saying it's analytics and it doesn't start. It does, like you ingest data. You ingest data and it gets appended into the system. I don't know whether that ingestion process is transactional or not. And then they have this notion of like, do I care about when I run queries, I specify, do I want to run fast as possible and then maybe, but give up reading the freshest data or do I give up, do I want to read the freshest data and pay a little more extra money to get faster? Like they have like a sort of- Cost. They have like a objective function like in terms of triangle, not just cost and performance, but also like freshness. Oh, okay. Yeah. Again, we want to, for this class, let's focus on Dremel. They gave a talk a year or two ago during the pandemic with us about Napa. It's a good talk. All right. All right, there you go. 2006, right? So again, this was a side project at Google like 20% time, like one day a week, they were allowed to work on this. And the idea, the original problem they were trying to solve was there's all these artifacts being generated from different tools and services all throughout Google that is showing up on GFS and internal file system. And the idea was they want to be able to run queries on top of this data through SQL rather than writing the C++ MapReduce jobs. Again, just going back to like the mid-2000s and Google saying that they don't want to use SQL, everyone's writing these MapReduce jobs. Hadoop, the open source version of MapReduce was in Winjava, the Google version was all in C++. So you have to write now C++ code to do scans and joins on data. It's terrible, right? Oh yeah. So the idea was they wanted to be able to just have a bunch of files sit around on disk, sorry, in shared storage and ingest it. Although the first version actually was a shared nothing system, meaning like you had to ingest the data into the system and then it got internal, you know, it got cataloged. And then the 2010 rewrite was, it should be late 2000, not 2010s. They rewrote it to now be, they just got to go to storage where you're just reading data directly off of Google files as in our GFS, right? And then this was the paper, the first paper came out in I think 2010 and then it was got commercialized in as BigQuery in 2012. And the reason why I had to, you guys read the follow up paper rather than the original paper because the original paper doesn't talk about the shuffle service was this one does and that's actually a key thing that separates BigQuery from other systems and allows them to do some interesting optimizations that other systems can't easily do. Is there anybody know what a dremel is outside of a database? What is this hand gesture? Yeah, it's a tool, right? So there's a footnote of the paper, it's a brand of power tools primarily used of speed to post a torque, right? More or less, this is the miniature. It's just like a rotary drill, like a grinder and user things. I'm always surprised that like their lawyers let them put out like a paper that says, hey, we have this internal service for a multi-billion dollar company and we've named it after another company, right? That's asking for a lawsuit, but they did it. And again, but then the commercial version that they, you know, smartly renamed it as BigQuery. So all the documentation you'll see online for what Dremel's actually doing, you know, it'll just be called BigQuery but for whatever reason, the papers are still referred to as Dremel. All right, so there's a notion of in-situ data processing. We've already covered this many times at the investor. It just means that I have a bunch of files that are sitting out in some storage that's separate or not under the control of the database system and something else is gonna be putting files there and people then wanna run queries on top of them. So obviously I need to be able to have a way to know what the files are, some kind of catalog, right? And reference it to, you know, some table or some logical identifier, say if you wanna read this collection of data or table food, whatever you wanna call it, here's the files where to go get it. But other than that, the database system doesn't necessarily need to know anything. Now when we read Snowflake next week, Snowflake had what we call managed storage where you ingest data into the data system and then Snowflake is responsible for deciding how to chop it up and where to store it and understands everything about it. In the newer versions of Snowflake, now they have to support, you know, this Lakehouse architecture. So they now support reading data from iceberg files. Same thing with Redshift, they originally started out being shared nothing system, everything was all managed storage. Now with Athena, you can read files on S3, yes. So Snowflake, they also charge maybe the ETL cost between moving that data into their proprietary format. This question is, does Snowflake charge the ETL cost from getting data from, like from remote storage into proprietary storage? Yeah. I'm sure they do, right? Well, that doesn't mean they're much worse both because if they have extra steps that they have to do. His question is, does it make it much worse product because the extra step they have to do? Well, let's pause for that now for now, right? Again, it's not always like cost matters, but performance matters. There's so many different factors that say like, is it a bad product? And I'm not trying to be a cop out. They're like, you know, suck up corporate masters or whatever. I'm just saying like, depending on different scenarios, that may or may not be a good idea. But the fact that they do support it now is a good thing, right? And as we've seen throughout the semester, all these systems when you have a bunch of files in some format, we've seen this when the team's working on the caching service, the IO service here, you're gonna convert it into an arrow or some other internal format and then process it anyway, right? So, you know, who pays for that cost? It depends on the pricing model. So again, this is this idea of Dremel, what they were trying to do was, you know, reading files where they exist, right? This is what we do in my data lake or the lake house stuff, again, this is just a marketing term, but Dremel was doing it long ago. And in the paper, they point out that one of the key reasons that they went with this, you know, trying to support this capability of just reading files where they exist is that it was better to have the, that their users were willing to sacrifice performance of having like natively managed data. They would rather sacrifice that performance in terms of the flexibility or the ease of use, meaning like I don't have to, you know, define a schema, then load the files into my schema, then run queries on it, right? Because that, there's a human cost to that, like, you know, labor cost. It's rather just, yeah, my query's gonna run a bit slower because I'm reading a bunch of files that may not be in the best format for my data system, but that's okay, because I can just get to it really quickly. And from my perspective, yes, I think this is the right trade off. And then SQL is typically the right abstraction you would want to do this. So for all the systems that we're gonna look at for the next two weeks, we're gonna use sort of the same kind of summary page like this, we're gonna hit all the high level aspects of the system as it relates to all the things we talked about throughout the semester. So again, a lot of this is gonna be table stakes. There's just things that you would expect in modern lake house or OLAP engine to be able to support. So shared disk, disaggregated storage, that's to be expected, vectorized query processing, as we said, these papers aren't gonna say anything deep about it, other than I know that big query is using intrinsics, because we asked them. The paper doesn't say that though. The shuffle-based distribution, excuse me, we'll get into a second, Google's gonna have their own proprietary format called capacitor, and we'll see that in a second, although there's not a lot of details about it, but it's basically gonna look like Parquet and Ork, and I'm sure people are generating Parquet and Ork files internally at Google. But for this column of storage, they're gonna use all the tricks we talked about, so zone maps, filters, dictionary, and RLE compression. The only index they support in big query of the service is inverted search indexes, to do like, like and regular expression lookups on strings. They're only gonna support hash joins, and then there's a combination of heuristic optimizer and a very light call space optimizer when you have some statistics, and usually they don't, and they're gonna rely heavily on the ability to adapt the query plan while it's running based on the data it sees. So we're spending most of our time talking about this, because this is gonna allow us to do things that we couldn't easily before. And this also is the transition to what we've been talking about in the entire semester where we were talking about how do we build the single node execution in first, and then now start gluing it together. And these systems, especially with the shuffle, is one way to start gluing it together. So when a query shows up, the data system is gonna convert it into a logical plan and then divide that into stages, roughly correspond to pipelines, but not always necessarily. And then within the stages, you're gonna have multiple parallel tasks that I'm gonna distribute it across the workers. And one key aspect of their query plans is that they need to guarantee that every task you would execute is gonna be deterministic, meaning if I execute it over and over again with the same data, I should produce the exact same result, and it's gonna be idempotent, and this is gonna allow them to have the ability to restart or kill a straggler or a task that's running slow, and then re-execute the task on another worker and be guaranteed to produce the same results. So think of things like if I have a random, call the random function in my query, I need to guarantee that no matter what worker I run on, when I invoke that random function, I get the same sequence of values or times another one, too. Yes. Doesn't this seem to come as a problem with that one? Yes, deterministic in terms of how it's run and also producing the same result. So that's idempotent, yeah, yes. All right, so there'll be a root node or the coordinator that's gonna be responsible for dispatching all the tasks. They talk about having a centralized schedule, but the coordinator is sort of setting things up and then handing things off to the scheduler. And what's interesting, they talk about, and we'll see this, I think, also in the snowflake paper as well, is that if you have all the workers going out to the catalog, the metadata, sort of say, where's the files that I need? What we know, when they start executing the task, then you could have thousands of workers all of a sudden flooding the catalog with all these requests. So instead, the root node is gonna do a batch request to the catalog to get all the metadata about the files it's gonna scan ahead of time and then embed that in a logical plan. So now when you hand the tasks off to the workers, they don't have to do a lookup in the catalog. They have everything they need to know of how to process the beginning. Every worker now is gonna have his own local memory and local disk, and then if they run out of memory while they're processing that given task, they'll be able to spill to that disk and spool it back in as needed. But then also, we'll see in a second, they're gonna write out the results to a remote memory service. So this is a really simple query plan, do is doing a lookup to get the number of articles in Wikipedia with my name in it, or Pavlo. There's some other asshole Greek singer named Pavlo too and he might be in there. I used to be when I was younger, like you just Google Pavlo, I would come up first, now this other guy is, but whatever. All right, so we have our distributed file system. A bunch of data we want to access. In the first stage, the corner says, okay, I'm gonna fire a bunch of these workers. This is what I'm doing on a partial group by. And then these workers are responsible for pulling the data that they need from the shared disk storage and then doing better processing on it. And then now the output of these workers are not gonna go to the next stage of workers. Instead they're gonna go to this in-memory shuffle service. So all of the worker nodes are gonna be writing out their data to this, think of this like an in-memory key value store that's partitioned or scaled horizontally. So I can hash whatever the data I'm looking at and decide what, you know, send it to what shuffle node I need. And then now the shuffle node can then send additionally for metadata about here's the data I saw for this first stage for this query to the coordinator and the coordinator can decide on the fly how many workers that it should use for the next stage. And then it spins up, you know, it's that request to the scheduler, the scheduler then fires up these workers. And then these guys are gonna pull data from the in-memory shuffle. So they're not gonna communicate and work to the next across the stages. They're always gonna use this in-memory shuffle as an intermediary. Yes? Two questions. Yes. Is the in-memory shuffle a single node thing or is it like a spread of a... This question, is it single node or scaled out? Scaled out. Scaled out. Second thing, why have that? Why not just make them like map-produced out of it which is that you can like the workers talk to each other? These questions, why am I doing this? Like, why have this extra step to go to this piece versus having the worker just pull the data from the worker itself? Right, there's performance implications of like, if now, well, if I can kill all these guys and then re-using the task for other things and this thing is just maintaining the data. And then, you know, otherwise I gotta keep this thing around so that make sure they get all the data that they need. Because what happens if like, say one of these guys down, go down, then I gotta go back to the previous worker and get the data again. And then, as I said, we'll see in a second, this having an immediate step that when like, get all the data I need from this first stage, then I can decide what to do in the next stage because I've seen the data. Because I now have it in a sort of central location that I can pass along to the coordinator. Does it live for longer than just one cycle? Like, do you keep that memory for a very long time? The question is, do I keep this memory around for a very long time? You mean like, with the contents of the memory or the service itself? The service is always running. Right, but I'm in the context of the memory, in the sense that, so we're waiting for all of those three workers to finish, right, before so that we can then erase that in memory shuffle that we have over there. So when you get to the next stage. Yes, at what point does this get blown away? The coordinator would come back and say, all right, I've completed this stage. Everybody's got the data. Still need to get past this stage, right? Because again, these guys could crash and then you need to go fetch it again. But once you know that nobody else is going to go back to the data you need, you can blow it away. Another thing, that gives me like a shit ton of memory. And also, like the worker can die, but this can die too, right? Yeah, but it's just a key value store. There's no techniques to replicate this scale itself, right? It's even crazier. They actually fab custom hardware to make this go as fast as possible. Oh, fuck. Yeah, it's awesome. Yes. The question is, why not keep check? You can meta data rather than stream all the data here. But where do you keep the data? Here, right? That means these guys can't go away until you've gotten the data over there. And you need data to get flushed to a disk because that immediate result might be too big in size. Yep. But here I do assume that the memory is big enough to show all that. And then isn't that what Spark also does, like you've met the implementable method here at Spark? So you get two things. One is, in Hadoop, you would write to local disk here, because you might run out of memory. Whereas, in this case here, it's just so massive that you're not going to run out of memory. This thing, you can spill it to the disk, too. We'll see in a second. And actually, it'll spill to GFS or Colossus, whatever the Google file system. And then your second comment is, isn't this isn't what Spark is doing? Spark still, I think, maintains the shuffle data on the worker nodes. Yeah, it actually does the same thing the map uses. The in-memory thing happens off the stage. So the map has to do something then, then, yeah. There, it seems to be happening between MapReduce, which is the same thing. Obviously, I'll say this is not a, this is not, like, Dremel or even MapReduce Hadoop didn't invent this idea of this shuffle step. That's distributed databases, parallel databases from the 80s and 90s, right? What's unique about BigQuery and Dremel is that they explicitly do this between every stage. Snowflake that has a shuffle, can do shuffles as well, but they only use it as needed. They do this for everything. Yes? Are there disadvantages to always doing this? Its question is, are there disadvantages to doing this? What if you always want data on the same worker? Ah, OK, so this is what I was saying before. I think I said back here, they call it stages. They're not always pipeline breakers, though. In some cases, you can have the second stage kick off while this stage is still running, right? And you can start processing ahead of time, right? So that's one advantage there, that you could start doing this. You could have this thing get fired up and start reading the data before these guys even finish, right? From a software engineering standpoint also too, that now you no longer have to embed the logic of how to scale up or scale down or do other optimizations we'll see in a second at all your workers. Because now it's just a coordinator says, OK, I need more workers. Do it this way, or move the data here and there. And from a software engineer's perspective, the worker implications is much more simple now, right? So again, at the end, the last stage we're doing a certain limit, one worker can handle that. As far as I know, these are just containers running in Borg, which is the precursor to Kubernetes, right? So they're meant to be stateless. So these things can get killed and swapped out at any time. Wait, is that why they're doing a member shovel? Because they're supposed to be stateless and they don't want to keep them alive, something stateless alive? I mean, this question is, is that why you're doing a member shovel? Because these things are stateless, you don't want to keep them alive? Potentially, I mean, it's one of the ideas, yes. But again, there's database query plan advantages that we can leverage if we have this extra stage. We'll see in a second. So the shovel is basically a producer-consumer model. It's just a way to send the immediate results from one stage to the next, again, using this dedicated service. As I'm saying, this, in the paper talks about, like, this emery service isn't used, not just for Dremel. I think Dremel is the main consumer of this service. It's used in other services within Google as well. All right, so again, the workers just send their output to the shuffle nodes and then if the shuffle nodes can run out of space, they can spill to GFS if necessary. And then it's just like a get and set API. So then the workers in the next stage just say get, get, get, get and get more data from the shuffle nodes. Right, so in this case here, say that all the workers are consuming data from the previous stage. In this case here, it could be from the, from the distributed file system, the data I'm reading, or it could be from the shuffle service itself. And then they're processing the data and they're doing end-wave traffic on the outside going out. If I run out of memory, I can always spill to distributed file system. And then another key advantage, why you want to do this is that, I no longer have to do sort of the end-to-end communication or end-to-end communication between one stage to the next, because this data is going to be partitioned, that I only need to send or get the data from a subset of the workers rather than sending it to all the possible workers, right? So without the shuffle service, without having to know exactly what, because the coordinator's going to tell us, here's the data you need, here's the shuffle nodes, go get it from. Without that, potentially at the poll, all these guys say, do you have any data that I could be consuming? So from that perspective, this is way more efficient in terms of communication traffic. And then I think also too, they can pull from the distributed file system rather than having to go and get it from the shuffle service if anything gets filled at the desk. So again, the shuffle is basically just like a checkpoint in the query plan. And so this part is actually unique to Dremel because historically parallel distributed databases didn't do checkpoints and they weren't fault-tolerant within the query itself, meaning like if I had a two-hour query that was going to run and one node happens to die partway through, then the whole query dies and I got to restart. From the database system's perspective, the disk was so slow that it was just not worth it to write out intermediate results, right? Whereas Hadoop, as he mentioned, is doing that for every shuffle, was always writing things out to local disk and then replicating out things on HDFS. And that was really slow because, but that was Google's model of like, okay, we're running on cheap pizza box machines that could get thousands of machines that could die any time. Whereas from a parallel database system perspective, it was better to design the system, assuming you're running on not $1,000 rack machines but like high-end machines that aren't gonna crash that often. And so you get better performance, but you're not fault-tolerant to if one of the nodes go down. So the in-memory service allows them to give that fault-tolerance by taking a checkpoint between the different stages of the query plan, but because it's an in-memory service, in memory, it's not gonna be as slow as writing to disk. Now with NVMe drives, maybe it's less of an issue because disk got really, really fast, but 10 years ago, this obviously was a big concern. So you get fault-times because at any time of a node crashes, you just get the data that you need from the in-memory shuffle and run the task on another service because it had a potent, you can run it again without any side effects. If a task is running too slow because the worker node is, for whatever reason, slow, the BigQuery guys told me that one of the big problems they faced is sometimes they'll land on, to run a query on a node that where another container is like doing encoding for YouTube, and they can look at the traffic and actually know it was YouTube and then that'll slow down the query. So if you have a straggler, then you can go ahead and just kill it and assign the task to another worker that can run faster. And then also too, as I related to, it's gonna allow, because we have this explicit stage, like, okay, we can take a step back, look at what our query has done so far, look what the data looks like, and then decide at the next stage, do we need to scale up or scale down the number of workers we need to process the query. So both of these examples here. So the workers are running, they're producing data that they're sending to the shuffle nodes, and say for whatever reason, this node is just falling behind, it can't keep up. So if we decide to go ahead and kill it and just reassign the task to this other worker here, who again, is just getting the data either from the derivative file system, it'll be there, or from the shuffle service, which again, will always be there. And then once I collect all my data in my storage, sorry, in the shuffle storage, I post some information to the coordinator, can look at the statistics of what the data actually looks like, and then decide based on what the SLA requirements are for the query, do I have too many workers or not enough workers? And then I can, if I want to, I can regenerate, add more, and then I don't have to move any of this data around, I just reassign what worker's gonna read what data from the shuffle service. Yes. I just wanna clarify what you mean by checkpoint. So I think you said that the shuffle service only is stole to the non-A. So checkpoint meaning, it's very clear, yes. So it's not a checkpoint in that we think about in like the intro Davis class where I'm taking all the contents of memory, I'm writing out the disk. I think it's a, I don't want to do word staging point because these are already current stages, but it's like a, it's a, pause is also not the right word because it's not like you're stopping anything, but it's a, it's not a word for checkpoint. Save point is very something explicit either. It's hard to say it again. It's more from a logistical standpoint that like I can, before I start executing the next stage, I can decide do I need to change my query plan or change my topology of the query plan or the number of works I have in the subsequent stages because I've seen the data that got generated from the previous stage. Yeah, so checkpoint, I don't mean that like everything in here always gets written in the disk because they want to keep things in memory as possible. Again, for intermediate results, who cares about like, I don't need the data beyond the query I'm actually trying to run right now. There are some papers about how we use data structures from one corner to the next sort of like a mini materialized view, like a hash table for a join kind of keep that hash table around from one corner to the next. They're not doing that. Literally it's just like, I get all my data in this location, I can then have a global view of what's going on and decide where to go next. Or what to do next. Yes. This question is, are different shuffle nodes stored where? Sorry. I mean, this is just in memory. Think of it, in memory hash table. So like you have two, right? Yes. Are there like two or three? Oh, yeah, yeah, yeah. So this question is like, are these, yes. So think of this as like, like I produce some work, my task process and data, and then I have a key on the data I hash, and I mod it by the number of, yeah. It's just like, think of like a consistent hash table. Yes. Is this distributed file system like, does this also exist on the same worker nodes or is it completely separate? This is completely separate service. This is like Colossus GFS. This is their S3. So I guess from the perspective of the worker, they don't care about the same memory, right? I mean, you care about the same memory because you want to hand it to the person as quickly as possible, whoever asks for it. That's the whole point of like these things, these are large memory machines. This is, I mean, this is Google, it's not S3. But like you said, it was there S3, that's fine. It's an author store, right? I thought somebody said S3 is at least metadata that on my SQL. But again, that's, for this we don't care. Okay. So I keep alluding to, okay, now that we have this staging area in the memory shuffle store, and now we collect statistics about what the data we've got, we've gotten from the previous stage, we start making decisions about what we want to do. But in the very, very beginning, obviously we don't have any of that information. Like we did some of a bunch of files we may not ever scan before. I think the paper even says like a large percentage of the data that the Dremel queries are processing are files that the data system has never seen before. Right? So there's no statistics. So how can we actually try to generate optimal query plan without any of this? They also talk about the ability to do queries against other data sources or database systems. And this is oftentimes called connectors. And we'll see this in other systems as we go along. But like the idea is that I have a single logical view within BigQuery Dremel to a bunch of different disparate database systems. And now when I run my query, I can say like, go read this Postgres table. And then the system is responsible for then writing the corresponding query to go against Postgres and get the data that it needs. But at that point, if we're running a query on another data, our query gets generated gets converted to another query that runs on another system, we have no statistics. We have nothing, right? The worst case scenario, we'd be doing like a select star against some other table and then do processing once we get it into our system. Best case scenario, we can do some kind of predicate push down to the other system. But again, at this point again, you don't have any stats. So the way Dremel is a new query optimization is a stratified approach with a rule-based optimizer and a cost-based optimizer that only does basic analysis of, the cost-based one only does basic analysis on the data or on the data you're trying to access if you have actually some information already about it. So for the rules, as all the classic stuff we talked about, doing predicate push down, primary key, foreign key, hints, some very basic join ordering, they have custom rules to do constraint propagation for star schemas. Like you could propagate maybe the constraints from a dimension table into a fact table. Or like if you're doing, the system will detect that if you have a star schema, a fact table with all the dimension tables, then when it generates the stages, it knows to always generate the hash tables, build the hash tables on the dimension tables and then have this single pipeline where you write up the fact table all the way up and just do probes and all these hash tables. So they have basic rules to check these things, but then they only trigger the cost-based analysis and optimizations if you have some stats which they only generate if you have a materialized view. But most of the queries are not, materialized views are not the common case. They have to deal with not having any statistics. So instead to avoid any kind of bad cost model estimates that we saw before, they're gonna apply adaptive query optimization techniques and they're gonna do this relying on that shuffle stage as a way to say, okay, stop, look what's going on and then recalibrate as needed. So we'll see various other techniques that are used to do that adaptive query optimization in Snowflake and Databricks and so forth. They're not gonna be as aggressive or all-encompassing as some papers that we discussed. They're not doing plan snatching, but they're also not embedding those trigger plan nodes that decide to go to this query plan versus the other. They're gonna be more like change the number of workers that they have and maybe change what joint algorithm they wanna use based on the data that I've seen but not like recalibrate or reorganize the entire query plan. So when we can do this because we have the shuffle staging point where we can look at the data we've collected and the idea is that we wanna fix things as we go along. So obviously we've already seen how to change the number of workers in a stage. If I recognize that the data showing up is much larger or smaller than I anticipated because maybe there's a very selective filter that I didn't anticipate was gonna throw in most of the data, then I can scale down the number of workers that I have at the next stage. You can decide whether you wanna do a shuffle or a broadcast join based on the data that shows up in the shuffle phase. We'll see that in a second. I also can change, you can change the implementation of the operator you're gonna use. This one I don't fully understand because you don't talk about in the paper, but they have notions like I have an operator implementation for small partitions or large partitions. I'm guessing you have things like unrolling loops and things like that if you know you're only gonna read a small number of data in each partition. And then dynamic repartitioning is a way to maybe split data up more if you have a hot bucket again based on the data you've seen. I'll go through these two examples. All right, so say I have a query here that's gonna read data from table A and table B and I wanna do a join. So on the very first stage, I'm gonna have a bunch of workers read data from table A and a bunch of worker read data from table B. And again, maybe there's some filter I push down in these workers that will start printing data. And then these workers are start submitting all the results from these scans into the shuffle stage. And if you think of like internally, we just build a, you know, a histogram or a counter that says how much data we put into for the partitions at these two tables. And say for whatever reason, table A is much smaller than we anticipated. So in that case here, maybe we don't wanna actually do the shuffle join where we're repartitioning the data on the join key. You could recognize that this data is actually small, small enough to fit on every single node and we can change what join wanna use. So again, say the original idea was that we were gonna do a, we're gonna hash the data and then send it to individual workers. They're more, they're pulling from this, but I'm showing the arrow of like, here's the data flow. But again, if this thing is super small, then I can just change it to a broadcast join where now the workers will, every worker will go get the entire contents of table A from the shuffle service. And then I still do the shuffle on B, partition it up, but now when I join B with A, I have all the data I need to do the join locally. The other choice is to do dynamic partitioning. So say that I have, I'm scanning my data, right? And say again, for whatever reason, partition one is much smaller than I anticipated, sorry, partition two is much larger than I anticipated. So this thing is gonna spill to disk and that's gonna be slow. So what I can do is, you know, as I'm running, I'm passing statistics to the coordinator and then I can say, okay, great, two new partitions. That's my biological daughter. All right, sorry. So the coordinator says, okay, well, this partition is gonna run out of space. So now go send a message to the worker and say, all right, anything you're gonna partition to, you're gonna send a partition to, hash it again and send it to these two new partitions that I just added. Basically, recursive partitioning from the gray hash to an algorithm we saw in the intro class, right? So then these guys start, you know, keep running and they start filling up data from these two partitions. And then when this stage is done, I introduce a new task in my stage to do repartition that then goes from trees data. Sorry, this is not professional, all right. I go read the data from partition two and then it just rehashes it and puts it in partition three and four, all right. In the previous one for the join, one of them always needs to be a broadcast, right? It's just which one do you wanna make a broadcast? Yeah, when does it always need to be a broadcast? That's shit, okay. Sorry, yeah, you can do- The shuffle join. So shuffle join is just like, everything gets repartitioned on the hash key, right? The alternative is just do a broadcast join. One of them gets broadcast to everybody. And then you don't have to just scan locally. Actually, yeah, so the stage I'm missing, if you do a broadcast join, you don't, like I do a broadcast A but I don't need to do partitioning on B. So you go ahead and kill these workers here and then in the next stage, they're just gonna read the data from directly from the table, the files. Yes, I'm missing arrows to draw that. Yeah, that's how you do a broadcast join. Because the idea is like one small one if you can send it around everywhere and you leave the other table where it originally resided. So as I said before, there we're gonna rely on the an internal distributed file system called Colossus. Originally we'd start off with GFS but then they switched to Colossus to do scale storage. Again, think of like it's like an object store like S3 and other ones we've talked about. The idea is that this is an external service to the data system. You just let them manage all the storage for us. Paper also talks about how they're gonna rely on a file format called Capacitor which is internal to Google. This link here will take you to a blog article that mentions it. There isn't much documentation about it. It's not open source but it more or less looks like Orc and Parquet when you talk to them, the Google people. One thing that Capacitor does do that Orc and Parquet do not do is that you can do predicate pushdown and partial query evaluation or expression evaluation within the access library itself directly on the data. So in S3, again, you can do some pushdown of some where causes on select for select statements on the Parquet files or Parquet and CSV files or JSON files as well but it's pretty limited. And certainly in the case of like, if you do access Parquet through the arrow files like it decompresses everything as you're iterating the data whereas this thing can do filtering directly on compressed data without decompressing it first. There's another file format called Artists that was for the YouTube or Scylla system that has similar capabilities but a high level of this is just gonna look like Parquet and Orc except that you can do better early filtering. And we saw before how they're gonna handle repetition definition fields to deal with nested data like think JSON data but again it's Google world so it's protocol buffers. These file formats capacitor are gonna be self-describing meaning again just like Parquet and Orc there'll be something in the footer that says here's the schema that you expect to see. And then they talk about how the metadata for the schema is just stored as kilometer data as well. So even though I may have 10,000 attributes in my file I don't have to deserialize the entire thing like you would have to do in Parquet and Orc I can just do all the optimization and look up and clone our data directly on the metadata and define the things that I'm looking for. Again this is not, this itself is not like mind-blowingly amazing but it's certainly better than what's in Parquet and what's better than in Orc in the current state of the art. The last interesting thing to talk about in this paper is similar to what we saw in the Velox paper where they talked about how Dremel was one of the big first systems that Google built that brought back SQL and then once SQL became in fashion again at Google there's a bunch of these different random projects that people started adding their own for SQL. But the problem is all of these different internal projects had their own dialect of SQL. And so there was effort in the late 2010s to unify this across the entire corporation by having a single SQL dialect called Google SQL that all these systems would then incorporate. So that way you didn't have to deal with the weird nuances of one SQL dialect to the other. Across the entire corporation everything was always the same. So the, this again in the Velox world they talked about how like there's all these like substrating functions and everyone was re-implementing the wheel over and over again and Velox invented standardized those implementations, the same idea here. So Google SQL itself is not open source but there's an open source variant of it called Zeta SQL. Who here has ever heard of Zeta SQL? Nobody, okay. So this thing here is supposed to be the open source version of this. And the idea was like, okay yeah here's, people could start building Zeta SQL compatible database systems that would then smell a lot like Google SQL. So like if you're comfortable with like running on whatever this one off system based on Zeta SQL you can easily transition your application over to BigQuery or Dremel or Spanner as well. So this thing is basically as far as I can tell it's dead. Like there's new updates. There's some update like a month ago on GitHub but like it says at the bottom it's not officially supported by Google. There's a bunch of pull requests and issues that aren't being responded to or answered, right? So as far as I can tell this thing is dead. Like, and there's only one data system I know that actually supports Zeta SQL. It's called Apache Beam. There's like a stream processing system but nobody else is actually using this. And to me this is interesting because again Google was or Google still is huge, right? Google still is very influential in the tech community but if they're putting out a SQL dialect that say this is what the standard should be and no one follows it like, you know this shows you like that the SQL marketplace is so diverse and fractured that like no one company, even a major tech giant can sort of bend the, you know the physics or bend the world to their whims. The last time this was actually done was IBM, right? IBM came out in 82, 83 and said, okay we're putting out a new database system and it's gonna be based on SQL, SQL's the standard. And everyone sort of got in line and followed along and SQL became what it is today. But now things are so diverse that I don't think you could ever do that again. The closest you can get to I think a true, I mean there's the SQL standard, the ISO SQL standard but as I said nobody actually follows that. The closest you can get to a dialect that everyone is based on is Postgres. Because everyone takes the Postgres parser or the grammar file and uses it. Like DuckDB did this, a bunch of other systems did this. Google, you know, I'm not saying they try and fail but like no one's gonna use that as SQL. Okay, so, again since the 2011 paper, as I mentioned there's a bunch of systems that have come out that are, in some cases like wholesale, they even claim it's a copy off of the architecture but other ones are more likely to say that it's inspired. So I wanna go through these four here. And then what's also interesting about this is that they're in the last three or four years there are now separate shuffle as a service components or architectures or systems that you could then use that maybe don't exactly replicate all the capabilities of the dremels in memory shuffle service. Certainly not using hardware acceleration but now that again there's separate projects that do nothing but shuffles which I think is kinda cool. And so we'll talk about the Celeborn one from Alibaba because that one's the farthest along Unifool and the Uber one. I mean I'm sure they're still using this. Unifool is still, and again Apache Incubator project so early, this is the big one. So again they're gonna go through each of these systems and I'll cover this one and then we'll finish up and go out for the Eclipse. All right Apache drill is again this one claimed to be again straight up copy. Like Dremel is a drill, Apache drill right? No imagination there. So this started as a right after the Dremel paper came out as a way to build up a query engine on top of HGFS. And this just started at a tech company called MapR. This was in the late 2000s or the 70s and 10s there was three major Hadoop companies or MapReduce companies. There was Cloudera, Hortonworks, and MapR. Cloudera and Hortonworks are based on the open source version of Hadoop, the Java one. MapR had their own proprietary sleepless version that was meant to be faster. And so MapR started building Apache drill. But actually this one was in Java. So what's interesting about this is that they are gonna do code gen query compilation using this thing called Genino, which is basically it's some kind of embedded Java compiler where you can give it Java code and it converts it in process. So this project is not dead, but it's certainly the number of commits and engagement and usage of it has gone down. MapReduce or so MapR was on the market a couple of times, finally got acquired for not much by HPE and the HPE announced in 2020 that they're basically stopping all development on this. At least their HPE is not paying for the developers to work on this, but other people are still working on it. So I would say that this is not, there's better alternatives now, especially in the open source world, but this was the first one that sort of came out directly after the Dremel paper came out. Did it do the shuffle in the in memory shuffle thing? I think this one did the in memory shuffle, yes. Again, but not the hardware. The next one is PrestoDB. This was started at Facebook. I wouldn't say this is like directly inspired by Dremel because I think they were working on this. They were already working on this when the Dremel paper came out, but they were building, Facebook was building this to replace Hive, which was a way to do SQL on top of MapReduce. It would take your SQL query and then convert it literally into MapReduce Java jobs and then run those. And obviously that would be super slow because MapReduce was slow, but the idea again, the same motivation that they have a bunch of files, sort of data lakes, in this case, it's HTFS, or I think Facebook has their own internal distributed file system. And they had a way to do a bunch of connectors, different storage systems, data systems similar to Dremel. And a few years ago, Facebook announced that they're getting off of the Java based runtime engine and they're switching everything to the Velox. The Velox paper talks about this project called Pertissimo. This is one of the targets they were building Velox for to replace the Java engine with the SQL engine in Velox. There's also another version of Presto called Trino, previously called Presto SQL. So the first version of the project was called Presto, then it was called PrestoDB, and then there was a fork called PrestoSQL, then it got renamed to Trino. And this was done by the Starburst guys that came out of Keradata, by Astrodata, they go hard by Astrodata and Astrodata acquired, there's a project called HadoopDB, there's a company called Hedapt. Hedapt got acquired by Astrodata and then Astrodata got acquired by Teradata. And then Teradata spun out this as Starburst, there we go. Anyway, so they didn't like how Facebook wouldn't give up the control of the source code. Like Hive came out of Facebook and that's a patchy project. For whatever reason, Presto was still not a patchy project and Facebook wasn't giving up control. So these guys forked it, renamed it as Trino. And I think this went to the cloud computing foundation and then Facebook then gave up source code control and gave PrestoDB to the Linux computing foundation. So not a patchy with these other similar kind of foundations. So what's interesting about this in Presto or PrestoDB, Facebook is again trying to get rid of the Java stuff in place of Velox. The Trino guys, they're very explicit by saying they don't want to give up Java. They have a blog article or they had a podcast a year or two ago and they talk about here, it's explicitly that like they rather spend the time trying to make the query optimizer better then try to spend a much engineering effort to replace the execution engine with something like Velox or even DataFusion, all right. Is Hive the query engine or is that the? Hive is the query engine. In the same way the Dremel is the query engine, Presto is the query engine, Trino is the query engine. So how is Presto and Hive connected? The question is why, how is Presto and Hive connected? Facebook first built Hive because they were like okay, they had all this map reduce stuff, infrastructure. Map reduces slow and people are writing Java code if you're running queries instead of SQL. So then they built Hive, which is a front end query engine that can take your SQL query and convert it to a MapReduce job and run that. That's slow because MapReduce is slow and the Hadoop model is slow. So then they said okay, let's get rid of that and let's have, keep HDFS or the shared file system and let's build a query engine that takes SQL and can run the actual query plans directly as SQL. That's Presto, similar to Dremel. And that one also has the in-memory shot. Actually, Presto, I don't know. I should look at that, I don't know. Good question. All right, another project again that came, it was definitely inspired by Dremel was the thing called Impala that came out of Cloudera. And so this was founded by people that Cloudera hired from Google who didn't work on Dremel but used it and were inspired by it. But the key thing that they did, that they didn't, that Impala did, I think it still works this way, that rather than have the query engine and the workers pull the data from the shared storage and then do the processing on the worker nodes, they wanted to do more predicate pushdown than you can do on S3 or GFS at the time. So what they would do is that on the shared file system, you'd actually install a little execution engine down there. I think this is all written in Java, so this is like the JVM. So the worker could then do predicate pushdown and other pushdowns and that would run that part of the query directly where the data was being stored. So this was HDFS at the time. So then like on your HDFS node, you also install the Impala executor node. We then take the queries and process the data locally before sending it back. So that's not a true disaggregated storage, the way that we've been talking about it the entire semester, but they did this because they wanted to be able to do the predicate pushdown. I think they also did query compilation, but they did, they compiled it, they actually took back, this was not Java, this was in C++ and I think they were doing like predicate compilation on like ware clauses and they would do that down there. It was like CSV parsing and other things. We'll see more about Impala next class when we talk about Databricks, right? Because Cloundera was the big, big map reduce company and they were pushing Impala very heavily, but then everyone started asking for Spark. So they also had to start supporting Spark, but then Spark's like, hey, let's add SQL and Cloundera didn't like that because they wanted people to buy Impala and then the Spark guys do this one trick, we'll see next class of how they got SQL into Spark. Basically, they embedded it instead of having it being a middleware. And Databricks basically destroyed Cloundera. We'll come at, we'll discuss this in the next class. All right, Dremio is probably the, of all the open source ones we've talked about. Again, directly inspired by Dremio. It actually is backed by a VC backed company, actually founded by a CME alum. I think he did his master's here, but he wasn't my student as far as I know. And as I said before, they're doing all the things that we talked about, very similar to Dremio, but one of the things that we're gonna do to speed things up is direct to access what they call reflections, but as far as we can tell, they're just materialized views, right? They're doing on Java based co-gen, I think for the entire query, not just the where causes and vectorization as we talked about before. All right, and then the last one is Apache Celebron, Celeborn, this again, just shuffle as a service came out of Alibaba. The idea is that in Spark and Flink, you can actually specify what shuffle service you want. Like there's a default built-in one where the worker nodes send the data directly to other worker nodes, but you can actually use this as a standalone service as an intermediary, and it can do all the things that we talked about so far, like I can spell a disk when I run out of memory, they can actually do block compression of the data when they put it down the disk and so forth. And again, it's just a key value store that's fault tolerant. I think this one's using RAFT internally. There's another one, the Unifool, that's based on the ZooKeeper, right? It's just a key value store, but it's only meant for moving data back and forth between the different stages of queries. All right, so to finish up, Dremel is very, very influential, as I said, in the combination of vector-wise for the single node query processing plus the Dremel for overall architecture, not everyone does the shuffle as we'll see it go along with, I think the combination of these two things gives you what we would call a modern lake house. And although the shuffle stuff seems wasteful, it is actually going to make things better because I can keep as much memory as much as possible, I can disconnect the warm workers at one stage to the next, right? There are a bunch of advantages to this, not just performance, also from engineering because it simplifies the implementation of all the workers. And this is another good example too of sort of the projects you guys are working on based on. It's like by decoupling the system architectures and having one group just spend as much time to optimize this one piece that then can be taken advantage of by other parts of the system. I think that's the right way to build a modern cloud-native system today, okay? All right, again, next class, we'll talk about Spark, SQL, and Photon. This is going to be different than the Dremel Paver because the Dremel Paver is an entire system. The Photon you're going to see, it's going to look like Velox, right? It's going to be something you embed inside the JVM for the Spark runtime rather than being in some standalone system, okay? All right, so that, let's stop now and let's go inside and check out the eclipse. Take off, you know I'm ready, honey. Got a belt to get the 40-inch model. Get a grip, take a sip, and you'll be picking up models. Ain't it no puzzle, I guzzled, because I'm more a man. I'm down in the 40, and my shorty's got sore cans. Stackin' six packs on a table, and I'm able to see St. I's on the label. No shorts with the cloths, you know I got them. I take off the cap, my first I'll tap on the bottom. Throw my three in the freezer so I can kill it. Careful with the bottom, baby.