 So going forward for now, the rest of the semester, as I said last class, we're going to start discussing real systems. So no longer are we going to read papers like, here's how to do this one technique. We're going to focus on one aspect of the database system. The goal from this point going forward is us to read industry papers and understand how they apply the techniques and the methods that we've been talking about so far this semester into these real world systems. So the sort of three goals are, as I said, was how to take the things that we talked about, like the individual components, individual concepts, and then put it to a full system. Then now you'll appreciate, as you read things, either in like an academic paper, like the Dremel one you guys read, or you start reading marketing literature from companies, that they're going to say things that are going to be slightly different than the things that we've talked about. But you'll see how it basically comes down to being the same thing. So by understanding the fundamentals and the principles of how people build modern database systems, if some marketing guy comes down and says, we have this groundbreaking new technique, yada, yada, yada, and it's just, oh, it's a SIMD, right? You'll be able to cut past all the bulls*** and figure out what they're actually doing. The other thing you'll get from this is that you'll see in these papers they'll say, we did it this way because of Reason XYZ for these different people building these full systems. And so now you build this sort of catalog in your mind of all these different scenarios where someone tackled the problem a certain way. And then if we're lucky, the paper can discuss, oh, we tried this, it didn't work, but then we tried that and it did work. So now when, again, you go out in the real world and you have a sort of database problem you have to deal with either as a user of a database system or as a developer of a database system, you can recall back to all the things we've talked about this semester and see, OK, well, Google had this problem, they solved it this way. Amazon, this problem, they solved it another way. And of course, it's always fun to make sure that I'm not making things up. So I get something out of that. All right, so here's the agenda for what we're going to read. So obviously today it's about BigQuery and Dremel. Next class will be on Spark SQL, where the engine you'll read came out in Sigma last year, sort of their engine called Photon. Then we'll discuss Snowflake. We'll have a guest lecture from Mark at DuckDB next week. And then we'll discuss Velox at a Facebook. And then we'll finish up with a guest lecture from Ipocratus on Redshift. OK, so are these the only OLAP systems that are out there? Obviously no, as you guys seen with dvdo.io. Are these the best ones? And therefore that's why I'm picking them or reading them? No. A lot of it has to do with what has, I think, good papers to discuss them and what they've done. And obviously the bigger companies are in a position with money to have paper people to help write papers. So they do the best, at least in a research venue like Sigma or VLDB or CIDR, even if discussing to disseminate their methods and their ideas. One side thing I'll say also too is when you read these papers, well the Dremel paper you guys read is the 10-year retrospective. So it's looking back in 2021 on what they did in 2011. Oftentimes whatever you read about in a paper, if it's like here's Google's new system, that'll be about five years behind. Because by the time it takes to actually build a system, flesh it out, get it deployed, write all the patents, even if they're not going to actually publicly enforce those patents, they won't protect themselves from other people. By the time you read the paper, it's about five years off. So in the case of the Dremel paper, it's a retrospective, so it was pretty state of the art. But for Snowflake, you'll see a little bit of this. Not open source, but like Photon is pretty straightforward. But at least in retrospect, their papers are always, again, looking back a few years. And then Duck to Bee, there really isn't a canonical Duck to Bee paper, and that's why Mark's going to talk about it. So as we go through these systems, there are going to be some reoccurring themes that I want you to pick up on. And this obvious one is this notion of desegregation or separating the compute from the storage. In the case of the Dremel paper, they're separating the compute from the storage. And in some case, for one particular operation, they're going to separate the memory as well, treat that as a service. But we'll see this reoccurring theme of like, oh, we're running in the cloud. We have these object stores. We want to have the compute nodes be stateless and have the final resting place of the database is going to be on some object storage shared disk. We covered this at the very beginning of the semester. It's like, this is how people build modern or left systems today, which is much different than the conventional wisdom behind using shared nothing architecture prior to this. Another big thing that's going to show up is the lack of statistics about your data, meaning you're not going to be able to run analyze, collect some summarizations or histograms, and then have a cost model be able to predict, here's the expected selectivity of some kind of filter operation. And you'll see in the case of the Dremel paper, which I think does the best job of any system of being very aggressive about supporting adaptive query optimization or adaptive changes based on what the workers see in the data as a processing and try to do late binding or late decisions in query execution as much as possible. So now if like does a little bit of this, Photon does a little bit of this, DuckDB is running on embedded devices. But even then, it may not actually have run analyze on the data. And we'll see a little bit of some redshift as well. Of course, also everything is going to be a column store. But in the case of the Dremel paper, we'll cover this a little bit. They also need to support things that don't look relational. So that means nested values, repeated values, hierarchical structures like JSON, XML stuff. And so in real-world data sets, these formats are very common. We can't assume that we're going to have this nice, beautiful binary columns in a Pax format that we've been talking about so far this semester. So you've got to be able to account for that. And the last one would be vectorized execution. Pretty much all of the systems we're going to cover is going to be doing vectorized execution. I don't think any of them, none of these are going to be doing, Manvillus is not a full system. With exception of redshift, none of these are doing query compilation. They're all going to be doing vectorized execution. And you'll see in the Spark paper next class, you'll see why they discussed. We used to do co-gen, but that was too much huge pain. We only do vectorized execution. Again, so these are what we're going to see over and over again. It's just a matter of seeing how what are the techniques we discussed out in the semester, they're all going to apply in some way, in some shape or form or another. OK? I would say also too, you're going to see in the experimental section, you're not going to see interesting results because they're industry papers. They're not going to say, hey, we have, you know, visas are a big customer. Here's their queries. They're not going to be able to share that. And they're also not going to be able to share absolute numbers because they don't want the competitors using the research papers that I'm putting out there in marketing against them. So you'll see normalized values. I don't think this paper even had any numbers. So don't expect any real deep analysis being done in the experimental section. It really is discussing what they've done, what their contribution is, and if you're lucky, you know, the scenario or what led them to make certain decisions versus another. The bar for, in a period journal or a period article, like Sigmund or VODB or really any other area is typically lower for industry papers because they want to encourage industry people talking about what they've done instead of just keeping it hidden. Now I'll say this. I'll believe this out. I was on, I was associate editor for VODB one year. It's been in the paper. And they never submitted the paper before. And it was, it was OK. But it wasn't as good as it should have been. And it got accepted because they wanted them to get much more. So again, the bar is lower for industry papers. Again, it's not to say that you can't just put a poop emoji and submit that. That's not going to work. It has to be real. But it's not as rigorous as a regular research paper. OK, so let's talk about today's paper. So the first thing I want to do is make an observation about, so try to put this in the context for everyone to understand the significance of the dremel paper. And you've got to understand, go back to 2011 when the original dremel paper came out. This was sort of when I would say the height of Google's influence or Google's impact on the database industry. And for roughly a 10-year period in the 2000s and early 2010s, Google really was at the vanguard of building modern database systems. Like Oracle still existed. Teradata still existed. There's a bunch of other systems that were still around. Postgres was gaining popularity. MySQL was widely used. But really at the cutting edge, large-scale cloud-based database systems, Google was at the forefront. And so what would happen is any time Google released a paper, not always in the same way, to be in a database conference, sometimes in the system conferences, because they didn't really have, I think like Jeff Dean, he's more of a distributed systems OS kind of person, not a database person. Not saying he picked where they submitted papers, but they would submit things like at SOSP and other things. But so every time Google put out a paper, then other people at other tech companies would take those papers and write open-source clones of them. And the mindset was that, oh, successful Google is so big, whatever they're doing, it's probably the right way to be doing it. So we should be doing the same thing. But every sort of copying what they did. In some cases, I think this is warranted. Other things, sure, maybe less so. But again, now probably when you think of who's cutting edge for maybe database systems, you may think of Amazon, because they're the biggest vendor, and maybe some of the smaller shut-ups. So I still think Google is still today at the forefront of a lot of things. But there's a lot of papers that have come out in the last couple of years from Google that nobody's really like, oh my god, this is amazing. I'm going to make my own version of it. And I think partly too is because there's enough tooling out there now where you don't need to build everything from scratch. But back in the 2000s, everything Google did, there was always another clone of it, an open-source clone. So this is just a quick, this is an incomplete list. These are the ones I sort of could think of at the top my head and I could fit in a slide. But these are the ones, these are the systems that Google has released papers of public you've talked about that I think had a lot of impact. So one thing to point out here is here's in the 2000s, this was when the NoSQL stuff took off. Because Google was pushing this idea that SQL is a bad, doesn't scale, we don't want to do this. So all the systems they were releasing or building eternally and then publicly talked about weren't using SQL. Chubby was a lock service that's not exactly, you wouldn't necessarily need a relational database for that. But MatReduce and Bigtable were very influential. And then in the next decade, in 2010s, here's all the systems they put out that were using SQL. And if you may be picked at this point in the Dremel paper, there's a sentence here where it talks about how in the beginning, Google says SQL doesn't scale, and that's why they built all these NoSQL systems. But then Dremel was one of the first ones that brought SQL back into Google. And then people realize, oh yeah, this is actually a good idea. They've figured out the stuff we figured out in the 70s. And they started building SQL-based systems. So another one I'm not including here also, too, is the file system. Technically, a file system is a database system, but we can ignore that. GFS, Colossus, they inspired HDFS, SF, Gloucester. So here's one of these open source implications that are based on things that were developed at Google. Probably the most famous ones would be Hadoop and Spark. Off MatReduce. And then ZooKeeper was a big deal for doing distributed state management. But the one we're going to focus on here is obviously Dremel. And then the open source clones of them systems that are heavily inspired by them were Drill and Paula and Dremel. We'll cover these in a second. So for, actually for this list here of all the Google systems, does anybody know which one is actually open source? No, he says, not Reduce, no. MatReduce is never open source. Yahoo implemented Hadoop as a clone of it. It's us, who has ever heard of this? This is a sharded version of it. It's a sharding middleware for MySQL. This was built by YouTube. I've been YouTube open sourced it. And then we got commercialized it as PlanetScale, right? So I think because it was, this is purely conjecture on my part, I think because like YouTube was sort of, they're making so much money on the side, they kind of were autonomous and the lawyers couldn't, didn't come in, the big Google lawyer city don't open source that. They were able to get away with it. Yeah, so all of these, the only one that is actually public is the test. Now of course, a lot of this stuff requires major rewrites to not use internal Google services. Like Spanner can't run on the outside world because it relies on true time. They're atomic clock stuff, right? Like, I'm not saying, oh, Google should have stopped when they're doing open source, everything. I understand why they wouldn't. But you know, and Amazon's certainly not any better. But you know, at least one of them made up. Amazon, as far as I know, they have zero open source data systems. Yes. We'll cover this a little bit at the end. Like the cloud basically makes, people talk about how open source is important. In the end, if you're selling on the cloud, less so. This is my opinion. For some things, like for like a key value store, like RocksDB, something very like an embedded system. I think it's very hard to say, oh, this is closed source and proprietary. But for a large service like Dremel or even like Redshift, it doesn't make sense. I must say one thing that is interesting is, Google announced was it last week with LODB, that's their version of Aurora at, yeah, it's Google's version of Aurora. It's based on Postgres. They actually have a version that you can run on prem and a Docker file. So the idea is that you can do development locally and then deploy it in the cloud. That part is interesting. I don't think anybody else has done that. Okay. Another one actually, I'm not including here also is, they had a product called App Engine. I don't know if that's still around. It's sort of like Heroku before Heroku. But inside that, they had a, like I think a JSON database. I think that was the inspiration of also MongoDB too. So like, again, Google's influence for databases was massive and still is today. NAP is a really interesting system. We haven't come to speak, but again, it's a, like I don't see anybody building, oops, I was calling to that anytime soon. Okay, so Dremel. So this was originally developed in 2006 as a side project. They mentioned there was a 20% project of an engineer there. And the problem they were trying to solve is that they wanted to be able to do quick analysis on data files or artifacts that were generated from other tools or their batch jobs, in particular for MapReduce. So you'd run your MapReduce job to generate some kind of data set. It would dump out a bunch of files on the Google file system. And then you wanted to be able to run SQL queries directly on it to do some quick analysis to extract some information from it, right? So that was the original goal of this. And then they used the term interactive to mean you wanted to be able to just run the queries directly on the data and not have to like ingest it and import it into an existing database system, define a schema, like do any manipulation or transformation of the data. You just want to run the queries directly on the data where it exists. And this is what they mean by in situ data files. The original version did not support joins and the original version actually was a shared nothing system. Again, in the paper you guys read, they talk about how this became problematic if it was a shared nothing system because as the added, more people started, more and more people internally at Google started using the database system because of a shared nothing system where the compute was tied to the disk, it made it hard to scale out because you had a provisional resources. So they rewrote it late 2000, not 2010s. Late 2000s to be a shared disk architecture built on top of the Google file system. And then the paper you guys read came out in, the original paper of Dremel came out in 2011, but then they again exposed as a commercial product called BigQuery outside of Google in 2012, available to be in the outside world. So there was this little footnote in the paper you guys saw which says Dremel is a brand of power tools that is probably relied on the speed of supposed torque. Who here actually knows what a Dremel tool is? So it's a little rotary drill like this. So it's not like a drill, you drill a hole, you drill like this little grinder or whatever to do woodwork and do other things, to cut things off. I imagine this is a lawyer's nightmare. It's like, hey, we have this important product inside of our important service inside of our multi-billion dollar company and we named it after another company, right? That can't be, that's not a good idea. So obviously, I'm surprised they didn't kill it entirely. Like you Google Dremel or Dremel database, you don't get the tool, you get them, right? But this is about my pay grade. Anyway, so internally it's called Dremel and for most of this talk I'll refer to that as Dremel but publicly it's BigQuery. BigQuery has other stuff around it, like Dremel's sort of the engine, there's like the infrastructure, like the pretty interface and all that, that's part of BigQuery. But the core engine is what we care about, that's still Dremel. So the reason why I hope you guys read the 2020 paper instead of the original paper is because, let's see this in a second, one of the big concepts that come out of the followup paper is this notion of the in-memory shuffle, whereas in the 2011 paper they hadn't built that part yet. And so the reason why I had to read this one, because again, I realized I was talking about a paper you guys didn't read, and like, oh, we tried this in the past in this other paper. But it's that shuffle piece is the important thing. And it's really only BigQuery and Spark do this shuffle operation. So we'll go to more detail, that is a unique aspect of what they're doing. And remember I said in the beginning, this whole semester in the beginning, we were gonna talk about how to do query processing on a single node, get that part done right first, understand how that works, then we wanna go to distributed environment, we'll see how to sort of stitch this all together. The shuffle piece is how we're gonna be able to do that. Now it's not to say not other systems don't do shuffle, but they do shuffle in sort of unique cases, like if you're doing a shuffle join for distributed joins, right? You do a shuffle operation, a repartitioning step just to do that join, but they're not doing it in the way that BigQuery or Dremel's doing it at every stage of the query. That part is unique. So the first thing, actually I wanna go quickly over what the Institute of Data Processing stuff is, I think we talked about this in the beginning of the semester, but this is the idea that the database system is no longer the center of the universe, and has entire control of all your data files at your organization or your company, right? So you think of a traditional data warehouse would be, I have some giant, very expensive machine that I paid $100,000 for, and all my data in my company has to go into that database, has to go into that machine. So that means I have to define the schemas, I have to do clean up, and then I bulk import it into the database system, and that's the data's final resting place. But in a modern scenario, or a modern organization, if you're especially running in the cloud, people don't wanna do that. People aren't gonna do that. It's too expensive, it doesn't scale. So instead, you have all these different units within your organization and they're generating their own data files, and then rather having provision, you know, a slice of storage space on this giant monolithic data warehouse, you just write it as S3, as a parquet file or an orc file or whatever you want, and then you wanna have a query engine or database system be able to execute queries directly on that data without having to import it. Again, using the Dremel example as a motivation for this. So this is what typically people mean when they say I have a data lake. It's an object stored with a bunch of random files. If you're lucky you have a catalog that says here's what the files are and here's what the schema is, but that's not always the case. And then a lake house will be a marketing term you'll see in the, I think in the Spark paper, this just means the database system that sits on the top of the data lake. The thing that you run the query on. So technically Dremel is a data lake house system. So again, the goal is you wanna minimize the amount of prep time you have to do for a user to start analyzing the data that you have. And in the Dremel paper, they said that you mentioned that their users are willing to sacrifice a little bit of query performance in exchange for not having to do all this prep work ahead of time. So I'm willing to run a little bit slower because the data is not formatted exactly in the best way for the data system is ahead of time. The data system maybe has to do some discovery to figure out what's actually in the files when it starts scanning, but that's okay. Because then you don't pay the penalty or you're not spending human capital cleaning up the data. So a quick overview of the key features of Dremel. And again, these are the things that we talked about throughout the entire semester. So we already mentioned that it's gonna be a shared disaggregated storage. Are they gonna rely on an object store to store files in? Some cases will be managed, meaning you do want Dremel to define the coding scheme for the data. In other cases, it could be a bunch of JSON or CSV files. And that's fine. They're gonna do vectorized execution or vectorized query processing. There's actually nothing to discuss about this because the paper doesn't really talk about it. They say they're vectorized. We know they're using intrinsics because we asked the people, right? This is table stakes at this point, everyone does this. The shuffle based stuff, we'll talk about in a second. The encoding team is entirely column store based. They're using zone maps and filters to try to prune things out. As we talked about before, you actually have to start reading the data. They're doing dictionary and RLE compression. There was a little bit talking about how to figure out the optimal encoding scheme. We started the ordering of data within a partition so that you can get the best benefit of RLE. We won't talk about that too much. And then this is not in the paper, but in the commercial version of BigQuery, the only indexes they support are inverted indexes. They do light queries and string lookups. So you can't build a B-press tree on any data. The system will maintain, if you use their data encoding, plus in parquet and orc, they have these things as well. They'll maintain filters and bloom filters and other things to dictionary, dictionaries to keep track of what data is maybe within a column, within a block, but there won't be a global B-plus tree index we can use for lookups. The only support hash joins, so they don't do any sort merge joins. Again, there's nothing to discuss here because once it's not public, and as far as I know, they're doing the non-partition version of this, within a single node. Obviously, when you start shuffling things around, it's partitioned, but that's doing a sort of multi-level. Now, the last one, they're gonna use a stratified approach to do query optimization. Well, they'll have some, here's the optimizer, some very light cost-based selections for corner cases, but then they're gonna try to get most of the benefit by doing things at runtime, by adaptively changing the query plan on the fly based on the data that they're seeing. So I'm spending most of my time on this piece and then this last one here, because again, these parts are unique and very interesting for Dremel. So the way it's gonna generate, actually queries, just as we talked about before, you have some SQL query shows up, you have the parset running through the binder, bind it to names, to identifiers, figure out where the file locations are because you're running on a shared disk architecture, but then it's gonna slice the query plan up, the logical query plan up into stages, which as far as I can tell are roughly equivalent to the pipelines we talked about before. And then within that stage, they'll have multiple parallel tasks. One key thing about the tasks though is that they wanna make sure that the amount of work that each task is gonna do is deterministic and repeatable, right, important. Meaning if the task runs and then it produces some output and then I run it again, I should get the exact same result and I can just overwrite whatever the result is to the previous one. And it shouldn't have any side effects or any other problems. This is obviously if you're running a select query or read only query, this is easy to do because you're not modifying tables, that's fine. But the determinism part is important too because if I run the task part of the way and then for whatever reason gets killed or gets, you know, I have to shut it down and restart the task from ourselves, I wanna get the same result. So they don't talk about this, but I have some experience doing this for other systems. There's just things like, you know, if you have a random number generator in your query, they call it the random function, you need to make sure that you have the right seed at the beginning of the query. So that any, no matter where you run it, you produce the same result. There's things like that and other things with time. So the root node in the sort of execution plan will get designated as the coordinator. And before you start running any of the query, the coordinator's gonna go and get all of the file locations for the data you wanna read as one sort of giant batch to the file server. And the paper talks about how prior to doing this sort of batch approach, all the workers at the leaves of the query plan would then make individual requests and then the file server would get overwhelmed. So you do a batch request at the beginning, get everything you need, and then store that inside the query plan. So the workers when they start running, they don't have to go do any lookups. They know exactly where they need to go ahead of time. So let's look at a really simple query plan like this. I do some lookup where you find all the articles with my last name in it, which I don't, I actually don't know how many there are. So the data at rest is gonna sit on the distributed file system. Again, they're gonna be using Colossus for this. And again, this is an internal thing at Google, but think of it like Seth or S3 on the outside. So you have the coordinator node. It's gonna be responsible for scheduling and firing up all the workers we want for the first stage, like say we're doing the partial group by. So these workers are responsible for retrieving the data from the file system, do whatever computation that they need on it. And then the output is gonna be written to this shuffle service. And the goal here is to store everything in memory so we're fast. And then the next stage will then read data from the shuffle, rather than reading from the workers themselves. Right, so they all can write to this thing. And I'm just showing this as a morph of blood. It could be individual nodes, could be one single node. For our purpose here, right now it doesn't matter. And then the shuffle can provide information about what data it was given, but to the coordinator, we honor utilize, overutilize, we have skew, additional things like that. And the coordinator could then decide, okay, for the second stage, here's the number of workers I'm gonna need, here's the data that we're all gonna process. And then the workers know how to go then fetch this data from the shuffle service instead of going from the driven file system and instead of going directly from the previous nodes, the previous workers in the previous stage. So the paper doesn't talk about this, but that's why I was saying it's not exactly a true pipeline breaker when you have these sort of shuffle stages. There's some cases where you could have these workers start specularly executing and retrieving data that these guys are generating before they even finish. So this model here comes from, it wasn't invented by Matt Produce or Hadoop and stuff. Distributed database, as I was saying, we're doing this back in the late 80s, early 90s, but the idea that you're gonna do this at every, between every stage is something that is somewhat unique to Matt Produce. And the Dremel guys obviously took inspiration from this. But the reason why Matt Produce sucked to do this kind of stuff was instead of writing into an in-memory key value store, essentially what this is, it would write the data back to the distributed file system. And by default for Hadoop, HDFS, it would do three copies for every write. So for every, as I ran my query, your worker produced output, you wrote it back to HDFS, it would then make three copies. The statement is for the shuffle phase? The statement is it didn't that reduce right to the local disk, it only after one iteration? If you were to run, say, an amount of Java and not, it would write the HDFS. But you took me the data with two thousand over. It was just, it wasn't that. Yeah, I had to double check. It was like 2008, 2009. They might have obviously fixed that, I forget. But that is still the same, like you were running the disk and that sucked, right? So anyway, so they're running, they have, as we'll see in a second, they have dedicated hardware just to do this, which can spill to the distributed file system or disk that they run out of space. But it's because it's in memory, it's fast. Actually, I don't have to bleep this out too. They actually also have custom hardware on these dedicated shuffle nodes too, just to do the fast hashing and partitioning, right? So this is a good example where Google is, feels like this is super important. So they'll actually fab hardware, similar to the way they fab TPUs, to handle this particular step. And that's unique to them. All right, so then the next stage, again, these workers are producing output, they write everything to the shuffle store, and then we fire up the last stage to do the sort and the limit, and then it produces the result in the distributed file system. And then the coordinator can then pass along the result to the client, depending on how it was accessed. So one thing they also talk about too is that these workers are running all the time. So again, I'm not trying to like, here's a modern, a modern OLAP system versus like MapReduce or Hadoop from over a decade ago. Like in that environment, you would, for every sort of map job, you would spin up the GVM, do whatever task you're gonna do on the data you give it and then go away. So that spin up time was not, was a, was not, how to say, but was not miniscule, and it could actually into the total time of the query. So these workers are always running, and they're not always just, they're not dedicated to you as a customer. Meaning like these workers are always running in the giant cluster of, for BigQuery or Dremel, and they can then be used for any particular customer. And so they have the ability to have a customer thread schedule on the action node itself, where they know that a customer has paid for dedicated resources, the provisioned resources, and it's running a job from some random user, they can, you can get preempted, you can put aside so that they give higher priority to the customer that's actually paying for dedicated or expected results in performance, right? And they can do that because they can control the whole stack. We'll see something similar in Snowflake next week. But again, the idea here is that like you don't, you're not provisioning any resources, this is one giant cluster, right? And I've been told that the, they also, they've also, like Google said that they found, because they're running one like Borg with like their giant container farm, that they sometimes have problems because the queries are laying on boxes that were like also doing YouTube encoding, right? And which is CPU intensive. And so they have to make sure that they preempt that and take over that thing. That's actually outside the worker. That's on the box itself. So, okay. So again, so there's this, there's this notion of scheduling, so within the cross the giant cluster, but then there's scheduling also within the worker itself. So within one worker also too, it's multi-threaded, it could be doing a bunch of stuff for a given task, right? It's not like one thread is doing all this work. And so they'll do scheduling inside of the worker itself. Yes. So if Borg is running on a bunch of other things, isn't that like too general? But maybe this is not a question for you, but like. The question is, do all Google database services run on top of Borg? I don't know. I see. Yeah. It wouldn't surprise me, right? Yeah, like a general, I think really curious. For like the container coordinator? Yeah, I don't know. It's, Borg is basically, it's like Kubernetes before Kubernetes. Yeah. So it's not doing like, like Kubernetes or whatever Borg is not saying, okay, this query is going to go here, and this query is going to go there. Like in terms of any of your tasks, yeah, there's a schedule coordinator, sorry, that does that. But then Kubernetes or Borg is deciding, okay, this task for this operation, sorry, the Borg is saying, okay, I have a worker here that's running on this box. And then something else is going to tell the worker what to do, and that's the database system. So your question is like, maybe another way to ask your question is like, is something like Borg or Kubernetes, is that impeding the performance of the database system in the same way that like, the Linux operating system can impede us on a single box? That I don't know. I think there's a, no. For individual queries, you can imagine, there's probably always better scheduling decisions you could make. I don't know whether, I just don't know what they would be. To overwrite. Yeah. Yes. I'm guessing most of the services are running on Borg. I don't see why you would want to run bare metal for anything. Right, like even if like, even if these things are running on custom hardware, like there's no reason you just, you know, they're managed by Borg. But they're dedicated, the pod is whatever dedicated to the service that you're trying to run. The shuffle. So as I said, this is unique to BigQuery, in that they're one of the few data systems that does it, does this. It's not, again, they'd invented, it's from other aspects of distributed computing, distributed processing, but it's interesting how they apply it. And then we'll see that it does open up opportunities or open up different optimizations that we could do that would be otherwise difficult to do if we weren't doing the shuffle step. So the shuffle is basically a producer and receiver model, where we want it, it's gonna wait for a workers at one stage to disseminate and send out the results from their processing and their stage on to the next stage. So the workers are doing some execution, they produce output, it sends it to the shuffle nodes. The shuffle nodes are gonna sort of this in memory hash partitions, where the hash could be like whatever the group I key or whatever it is that's in your query plan. And then the next stage when it gets fired up, those workers are gonna retrieve the data from the shuffle service and not from the individual workers. So again, everything's gonna be in memory, but it can spill to disk storage if it gets too big. And I don't think the paper says what percentage of the queries get spilled to disk, but I'm imagining it's low, right? Again, the shuffle paradigm goes back to the 1980s, at least I know there was early distributed databases that were doing this, but it's mostly used for joins. All right, so the idea is this. So here's our worker. So here's our sort of say our current stage and the worker is gonna have sort of a consumer API and producer API. So the consumers is retrieving data from whatever it is the previous stage or coming from the distributed file system if we're reading the original files. And then as they execute, again, they produce output and you hash the data on some key that everyone agrees upon and then they send the data to the different nodes. And if it gets too big, we can then spill to distributed file system. Then in the next stage, these guys can start running once the previous stage is finished or they can start retrieving data specatively from the shuffle nodes and start doing whatever processing that they want. And then so on again, they produce output also come from, they know that they need to go read from the distributed file system, they produce output, they then goes to the next shuffle servers or the final output for the query. So the shuffle phase are essentially checkpoints for the query's lifetime. So like once all the, going back here, once all of the workers within the current stage are done and produced their output, these can all then be reassigned to work on other tasks, right? And so no matter if now in the second stage if one of these nodes goes down or one of the workers goes down, I don't have to restart the entire query because I have the ephemeral stage or the intermediate results stored in my shuffle servers. So that is actually a big difference between sort of traditional shared nothing or OLAP systems and a system like Dremel, right? This is actually one of the benefits that the MapReduce guys claimed when their system, they were promoting their system in a traditional distributed OLAP system. If one node goes down, the entire query is killed, right? Because they were not storing the enemy results for performance reasons. Yes. So distributed memory is going memory, like how, what kind of... What is the API? Like how does it compare latency wise to write to another memory then read from that memory to your own memory versus storing to this? So your question is, what is the overhead of, say this worker here going to this to retrieve the data it needs? Like again, over TCP IP, nothing fancy, versus going directly from here, right? So it's gonna be slower, but as they talk about in the paper, because you have now this, you have this sort of this abstraction layer between the producer and the consumer, it opens up, it makes, want to make software engineering easier, right? It makes it easy to now, let's state you have to maintain on this side over here, because now if I'm getting back from the workers, I gotta know who those workers are. In this case here, I don't have to know that. I just know here's the, here's some IP, or sorry, here's some identifier I used at the shuffle servers to tell me what data I need for my previous stage. Like all that now of like how the data got to you is hidden. So they talk about, I think it's, they say it's, it seems counter intuitive, right? It seems like this extra step makes things slower, but it actually turns out to be a benefit. Yes, yes. We'll get that second. There's software engineering benefits you get from this beyond this performance. This is the point is they're all in the same network. Assuming the network is fast because it's Google or whatever, right? Yes. Yes. Yes. So just to repeat what he said, like these workers aren't gonna have a lot of memory. These are larger memory machines, right? And yes, there's an extra hop, but like they're gonna be in this, at least in the same data center. Like they're not gonna be the other side of the country, which that would take too far, that would be too long, right? They're gonna be close enough and in the internal bandwidth is gonna be super high with low latencies anyway. And then these things are gonna have more memory to keep everything in memory versus over here, right? Then you also have potentially keep state of like, okay, if I start exceeding memory on the workers, I gotta know what I've given out so I can start evicting it. But then like, if again, if this guy fails, I gotta go get it again, but then I've already reassigned these to go work on other stuff. What's the problem with transaction running and now you wanna roll back at them? No transactions. We're not doing transactions. We're not doing nothing, no transactions. Yeah, ignore that. Not in this paper. I don't know about the BigQuery service. I think they support DMLs, but I think it's like, it's not, they're not gonna be optimized for that. Yeah, this is not. A lot of days I've worked in just home without transaction. Yeah, you can't do like multi-reads that are transactional. But that's not the scenario that's being useful, right? It's not like I'm trying to run queries. Yeah, like I'm not running like, I'm not worried about inconsistent reads from like, I run this query and then I run it again and I get different results. Like, because I'm reading much of historical data, the maze updated periodic in that batch job. I'm not like reading the live stream as it comes in. There's other systems that can do that. But that's not what this is for. Yes. So his question is, does each worker need to have his own cash manager for catching what? Oh, okay. So the paper talks about how they have disagreement in memory, right? That's only for this, right? So for each worker, it has local memory. So I'm trading data from upstream or this downstream depending on where you look at it. From whatever came before me. And I'm gonna try, I'm gonna keep that in memory. If I need to spill the disk, I'll spill the disk. We'll talk about how they repartition to avoid that, right? This thing has its own memory, but it's ephemeral, right? I get whatever data I want to process for my task. I crunch on it using all the techniques that we talked about this class this semester. And then I shove it out the door and move on to the next thing. It's not storing like ephemeral state of like, here's the hash table I built to do my join on my worker node, right? You're doing that when you use local memory. Yeah, so Snowflake doesn't do the shuffle, right? Their worker nodes will have, they can write to a local cache themselves like a local disk as well, right? For like, so if they, how does it say this? Like if you're reading from S3, they don't want to pay that like, Snowflake doesn't want to pay the cost of reading from S3 all the time. So the worker nodes are a bit more stateful than I think in the example here or they will have a disk cache. They can cache results, right? You're fetching from S3. As far as I know, they're not doing that here. Yes? Like they're sharing memory. How's that fault tolerant? Yeah. I gotta say I'll, yeah, I'll leave it at that. Like redundancy for a key value store is in the same data center, it's a solve problem. So like, yeah, and there's no, I don't, there's like, how's this? The use case for this, for like what the, like the operations that this thing is supporting, like they're not like getting sets or not widely, it's not a novel paradigm that you need something like you can use RAF or whatever the other techniques for as well. Okay, so as I say, because they have this explicit shuffle step, it opens up a bunch of stuff that like, that go beyond that like, you would want to have an idea system anyway, but like now you don't have to, it makes it easier to implement these additional features because you have this abstraction away of this intermediate storage. So the first one, I think we've mentioned a couple of times, for fault tolerance to straggler avoidance. So if a worker doesn't produce the task result within some threshold or some timeline deadline, that the coordinator can then fire up another worker and say, okay, you're Ross Franco before exiting this task now, right? And we can do that without having to rebalance anything upstream, right? It's isolated to just what happens in that stage. Again, this is why we want the task to be deterministic. So like, no matter how many times I rerun it, I produce the same, always the same result. So also gonna have the benefit, and we'll see this when we do dynamic repartitioning is that we can scale up and scale down the number of workers within stage, pretending because we'll know what the size of that stage is. And then we sort of, we have this checkpoint saying, okay, well, here's all the results from this. Once I have everything, then I can decide, okay, how many workers I want to go on the next task? Or I have some default setting or initial setting I have for the number of workers I have. Once I see the full output of the stage within my shuffle nodes, I can then decide whether to scale down. And that's just a matter of like, when one, if I want to scale down, when a worker's done with the task, I just tell them to work on something else instead of this query. So for the straggler one, it's pretty simple, right? So say that, again, all these stages are producing, the workers in the stage are producing their output, but for whatever reason, this one is slow. And then after a while, we're still waiting for it, we haven't gotten it yet. So the coordinator says, okay, well, this guy's never gonna finish, let me just go kill it. And then I'll reassign the task to this node here. Then now, once I get all my results, the coordinator can look at the statistics of the data that's generated across all the partitions that got stored in the shuffle phase. And then I can decide, okay, well, this is actually way more data than I thought was gonna come out of the previous stage. So let me fire up or let me get some new workers that are available in my pool and have them be involved in executing the query as well. Or again, the same thing, I can even start running, realize that I want to use these resources of workers for other queries and I can take them away. And it doesn't change anything of how I coordinate things over here, it's all isolated within a single stage because there's just retrieving tasks to say, okay, go fetch the data I need from the A memory storage. Is that clear? Yes, sorry. The question is, if resource allocation is an eternal decision, how do they charge cost-wares? I think the big query pricing model, ignoring the provision and guarantee capacity, is you pay on the amount of data you read, right? So like, I might be wrong with this, you don't pay for compute. So like, if I have a query that reads one terabyte data and I do Bitcoin mining while I run the query versus I just do select star and produce the output, you don't, there's no pricing difference. They'll definitely charge you to get the data out. Like, don't forget that, there's all egress and ingress charges for networking. That they mark up a lot. But the actual query itself, they charge per the data you read. I think the registry model works the same way as well. And then you can get provision, or get guaranteed performance, you pay extra for that. Okay, so I want to talk about now how they do query optimization. So I think this part, again, this is interesting. You'll see a little of this in Spark as well, at least Photon, next class. So we spent over a week discussing query optimizers and we made this big, big assumption that a lot of techniques are gonna rely on having a cost model that allows you to compare whether one query plan is better than another, or whether one physical operator is better than another. But now, if we're assuming we're in this in situ data analysis world, or this data lake world, where we have a bunch of files that we've never seen before, and we didn't run analyze on because they're outside of the purview of the database system. Data systems control them. Some other job is loading up the S3, and then somebody asks us to run a query on it. But we haven't run analyze yet. How are we actually gonna support this? And in the paper, they don't have exact numbers. They mentioned that, they say a large percent of the Dremel queries that they support running, are running over data that the system has never seen before. So it's like, I got a bunch of random files. I don't know what's in them. How do I actually start doing query planning on them? The other challenge is gonna be, we didn't really talk about this semester, but a lot of these systems support what are called connectors, where if you have an existing Postgres, MySQL, whatever you want database system, you can connect your OLAP system, like Dremel, to it. And then Dremel says, okay, well, I see these tables, I see the, here's the schema. I can then now run queries on my Postgres database through BigQuery. And depending on the expressiveness of the API, you can either do some push down or do some filtering on the natural data system itself, or it's gonna do something stupid, like select star from the table and pull everything in and then do the processing on it, right? So again, in this environment, again, how do I do query planning where there's some black box system that I don't control that I have to start making estimations about selectivity of predicates in order to do query planning? So the way BigQuery is gonna work, or Dremel's gonna work, is there gonna be a stratified approach with doing sort of heuristics and a very simple call space optimizer to generate a preliminary physical plan for the query. And then at runtime as they're running the query, they're gonna look at each stage and make decisions on whether they should change the query plan as they're going along. So the sort of the hard-coded or static rules, the ones that are cost model, are the standard things you would expect, like predicate push down, for key, primary key hits. If they have special code or they have custom checks to see whether you have a star schema or a snowflake schema, and they can appropriate the constraints from the dimension tables to the fact table, to try to do early filtering on the fact table, right? These are all sort of standard techniques. They do some basic join ordering, again, based on the join ordering for things like, if I notice it's again, a star schema, I'll have the fact table and I'll have that write up my pipeline and join that with the dimension table. So there's standard tricks everyone does. As far as I can tell, the only cost-based optimization that they're doing are when you actually have statistics, which makes sense, because how do you do cost-based estimation if you don't know anything about the data? They have a cost. So if you define a materialized view in Dremel, for an existing dataset, obviously the dataset is gonna run that query, materialize the view, like thinking of result caching, and now it knows something about the result of that materialized view. So you can use that, the sort of standard techniques to figure out whether there's one view better than another, if your query is accessing it. But that's not the common case. Most people are running directly on raw tables. So the basic, again, the basic idea is we're gonna use a bunch of rules to figure out a sort of rough sketch of what the initial query plan is gonna be, and then at runtime, we'll look at the estimates, or look at, not the estimates, the actual data that each stage is producing and then make other decisions. So what Dremel's gonna be able to do is that before each stage starts, you look at the results in the pre-deceding stage and then you can make different decisions. Now let's, I won't show this example, but it's not entirely true. You could actually have a stage start running. If you could have two stages in parallel, like if you're doing a join, and then you realize that one of the stages is actually not needed anymore and you can go ahead and kill it. Like so the example would be if I'm doing, joining two tables, if I think I'm gonna do a shuffle join, where you sort of repartition everybody and then do the join locally. If you recognize one table is really small, you kill one side of the shuffle and then switch to a broadcast join. But the basic idea is still the same here. Actually, that's what I'm gonna say to you here. But so a bunch of these we've already covered, we're talking about how to change the number of workers in the stage. They can also do things like recognize that they have different algorithm applications for physical operators. One that's actually for really big partitions, maybe one for small partitions, for object size, they're completely separate code baths and they can pick which one to use based on, again, the result of the, the intermediate results in the previous stage. And then dynamically partitioning is an interesting one as well. So, say we have a query that's producing one stage, one bunch of workers, they're producing some results, right? And they start filling up, say, two partitions on our shuffle notes. And then if the coordinator is getting sort of periodic statistics from the partitions and recognizes that the second one is about to overflow, it's built a disk, right? It's for whatever reason the data is skewed and everything's going to this partition here. So, what it can do is it can then instruct the shuffle notes, like you actually have more partitions now, allocate that memory, instruct the workers to say, don't hash, any of the hashes to this partition anymore, do another round of hashing on it, like basically recursive partitioning, like in Grace hash joins. Do another round of hashing and then write all the data to partition three and four instead. And then they all fill up. Then once this is done, you then fire off another worker that does this repartition step that goes back and reads the data from the shuffle notes that you previously put into the overflow partition that was out to overflow. And then rehash it again to fill up the two ones you made and then kill off the extra one. Right, again, this is very powerful concept because they're trying to avoid disks as much as possible. So, think of like, when we talk about doing hash joins, if our hash table got filled because we didn't predict the size of the hash table correctly, because our estimates were wrong, what do we have to do? Stop the query, stop the join, create a new hash table and throw away the old one, double the size and create a new one. In their case, they don't want to do that. They can spin up the new partitions, do again, do the recursive partitioning or recursive hashing to make them go to the new and make the data go to the new ones and not have to change anything below us, right? So, again, this is very powerful. This is not something that most systems can do. And because they, you don't know what the data's gonna look like when it comes in, that you don't know what the skew's gonna be, you have to have this dynamism in order to support this because you're doing, again, data you've never seen before. All right, I want to quickly finish up and talk about the storage stuff. So, as I said, again, they're relying on a distributed file system. In the case of Google, it's Colossus. They're using it to scale the storage capacity. And the paper talks about is like, because, even though they're relinquishing control of the storage infrastructure to the file system team, the benefit of that is, as the file system gets better, because the other teams still work in Colossus, as Colossus put out new improvements, their data system got better because they became more responsive and faster without them having to do anything, which is fantastic, right? So, it's sort of what you think about as like, as Amazon improves S3, makes S3 faster, if our system is built on top of it, we get that benefit for free. So, that's a really important lesson from this as well. The, when they do manage servers or have their own encoding, sort of database encoding, a file format, they're relying on this thing called capacitor, which is not open source. You can think of it like, it's basically equivalent to like parquet or orc, where there's like a file format, or specification that defines how to, how to encode the data and how to do compression and so forth. And then they also have utility libraries for actually accessing data from it, similar to what you guys built for in project one. So, the, they talk about how they, because they have to deal with these nested data structures because it was a random JSON stuff, they want, if you then convert that to capacitor format, they're going to keep track of all the, the information about how fields are repeated, whether they exist or not within columns themselves. So, you can do the traversal of the, you can scan along a column without necessarily having to go scan down into the ancestor fields within a, within a tuple to figure out whether data actually exists or not. And only go read the data from the columns that are defined separately if you actually need it. So, it's a way to take sort of a row-based, sort of a naturally inherent row-based data model like JSON and convert it to a columnar format. And that's very powerful, that's very important. For how they're going to store the schema, again, they, we don't talk about too much about catalogs, but they're super important. So, they talk about how they have a, in their data sets at Google, sometimes they could have tables that have thousands of attributes, thousands of fields, but most queries only, and again, in OLAP, you only need to subset of them, or scan along, find all Andy's orders, or find, you know, within some date range or whatever. So, if you store the catalog within the file format themselves as a column store, you have a column store catalog representing column store data, you can do all the same tricks and techniques that we do to make OLAP queries on fast and column data. We can do that for our catalog as well, to find the things that we're looking for. And that makes it so that when I open a file for the first time, I can quickly find that you figure out where the data is that I'm actually going to need, without having to parse the entire schema. And the last one I think was interesting, I think it's important to talk about how they're sort of the internal transition at Google to being a no-sequel data, or no-sequel tech startup, not a startup, no-sequel tech giant, the opposite of a startup, to one of the biggest proponents of SQL-based database systems in the world. So, as I said, like the, or in the Dremel paper talks about how Dremel was one of the first systems that was sort of pushing as a SQL as a good idea within Google, and then that sort of spread, the people realized had Cobb as a genius, and they sort of spread throughout the other parts of Google. And so as all these different units from the company were building their own database systems with SQL support, they all had their own dialect, their own flavor or SQL that had the things that they sort of cared about for the product, the service they were building. And so this obviously became untenable. So there was an internal effort called Google SQL, where they wanted to have all these systems that were building redundant code, redundant parsers, and binders, and analyzers, and type systems, and syntax stuff. All, everybody's gonna use the same infrastructure for this. And so Dremel was part of that, Spanner is based on this, a bunch of other systems are now using this. So they did also open source all this code, just something called Zeta SQL. As far as I know, the only system outside of Google, or so I don't think Google actually uses this because they have their own thing, right? They always have like this, the open source version like Kubernetes, and then they have the internal version like Borg, and maybe they have the original same lineage, but Google still uses whatever they had on the inside. So they, open source a library called Zeta SQL, I don't think Google, I don't even maintain this anymore, I think it's all external to Google now. And the only database system that I know, or system that supports this is Apache Beam, which is like a stream processing system. But no database system, I'm not aware of any system that's based upon this. To me, this is interesting because it's like, if you think about like in the early 1980s, right? When IBM said, okay, where do you use SQL? And that became the de facto standard because IBM was huge, right? Oracle was copying it, and then everyone had to follow along with SQL. I mean, of course there's dialects, there's flavors of it, like SQL is considered the gold standard and everyone's gonna be based upon that. Google's a massive company, it's widely influential. They put this thing out and nobody else uses it, right? I think so this, I was just saying that like, we, I think it's gonna be never, I don't think never, very impossible that everyone will converge on a sort of pure single SQL dialect. It's always gonna be these slight variations and fragmentation in the industry. And that's probably gonna be okay. The only standard I would say that exists now is probably Postgres, because there's a bunch of systems that fork the Postgres parser. With DuckDB, we did this, a bunch of other systems. Like that's probably would be the, considered the de facto flavor of SQL now, but even then like that's only a small segment of the market. There's Oracle, SQL Server, everyone has their own thing. Okay, so I wanna finish up talking about three systems that are inspired by, or based on Dramio, drill, Dramio, or sorry, based on Dramio, drill, Dramio, and Apollo. I'm not gonna talk about, on a full, but like there's a, this is a incubating Apache project where they try to build a open source version of the shuffle service, similar to what Dramio uses. And it supports MapReduce, it supports Spark. So this space is interesting, but this is still very early. All right, so the most probably, I was like, famous, it's not the right word, but like the, when people say, what's the open source version of Dremel? Drill usually is what people think of, right? Cause the drill, Dramio, Dramio is a drill. The naming is close enough. So this started in 2012, a year after the Dremel paper came out at a MapReduce company called MapR. Who here has ever heard of MapR? Nobody, right, they failed. Well, they got bought, but barely. So this was like a, they had their own proprietary version of Hadoop. Think of it like that. Right, so that set of Java was written in C++. It was considered high performance Hadoop. That was the thing. So what's interesting about them is that they're gonna do, the basic tenets of how Dremel works apply to a drill, but they are gonna do co-gen but using this thing called Janio, which is like an embedded Java compiler. So they do vectorization, they do all the stuff we talked about before, but they actually compile Java as well. They're queries into Java byte code. So I don't wanna say drill is dead because people watch this, I'm gonna email and complain, but like MapR got bought for peanuts by HP and HP announced in 2020 that they're not gonna maintain, they're not gonna spend resources maintaining it anymore. So it's an open source project, people are still working on it, but it doesn't have the, I would say it doesn't, there's no company backing it in the same way like Dramio is and there's no, there's not a vibrant open source community like in Presto or Trino. Why did it get killed? HP had a cut cost. So the Apache drill is still alive, it's an Apache project, it'll live forever, right? It's still being actively maintained, it's just there isn't a company that's like, that's like, you know, like a lot of these Apache projects will get spun out of the company and the company was still like, commercial version of it make money and then they'll still maintain it, right? As goodwill to the open source community. HP says, they didn't wanna spend money on that. Dramio is probably the most vibrant one that's inspired it by Dramio. That's interesting, this was founded in 2015 by somebody who did masters here at CMU. It's based entirely on Patchy Arrow, so the CMU guy is actually, I think it was one of the co-inventors of Arrow. So they're a big claim that why they're fast is that they rely on what they call reflections, but as far as I can tell, they're just materialized views with a fancy name to speed up query execution on external data files. So the idea is that instead of reading the file, every file as if it's the first time you've ever seen it, they can maybe cache some additional information or cache the actual results of certain queries defined by the user itself and then run the execution on those materialized views or reflections rather than parsing everything from scratch all over again. They also rely on Java-based co-gen, I don't know what compiler they're using, and then of course, everyone does vectorization at this point. The last one is Impala. So this was started in 2012 by somebody that worked at Google. He didn't work on Dremel, he worked on F1, but obviously F1 people work with the Dremel people. So the idea again, if this was to try to take the techniques that were built in F1 and Dremel and build an open source version at Cloud Air outside of Google. So they do co-gen at the executor node, but I'll talk about in a second, that is embedded inside the data node, but they only do it for doing expression evaluation and then parsing like CSV files and to speed that things up. So what they do, at least the original version, I don't know if this is still exist because you can't do this obviously on S3, but the way they got better performance is that they didn't treat the distributed file system as a black box, you would actually run a little JVM down here, or it was, I think it was C++, there's a little program down here on the same node where the data was located so that you can do predicate pushdown and filtering directly where the data is rather than having to send it over the network. And obviously this breaks the entire paradigm that we've been talking about with these object storage where you can't do that. You can't do that with basic select operations like in S3, but this was like a whole piece of code that the dataset actually could coordinate and send work to, but obviously you can't do that in S3. So this is still around, it's an Apache project. I think it's still, they're still developing it, but I mean, Cloud Air got eaten alive by Databricks, Cloudera did not do this, despite having the name Cloud in the title of the company, they did not do a pivot to Cloud Hosted Services. They're still like this on-prem service model where Databricks sell Cloud as a thing and yes, they had Spark, but they sell Databricks the Spark service immediately. And Databricks is gonna go IPO and Cloudera got, I think that reacquired, they went IPO and then they got reacquired by a venture firm or venture capitalist, no, private equity. They got reacquired and then it became private and they're gonna try to spin out as IPO again, but they've been eclipsed clearly by Databricks. All right, so to finish up, I had you guys read the general paper first because it predates all of the other systems that we'll talk about, including Snowflake, not just because 2006 when they started working on it, at least that paper came out in 2011 and like I said, that was very influential. The shuffle phase, again, it seems like it's wasteful, but it opens up a bunch of engineering opportunities, makes it easier and in some cases, you actually do get better performance. And I would say overall, what I like about this paper too is that it shows the benefit of decomposing the data systems internal components and pieces into separate services to allow them to either get developed independently with either separate teams and then scale and improve efficiency separately, or it also abstracts away what would be an additional code that you would have to embed into different parts of the system. If you just expose it as a single service that handles everything for you, then that makes it cleaner to write the worker code or parts of the system. So again, the shuffle node stuff, it's just getting set or maybe delete and you don't worry about how to get the data to the next stage. You just say, if I write it here, someone will take care of it for me and get it to the next stage. So I think that part is very unique and interesting. Okay. Any questions? That's my favorite all-time job. All right. Yes, it's the SD Cricut, I-D-E-S. I make a mess unless I can do it like a Gio. Ice cube with the G to the E to the T. Here it comes, Duke. I play the game where there's no rules. Homies on the cusley, I'm a food cus I drink fruit. Put the bus a cap on the ice, bro. Bushwick on the goal with a blow to the ice. Here it comes, Willie D, that's me. Rolling with fifth one, South Park and South Central G. And St. I's when I party, by the 12-pack case of a thought. Six-pack, 48, gets the real price. I drink fruit, but yo, I drink it by the 12 hours. They say bill makes you fat. But St. I's is straight, so it really don't matter.