 Carnegie Mellon University's Advanced Database Systems course is filmed in front of a live studio audience. Today's class is the beginning of the discussion of the lecture material throughout the semester, and it's still sort of a high-level overview of OLAP databases, but the idea here is that we will set the foundation for all the various parts of the systems and the papers we're going to talk about, like how to build these individual components in a modern system, and then obviously this is highly related to the projects that we're going to be working on, which we'll discuss at the end. But this sets the context in which we will build all the things that we'll talk about throughout the semester. So we're going to first talk about, again, the basic background of how we ended up with the prevailing architecture for a modern OLAP system. We'll talk about some high-level choices and issues in this space. I started early. Sorry. Yeah. Sit wherever. Right. Again, the idea is that we want to sort of talk about what some historical systems look like, how we end up with what people build these systems today, and then we'll talk about the high-level issues you have in building one of these systems, and then we'll finish off just a quick overview of what a query goes through or what happens when you actually execute a query in one of these systems. Again, this is a grad-level class of stop-me-and-ask-questions as we go along. If you recall from the intro class, we made this distinction between these operational databases or front-end databases or OTP databases and these analytical database systems, or OLAP, online analytical processing systems. Again, the distinction was in an OLAP system, that's really the thing that faces the outside world, either humans or computers, like a little web interface or a rest interface, that's ingesting new information. You're getting new state, you're getting new changes, and you want to store that as quickly as possible. Once you've sort of accumulated a bunch of this data, now you want to start extracting new information from it. You want to extrapolate new knowledge, allows you to make decisions or decide how to do certain things or provide justifications for whatever it is that you want to achieve in your business or your institution or organization or whatever. That's the goal we're trying to do this semester, trying to take a bunch of data we've accumulated and then run queries or run something on it to pull out new data that informs us about what our database actually contains. Ideally find trends that we didn't think of as humans easily. In the old days, people would run these sort of analytical workloads on what I call a monolithic database system, meaning a system that had all the components and all the subsystems to actually execute queries and store data was all built inside this one piece of software. If you ever run like SQLite or like DuckDB or MySQL Postgres, that's considered a monolithic system. Embedded databases less so because they don't have threading and so forth. Think of like Postgres. You install Postgres, you put it on your laptop, put it on your server, you start creating tables on it. Everything you need to do actually queries and store that data is inside of Postgres. These monolithic database systems was how people were storing the data in the old days. We'll talk about what this means by centralized storage but like a managed storage, basically the database system is completely in charge of what bits are getting written down the disk, where they're going and how to pull them back in. The first sort of work, I shouldn't say first because there was inter-datas from the 1970s sort of built in the space, but people started really paying attention to analytical workloads in the early 90s, maybe late 80s. But the prevailing architecture at the time for how people built database systems was again the classic database system architecture. We talked about the intro class, row store, pages on disk, it's a buffer pool, fetching things in, because that's what they were building for operational workloads. A row store is exactly what you want if you want to ingest data very quickly in transactional manner. But obviously if you want analytics, that's going to suck because now if you're doing OLAP queries, we only read a subset of the data, you're fetching in the entire page, you're fetching in the entire row and there's going to be a bunch of data you don't actually need. So people realized that these were kind of slow and so they started building what were called data cubes. And you sort of think of these as like, just like a materialized view, a pre-computer aggregation query, like a root by and so forth, across a bunch of different dimensions and you would generate this array more or less, store that in your database and then any analytical query that came along, you would then try to target that data cube because it's already done a bunch of computation for you. And you could store these things in an array manner that was better than a row store. So these things were not automatic. An administrator had to specify, I want these pre-computer cubes. Again, just like materialized views are regular views. And then because materialized views are certainly at the time before, the time of the 90s is definitely still now, they're difficult to do incremental updates on. So they had to have a human say manually refresh, like a SQL command to populate the data cube. So like you would do something like a cron job at night, run the refresh to build the data cube. So these were introduced in, as I said, in operational databases as a way to handle faster analytic queries than what you would do over row-oriented systems. So with the exception of Teradata, and this is the logo for S-Base with Oracle bought in like 2003 or so, SQL Server, DB2, S-Base, Oracle, Informix, all these guys had their own sort of variation of data cubes. Teradata did as well, but Teradata was primarily an elapsed system back in the day. One of the first ones, actually the first one. So basic idea is this, you have your O2D databases, your operational workloads, this is where you're getting new data, and then somebody wants to run some query like this where they have a cube function and the group by clause. And then all you're just going to do, again, do a sequential scan on each node, populate the cube for this, and then when this query shows up, if you defined this as a view, you would do the query on the cube. Just again, think of like pre-computed aggregations, that's all it really was. So what really changed and got us on the path towards where we're at today was in the mid to early-ish 2000s where people started building these specialized database systems called data warehouses that were specifically designed for analytical workloads. So even though a lot of these started off as forks of Postgres, we can go through that a little bit, but like even though they were mostly derived from row storage systems, they ripped out a lot of the storage internals, they ripped out of the execution engine, and replaced it with something that was targeting column-oriented data. So all of these, except for DataLegro and Monet are forks of Postgres. Part Excel is what Redshift is based on. So we'll cover the Redshift paper later on. Basically, Amazon bought a license of the source code and then hacked it up a lot, and that became Redshift. MonetDB was written from scratch out of CWI. DuckDB is originally derived from MonetDB. There was a version of DuckDB before DuckDB called MonetDB Lite, and then they threw all that all away and then they rewrote it as DuckDB. Vertica was started by Stonebreaker and others back at MIT in Brown. That's a fork of Postgres. DataLegro was hacked up, was made aware in front of Ingress. Microsoft bought it for like, I think, a couple hundred million, and then immediately threw it away because apparently it was garbage. And then teaser was the early one. This was actually pretty cool. This was a version of Postgres that had an FPGA accelerator to do the accelerator sequential scans. Green Plum is still around today. Pretty widely used. That's a fork version of Postgres. So these monolithic systems where they were designed now to run, you know, little workloads in a and control storage layer of the system. And they sort of used their own proprietary formats. And that'll make more sense, especially in the next class. But basically, again, they were in charge of what the bits look like on disk for the pages they were storing for the data. The other thing to point out is that all of these were all shared nothing systems. Again, we'll cover that in a second. But again, they were assuming that every compute node in your database cluster had disk, memory, and CPU. And each node was responsible for sorting some portion of the entire database. So the way you would use it sort of like this is, again, you have your O2D databases. Now you have your giant data warehouse. And the idea is that you want to get all your operational databases, all the data from here, back into your single data warehouse. Because now you have a single view or complete view of all the data across all your databases. And so the way you would do this is using tools called extract, transform, and load, or ETL tools. And you just sort of get the change data capture periodically getting updates from the O2D databases, doing some amount of changes to them to clean up the data, like entity resolution. If it's A, POP, low, and NDP, you can figure out that they refer to the same person. All that sort of happens here. And then you load this now into your data warehouse. But again, for this, because the data warehouse wants to have complete control of everything it's storing, you've got to set up the schema ahead of time. You've got to provision the hardware ahead of time. Everything has to be sort of set up before you start putting data into it. And it was a shared nothing system. So if you want to scale the capacity of the system, you have to add more nodes. And now you've got to start moving data around. And that's going to be one of the limitations we'll see throughout the semester. So then we hit the 2010s. Late 2000s, early 2010s is when these things took off. We enter this new era that we're sort of in today of these shared disk engines. And the idea here is that instead of having the database system manage its own storage layer, we're going to offload that to some other piece of software, some other service. In the cloud setting, it's going to be an object store like S3. And the idea here is that because we no longer have to be responsible for managing the storage of data, we can optimize the compute layer as much as possible. We're still going to have proprietary data formats, meaning like if you're using snowflake, regular snowflake, it was sort of the first ones in the space, snowflake when you store data into it, it's going to store it in the snowflake format that only snowflake understands. But it's going to store it in S3. So that was the first generation of these systems. They're going to manage all the files themselves. The newer generation and what was in the paper that you guys read that sort of fall under this branding label or moniker of Lakehouse systems, the idea is that it looks just like a shared disk system before, but now instead of having always using a proprietary storage format and only allowing data to be added to the database by going through the database system, you allow anybody to write a bunch of files out on S3, tell the database system the Lakehouse system, hey, here's my files, here's what's in them, and then now you can run queries on top of that data. Yes. Yes. The question is, I'm showing much of these logos at the bottom here. Why are there so many? What are they competing on? I mean, the reason why there's so many of them is because there's so much money in databases, right? That's why there's a lot of this, you know, there's Linux and then what else, right? There's a lot of database systems, right? So the key difference is going to be in, I think, some of these will be hosted services that you can only get through a certain cloud provider like an Amazon Redshift. You can only get Amazon Redshift on Amazon. With BigQuery, I think they have to call it Omni or something or other, you can now run BigQuery stuff on AWS and Azure and so forth, but for whatever reason, people start building database systems because they think they can do a better job than what already exists. And oftentimes, some of these projects are actually spin-outs of larger tech companies that decided, oh, we want to build this stuff in-house and then it turns out to be useful and then they cover it to an open-source project and get people to use it outside of it. So Presto started at Facebook. Pino started at LinkedIn. Trino is a fork or Presto. Druid, I forget, this came out of something as well, but like, for whatever reason, people start building, you know, these various systems. And I would say at a high level, for the most part, as we'll see throughout the semester, at the high level, they're all going to be basically the same. The real difference is going to be the things that actually really matter in some cases is going to be the top layer, the front end, like what the user experience looks like and how good the query optimizer is. In my opinion, that's the part that really matters. All the stuff that we're talking about throughout the semester, it becomes almost commoditized or it becomes table stakes. Everyone has it. Everyone has how to build it and why they do certain things and why they perform in the way they perform. Like that part is still super important, especially if you want to work on the internals of these systems. But to the average user, it's really the top part that really matters. So another way to say it is like, are there too many databases? Maybe. Like I said, there's still a lot of money in the marketplace for this kind of stuff and people, you know, these things don't die. You know, like Impala didn't really take off as much as Snowflake did. They're still maintaining Impala. There's still people using Impala. Would I recommend anybody starting off using Impala? No. But it's still there. Yes. Yes, the question is what I'm describing here the same as a data lake. The term data lake basically meant that like, okay, here's S3, anybody can store data in there, right? But then the lake house architecture, we'll see in a second, it does have the ability to ingest data through this lake house and keep track of things, but also provides additional, like schema control and metadata stuff as well. So like the data lake, the idea was like, okay, dump your files in S3 and then whoever did that is also responsible for telling the catalog, here's my files with the lake house architecture. It's supposed to be like a unified sort of front and interface to the developer and say, okay, here's my new data and then the lake house can put it where it wants it or like, you know, you can tell it, here's where it is and sort of keeps track of these things. So it's not to say you couldn't build, ignoring the incremental updates, we'll talk about in a second. It's not that you couldn't do what a lake house does on a data lake, they're essentially the same thing. There's more services to help you keep track of what's going on. I was wondering when you said shared. No, so data lake would just be like, his question is when I say shared disk, do I mean data lake? I mean shared disk would be the distinction between shared and nothing, right? That you have the separate computing storage and you rely on the object store to store your data. So you could do that in some systems before like data bricks is the lake house stuff, right? But the idea is like, there's much more manual stuff you have to do, like if you just dump up files in S3 but nobody knows about them, then like you can't run queries on them. But so that means you would have to update the files in S3 and then tell some catalog service, here's where my data is so they can then run queries on that, right? So it would be like a manual process. The lake house is trying to do all this sort of automatically for you. It's a marketing term in some sense, right? Like the data lake just means just like, okay, here's S3, put my files there, right? But the important thing about this is though, like what this allows you to do because it's going to be a data lake or an object store, like I don't have to go get approval from the DBA to say, I want to store this data, and you're going to spend time setting up the schema and figuring out what's going to be in there and provisioning hardware, you just start uploading files. And then for better or worse, like that makes it easier to put data in there, but then now someone's got to figure out what's actually in there, right? So you're sort of pushing the burden of figuring out how to interpret the data and the contents later down the pipeline. In some cases, that's a good idea, sometimes it's a bad idea. That's almost like a philosophical discussion. But the key thing here is that we're separating compute and storage by using a shared disk architecture. So we go back to our sort of diagram before. Now we have all the databases. They're going to send all their data to an object store and maybe there's a ETL thing or some kind of middleware here that's going to do some transformation before it puts it in there. And then we would maybe tell the catalog, here's the files that I just put in there and here's their contents, or here's maybe the schema that's in there. And then now if I went on queries, the query engine on the side is not responsive with the storage anymore, so it has to go to the catalog and say, where are the files located in S3? And then once I have that information, then I can run my queries on the object store. So in this semester, this box here is what we care about. This is the thing that we're actually going to design. This is what we call an OLAP system, whether it's a lake house system that comes with additional stuff that Databricks wants to sell you. But this is the thing that we're described conceptually and how to build. And this is the way people have been really building the systems for about 10 years, about 15 years maybe now. That started off with Dremel at Google and then Snowflake was the one that really commercialized it. I'm going to call it a classic architecture and I realize that you guys are in the 20s and I'm saying this 15 years ago. In the scope of databases, that's actually not a lot of time. Yes? Yeah, the question is, we were going to talk about this. It's called changing data capture. How do we get the updates from this into this? We're not going to talk about that specifically, like what the process to do that. I can point you to some previous lectures that we've had guest speakers talk about this. We are going to care about what the data is going to look like when it goes in there. And that's the next class, like Parquet and Orc files. Yes? Next class. It's our next slide. In the Lakehouse paper from the Databricks guys, they talk about the big problem with these data like systems is that you get stale data, because we're getting continuous updates from the operational side of things. How do we integrate that into our database so that we're always trying to look at the freshest data? And so what these data lakehouse systems provide also is the ability to do transactional updates, creation, deletion, updates, insertions into this database. And the way you basically do this is that it's the fracture mirror stuff we talked about last semester. You're just doing a log append to a file. Here's all the latest changes. And then there's a background job that periodically takes that cold less as it combines it, removes that stale data, and then stores it now into Parquet file example for in the case of Delta Lake thing from Databricks. They'll take that thing, Parquet file, and then store that now into your object store and then update the catalog and say here's the latest version of it. They also can keep track of if you make schema changes, they keep track of those things for you. It's providing more, as I said, more infrastructure to make sure that the object store can keep track of what's actually in it. So we're not going to talk about any of this this semester. Delta Lake is one example of the system. I don't know whether the paper mentions Hootie and Iceberg. Hootie came out of Uber. Iceberg came out of Netflix. Snowflake has their own sort of thing. I think they call them hybrid tables. They do incremental updates, and they support Iceberg as well. This logo without a name is actually Google Napa. They have a paper on this. Google doesn't put names next to logos. It's really Amazon too. How am I going to know what this is? In their defense, that's actually an internal system. They're publicly talking about it now. We're not going to talk about these things this semester because we really want to focus on how do we run all that queries as fast as possible. Once we have that, then we can go beyond that and build this Delta Update stuff. Okay. So the paper you guys read, they make a bunch of different observations. She mentioned the one about the stale data, the three things that we want to keep in the back of our minds as we go throughout the semester to understand and guide us on how we make decisions on building a system. Again, even though beyond the semester, you may not go off and build data system internals, but these are the things you should think about when choosing maybe an OLAP system in whatever your next project is at a startup or wherever you go next after you graduate here. The first thing to point out is that in the current setting in modern organizations, people want to execute more than their SQL queries. Now, I realize as someone who's like a SQL maximalist where I think everything should be SQL, this seems like heresy to me, right? But I'm not naive. I know that people, you know, want to run PyTorch and TensorFlow and all these other ML workloads that you can't easily express in SQL. Now, there's projects like PostgreSQL that gives you UDFs and make calls into PyTorch, but most people aren't writing that. You know, data scientists that are writing stuff in notebooks. And so the access patterns for ML workloads, for example, are going to look a lot different than OLAP queries or SQL queries. And we'll see this later in the semester when we talk about the networking protocols for database systems, that sometimes you want to do a bulk export of data without having to, you know, it was retreating exactly as it exists in memory from the database system and having covered it to a result format that you can then read as if you're going through JDBC or ODBC, right? And so maybe you want to get things out of it in the patchy arrow format, which we'll cover throughout the semester. And that's relevant to the projects. So we'll design our system, the most of the parts of the system at the sort of the planner level down, ML workloads are going to look a lot like, you know, Python workloads and look a lot like SQL workloads. That's what I was saying before, that we want to expose different APIs for how people want to get data and run queries. The other important thing is that the, as I said already, that like because of these shared disk architectures that it's no longer having the data system having rigid control of exactly what data is going into the database and how people can get data out of it because now it's just files in S3 and ignoring any governance or any security permissions of how people could get to those files. If it's just files in S3, we can go through, you know, the front end of the database system to do anything with our data. Right? That doesn't mean we still don't want to track schemas and versions of the schemas and what files actually exist and the catalog is a pivotal thing that makes us all work. But because now anybody can put things in S3 in theory, you know, you don't have to go through that full bureaucracy that I mentioned before. And the last one is that as they point out and just think of your own behavior on the internet, most data is unstructured or semi-structured. So unstructured would be like an image or a video file. I think most of the traffic on the internet is from YouTube or video files. And then a lot of it is also going to look unstructured or semi-structured. So this would be something like JSON files or a combination of structured data, like, you know, a tuple, but then there's some JSON portion or maybe raw text from a log file. A lot of the data is going to come in this format. For unstructured data, we're not going to talk about anything about this semester. Like, that's taking like a transformer or some ML framework that then extracts information about what's in an image or in a video file, right? But the thing that spits out after you do that transformation is going to be structured, right? So there's, for sequel purposes, it's not much that we can do for this. For semi-structured, this is going to be a big issue that we have to care about. Like, because people are going to have a bunch of JSON files in S3, even if it's in a structured file format like a Parquet or Ork file, which we'll read about in the next class, then, like, we still have to be able to make sense of it. And this is a good example where we'll see some systems, the different systems will do different tricks to make it work efficiently for this. Snowflake, I think they assume that, like, the data is always going to be there. They generate columns for this. Data bricks, I think they do parsing on the fly. There's a bunch of different ways to handle that. We can do that throughout the semester. So again, we want to design our system, keep keeping these things in the back of our mind. And we'll see throughout the semester how we do each of these. All right, the other interesting trend that has come out in the last decade is that we've gotten away from these monolithic database systems where now people are building services or individual components that are separate from the full system and it's basically how sort of laid out the project this semester that in theory, these different services could be developed independently as long as they expose and maintain an API that the other services can understand and use and is stable, then you could start swapping these things in and out or not have to build the entire system from scratch. You could use some off the shelf tools to build a full-fledged database system. Again, everything in Postgres or everything in DuckDB or SQLite is written by those developers, right? Ignoring third-party libraries for SSL and things like that, that's obviously not what I'm talking about. But the query optimizer, the catalog, the parser, all of that is built by the system developers. But now what we can do instead is, in theory, use some, again, off-the-shelf tools and call them together and still make a full-fledged, full-featured database system. The challenge is going to be, though, obviously, if basic software engineering principle, the more abstraction layers you put in place, the more inefficient the software will become, right? Just because there's intimate knowledge about what the system wants to do at any given level, and if you don't expose that information up and down the layers of the software stack, you end up doing the lowest common denominator. Do we know how costly it is to add these separation layers between the different components, as opposed to having a full stack? This question is, do we know how costly it is to start using these different services versus having everything written from scratch? No. I mean, it's hard to study, though, right? It's like, hey, have two teams build a multi-million-dollar projects just to see whether one's better at the end. I mean, some parts are super hard, too. Like, the query optimizer is the hardest part of the day. Most people can't build that. And if they do end up building it, the first version is usually a bunch of heuristics, if and else statements. It's garbage. And so, in the case of query optimizer, I know, again, we have a project going in this class. There's been two other attempts to build standalone services in query optimizers. There's CalSight from LucidDB, and that's probably the most common one, and there's Orca from the Green Plum VMware people. But actually, there is a paper about the effort it took to get Orca to work in MySQL. It apparently was a huge pain. Because there's assumptions about these things. I mean, this is my last comment here. There's a bunch of challenges in calling these things together. It's not just like, you know, here's the HTTP protocol, and every web server speaks it, and it's easy to combine these things together. Did you ever use, like, a REST API from any service? They're like, oh, they want to data this way, like, it becomes a train wreck real quickly. And so making these things actually talk to each other and actually making it fast is non-trivial. So we'll see how it goes for the project semester. Another important thing that we're going to cover, too, is what the intermediate representation, the IR, looks like for the various parts of these systems that are talking to each other. And so I'll show in the next slide what I mean, but I've already mentioned this in the project right up. They're like, okay, there's the query optimizer, and then it's going to generate a query that it hands off to the scheduler. Okay, well, what is actually handing? The scheduler needs to know what's actually inside the queries in order to make sense of what goes where. So how do you actually represent that? And then now, like, how do you actually represent data? What are the data types across these things needed to be synchronized? But again, if I'm using off-the-shelf components they might have 32-bit ints one way, and then all ints might be 64-bit since the other one, fixed-point decimals is another challenge. Like, how they actually store the data itself can become jumbled up and difficult. We'll talk about file formats in the next class, and then again, we'll talk about execution engines and a little bit of execution fabrics, I think of like Apache, Ray, that kind of stuff. We'll talk about that in the semester as well. But there's a bunch of these stuff and people have talked about, hey, there's all these existing things. Can I just cop it together, you know, a Java database using all this Java stuff? None of them have really taken off. I think they've only been, like, toy exercises. And this paper here that I'm referring was optional. This is from the Facebook guys, the Voltron people. Basically, they're arguing that this is how people should be building data systems today that have these standalone components that interrupt. All right, so here's a high-level overview of what the internals of one of these OLAP systems look like given the context that it's described. And essentially it's going to mirror how we're sort of envisioning the projects that are going to work out this semester. All right, so at the very top, you have some user. They're going to send a query, assuming it's SQL, to a front end part of the system. All right, and this will have, like, a language park or a SQL parser that's going to convert the SQL query into a bunch of tokens that specialize its form. And then now we're going to send this intermediate representation of the SQL query to some planner, right? And the planner is going to have a bunch of different parts. They have the binder that's going to be responsible for figuring out, like, you know, it refers to it. There's a token that refers to a table name. Do that table exist in my catalog? Can I do some rewriting of the query to put it into a better canonical form? And then I'll have an optimizer that could do a cost-based search using cost models that are derived from the data itself to help figure out what's the most optimal plan. And so for this part of the planner, we're going to have a catalog, right? Because we got to say, okay, again, I have this token. It's table foo. Is this really a table? And what columns does it have? What's the data type? Where is this data actually being stored? Do I have any statistics about that data that I can then feed into my cost models? And then now I have a physical plan that I can actually execute in my system. I got to hand that off to a scheduler, again, represent it in some intermediate form. And the scheduler can look at that and say, well, you want to run this plan for this data. Let me go to the catalog and figure out where the data is actually located physically and who's responsible in my cluster for actually executing that data. And then I now dispatch it to an execution engine, ignoring how I distribute the query out, whose response I'm making sure the compute nodes are always running, all that we can ignore. And then as I'm executing my query plans, my operators, I may have requests to go get data from storage. And so I'd have an IO service that I would make requests to and then that IO service is responsible for going out to storage, not saying or defining what it is, assuming it's S3, assuming it's some distributed file system. It could be a local disk, doesn't matter. And then it's going to go fetch this box and then hand it back up to the execution engine that could then compute whatever it is that it wants to compute. And then when I produce my final answer, it goes all the way back up the stack to the end user. There's other two things that are happening at the same time. Again, the catalog is super important. The execution engine, as it's scanning data, as someone asked on Slack, should my query planner actually be responsible for looking at the files and figuring out what's actually in them? No, because then you have to have duplicate code or redundant code to have the ability to scan data. The execution engine can just do this because an analyzed command in SQL is just a sequential scan. It can update the result to the catalog. Now, whether or not it goes through the scheduler or whatever the coordinator sends this back over, it doesn't matter, but I'm just showing that the execution engine can derive new information that isn't just for the query, it's actually for the contents of the files and go to the catalog, yes. Sure, yes. But that's... I mean, it's a command, right? Say, I have some files. I don't know what's in them. Schedule, go tell someone to do it for me. It's another query. But the query is coming from this, not from somebody at their desk. Yes. Let's cover that after class. Yeah, okay. Other questions? Okay, so again, same thing at IO Service. How are we actually going to find out what files are in there? Again, you could have a command that goes through the front end and tell the catalog. If you're seeing some stuff and can send it over, it depends on the implementation. But I mean, again, let's go back to the command I said before. The thing that's responsible for actually knowing what's inside of the blocks that I'm getting from the disk or my object store is going to be the execution engine. Because otherwise, you have a bunch of redundant code. You have people running some stuff to go, you know, a bunch of threads up here and start reading data, and that may interfere with what's going on down here. All right. If you have any paper or statistic about the catalog, we'll see how, as we go through that semester, I hope people will implement this. Snowflake uses a whole other database system. They use FoundationDB to do this. This is basically a whole other database system. You want it to be transactional. You want it to be fail-safe. You want it to be high-performance. This is a whole another problem in itself. And I want to talk about how we do this first, again, before we do the Lakehouse stuff to do updates on top of that. Because once you build that, that additional incremental update part. Okay. So let's talk about a high-level, okay. That's the context of, what one of these systems look like. Well, what actually happens when I execute one of these queries? So for this semester, although we're going to be just, you know, the high-level context of what the system we're describing is going to soon be distributed. Snowflake's distributed, Redshift's distributed, all of these systems that we've been showing, for all scale-out distributed database systems. But we want to walk before we run. So most of the papers, almost all papers we're going to read, are really about single-node execution. Because at a high-level, distributed query execution is the same thing as you would do on a single node. Modern CPUs have a bunch of cores. Sometimes you have multiple sockets, and you have to care about in numerous regions where the actual memory is being located for each socket. So all that is still going to be the same. Just when you go distributed, there's a bit more extra work to say, okay, well now I need to send data from this node to another node. Well, that's no different than sending from, you know, this CPU core to this CPU core. Right? It's obviously, potentially slower than going over the network. But at a high level, the key concepts we'll be describing throughout the semester are the same on a single node as distributed node, single node versus multiple nodes. So the query plan is going to be, ideally, a DAG of physical operators. So some systems, data fusion is one of them. It's actually not a DAG, it's a tree. And we'll see later in the semester, well, that's going to cause problems when we do nested queries, or subqueries, because you want to be able to rewrite or maybe reuse computation from one part of the query for another query. But if it's a tree, you can't do that. So ideally, you want things to be a DAG, and not all systems actually do that. And then the data system is going to look at the query plan, figure out what data it needs to access, do an operator, and where it needs to go next. And so we're going to figure that out all ahead of time so that when you run the query, it knows exactly how to orchestrate and schedule things and where to send stuff. We'll see a little bit how we can sprinkle some ad-activity in this process, where we can make changes on the fly to the query plan and how we move data, or maybe scale things up and down based on what we see in the data. Because that's going to be a big theme throughout the semester. So in this data lake or lake house world, the object store world, you may not actually know what's actually in the files, because you haven't done the scan on it yet. So your estimations might be wrong. So maybe you underestimate or overestimate different parts, and you want your system to adapt a little bit. Yes? If a query plan can't be a DAG, what else can it be? Tree. Postgres is a tree. Yes. Like a DAG, you could have one part go to another one. Yeah, a tree where you only have one parent. Or a DAG, like you can do some computation here for like a nested query and then send it to two different parts of the tree. Or the query plan. Okay. So again, this will be a high-level overview of what is going to actually happen now in the execution engine with the IO servers when we execute a query. So again, these are our worker nodes. They'll have local CPU, local memory, local disk. And then they're going to retrieve the query starts, you know, think of the leaf nodes of the query plan, like sequential scans and so forth. That's going to go at what we're going to call persistent data. And this is the underlying tuples that are in our tables, in our database. So again, whether this comes from the IO service through local disk or from the app store, at this point it doesn't matter. So all now the worker nodes are going to do some computation for our query plan and they're going to produce intermediate results. So these intermediate results are again artifacts that the operator generates that needs to go to the next stage of the query plan. And again, we'll talk about this throughout the semester, we'll assume that the unit of work for our worker nodes when we execute query is going to be a pipeline. And then we have to obviously stop at a pipeline breaker and then potentially distribute data around as needed. So the way we distribute data around is one way to do it is through shuffle nodes. And the idea here is that you just, you hash whatever some partitioning key is on the data that you're scanning, that you're producing in your results and then you distribute it across them to these shuffle nodes. And then this is sort of again, think of this as the pipeline breaker and then now the shuffle nodes are responsible for distributing this data to the next stage of the query plan, the worker nodes. So there's really no computation being done here, it's just basically in and out, storing things as a key value pair in memory and then sending out to the worker nodes. I'm saying this is optional because not all OLAP data systems do this. BigQuery and Dremel is probably the most famous one that does this and Google does this insane stuff. They actually have specialized hardware on these things to keep everything in memory and run as fast as possible. And it allows them to do a bunch of tricks for scaling things in and out because now you have this pipeline breaker you can go look and say, I thought the data was going to be this size or this amount, but I actually had this amount. Do I need more nodes or less nodes? Should I change anything up in the query plan? So this is like a pipeline breaker of place briefing and sort of like adaptability into a query plan? Yes, so the statement is this, as a pipeline breaker is this a good... I'm saying stop point, but it's a point in the query plan where you can say, okay, let me reassess what is coming into me and do I want to change anything upstream? So Google does that here. Yes. So this idea, some of this idea comes from the map reduce world. They would have an explicit shuffle phase, familiar things like Hadoop, which you I don't recommend using, you don't want to use that anymore. But like in that world, it was all batch base, meaning like you had to accumulate all the intermediate results of the shuffle phase before you could start the next phase. In a modern overlap system, you can use in a streaming manner. So as the data arrives, you can start pushing it to in a streaming fashion up to the to the next worker nodes, start executing right away. So having this sort of long pause. The other thing to point out too for the simplicity of showing this in PowerPoint, I have the same number of worker nodes as shuffle nodes. You don't need to do that. You can scale things up and down accordingly. Because sometimes the amount of intermediate data could be larger than your persistent data. We'll see this later in the semester when you do worst case optimal joins. The intermediate balloons to be much, much larger than persistent data. Even the final result will be much smaller, but this thing can get quite large. And then now again, we do the next phase, these worker nodes produce more results, and then we send this now to some final node to do some final coalescing or aggregation to produce the final result that we send back to the to the user. So I'm not showing this here, but the thing that's above all of this, keeping track of what's going on, what workers are still live, what stage they are in execution, how much data they're generating, that's the scheduler and the coordinator all above this. It's different than the orchestrator in Kubernetes, because Kubernetes is just like seeing is the pod still up. It doesn't actually know what's going on the side of it. You have to build that ourselves in our database to keep track of like, okay, what are you actually doing? Because Kubernetes, again, can't see inside your the query. So I've already said this, but the distinction is between persistent data and intermediate data, persistent data is the source of record for our database. Again, it could be files in S3, it could be proprietary storage that the data itself. One key thing though is that all these modern systems, because you assume you're going to run on something like S3. And S3 is immutable. And I can't store file or object in S3 and then go back and make in-place updates to particular bytes. If I want to do that, I've got to overwrite the entire thing. And so that means they're all going to be using sort of pen-only architectures for how they design the data encoded in storage formats. That's sort of the thing I mentioned in the Lakehouse before. Like, it's log structured, a bunch of changes. I think it might distort as JSON. And then they batch it together and distort as parquet because it's worn right out to the outdoor store. For the animated data, again, these are short-lived artifacts that we're going to generate as we execute the query. And because they are only really useful for the lifetime of the query itself, we don't have to worry about durability and fault times in the same way we would with persistent data. Now with persistent data, because, again, assuming we're running on an object store, they handle all that fault times and resiliency for us. Again, one less thing we have to worry about as we build a database system. But for the animated data, we're responsible for maintaining it. Ideally, we don't want to store it on S3 because that's slow and costs money. So we'll try to keep this in local caches, either in memory or on disk. But again, if, like, a node goes down, we can handle that. We don't have to make sure we store a good million copies of animated data because who cares when the query is over, we throw it away. There has been some research in maybe reusing animated results from one query to the next. No system as I know actually supports that because if you want to do that, then you just define a materialized view because it's basically the same thing. Yes? Is there no correlation on the amount of persistent data in the operator? Is there not, like, a strict upper bound in the amount of data you generate from the data that you would give them from the persistent data? This comment here that says that there's no correlation between the amount of animated data the query generates. There's no correlation between the size of the persistent data that they're reading in or the extrusion time. So this result comes from three years ago or last year. They looked at all the queries that actually exited in Snowflake and they saw that this wasn't the case at all. And so your statement is, oh, could there just be an upper bound where I know that the max limit of the amount of data I could ever generate? Well, no, because my query I can do anything. I can just do a billion times and generate a bunch of random stuff. It would cost you money. It would be stupid to do it, but you could do it. And the challenge that's going to be, is to know when this is going to happen. When you have an operator that's going to generate more animated results than the data going into it. Because you want to use that to figure out, okay, what join algorithm should I use? Worst case optimal or a hash join? Again, I keep saying this. We'll cover it later this semester. I mean, hopefully again, when I was taking classes back in the day, the professor would say a bunch of stuff in the beginning, like, what the hell is he talking about? And then later on you see, oh, yeah, you learn all that stuff and it clicks. So I'm bringing this up now, because when we hit those lectures, oh, yeah, that's what he meant by this. Worst case optimal joins. Okay, now I know what that is. So that's, it's a prelude for what's the common. Hopefully, you can see I'm getting excited. Because database is awesome. Okay. So the other thing that they consider now, too, in our system architecture is the way in which we're going to transfer data between different operators and the nodes. And it's really going to come down to where that persistent data is actually going to be stored. And again, the high-level detail is like if it was a shared nothing system, which we'll cover in a second, you primarily use push query to the data in a shared disk system. You would think it primarily would be by textbook definition pulling data to the query. But we'll see in a modern setting these lines get blurred very quickly. Because for the intermediate results, you actually want to, you know, you push the query to the data sometimes. And in other cases, you want to in some object storage, you can actually push the query to the data down to the extra object store. Other cases, you start moving things around. Like, it can get jumbled so there's not really a clean divide between different models. But it's good to understand them in the context, again, of the system that we're trying to build conceptually in our minds. So the push approach, again, is that the idea is that the query itself, either the SQL string or the intermediate representation of the query plan, is going to be much smaller than the data. So why transfer a bunch of data over to the node just to execute it? Why not to send the query, which is much smaller over to where the data is actually being stored. And I can do the processing there and not send back the intermediate results, which ideally should be smaller than the persistent data. And so this made a lot of sense in the old days when disks were super slow and networking was super slow. And usually the network is always considered much slower than the disk. That's not really that true anymore. The hardware's gotten really good. So the challenge in this space, though, is that you may not have the computational capabilities on where the data is actually being stored to do any processing on that side. Again, you think of using S3, if you ignore the select operator, which I'll talk about in a second, it's just the APIs get, put and delete. You can't say, oh, by the way, also execute this part of the query plan for me. You can. We'll talk about it in a second. That's usually what the APIs expose you. Actually, in Google, that's all they expose to you. So you can't do any computation there. So you can't push the query to the data. You instead have to pull the query to the data. You bring the data that you need, do the processing there, generate intermediate results, and then send it to the next stage. And again, the idea is again that your processing is going to be much smaller. The largest query I've ever heard of in the pure SQL string itself was 10 megabytes from Google. That's a huge-ass query. It's big. But the processing terabytes of data, it's not even close. Or there's magnitude difference. So in the old days, this was considered the primary way to do it, especially in the shared nothing architecture. But in shared disk, if you're just ignoring the extra features you can get from object stores, you can actually do this. And so the extra features that I'm talking about are in things like S3, they have this select operator where now you can basically send what looks like a SQL query down to S3 when you make the get request. And you can say, here's run this filter on this data. And S3 actually knows the contents of what your objects actually look like. So it's not like a dumb key value store. We say, hey, here, give me this bucket. Just give me the byte stream. I don't care what's in it. We know that it's a CSV, a JSON, if it's parquet, and they actually can process that natively where the data is actually being stored. I don't know how they charge you for this, whether it's just the fetch command, or they charge you the run time because it runs as a Lambda function. I actually have no idea. But again, this allows us to do predicate pushdown in a shared disk architecture, which, again, according to the textbook, you would not be able to do. Microsoft has their own thing. I don't know whether you get SQL, but you can kind of see like you passed in some kind of query there. And again, as far as I know, Google doesn't have this. I didn't look this year, but last year, they didn't have this as well. So again, this is what I'm saying. The lines get blurred because you can do some predicate pushdown and other things. And projection pushdown as well. All right, again, yes. This question is, do you always want to do predicate pushdown if the option is available? No, because it may be the case that the block of data you need is going to be used over and over again a bunch of queries that have different predicates. So therefore, I'm making now multiple requests to S3 to get different portions of that same file where it would have been cheaper just to go get it once, cache it, then do all my filtering locally. Right? But how to figure that out? It's hard, right? And that's why people pay a lot of money for databases. They can do that all for you. I actually don't know how many systems actually do that, this trick though. I think Redshift does it, because they obviously built it, but I don't know what this notebook does it. All right, so again, share nothing architecture. This is what we covered in the intro class. You know, classic textbook definition. Actually, it comes from Stonebreaker. This is something that he coined in the 1980s. And this was the prevailing architecture for distributed databases, but for OLTP and OLAP systems it took 30 years until the 2010s. Again, each node itself is going to have its own local CPU, locally attached disk and memory. And any time you want to send information to or get data from another node, you got to go over the network and send PCP or UDP requests. Right? So we'll call each of these things as a single data system node. So I think like EC2, a good instance that we're talking about that. The databases will be partitioned to disjoint subsets across the nodes. Right? Again, picking like a partition key, you can use the range partitioning or hash partitioning to divide them all up evenly across the different nodes. And then now, since the data is being stored by the data system as local, then I can just use a POSIX API. You know, I can use syscalls to go get, you know, fread or fopen and get the file from the data that I actually need. Because everything, again, it's just files on disk that I control my file system. Yes. Yes, so the question is, if I add a new node, do I have to move data immediately or do I move data as it comes in? So this is going to be one of the big problems we're going to face and share nothing. Is that if we want to increase capacity here, then I have to add a new node, but then that node when it gets added doesn't have any data in it. Can't you use NFS mounts? But then, wait, NFS has to be like a central storage, right? It can't be... You can't do like a peer-to-peer file system, right? AFS, same thing, right? It's a central storage. So that's a shared disk architecture. The difference is going to be, though, is that in something like AFS or NFS, the location or the distribution of the data physically is transparent to the database system because it doesn't have any positive API. You just call fopen and fread. You don't actually know underneath the covers where that data is actually being stored, right? So unless you now explicitly tell NFS to partition things a certain way, and so that when you read this range versus that range, you know that only certain pieces of data that get it locally, but again, it's as opaque storage that you don't understand. But through NFS? You don't expose that to your... But then you've got to get that information out of NFS. And now you're basically building a database system. We don't... You don't want to run your database on NFS, right? It needs to be on a scale like this. People do that. People run the sands all the time for distributed file systems. But like, the data system can do better if it knows exactly where the data is actually... Maybe not physically stored because if it's S3, it's abstract away from U2 as well. But like, it's... How do you say this? The object store versus NFS would give you roughly the same interface, except that you would have better control of... What can you control? You have control of things like geo-replicate or not, and NFS hides that from you. You get more metadata out of object stores. That's the catalog. You're talking about the catalog. So the scheme is, oh, some systems use Cassandra as the metadata layer to keep track of where data is actually located. That's the catalog. Again, I mentioned Snowflake. They use FoundationDB, right? I can't think of anybody offhand that uses Cassandra for this in databases. But yeah, it's the same idea. But that's a separate database to keep track of the metadata. That's not this, yeah. Okay. So in the shared disk system, again, we have this separation between compute and storage. We have the compute layer. We have the locally attached disk. Yes. I got sidetracked. If one of the nodes goes down, do you lose access to that data? Yes. Therefore, you have to replicate it. Again, this was going back to the same before. It's now managed storage, the data storage controls, and they're in charge of replication. So they have to handle all that for you. In an S3 object store, Amazon handles that. Google handles that. I don't know how they do it. But for our purposes, this is good enough. In this world, you have to manage it. And then the question was, if I add a new node, or take a node away, do I have to reshuffle data? Yes. And the data system has to do that. You want to do this in a transactional manner based on your catalog so that you avoid false negatives. So in the shared disk system, again, we have the compute layer and we have the storage layer here. Some kind of API. And in the cloud world, instead of using positive API, because you don't want to use a Fuse file system to talk to S3, you instead use whatever the user space API that the cloud provider provides for you. Amazon gives you a bunch of libraries to talk to S3. Some data systems go to extremes and throw that all away and rewrite it themselves. We'll see one in a second. But that's how we're going to interact with these things. And so now, the way to think about this is the compute nodes are stateless. In my shared nothing system going back, I actually pointed out I partitioned my database and now each node ignoring replication is responsible for that partition of the data. And so if I want to take a node away, well, I got to copy whatever's in it and send it to all the other nodes to redistribute it. But in a shared disk system, if I want to take away one of these compute nodes, well, okay, that's fine because the data is down here. So I could, in theory, kill this thing and then not lose any data. Or I could turn them all off, not pay for the compute costs, and then, you know, all my data is still retained. Whereas in a shared nothing system, I got to keep the CPUs running because if they go away, then the data goes away. Yeah, you can check point to EBS and all that, but like, yes. Like, if you shut up the computer, we would lose like, think, IR stuff. So the statement is, the statement is, and he's correct, if we shut up either in the shared nothing or shared disk, if I'm literally running querying at the time, then yes, there's a femoral state for that query that I would lose, yes. We can talk about how to handle fault tolerance later on for that. But typically the way, again, say ignoring unexpected failures, when you do a shutdown, you basically, answer all the nodes, hey, I'm going to shut down, finish whatever jobs you're doing, and then once the last queries, or plan fragments to leave the queue, then you can shut it down. There's a step, the process of doing that. It's not, you know, nothing fancy. It's not hard. Okay. Again, we'll see this later in the semester, but like, there'll be a catalog service that keeps track of, you know, this data actually here, it's still going to be partitioned, and we see, you know, this thing could be a bunch of files on S3, and I could keep track of my catalog, which of my compute nodes is responsible for those files, and then if I increase or decrease my capacity, I have to run then some kind of update to my catalog and say, okay, these new nodes are now responsible for this other technique. And then for that one, in case of like Snowflake, they'll just use consistent hashing to avoid having to reshuffle everything. All the standard techniques still apply here. All right, so this, I'll finish up with the distinction. Again, in the Share Nothing, it's hard to scale capacity. Potentially it is faster because now everything's sort of local, but the engineering benefits and the operational benefits you get from something like a shared disk architecture is why basically every OLAP system built in the last 10 years uses this technique. And even systems that started off using Share Nothing, like Yellow Brick, have converted to this. Like, because the benefits are so significant. And again, like Amazon's improving S3 all the time so as they make those changes your database rise along the wave and gets all those updates and new features as well. Right, because when S3 first came out it didn't have that select command, now it does. So that's something, you know, they've added that you can then take the benefit of without having to do any engineering for yourself. And S3 is pretty cheap. That's actually really cheap compared to EBS and other things. But then that's okay, we're database people. We know how to do caching, it's basically a buffer pool manager, right? We know how to do caching to avoid those long latencies of doing this access. So we can do all of that to hide the slow, the round-trip times from something like S3. All right. So again, this semester we're going to focus on this implementation. So this is not a new idea. Again, I sort of showed like, oh, these things became Vogue in the 2010s, but it actually goes back to the 1980s and certainly these things were terrible. These shared disk database systems. But because of the cloud stores, because of all these, you know, every vendor has their own version. There's local things like SEP and other stuff or Gluster FS. There's local object stores you could use. Like these things are so prevalent that again, every system is based on this. So just to give you an example of what a non-cloud version of a shared disk architecture would look like, but it's old, this is Oracle Exadata. They ship you a rack or a bunch of racks and it's basically a shared disk architecture that's going over like Infiniman or Fiber Channel from the compute nodes to the storage nodes. I think they can do predicate pushdown on the storage size as well. And again, this is all running in the same rack instead of like on the object store over the public network in Amazon. But again, just showing that these ideas have been around for a long time. All right, so let's talk about the object stores from the portion that we care about. So, again, from the Davis' perspective, it's disk. And instead of going, again, we're not using POSIX API, we're not using the, you know, libc calls. We're using whatever the API the cloud manager is providing us. But we're going to be responsible for what we're actually storing in it. And then whether it's going to be in a proprietary format that's custom to the database system or a open source format we'll cover next class, again, it doesn't matter. So most of the, you know, in most of these systems they're going to be storing it in the packs format. Again, think of that as like it's a columnar format, but the the table is going to be divided up to row groups or blocks, big blocks of data. And then within that block, all the data for a single tuple is going to be located in it, but it's just going to be stored in a columnar format. And that's different than, like, the really early column store systems, like in Vertica, for example, I think they stored the entire column as a separate contiguous file. In the packs world, again, you combine things together so that all the tuples are spatially close to each other within the file, even though they're stored in a columnar format. Again, and we'll cover this next class, but there'll be some kind of header or footer for all these files. It's going to contain how to get to these offsets, because everything has to be fixed length to get to the different tuples that you need, how things are being compressed, any additional sketches or indexes or metadata where we want to store what the data is. But again, we scan through the execution engine and then feed that into the catalog, so that the planner can use it. Again, all this we'll cover in the next class. Again, basically what happens is you retrieve either the header or in case of parquet and orc, it's always the footer, because it's an independently of storage. You start writing up the file and then you realize, okay, here's all the data I just stored, and you put that in the footer. So you can use your object store to go just retrieve the footer of the file and then figure out what's actually inside of it. Right? And like I said, they all have their own version of put, get, and delete. So this is the one system we're going to cover later in the semester, but I bring it up now because it's wild what they did. So yellow brick was originally an on-prem database system that was shared nothing. It was an appliance. You would buy these custom hardware that they would put together that was tuned for the database system, and you run this on-prem. They switched converted to a cloud-based database system, like a Snowflake or like a Redshift and others. But they found that when they converted over to the run in the public cloud, the object store was just so much slower than they were used to in their on-prem version or appliance version. So they ended up rewriting a lot of things that Amazon provides you or like the operating system provides you, and everything's all custom. So for example, they threw away the Amazon libraries, wrote their own libraries to call S3 using Intel DVD-K, so it's doing kernel bypass, which we'll cover in the semester, basically to do fast look-ups to S3, get the contents or get the data you need, and not make a copy in the kernel, just immediately pass it up to user space. DVD-K is a nightmare. We'll cover that why in the later in the semester. Or like, instead of running over TCP-IP, they wrote their own network protocol over UDP because it was just so much faster for them. So they rewrote a ton of stuff. They wrote their own PCE drivers. Who does that? Data these people, right? It's awesome. So, you know, there is, like even though I said S3 is slow or, you know, relative to local disk, there's ways to make it faster. And again, caching is also going to help us hide long-latencies as well. By relying on the local attached disk on the compute nodes. Okay, so to finish up, today again, I was a vomiting a bunch of database stuff at you as a preview for where we're going to go this semester. And we're basically going to start from the bottom layer. We're not going to talk about how S3 works, because that part we don't really care about. But we're going to talk about what the data is actually going to look like that we're going to put in S3. And then we'll start building the layers on top of that to be able to execute queries. So the opposite direction of what I showed in the beginning, going top down, we're going to go bottom up. And the idea really is about how to, what stated our implementation, stated our methods, techniques, and algorithms that we laid out ahead of time. And that's the papers that I picked for you guys are really designed to expose you to. Here's a certain way of how people approach these problems. But then we'll also cover other papers that are related to the area or other techniques. Okay? So next class, the paper you guys have reading is actually something that I wrote with my former PGA student. Basically, it's a survey of the internals of Parquet and Orc. It's going to talk a little bit about GPUs at the end. We're not going to cover GPUs or FPGAs the semester. There's a whole other line of work. We're only going to be focused on how we do execute queries on CPUs for now. Okay? And then, so next class will be Parquet and Orc the most widely used open source file formats. And then the following class next Monday will be, here's the, here's new variations, new implementations of a file format that supposedly are better than Parquet and Orc. So it's sort of the next generation file formats that are coming out. Okay? Any other questions?