 Carnegie Mellon University's Advanced Database Systems course is filmed in front of a live studio audience. Let's jump into the discussion today on DuckDB. Again, as I said last time, this is a lot different than all the other systems we've been talking about, at least in the last week or so, because all of those are these giant distributed, old-life warehouses running in the cloud. And then now I had to read a paper about DuckDB, who's once a run on a single node. But we'll talk about MotherDuck at the end, how they not necessarily go distributed, meaning fanning out, scaling out the queries themselves, but at least now be able to leverage cloud compute infrastructure for query execution. But we'll see you at the end. All right, so last class we were talking about the Snowflake Data Warehouse. And as I said, this was along with Dremel, one of the first classical cloud-native OLAP engine that did all the various things we talked about throughout the semester, like precompile primitives, push-based execution, separate computing storage, all those nice things. And so this showed up, actually, I think yesterday, which I think worth just looking at, because I think somebody else asked me, like, okay, well, all these systems at a high level look are the same, how do you pick them? So data engineering subreddit is actually really good, because there's people actually using these systems and talking about, like, the pros and cons of them. So I highly recommend it. And so somebody's asking, hey, how do I pick between Snowflake Databricks, BigQuery, and Redshift? We haven't talked about Redshift yet, that'll be next week. And this dude basically says, like, hey, don't worry about the nitty-gritty details. The way to really think about whether you want one system and another is, first of all, what cloud are you already running on? If you're already running on GCP or Google, then just use BigQuery. If you're already running on AWS, you could probably just use Redshift or otherwise. And then if you're already using Spark, use Databricks. And then if you have a lot of money and looking for a good time, then that's Snowflake. And I would say that's one of the differentiators I think that Snowflake has done really well that separates them from the other systems, is just that the user experience is much, much better and cleaner than these other cloud systems. So even though, again, the core architecture might still be the same underneath the covers at a high level, based on the things that we talked about. As I said before, the user experience is going to matter a lot. And also how good the query optimizer is. Is there like no tuning in Snowflake? Is that why? His question is, is there no tuning in Snowflake? Yeah, they don't expose really any knobs. You can't even do query plan ends. Just one knob, right? I think there's one knob you can jack up the compute size. You can turn off auto scaling. I think there's three things. Now, that doesn't mean in the actual implementation itself, there aren't a bunch of other knobs. And they've told me this. They're like, yeah, we have hundreds of knobs on the inside. They don't expose them to the users. So then typically what happens is, they said if somebody calls a sales engineer and says, my query is running slow, the sales engineer can then get in touch, or the customer service can get in touch with a database system engineer. And then they'll recommend like, hey, tune these three, four knobs for this one particular customer, right? But it's very ad hoc. At least this is what they told me a few years ago. So underneath the covers, there's all the tuning knobs that you'd expect, but they don't, to make it your life easier as the user of it, they don't expose it to you. So how are they actually tuning it themselves? It must be really hard to do that, right? The question is, how are they actually tuning it themselves? What do you mean, like, how are they tuning themselves? I mean, they have tons of knobs, right? Yes. And like getting those plan ends. So their design philosophy would be like, you want things to be adaptive. So again, it's hard to quantify this, say how much is actually adaptive. But you can think about it like, instead of being lazy, not lazy in another right word, but instead of saying, okay, here's some value for a knob I need to know about how to set correctly for the workload. Instead of just saying, all right, well, it's a pound to find, somebody else will set it for me. You could, there's ways to try to make things adaptive. It's more engineering, certainly, but it makes you more robust. Yes. Why is Redshift not a positive experience? The question is, why is Redshift not a positive experience? We'll come to that next week. It's not, again, I think there's nothing particularly wrong with the architecture itself, because they fixed a lot of it. Again, it's the user experience, and that's hard to quantify, right? It could be like, okay, my query went slow. I've been writing the right tools for me to figure out why it ran slow. Or is it how stable is the system in terms of performance? I run the query today, I run it tomorrow, and it's an order of magnitude difference of performance. Is that what people are seeing? Again, this is purely anecdotal, right? I thought it was an interesting quip, and like I said, the data engineering, this is where you see a lot of people at the bleeding edge talking about how using these systems and also the ecosystems around it, like Airflow, DBT, and things like that. And how they integrate with these cloud vendors. Okay? All right, so for today's discussion, we need to go back a time, just a little bit, to talk about what led to the creation of DuckDB. And you actually already read one of the papers that was a precursor or a catalyst for what set off the development of DuckDB. And that was the networking paper, like, don't hold my data hostage. Right? And that paper came out of this project they were building at CWI to make an embedded version of MoneDB called MoneDB Lite. Basically, MoneDB, again, was one of these early column store systems at academia, at a CWI, that then it was open source and people were actually using. But it was like a Postgres or any other sort of shared everything system. You had to prop it up, configure it, set it up, and you would connect to it through JDBC over the network. And so what they were trying to do was to make it faster or easier for data scientists to use a database system inside of like Python Pandas or R, I think this is explicitly for R, they were doing this. They tried to make an embedded version that ran in process. We haven't really talked about what an embedded database system looks like, but basically it's, you know, there's no main function, like you would starting like Postgres or MySQL. It's running, it only runs whenever the hosting app process then invokes insight down into it. Now it can spin up its own threads in the background and do other stuff, but like it's, you know, it's not something you would just start like as a daemon on your server and just run all the time. So they were trying to make this run directly inside of the R runtime or Pandas runtime and they were trying to reduce the cost of going back and forth between the Rlang infrastructure and the database system because typically very often how data scientists use databases is they just grab a giant CSV or parquet file, whatever, from the server itself, bring that down to your local laptop and do all the crunching on that. And so they're not leveraging the database system to do all the fast calculations and stuff that, you know, we know how to do really well. They're basically rolling all that crap themselves inside of their user code. Pandas isn't particularly known to be very swift or fast. So that was the goal. They're trying to have an embedded version of MoDB Lite that got all the advantages of the column store and would expose that to the data scientists, but the problem with MoDB, you know, at this point it was 15, 20 years old and it just had too much sort of legacy infrastructure and legacy code but it was too much to rip things out and strip it down to get it to be a more simplistic package that you, again, you could then embed. So that's what led the CWI researchers to develop DuckDB. It's basically the time, the efforts that they took with MoDB Lite, they learned a bunch of lessons from that and said, okay, we should build a system from scratch, specifically designed from the beginning to be embedded inside of, you know, other applications. So you could call this an embedded data system. Sometimes it's called in-process. You could technically call it serverless, right, because, again, it's not a daemon that's always running. But the idea is that they're trying to provide a fast SQL execution engine on any possible data file you could find that you possibly want to query. Again, this is sort of the same idea as in Dremel. They want to be able to take any data file that someone may have in their object store and you want to run queries real fast on it. So the SQL dialect that they're to support is based on Postgres. They did what we did in our early system. We took the Postgres grammar file and just embedded that inside of our system. But then over the years, what I've liked is that, well, this is how you get, you know, that's why there isn't a single SQL standard. They've added some nice quality of life enhancements that are specific to DuckDB. Like you just type from and then the table name, and that does the same thing as the select star, things like that. And the pitch, which I think is fantastic from a marketing standpoint to really understand what DuckDB is trying to be, is it's trying to be the SQL light for analytics. SQL light is the most widely deployed embedded database system. It's on all your phones and your laptops right here right now. It's running satellites in space. It's in every plane. And it's designed to do transactions. And so they wanted to sort of have the same ubiquity of SQL light, the proliferation of it being used by everyone, but specifically for doing fast analytics. Now, Jignesh has a paper with his students, I think came out two or three years ago. They've added some enhancement to SQL light to improve the, you know, its support to do analytics. But I don't think it comes close to what DuckDB can do. So another important design decision in terms of the implementation of it, is that it's all going to be custom C++ code that they've written for DuckDB. So that means they're going to try to avoid bringing in third-party dependencies when necessary. I think for encryption SSL stuff, like that stuff you want to bring in, you don't want to write that yourself. But like a bunch of other stuff, they're not going to rely on third-party packages or libraries. They're going to write everything themselves. And this is going to make the system more lightweight, easier to manage from the engineering side of things. And you have the additional capability of like, now, because it's all C++ code that you wrote and not some third-party library, they can actually compile it on the Wasm and it can run in your browser very easily. There's been attempts to get like SQL light and actually SQL light won't run on Wasm, but people would try to get PostgreSQL run on Wasm and it always seems like a huge hack to get that to work. Whereas in DuckDB, it's quite simple. And so the way they're going to be able to expand what the system could actually do beyond the sort of core runtime engine, they're going to rely heavily on an extension ecosystem. We'll talk about that in a second. So they'll have like the official ones that they support, but also you can download arbitrary ones and install them. And that, again, that makes the core engine lightweight and easier for them to manage. So again, like we talked a little bit about, when we talked about Photon, like the design philosophy of like, they have to maintain all the Spark Java crap and they want to avoid Java. So Photon was the most engine that they invoked down through JNI. Yes. By serverless, what exactly do we mean? Are we talking like Lambda function or something? Yeah, so servers will be like a Lambda function, meaning like the process goes away when like if nothing is actually using it. So embedded, I think embedded is the better one. Like embedded is sort of the same idea. Like there has to be some other process already running that then links into the data system and then it is, I don't think you hand it threads where you do in SQLite. I think you can spin up a zone threads, but like it's all within the same address space as the host process. It's like a library, yes. Yeah, it's a library, yes. All right, so here's at a high level what the major features that are in DuckDB. And again, the number one difference at the top is going to be that it's shared everything versus all the shared disk or disaggregated storage stuff we talked about before. Because again, it's an embedded database system. It has no notion of a separating compute and storage. I mean, you'll see they can use extensions to talk to the cloud platforms and so forth, but at its core it is sort of just the query engine. It's not in Tata 2 because they do transactions, they have their own file format. So, but again, it's not the like scaled out architecture that we saw in Dremel and others. They can do push-based vectorized query processing. We'll spend most of our time talking about this because they actually started doing pool-based and then they switched over two years ago, three years ago to push-based. And they talk about all of the challenges they face scaling out a pool-based model and why switching to a bush-based one was better. Again, given the provenance of coming from CWI where vector-wise was invented, they're going to pre-compile primitives, not a surprise. They use a solid MVCC that's actually based on what the Germans do in HIPER. We're not going to cover this paper, but actually a lot of these are based on what the Germans did. In some ways you say like they literally just took the papers and re-implemented the stuff that we, all the papers we talked about today. For example, they're doing morsels. They're going to do the un-nesting arbitrary subqueries. They're the only databases other than HIPER and HIPER that can do this. Yes? So why would you want morsel parallelism or something that's intended to be embedded maybe like on a pizza box or something? So actually why would you want morsel parallelism if you intend to be running on a pizza box? But a pizza box nowadays has a shit little cores, right? When it's a pizza box, I mean like one-unit rack unit, you can put like multiple sockets in those things. Or even the AMD, the Ryzen, the latest one, it's a ton of cores. I forget the exact number now. You can get a ton of cores. So why wouldn't you want to use them? Actually, I don't know whether the selection knew more where. The paper doesn't say. But I imagine it's not hard to figure that out, right? You also tell you. Like SQLite is, again, I would say SQLite really was designed to run on like a one-core, very parented CPU from the mid-2000s. There's a even more stripped-down data system called ExtremeDB. That's running on SOCs in like missiles and shit like that. You don't even have an operating system. So the system is even more low-level than that. SQLite's a little bit more. And in that case, SQLite can't do parallel query execution. Whatever thread makes the SQL request, that's the one that runs it. I'm fairly certain. Whereas in DuckDB, one thread would make the request. But then if you tell it how many threads you're allowed to spread the query across, and they use morsels to do that schedule. Right. So the packed stuff, we've covered that many, many times. There are going to be the sort merge join and hash joins. And then the stratified query optimizer looks very similar to what we talked about before. Again, that you're running on arbitrary files that you may not have any statistics for. So they're using a bunch of rules to figure out, you know, some basic join order heuristics or things like that. But the one thing they do well is the unnesting arbitrary subqueries. They didn't support unnesting arbitrary subqueries for lateral joins. And as I said, last year in 721, the students actually did that. Like Sam Arch, the PGA student here. They got that merged into DuckDB. So they can handle all possible subqueries now. All right. So we're going to spend another time talking about because they're going to talk about, it's public about actually how they're going to pass data between the operators, which is something we haven't really discussed. So as I mentioned, the original version of DuckDB prior to 2021 was using a pull-based vectorized extrusion model with, again, precompiled primitives. But then over the years, they found that it was turning out to be cumbersome to maintain and work on because every single time that you wanted to add a new operator and make it parallel, you had to modify this control plane piece to say, OK, here's this parallel thing where you now run. It was more infrastructure that they had to maintain for every single time they added something new. The other challenge they had is that because now, since they want to be able to support reading data from not necessarily local disk, it's a remote file system. They can support HTTPS or S3 and so forth. Now you have this challenge where some pipeline may be blocked because it's getting data over the network, but there are other pipelines you could start running because they don't need to wait for that data. But if you do a pull-based model, like the volcano approach where you're calling get next, get next, get next, the call stack down in the query plan, that's essentially the state of where things are being executed. So if you reach the bottom to a leaf node and that leaf node wants to go get remote data, you have no way to pause it, unwind it, go back up the stack, and then maybe call down another pipeline. Basically, the control flow of the execution of the query plan is implicitly within that call stack. So if they switch to a push-based model, then now you have this centralized scheduler using morsels that can say, these are the pipelines or tasks that I can run right now. Go ahead and run them. Here's the ones where I know I'm waiting for IO for whatever reason, and we pause them and wait. So for them, they found out, not just in terms of performance but in terms of the engineering of not assuming the data is always readily available, switching to a push-based model turned out to be a much better approach for them. And the great thing about it is you can read the actual PR or Mark, the co-creators of it, added the push-based model. And now it talks about all the great things that they were able to achieve with it. All right? What's wrong? What's that? The PR is huge. Yeah. I mean, because they ripped out all the pull-based stuff and switched to a push-based stuff. Who are you? I don't know. Anyway. There's probably a bunch of PRs. There's only they just did it all at once. They probably did it all at once. Mark's amazing. I think at this point, 2021, I think he was still a post-doc. The Dutch model is kind of weird as a PhD student. CWI is not a university. It's like a research consortium of a bunch of other universities. So technically, all the people there, like Peter Bonds, Hannes, well, Martin, he died. They had affiliations with other universities, but they didn't teach classes really and they just wrote code at CWI. Again, this is one of the pros and cons of the American academic model versus the European one. In case of the Germans, they get six free PG students a year. Right? Yeah. Each PG student here costs me $125,000 a year. So they're getting a ton of money and their top students can write a ton of code for them. Whereas it's hard to scale up at the same sort of level. And so CWI seems like a thing. Like Mark just basically wrote code full-time on DuckDB. So it doesn't surprise me that he did this entire PR. It means he's also very smart. It's not like, you know, I'm not surprised that he did it. Just saying like he had the opportunity to do this, even though he's a PG student. Okay. So because now they switch to a push-based model, they talk about how this opens up a bunch of opportunities to do additional optimizations that you would be very difficult to do in a pool-based model. Again, some of these I think we talked about, but not all of them. So the first one is that because now you have explicit control of when you can basically pause a pipeline, right? If you, since all the tasks are in this centralized scheduler cable, if you will, or list, you say, okay, well, this thing is going too fast or going too slow, let me just prevent any more tasks with that pipeline from executing. So you can do things like if my scan is running too fast, actually, that's back pressure. But you can do things like if my scan is producing not a lot of data as it's going up, rather than having the filter send up a bunch of data that's not half-empty or semi-full vectors, you basically pause things, have it buffer the output between the filter and the aggregate, and then when that fills up, then you can say, okay, now you can start executing again. Right? We talked a little bit about the mediated relaxed operator fusion. We talked about vectorization query compilation. This is a paper that we wrote where in order to maximize the vectorization between operators in a push-based model, you have a little buffer in between. So they can introduce that, but then not just fill it up and then pass it along. They can say, okay, well, this thing's not full yet. You're allowed to keep executing or it is full. Let me go ahead and pause you. They can also do scan sharing because all the query plans are DAGs. So you could have your scan, start producing results, fill out some buffers that you can then reuse for maybe this parent operator and this parent operator. Again, in a pull-based model, if you're calling getNext, every child has to have one parent. So how do you pass along that information? It would have to be this weird sideways information passing. So in this case here, they just fill the buffer up and say, okay, well, the buffer is full. This parent task can run. Okay, before I throw away the intermediate result from this child, they go and invoke the other one. And again, through the centralized coordinator, you can turn these things on and off as needed. And then the last one is, if you recognize that the operators on the top of the query plan can't consume or process the data you're passing up as quickly as possible, you can just pause things. So you can introduce a buffer here that says, okay, when this thing gets more than 10 megabytes, even though I have more memory, I could keep going, but rather than letting the balloon indefinitely, I can just pause the whole pipeline and just don't let any more tasks execute for it. The other one that's also super useful is, because again, you're reading remote data over HTTP, instead of having this entire task just paused while the thread being blocked while you're fetching this, you do the background IO, fill up some buffer, and then when the data is available, then you can kick off the execution again. Again, think about how you would do that with GetNext. I would have to have, I call GetNext down here. They make a remote call at HTTP to go get some data, and then I need a way to go back up and say, okay, I don't have the data you're looking for, but call me back again when it's actually available. You have to make another GetNext call. It gets really awkward and weird. Whereas the push-based model, because now the control flow and the data flow are separate, this makes this all much easier to do. All right, the next thing that's interesting is what their intermediate result vectors look like. So there's the data, obviously, on disk. We'll talk about it in a second. That's going to be more heavily compressed because you want to reduce the size of the data itself. But once everything's in memory, when you're going from one operator to the next, they want to do a lightweight encoding, similar to what we talked about before, to pass data from one operator to the next. And they're basically going to have four vector types that are specialized, or three of them are specialized, to different types of data. So without any compression, they call the vector uncompressed or flat, and it's just the listing and columnar order of the values. But if they recognize that within the vector you're passing along, it only has one value, then rather than passing on that repeated value over and over again, they can have what's called a constant vector. And just a single value says this entire vector has 1,000 tuples, and they all have one single value. They have a dictionary vector, like we talked about before. The selection vector just says what offset in the dictionary corresponds to the data you're looking for. And they have what they call as a sequence vector. This is basically some variation of a special case of delta encoding where you have the starting value as the base, and then you just say for every single value that comes after that, incremented by this amount. Sometimes you have auto increment keys as the primary key for a column or a sequence, it's like 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. If you recognize that, you just need to store two values to say here's the base and here's how it's being incremented. So again, they'll figure this out on the fly while you're actually processing the data, like between one operator to the next, which of these versions you can use. And then the default is always fall back to flat because it's just the simplest one. So they're actually in memory layout, they actually design this in conjunction with Velox. So they're actually compatible with the vectors that Velox passes around. And again, at a high level it smells like arrow, but my understanding it's not exactly completely 100%. The memory layout isn't 100% compatible. So now the challenge though is that because it's a pre-compiled primitives, if you have all these different variations of these different vector types for all the possible combinations of data types you could have, now you have this combinatorial explosion of the number of possible primitives you would actually need. And even if you templated everything, now this is going to balloon up your code base when it gets compiled, which makes compilation slower but also expands the size of the code in memory. It makes the process more heavier. So what they want to do for three of these vector types is that they want to get them into what they call a unified format where there is a, now you have a single primitive that knows how to do whatever processing you need on that data without having to do any conversion or memory copying. So for the flat one it's super easy because it's just exactly the same. But then now they're going to add this, the selection vector just to say, here's the data and then for the, here's the offset that corresponds to the data that Tupa that offset matches for. And that looks a lot like dictionary encoding. So this is actually how they'll represent things in memory and pass things along. And then now they don't have to decode it if they're matching up against another dictionary one. For constant, same thing. So basically dictionary coding, I have, that's backwards, sorry. Yeah, sorry, that's constant. The constant one, again it's just, I have a single value as my data as if it was a dictionary. And then here's the selection vector, which is all zeros because they're all pointing to the same one. And then the dictionary one, it's basically the same thing. So again, like even though they have different compression schemes for these different, three different vector types, the sequence ones I think they always have to unroll to the flat one. But all this looks the same now, and then you have a primitive that operates exactly on this data. And you don't have to do any extra memory copying. And you get the benefit of like, oh, if it's constant, you're passing along less data than before. So this is what they call the unified vector format. And again, this is for the immediate results that are going from one operator to the next. But the increment one? The increment one, I think they have to unroll it. That might be a year out of date. I haven't looked. So the other interesting thing that we're talking about at DuckDB, the way I haven't really talked about before, is how they can work with this, you know, the non-SQL ecosystem that data scientists are coming using. And so, you know, if you think of like Python Pandas, people are operating on data frames, and they want to, you know, data frames provides this API to do the manipulations. But at a high level, it's not a relational algebra. And so what they want to be able to do is for people that can link in, or, you know, instantiate DuckDB within their Python or R program, write to some common API based on data frames, and then have that get translated into the corresponding SQL command that can then retrieve the data that you're looking for. And so there's, they support two different libraries, one for R, one for Python. One's called dplyr, pylr, and it's called ibis. Ibis was developed by the guy that invented Apache Arrow. I forget where dplyr came from. Has anybody ever heard of these before or no? No. Let me show you what it looks like. But basically, it's a, again, it's a, it looks like if you ever use like Spark, PySpark, something like that, right, it's going to have, you know, APIs that manipulate data frames like this, right, and sort of like this. You fetch some things at the head, right, so again, it's a procedural language to manipulate data frames which are just basically relations or tables in the database. And then, you know, I would dplyr. Sorry. What's wrong? Yes. This question is, is data frames, what came first? So, data frames came from pandas, I think. And the guy that invented pandas also invented an arrow. Yes, that's the connection there. But like, the idea, through the pandas API or dplyr, pylr, or ibis, you can manipulate what you think are data frames and memory that could be, though, in the Apache Arrow format. Because the whole point is like, again, going back to that paper you guys read, don't hold my data hostage. Like, they make a big deal of like, okay, well, if you just use JDBC to go run some query on a database, you're going to get back a bunch of rows. And then, if you're using, like, pandas or some Python program, then that's got to then do a pivot, convert it now to a column store. So what you really want to do is hand things off as arrow, and then now your Python code or R code can manipulate directly on those arrow buffers, even though you're still operating on the data frame API. But with these integrations is that, I mean, the dplyr one has it first. You see, they have these basic primitives, and then they have different back ends, and then here's the one for DuckDB, right? Drop-in replacement for dplyr that uses the same API. But what's going to happen is, when I write code against this thing, it's actually not going to generate SQL. Where is something to look at? Right, yeah, so here, like, that's the data frame. Here's the query. So here's some query here. More or less looks like some, you know, bastardized version of SQL, which is, you know, instead of where calls, it's called filter, right? Instead of average, it's called mean. But at a high level, this basically looks like SQL. And so what these integrations do with DuckDB is that instead of converting these commands into SQL, the library will convert this into a logical plan of the internal representation that DuckDB has for queries. And then, as if it got parsed from the SQL from the command line or whatever, and then it hands that off now to the optimizer who can then convert that to the physical plan. Yes? This David is, do people hate writing SQL that much? You're talking the wrong dude. So, I mean, no, so it's similar to what we saw with UDS, right? There's certain things that are hard to express in SQL. Yeah, but like, yeah. So like, there's a lot of data scientists that prefer to use pandas and Python APIs and Python notebooks. And certainly, for some things, like, it doesn't make sense to run that in the database. Like, if you're going to call it PyTorch stuff, like, it may not make sense to have that run in the database system. It could run that locally or farm it out to, like, spark. And actually, that's one of the advantages of, like, Databricks, although Snowflake has Snowpark. Forget what the Google one is. Like, this single environment where you could do run your SQL query on, like, the OLAP engine, plus also run, like, scale out machine learning jobs all together instead of one interface. That's very common now. But it's also very common, like, if you're running a one-off experiment, just to download the whole file locally, run it using Python and then maybe upload the results or hand it off to somebody else. So the idea here is now, like, instead of having, like, the pandas run time is actually very slow. But instead of, so now if you want to manipulate data frames, if you put your data in, like, Parquet files, then let DuckDB do the crunching of those Parquet files instead of pandas itself, you get all the advantages of, you know, a modern OLAP system that we've talked about in the entire semester directly in your Python notebook. So that's the idea here, right? And again, if the zero copy is the big idea, this is what Apache Arrow sort of, this is what was the original foundation or the motivation of Apache Arrow, that you could pass around data between, you know, disparate applications running the same address space without having to do serialization or deserialization. But now, as I said, it's basically being used as the transport protocol or format between different nodes running in the system, right? So, but in this case here, getting DuckDB's in process, you can get data in and out of DuckDB, back and forth between DuckDB and Python R through these different APIs through all Apache Arrow. Because again, same address space. If I malloc in DuckDB and hand you some buffer, you know, how you actually keep track of who he has to free it, that separate thing. But like, I don't have to do a mem copy to be able to manipulate it in Python. DuckDB also supports the execution of substrate plans as well. But I think that's done through an extension. So I think you can take a substrate serialized plan and then hand out to DuckDB, just as if it was a logical plan generated by one of these guys, and then it'll convert it to its physical plan that it can then execute. So this goes back to the very beginning we were talking about before, like that Reddit post, like quality of life things or like the ease of use. Like if I have a bunch of Python notebooks and I want to use some data that I have stored out in my object store, but I'm going to read or run them in, you know, like an OLAP engine or something like that, if I can reduce the friction and how much manipulation I have to do for this existing code, then that's a better user experience. Instead of saying, okay, rewrite all your Python stuff to its SQL, no one's going to do that willingly. All right, so I've already sort of been into it multiple times that like DuckDB does support reading a bunch of different file formats that we've covered, but it also has its own proprietary custom file format. Similar to like, you know, if you create, you know, open up SQLite and you call create table, it'll write out like a .db.sqlite file. You'll get the same thing DuckDB. It has its own proprietary file format. It's meant to run as a single, so generate everything as a single file. Now when you do updates on it, just like in SQLite, they'll maintain a, they'll maintain a write-ahead log as a separate file and for temp files or temp data, that'll get spilled to separate files as well. But the core database itself that you attach to is always going to be a single file. So no surprise, it's going to be Pax. Their row group is set to be 120,000 tuples. And the important thing to understand is that they're going to be more aggressive with the compression schemes that they're going to use when it goes to disk versus when it's in memory. In particular, they're going to do bit-packing with the frame of reference optimizations that we talked about before. So when you actually start writing data to disk, I think you're doing much inserts of copying into DuckDB that then writes out to the database. They're going to do something sort of similar we saw with better blocks where they'll have sort of an initial pass to look at the data, figure out what it looks like, use some kind of ranking algorithm to decide which is the most efficient encoding scheme for that data. And then once they figure it out, then they compress it and then write it out to disk. And they're going to do this on a per-column basis, I think, within a row group itself. So this table is a bit out of date. This is from 2022. But you see over the years, over the different versions, where they've added different compression schemes. And even the latest one, I think, just came out in March. They have the latest compression scheme from another project at CWI called Alps that we didn't cover for floating point numbers. But you can sort of see how they compare against base compression in Parquet with SNAP-UZ standard for the different lightweight encoding schemes. Yes? Sorry, the columns are, these are different like well-known data sets or benchmarks. So the tech... Light item is TCH. Taxi is the New York City taxi database that's like every single taxi... Every single taxi ride over like a one-year period. And like the pickup and the drop-off location. And then on time, I think that's flight data, but I'm not sure. But again, this is the on-disk format. They only have those four vector types once everything is in memory. Because you want that processing to go as fast as possible. So in addition to being able to support, again, reading their own proprietary file format, on the local disk, as I said, they can read data in other file formats. Parquet is not surprising. I don't know. I don't think they support Orc. I don't see it here. Obviously, they can support Aero. Another cool thing they can do is they're going to attach to a SQLite database and actually read that directly. And they can manipulate it directly. And you attach to the database and you see the catalog, you see all the scheme and everything. From your perspective, you don't know that it's actually a SQLite database versus a DuctDB database. They also support, obviously, reading JSON. And then for the Postgres one, I think they connect to the Postgres over JDBC. But again, it sucks in all the catalogs. So within DuctDB, you can see all the tables you have in Postgres and run queries on them. But I don't know how much they push and pull between pushing down the query into Postgres versus sucking the data in and running more quickly inside of DuctDB. Because the DuctDB query engines is going to be way faster than Postgres for doing analytics. So these are the, if you call from, it's hard to see, from DuctDB extensions, this will give you all the listing of the official built-in ones and whether they're loaded or not. And as I was saying before, they're trying to minimize the size of the core engine binary itself for not bringing in additional binaries or additional code. But if you do need it, then you get it through these extensions. One in particular is ICU. That thing is super important. That's like the international time stamps and things like that. And date formats. That one, you don't want to write yourself. The Germans did. I'll talk about that in a second. But a lot of people just need it. So you get it through one of these extensions because it's going to be a third-party library. The Germans told me that one time they were going to do a sales call. This is back when it was Hyper. And they needed the customer. They were literally on the plane flying to meet some customer. I don't know where in Europe. It was the ICU library. And it was super slow. So then Thomas, within the flight, without internet, rewrote the ICU library on the flight. And they had it ten times faster when they landed. Insane. So the one I want to point out, though, here is this one called Mother Duck. Again, if you just download DuckDB and you call show extensions, these are all the things you can get. And I think some of these, they're not shipped in the binary. It'll pull them down from the DuckDB website. But this one here, again, this DuckDB, so this comes along with this thing called Mother Duck. So what is that? Yes. How do you load these extensions? It's shared objects. You call load and then load or create extension the name that you want. And then I think it'll pull it down for you from the website. I think when you call create extension, yes. I think... The binary, you know, when you download it, it's not that big. Yeah. Correct, yes. I think some of them are shipped with it, but they're not loaded automatically. I mean, some of these, say, built in. And that would be equivalent to, like, in Postgres, within the Postgres source tree, they have a contrib directory. And that has, like, official third-party extensions that are shipped with Postgres itself. But obviously, things like PGVector, you just download that from GitHub and install it yourself. So it's a similar model. Yeah, you can, of course. You can do that in Postgres, too. It's just linking in a shared object. And then that shared object has to implement this general extension API to note the entry point when you want to book something. Yeah, that's this. The question is, like... I think maybe you're asking, like, when you download DuckDB and get the executable, does it come with these built in? Do you get from their website? There actually is at least one extension... Overrides the website? The malloc implementation. Yeah, but that comes with it automatically. Yeah, we... Sorry, we don't... We don't really talk about malloc. You never want to use libc malloc for your database system. You almost always want to use je malloc. In rare cases, you maybe want to use tcmalloc. And that one's from Google. And that one's from Facebook, Meta. It's just like... It's just way more efficient. It's less... It's more scalable for multiple cores. Like, it takes less... The latches are less expensive. tcmalloc is thought to be better for multi-threaded applications if you have a lot of cores. But je malloc is always the right choice. So pretty much everyone uses this in their database system. We didn't talk about huge pages. That's another one that most systems don't do that. You never want to use transparent huge pages in the OS. That's always a nightmare. But I think it's gotten better. If people are curious about these things, we can talk about it a bit more. But when in doubt, just use je malloc, yes. Libc malloc. What is the big problem with libc malloc? Yeah, it's too many latches inside of it. Je malloc is basically designed for taking lightweight latches, small critical sections to be able to scale multiple cores. Is there any reason why it's not the default in Linux? Je malloc is not written explicitly for databases. All the databases would use it because it's much better. Pretty much any high-performance application is going to use je malloc. I have no idea. Licensing, maybe licensing issues. Maybe this is like MIT and it has to be GPO or whatever. Yeah, there's a paper we didn't read this year. There's a system called scuba at Facebook. It basically was an in-memory database. One of the things they do is they want to do rolling upgrades. They want to restart the server. Since it's an in-memory database, if I kill the process and start it back up, I got to load all the data back in. The trick they do is they write everything all the contents of memory to shared memory in the OS, kill the process, come back, reattach to that shared memory and everything's there. They talk about how they tried having, because they employ the person that writes je malloc. They had them write some tricks in je malloc to make it better so they could share memory and restart. It was a huge nightmare. For that case, they just relied on the OS to do it. There are some optimizations in je malloc that are specifically for databases that are rolling, because I know Facebook puts stuff in there. Yes? Also, the Lipsy malloc is used because it uses less memory. Je malloc is a lot of memory. It uses a lot of memory? It eagerly allocates ahead of time. Yeah. Just use je malloc. All right, sorry, yes. So this mother duck thing, what is that? So this is during the pandemic. The harness and mark, we're thinking about doing a startup on DuckDB, but at least when I talked to them, I really wanted to do is just keep building DuckDB, as is running on embedded devices and so forth. And all the VCs were one of them to make a cloud version of it that end up looking something like snowflake. But that would be, again, a major rewrite and go against the ethos of the original design of DuckDB. So there is DuckDB Labs, and that's basically the spin-off of CWI that is doing most of the development on DuckDB and employs a bunch of former students, Mark and Hannes, to keep building DuckDB. And that's where you get official, if you need official support for DuckDB, you contract out the BEM. Then, a year or two ago, there was a spin-off, there was a startup that was created called Mother Duck to provide a cloud version of DuckDB. But again, it's not a scalable version of DuckDB, like a snowflake or a dremel. Instead, it's more like a remote compute option that you can get now in DuckDB, where you still run DuckDB locally, but if your data is already in the cloud, you basically can run a query locally that then gets shipped over the wire to Mother Duck, who's running DuckDB there, and do some processing and then send back the result to you. So, again, going back here, when you download DuckDB, this comes along with it, this official Mother Duck extension. And so that means everybody's running DuckDB now can connect directly to Mother Duck, assuming they have an account and API key. And so, Mother Duck sends down to the local DuckDB, like here's the catalog, everything, all the files that are available, and then I can write queries on them if it was a local file and the system figures out what part of the query needs to run in the cloud and what part of the query needs to run locally. Yes? Why would someone use this instead of any other OLAP system? It's a question. Why would anybody want to use this versus some OLAP system? Yeah, that's on the... Does all of them have cloud offerings? I mean, again, but if you're already doing much analytics locally in DuckDB, because it connects to Python or whatever, to stop whatever you're doing, then switch over to BigQuery, where the query might not actually work anymore. You might not even be running SQL anymore. You can be using iBus or dplyr. So it's the seamless integration. It's still the DuckDB client-side interface, but you don't know necessarily where that query's going to run anymore. Yes? There's no interquery parallelism going on here. Like, it won't split to multiple nodes at all. It's a question. Is there any... Will it scale horizontally to multiple nodes in the cloud? As far as I know, at least the current version, no. Right? It's a question. Is there a reason why not? Because, yeah, you'd have to rewrite a lot of DuckDB to make that work. Right? Maybe you could just... If they're just sending the query plan up to the cloud, maybe you could identify pipeline breakers to write out parts of the query plan and then just run the DuckDB instance. So to this point, actually, you're right. They could do that. I don't know whether they do that, though. I'll show that in the next slide. Basically, they know what parts to promote in local, and then the local DuckDB instance is a response to our figuring out, okay, like, this data's remote, and it's too big for me to suck down locally, so I'll send my query... my plan fragment over there and process it and get back to the result. So, yes, in that point, you could say, okay, I could take portions of this and fan it across multiple DuckDB instances. I just don't know whether they do that or not. So this is from the paper that came out this year. The idea is that, again, you have the client side, the DuckDB, you install that MotherDuck extension that then can send query plans up to the MotherDuck cloud service where they're running DuckDB inside of Docker containers. And they're doing client side caching similar to what we saw in, I think, in Dremel... sorry, sorry, sorry, in Snowflake. And then, of course, you can always read data from your object store, whatever you want. And how the clients... sorry, the MotherDuck service is aware of what data's in here. I think they can connect to Iceberg and other things to suck out that metadata as well. So the way this is going to work is that they're going to introduce what they call a bridge operator in the query plans now that is capable of sending... sending and receiving data from the local DuckDB instance to the cloud version of it. And so, again, when you invoke a query on the local side, it'll do all the planning that it normally would. And then with the MotherDuck extension installed, they'll not even take a second pass on it and say, okay, well, this data you're looking... you're trying to access in this pipeline is local, this data is remote. And then they do, in a cost calculation, decide, is it better to push the query to the data out on the remote storage or pull the data down to the local machine? So I'm doing a join between some customer table and a sales table. The customer table is remote, the sales data is local. And so to say the customer data is huge, it's typically the opposite of this, the sales one is always much, much bigger, right? So DuckDB would say, okay, well, since I already have the customer data and that's remote, let me go send the sales data over the wire up to the remote service. The remote service then computes the hash join using the DuckDB instance that's running on the cloud. Then it has to get the result back to you and the client so then they send the data back. So they add source and sync operators in between these things. Again, now you see they don't have to do anything extra to support this in terms of scheduling and running these pipelines because we just made a big deal about how they switch to the push-based model. They can pause things or do asynchronous I.O. because now the control flow is separate from the data flow. So I can have this thing start running, sending data up and not have to have this weird call stack thing where I put pause and unpause as I'm pushing data out, right? So that call model for that second pass to decide whether what runs remote and local is based purely on not computational complexity and transfer cost of the data, the transfer time of the data. Obviously, if I have 10 terabytes in the cloud and a one kilobyte file locally, I don't want to suck down the 10 terabytes, I want to send the one kilobyte data up and run everything remotely there. Yes? Besides DuckDB having this client layer that connects to the motherboard, what's the difference between this and something like Neon where Neon is sort of equivalently scaling out Postgres? I think they actually have a lot of horizontal scaling going on, right? The question is, what is the difference between DuckDB and Neon? I mean, Neon is doing, Neon is taking, ripped up the storage layer of Postgres and can basically make a shared disk architecture that can scale out horizontally. But I'm pretty sure that the compute side for the queries themselves are still running on a single node. So that would look like this. The horizontal scaling is happening at the storage layer. The horizontal scaling is happening at the storage layer. The mother duck still runs DuckDB as shared nothing. So that's why... Wait a minute. We'll go back here. Sorry. You're saying that... Okay, well that is clearly a shared disk, but I'm just trying to understand why autoscaling isn't viable here but something like Neon does. So I misspoke. I don't know whether they're actually scaling horizontally on the compute side, but you could, as you said, if I have four pipelines that I'm all going to run remotely, instead of having them all be on a single instance of DuckDB, you could have them run on multiple instances, except for containers. Yeah, you could do that, yes. I don't know whether they're doing that though. I think the initial version of it, here's the pipeline, just run it. Because you would need an extra step to say, okay, chop it up and scale it out and then stitch it back together. Eventually they can do that, I'm sure. The Neon architecture is more similar to Aurora. You have a single primary where all the writes go to, and then you propagate the updates through the storage layer, and then you have read-only replicas, service those queries in a transaction-consistent manner. So Neon is more heavy to say? No, I'm saying that the Neon is... The Neon is trying to have a primarily multiple replicas. So the primaries get observed as all the writes, because it's an old-to-be workload, and then the changes get propagated to the replicas, and you can run read-only queries on those in a transaction-consistent manner, or Snapchat isolation. And then what Amazon does in Aurora is they... where Neon does all the... that propagation of the updates through code, potentially sitting above the file system, Amazon puts that propagation directly within EBS itself, and they can do that because they control the whole stack. Not exactly directly in EBS, but just a little bit above it. Yes? This is a bit of a tangent. We just had a question. If you do the primary thing, aren't you... you're doing eventual consistency rather than acid? The statement is if you're doing... if you're doing the... what I was saying before, you're doing... The primary... primary... No? No. Because I could do a commit on the primary, and then not acknowledge the commit until all the replicas have acknowledged that they got the update. Oh, okay. Right? And then, again, for read-only queries, sometimes you don't maybe need... you don't need that sort of strong consistency, or the higher guarantees, like Snapchat isolation might be enough. So I don't care that I'm reading data that's 10 milliseconds behind, as long as I have a consistent snapshot. That's a whole... Yeah. Okay. All right. So to finish up with DuckDB, and then I'll talk about the worst database system, the worst idea I've ever seen in database afterwards. I think DuckDB is amazing, right? The amount of adoption that they've had in the last couple of years is phenomenal. And I think it was a combination of three things, right? It was... they were at the right place at the right time where people... the pendulum has sort of swung back where SQL is now the default choice for a lot of applications. He was sort of asking why people would not want to write SQL. And that's a... I think an artifact of data scientists and the people in the no-SQL world eschewing SQL back in the day, but it definitely has changed over time. So they were at the right time for this. They were solving the right problem, like, hey, we need an embedded database to do analytics, not another Dremel, not another Snowflake. And they basically borrowed... took a lot of the ideas that the Germans were developing. Honest is German, but I mean the Munich Germans, the hyper-guys. Took those papers and built an open-source implementation of it. And had hyper-umbra been open-source, I mean, the embedded... the embedded one is also... that was a great idea, too. Like, you know, SQL library analytics, to be embedded database, hyper-umbra or not. So I think the combination of, like, that packaging plus the ideas from hyper-umbra and improvements over it certainly, you know, phenomenal. That was a really good idea. And for me, like, we were building a system that I wanted to be, like, you know, have adoption outside of CMU, but I was putting all my eggs in the basket on in-memory databases. And that certainly did not pan out because DRM prices could have stayed stagnated and SSDs got really fast and really cheap. And so in-memory databases aren't, you know, aren't in vogue anymore. Everything is always SSD-based. So. And we had other complications that I could take offline. So, again, I think DuckDB is a great system, and I used it. Like, this is my default choice of, like, oh, I got a CSV. I got to, you know, do some analyzing on it. It used to be OpenUp Excel or whatever, Google Sheets. Now it's DuckDB, right? So you should have that in your mindset, like, throw DuckDB at everything for quick and dirty things. All right, so next class, we're going to read about Yellowbrick. And as I said before, the reason why we're reading this paper, which also just came out in 2024, is they're going to do a bunch of low-level things that nobody else does. We talked a little bit about that this semester, but you see how they're just going to be super hardcore about it. It's a very fascinating system. And, like I said, they have real numbers in there that nobody else does. Okay? All right, so I'll cut this. The worst idea I've ever heard. Take off the cap or first I'll tap on the bottle. Throw my three in the freezer so I can kill it. Careful with the bottle, baby.