 Yo, hey, yo, hey, yo, yo, pack the crumbs down, fly like Mrs. Jones. Lyrical mathematics will have the devil smoking stones. I put heads to bed, lick shots and rap with fans. I'm sure you're excited today. I'm like, just promising all week, and I think through. You have Mark, one of the co-founders of Duffy B. I think he's written the most code, right? We've looked at this. Yeah, I am. OK. And he did his PhD at CWI, which, again, is the top school in one of the top schools in 36, yeah. CWI is all the things that we talk about this entire semester. So, again, if you have a question for Mark, if you go along, just stop and interrupt, and that way it's a conversation, OK? All right, perfect. Go for it. You're the best. Thank you. Thank you very much, well, for the introduction and invite. I can skip my first slide. Those are always the best introductions. Yeah, so I'm Mark. Hello, everyone. Happy to be here. And I'm going to be talking about Duck D.B. So Duck D.B. is something that we, me and Honest, co-developed at the CWI during my PhD. And it's a database system, obviously. Otherwise, I wouldn't be here. And it's an in-process analytical database system. So essentially, there is this kind of big standard in database systems, which is that you have this client server architecture, right? So you have a database server sitting somewhere, and then you have your client that talks to the server. And that's very useful in some situations, like when you want to share data between multiple people, multiple clients. But there is this kind of alternative architecture that I think SQLite is like the prime example of that allows you to run your database kind of as a library inside your application. And that has a number of cool advantages. And that's one of the reasons that we started building Duck D.B. is because we realized it was kind of like a gap, like somewhere where database systems could be getting used, like all the cool technology that Andy has been describing to you, all the cool technology that the database community has been working on could be getting used, but wasn't getting used. And that's kind of what Duck D.B. is, our tagline, or what we like to say is we're the SQLite for analytics. And that's for, the main reason is because we're also in process, and we try to follow a lot of the same design principles of SQLite, like that it's easy to use, easy to set up. And I think for analytics, one of the cool features of being in process is that you can exchange data very easily with your host process, right? Because you're both like next to each other instead of having this like socket connection. So you can very quickly switch from your application code to your database system and back. And that basically allows for some very cool use cases, specifically for analytics. So Duck D.B., it's free and open source, it's MIT licensed, we have a website. Yeah, if you haven't heard about it already, feel free to check it out. We have two groups doing practice at Duck D.B., doctor. Excellent, excellent. That's what I like to hear. So the first question I always get about Duck D.B. is why do you call it Duck D.B., right? Like it's a bit weird. The reason is that Honest, like the co-creator of Duck D.B., my previous speech supervisor used to own a pet duck. So Honest, yeah, he lived on a boat and he wanted to get a pet. And he thought, well, being on a boat, it's getting a pet can be potentially problematic, right? Because there's water. What can swim? And the answer was a duck. So this is Wilbur, and he is our inspiration for naming the system Duck D.B. So that's kind of the canonical version why it's called Duck D.B. The dog sees the picture of the duck, it's freaking out. Oh, sorry, sorry, sorry. I wasn't aware there was gonna be other animals. So yeah, I have also already given a talk at this course previously three years ago, in 2020, back when the pandemic first started and everything was moving online. And it was very cool. It was quite a long time ago in terms of Duck D.B. years because the system is very young. At the time it was way younger, right? And a lot has happened. So as you can see back then, I was still working at CWI full-time as a researcher. We didn't have a spin-off yet. A lot of things have changed. A lot of things have also changed, kind of stayed the same. So I kind of want to talk about the system, but also the evolution of the system and what we've been up to essentially. You're the first person I invited when the quarantine hit. I invited you first. We wanted you to come first. Oh, thank you. Thank you. I had a great time back then, which is why I was so eager to do it again, of course. So here you can see our commit history on GitHub. And it's kind of funny. You can see the CMU talk. It's right at the beginning. And well, things have kind of picked up. And of course, the reason is that back then it was just me and Honest working on this, right? Nowadays, many more people are using it and many more people are working on it as well. We're also getting a lot more external contributions and we have a much bigger team working on it. So there's just a lot of stuff happening. And as you might imagine, we have been doing some stuff in this like big green cloud to the right of this previous talk again. So yeah, one of the big changes, like on an organizational level is that we actually have a company now. So we made a spin-off from the CBI called Dr. B Labs. And that provides consulting services around DuckDB. And we actually have a team of people that work on DuckDB now. A lot of them are from the CBI before like people did their master's, these there, he'll did their bachelor project there, he'll did their PhD there. But we also have some other people from other schools, other backgrounds. So yeah, we actually have people like working on documentation, which is something that before like everything was just done by me and Honest. So that's a, yeah, big, big, big difference there. All right, so let's talk about tech of course, the database, that's why we're here, right? There's this very nice slide that I made, I think four years ago, which surprisingly has stood the test of time. So this is kind of this high level overview of DuckDB of the system and kind of like at a glance, what we're all about when it comes to the internals. So, and none of these things have really changed. They have changed internally, like we have changed the components very significantly, but at a glance, it's still the same. So we're still a column store database, we haven't pivoted to a row store, we're still using a vectorized processing engine. The storage format in DuckDB, it's a single file storage format similar to SQLite. So your whole database lives in one file instead of in one directory, which is kind of the norm for most database systems outside of SQLite. We have an art index, although we have been making a lot of improvements to it and a lot of cool stuff around that. I won't actually be talking about that though. We use MPC for our transaction management. It's written in C++ and we use the Postgres parser. So that's like at a glance what DuckDB is all about. In this talk, I'm gonna talk mostly about the processing and the storage. I will touch a bit on the MPC, but unfortunately I don't have time to go into like, in-depth into all of the different components than I would be running out of time. Like I would need many more hours. So I'll be focusing mostly on the processing and storage here. We don't come on MPCC this semester anymore. So no more about it. Oh, all right. Okay, then that, well, I have one quick slide on it, but I will just zoom by that one. Cool. All right. So DuckDB, it uses a vectorized push-based model. And as I'm sure Andy has told you many times through this course, what that means is that instead of having a tuple, like a row at a time model, you have a vector at a time model. That means that vectors are kind of your unit of data to flow through the system, to flow through the operators, right? So these vectors, they're kind of the bread and butter of the engine. And I think it's worth spending a bit of time talking about DuckDB's vector format because it's such an integral part of the system, right? Like all intermediate data is represented as vectors. So DuckDB, it has a custom vector format. And it's quite similar to Aero. It has some similar errors to Aero, but it's different as well. And the differences are mostly towards us designing this format entirely for execution. So Aero is a very awesome vector format, obviously, but it does a lot of different things. Like it does, you can serialize, you can stream it over a socket, right? You can write it to disk, you can read it back. We don't, our vector format doesn't need to do any of those things. It's designed solely for execution. And we have made some changes to the four, we have done some optimizations to enable that. And our vector format, it was actually co-designed with the Velox team. So Velox, it has the same vector format as DuckDB. So vectors, they're kind of like a slice of a single call, right? So they hold data of a single type. And for scalar types, vectors, they're kind of just arrays, right? Like if you have an integer vector, you just have an array of integers. So in this case, an array of integers of one, two, three, four, five. One thing that's kind of different in DuckDB is that we have a vector type. And that doesn't change what a vector represents logically, but it changes how the data is represented physically. And what's cool about the vector type is that it allows us to push compressed data through the engine and operate directly on compressed data without having to decompress it in the sources. So these are some of the vector types that we have. So at the, you have the flat vector, the uncompressed vector, that's kind of like the standard vector. Like that's how a cookie cutter vectorized engine would, what a cookie cutter vectorized engine would operate on, right? You just have arrays. But we also have these other vector types. We have a constant vector, for example. And in a constant vector, logically, the array has for every entry the same value, right? So in this case, our logical view is just a bunch of ones. But because we know this, we have this information, we can just say, okay, physically, we only need to store the first, like we only store this value once, right? Like we don't need to actually repeat it in our storage level. And that's very useful, not just to save space, actually space saving on vectors, it's not super relevant because it's just an intermediate layer, right? But it's important because it allows us to do optimizations. Because if we have two constant vectors and we do an operational, like we want to add them together, what we can do is we can just look at this one value and add it to this other one value. And we don't need to look at like all the value, we don't need to look at the entire array, right? Like it allows us to optimize, like push this compression into the actual execution. Another very important vector type is the dictionary vector. And that's kind of like, if you have dictionary compression at your storage level, you have like a dictionary, like a set of unique values, and you have an index into that set of unique values. And the dictionary vectors, they allow us to store such a dataset physically like that, right? So we have our dictionary, which are unique values, and we have the indexes. So in this case, we have A, B, A, A, B. But the way we store it is we store A, B as distinct values and then we store 0, 1, 0, 0, 1, right? Like the index into dictionary. And that's cool because it allows us to read directly from our dictionary storage or from the dictionary storage in Parquet and emit these dictionary vectors. And it allows our execution to optimize for this because sometimes maybe you want to only do an operation that influence the dictionary and you don't even need to touch the indexes, right? Or maybe you can say like, oh, I need to compute the hashes of a bunch of big strings, but I know they're duplicated. So instead of doing it once for every value, I do it once for every entry in the dictionary and it can save a bunch of time. Another vector type we have is the sequence vector. This is not very commonly used, but it's used for stuff like rowities. And that's basically if you have a sequence of values like one, two, three, four, five, right? You can store that as a base in an increment instead of storing the actual values. Now, when it comes to actually processing these vectors, right? You need to operate on them in some capacity. You can compress them as is, right? Like you can say, oh, I have a flat vector, I have a constant vector. I'm gonna look at the physical representation and I'm going to do the operation. However, because we have all these different vector types, you run into a combinatorial explosion problem, right? If I have two parameters to my function, now suddenly I need to do a flat, flat, flat constant, flat dictionary, constant, constant. You get the picture. And solving this is not only like a major engineering sort of hurdle, it also is problematic from a code footprint perspective. Because in this vectorized engine, you pre-compile everything, right? Which is great for this because it allows you to do dynamic switching very fast, right? You can just say, oh, now I have a flat and a constant. Now I can do this specialized operator. I have a flat and a flat. Now I can do the normal operator. Now I have a constant constant. I can do this specialized operator, right? Like you don't need to compile this combinatorial explosion, but you also, you don't really want to because it creates this giant code footprint. So what you often want to do is just have a function that operates on, that isn't specialized essentially, just operates on a, just does the operation once. And there's two ways of doing it, kind of like the straightforward ways to say, oh, I can just decompress the vector, right? I can just convert it into a flat vector. However, that has the downside that I need to move and copy data around, right? Like I need to, if I have a constant vector, I need to replicate my value and actually create the logical array that I was kind of avoiding to create before. And in order to avoid this, we have a unified view, a unified format over the vector that allows us to process these vectors without having to do this moving or copying data around. And this works for our three like primary vector types, the flat vector, the constant vector and dictionary vector. And the unified format, it's, in essence, it's quite simple, it's just an array and a set of indexes into that array, right? So you have like two sort of structures. And for a flat vector and for a constant vector, the kind of the key ingredient here is that they're always the same. So if I have a flat vector, an uncompressed vector, I always want to access the elements in order. I wanna access the first element, the second element, the third element, the fourth element, right? That's the way I would logically access a vector. For a constant vector, I always want to access the first element, right? Like there's only one, so I wanna do like first element, first element, first element. And these selection vectors, these indexes into the data, they're always there, they're constant. And basically what we can do is we can just allocate them once statically and then just set a pointer to them and keep reusing them. And that way we don't have to do any sort of allocation or constructing of a vector to be able to construct this view of the data. Now for a dictionary vector, it works even nicer because it turns out that this is already a dictionary vector, right? You have the data, the pointers, and you have these set of indexes. So we can just use this stuff as is. And the cool thing about this unified format is that we can basically, without copying data, without moving data around, have this sort of unified view over these vectors that allows us to implement our operators, our execution without requiring any specialization and without penalizing the system by needing to do the moving around of data, right? So allows us to create this generic operators in case we don't want to specialize in a fast and efficient way. Mark, so like these vectors, like for the different vector types, like this could be coming out of the WTB file format. So like, you get a dictionary vector because that's in the file. Like really for Parquet, are you gonna convert their dictionary format to your format or you always flatten it? So in Parquet we also emit dictionary vectors and also some like functions can actually turn like mostly constant vectors can be generated from functions. So we can say like, oh, well, first off, if you have a constant in your query, it's a constant vector, right? So if you do plus one, it's a constant. But also if you pass like a null parameter, we can say, oh, now it's always gonna be null. So we turn into a constant. So these things, they can come from different places. They can come definitely from our storage, but also from Parquet or yeah, from during execution, we can generate them as well. My next question is like, so once you get to save up a pipeline, would you ever have an operator spin out something that wasn't a flat? Like assuming it came in one way, would you ever say, oh, I know I can dictionary compresses and offer part of the query plan and send it up as a compressed vector? So there are some operators that turn things into constant vectors, but usually not, no. Usually you would say, usually you kind of go towards the flat vector, right? So yeah, it's just the easier way of doing it. We don't do any sort of recompression unless it's like a clear sort of, we can prove statically this is always constant so we're just gonna keep it constant. For example, for our nested loop join, we use these constant vectors because we have to match one vector on one side with one row on the other side, right? So then we use these constant vectors. But in general, yeah, it says it's flat. Matt, you have a question? Yeah, I was curious if, I mean, the constant vector is sort of a special case of the dictionary. Do you have an idea of like how much performance you gain by removing that indirection layer like between like a dictionary with one, a dictionary vector with one element versus this constant vector? I mean, I imagine it's quite a bit if you special case it like this, but... Yeah, so I think the, indeed you're right. The strength of the constant vectors at first, well, you don't need to store the list of indexes. Of course you could just add like a bunch of zeros there. But I think the, yeah, the strength is mostly that it's easy to special case on the constant vector. So for the dictionary vector, very often we don't special case on that because there's just stuff that you probably, you don't necessarily want to specialize on for a generic dictionary. Whereas for, if you have two constants input, it's kind of like obvious, like it's always going to be very beneficial to specialize on that. Makes sense, thanks. I'm adding a question. It's like, how, do you guys like read data before you like choose which representation to like store your like vectors in within storage? Like, how does that process work? Like, how do you, like store it as like flat or dictionary, e.g. So our storage is actually not vectors, right? So it's, our storage format is kind of, it's very compatible with our vectors, obviously, but it's not exactly our vectors. So there's a bunch of differences that for example, in our storage, we do stuff like bit packing, right? But in our vectors, we only will ever have like integers, like in 32 or stuff like that, right? So it's just that from certain, like I'll talk a bit about the compression later. I see my slides as well. But during the actual compression, we have more options than our actual vectors. And some of those storage formats translate to vectors, but not all of them translate to different vector types as well. By the way, this is not exactly my question. This is like, how do you like, let's say like a user is like insert values like one, two, or like one, zero, one, zero, one, zero, right? And in that case, like a dictionary format would be like, you know, pretty good for like storing like this data, but like, how do you determine whether to use the dictionary format versus like a flat format? So this happens at the storage layer. We have this compression phase that writes to disk, right? And then when we read from the storage format, then we can say, oh, our storage format is dictionary encoded, right? The storage has determined dictionary is the optimal representation for this. And then we emit dictionary vectors. Oh, okay, thanks. Yeah, right? Yeah. All right, so a bit more about, well, different types of data within the vectors. So scalars, integers are quite straightforward. For strings, we use the same format that Umbra uses, which is this kind of cool string format allows you to inline short strings. And it's described in the Umbra paper. The way this works is that you have 16 bytes for your string. And if your string is 12 bytes or less, then you inline it instead of storing a pointer to the string. If it is larger than 12 bytes, then you store a prefix because you have like these four bytes left over. You store a prefix like the first four bytes of the long string. And this is cool for two reasons. Like first, the inlining is very beneficial because if you have a lot of short strings, you don't need to go, you don't have this memory ax, you don't have this level of indirection. The other reason it's very cool is that you have this length and prefix as the first eight bytes always, whether or not it's inline or not, which means that if you're comparing strings, you can compare only the first eight bytes. And if they are different, you know the strings are different. So that allows you to like very quickly reject comparisons between strings that are false, that are gonna be false, no matter the size of the string. So that's very, very cool. All right, after the strings, we have nested types. And nested types, I think it's something that's kind of not discussed as much in the database literature, but something that we have found to be actually very important. Like a lot of people use nested types. They come from like Jason, obviously, but also people just like working with nested types. They do like a string split and then they have a nested type or they like to like pre-normalize nested types. And while maybe it kind of violates the relational model that's kind of the view I had when I started going into this, they're like, nest types are very convenient and it's kind of important to do this fast. So that also is kind of reflected in our vector format because one possible solution to nested types, of course, it's just sort of as blobs or sort of as strings. And that's very slow. And because you kind of go from your like highly optimized vectorized query engine, right? To a row store, just because you're using a list or a struct, right? So instead of doing that, what we do is we store nested types recursively by using more vectors. And that's very similar to what Aero does. And that's very cool because it allows for a highly efficient processing of these nested types. And it really speeds these things up immensely and allows us to reuse like all of our operators also for these nested types. So there's two like core nested types induct to B that you can generally use to compose essentially whatever you want, right? Like you can compose these in any way you want. You can have like a list of structs, a struct of lists, a list of structs of lists. You can have like very complex nested types. Well, these are like the two primitive nested types. So the struct contains like a set of different values that each other name and you have a list which is an array essentially, right? So every row can have zero to n values. So for structs, it's kind of straightforward. You just store each array as a separate vector. So in our struct here to the right, we have two vectors. We have the item vector and the price vector. And the item just stores the values of the item, value within the struct, right? Like pants, these are the shoes inside that vector and the price vector similarly stores the values of the price. You do have like nulls at each of these different layers. So that's kind of something that's interesting is that the whole struct can be null but the individual struct elements can also be null. So you have like a lot of validity mask, a lot of null values. But that's kind of like, yeah, how struct values are stored. And the reason this is quite straightforward is because a struct value vector, it always has the same cardinality as its parent vector, right? It's just like a wrapper around a set of vectors. Lists are a bit more complex and that's because lists can have different lengths, right? Like you can have a list that has a million elements or a list that has zero elements. And that's also reflected in our vector representation in the sense that the child vector of a list vector can have a different length than its parent vector. And the way we store them is by having these offset length pairs that basically say, okay, my first list starts at this position within the child vector and it has this many elements. And this way you can store essentially any kind of nest data you want by composing these structs and lists, right? Like this child vector can again, it can be another list vector or it can be a struct vector. And because of this all recursive, you can store any nest data you want. And because it's all vectors, you can operate on them very efficiently. How do you handle with the type of a, like a set of struct is different from one instance to the next? This is easier to say that the third guy has an item instead of a teacher, it's a list of stuff. So we don't support that in a normal struct. So the struct is by definition fixed type. We do support maps, and maps are essentially key value pairs. We do support that. And the way maps are stored is by a list of structs. So, and they, the list themselves, they can contain different keys for each row. But in a struct, like it's fully, the scheme of a struct is fully fixed. Okay, Wayne, do you have a question? Yep, structs are typed, right? Are they named as in, can a struct recursively contain itself as a child type? So we don't have recursive nested type. So the struct cannot contain itself as a child type. Cool, thanks. Cool, right. So that was vectors. And that's kind of the formats that flows through the engine. Now let's talk about the actual engine, right? The query execution, the push-based model. So when I last presented this course, and what we kind of started out with, with Dr. B was a pull-based system. And we switched later on to a push-based system. What we started out with was this vector volcano system. So I think you probably know about the volcano model by now. I hope I'm not repeating Andy too much. What we had was this volcano model system, where instead of having like individual rows, they would flow through the system like a classic volcano model, we had chunks or vectors that flow through the system. And the way this looks like from a, like implementation perspective is that you have a bunch of operators and each operator implements this getChunk method. And basically these operators, they call getChunk recursively on their children. And by doing this, you can construct the result of the query by calling getChunk on the root elements, which will then in turn call getChunk on its children until it reaches a scan, right? And then you can process the query like that incrementally. So here's kind of what that would look like for, for example, hash joint, right? You have a method called getChunk, it returns a chunk, right? And then you have a, like, you have a build and a probe phase, right? It's a hash joint. When the build does not get finished, you go to the ride child and you say, hey, give me a chunk, build a hash table, give me a chunk, build a hash table, right? You fully exhaust the ride child until your hash table is full. Then when the build is finished, you pull from the other side and you start probing this hash table, right? So relatively straightforward with the hash table, implementation of the hash table, left as an exercise to the reader. And that's kind of how this volcano model works, also in our thing. And it's nice, it's straightforward for a single thread execution, right? So there's this sort of thing about the volcano model. It's probably, I think probably one of the oldest models for a full-time database engines, and it's fully geared towards single-thread execution. And when the volcano model was initially designed, that was kind of all you needed. But nowadays, you really need multi-thread execution. Like you cannot really get around it anymore. And the reason for that is that, well, computers just have many, many cores nowadays, right? So if you go to AWS and you want to rent an instance, you can get a machine with like 200 cores. And 200, that's literally, it's more than two orders of magnitude, right? Like a multi-thread system can be up to 200 times as fast on this machine than a single-thread system. Like it's not like difference between one, it's not like twice as fast anymore, right? Like it's literally two orders of magnitude speed up. So to give you some perspective on that, if you have a multi-thread system that completes in one second on this machine, a single-thread solution given that the multi-threading was optimal, et cetera, would complete in like more than three minutes, right? Like one second, three minutes. If you had a multi-thread solution that finished something in one minute, it would take three hours single-thread, right? Like these are, the differences have become staggering because there's just this massive amount of cores. And we actually had people report bugs where they thought the system had gotten stuck in queries. When actually what happened was they had like made a change that forced it to trigger single-thread mode. And they would, it would finish multi-thread in like one minutes. And then after 30 minutes of single-thread, they would just give up and say like, there's a bug, this thing is stuck. So like this stuff, you can't get around multi-threading anymore in, well, current year. It's really required, right? So there is this method of adding multi-threading to a volcano-based system, and it's the exchange operator. And the way the exchange operator works is that you leave your operators kind of alone, right? So your joints, your aggregates, your scans, they're mostly just left as is. They're like single-thread. And then you have a optimizer that bakes the parallelism into the query plan. So it's like it divides up the query plans multiple partitions. The partitions can then be executed independently. And then there's like this exchange operators, right? That's why it's called the exchange operator that does this sort of like the switcheroo between. That does like, oh, now we do a repartition on a different attribute and we do like a split again. And that's kind of nice because it means that your operators don't need to be parallelism aware. And that's great, of course, if you want to bolt parallelism onto a single-thread system. It's kind of also the situation we had, because when we started building ductivate it was built same thread. But we thought about it a lot and we thought, hey, this exchange model, we had some experience with it and it comes with a bunch of problems as well. So one of those is plan explosion. So you may have seen this picture here, right? It's splitting the plan into three distinct fragments and it's all modeled in plan. Now imagine the same plan, but you have 200 fragments, right? Like there's many cores, so there's also big plans. And it turns out that a lot of the stuff database systems do with plans is not exactly linear. You always, you often have like quadratic or sometimes even exponential stuff in plan size. So this plan explosion can cause a lot of problems. We've seen it cause a lot of problems. And essentially it will force you to optimize your optimizers usually because your plans just like even if you have a simple query the parallelism will explode the plan, make it giant and then suddenly your optimizers can't deal with it anymore. There's also the problem of load imbalance. And the thing is that because you're baking all of the parallelism into the query plan it's not dynamic anymore. It's like determined how you're gonna parallelize it up front and that can cause load imbalance issues because maybe you were wrong about your petitions or how the data is distributed. And now you have like half your threads doing nothing and the other half processing which again cause inefficiencies. The other problem is materialization costs, right? Like the only sort of way of communicating between these operators is materializing rows, right? Like there's no other avenue of communication because the operators are all like parallelism unaware. So you have to materialize or and that has costs as well. So there's this great paper by a Munich guys about a alternative way of doing parallelism and it's called morsel driven parallelism. And in this model, instead of having these parallelism unaware operators you make the individual operators parallelism aware. And that has like some very cool side effects like you can have your input data distributed adaptively instead of having this all baked into the plant. It kind of avoids all of these problems. So the way that this model works is that you look at a query plan, a query tree, right? And you split it into these pipelines on these pipeline breakers. So for example, in this query we have a hash join between two tables followed by an aggregate, right? A group buy. And what we do is we split it into two pipelines. The first pipeline is our hash table build, right? So we scan our sale table, push it into the hash table build site. And the second pipeline does the probing of the hash table followed by the actual aggregation. And these pipelines, they're then paralyzed internally, right? Because instead of having these operate like parallelism unaware operators you have parallelism aware operators. So the scam now knows, hey, this is how I partition my input. The hash join knows, hey, this is how I built a hash table given that I receive inputs in parallel, right? So all the parallelism is no longer modeled outside the plan. It's modeled as part of the operators and like inside the operators themselves. So that's a very cool model. It solves a bunch of these problems. But how do you do that in a pull-based model, right? So if you look at this volcano model everything is entangled. You have a get chunk. It does a build. It does a probe. It does all the flowing of data, right? Like everything is entangled in this one sort of method. So what we kind of slowly discovered was that you can't really do this in a volcano model very nicely because you need to split this stuff into pipelines, right? And you can't because it's all entangled. So that was kind of the catalyst for us switching to a push-based model. And kind of the nice thing is that instead of having this single get data method you have these separate interfaces based on where your operator sits, right? Like whether or not it's a source it's an intermediate operator or it's a sync. And you can very cleanly actually model this parallelism awareness in the source and the sync by basically saying, okay, we have these states. You have a global state which is shared. You have a local stage which is like local to your thread. And you have quite a clean sort of parallelism model where you say, this is where I need to be careful that I'm handling the parallelism correctly. And if you look at our previous plan in our pipelines you can kind of see where these sources and syncs are at, right? Like your scan, it's clearly a source and your sync, it's the hash table bill. The hash table bill is a sync. In your second pipeline, the scan is again a source and you can see the hash join, it's the same operator but now it's an intermediate operator, right? So it gets a different actually parallelism underwear implementation because the hash probe doesn't need to know about parallelism because it's a read-only thing, right? Like you already have the hash table, you just need to read from the hash table so it doesn't have to be parallelism aware. And then the group by you again have a sync that is parallelism aware. And this is very cool. And the nice thing is that it's all like it's very nicely modeled, all the parallelism is there and the source is know how to split things up, the sync's not how to do the other thing in parallel. And yeah, I think it's just a very clean model that's quite intuitive to work with in a sense as well. There's some other cool stuff about a push-based model which was also like a catalyst to a switching and that's like, if you look at a pull-based model, right? All of the control flow lives inside your operator. And on the one hand that's very nice because it's super flexible, right? Like in an operator, you can always call your child. You can always say, give me more data or don't give me more data, right? But we kind of discovered you don't really need that level of flexibility in a database system, right? You can model the control flow of database system centrally without really losing out on anything. And modeling this as like a bunch of function calls that are like recursively called, calls this sort of like a scenario where the call stack is now your state, right? Like when you call get chunk, that operator calls get chunk on its child, that operator calls get chunk on its child. So your call stack is actually an important part of your operator state now. And that causes a bunch of problems as well. Then you also have the fact that all the control flow is duplicated inside each and every operator which means optimizing around the actual control flow gets more complex. So that's something that's really nice about the push-based model is that the control flow happens in the central location. So the operator itself, they often get simpler because they don't need to know about the control flow, right? Like the projection now just says, oh, I have my input, I execute my expressions and I go to my output. I don't no longer need to pull data from my child. And these states, which used to be like stored in the call stack is stored as a bunch of like, is like explicitly stored in a central location. And that allows some cool optimizations. Like for example, we have this operation optimization where we have a cache in between operators. And the reason for that is that you can have operators that reduce the size of your vectors, right? Like a filter or a probe of a hash table and a vectorized engine. It's kind of built on this idea that you want to have vectors that are like kind of full. Because if you have vectors with one row in them, then you have regressed to a tuple at a time model, right? Like it's slow again. And what can happen if you have a selective filter is that everything after is kind of optimized for the fact that you have many elements in your vector. So if you emit only one row at a time from your filter, the rest of your engine gets slower. And this can actually get you into worse performance than a tuple at a time model because at least the tuple at a time database is built for this, right? Like we're not optimized for that at all. So the solution is to add like small caches between these operators that can reduce cardinality. And by having the control flow in the central location, that makes it very easy because you can just look at, hey, does my vectors have like a low amount of elements that I buffer them, right? Another cool thing about a push-based model is that you control where your results go. And that means you can do scan sharing very easily. So if you scan data from a source, you can then push it into multiple sinks, right? Because you control where the data goes. You can just push it into an aggregate and push it into a different aggregate. And that kind of, you can probably hack this into a pull-based system, but in a push-based system, it's something that is very intuitive, right? Like it's just, you push to different things and you're done. Another cool, oh. Question. Yeah, I just go to ask about like back pressure. What is, like the scan is shared by two operators, but the two operators are consuming the data in like different speeds. For example, like the aggregation is very simple. But another side of like scan sharing is like going, which is like some kind of context to go there. So. So that's a good question. So currently our sinks, they're kind of like blocking in that sense. So we would execute them one after the other, right? So that's not really a problem. If you get into like async IO sort of stuff, then this can be a problem. You need to apply back pressure somehow. But currently that's not a problem for us yet. Thanks. Yeah, so another cool thing about the push-based model, because your state is stored in a central location, you can stop, right? You can set aside the state and resume later. It allows you to pause execution of like a pipeline or something. And that's useful because of different reasons. So one potential reason could be if you're pushing data into a sink and that sink has like a fixed size buffer, right? You may want to say, oh, my buffer is full. I want to interrupt the pipeline. And in a push-based model, that's quite easy because all your state is stored in one central location. You can just store the state and set aside for later. In a pull-based model, it's way harder because like I showed before, your state is your call stack. And you can't really easily save a call stack. In a similar vein, if you want to do async IO, right? You kind of need to be able to pause your pipeline while you're doing the IO, only to then resume it when you get like your callback or like your data is actually ready. And in a push-based model that has the stack modeled as like the separate structure, you can again pause execution and resume it later very easily. So those are things that are just very hard to do in a pull-based model. No, no question. Question, yes. So in this parallel, parallel-based model, is it possible that you can have a parallel-based model next? You have very large queries. The question is, could you have a synchronized data model next if you have parallel queries? So that depends on the operators you're using, of course. It could definitely happen if you have like things that do like socket connections, stuff like that, right? Usually that's not, usually the only socket connections we have are on our sources because we don't have like a, we don't have a clients consuming this over like socket, right? So we don't quite run into that, but it could happen. Especially if you have like, yeah, sockets in multiple different locations essentially. But even for the global state, right? The global state you mentioned for a query, that could become one, like every operator's mentality is a global state. It's there. So the states, yeah, the global states that's kept for the pipeline itself, right? And yeah, that's what usually holds stuff like the hash tables and stuff like that. So that's where like communication between threads happens. It can definitely become a bottleneck if that's what you're asking. Yeah, all right, thank you. Cool. All right, so that's the execution. Now I wanna talk a bit about storage. Can you check my time? All right, I'll try to be a bit quicker. So the storage, Duckby uses a single file block based storage format with the right headlock stored as like a separate file. And we also support asset. And the way we do that is that we have these two headers in front of the file, which essentially allows us to do like a switch route within the file. And so one problem of course, if you are writing to a file, at any point your computer could like run out of power and you need to be able to do like a atomic switch between two versions. And that's what these two headers are for. So we can essentially write all the data we want to a file. And then at some point do like the switch route where we go like, oh, now we're using the new version instead. All of our blocks are all fixed size. And the reason for that is that because we have a single file, we can never really delete the file, right? And we really want to avoid fragmentation within this file. So we have just fixed size blocks, which is blocks of 256 kilobytes. Our tables, they are partitioned into row groups and every row group has around 120 K rows, which is around 60 of our vectors. And these row groups, they're kind of the parallelism and checkpoint unit. So often if you do like a checkpoint, for example, you don't want to rewrite your entire table. Obviously it could might be very big. So if you have like a delete or updates or something, you can instead rewrite only a single row group. That's kind of the idea behind these row groups. There are also a unit for which the check, the parallelism is done, right? So the sources distribute the row groups over different threads. We also support compression. And it's kind of a critical thing. I think with the column store, at least it's very nice to have. And that's because compression, it works very well with column stores. One of the big benefits of column storage and compression is kind of cool because not only does it reduce your file size, it can actually make queries faster as well. And one of the reasons is this compressed execution that we're doing, right? With these vectors that we have shown. But another is that it can speed up IO, right? Like sometimes IO is the bottleneck. You're reading from a slow disk, you're reading over socked connection from a network and compression can make things faster as well as saving your disk space. So when it comes to compression, there's kind of two schools of compression. You have general purpose, heavyweight compression, which is stuff like G-Zip, Z-Standard, Snappy, LZ4. Like stuff you're probably familiar with, stuff you've used. And that works by just finding patterns in bits, right? Like it's data type, agnostic. And then you have special purpose lightweight compression models like RLE, bitpacking, dictionary, frame of reference. And that, those work by finding very specific patterns within the data. Now general purpose compression, it's great because it's very simple to use, right? Like you just call a method called compress in the library and you're there. And it works great if you want to save space. However, there is generally a higher sort of decompression speed there, which can slow down query execution, right? Like you have to decompress this stuff all the time. And in the case of DuckDB, we're actually keeping data compressed in memory. So it's not just about reading from disk, it's about reading from memory. And memory is very fast. So having decompression there that's slow can have a significant impact on performance. You also generally decompress these in bulk because they work by optimizing like blocks of data. So you can't just do random seeks or do compressed execution in them. You just need to decompress them and then it's over. So lightweight compression, because it works by detecting these specific patterns can actually be much faster than general purpose compression. And these patterns can then be exploited during execution like what we saw with the vector formats. There's a downside, which is that these patterns are generally very specific. So if a pattern is not there, it has no effect. It's no positive effect on your compression ratio. And that means we generally need to implement like many of these kind of algorithms, not just the one and done kind of thing. It also means we have to choose them. So induct to be the way compression works is that per row group, right? That's our checkpoint unit. We look at each column and then we go through two phases. We do an analyze phase where we look at the data and we say like, hey, which method is the best? And then we have a compressed phase, which says, hey, now we actually do the compression and writing the stuff disk, right? So these lightweight compression methods at this point, we support quite a few of them. We have been adding more and more over time. There's, this was from a blog post that I wrote that's linked there at the end of last year. We have actually implemented a few more compression methods since then. And you can kind of see this cool effect where as you add more compression methods, right? The size of the tables go down because there's more patterns to be found. And you can also kind of see that depending on what, depending on the dataset, some are more or less effective, right? You can see that dictionary was like a very big win. But on, for example, the FST algorithm made a giant difference on the line-end dataset, but not that much of a difference on the on-time dataset. That's because these compression methods, they work by finding very specific patterns. And if the pattern is not there, it just doesn't help, right? Like it doesn't work. Cool. Okay, so that's the storage. I do have a lightning round where I have like a bunch of quick, like one slider things on other components. But I'm also happy to answer questions if we don't have any more time. We have plenty of time. All right. A quick question, like what's, you guys have a book about your book pool, right? Yeah. Like, is there anything special about that or is there a classic LRU? Yeah, so I have one slide on that. Sorry, I could have talked a bit more about this. So we implement a buffer manager that's very similar to lean store. It has some differences there though. So we have a lock-free sort of queue where we have kind of like sort of LRU, right? Like it's not actual LRU because that requires a centralized data structure, which you want to avoid. But we have kind of sort of LRU, but the functionality is very similar to what you would expect from a traditional buffer manager. Got it. We did cover the lean store paper, but this is how I'll take it. Cool, yeah. So lean store kind of the key takeaway there is that it works by having the, like as few centralized data structures as possible. So the actual handles, they're stored in like the pointers to the buffer handles. And that allows you to have like, well, very high degree of parallelism because all the operations are kind of on separate buffers usually, right? Like one thread reads one row, group one thread reads another row group. So yeah, it's faster for parallelism. Got it. Cool. Any more, another question? One has a question. One, go for it. Hi, could you briefly describe how query cancellation works in .gibi? For example, can you stop an ongoing hash table build or would you wait for it to be done? Yeah, so I can definitely describe that. We do support query cancellation. That's actually one of the other advantages of the central location flow. Essentially whenever we have a operator come out, like a bunch of vectors come out of an operator or go into an operator, we return control to the central location and that can then handle all the query cancellation. And the nice thing is because it's a vectorized model, right, even if you have operators that process a large amount of data like a cross product or something, at some point you will produce like one vector of data in one intermediate. And that can then immediately be canceled. So the query cancellation is very, it works very well. Thank you. Hey, I have a question about the storage. So basically for something like compaction the storage because like you're using a VCC. So it is possible that some overtures when you say more and you only choose like three of the same issues by a whole version, by for example, merging to like half new blocks. Because like a year ago, when I was looking into the productivity issue, there were some users complaining like, they are using Python with productivity, they ingest a lot of that and they delete the full table, but there's still like a lot of memory usage. So I'm not sure it's fine. Yeah, so we indeed, I have actually been working on this like last week. We need don't yet do compaction, but that's mostly just because we haven't gone around to it yet. It's like, yeah, we do reuse of the blocks, right? So what we do is whenever you destroy a row group, we mark the blocks as they are free and they can be reused, right? So that works, but we don't yet actually go through the effort of like making the file smaller. So the file can only grow at this point. As for the MVCC, our MVCC is actually only in memory. We don't write the versions to the storage. So it's a bit different from the way that Postgres does MVCC, for example. Like all of our transactions are exclusively in memory and like all the MVCC information, it doesn't really need to be stored, right? Because once the system shuts down, like your transactions are canceled anyway, right? You, for the checkpoints, is that something? Like obviously if you close the process, you checkpoints and write everything out, but it doesn't do it automatically in the background or does it, or like when the white head log is so full, then you do a checkpoint? Yeah, so we have a threshold that we checkpoint after the red head log gets to a certain threshold. And we also do optimistic writing to the, like so when you're loading data into the system, we won't write to the right head log at all if you're loading a lot of data into the system. So we will write directly to the database file, directly compressed stuff. That's actually quite new, I think that's like half a year old or so. But that speeds things up a lot if you're doing like bulk loading, which is super common use case for us. Got it. She's also special case. He likes to read the issuance of open-source databases in his spare time. That makes it sound like we're stalking you. Not? Done. Okay, cool. Any other questions? In the back, yes. So what do you think the role of UDS? And when you're all using plans in UDS? Can you repeat the question for you, Mark? Yeah, please. The question is, what is the role UDS and what do you think the future of UDS are in WPB? So I think it's a good question. We're actually working on implementing a bunch of different, mostly Python UDS right now. And we do have UDS in like JavaScript already and we're gonna probably implement wasm UDS as well. And of course you can make, you have extensions to DuckTv to implement UDS. So I think there is a bright future for it. I do think there is one sort of thing there which is that UDS are actually less important in DuckTv than in like a system like Pulsegres. And that's because you can pull data out of DuckTv so fast and push it back in so fast that often stuff you might wanna do with the UDS and Pulsegres, you can just do by pulling data into your application instead. Right, your point. Abby, your point. So now, because we're all showing our over-respected research projects in TempleVendant, in DuckTv. And how are extensions implemented? It's support for extensions in DuckTv. So yes, we have extensions. Here you go. This is my slide on that. So actually, we have been working a lot on extensibility in DuckTv. And I think it's kind of important for us because one of the things we like about DuckTv is that it's quite small. Like it's a compact system, right? And a lot of the stuff people don't need necessarily, right? So we have, for example, this ICU extension, the International Components for Unicode that does stuff like implement collisions and stuff like time zones. And that requires like a big database of all these languages and stuff. And it's a very useful feature, but a lot of people also don't need it. So it's quite critical for us, especially for the WasmBuilds that runs in the browser. That's, I think, this one. We have a WasmBuild that runs in the browser that you have like a small core. So a lot of the stuff that we implement ourselves is actually implemented in extensions. Like for example, our ParquetReader is an extension. It is an extension that's often bundled, but it's an extension. Our JSON support is like fully an extension. So yeah, you can do a lot in extensions. You can build your own extension and then you can load them. And yeah, it's a shared library. It's somewhat straightforward. There is the thing with extensions now, if you're doing it now, it does tie in quite heavily with our internals and we are changing our internals sometimes still because we're like, so like you can get it working with one version of that to be made on necessarily work with the next one. And we're moving a lot faster than something like false graces, right? So like things are changing at a higher pace. But you can do a lot with extensions and we actually released last week our Geo extension. So you can do like, yeah, which does a bunch of cool stuff with like geospatial algorithms and stuff like that. So yeah, you can do a lot with extensions. And are your extensions just like in the first case where they're like, you ever write the hooks and then the run time, but you'd be able to see where this, this book is ever written and it calls that or is there something more fancy that you're doing? So our C++ extensions actually kind of hook into almost the entire system, which is also why this is like so prone to breaking if we change the system, right? We also have C-level extensibility, which is far more stable, but because it doesn't hook into the entire system, it's also less feature, like you can extend less things, right? Like we basically need to implement a API that extracts this functionality. But yeah, you can essentially do a lot of different things. So one cool thing, well, we have a pluggable file system so you can implement your own file system. One thing that's kind of cool is that we have fully pluggable catalogs. So this is something I added very recently, like a few months back, and you can essentially have, like we have our own catalog, right? Which has tables, it has views, it has sequences. One thing we can do is have a fully custom catalog within DuckDB. And that means you can create tables within that catalog. You can create views within a catalog. So we've implemented this for SQLite. And what that means is you can attach a SQLite database and run queries on it like it is a DuckDB database, create tables on it in it like it is a DuckDB database. But in the end, it's a SQLite database, right? So you can open it in SQLite. And you can have like a fully custom catalog in your extension essentially. Yes, thank you. Question, yes. Yeah, I wanted to ask like, in terms of like, maybe different from what everyone was asking, but in terms of the long term, where in DuckDB do you see DuckDB and like, are you found on DuckDB everywhere? Or what's the- Is it SQL LibroStore? So- Yeah, that's the only like SQLite on every cell, huh? Right. Yeah. Maybe that you have DuckDB on every cell phone kind of that kind of thing, right? But what's the large, broad goal? Yeah, I mean, so obviously our goal is world domination, right? Like that's clear. I think it will be hard to beat SQLite in terms of ubiquity, right? Like SQLite is like running literally everywhere. I also think that analytical systems, they're probably a bit less generically useful than transactional systems. But I do think there is like, yeah, we definitely, we want everything that DuckDB can solve, we want DuckDB to solve, right? So I've already shown you, we have a Wasm build, we actually also have like a Swift package, for example. So you can't run it on your phone, you can run it on your browser. And there is some very cool stuff you can do with that, where you can run DuckDB in place where you normally wouldn't really expect a database system to work very well. You can now use the database system using DuckDB. So yeah, definitely we want to go to like, as many places as possible. But you SQLite runs on airplanes because they have to get like avionics certification, right? It's gonna be a year to do that, right? So that's like, DuckDB is not gonna run on an airplane or any time soon. It'll be on the entertainment system. Yeah. It can run this type of SQL queries. Yes. But it's the most interesting, like implementation trick in DuckDB. Sorry, the most interesting implementation trick? Like, just like little like, I don't know, you're like in the one system, you're like, oh yeah, we have to like do this, the solve this problem. We have to do like this thing, but like it's not something that you would have like expected it to be. Does that make sense? Did you run them to anything like that? Oh, that's, it's hard to answer what is the most interesting. I think there's a lot of like, there's a lot of stuff that we had to like, we thought would be one way. And then we had to rewrite all of our code because it wasn't, right? So I think that, I mean, the whole execution model is one of those, right? Like we switched to push-based model. Another one is the vector format. So one thing that we had initially was we didn't have this fancy vector format, we just had arrays. And then we were like, oh, hold on a second. People actually want to use nested types, right? So we had to rewrite that. I think like almost everything in Duck2B has been rewritten several times. Yeah, I think, yeah. Yeah, it's hard to answer what is the most sort of like, impactful there, but. I guess, yeah. Okay, all right. I mean, Mark, you gotta go, you gotta go. We have 10 minutes, if you want to finish up. Yeah, I'm happy to be here for some more minutes, no worries. Oh yeah, go for it, go for it. The lighting route's live, do you want to go through those? Sure, sure, sure, I can. So I already went through a bunch of them, but I can go through the ones I didn't go through. So yeah, so the out of core, that's something that's kind of like a staple of database management systems, right? Is that you want to do larger-than-memory execution and most systems support this in a good capacity. Just by having a streaming engine, you'll already support this in a lot of ways, right? Like if you have a streaming vectorized engine, you can do small aggregates with a larger-than-table dataset size, no problem. However, there's still a bunch of these materialization inputs, like you have a hash table for your join, you have a hash table for your area, you have a like sorting data requires materialize. You have like a window segmentary for your window functions. And like I think traditionally, systems have always been kind of, either they always take the slow path, right? Like they always do this on-disk sort of thing, or they have a fast path for when things fit in memory and then switch over to a slow path when it doesn't. So you have like this optimized rule that basically says, oh, I can't do a hash join anymore. I'm gonna do like a sort merge join or something. And I'm gonna do like an out of core sort and it's gonna be like a big out of core merge sort is gonna be very slow. So one of the things we have been wanting to do, inductively, we have been, well, I think succeeding quite well, is to solve this problem of large memory execution, but have this very graceful performance degradation where you basically say, okay, we try to use as much memory as we can. And if your performance, if like your data barely doesn't fit in memory, we want to write as little as possible to disk and also still use like this, these algorithms that are optimized for memory. So Lawrence on our team has been working on this and he's been, he's actually doing his PhD on this. And he has some very cool tricks where, for example, like in hash join, we do like a radix partitioning and then dynamic repartitioning based on how much data there is to be able to keep on doing a hash join. And one of the things that was kind of cool there is that in like our MacBooks, the disks, like the SSDs, they're so fast that even when you go like massively out of core, like you're operating on like an eight gig data sets and you have one gig of memory, it's still actually, it doesn't get that slow. You don't get like a performance cliff where you go like, oh, suddenly it's like a hundred times slower, it can actually still be quite performance. So it's definitely worth doing all these cool tricks. Yeah, so there's a bunch of papers on there. We have some blog posts as well if you're interested in more. Yeah, so the transactions, like I said, it's based on Hypers MCC model, but then we have this optimized for batch operations for vectorized processing. We support snapshot isolation and we do optimistic concurrency control. Won't go too much into detail there. There's a paper you can read. Yeah, and if you don't know what MVC is or you haven't had it in scores, then more to learn in the paper. So they should know. Sorry, they should know. Yeah, they know. Excellent. I think one of the sort of most sort of killer features of VectiB is our support for external format. So that's something that has been very powerful since the beginning, but is even more powerful now. And it's basically our table functions are like data ingestion or scans. They're super flexible and our own scans use the same extensible scan that everything else does, right? So our scans are not special. And I think that has allowed us to build a lot of like very efficient integrations with a lot of different data sources. And it's one of the cool use case, I think of VectiB is that you can really, you can go to and from very different data formats. And you can also use this to like do data conversions, right? Like you can load parquet files in the SQLite using VectiB and you can do like queries on Ponzi data frames and convert results to CSVs. So there's this like sort of cool combination that this enables. And it's even, I think it works very well together with our in-process nature because we can query like in-memory arrow data structures or in-memory Ponzi data frames, our data frames directly and ingest them into our engine in this vectorized manner and combine data from different sources as well. Although there is like kind of unique optimization challenges that you didn't like arrive at because one of the things that if you operate on your own tables is very easy is that you know, like you were in charge of gathering the statistics, right? So you know you have this and this available. Whereas if you operate on like CSV files, suddenly there's no statistics. If you operate on a Ponzi data frame, okay, you have like a cardinality, but that's it. And if you operate on a Parquet file, you have like min max indexes, but maybe not everything you have in your local storage. So doing like join order optimization under like the constraints statistics case, it's an exciting sort of a, it's a sort of interesting problem. Like how do you do that efficiently? Yeah, so the plug will count like I mentioned, the file systems as well. So yeah, one cool thing as well, you can read stuff over S3 or HTTP is inducted to be, and that works using our works through an extension through our file system layer. So you can kind of inject a file system, it's inducted to be, that can then intercept all the open calls, the read calls, the write calls. And that allows us to have this sort of extension that implements a HTTP file system, a S3 file system and allows you to read data from remote things, which is yeah, pretty useful as well. Extensions I've been over, wasn't. And yeah, those were my remaining lightning slides. Yeah, so this stuff you guys already know, we have a website, we have source code. If you guys want to look at our issues or open new ones or open PRs as well, then of course you're more than welcome. Yeah, so once again, thanks for having me. And if there's any more questions, I'm happy to take them, otherwise. Question, yes, go for it. So, this is a special subduation of that TV is targeting it has actually a unique connection, different from others in the business, like some data free systems are also connected for that, you know, Pandas in Python and the data free in R. So I wonder if in the future that TV will support some similar API such like those data free systems got fit on my observation, that TV still does not perform as well as those systems in some simple queries. You got any to go through those traditional steps that the database, you know, query, part of the query and the optimization and, you know, the price of that stuff. So it's possible that that TV in the future can provide API such as the data free. So users, for example, can during the data table as it is a free band, it is fun to retrieve a data from a presentation of this table, the row ID and column name. So we can view, we are free to all that TV API in that frequent move. Yeah, so we actually already do support this and have supported this for a while. It's clear, it's like it's obviously not our main focus, right, like our main focus is SQL, but we do have a data frame style API and the way that works is it essentially constructs like Duck to be query nodes based on this API call. So it's a great point, we do have it, at least in the Python and C++ clients, I think in the R client as well, but not in all the clients. Yeah, it's very useful. I think also, yeah, in some sense, data frame libraries are competitors because they do, they can do kind of the same thing, but we also work very nicely together with them, right? Like, so that's also one of the cool things about Duck to be is that the interop is very nice. So you can use Duck to be to run like SQL over upon this data frame or something. You don't need to use either or you can use both. What is the query optimizer look like? Is it like, is it top down or bottom up? So we have a bunch of different ones. It's actually, we have been meaning to refactor it because it's kind of, we have a very structured expression rewriters and that has like the nice rules and stuff, but our logical operator, like our query tree optimizer, it's kind of messy. We have like specific, we have like a mix of kind of rule-based and cost-based. The joiner or stuff is all cost-based, of course, then we have some rules, but it kind of like, yeah, it's a bit mixed and messy. Like we need to rewrite this. Okay. All right, last question to be, what's the most surprising use case you've seen in somebody use Duck to be? Poo, that's hard. I think that when the wasm stuff came out, that was quite surprising to me because I was like, oh, wait, you can run this in a browser. And then I was like, oh, wait, it's actually useful. Yeah. Because there's like, people are doing stuff like running stuff, like to blow or something in a browser, right? And then being able to load your data into your browser and then do like local data transformations, that's very powerful because you don't have to round trip anymore. So I think that's definitely, one of the things that surprised me the most was like, okay, I hadn't thought of this. You know, like all the data science stuff that we have thought of that, that's why we made it. But this was definitely, yeah, one of the surprising things. Okay, awesome. All right, let's, one last question. One last question. The question is basically a paper that we've given to you. Basically, you say that you've made it in order to basically keep track of the resources. And my question is, how does the DB keep track of resources? For example, say if you do run out of memory, how do you keep track of that? Do you have like a fixed memory usage that you assign to the human detection of the time? All right, so it's five minutes because I made every demo paper for you guys. Like how do you keep track of resources? Yeah, so for threads, we have our own thread pool that we use to manage the threads. And also because it, Dr. B can use library, we allow, if users have their own thread pool already, we allow them to call, like to integrate with our thread pool or like to have their, to use their thread pool to work on the tasks. For memory, we have our buffer manager, which has indeed a fixed memory limit. And we'll start unpinning things if possible when that memory limit is exceeded. That's my favorite all-rounder. All right, all right, all right, all right. Yes, it's the SD crooked IDES. I make a mess unless I can do it like a Gio. Ice cube with the G to the E to the T. Now here comes Duke. I play the game where there's no roots. Homies on the cusp of y'all my focus, I drink proof. Put the bus a cap on the eyes, bro. Bushwick on the go with a blow to the eyes. Here I come. Will he eat? That's me. Rollin' with fifth. By the 12-pack case, I'm a four. Six-pack, 40-act, gets the real bounce. I drink proof, but yo, I drink it by the 12-ounce. And they will make you fat. But saying eyes is straight, so it really don't matter.