 So, we are super excited to have Mark Resbelt from CWI come give a talk for us today Mark did his PhD at CWI and now is halfway in between being a postdoc and halfway in between being a PhD student because you haven't defended yet. Yeah But you are officially a postdoc. Are they paying you like a postdoc? Sorry? Are they paying you? They're paying me almost like a postdoc. It's a see you are in between I'm in between. I'm a pre-doc. Got it. Nice. No, no, you're a pre-postdoc. A pre-postdoc, a pre-post-home doc, post-home doc. Okay, so DuckDB is probably in my opinion the most exciting academic database system coming out in the last two years as the title says it's sequel life for analytics So I don't think there's anything else to say. I think Mark's awesome DuckDB is awesome So go for it. Well, thank you very much. And actually the way we'll do this is People can just unmute themselves and ask Mark as we go along with questions. Yeah, no problem. Yeah, so and at any point just Don't literally shout but it is a bit late here. So I guess my neighbors wouldn't like that But just ask that's fine. All right, so I guess I don't need to introduce myself. I could have skipped this slide Thank you for the introduction and the invites Andy. I'm very happy to be here as well So yeah, me and honus in this picture. You can see honus and me down there drinking beer at almost October Fest we made DuckDB the database system and I'm here to present that here today. Just to interrupt go back to that picture For everyone go back to the last bit plus like in your middle. That's Peter Bontz. That's the inventor of vector-wise another sort of like CWI cranks out in amazing systems. Yeah Yes. Yes. Yeah, so Peter is not super involved in actually building DuckDB in terms of like the code base But we do definitely ask him for advice quite a lot But he is way too busy with tons of other things like he's super busy He has no real time to code that much anymore. So But we definitely talked to him a lot. I mean we used to share an office before all this mess All right So basically I have kind of two parts to this presentation one the first part is kind of a talk I gave already at Cider because Well, it's too late to prepare anything new And then I have also a bonus part lessons learned building a database system Which I think is interesting or could be interesting to people that might be building a system right now like Andy All right, so before I talk about database systems I first want to give a bit of a motivation for why we build DuckDB in the first place and that motivation It kind of start with data science. Well, what is data science? I just called statistics on Macbook as a thousand different words for it. It's very overloaded term But I think one thing that's kind of obvious about data scientists is they work with data it's in the name, right? And then we have us the database researchers in our field. We've been working for more than 50 years I think it's perfecting database systems. So obviously data scientists database systems. They have to be using them, right? Well, unfortunately, not they use them if they are forced to such as if that's take data out of the database system in the first place But they generally kind of prefer to avoid them, which is a bit sad, right? It makes the database sad So here's a quote from Hadley Wickham. I can't actually read it because I blocked my quote with this overlay If your data fits in memory There is no advantage to putting it in the database. It will only be slower and more frustrated So if you don't know Hadley Wickham, he's kind of like the R guy. He made He made kind of all the popular R packages like ggplot, dplyr, it's a huge guy in data science And he is saying this about our fields about our like our lives work in a lot of cases So it's a bit sad So why is he saying that? Well, maybe they just don't need database technology, right? So what do data science actually do? Well, I think probably the easiest way to figure it out is just look at the packages they use, right? So if you look at like popular Data science frameworks in Python, you see the top two are NumPy and Pondas, which is pure data wrangling, right? In R, it's very similar. The top two, dplyr, tibble, and down there data table It's all data wrangling libraries and what did they all do? Well, they do database operations They do sometimes a bit more, but some of them do just database operations or even less than that They do joints, aggregates, basic transformations, filters So it kind of turns out that data scientists, they do need all of this functionality that we are That we are building in our database systems, but somehow they're not using them So let's go back to this quote. Why does he think that database systems are slow and frustrating, right? I mean, we've been working on optimizing database systems for what, 50 years or more? Not me personally, but our field So to figure it out, probably the easiest way is to just run a data science workflow with and without a database system and then just see what happens So typical database data science workflow, it's kind of simple. You load some data from CSV file You do some ETL, some preprocessing, then you run your analysis, like a really simple sort of workflow So for our example that we ran here, we have some voter data from North Carolina voters It's just a dataset we have lying around, which is around 300 megabytes in CSV format Then we do some, in this case, very nonsensical preprocessing where we remove entries Oh, no, this is not nonsensical We remove entries with a missing phone number or house number And then we compute the correlation between phone number and house number. That is the nonsensical part I couldn't read it again because this thing is blocking my view So of course correlation between phone number and house number expected to be zero So let's do this in R First we install some required packages Let's install like data table and deep player, some basic wrangling packages Then we run our analysis First read the CSV file Then we apply the filter and then we check the correlation It's three steps, three lines of code, and indeed we see no correlation between phone number and house number Surprise. It's quite easy and convenient So what do we, what happens if you do this with the database server? Well for that let's just use Postgres because it's kind of the open-source database Well first we have to install the database server and this is not actually trivial without a separate package manager And for for us at least on our cluster, we don't have root access So we have to download the sources and compile Then after we do that we have to set up some cryptic environment variables Initialize the database and use this very descriptive pg kutlu commands because in the when this was created They had to save bytes on the commands And initialize their database Then we install the client. At least this is easier, right? We just install it through the R package manager Except it doesn't work Because the client actually requires you to have the database system itself on your path for some reason Okay, fine. We put the server on the path to install the client install it again, and then we run it Now we connect to our database system Providing like the host the database name the client whatever we create a table So I have it. I'm actually being nice here because you see this dot dot dots. This is around a 100 different columns because it's a very wide data set. It's very Very very big statement actually then we send a query where we use like string concatenation To construct a copy command to read the CSV file Then we can actually use a delete read the table and use a filter so you can already see like all of these steps They're pretty annoying to do You're like if the installation is more difficult the usage is more difficult. There's a lot of like pains here And what about the time? Well, it's a lot slower as well And the the whole reason slower is actually because of this because of this client server connection Where you are basically pushing like this whole table, which is around 300 megabytes over a socket connection And that just cost you like 10 extra seconds here So it's annoying and slow like in the end Hadley is right for these types of workflows and are in python database systems They are frustrating. They are slow. So what are the problems here and how can we fix them? Well, the problems that we have kind of identified is that well They're difficult to install difficult to set up and maintain They have a really slow data transfer to and from the clients and I kind of have a poor interface with the client application So of course you have this client server architecture. It's nice for multiple users, but for a single user. It's very pointless It just makes everything more cumbersome The installation needs to be easy if you use python You need to be able to install it all through pip install. Anything else is already way too cumbersome Because then you're already looking into like OS specific package managers Like figure it filling with compilers and that's just not happening. It's too much Transfer efficiency is very very crucial Which also we have a paper about in field to be some years ago Actually when the picture you saw before of the Oktoberfest was taken was this field to be So the transfer efficiency like the database client protocols. They're not optimized for this kind of bulk transfer scenario And they got and the transfer is necessary because the data scientists they can't run everything Within the server right they need to use their like machine learning framework of the week of the month whatever to run their data science work loader and I'm going to do like a Deep neural network within sequel so they need to transfer this data at some point and Finally sequel it's a lovely language. I Very much like sequel but as an interface for clients It's kind of problematic because you have to embed sequel as a string and it's kind of unnatural right you don't get syntax highlighting in most editors and It actually has introduced the whole class of security for it. This is interface, which we call sequel injection It's also not that easily composable So if you look at deep layer, for example, you can change operations So you have like a filter and a projection you can change them and you can look at intermediates by just pushing like this head function in between at any point In sequel you can kind of do that with temporary tables, but you have to write a lot of crufts You see like there's an extra select star from being pushed everywhere. There's a lot of like Crufts around it. They just don't need in this deep layer like interface So well the cure database systems. They're not a good fit for data science. Why do we care? Well, the answer is simple money Data science is very popular so we can make money. We can get grants and then we're happy But maybe a more like fundamental reason is that data scientists they are kind of reinventing database systems, right? They're They are Reimplementing joins these areas. They're kind of re creating a database system and are doing a very poor job of it They're basically ignoring all the research that we as like in the database field have worked on in the past 50 years They don't use optimizers. They materialize huge intermediates. They don't support out-of-memory computation They don't have any parallelism. They don't scale up So they are kind of making it like poor man's database That is just like winning or like that it gets a lot of usage just because they have a good interface like a good Good connection to the aren't Python communities And database they can solve all these problems Mark. Yeah, I wonder you can may I ask a question if you don't mind Yeah, of course. Go ahead. Yeah. Yeah. Oh, by the way, my name is Ling. I'm a PG student here Yeah, so I was just like a The previous few slides and also the early introduction of your talk make me thinking that Especially at the beginning of the talk when you are talking about the most popular libraries in Python, which are now high and pandas, right? Just like from my own perspective when I'm using those libraries Yeah, I actually feel like the the functionality and a feature of of at least for now And pandas a little bit different from my perspective I feel like in pandas you have lots of like this kind of like database Created operations like a filter, strings, transformations, etc. But in NumPy, many of time you are using those like arithmetic or numeric-related operations like you do matrix Yeah, that's that's a different different side of the broadcast team or you have this very convenient interface Indeed, it really interface with this very convenient interface to quickly index a matrix at different rows columns, etc So I think that part would be a very important reason for NumPy You definitely have the matrix stuff as well that is an excellent point Indeed that is different but still a lot of the stuff in NumPy is also present in database systems and You can like that's also kind of the reason why you need this convenient gap bridge, right between the two Because once you start wanting to do things like matrix multiplication for which the relational models may be not so suitable They can easily like go back and forth between what each System has like what each system is good at the strength of each system, right? Okay, okay, I see Okay, that makes sense. Thank you for your question All right, so where was I? Here I think yeah, so This was actually kind of on the bridge between two things indeed Database are slow and frustrating, but they don't need to be Q my database dr. B. Well our database Which we call the sequel light for analytics the the reason we Called sequel light for analytics is because sequel light is really like the most easy to use database if you have used that And we really really like it, but it has one big flaw and that is that it's quite slow once you have to do any sort of analytics So the core features of duct BBR actually very sim similar to sequel light with a few big exceptions There's very simple installation. I will go a bit more into depth there. It's embedded So you don't have any external sort of management. We have a single file storage format just like sequel lights We have fast analytical processing and we have very fast transfer between our python and the database system together with a convenient API for this So the first question you always get is why duck TV? Well ducks are amazing animals. They fly they walk they swim their resilience They can live off anything and honus used to own a pet duck called Wilbur. Here's a picture of honus my former supervisor now Co-worker still kind of supervisor With his duck on the boat where he lives because he's a bit crazy, but the good guy So what do you say used to own a duck? Where's the duck now? the duck found greener paste pastures in the To form a little duck family we assume Wait, so like it was a pet duck in his in his boat house, and then it just one day like I'm out of here Well, the duck has complete freedom So you would just be on like the it wouldn't be in a cage You know, you would just swim in the canals because the nails of answering it It would just swim around fly around fly back and sometimes you would already be gone for like one day two days But usually come back and at some point he just didn't come back So we assume be met a lady duck and made a beautiful family of duck Now did this happen did the duck disappear before you guys started duck DB? Yeah. Yeah, it did. Yeah, that's a great back story for a name Thank you. It was a very sweet duck. I watched it grow up. So I'm I'm still a missing Wilbur All right, so well dr. B. It's a very young project. It was the first commit I made in the One year and nine months ago the 13th of July 2018 As you can see, there's a very bare bones commit. I just added a parser and Initial system was actually heavily inspired by Peloton So when I started making duck to be I just looked at the parser and the binder from Peloton and just took like half of it So here you can see a code snippet from the early duck to be and If you are still familiar with the Peloton code base, maybe it looks a lot like it Because it is actually that Many names are still present. So we have a data table bind context string util But by now it's mostly been rewritten besides some small things the string util methods are still there and they're still the same exception types So here's the internals at a kind of a glance like It's a column store database surprise. It's analytics. It uses vectorized processing That is actually probably a surprise to some people that's prefer jits There's a lot of reasons why we chose for vectorized. We have had a lot of debates about this The main reason though that we decided on initially was because we wanted to avoid the compiler dependency Which is very problematic if you want the user to install the system But there are a lot of good sides to vectorized processing still When you compare them with jit they both have their advantages We use an art index. We use MVCC It's inspired by the one by hyper but has been adapted kind of to be optimized for a vectorized engine And we use single file storage which looks a bit like this It's we're in the C++ and we use the post test parser Which was heavily inspired by the Peloton wrapper of that parser. Can you talk about the The changes you made to the hyper MVCC to like to deal with a vectorized engine Well, it wasn't so much deal with the vectorized engine as it was to deal with bulk updates and columnar updates So I'm actually Still looking to write a paper about this, but I'm kind of procrastinating. Yeah The thing is the hyper MVCC it has one big pitfall Which is that it's always copies entire rows to the undo box. I don't know how familiar you are with this We implement the same thing new influence. Okay, so it has one big pitfall Which is that it does this like copying of entire rows to undo buffers, which for inserts and deletions is kind of fine But for updates gets pretty messy, especially when you have to deal with like large columnar updates Because if you touch say like if you run an update that copy the updates like one column But like 10% of the column. Yeah, we're copying 10% of your table into an undo buffer And it might be one column out of a hundred columns. So we I think Matt we fixed that right? We don't do that either. Is that correct? Yeah, okay, and then the other the other miss the other issue with hyper MVCC is the single writer and I think we got a read we got around that too The single writer. Yeah, there's like some single like they only have one writer at a time or something There's some some aspect of it like I We don't really have this with at least we have one We have parallel rights to different tables and also to the same table kind of with some like switcheroo on the law on some latches You're in there. Yeah, but one thing that we have is one committer. Maybe that's One writer to the right of headlock one committer This is what we do Now we don't we don't have that either I'm not sure of that But we are also like not really that's concerned with like very small like point updates or point right We are mostly actually concerned about bulk and that's why we made these like changes to the MVCC to deal with bulk updates Specifically, so we have like a different versioning also like our versions are actually vectors not individual rows Also to deal with this I Will write a paper about this at some point. I also have a bunch of slides. I can send you if you want to take a look Yes, we should The student that wrote the MVCC stuff with Matt is not here He's not a teacher to MIT, but we should we should swap notes at some point. Yeah. Yeah, for sure. That sounds excellent Yeah, I spent quite some time on the MVCC First on implementing it in the first place and then on revamping it. So I'm definitely interested. It sounds excellent I got what I'm sorry. All right. No, no worries. Okay. This slide I accidentally copied here again All right, so well some like a oh no actually I didn't accidentally copy. Aha So let's go back to the database problems that we had before and how does duck to be kind of address these problems So first we had difficult to install difficult setup slow data transfer and a poor interface All right. So first installation and usage here We actually learn a lot of lessons from SQLite. First off, it's an embedded system It has zero external dependencies. So all the dependencies are completely in lines We have an amalgamation build of the system, which means that we can compile our all our source files That you need to build the whole system into two files a header and a source file It takes some time to compile though like a few minutes, but it does work Um, we have also integrated duck to be entirely within the package manager So you can just type pip install duck to be and will fetch the latest version Um, does that actually compile it when you go pip install? What does that do? It's well depends on the system. I think on linux. It compiles on always x and windows It fetches a binary so it just gets like a dll and dumps it there And also from the python background at yelp, you can actually get wheels. So wheels is like Distributed binaries. So we do that very often to speed up our docker build time Yeah, we use wheels as well. Indeed. Yeah. Okay. So I actually want to set this up. So the the wheel parts I don't have that much Insights into the setup, but we do use the wheels. Yes. So like how for the ones like Like windows should be pretty uh That environment should be pretty consistent. I guess osx as well, but I guess Like what happens if they have some funky word, you know, they're using the north korean version of linux And they try to install this and if it doesn't compile, do you get an error and you get a message or like It just failed and you and you don't know Often time go ahead. Sorry No, no, you go ahead. I you'd probably no more Uh, usually, uh, the way we set up python, uh, this type of c++ dependency code is that it will fall back on compiling So, uh, if it can't find a wheel, so we invoke the system compiler to compile it from source No, but the question is all right, so if it falls back to compilation and you're running on the weird version of flavor linux And compilation fails Does duck db get notified that it that fail or just or nothing? Only if they open an issue is got it. Okay. Yeah, we don't have any sort of reporting back or any of that like we don't We don't believe in that either. So there's no like we don't believe reporting back of Not like invasively, let's say, of course. Yeah, let's keep going. We'll take that. I'll find all right. Sure. Um, all right so One more thing. Uh, well simple installation Duck db does not rely on any external states, which is separate config files or environment variables or like locale things Which can be pretty messy otherwise and we have the single file format which users typically find easier to handle See for example, microsoft office um Then we are also having uh, we also have a composable interface which is still kind of a work in progress um And this interface directly connects The u to the duck db System without having to type sql queries. You can still type sql queries. We support both Um and ideas that you can compose these sort of operations. So if you want to have say, um, a table scan followed by filter Followed by an aggregate you can have this sort of like db dot table Then with the name of the table you want that gives you a table scan Then you build a filter on top of that with like a expression And then you can inspect any sort of intermediate there Then you can run any areas on that again, like all of this kind of return is like relation in relation out And it's kind of a very hidden mapper over the logical operator tree. We have internal um One nice part also about being embedded is that we can integrate very easily with Things from the source environment So we have a scan of a ponus data frame and we can produce also a ponus data frame result And the scan of the ponus data frame is basically zero copy except for strings, which is a bit nasty um And that will just run a scan of a ponus data frame Go through the entire duck db query engine and then on the other end It can also be just a ponus data frame again, and then you will benefit from the vectorization um And from all of our our entire engine basically can just run whatever you want on a ponus data frame We can also run just sequel on a ponus data frame. No problem Um, which is quite nice. Yes. So it's an interruption Just wonder how can you directly scan or append a data frame? It's like your storage format is compatible with data from format or you do some transformation It's like there is for so it's not about the storage format. It's about our scan operator That emits like vectors basically, right? And these vectors can point to an array of data And for integers and doubles and such it's just an array of integers or doubles or what have you And ponus just has the same array internally, right? So because we also run inside the process of the user We can just get that area and run on that in our system For strings, it's a bit more complicated because for strings, they have python string objects A bit more of a mess, but for numbers you can do it without any copying just set some pointers and then go I think the answer lens question is that duck db knows about the data frame format. Yeah. Yeah. Yes. Yes. There's the We have a function which is essentially a table producing function That has like a pointer to a data frame and then just scans that in memory But according to mark it sounds like that data frame format is just array. I mean except for strings Uh in internally ponus data frame is just a bunch of numpy arrays, right? Yeah, sure And the numpy arrays are just arrays just c arrays, right and we also use c arrays Right um Yeah, so kind of this Conclusion of this part duck db pretty few feature completes not entirely We still need parallelism which i'm working on and compressed storage, which i will work on after that We run the full tpch and almost the full tpcs I have not bothered to implement the full outer join yet, which makes us fail two queries But I get to that as well Well, I have assets we have window functions complex aggregates. We have arbitrary sub queries. That's also very fun. By the way arbitrary sub queries I can give another talk about that Um, we have compressed execution and I can't read the other one Uh a string collation. Yeah. Yeah, I implement that this morning. That was fun So Did you did you use like icu like no, we actually have um utf 8 proc it's called And it's a very lightweight alternative to icu that's created by the developers of julia if you know this um And it doesn't actually support all of the things that icu does but it comes quite close it supports like um String composition decomposition it supports like a graphene cluster detection these sort of things Oh, it's a single header. It's not Oh, yeah single header. Yeah Impressive. All right cool. So this is what we use because we really don't want to depend on icu So we don't actually support full string collation either like the full like all locales shits not exactly, but A we probably will add that as like extension at some point because well Uh, you don't want to link icu just for that. I think I think that's a big waste. It's a big thing Uh, yeah, so drb free open source burn pre-release. We have a website github feel free to try it or Take parts of it if you're by chance making a database system And if you find anything that breaks, please send us a bug report and then That was the end of my cider presentation But I think I will continue a bit because I still have some lessons learned that I think might be interesting for people That might be coincidentally also developing a database engine um potentially allegedly All right, so just some quick notes Testing what we have learned is that for making a database system. It's probably the best to always just test sequel Uh in in a way database developers We are kind of blessed because we know the requirements of our system, right? It's to run sequel correctly And the internals like the internals of the system. They're only useful if they work towards that goal of running sequel so Testing internals it's it's kind of easy to do But it's a bit of a trap we have found It makes changing them later on very difficult and you will kind of need to change them if you want to Refactor them at some point um, so some kind of uh History, maybe we refactored I refactored slash we wrote the vector implementation twice wants to add support for arbitrary vector buffers wants to add support for compressed execution I reworked the binder like three times to get arbitrary sub queries to work Uh, that's annoying That's the work. Yeah, so if you want actually to take my or Our binder feel free. It's does support arbitrary sub queries arbitrary correlate sub queries So it's it's quite i'm pretty happy with the implementation. We have now There's still a bit. I want to change but it's it's kind of funny the binding, right? It's one of those things nobody talks about but it's actually quite complex like it's a lot of things that happen Yep, um, yeah, so the planner as well. We wrote we wrote a bit And I think when we were rewriting this we were quite lucky that we did not have many internal tests And the ones we did have they actually cost us some pains Like we had a bunch of tests that tested the vector represent the vector things directly And I had to like fix those while it was refactoring was just kind of annoying I ended up just removing a lot of those So I think test sequel is kind of the holy grail And it it actually makes it a lot easier later on to refactor your system um Another point make testing fast Um, if testing is slow, you don't do it as often. It's also kind of developer time. You're waiting for this Uh, so we have split testing into like a fast and a slow path The fast test just or test functionality on small datasets And it's they actually test a lot like you can see here is the fast test They run in seven seconds on release mode and they had triggered like 150 000 assertions they I think they get like 93 92 co-coverage or something like this Um, but they test only small datasets. So it goes very fast Uh, and then we have a slow path that tests like larger datasets the test concurrency So where we have like parallel updaters parallel inserts Where we test like trying to like send sick kill to the database to see if it gets corrupted those sort of things That's all in the slow path those who run as well But when you are just developing something or refactoring you just run the fast test and for us. That's very convenient What would like unit test framework are you showing that? What is that? I'm grown or is that catch? What is it? I'll catch. Yeah We really like catch because it's again no external dependency. It's just one header One annoying thing though is that because it's one header. It's The slow down compilation somewhat Um, which actually we use unity builds for now So any sort of files that are like Together in the same folder. We all compile together. It's like one batch The because otherwise the compilation takes like twice longer And it's not super long But it's kind of annoying if you're running on a macbook of like five years ago, you know, like it's going to take like At least like three four minutes for a full compile. So it kind of blocks you take you out of the flow Um, all right, so verification code I think that's also very important um in the fast test We actually run every select statement four times Once the original query obviously Then we run the query with all optimizers turned off Then we run the copied query So we have like a copy for expressions for select nodes like everything you can copy So we copy the query and run it again to verify that the copy works And then we also serialize and deserialize and run again to verify that the deserialized and serialized methods work And I think this is very nice actually because this Copy and serialize deserialize is very hard to test for But this basically makes it just work because if it doesn't work If you like forget one element somewhere to copy it, then your query is not going to return the same result and you just catch it um Wait, so When you say unoptimized it's you mean like unoptimized c code or unoptimized like in the plan Unoptimized sequel. So we turn off the whole optimizer. We run whatever comes out of the planner straight. Just Got it, okay So we actually this is why we only do this obviously for the fast test because this also means we don't run join Optimize optimization, which means we just run a bunch of cross products usually that Which is actually great to catch bugs in your join implementation because now Any sort of join will be run against a cross product plus filter And if they don't match up probably have a problem in your join. So Interesting and yeah, okay This this has caught tons of bugs for us. Like it's really really really good For it has been very good for us. So next Next advice I can give is make an embedded database even if you're not making an embedded database Make one anyway for testing in the end the the kind of the wrappers around A sort of database server. It's kind of a wrap around embedded database anyway And an embedded database just makes it's much more convenient to write tests and much faster because you don't need to deal with this whole client server stuff And if you're sending like individual sequel statements that are running on like one row Two rows All this like overhead actually adds up and it makes your tests run a lot slower Which again is kind of annoying if you have to run them all the time So we have gotten a lot of like um Good stuff from having an embedded database. So if Before this I worked a lot on monet TV Which does not have that or did not have that until we actually added this and they they're testing framework. It takes around Half a second I think for every single test they run because they start up a database server Start up a client connect to the database server run the query Could check the results and as a result they they just run a lot fewer tests like they Like they're fast tests take like 30 minutes to run ours take like a few seconds And we actually run more tests in the time So it's it's like it really makes a difference in how much you can test as well Right because there's always a limit in in terms like how long you want to wait Um, well then automatic benchmarking. It's very useful. I know Andy is has added that or wants to add it Work of promise We're in the price. Okay, uh, so it the text unexpected regressions, which is very useful Like here's something that actually we found this morning Because one thing that we don't have yet, which we also want is automatic notification When it actually breaks. So here was a test that has been running for Uh, a year that broke down in december and we only now found out because nobody was actually looking at this test because it's not Uh, it's just somewhere in our micro test But actually it found kind of a bug and that is here you can actually see by the commits We have like a list of commits If you go on our website, you can see it and you can see hey this commit actually broke it So you can like you know, it's here and what actually happened here is that this between expression Broke the index optimizer which made this thing not use an index scan anymore, but a regular scan So great good catch probably not if catch you don't catch these things in unit tests as much right because the answer is still correct Um, all right kind of the last point. I think uh, be careful with dependencies Um It's something that we are kind of we hold to hearts partly because we are forced to buy our work environment to hold this to heart Dependencies they are very easy for the developer But they can be very difficult for the user for example, they're running a different operating system Uh, they might not have a package manager or they might not have roots And this is actually our scenario on our work machines. We don't have roots And which means we can't use the package manager Which really sucks when you want to install things really sucks if they have a lot of dependencies. Otherwise, it's fine Uh, and of course packages can break. We all know about the left path fiasco. I think you have a dependency chain Uh, just everything can break down if you have like one board packaging there Um, so dependencies there are liability Um, so you should use them with caution Don't include a dependency on boost just to parse one number from a string Please I actually found we actually found this in the d python par k reader like d python par k reader That did this it had it has like 20 dependencies, but this one was the most ridiculous It was parsing one number and it pulled in the whole of boost for that ridiculous And uh, here's a nice quote from the tinyverse people who are kind of on a mission to Uh, to have less dependencies every every dependency you add to your project is an invitation to break your project, right? Um, yeah, so yeah, like I said, we can use root on the work machines if I have not complained about this enough I actually tried to use terrier there and I got met with this message so I was not I was not I was not so happy. I couldn't use it without first compiling g email um so Maybe for you also it would help me a lot if you guys were to not use dependencies Okay, it's a little let's call a couple of things one is so you guys don't use g email We don't have it right now. No, we probably will add a Add it but if we added we might even inline it or as like an option Right, so so we find it gives us like a 20 boost. Yeah, that's why we have it. Um, the It's So you say you have no dependencies, right? So you're not But not uh, actually this is my final slide. So it's pretty thick. We have inline dependencies. So we use the postgres parser um, we use Yeah, like you you you've copied the code you need into like a third party directory and then you compile it so, uh How do you decide like is it just ad hoc? I mean, it's just you and honus like how do you decide like, all right Well, we need this feature we could roll it around or we could bring in this dependency Do you like you have a way to judging that again given your requirements? It's mostly what we find Would be a reasonable use of our time and how Our criteria is also a bit like how easy is it to inline in the first place So we are making these amalgamation builds Which means that well first off everything has to run under c++ So we actually have spent some time converting c dependencies like the postgres parser c++ But like the actual grammar and the yaw stuff all of this has to be c++ um, and The problem is if you want to include one library you have to include all of its dependencies as well So very much depends on What like like the the cost benefit trade off right and we kind of just make that ad hoc I would say the ones we use are like the postgres parser because we don't write our own parser We use the utf a proc because we don't want to write our own unicode Framework because this is a lot of effort and annoying um, we also use, uh, I think we have somewhere Snopy in there for example because why would we reinvent snopy, you know, like that's What is snap snappy snappy? Oh snappy. Sorry. It's actually a german song Um, schnee schna schnappy. So that's why pronounce it Okay, yeah, so we add um So then um, we'll also are the same to talk so We don't want to applause so you hit the applause button. I don't feel like It's an empty gesture, but you want to do that to then uh mark So why don't we go around and if anybody wants to ask questions you can and then um, just I guess you can raise your hand. We can unmute you or just chime in. It doesn't matter. Okay. I'll ask questions then. Um, the uh Who's gonna say um, not for open s Oh for the You didn't really talk about how you uh You're built on top of sequel light or you like you you implement the sequel light api to that like like So so when when you say it's an embedded database like Your code is what sending sequel over JDBC to to to the to the data system or it's a focus there's no there's no It's it's really just like sequel lights, you know, like you we have a function called query that takes sequel string And that feeds directly into our engine there's no Nothing else. We also implement indeed the sequel light api But that is just like, uh, we have the header. We have our own implementation. Which uses duck to be under the hood But we have a Dead as well, but there's no client. We don't really have any protocol of any kinds We have some rest server that we use for one thing, but not really seriously, you know, like it's not It's not our focus Okay And then so and then for the fast and slow test so like command line I call make unit test whatever that gives you the fast off um And then at night or when you spent the github you run the slow stuff. How does it work? It's very much depends on what you want to do So if you want like a basic verification, if your system is not broken, you just run the fast one If you want to debug a test, you obviously run just that specific one You don't run everything before you push the github. You always run all of them Like the slow ones they take like maybe 15 20 30 minutes on this depending on your machine So you always before you fire off travis the bill. What do you what do you running the slow stuff? Yeah, so actually we have a pretty complex travis setup We run a bunch of different compiler systems, of course We run our r packages python packages on both linux osx windows We run like gc 5. I think is the lowest one up to gc 9. We run clang different versions We also have some extra Verification stuff we run For example, we have this vectorized engine. We have one build where we run with a vector size of two So our normal vector size is thousands But running with a vector size of two gives us actually detects some other bog bugs Then we also have different compilation options. So we compile the amalgamation. We do like our normal compile Um, we test all of those on travis like it's I think we have like 17 18 travis tests. We also actually run a vm with solaris in there. It's a bit crazy That that is crazy Because there are people We'll need solaris to work because there are a bunch of old people that like solaris for some reason so Okay, yeah All right, any other questions from anybody else Uh question. So you said you implemented the rewrote the binder couple times. You should introduce yourself Uh, hi, I am someone who is currently maybe going to rewrite a binder in the near future. Ah, very good. Okay Time out. He's an undergrad at cmu. He's starting his phd with me in the phone So you you will come across mark and when Mark will come across when in the future For sure. Yeah, as if we still have uh conferences then we will see each other Yeah, sure. So, uh, I was kind of wondering what your main takeaways were for how you should go about writing a binder So that you don't have to rewrite it again and also whether you're kind of annotating existing objects or if you're kind of, uh Figuring out what information needs to be generated and storing that on the side so it's a bit It's a bit tricky. I think one thing that is Kind of crucial for us was this like subquery binding, which is quite tricky So we started with I think something very similar to your binder Which what gave us a lot of problems with binding correlated subqueries So we added kind of like this nesting of binders where you have like different nesting phase of the binders But also on the expression level. So you have an expression binder that um Has like a different layering as well. So because for example, you can have a subquery in your select clause Um, that can then bind to the variables that are available in select clause, right? But then within that you can have another subquery that has then a where clause that has then another subquery in the where clause And then that one needs to bind in the where clause again And you can bind in all these different levels. So now on the on the outer query You need to bind the select clause the inner query you need to bind the where clause So you have like this layering of these kind of different expression binders Depending on where you are, right? Because if you are in say a select clause, you can bind aggregates But if you're in a where clause, you can't so you have this kind of context there um I mean, I think that's kind of the way we structured our binder now One other thing that's important is that you have to at least for us We want to do statistics propagation as well throughout the binder Because we do overflow prevention using those statistics. So you have to go from like you have to be able to kind of Track those statistics throughout the plan, right? Because you don't want to just give up your statistics when you say leave the select you want to go from Okay, I have my statistic from the table I go through my filter through my where I go to my select and then you propagate all those statistics Your binder's doing that Yeah, actually not now anymore. We did it but this is we need to do this though Okay, we we used to have this but we're gonna re-edit. We I removed it in one of the reworks Okay, but it needs to the reason we do or want to or used to do that and want to do that still Is that we need to have this for overflow prevention Because we want to guarantee like statically can this expression overflow and if it can't Then we can use a fast path on the computation, right? so So by statistics you mean Would do together include like the histograms or kinalities of the columns Min max is null has Unicode, you know, like because for string operations, there's usually a fast path of just oski So this sort of things, you know, how about the other things like histogram kinality estimation The carnally estimation is done somewhere else, but we don't have a very Good one now. Actually. It's one of the features. That's kind of missing That's done in the join or optimize. That's not done in the binder Kind of a quick question actually going back to what you said earlier when you said it's based off Kind of similar or like look similar to what we do Did you mean peloton or did you mean terrier? No, I mean peloton, but I guess you got some code similar to that now And also reported the binder, but like I said, we then we wrote it twice So last third time's the charm. Exactly. Yes Uh, do you also use the binder to kind of do validation of like, uh, I think I just searched Doug Gibi also supports like prepared statements like how does The prepared statement code path differ from the binder code path if it comes to for example validating types So the prepared statements, um, actually one thing honest implement is all but I do know something about Um, the prepared statements. They need to have a type, right? And in the binder, we do all of the type resolution. So also, this is one thing that We had to add as well as like For a function interface that we have Where you have like this propagation of all the sequel types Um, I don't know exactly how it works for you guys. We have a separate type internally and sequel And the reason for that is that we have actually the same internal type for different sequel types So for example a date is just a 32 bit integer Which means we have a sequel type date and an internal type integer And all of this sequel type information It only exists during the binding phase and after that it's all converted to the internal types Um, and the binder does all this type resolution and I kind of forgot your original question Uh, whether the type resolution also has to happen for prepared statements. Yeah Yeah, so the same code prepared statement What happens is we have a prepared statement in in the binder that has an unknown type And basically when you have like an unknown type and it gets um Kind of it goes down and when it encounters something that gives it a type We assign that type to the uh prepare statement and that can be for example a function So if you call like a prepared statement inside a substring And it's the first argument of the substring that has to be a var chart, right? There's no other possibility So then we add a cast to a var char and then um And then the prepared statement becomes a var char statement And then you can only assign a var char type to that statement if you will right Yeah, that makes sense. Thanks. So when they're ductdb's mit license you should just look at that look at the binder Yeah, yeah, and if you have any questions shoot me an email Or we can have like a zoom chat. I'm sitting at home all day. Anyway, just the time goes a bit challenging But I don't mind to answer questions about you Okay, thank you So just a very high level question I have uh out of curiosity, right? So all of all the components of ductdb like the storage execution A transaction or maybe you uh, yeah, there's like concurrency control We call it or the binding catalog which part you find the if you can give a rough estimation That is like a you spend the most of time the most difficult to deal with and why Uh, this is difficult to say because we have not given each component the same focus So for me personally The most difficult component was the storage And I've not actually the storage is the least polished part of the system. I think okay, okay And the reason it was so difficult is because our group has not actually done that much in storage So in net db, which is the uh, one of the other databases we have made They just map some arrays and that's it. They call it a In vector wise they have a very complex storage that was written by Basically one guy that has left 15 years ago And I've not personally worked with the storage there, but from what I heard it's kind of a mess But basically we are like as a group Focusing a lot on analytical queries query execution these sort of things So for me these came very naturally like query execution was quite easy to do Which is also why I think this is probably the best component is the execution framework like for the storage I I had to like relearn how to write a buffer pool and such Got it. Got it. I see. Thanks for sharing Any other questions Um, this is Paulina. I'm a master student In the computer science program. Um, I was wondering if you could talk a bit about You said that you have implemented compressed execution and then you're also working on doing compressed storage And I was wondering if you could talk a bit more about that. Yeah, so we actually, um, the way we have implemented compressed execution is kind of a part thing Our vectors that propagate through the systems. They have difference. They have a type as in they have an integer type or a string type But they also have a vector type, which is the actual physical representation And the physical representation Can be anything it can be dictionary compressed. It can be like a flat array. It can be a single constant as well so a single constant it's like Logically replicated but physically not And these vectors they kind of flow through the system in a compressed In in in whatever type you want And then whenever you get to an operator the operator can decide what to do with it So we have a method, um, that we call orify because it was came up with by orie erling from facebook He came out this and it basically creates like a thin wrapper over A vector that gives you like an iterator over it that is quite efficient for the common cases And then we have some specialized code to handle like different types of compression but it's very like Modular so for any function where you want to implement compressed execution or any operator You can just kind of add a specific type of compressed execution to that So we don't have everything yet like we don't have near to everything. We probably don't even want everything But it's like the framework is there to allow for anything So in the case of dictionary compression, are you passing the dictionary along with the vector or are you Yeah, so the the vector the vectors in our system. They have shared pointers to a buffer basically And the buffer can hold Anything it's not just a it's kind of a misnamed. It's not just a bunch of bytes. It can also hold a dictionary It can hold like a lock as well So we can have like a lock on something there or we can have a pinned buffer pool page And that's all like a shared vector a shared pointer and vectors And that basically means that we can have like a dictionary That's referenced by a vector and as long as that vector is alive The dictionary will also also be alive It won't actually have to physically go along with the vector and just be a pointer But it can be kept alive as long as the vector is alive Okay, and when you say you're working on compressed storage, what are you thinking about? Um, well, there's a bunch of different things. I think uh from our group, we have the p4 delta which will go in there probably for sure Actually, peter came up with a new string compression method, which also will go in there for sure Um besides that probably some simple things like arlie or dictionary and such for now All our all our storage is just flat But we are adding like at least like the the simple things, you know, like dictionary arlie The p4 delta, maybe not that simple String compression. So we're not going to use generic compression. Like we're not going to use snopy or anything For example, we're just snappy Snappy, sorry snappy Thank you. Very good. Usually I'm the person correcting people's pronunciation at our office. So it's That's a minute. Yeah, um what uh What your catalogs look like is it just Are they tables or they separate data structures? They are our catalog is just a bunch of unordered maps, but they are transaction managed So we actually have a separate code path for our catalog, but it is fully managed by the mscc Um, because we really do want to have things like create table statements being tied to transactions Of course, um, but the main reason we went with this like sort of different Representation is because we had a lot of problems in moneteby with this Because in moneteby they use the main storage engine to also handle the catalog But the storage engine in moneteby is very optimistic Which means if you so much look if you so much as look at the same table from two different threads The whole system just blows up um Like it starts like Undoing things like saying like all transaction aborted. There was a potential conflict, you know And that's got really problematic when you wanted to handle the catalog because if two threads would start transaction create a table They would already have a conflict So we kind of took away from that that if you're doing optimistic concurrency control Maybe you don't want to do that from the catalog. So okay we um As are in are in the catalog and they are transactional or our catalogs are in tables and they are transactional Um, so it sounds like it was sort of an engineering limitation of moneteby rather than something more fundamental No, they could be stored as tables as well. It's just that um But i'm saying like with the the issue of like what you're describing of the moneteby's problems It sounds like it was an engineering problem not a fundamental problem. Yeah, it's it's I I think either way it's kind of fine um There is some duplicate code that we have in the catalog that deals with the transaction stuff Yeah, because the handle is it kind of in a different way um, I think if you're making a system that's pretty good with like A point updates like that. It's not super optimistic. It's fine to store it as tables as well I think there's no Fundamental objection to that but I think for like a optimistic engine like we ran into these problems And for us, it's also kind of easy to just have like an unword map of tables, you know Like it's it's kind of simple in a way as well It's easy to bootstrap because it's you know chicken for the egg problem. Um, yeah All right I have two more questions and then uh, we should let you go because it's it's late for you unless somebody else has an Yeah, no problem. Um So you mentioned that you're using shared pointers. Have you found that to ever be either a performance issue? No, we use we use unique pointers almost everywhere Okay, the only the only case where we use shared pointers is actually this vector buffer And I I I thought about it a lot And we ended up doing it because we I saw no other good path to implement this and the reason for that is that The thing is if you have a vector that has like a dictionary, for example We have like operations that just reference a vector And you have to have somehow some sort of shared ownership there or you transfer the ownership But sometimes it's a bit iffy because if you have for example a case statement A case statement gets two vectors as inputs Or well more but could be two if you then have strings you don't want to copy all the strings just want to Have like pointers to them, right? so because of that you can you have like the case statement that has Relies on this sort of string area and you have the original vector that relies on the string area So it is like inherently a shared thing because you can have multiple vectors depending on the same sort of data But for the rest of the system, we all use unique pointers and I I really think that was the good choice as well like unique pointers everywhere. We uh in In palisthenum, we just it was it was a wreck. Um, we are doing something different in our new system We don't use shared pointers. We actually made our own pointer type Um, and we do we manage memory separately. All right. My last question is uh, it's an embedded system. So it's single threaded, right or like how does like right now Uh, so right now it's single threaded. Um Because running in python or whatever right like Sorry, because you're running in python. You just can't like use whatever thread you're given No, no, no, you can you can launch any threads you want, of course There's nothing stopping you from launching threads from the ccode in python like Only if you want to interface with python things, do you have to actually respect this whether you can easily run the database and she Which we also will do So duck to be it is currently single threaded per query But we have full support for concurrent queries and uh such so you can run n updateers and insertors and readers however many you want And most of it is uh, not gonna interfere with each other unless you do like Single updates the point heavy workloads, then you have a lot of interference obviously, but that's also not what we care about um, what about like background threads again, so like if you if If it's an embedded system and you're told you can only have one thread or like I guess they're in the back and you can do whatever they want, right like yeah, so we uh It's it's basically we are yeah People can do whatever they want with the system. That's kind of It's it's it's a challenge. It's somewhat of a challenge because people can mess with it, of course um But yeah, we One once we add the multi-threading we will allow the user to limit it as well obviously because they might they Want to you don't want to run four eight cores on your laptop for no reason, right Maybe a database not that important always, you know So I said You have you have checkpoints like you have a background thread take a checkpoint No, no. Yeah, so this is one thing that we kind of decide on as well is we don't want any background things Okay, so the checkpoints actually are pretty ugly uh, but They will just trigger Once the database reaches a certain size like they are and they will just be part of the whatever made the database exceed that threshold so We made the conscious decision not to have any background anything like that's a no background gc No, no no gc. Okay Everything is run just inside The the query like if you call Duck to be dot query or whatever Everything end to end. That's all duck to be will do like there's no a background anything like It's kind of a kind of a contract The philosophy a little bit more Just curious Can we elaborate the decision on not having any background thread a little bit more or just yeah So the thing is that's because we're running. Um, so there's actually I have some extra slides on this Uh, uh, okay. I'm never gonna find this Yeah resource sharing Okay, I guess it's this slide. Can you guys still see the slides? Oh, yes Oh Unskip. Okay resource share. Yeah, so resource sharing. It's kind of a big thing for us and not for standalone servers because if you run a standalone database server You have no it's yours basically like you take up the whole computer But for us, we don't just run on someone's laptop probably we run in another process so We really want to kind of respect this boundary somehow as well and not just go like You don't want your python process that you are typing interactive queries into just Randomly hang for a few seconds because your database decides it needs to checkpoints, you know, or like your database decides Aha, you haven't done anything inside the database. I'm going to check points Which makes a lot of sense for a standalone server, but not so much otherwise. So We have been looking a bit into adaptively Doing resource sharing as well, but it's something that's pretty tricky like it's not probably not something we're gonna Really look into but we have considered this Okay, uh, any other questions last chance