 Hello everybody, welcome to FOSDEM Lightning Talks in Building H. I want to introduce you, Hannes Müller-Eisen, who ever talked about DuckDB, an embedded analytic database, and give him a warm welcome. Thank you, welcome everybody. So a quick introduction. So I work at CWI, which is the Dutch National Research Lab for Computer Science and Mathematics. I also teach computer science students about the wonderful world of databases. But I have found out that a good way of learning about databases is building them, and therefore I also do that. And today I'd like to talk to you about one of these products, and that is DuckDB. Obviously DuckDB is not my own sort of soul creation, but there's other people involved, most notably Mark Rasfeld, who is not here today. So we're gonna talk about DuckDB. DuckDB is a database management system, and it's new, it's completely new, and it's focused specifically to be embeddable, which means not embeddable as in hardware, but embeddable as in embeddable into other software. And it's analytical, which means that it's focused on crunching through large amounts of data, as opposed to dealing with transactions like orders in your online shop. So if you wanna do orders in your online shop, go to the Postgres people next door. If you wanna crunch large amount of data, you can use DuckDB. Now I have to find out whether my clicker works. It does. It is common to start these kind of talks with a description of how terrible the state of this world is. This is no exception. The present is very bad. The data management in data analytics is a huge mess. I don't know if anybody of you has ever tried to use things like pandas. And that's great, it works with the five examples that they have on the website. But one of the problems there that is really overwhelming is in the data storage itself. People tend to have these text files where there's a well-known folder structure somewhere which has a bunch of CSV files in it and then there's maybe some code on top of that that decides which CSV file should be read. Once we have loaded these files, we have these crude query processing engines. For example, the one that is in pandas or the one that is in the R environment. Once people decide that CSV files are too slow, they start inventing their own crude, hand-rolled binary formats that are on disk maybe and start processing those. There's been a recent push in the direction. In general, this is sort of a zoo of one of solutions and that makes secondary problems like, for example, changing anything about the data that you have very difficult. So this is bad, we don't want this. And these things are all solved problems. We have data management systems, they've been around for 50 years or so, and what we are trying to do with DectiB is make them usable also for these data analysis tasks that are so common. So here, so now this is the contra. The future is bright, obviously, with SQLite. Sorry, with DectiB, sorry. Who has used SQLite? Okay, this is very many people. In fact, everybody has used SQLite because it is in every browser, every phone, and every device that you can imagine. What we're trying to do is build something similar to SQLite, but very different in sort of the intended features in the sense of what kind of data analysis questions you wanna ask. So you wanna do data analytics in contrast to with SQLite where you do transactional data management. And how do we do this? We have built a very fast, so-called vectorized data processing engine, I will explain to you in a bit what that is. And we have stolen a lot of good ideas from SQLite. For example, DectiB does not require you to run a separate server. You know, this idea that you have to run a demon that is your database that you have to kind of set up and configure and restart and whatever. No, it's kind of database as a library. You run the DectiB system inside your process. This has a nice side effect that data transfer from whatever you are using to talk to DectiB and DectiB becomes very fast. And this is for data analysis, this is really a critical question. We've written a paper, this was quite fun. Measuring, for example, the client protocol speed of various popular databases. And the guys next door from Postgres Day came pretty badly. But we also have stolen from SQLite is the idea that you have a single file storage format. So basically, all your database, no matter how complex it is, no matter how many tables it has, is in a single file. And we've also stolen the idea of that it should be simple to install. More on that in a bit. So this is the bright future. How do we make that work? So DectiB is a library. So think of just a package, a library that you embed into your application. We have zero external dependencies. This is really something that took a lot of work. But it is something that we believe is actually quite necessary for a library to be successful is that you don't have to install 57 other programs before you can use it. In fact, we have a special way to build DectiB that results in two files, one header and one implementation. DectiB on the base layer is a C++ API. We have full SQL support. So I went through these wonderful job of implementing things like window functions in a database system, which I can tell you are not fun. So you don't have to do it because you can use DectiB. We also have built a wrapper for the API that SQLite uses. So in principle, what you can do if you have an application that talks to SQLite, you can do some library preload tricks and it will use DectiB instead. So this is something that we have done to make it easy to switch. We've also learned from previous project how important it is to integrate with the tools that people are using. And in terms of data analysis, people use R and Python. So there are packages for R and Python. I'll show an example in a bit that basically include everything that you need to run DectiB as well. And just to wrap it up, there's a command line interface and for the people that wanna do the web stuff, we have a REST server as well. Let's show some examples. So here is an example for Python, which by the way was also invented at CWI so we are kind of obliged to integrate with Python. You say pip installed DectiB, that's very complicated. And then you have it installed. There's no additional software required. All the batteries included. And then you can just use this wonderful Python database API where you connect to a database. In this case, the database is a file, so this would be a file. And then you can run SQL queries, which is a required skill that you have to have to work with DectiB. Or maybe not, because in the R world, we have a similar integration where you load it up the database, you connect to your database file. And the R people have invented this wonderful deep flyer system of actually programmatically expressing queries, which is quite nice. And finally, the C++ API I wanted to show you for the people that are more in C land is really just that. This is the actual fully functioning minimum integration of DectiB into C++, where again, you specify which file you want your database to be stored in and then you can merrily run SQL queries. So that's the outside view, right? So it's not very exciting. I realize this, I mean, not many people get excited about databases, I'm one of the few. But it is a tool that you can use to store your data and you can actually, and this is the big difference, you can get it out again quickly and you can run queries on large amounts of data on your local computer quite quickly. Now how do we do this? Let me talk briefly about some internals. So we have something called vectorized processing. I'm not gonna talk a lot about the other things, but this is the core of the engine that makes it fast. And you have to understand vectorized processing. You have to understand that database engines comes in different flavors. So it is traditionally tuple at the time. This is what Postgres, MySQL, SQLite, everybody uses. It's basically we look at one row of data at a time in the process of running queries. That's great, however, it's slow. Then we have the pandas numpy R way of doing things where we look at one column at a time, which is faster, but has issues when the data becomes bigger than memory. And then finally we have vectorized processing, which is kind of the middle ground where you look at chunks of data at a time. And this is a very nice thing because that means that the data that we look at in the query fits into the higher in the CPU cache hierarchy. So here on the right, you see a short overview over the CPU caches and basically what we're trying to do with Duck2B is keep the data that has been worked on up here in these very fast L1 and L2 caches and actually avoid going into main memory for performance reasons. And this is very nice because it allows us to process data that is bigger than main memory. This is one of the limitations of things like pandas is that once your data becomes bigger than memory, you're screwed. With an vectorized execution engine, you actually have a reasonable chance of still completing your analysis questions. Yeah, and you don't get wonderful out of memory errors. So now I'm gonna actually skip something. So you would ask, then you would ask, okay, so why should I do vectorization? It's great that Hannes is excited about it, but what kind of difference does it make? And this is like a very crude benchmark. We run like a standard benchmark DPCH on different systems and this is based on an old version. We have gotten faster in the meantime, but basically if you look on the bottom there, you can see the time it takes to complete these benchmark queries between the different systems and then there's Duck2B up here which clearly is much faster. So generally you would say that this is 40 times faster than a traditional engine that is working in a tuple of a time fashion. But then you would say, but yeah, Hannes, you're an academic and you have a nice pet project, but I'm interested in something that I can use, maybe even in serious ideas. And this is why I briefly wanna talk about our quality assurance that we're sort of doing with Duck2B. So basically we have continuous integration running where we have millions of SQL queries run on every single release. We know the correct result for every one of these queries. So whenever we get something wrong, it instantly flagged. We have verified benchmark results for live standard benchmarks that we also check for and basically we went around and steal everyone's test cases. So with SQL engines, you can do this because they all have the same sort of query language. So the only thing you have to do is to have to write a parser for whatever the result format they have. My favorite part was to write a scraper for the SQL server website because they have example queries with answers and from that we generate a bunch of test cases as well. We also do query fuzzing where we auto generate queries and to try to break our system, which always works if you run the fuzzer long enough but you find very important bugs in the meantime. And we also have something that we call continuous benchmarking where every release is subjected to benchmarking and we can flag performance regressions quickly. So Duck2B is free and open source under the MIT license. We are currently in pre-release so which means that you can't yell at us if we change APIs internally but it is fully functional. You can use this to run queries, to store data. It is all there. We have a website, there's a GitHub page where you can go file a full request if you want. We are very interested in hearing feedback and if Duck2B doesn't do something that you wanted to do then please tell us if you are even more database inclined then you can send us a full request with new features, bug fixes, whatever. We have a long list of issues in the issue tracker that have tagged with help wanted or good first issue so these are good places to start. And with that I'm happy to take questions, thank you. Can I ask two questions? We have to ask him. Do you do something for internal data compression? As you say, it's used for a big amount of data. The question is do we do something for internal compression? What we are working on is that the two things. One is the storage on disk is going to be compressed so whatever we write to this to the single file format is going to be compressed. But we also, and this is really something we're working on right now, is working with compressed intermediates so that vectors, for example, if you have a vector of 1,000 values and they're all the same then we have compression that will actually not move these 1,000 values around but the fact that it's the same. And the second question is do you support any statistical functions like computing percentiles and getting histograms back from the database engine? That's a good question. So our philosophy there is that because the data transfer between Dr. B and the host is so fast that if you want things that we don't support it's actually you're not gonna die pulling a chunk of data into pandas, for example, and running it there. There is support for user defined functions if you wanna add anything. We have a fairly complete aggregation functions library so there's multiple options there but the general idea is that we don't punish you for pulling a large chunk out of the system. We don't hold the data hostage. Hi, I have a question. Thanks for the talk. Do we have a connector for SQL Alchemy? For example, in pandas you have a connector for SQLite so you can write a SQL query and then... Yeah, that has been... I'm not sure what the status on that is but people have worked on this. I think eventually, if it's not working already it should be working pretty straightforward because we support the exact same query languages Postgres. So I suspect it should already work and it's just a question of plumbing the connection. Okay, thank you very much. If you wanna talk to me, I'm outside. Okay, perfect. Thank you for your talk.