 Hello everyone, can you guys hear me good? All right Okay, so I guess we can get started Let's start off maybe with a who am I why I'm here speaking to you guys about duck to be right? So I did my ptsz in database architectures At CWI is anyone here from the Netherlands? Oh So I'm gonna butcher your language a little bit. I'm sorry So CWI is the centrum viscude informatica Which is the center of mathematics and computer science in Amsterdam? And it's actually the place where duck debut was born. So that debut was born in the database architectures group Also CWI is where python is born in case you didn't know so there you go Also the first point of internets in the whole Europe. So it's quite a strong research institute in Europe So I did my ptsz there in the index structures and while doing my ptsz I was already working at duck to be so you can see there on the GitHub history it was basically the co-creators of duck to be and me On the first July to August of 2018 so long long ago. So today. I'm a software developer duck to be labs Which is a company that Maintains and Continuous development in duck to be and I did all kinds of stuff. So of course I did my ptsz in indexes So first thing I did in duck debut was an index. I did zone maps. I did arrow adbc Substrate integration. Maybe you guys heard some of these words earlier today on other presentations What the main reason I'm here is because I did the big chunk of the python API. I'm not a python guy per se I'm a C++ guy, but I did my very best And I'll show a little bit of that to you guys today worked on the CSV reader and tons more of other stuff So a little bit of the outline of the talk today I'm gonna go quickly over motivations are really about database systems for data science and data analytics Talk about the alternatives. So classic server clients and data frame solutions And then of course gonna talk about in process database management system Specifically duck to be and what makes duck to be so special. So what were the design decisions there that make it Different from all these other solutions, of course, there's a python conference. So I'll talk about duck to be in the python lands our integrations what's our users are the different APIs implements and python UDFs I'm gonna do a little demo and then finish off the talk alright motivation So this is a classic data science workflow So usually data science what you do is you start off by exploring your data You have a question in minds and then maybe you zoom in on a piece of data to try to investigate if your question or hypothesis is correct You explore it you model it. Maybe it's like suddenly. Okay, this is wrong I'm gonna analyze something else entirely difference and you kind of keep repeating this process all the time So if you think about the workflow is simply that's asking interesting question get the data explore the data model The data visualize the data repeats If we think about the libraries that are part of this workflow We can think of asking an interesting question exploring the data. These are basically queries, right? We're actually trying to get pieces of our data and then model it So for example tensorflow or pytorch and visualize it with potli or metlop metplotlib And the data comes in all shapes and forms, of course Nowadays you have JSON files CSV files parquet files you have binaries from other database systems But the interesting thing here is that in two of the two important pieces of this workflow. We need a database engine And this database engine as we've seen previously must scan all these different file formats, right? So File comes in all shapes and forms and most integrative ecosystem too. So if you think about these Machine learning libraries if you think about the plot libraries, they're frequently well available in Python But you need to be able to get your data in and out To these tools and it must be efficient In executing analytical queries data science that analysis these are analytical queries And you must ensure that they go beyond memory execution There are nothing more frustrating for example if you're running Your query on your computer and just because your data doesn't fit in memory. Bam. The whole thing blows up, right? It must also handle complex query optimization and when I mean by complex query optimization It's not only future pushdown or projection pushdown, but also sub query flattening for example And one thing we also notice is that is Like there was a talk earlier on talking about a relation API in seco and how they differ and the benefits of each What we notice is that people usually use both of them in different parts where they fit most So these database engines should be able to support these two things and When we start to see about the alternatives of how you could do data science you can think of of course Classic server database systems relational database systems like postgres SQL server Oracle Years of database research embedded into them like I'm talking about 40 50 years So they have a full-on query optimizer. They have a buffer manager So they can do data that does not fit in memory. They have their own storage So they can apply compression and have everything in one file. They have a state-of-the-art Execution engine they support SQL, but I have a question for you guys Has anyone ever managed to set a postgres instance for example in less than five minutes to read a parquet file? two people three people More or less ten minutes, maybe it's tough. It's difficult to set up So it requires upfront schema creation. You need to know exactly what your data looks like It doesn't have any integration from the lyrical ecosystem So if you want to then transfer your data to I don't know tensorflow for example You're gonna have to write something that queries this and that transforms the data there And the way this works when you think about Python as the figure Do I have my mouse here? No, it's the figure down there like you have your relation database system You have your Python lyrical tool, but your whole data needs to go through the database connector So basically you copy and transfer in this data. This becomes a huge bottleneck. So that's a no-no The other solution is data frame libraries They're pretty cool. I mean everyone here uses like pandas for example, right and integrates with these ecosystems very well They use like NumPy or PyR underneath Very easy to use with pandas. You can read the CSV file in more line You don't have to set up a schema all the magic happens for you They have relation API fast data transfer because you're running the same Python process So usually they integrate so well of these other libraries because they also talk NumPy or PyR that you don't really need to do any copying And as I said, like with the CSV files or parquet or whatever These they have so many integrated scanners for all these file formats with the schema detection With automatic schema detection Well, the problem of data frame libraries is that they are not such great analytical Engines so usually they don't support SQL or support just a small subset of SQL Usually they have no query optimization or only simple query optimization. So no sub query flooding for example It's frequent that there's no beyond-memory execution So if anyone here again use pandas before, you know if it goes over your memory The whole thing burns your Python process dies and that's it. There's no storage So basically you're gonna be still handling whatever CSV parquet files a bunch of them bunch of hard-coded path becomes super cumbersome In the case of pandas, for example, there's not zero parallelism like everything is single threaded So it doesn't matter if you have a shiny new Mac book. So you're just gonna be one thread So we decided to Okay, we kind of want to do something like nice as data frames like easy to use Pleasants people enjoy it. But with everything we know from academia every all the cool stuff to make an analytical Query engines good, but where can we draw inspiration to something? That's like a database system, but kind of similar to data frames and that was SQLite So SQLite is an embedded database system. So it runs in process There's no external server management. So you don't have to do any of the setup that a postgres has it has basically bindings for every language The storage of the database is one single file. So it's easy to transfer Public domain super easy to use. It's actually secretly secretly the most used rdbms in the planets It basically runs on every cell phone every browser even airplanes, so It's great. The only problem is it's built for transactions and not for analytics So that's kind of like where duck to be the idea was from so we won't do something for data science That works very well of these analytical tools like Python in our so very well integrated zero copy That works with data visualization that's small and can be putting any sensor you once for example And see a very strong execution engine for behind analytics Doing resource sharing because of course you're running the process So you need to be able to communicate with the other processes and again with the fast data transfer And that's of course the beautiful table where we have like data frames client servers SQLite They're all lacking something duck to be has it all And I'm gonna show you guys a little like I'm gonna give you guys a crash course and database systems in a bit All right, duck to be so as said We wanted something that was easy to install simple to use to install duck to be in python pip stall No big secrets. It's embedded. So there's no server management. It runs within the Python application Has fast analytical processing has this fast transfer supports full SQL so our our parser actually was The postgres parser is a single file format is free and open source under the MIT license Which I think is the most permissive license out there. We currently in pre-release So our version now is 0.8 point one You can of course check the website for documentation and more details about this So let's do a quick crash course on database systems. I'm gonna go each of these topics of design And try to explain like what makes duck to be different from for example SQLite and pandas So when you think about data layouts, you basically have two ways that you can start your data You can either start your data from your table Contiguously memory per row or per column. So SQLite has continuously per row Which means that if you think about the memory, you're gonna have first the first row Then the second row then the third row so and so forth That's pretty nice because individual rows can be fetched very cheaply So that's very nice for a transaction databases and for example And it's particularly nice when you don't have a lot of memory because that means you only need like one memory one row in memory per time but The problem is if you have a white table and you're not actually using all the columns You're still gonna have to fetch all the columns. So for example for not only interested in the price of a product in our example here But not the stores in which the product is sold that still means you're gonna read the whole table The other Option which is a column star. So basically any analytical database will be a column storage these days duck to be or pandas We can actually fetch these columns individually because now the columns are stored sequentially in memory And then this brings immense savings On the disk IO memory bandwidth when you only accessing a few columns So again, if we have the same query very only interested in the price of a product We would have the products and here's Not the dates and stores Yeah, so basically we have products consumer and price So then this reduces already the number of columns we would be reading from here So if we get like a bit to the numbers, let's say we have a 1 terabyte stable with 100 columns We have a query that just requires five columns of this table in the row stores on the SQLites postgres You're basically gonna read this whole terabytes from this which at 100 megabytes per second It's about three hours a very in the column store That means that you only actually need to read these five columns, which is 50 gigabytes that reduces the time for eight minutes the other nice thing of actually having a color database is that You actually have the values of the columns again sequentially memory is that the columns usually have similar values So for example dates are usually increasing and that means you have more opportunities to apply compression So here in this table, I have that to be from version 0.2.8 So that's July of 21 there was no compression and then we implemented a bunch of stuff so implemented constant compression implemented repetition implemented bitpacking implemented compression for strings for floating point numbers So in about a year and a half we managed to reduce for example line item Which is a classic table from a database benchmark Up to five times and the taxi data sets pretty classical for data science up to three times So exactly by taking this leverage of the similar data being stored So if we go back to our example That we only required the five columns from the table which were 50 gigabytes and took us eight minutes now by applying compression So about like five times of savings, of course, this depends on your data types And all this time can already be reduced to a minute 40 The other thing that I can talk about of course the execution engine So SQLite uses what we call a top of the time processing So it process one row at a time through the query plan Pundas uses a call at a time processing which means the process entire columns per time and duck DB uses this technique That was actually created at CWI Called vectorized processing which process batches of the time so not really rows not really columns But small pieces of your table So if you think about the top of the time again It's from optimized from when you didn't have a lot of memory In your computer and that's because throughout your query plan. You only really need to have one top at a time However, because you always passing this one row per operator You always clean the caches of your CPU. So it creates a huge CPU overheads So basically when we're executing a query we go row by row query process result query process results and so on and so forth On pandas you because it's in call at a time or I have better CPU at a station. So for example, you can do seems it single structure multiple data But the problem is that you have to materialize the whole column into memory So if you don't have a lot of memory, you're gonna go get into trouble because these intermediates of course can be gigabytes each So again the examples that would process one column the second column and then get to the results That to be again is the vectorized processing is optimized for CPU cache locality It still allows for CMD for pipelining and the whole idea is that you have this Intermediate ideally fitting in L1 cache So for example, you would have like the first chunk main process getting the result the second chunk main process getting the results I'm not so sure how familiar guys are with CPU modern CPU architecture But basically you have the CPU core and then you have these Caches up to the main memory and the difference of them is that the cache is closer to the core There is smaller but have a very fast access and further from the core They are bigger, but they have a higher latency So by fitting your data in L1, which is the size of these batches inducted be you basically paying your latency of One nanosecond it's pretty fast, but if you think about your data from pandas I was not gonna we're talking about gigabytes of data, right? So that's not gonna be fits in L1 not gonna fits in L2 not gonna fits in L3 It's gonna end up in main memory So then suddenly for every time you access in some piece there You're paying this huge difference in latency and this that's where this bottleneck in execution engine comes from My computer is not yeah cool So the other thing I talked about of course it was a query optimization so Duck to be this all sorts of stuff that does expression rewriting the does join ordering Subsequent flattening filtering projection pushdown and here I have an example of how this is done automatically in duck to be but manually in pandas Right, so basically I have a table here with five columns ABCDE and I'm getting the minimum of a Applying a filter in a and groupie by by be rights So what we can see here in the duck to be planned for the more to the right is That in this scan of the table we are already pushing down a and b in the projection So basically like okay this table has five columns, but we actually only interesting this to so just go With these two up to the plan and then if we go to pandas where for example, we start off by filtering our data frame So getting The values from our data frame where column a is bigger than zero and then doing the group buying the aggregation That means on this filter data frame. You're not pushing down any projection. You're gonna basically go through your whole data sets the This type of optimizations. They need to be manually done Duck to be again has parallelism so has parallel versions of most of its operators Scanners aggregations joins and scanners. They can also be insertion order Preserves so basically it guarantees that the data will be output in the same order was inserted so here I have an example of a an aggregation of TPC 8 squared classical benchmark for database systems on a scale factor 10 So basically what we can see from the charts is that the more Fred's you're putting there is actually going down quite nicely, which means that parallelism is working very well It also has the beyond-memory execution So it's not only because your data doesn't fit some memory that everything should burn down to the ground, right? So we kind of have this never give up never surrender mentality doesn't matter if it's gonna take a little longer We're gonna execute a query at least that's a go So we want to support this also if graceful degradation, right? It's not oh oops Doesn't fit senior memory for one megabyte. Now it takes way longer. No The the last it fits of course the more impacts that we'll have but it's not gonna be a huge spike so here for example, I have a hash join on the X axis I have a memory limits in gigabytes the data itself fits in 10 gigabytes and then we can see that's Okay, the time is going up, but it's not drastically going up the last memory we have So Python integrations Well, we actually integrates with the Python ecosystem. I would say quite well The thing that we ourselves did here very strongly was to integrate of this data formats So when you think about numpy pandas arrow pullers and pytorch, we really try to have Zero copy integration with them, which means that we can transform for example to Pandas and read pandas data frames that I'll copy any data TensorFlow that's not really true, but we still integrate of TensorFlow and then All these other libraries for example there are the IBIS projects data frame API We're actually the default execution engine of IBIS as well Yeah, there's lots more for reference a bit of usage So this is the amount of downloads that DB had on the last month So that's about 1.3 thousand that is only for the Python clients, which I think it's still our biggest clients Again, we have clients for many different languages But that already gives a feeling that is a project that's maturing quite well. It has been quite used So we have over 11,000 stars over 200 contributors And it's used by over 3,000 projects on github APIs we do support the classic Python DB API so where you create connections You create cursors you execute your queries your fetch results so and so forth We have a relational API that's more inspired by a pandas so You're gonna basically be able to for in this example here You have a connection and then you can create a relation that points out to a table in the DB And then you can start to chain operators, right? So you have okay this table relation I want to apply a filter I want to apply a projection so and so forth So it has a look and feel similar to data frames We also have the spark API. So the idea of the spark API is that it's gonna be one to one With the extra spark API It's still a work in progress But yeah, if anyone wants to contribute to that you guys are also very welcome Last but not least Python UDFs This might be too small for the people in the back Unfortunately, but the idea here is that you can actually have Python code that can be executed directly for SQL So in this example here, I'm basically creating a World cup titles So it's just a dictionary of name of a country and how many times that country want to work up Then I create a function called work ups It basically just do a look up on the dictionary So very simple Python function, but I can register that function in duck DB with a name And I just have to give like what's its inputs and what's its outputs in this case is just a var char String and then it's gonna output an integer and then I kind of create a table with countries So for example, Brazil, Germany, Italy, Spain, Netherlands And I can apply directly that Python function via SQL So of course this example is quite simple, but you guys can imagine about doing Machine learning via Python UDFs And then after appliance, of course, you have the results of Brazil with five word cups Not unless with none. Yes, but maybe in two years Now let's go for the demo All right Is this good enough for the people in the back? No, yes, no bigger like this That's good. Cool. All right. So, uh, I start this example here by installing So this is a colab And duck DB and pollers, for example, they already shipped with colab So I actually uninstall them and install them again just to show you The sizes of these libraries I download then free data sets. So it's basically the the cab data sets So it's the cab rides from New York from January of 2016 I get the weather in the CSV file for that month and I get a JSON So that's basically it. So here what I want to show first Is that duck DB has about 15.9 megabytes. So it's quite a small library When we look at spoilers is 19.1. It's a little bit bigger Nothing too crucial. It's too crazy. But when we look at, uh, spark, that's 310 megabytes. So that's like 20 times bigger. So I expect 20 times better performance. I think that makes sense, right? Um So when we go here, uh, for example, to just execute a SQL query in duck DB, you can import duck DB You can just do a duck DB SQL with the SQL you want to execute as a show and you already have your results Again, duck DB can read from all these different file formats, right? So for example, we have a from CSV auto function that we just passed the CSV path for the file automatically detects all the types column names and reads the CSV file We can do the same for parquet files. So duck DB has a from parquet function. You pass a parquet path does the same thing Automatically, it's completely able to read that parquet file And you have the same thing for json. So when you pass a json path Again, if we read json function, it can just read json files So it's very well integrated with all these different file formats We can also read postgres database files and sqlites database files directly So We can also create duck DB tables from these files. So in our example here We're creating a database from duck DB called rdb dot db And we're basically creating the table csv json and parquet Selecting all the data from each of these paths And the parquet file specifically is quite sizable. So you can see there's a nice progress bar Going on there to tell us when the data is fully converted to the internal duck db formats A little bit more call up machines are for free. So I can't complain Yeah, it's done And then the nice thing of this is that we can connect against this database that we now just stores And creates relations From this table. So we can create for example csv relation that points to the table csv that we created here, right? And then we can get that relation and use to csv. So we can now output csv files as well We can do the same thing with the parquet. We can go and point to the parquet table We just created and you can use the to parquet function and then we can also put a parquet file So we can also input and output these data formats Uh, so let's go for a very quick performance comparison between pandas duck db and pyspark So here i'm just importing the libraries Here i'm just i just want to show like how big the the new york city datasets is so it's basically A table with multiple columns of different formats integer 64 times temp doubles and whatnot and has Three three ten million now almost 11 million rows So to benchmark this of course, we're going to have some kind of time function So this is basically running the same thing five times and getting the meantime And we're going to do everything from data frames So i'm not going to give duck db the advantage of having the data in its own formats But we're actually going to go through pandas data frame. So we created their frame by reading the trip data parquet it's reading it All right So then the core you're going to execute is actually rather simple. We just want to the passenger counts the average supermount For distance for short distance. So there are under five miles grouped by the passenger counts So basically we want to know how much money our passengers are Cap drivers getting From tips with grouped by the number of passengers So we can run this in duck db And then this takes about 0.29 seconds and Because duck db can also again outputs back to a data frame and that's where we're turning here as well We're we're consuming upon this data frame and returning upon this data frame We can also use like the nice plots function from pandas and then we can see here Or maybe this is too big The number of tips per passenger The interesting thing here is that um, zero passengers is actually getting quite a sizable tip amounts. Not sure what's going on there But it's new york, you know, I don't know Then we can run the same thing with pandas. Uh, so again pandas is just reading its own data frame executing Uh, the same query but using the pandas api And it's going to output the same plots. So now it took 1.7 seconds Quite more than duck db and again, remember this is a machine that's not very powerful, right? It's a colab machine If you run this on your macbook, this difference will be more drastic And then we have basically the same plots and now we're going to run it with spark So actually with spark, uh, I tried to run it through the data frame, but it actually crashed my python process So I had to allow Spark to be able to execute it By creating its own data frame through a perk file, otherwise I couldn't bench mark it because it was crashing my colab machine Uh, so this is running Yeah You guys want to talk about something? Yes Yeah, um, so of course, I would say that queries like this are not exactly data manipulation, right? Like you're actually applying There's a proper query, but it's a simple query for the For the the purpose of the example, but you can use it basically whatever you want So you can use it We really designed it with the local machines in mind So our initial goal is we want users to be able to get the max of their laptops But you can totally put this on the server And we actually have a sister company called motto duck that's actually doing duct db for the clouds But starting from your laptop. So the premise is a bit different You only grow when you cannot feed things on your laptop anymore and then you start paying the the cloud providers um So it's spark finished, uh, so it's 3.5 seconds Uh, yeah, so I guess all that size, uh, didn't really help them all the time More yeah more megabytes more time apparently, uh Sorry All right, so quick summary Duck db is an in-process database system. We really focus on this fast data transfer None of this dedicated server has so all the goodies from analytical dbms So really take imparting the users with all the cool stuff The academy has been doing for the past 50 years is really designed for analytical queries So think about data analysis data science again, it's open source Other MIT license if anyone wants to take it fork it create another company make money out of it feel free You can't do that. It's completely free to use Has binding for many languages. So here show python boy. I'll have our java.js go If there's a language out there, we probably have a binding to it Very tightly integrated of the analytical ecosystem. So from the example I showed you We can even read data frames and be faster than pandas playing their own game kind of And has full cql supports Let's but not least I actually brought some I only had this here So I wouldn't forget but I have some keychains and Stickers guys, please don't let me go home with this. I hate to bring swag back So after the talk come here and fix them up. Thank you Thanks Pedro for a good session and Probably Making ducks out of our databases if we can say that so yeah We have a few minutes for quick q&a's and for that you'd have to come to the microphones that are here And yeah, thanks Hi, yeah, thanks for the duck TV. We love it We are using it with lambdas. So we have so many lambdas. We simply the problem into many pieces So our quants are relying on so many group by operations And you know, lambda has a limit a ws lambda and in memory wise and temp memory wise And I think group by is Depending on many temporary storage in in duck db. Do you Can you suggest anything to overcome with the memory issues when we are having grouped by operations? Um, so I think recently on the last version there was more work done in the So so if memory issues you mean it's crashing because there's not enough memory. It's just okay Yeah, so I think on the last version. We had more work on out of core Group buys. I'm not sure exactly which version you're currently using If I'm not wrong 0.1 We pinned that If I'm not at 0.1, it's 0.61 0.61. Yeah, that's that's already quite old. I would say so there had already been a Seven seven one eight eight one like four releases after that. So I would suggest you try the the latest version. Okay. Thank you. Cheers. Thank you Other thanks a much for the great talk So we usually deal with the situation where we have effectively like a dag of expressions And it's not like a kind of a root node of the first expression And then we have like drive of expressions from that you can express that that multiple sequel operations, for example Do you think it would be sensible to use duck to be With multiple sequel operations effectively to create like a tree of expressions Would that be like a lazily evaluated somehow or So even if you So I'm not exactly sure how your expression or implemented that case I'm using something like a data frame or right or because even if you use the duck to be relation api The relation api it doesn't really execute the query when you chain operators. So it Creates a query plan That's actually only optimized and executed when you reach certain comments So for example, and you try to fetch results. So all right, so now is the point that we can actually get the query and Really go through the optimizer which Yeah, we will do unchain or something like this and Get a good result for you. So you should totally be able to do that quick follow-up So then it would probably like still already work for our use case if we just like try to use it I think so and then if the data changes on this say you would need to Recompute and the you would you would effectively have a new query plan You would have the same if your data changes, right? Yeah, so you say a part of the data changes Yeah In a nf file on disk Say in a parquet file, for example, would that be a fast operation or The data changes in a parquet file. You have to reread the parquet file. Yeah. Yeah The duck to be itself it supports updates. So we do have mvcc implemented Kind of like an mvcc for analytics. So if you're using the duck to be formats, uh, that would be Quite fast. I would say but if you're Doing updates, uh, somewhere else and then, uh, also putting a new parquet file, you have to reread that No way around that. No, perfect. It's really good. Thanks. Cheers Hi, thanks for the talk. Uh, yeah, I was just wondering if you could explain a bit the relation between substrate and wdb Uh, it's between substrate and duck to be yeah, uh, yeah, so We actually have support for substrates, but we don't use substrate internally in any way There are some other database engines like vellux, for example, that I think Uses substrates their actual query plan What we do is we have an extension That allows users to read from substrates and to Output substrates But that's not really used internally in any way. I know there are some users and clients that use it and What I can tell you about the current status is that we can roundtrip most of the tpc8 squares So there's still some problems with, um Correlated subcourts, for example And and we can work some queries as well with ibis, for example, but it's still very much a work in progress All right, thanks Pedro, thanks for your talk and thanks for duck to be which looks like a very promising Thing out here. I've been playing with that and having very good results. So the question would be We're at release zero eight one or something. What would the duck to be team call? One dot zero release. What would be the conditions assuming that Zero dot something is not stable and one would be stable and that's my assumption Yeah, so I would say that it's not that execution engine is not stable or that like I think it's quite well tested by now What we would call the one point zero is when we have a consolidated storage formats To the point that we're always guaranteed that we're going to be able to read Database files from previous versions. So because our storage formats to changing quite frequently Because of new structures, we ended up adding we didn't want to have this hassle of having to handle it So currently like if you change to let's say zero point nine, you basically need to do an export and an import Of your whole database file And there's an estimate for that calendar wise. Yeah in the next 24 months The estimate was last year. Oh, okay So you're perfectly on time. Thanks Hey, so love the talk and the energy My question would be about first of all the name because it sounds like the focus is mainly on the compute Not the database part, but the duck part is great. So that one and Does duck db actually provides something or promises About fault tolerance because you said that you you can spin up on the server and use it in production in pipelines, for example but what happens if there's a faulty data point in the Stream of data that is coming So, uh, how does it handle that? Yeah, so for the first question about the duck, uh, yeah, I mean the reason it's called duck db is because you know Ducks are quite versatile. They can fly walk swim. I'm joking guys. This is a corporate answer The real reason is because one of the co-founders The guy lives on a boat So because of that, he didn't really want to have like a dog or a cat. I guess they're not great swimmers So he thought that a duck could be a perfect pet So he acquired a pet Or he adopted a duck while the duck was still growing and when the duck of course became adults I went to to live its own life. But um, he grew very fond of his duck and He he wanted to name his system duck db and that's a real story About the fault tolerance So we do have for example checksums in our storage to guarantee that the storage is not corrupted But there are definitely errors that could still happen that Could potentially corrupt your data Um But we do have some uh, some more basic checks on it Even in the middle of the compute for example, if I have one million rows and it fails on the 900,000 one The do I need to do it all over again? Yeah Yeah, basically those are most likely froze some kind of exception Um, and then you have to re-execute it Okay, I mean we I guess you can't have some kind of caching mechanism or all that but we currently don't Cool. Cool. Thanks Thank you Pedro. Thank you everyone joining us in the room and remotely and that'll be the end of session Have a nice evening here Bye