 just to make sure people are in the right place, giving a talk on WebAssembly and databases. So, who am I? I'm a founding engineer at MotherDuck. MotherDuck is about a year ago, we, the company was founded with kind of a vision to change the core way that databases are structured with really leveraging the core backing technology of DuckDB. Someone's in mind, can someone get the back door? Thank you. With using a core technology called DuckDB, DuckDB is an embedded database which was founded with the principle of current databases, especially how they're structured component-wise, or just a pain to deal with. Most people use Postgres and really databases that were created in the 80s plus ago. So, they've honest and mark kind of founded with the idea of creating a modern database that they actually like developing in. So, we built the service layer on top of that that actually brings it into the cloud and provides it as software as a service with a vision of usability and accessibility. Apart from that, I've been in and out of databases and data tier itself, working various companies, Samsung, OpenX, Teradata. Matisse is core developer, is a core developer in database line itself, as well as growing kind of data teams at scale. All right, so the rough structure of this talk, essentially I'm gonna start with kind of state of the world in database or say, state of the world of WASM in databases and then move on to some of the problems, like what problems I see WASM is trying to solve in database lend and then talk about really what I see as the road ahead. So, broken down three parts, where we are right now, what's kind of the cutting edge of, it's all pretty bleeding edge, what's like really up next in progress of development and then what's kind of far out in the horizon. The talk itself was a little complicated in only 30 minutes, so I kind of dive directly into a lot of the mechanics and jargon, et cetera. So, asking any questions, stopping at any point in time. If anything's unclear, I definitely prefer conversations as opposed to giving speeches, but we will also manage time accordingly. So, state of WASM in databases. So, the core place that WASM is being leveraged in databases right now are UDFs, user defined functions. There's a whole family of user defined functions, but roughly they're broken down into, essentially it's the developers being able to run on trusted code inside a database itself. The kind of core idea is to push logic that's traditionally in application land into the database itself to run for better performance characteristics. The downside is that whenever you push data into, as opposed to using traditional SQL, you lose the optimizations on top of this. Now, core question you can ask is, why would you make this trade off of losing optimization? Essentially it's because SQL's kind of a pain in the butt and we're developers. So, you want to have the full richness to be able to full richness, to be able full richness and expressiveness of an empty language to be able to solve business real world problems. So, any real complicated complex business logic usually breaks down into some UDF that's being run in the system itself because you get variables looping, branching, which traditionally are very difficult to do in SQL. More advanced databases have their own like Oracle has POSQL, which is adding procedures on top of SQL. And there are different variations like that to add these sort of same things to our, but the only industry really practice and the only thing that's recognized kind of industry wide is UDS in the system. So that's really the only way you can get cross compatibility between or some semblance across compatibility between databases. So the other kind of branch of where WASM is being used in databases is actually to embed an entire database in WASM. So this allows WASM to run on the edge as well, well edge depending on old definitions of edge. So traditionally this was actually in the browser but lately there's been pushes to actually get this to run on edge servers. So that's using WASI and that whole branch of WASM cutting edge development. The databases on the edge itself, so both SQLite and Duck2B are have embeddings in the browser. SQLite itself actually Chrome is switching off of using index2b to using an embedded SQLite in order to power all of the SQLite functionality and the document store in your browser itself. The grade Postgres out there is, so there is a version of Postgres that's been embedded. It's actually Postgres running in a VM that's running on top of, running on top of, or running embedded in WASM which just shows the power of WASM, the capabilities that we can actually do that. Practically how performing it is et cetera is kind of to be seen but it also goes to show like Postgres running and industry standards, like we really do want this legacy code and also the rationale behind Duck2B and trying to gain some market share is to be able to not have to take all the legacy baggage with us. So why would you want to embed on the edge? Well, it basically allows you, so we talked about UDS which were pushing application logic into a database in order to be able to run. Here we're doing the exact opposite, we're pulling a database into your application so you get the full power and richness of a database to store the state of your application without the downsides of having to do remote RPCs. The general trade off or problem in this space, really with the core gap in this technology is, is that you kind of run into the scalable to all. So at some point in time you're gonna be running with too much data to or too much state to rerun local and then you're gonna wanna spool out to the remote server to be able to process. Doing that seamlessly with two different database engines is traditionally very, or is a very difficult problem to solve. So you're incrementally better because you have the same model both on your server and clients to move forward. But you still have to handle this management baggage of shifting data back and forth yourself. So what was the problem that we're trying to address? I kind of mentioned this already but it really is this interoperability and movement of compute and data across the system. So you wanna every node in the network topology or your stack, we wanna actually be able to execute and leverage the compute as well as push and store data on those, ideally with the same set of primitives. Yeah, and this unlocks a whole slew of potential use cases. So some of the stuff that we're looking at, mother duck at solving in this space to get a little bit into in a bit around how we're trying to tackle these problems but the rough categorization of them. So ATL agent is the idea of you're transitioning your application servers. So you can think of this, the set of example of this would be like a logging server where you have basically all your application. So all the servers that are like your network servers or your load balancer, et cetera, they can actually run with an embedded database agent on them or an agent with an embedded database that can do local processing, store logs locally and then only push up filters, aggregates and really aggregate expressions up to the datamards that actually care about them. And then the datamards can also federate out queries back down to the application nodes if you need the higher granularity. So this basically allows you to actually leverage the full power of your servers as opposed to trying to centralize into a data lake and then running into a scalable default next there as well as the cost. You really have a whole bunch of servers that are sitting, I mean they are serving the actual load or the load of the actual requests but by and large the disks on those are sitting idle as well as you're just sending a whole bunch of network traffic up to the server and already network found machines. Client-side encryption of data. So this is the idea that the client itself is the only one that actually stores keys and can query data that's actually encrypted in the database. Now you can do this using Hackley, using application node logic but here you can actually embed this in the database itself. So it seamlessly the apps like you just write traditional SQL and the client-side is actually doing the decryption. So this is pure client-side encryption. Yeah, in like not just sending over the key in some sort of encrypted form to decrypt on the server but actually the server has no way of ever reading the information that's stored on it. Data apps, there's kind of two different use cases here. So there's progressive rendering. So here's the idea you down sample your kind of core data set so that it fits on a browser or whatever your client is and then the client can run queries first against the down sample data and then slowly render out to the full data set itself. Obviously it depends upon the actual use cases to how you do that kind of upsampling and how you present this out but the idea is really you get this really fast snappy UI because the query's all local and then progressively you get the full fidelity if the user needs it. And then short of views is the idea of like traditional data apps. A lot of them that are coming onto the market are kind of you're working with SaaS provider some really large data set and really you only have a shallow view of it. There's some small partition of this data that you actually own and care about and so this is more targeted at SaaS or application kind of providers but you can think of this as like a video game type use case where you care about all your local stats et cetera that are stored across as well as maybe the global leaderboard and those are the only two data sets that you'll ever be able to see. So all those that data could actually be pushed down to the local client itself and each client kind of sees a small view and the warehouse actually stores the entire thing but no queries have to go against the warehouse where it's trying to do these traditional partitions and like really workload balance across all these different clients. The clients just get the data set so it's basically like models of pushing all the data out to like S3 objects and pulling those objects locally and querying them but in a database setting. So next I want to do a quick jump over to a similar technological movement which I'm going to parallel into how I see kind of the evolution of WASM going. So Hadoop and the rise of data lakes. So essentially MapReduce I think when it was A solving a very similar problem. It was solving the problem of in a data center so not across the entire network stack getting the separation of storage and compute and being able to push compute down to storage nodes and be able to execute kind of on the reduced side push data back up to reducers. So some notable features there. They use the JVM to solve the same problem. They want to run over heterogeneous hardware. It's a lower cost. And then that's kind of mentioned the key primitives being the map and reduce step where the map over kind of keys tells you how data is being funneled around the system. And then also the reducers and the mappers are pushing compute down to the various nodes. So pushing compute onto the already sharded out sharded out data on the on HDFS for the map stage and then on the reduced stage being able to run reduced jobs on any node. So that's that kind of core funneling of the compute which is our UDFs and the data which is our traditional data now. Some learnings around here. So one key thing and I think this is really the core theme that I'm gonna kind of move forward on is people really do love SQL and the expressiveness and the ease of use of SQL as well as the optimizer. Map reduced, et cetera, but really took off when it got into Spark phase. Map reduced traditionally was a great starting point and but it really didn't see industry wide adoption because of just how difficult it was to write a map reduce job. Maybe at the right, the really, really verbose like map reduce 1.0 versions of map and reducers with older boilerplate code can appreciate that. Plus the hardness of the buckets in a distributed setting can really appreciate kind of how far we've come with Spark and really all the guarantees that it gets for you. The M plus one standard, so map reduce basically was a counter movement against traditional data warehouses which wrote everything in SQL and kind of dispels languages on top of them PL, SQL, et cetera. So they came in and said, well, let's standardize a programming paradigm behind this which introduced another standard of how to write things. And even now there's still like this movements to Spark to become more and more like traditional SQL but it's still a different standard. So there's this contention and trade-offs in the space which I think has hampered a bit of how widespread Spark could have been. And then this is completely separate from this talk with the other place where Hadoop really hurt was around how managed offerings kind of worked, et cetera. Like we're getting into Spark on Kubernetes and these various kind of out-of-the-box solutions but the original implementations of Spark relied very heavily on users to really take on the DevOps roles and be able to deploy on the entire cluster and be able to manage this. So you spent more time managing the infrastructure than you did writing code. So road ahead. So we're really, really in the infancy of WASM in databases. The kind of core movement is kind of, we can see it with the, sorry, the core movement to solve the problem of where compute and data is. We can see it kind of coming to life with these UDFs, which are hearkening to the map kind of phase of map reduce and then the reduce being able to run on the edge components, right? That kind of, we can start seeing some conceptualization of that today with the kind of UDFs capabilities and the WASM capability, embedding of WASM on the edge nodes. The core advantage of this over the traditional, I'll say Spark movement is like, this actually is a seamless embedding of technology on the existing technology stack. Any existing technology stack could support this in theory. Now there's some practical limitations like Postgres running on your browser, which like, I mean, you could always have some database that's Postgres compatible wire format that's stitching between these two things, but you can conceptually have this one seamless experience where you don't actually, you don't have to be aware of the topology of your system that the database itself is handling this where it's X number of data nodes that are just coordinating amongst themselves. And then JVM versus WASM, like WASM works on any technology, sorry, any existing language, et cetera, can run on WASM. It's pure virtualization, like pure VM, right? That gives you all the ease of use and cross compatibility of a VM, but in whatever language you want, which like another huge thing with the pushing to Java was like, it was around the same right time for Java taking off, but we still have this huge fight around like the Java ecosystem versus what everyone else uses. So like, I think this is kind of the core right synergy to have this level in various other, very established speakers have talked about how WASM would have replaced Docker, WASM definitely would have replaced Java, like if we had it at that point in time. Okay, so walking through as I showed before, kind of this hierarchy here, we had MapReduce, we have Hive, then Spark on DataFrames, Spark with the Calus Optimizer in Tungsten, and Spark, finally Spark on Kubernetes. So the Hive phase, I think is our kind of up and coming where we're heading in this space. So what was Hive for MapReduce? Hive for MapReduce was basically the first introduction of SQL into the MapReduce framework. What did this get? This really started to get a little bit of the optimizer in as well as the usability aspects. People didn't have to write traditional MapReduce programs, they can now write Hive programs which compiled into hideous, hideous, hideous MapReduce programs. But like Hive still has huge adoption and is most people who go and look at MapReduce stacks are using some Hive, some are hidden in the story, even if it's just the metadata story in the catalog. So where does this parallel into the wasm? At DuckDB, and you can see the beginnings at Substrate as well, we're starting to create hybrid execution models which is you just write your traditional SQL and across the edge database and your server databases, it breaks it into a paralyzed plan. I'm gonna show some examples in the next couple of slides of what this plan kind of looks like but the core parts of this is that it naturally extends into UDFs. UDF can be pushed either onto the server side or the server side, so it can run locally or remotely. And it already has this natural interface language between how SQL's interrupting with UDFs themselves. So the kind of API service layer which is admittedly one of the hardest parts to get right about any new technology, especially embedding of an existing technology is already kind of established for us. So it's a little bit cut off on the screen there but this is a query plan of the query on the right here which is selecting some custom polygon over a local set of suspected cars joined with the New York City Taxi open source data set. So in the query plan itself you can see these nodes down here which are the local nodes. Joining up against this remote execution which is scanning the New York City Taxi share rides up here and then at the very, near the very top here you can actually see the running of our custom Wasm function right here or UDF rather, UDF written in Wasm but that's not as important for the relevance of how this kind of embeds through here but this kind of showcases how hybrid execution will work. So it can just look at where the local data is and the remote data and make choices around where these things should run and then the more relevant part as we start talking about optimization et cetera is that that UDF could get pushed in theory to the rideshare sources itself and run over that if it finds that say the computer is too intensive on the server side and it makes more sense just to shard it out, process on all nodes and then just do the filter on the remote side against the dataset. So the optimizer if it's exposed to all these these different metrics can in theory make a decision around these things. You can start to see how this could be stitched together but the core part of where we are right now is just being able to have this one seamless plan that runs across both the server and the client. This is the reverse. I just flipped the local and remote. Now MotherDuck you can actually run these, this set of SQL you can actually embed custom UDF yet we're working on that but everything else these are actually query plans from MotherDuck. So the remote local here is you're querying, you're querying the same dataset but here you're running just remotely to scan over the suspected cars and then here you have the local kind of funneling up. So to make a bit more realistic this kind of toy example here would be something like you basically pulled out a CSV or something or got shared to you a set of interested cars that you're trying to probe for where they're actually located across whereas here would be something like you have a data science model which is actually doing the prediction of here's a set of cars that you need to be looking at et cetera which is much more competition intensive and then locally you have your small dataset around say my little like I'm a taxi company I've got some small set of dispatch cars that I'm looking at and then the full model remotely being across the whole New York City or something that's all the possible taxi rides. So that's kind of where we are right now and what hybrid execution looks like. So next phase is data frames. So what did data frames introduce into Spark? Sorry just checking time. The data frames introduced into Spark is data frames introduced into Spark really a regulation of the inputs and outputs and this is where the component model fits really nicely on top of UDFs. So this is really the work that substrates and single stores also kicking off which is essentially how do you model inputs and outputs in a way that's composable across different assets. So you have the UDFs which are like you compile any code against executions and they're just traditional WASM that are very much WASM programs but the inputs and outputs, they vary depending upon like your server could have data stored in columnar format and your client could be storing in a row format. So traditionally they're gonna have very different optimization schemes of how things could be run across them. So you would want to potentially like one's actually going to take leverage vectorization, how it runs, et cetera. So being able to pull out or use modular components that can do kind of scans or an aggregate or sum over the status set so which require both standardization of the input kind of format in a way that is agnostic to how it's underlying stored and as well as the functions themselves in component format such that later your query plan can actually become aware of them. That's really where you can see the evolution and the data frame side can come into and that's kind of having composable transforms as well as having composable inputs and outputs will allow kind of the UDF bridges to become properly optimized into the system and you don't have to have this trade off between do I write something in SQL or do I write it in UDFs? They, with WASM and components and proper visualization so substrates working on how a query plan would look like in a way that's agnostic to the environment that's running in that's kind of what it was invented for. Single store is coming up with a standardization for WASI on how inputs of data like basic traditional data frames would look like. Calus and tungsten really gonna quickly jump into this but essentially this is the optimization phase so what Calus was, was the optimizer was is the optimizer a spark? That's just what it is. So it moved spark from traditionally going in a volcano model where it basically every primitive would just execute one at a time and pop up the entire stack for every row to each operator having its, sorry, to the system being able to optimize and filter out parts that don't actually need to get executed across et cetera so just a traditional optimizer. Tungsten was creation of physical operators in spark so that actually was taking, you don't have to run in this volcano model but you have actually compiled code that doesn't run in the JVM so as that performance boost as well as is compiled for just what you've optimized against so it doesn't have to deal with like overflows et cetera if you don't actually care about them so really up level the performance of spark. So where does this come into WASM? This is again as you start moving to standardize the UDFs on top of and the interfaces then each invitation language could actually have so your server and your client could have different forms of UDFs that or different components that are exposed into the UDFs that you can plug and play as libraries into your codes. Your code looks the same in terms of I just pull in a library that has this function but that function is optimized differently on the client and the server and that would seamlessly work with the component model and some intelligence stitching and optimization on the client and server side or rather the execution of the entire system. Yeah, the WASM roadmap I kind of talked about actually this is leading heavily on WASM components as a framework that's kind of the biggest bottleneck and really the key thing that's we need to figure out how we integrate seamlessly into in order to really move forward with this in addition to that kind of bringing threading in and a bunch of other kind of just nice things that because we want to seamlessly interact with existing stacks we definitely need in order to pull it across like currently you can leverage a lot of WASM today but the hurdle of not being able to just take code and port it over directly is this big showstopper in most cases that plus people worried about performance overheads I think performance overheads we can deal with pretty easily by just showcasing the capabilities but not saying that you have to write everything in single-threaded ways and it actually is how the system would probably boil down to but it is really a show shopper right now. Yeah, that leaves like five minutes for questions. So the question was how does like an aggregator function itself and all that state management map into kind of linear memory. So a couple of different answers here. So I think that's a great question. So by and large is kind of this fight in the database community around do we leverage kind of the traditional kernel, et cetera to how much should we write our own primitives. So in the teaser for example actually had a flat memory model for the entire space and only when they ported over to IBM because they had their own custom hardware they originally launched they actually like they ran X times faster on just a single memory space and it's just a lot simpler to actually deal with linear memory. The great thing about WASM is actually it gives you that kind of abstraction yourself where you could just have that linear memory that's only isolated to the aggregator itself and then you could have shared memory say if you're trying to to like you could have the lower level operators share memory with a higher level operator but the flat memory space itself and kind of the isolation of that model actually will allow a lot of simplification in the runtime execution of there as well as kind of introducing your own custom paging, et cetera. So the question and mostly since kind of boiled down to is how much of the industry standards do we want to adopt versus kind of rolling our own and that's really going to be case by case independent. So my personal experience actually the flat memory model works really well but you probably will hear if you ask someone else kind of the opposite opinions which is like it doesn't really, it makes it harder to code against and like you'd have to roll your own abstractions. Yeah, yeah. Yeah, that's a hundred and correct me. And that's like the one biggest pain on database land is dealing with the kernel and anytime you get into that memory like most systems like duck to be when we try to port it over to the cloud it's just memory greedy. It just doesn't, as soon as whatever memory you're going to give to it, I want to run on. So getting that to run on containers in that environment is kind of a nightmare because you've isolated away from how much it knows about the host itself and that really does boil down to how much of the stack you want to build but at least my personal opinion like the more and more shares you get into database like into building out these hard end systems and these runtimes like the more and more you have to kind of build up top of the OS or build around the OS as opposed to just leverage it but it is kind of this core blocker and this is a realm that I, I saw it need to really get educated across but I know that WASD is also finding this with the WASD framework, et cetera around how much of the kind of core system libraries do they include from Linux, et cetera and how much do they actually try to roll their own stack and I think that's also will help out with some of these concerns and then the UDS themselves being like these modular components you could have two different models depending upon what your underlying operating system will actually support so it's not like everyone has to make these kind of judgment calls. So for the first question, around the four gig limit so WASD in itself as I think already is already in the WASD framework and I think if I mentioned on most of most if not all of the WASD runtimes is the 64 bit. So you have 64 K memory space you're not limited to 32 bit anymore. So you can map eight and larger pages and that's but gain that full adoption and that's definitely a critical part of the runtime itself to get out of the 32 bit memory space to 64 bit memory space. That being said and I mentioned it before but this is actually did was a 32 bit memory space as well so you can do some tricks around memory management but by and large or just it's just becoming much larger as it's becoming much larger so 64 bits for every operators probably a core requirement. To your second question, could you rephrase that a little bit? Oh, okay. So you're talking about, yeah, yeah. So traditional like document stores, graph databases or just a no sequel movement, new sequel movement. So I can treat those separately. So the new sequel movement actually in a lot of ways it was a false dichotomy to talk about like MongoDB, et cetera. What they're really doing is trading off. Basically I'm gonna get rid of a core simple primitives like atomicity or say transactionality, yes, atomicity of the, sorry, blanking here, the core database primitives. So the core part is that you're trading off these kind of core fundamentals to potentially get some performance improvements but with UDFs et cetera, like users can actually make some of these trade offs themselves. So you could, it essentially becomes a different runtime inside the same environment that you could have like UDFs will work on document storage just as much as they work on traditional relational databases. So like the short answer, long and short of that is like you just treat this as a different runtime that's running in and your function should be compatible across both of these two environments. The slightly longer answer is actually I feel like with componentizations these fundamental dichotomies actually become less and less of a trade off that users have to make themselves. Like do I want to run on MongoDB versus Postgres and the characteristics of these different things? Like I just write my functions and the optimizer kind of figures out can I lack some things down especially because you have a full closure now that's handling retribility and various different things like that. So depending upon what functions you call and kind of you can figure out which modes you're actually, the user's trying to express where traditional SQL just doesn't have the expressivity to make those trade-offs for you. Yes, yeah, right now, well. So document storage, et cetera, definitely out of scope. Relational databases are now graph databases and things like that are usually actually built on top of relational databases as well as a lot of data science type models. So DuckDB itself actually has extensions that work in graph models as well as on relational, sorry, as well as on more, like it's primary, one of its primary use cases is actually data science models and writing on those type of systems which are more kind of MapReduce type focus in how their workloads are. All right, I think we're over time as well so I'm happy to take any questions offline. Thank you.