 Hey everyone. Today, we're going to be talking about distributed computation with Wasm and Wazzy. My name is Bailey Hayes. I'm a software engineer at Singlestore. I've been working with WebAssembly for several years now. At my last company, I worked on the more traditional use case for WebAssembly, which was working on a data visualization library running in the browser. Carl? Hi. My name is Carl. I have been working on databases for the last eight and a half years at a company called Singlestore. I work with Bailey. We're both software engineers there. Over the last couple of months, I've been deep diving into the world of Wasm with Bailey, and it's been a lot of fun. We're really excited to share our vision of distributed computation with Wasm and Wazzy, especially with relation to data. That's why we're here. Let's talk about apps really quick. We just had a great presentation from Liam that was really interesting about capabilities and how applications can become extracted from their requirements a little bit more. I think that we have to look at this also from state. Applications have state. We all know that. Applications start off with usually very little state when you're in development on your computer. That gets a lot more interesting as apps grow. As we all know, applications tend to have a huge explosion of data, especially modern applications. We're seeing a surge of data intensive applications. Data intensive applications use a lot of state. If you look at any modern application, you have things like databases, blob storage, caching layers, queues, all sorts of data everywhere. The thing that's interesting about data everywhere is that it leads to data moving everywhere. In the last eight years working with Singlestore, I've seen a lot of companies run into the same problem. They start moving huge amounts of data between their data layer and their application layer. With the surge in data apps and data intensive applications, we see this weird thing where the database isn't always able to run the right type of computation. We're starting to see apps move not just like an answer to a question, but the actual data required to answer that question into the application, where they're running custom processing and custom code. This is really inefficient. It's energy inefficient. It's computationally inefficient. It's slow. We believe that there's potentially a better way to do this. The ultimate question is, can we move compute to data? This is a pretty easy idea, but it's a really hard thing to execute. We believe that WASM might actually be a great way to represent computation in such a way that we can move that computation to the data as opposed to moving that data to the computation. Today, we're going to be presenting our idea for how this might look in the world of WASM and really engage with the community today, especially tonight later on today. Talk to all of you about some of our ideas in this space and where we want to take it. Bailey, why do you think WASM is such a great idea for this? Yeah, well, we're all here, so we already know a lot of the benefits of WASM, but for us, we're a multi-tenant managed service database. It's running across the clouds, and it's distributed, which means the security by design feature of WASM is probably the most important component. You wouldn't want your data to leak to somebody else, like from one bank to another. So that is the first feature that we looked at. The WASM module is only going to have the capabilities that we granted. We also, because we're distributed, want that lightweight runtime and lightweight module so that it's easy to move across nodes. Near native performance, but also very predictable performance, is another key component for running WASM as compute within a database or any other type of storage. And the last thing, which is probably the most important one to me, because I'm not an expert in SQL, like somebody else on the stage, and so I really love the idea of being able to write in the best tool, the best language for the job. And I can use all of the testing and CI CD that is provided in my favorite tool in language of choice in a host agnostic way. And our goal with WASI data, which I should just say, introducing WASI data, that's our proposal for how to do this. I want to be able to take any language like Rust or Python and say I'm using WASI data, and that give me the framework for being able to distribute my computation and push that compute down into a distributed database or some other type of distributed runtime. So yeah, so WASI data is partially something that we have built and are working on actively, and partially a proposal for where we see this going. So we've sort of broken up into four stages. In stage one and two, we're focusing on building the sort of building blocks of computation. So in stage one, we basically just want lambda functions. These are just functions that map from a single piece of data to potentially another single piece of data. In stage two, we want procedural extensions, and we want to be able to just define a procedural multi-stage algorithm or data application. And we'll give you an example of that in a little bit. Stage three is going to be about relational extensions. It's about using WASM to actually extend and augment relational query execution systems. So inside of a database, you have things like logical query plans, which actually describes the way that the data is computed or the answer is computed over a distributed system. We want to be able to extend that with WASM. And then in stage four, we want to even go farther and look at just arbitrary computational DAGs and define those in WASM and be able to execute them on many different data systems and distributed systems. So we'll talk about each of these stages to sort of give more details. Let's start with the simplest one. Let's define functions. You have things like scalars, value in, value out. We want to operate on records for us that might mean a row and a database, but that could also be structs in your language. And then be able to work on batches and vectors of these. And so this is the simplest unit of compute, basically, that we can start with. This is an example. And I've got three of these. The first one is being able to translate English to French, for example. Say we're streaming in comments and we want to make it really easy within our database to have outputs with whatever language our users are using. The second example is being able to do sentiment analysis. Sentiment analysis is where you read in text and decide if the sentiment was, say, positive, negative, or neutral. So say a customer had a really bad experience, we would want to mark those negative comments for follow-up, like, say, in some type of Google review. And the third example would be risk analysis. Say you're bundling up a bunch of loans and you want to decide whether or not it's too risky to offer that set bundle. And all of these typically would depend on some type of machine learning model we've already trained and now we've compiled it to a WASM module. So that part was really simple, basically value in, value out. But we also want to be able to define procedures, which is creating distributed algorithms in a multi-stage way. We need to be able to define functions, create a plan for us. We're just passing in a string right now. This is definitely something we want to make better in the later stages. So our plan string, in our case, is going to be like a SQL query for our database single store, but could also be like a MongoDB query. So let's look at that same translate example from before. We check whether or not we have any users that have a French language. And if we do, let's update the comments so that they're in French. And that's kind of the idea of just a really simple algorithm. Coming up with examples for this sometimes is surprisingly hard. That was a very contrived example, but we just wanted to show that we want to be able to do procedural, define a procedural work set of steps in WASM and run that. Then it gets really interesting. So once we have the basic foundation, we have the basic lambda function, the thing that can work on a unit of compute. And we have the higher idea of having a WASM module be like an orchestration of distributed computation. We need to actually be able to define arbitrary distributed computation inside of WASM. And we propose this, we could do this in two different ways, and potentially both. So the first one is extending relational systems with WASM. So for anyone who isn't super into databases, most databases, especially relational databases, use relational algebra, and they have things called logical query plans and physical query plans. Logical query plans basically define a distributed computation. So if in your SQL engine, if you say like select star from foo, join, bar, something like that, internally it might look like this in the engine. It might have some type of tree. But sometimes these things are limited. You might want to do a distributed operation that is like a little bit more unique to your particular workflow. And in these cases, usually you'll have to pull the data out of the distributed system or the database into something like Apache Spark, where you can go and write something completely custom. We propose that do the opposite. Actually allow you to augment relational query plans using WASM and push that actual plan directly into the engine. So we're looking at ways to do this and we're really excited about exploring this with the community. So this is definitely future work. And similarly, I mentioned earlier we want to be able to represent computational DAGs, so just arbitrary directed acyclic graphs of computation that would allow you to do really complex workflows. If you could imagine defining this workflow entirely in WASM from the actual coordination aspect of the workflow all the way down to the individual components in that workflow, that would be amazing. And if you could do that in an abstract way that could run on many different distributed systems, that would even be more amazing. And that goes back to, I think, what Liam was saying about continuing to extract out those fundamental non-functional components and focus on the thing that really makes your app work. And we believe that is true for databases. Hey, Carl. So why wouldn't we use something like Argo workflow? So yeah, there's a million workflow engines. There's a million ways to do all this stuff, but none of them allow you to push the computation all the way into the database. And as you said at the beginning, WASM is a fantastic way of representing computation, and we really believe that by defining these things inside of a WASM module and pushing that into the engine, into these distributed databases and distributed computation systems, that's where we're going to see a huge win. So cool. All right, so now a demo. This is going to be a live demo, so here we go. I think the code's still good. Yeah? Okay. Yeah, you're good. Just maximize it. Yeah, you're good. So remember we keep going back to that sentiment analysis example. This is our Rust code for doing that sentiment analysis. We're building right now just on Wittek's bind gen and the interface types proposal. So not a ton that we had to do unique to us for being able to run this in a distributed way. Carl wrote this macro, this WASI interface macro, and that basically creates the bindings that we need. So let's run it like a data scientist, which I am not. I've got a Python notebook here, and I'm using a framework called Dask. If you're familiar with Spark, this is kind of like the Python equivalent. The data that we're operating on looks a little bit like this. It's a table with comments from Stack Overflow. We wanted to show you something with real data and maybe even do a real type of analysis on that data. So I've got my text. I can see how large my data frame is, and you can kind of think of a data frame as almost like a table. I've got 1,000 rows, and that's also really common. You pull a small set of the data, iterate on it locally on your host, which might be Windows or Mac or Linux, and come up with some of your ideas. We are going to add our sentiment analysis WASM module, and right now I have it say I love WASM, which is a really positive sentiment. It's not completely positive because it doesn't actually know what WASM is, but if we say something negative like COVID is bad, we'll see that now we get a negative sentiment. You might notice this compound field here that's basically combining the positive, negative, and neutral traits of the phrase. That just kind of proved that my WASM module worked. Now we're going to operate on all those 1,000 rows, and we get output that's kind of like this. That's interesting, and by playing around with this, I kind of formed my own hypothesis. I think that on Stack Overflow, neutral statements, because Stack Overflow is supposed to be giving you answers, those answers should be technical and neutral, those are going to have the highest score. Now, you probably are familiar with upvoting and downvoting maybe on Reddit or even Stack Overflow on posts, but comments also have upvoting and downvoting. Because I want to operate on the full set of the data now, I'm going to connect to our database, which we've also added WASM support. That full set of data is 82 million rows. Like before, I'm going to take the exact same WASM module that I was running in Dask. I'm going to run it in single store. The first step is letting single store know about my WASM module that's sitting here in this workspace, and we're going to do the exact same test as before. Hello, WASM day, I love WASM, and that's mostly positive. Hello itself is kind of a neutral statement. So I am curious, read the docs. Do you guys think that is neutral, negative? How do you feel? Anybody? Neutral, okay. It's true, it's neutral, but that just kind of shows you computers don't know about passive aggressive. Now, let's do something really cool. I want to run distributed computation. That was the promise of the whole talk, and that's what this query does. Our database is distributed. We're taking our WASM module and running it across all of our nodes across 82 million rows of data, and we get back very quickly a query that uses analytics. Now, it's pretty simple analytics here, but the idea is that we want to know by bucketing all the rows, and we group by them based on their score. So if the score ranges from, say, 10 to 20, that's one bucket. And when we look at this, we see that these scores are both very negative and very positive. So just a quick peek at the data, maybe I'm not able to draw any types of conclusions. So the next thing you do is plot it. We're going to render a really simple graph that has a trend line using ordinary least squares. Can you tell I'm not a data science person? But we actually see two really interesting trends off of this data. In this graph, we are plotting our buckets. So that's the x-axis. On the left-hand side, we're doing polarization. So that's going from zero to one, zero being completely neutral, one being very negative or very positive. When we graph this with the trend line, we see that first it supports my hypothesis that higher scores will be more neutral. But something else interesting came out, which is they also seem to be leaning towards more neutral positive. Perhaps negative comments are more likely to get downvoted. So I think that's pretty cool and kind of shows a little bit of the power of being able to run wasm and two totally different frameworks. And I would love to see it show up in others. So if you guys are into it, come talk to us. And just something to point out about that graph. That graph is generated by pushing the sentiment code which we compiled from wasm into a distributed system running in Google Cloud. So instead of pulling 82 million comments down to our laptop to generate that graph, we are pushing the compute two to Google Cloud to actually generate this graph. And all we're pulling down to the laptop are essentially a couple of data points. And that is the future of combining wasm and data. It's sick. And our whole goal is that we are able to do this in a runtime agnostic way. Our example is pushed up to GitHub under single store labs. Our last slide will have a resource link. And we also just want to say that we want to work with the other existing proposals that are currently in-flight, like WASI Parallel and WASI NNN. We envision our WASI data modules to probably take advantage of WASI Parallel and WASI NNN. Obviously, all of my examples were mostly machine learning. So being able to work with neural networks and maybe even do task parallel operations is very important to us. Sweet. So yeah. So in conclusion, WASM and distributed is a fantastic thing. I mean, as Liam said, it's the future. And we really, really believe that. But we believe that data has to be involved. And so really WASI data is really just a proposal. It's an idea. It's an idea that we could actually push down computation into distributed systems and databases and data systems so that data apps can stop pulling data out. Keep the data where it is. Keep it efficient and leverage WASM for its superpowers. So that's our talk. Any questions? And please scan the QR code for GitHub repos and also hit us up on Twitter or GitHub if you want to reach out. Thank you. So any questions? And can we get questions from virtual as well? How does that work? Yeah, who's running that? Yeah. So they should in theory be asking the questions either in the Slack, if they wish, or in the chat in the virtual environment. But I hadn't seen any questions during the conversation because I think it was so interesting. But let me check again. So if there are any questions up here in the Slack, the Slack is available all week. So if you have questions while you're chewing on everything you saw, Carl and Bailey will drop into the Slack and you can ask the questions there right now. There's a question over there. Go ahead and say your question. We'll repeat it. Do you want to take that one? Oh yeah. We'll repeat the question. So repeat the question. Make sure it's everybody can hear. So the question was basically do we have strategies for limiting how much computation a WASM function is using or how many resources a WASM function is using during execution? And I'll let Bailey take that one. Yeah. I love that question. Crypto has a really similar problem where they want to compute the gas or how many cycles are going to run on the CPU. And yeah, that's super important for us because you're running in a managed service just like maybe even a serverless function. A lot of, if you squint and tilt your head, a lot of these cases are really the same. And yes, we haven't done anything yet. We would love to take advantage of the things that crypto is doing in that space. Something that I haven't seen yet is throttling, which is something else that I would really like to check out. Yeah. If we could find a way to both predict what the performance is, a lot of these machine learning operations, they're long running. They're just always running perhaps. And perhaps you only want to run them when there's downtime. And so being able to say, okay, I know what my gas is going to be. I know that this is kind of a low priority. I want to just set it to this exact set of resource limits. Yes, absolutely. That's a huge use case for us. If I may. So James is asking a quick question and then we'll move on to the next presentation. He wants to know in the Slack, how does this compare with the current map-reduced feature set and that kind of approach? If you have a quick answer, great. If you want to go ahead and elaborate, you can do it in the Slack afterwards too. Yeah, just a quick answer is with... Oh, yeah, sorry. We're good? Okay, cool. So the answer is basically in the current approach to what we have inside of Singlesore and inside of Dask. So like what we've implemented, you can actually build MapReduce using a combination of your functions, which cannot act like a mapper, and the actual frameworks themselves. So Singlesore and Dask both provide solutions to be able to do reductions in groups and shuffles and stuff like that, that you need to be able to run large-scale MapReduce operations. So in our example of sentiment analysis, we were using it like a MapReduce operation. We were mapping the data over from each of the comments. We were mapping to the sentiment records, which contains the sentiment scores, and we were reducing by aggregating those sentiment scores. So you could think about that as a MapReduce. But in the future, as we go towards relational extensions and workflow extensions, we actually want to be able to define arbitrary, custom shuffling routines, custom repartitioning routines and stuff like that, and that's where you start to be able to build completely custom things. Yeah, we didn't talk about substrate at all. That's true. We are working with one community called Substrate, which, if you're interested in learning more about augmenting, sort of like providing abstract logical plans, I definitely recommend checking them out. We'll link to them in our resources section as well. Okay, do we have more questions here or in the Slack? We've got just a few more minutes. Give it a question here, yeah. Go ahead. No, that's all good. So, oh, yeah. Yeah, yeah. So this code leverages WIDX and WIDX by NGN 100%. So essentially, what this does is this macro takes a mod written in Rust. The goal is ergonomic simplicity for people who might not be, they might just want to write a simple Rust function. They don't want to necessarily have to learn anything more than that. So we want to simplify it to literally import WADZ interface, annotate, I think it's up on the screen, annotate your Rust mod with like some basic structs and stuff like that. This macro will spider this mod and export all the functions using the canonical ABI per the same as WIDX by NGN. What this gives you the advantage of is on the host side, like when we implemented the desk side, for example, all we had to do is just take the existing WIDX by NGN Python module and essentially load the WIDX. What we're exploring is ways of eliminating the WIDX stage because we really want to create ergonomic simplicity. That's not entirely related to WADZ data. That's more just like general for the ecosystem. We want to be able to write like a simple Rust function and just like deploy it with good structural types and real types. And so we did that for both Rust and Python. Really quickly, can we sort of recapitulate the question that he had asked that you were responding to so everybody's got it? So hard to remember. Yes. The question was about this interface that we built, this Rust interface. What was the way in which it communicates with the host and how does it interact with types and stuff like that? Didn't you ask it much better? We do have just a couple more minutes. So if you want to launch on another question, we are watching the Slack and the meeting play. So if there's a question online, how would you handle failure recovery during a batch process? Would the state of the process be stored somewhere to be able to restart? How are you thinking about that? That's a great question. One of the cool things about WASM is that I'm able to inspect things before I ever actually run it, which is kind of how crypto computes perhaps gas ahead of time, like with Ethereum. For us, I could do things like saying I'm about to execute a transaction. I can annotate that before I execute it. WASM is basically a stack-based machine. I could probably replay things if it fails. I can also freeze my WASM module where it's at, say I have some kind of network or hardware failure happening on a node. I can move that exact WASM module to a different node, and all of this, of course, is in theory, and I would be able to replay that transaction. So it's kind of similar to perhaps Spark's RDDs, reliable distributed data frames. Yeah. So that's the idea, and that's one of the cool things with WASM. Cool. Well, I think that's all the time we have for questions. Thank you all for attending our talk, and please, if you got the QR code, eh, eh. Okay, never mind. Anyways, there's the QR code, and we're all good. Thanks, everyone.