 So yeah, welcome to extending a high-performance data streaming system with WebAssembly. We're going to talk about both what is this system and why would you want to extend this system with WebAssembly. My name is Tyler Rockwood. I'm the tech lead for data transformations at Red Panda Data. I'm happy. And we'll talk and dig into what are data transformations and what is Red Panda. So agenda, we're going to talk through three main points. We're going to talk about just a quick introduction to what Red Panda is, what data transformations and streaming is, and we'll work through a case study. And then last one is just what are the challenges of embedding a WebAssembly runtime within Red Panda. We have a high-performance computing environment, so we impose a lot of restrictions on ourselves, which I'll get into later, and that pose some challenges. So we'll work through some of those. So first, introduction. What is Red Panda in 60 seconds? Red Panda is a high-performance data streaming engine. We speak the Apache Kafka API. So if you're not familiar with Kafka, that's OK. The simplest primitive I can give you is it's a log, and you can produce to the end of that log, and then you can consume from some offset in that log to the end. So it uses a message broker to power a lot of applications, sort of decouple your producing events from your consuming events. So these go into partitions, which is sort of the smallest level of abstraction, and then you can kind of group those into topics. And you get ordering for each of these logs within a single partition. So you can think of a partition as essentially a log. It's a distributed log as well. We generally have lots of logs of these partitions per cluster. And then each of these partitions is a raft group. So we use the raft consensus protocol to ensure that your data is replicated safely. You don't have any data loss, things like that. And then one of our major selling points, besides we're very high-performance, low TCO, we also are simple to manage and deploy. So it's not a bunch of different services you have to set up. We're one single binary that you can deploy and that runs on any Linux system. OK, so that's a real quick overview of RedPanda. I'll talk more about some of the interesting bits of RedPanda in its architecture in a little bit. I'm going to start by talking about what is data transformation in this streaming pipeline when you're producing and consuming events in these architectures. So this is a quick example of an e-commerce sort of application use case using any sort of Kafka-compatible API like RedPanda or Apache Kafka itself. This use case is sort of we have transactions that are happening in this business. Those get pushed into a log topic called transactions, which has the information about how much the transaction was, who made the transaction, credit card information, all these sorts of things. And then that gets consumed downstream by a separate application doing fraud detection. So this is a use case that we see for data streaming in e-commerce applications. So let's say your business then wants to go, OK, we want to do restocking of our inventory very quickly. So we want to consume this real-time events of transactions and has things are being bought. We want to make sure we replenish our stock. So if you want to add this, then you say, oh, simple. I just add in another component that will sort of do this processing and consume from this log. However, there's one problem with this is that there's a bunch of, like I said, credit card informations and information in there that the fraud detection system needs, but you don't want to expose to sort of the supply chain side of the wheelhouse part of your company. So a way that you can do this is, well, let's just write this to a separate log that doesn't have any personal identifiable information or any sort of sensitive information in it. And then we'll just have the restocking subsystem kind of stream in data and consume data from that. So how would you do this in a sort of today and a normal sort of thing is you'd stand up a whole separate distributed application to sort of read in this transaction, original data, do these really simple transform and strip out a few fields and write it back in. You can use Apache Flink. There's lots of different sort of solutions to these sorts of really simple, stateless transformations. Now you're spinning up an entire new infrastructure and cluster just to apply these really simple, small transformations on your data. And now you may be seeing where WebAssembly can come in. So with Red Panda, we have the ability to, within the broker itself, as these events are being streamed into one of the logs, we can read information that are on these logs, process it through whatever your custom business logic is in your WebAssembly module, and then output it to a separate topic for you to then consume. So this allows you to sort of build really nice, pluggable, extend your broker to satisfy your business requirements without needing to set up a whole bunch of extra infrastructure, figure out how to monitor it, figure out the costing and sizing and all those sorts of things. These can all be built into the broker. We'll have all the metrics and logs and things for you built in. So this is sort of at a high level. What happens, you have one log and each of these in the log is a series of records. And we'll take each of these records, push it through your Wasm transformation, and we'll get a record back out. And then we'll write these into the output log on a different partition. And these can exist in sorts of all sorts of different ways. So here's a really high level. I'm gonna kind of fly through this overview of Red Panda's architecture. So we have the Kortokovka API compatibility layer, and that goes down into the RAF subsystem, which then handles replication of this to multiple nodes and getting all the appropriate consensus, and then also f-syncing it to disk so it's safe. So that happens today. What we've done is added WebAssembly on top of the RAF subsystems and pull in data as it's committed to the original input topic. We process it per each log and then we'll write it to wherever it needs to go. So let's say in the example I have, you have your transactions and your output of like clean transaction data, right? So as transactions events, records are written into the input topic, we'll, for each of those different partitions, we'll have a VM deployed on that core, and we will pull those logs and push it through the WebAssembly runtime. So Red Panda is a thread per core architecture. I'll talk a little bit more about that in a little bit. So we run one of these on every core, which these are needed, and we run these transformations on the input partitions leader, it's RAF leader. So the RAF leader is the one that knows whether or not something's committed, it's been successfully act, and then that's when you can then go ahead and process it. So we get the same sort of safety and durability guarantees that we already provide. So why use WebAssembly for this task? So obviously the appeal to use any language that you want for our use case of a high performance broker is the very strict runtime limits you can impose. So I'm gonna detail here at the end some of the challenges around these runtime limits and the constraints that we have programming within Red Panda that we're able to impose on the WebAssembly virtual machine so that you don't have any sort of latency hiccups or any sort of performance issues as you're running and deploying this sort of and running these user code. And then also the WASI standards really great gives us a sort of like sandbox politics environment. So for us we use this, you can deploy configuration via environment variables. So you can sort of configure it, in this environment I have these different policies or whatever it wants to be, whatever you want your application to do, you can embed these as sort of like runtime injectable configuration. We also use standard out and standard error as a logging mechanism. So you can just print land or however that is in your language and you can consume those via the normal logging frameworks. Now how do I actually use this? So we have this tool called RPK which is our sort of one stop shop for interacting with Red Panda and built into RPK we've provided the developer experience. So a way to initialize projects of, you can specify what language you want, you can give it a name, description, some other metadata. And then from there you get a single file. So this example is using the tiny go compiler to build a transform. You can edit your transform to do whatever business logic you want. We'll walk through sort of what that looks like very briefly and then you can build it. We for some more obscure sort of tool, tool chains and compilers will embed a sort of tool chain for you and download it and manage it. So you don't have to, but then for more common things we'll just use the system compilers that you have installed. And then lastly you can sort of, you can deploy these to any sort of input and output topics and we'll manage all the machinery around replicating this safely, persisting it, making sure it runs in all the right places within the broker and within the entire cluster. So here's a very short snippet of code that is sort of a hello world so to speak in streaming transformations. So I'm gonna walk through this one for one. If you're not familiar with Go, hopefully it's fine. So I just want to talk about the signature to start. So this is sort of the programming model that we expose to people. Is that you get a write event, which this means, so an event has happened post write. We've successfully confirmed it's fsync to disk. Now you can take that event, it contains a record and you can do some sort of transform and you're gonna return a set of records or an error. Which sort of like a generic Go pattern. But the set of records, the nice thing about it is I can do filters. So I can do one record to zero. So say I wanna just fork a subset of my data stream to another topic for whatever reason. I can do that. I can also flatten records as well. Let's say I get in from some third party, like an array of JSON events or something and I wanna flatten that to just one record per output. I can do things like that as well. Or also just the standard sort of one to one transformations of I wanna take JSON data in, spit it out to Apache Avro, Protobuf, whatever it is. Or if you wanna do what we talked about in this case study of just stripping out PII or things like that. So you can do any sorts of simple transformations. So this is an example of taking some XML input data. I don't like working with XML. I've written lots of Java. Yeah, it's more friendly to work with JSON these days. So it takes a small go program that will take the input XML data. It'll un-martial it into some sort of structure. And then we'll serialize it into JSON and then we'll output that copying the key over. A record in Kafka in parlance is basically a key and value pair. There's also some headers for additional information but for the most part there's just key value pair. And that's sort of a very simple transformation that you can write in just a few lines of code. You can deploy this to your Red Panda cluster and we'll scale this up as many partitions that you have and run it at very high scale and high throughput and all the sort of tenants that we hold to at Red Panda. Okay, so that's sort of my whirlwind talk through sort of what is data transformations, how it's architected within Red Panda. Now I wanna dive into some of the more technical challenges of embedding a WebAssembly runtime into Red Panda. So first, I'm gonna talk about a little bit what makes Red Panda efficient and fast. I mentioned it's a fast piece of software. We hold our hanger hat on really efficient, low resource utilization workloads for at very high throughput and scale. So the ways that we do this is we're a thread per core architecture which is something the high performance computing environment is very familiar with but we spin up one thread per each logical core in your system and then we use all asynchronous system calls. So whether that's talking to network or disk, nothing ever blocks. And then what we do is we run in a reactor loop similar to like an event loop and Node.js if you're familiar with that programming paradigm to each loop to sort of process tasks as they happen. And then we share nothing in terms of memory between each of these cores. You can kind of think of each of these cores as own logical little computer and we allocate all the system memory upfront and then we divide it up onto each core in sort of a NUMA friendly sort of pattern in distribution. And then we communicate between cores. Cores need to talk for example where network requests may come on in core one but core two needs to then like actually process the right for example using a bunch of single producer single consumer message queues. It sort of makes a little mesh for them to talk to between and we use a framework called C star for this which is a really great high power performance as written in C plus plus 20 modern C plus plus which has a lot of great advantages. I'll talk a little bit more about we love C starts it's what empowers us and is the framework of which we've built this high performance message broker upon. So there's a little bit of architecture overview and some of the constraints that we I wanted to go through some of the constraints by this programming model that we sort of have to apply. So we have these sort of reactor loops that I talked about. So and we use async IO. So we use cooperative multitasking to sort of manage all the different tasks that are happening to a single local core since there's only one thread processing a logical subset of the system. And what this means is that each task is responsible for giving up its share of the CPU when after some period of time. So if failure to do this is known as what we call a reactor stall. So the important thing about these reactor stalls is that they impact your P 99 latencies which we care a lot about keeping those very low. So for example, we only have one core processing let's say 10 network operations. If one of those operations takes 50 milliseconds for example to do some work then that means there's 50 milliseconds where those other network connections nothing can happen. You can't process IO, you can't do work for them. So what we need to do is any long computations we need to break up to prevent blocking IO from happening on other workloads and make sure they're shared, scheduling and fairness. So we heavily use stackless co-routines to do this. And just for a sort of a mental model for us a long task our budget that we use to execute a single task is about half a millisecond. So 500 microseconds is our policy for about how long a task should go before it needs to yield control. So we have lots of different ways of doing this. Like I said, stackless co-routines and C++20 is a really nice programming model for us to do this. So here's a simple example of taking some data off an input stream. Let's say from network or even disk and just reading some data. Now in a normal sort of system you may use blocking IO and you'll just have a while loop that kind of runs and it sort of eats that thread while it's processing and reading data, right? So we don't use this paradigm. So what we use as co-routines you can kind of see this co-awaits keyword here is it inserts a scheduling point and tells the scheduler for C star of don't resume this code until this event has finished. So if you're familiar with Node.js programming this is sort of the async-awaits sort of paradigm or if you're familiar with Rust even it's sort of the same async-awaits sort of model although Tokyo for Rust uses a work stealing sort of scheduler which we don't use. Everything is locked onto a single core. So it's of utmost importance that we yield CPU control because we don't allow any sharing. So that's sort of the paradigm here. So now how do we prevent as we're embedding a web assembly runtime how do we prevent that runtime from stalling and hogging up the CPU for a second? So let's say you have a transform that's relatively expensive and takes let's say 50 milliseconds of compute. I think the average developer probably doesn't blink or that long of a computation but for us within Red Panda we can't allow that because that means there's a ton of requests now that just encountered a 50 millisecond latency spike. So how we do this within Red Panda is we use stack full core routines, stack switching. So we run all of Red Panda in the sort of stackless core routines within Red Panda as normal the event loop runs and then within the web assembly VM we allocate a separate stack to execute that on. So that allows us to swap back and forth between executing the VM and executing our actual code itself. And the advantage of using these sort of stack full core routines over a actual like spinning up a separate thread for the web assembly VM is these context switches are very cheap. Most of the premise of Red Panda allows us to do kernel bypass for very fast and efficient processing. So these kernel switches or these stack switches only take about nine CPU cycles on sort of modern machines. So we sort of allocate these separate stacks and then the nice thing is a lot of these very advanced web assembly VMs such as wasm time allow you to inject into the compiled code sort of instruction counts. So you can say after X number of instructions yield control VM and switch back to my stack so I can run. We've also been able to do this sort of same model and other frameworks that don't support the native stack switching as well. You can do stack switching sort of in user land so to speak as well. So that's how we prevent these stalls is what we do is we will run our web assembly VM for our task budget and then we'll swap it out for running the actual client IO and other sort of important information important tasks the broker needs to do. This also allows us to do sort of I talked about this sort of message passing model this allows us to use that within the VM as well. So there are certain host calls that we expose within Red Panda in our SDKs to be able to read data on different cores. So for example, some of our data is sharded to every core for memory reasons because we don't have a ton of memory on a single core. So it may be that there's a transform running on core one that needs to go grab data from core two. So what we do with the stack switching is we're able to switch out the stack and run other sorts of computation while there's a message in flight to talk to core two get the information at need and bring it back to core one so that we can then copy it into the web assembly VM. Next I want to talk. Sort of an overview of memory and some of the things there I want to talk a little bit about or excuse me that was talking about CPU I want to now talk to memory talk about memory and some of the constraints here. So I talked about each core has its own equal share of memory on startup. There's a buddy allocator we run on each shard. So there's no synchronization of memory between shards at all. We also don't use a page cache for any of our IO. This is sort of a separate detail but we do all of our caching within the application we don't rely on the kernel to do that for us we use DMA writes and things like that so that we can get the optimal caching strategy for our application not the generic one for the kernel. One of these things about this model is since we allocate everything in the way of these fixed pools up front is the allocation can fail of memory during the execution of our program. Normally if you malloc stuff in a normal C program the OS will sort of hand out virtual memory until it sort of runs out or it'll swap or things like that. So we don't allow that mostly because those are end up being like sort of soft failure scenarios in your applications where you're swapping memory and performance starts to tank. So using this sort of constrained memory model allows us to make sure we're going very fast. So we need to force WebAssembly to also respect these constraints. WebAssembly defines a linear memory as sort of one large continuous address space and this is sort of for performance reasons and things like that. And there's also another structure called a table that stores things like references to functions and things like that. These are potentially large contiguous chunks of memory that as Red Panda runs, memory fragmentation builds up. We may not be able to allocate those large structures as we boot up VMs. So you don't want to deploy WebAssembly function and then all of a sudden not have enough memory to do that and that either wooms your Red Panda broker or something else. So what we do is for linear memory these are large pools of memory. We have a budget, we sort of have a budget of like under this fixed amount of memory we can always say we can allocate a contiguous chunk of. Anything over that our strategy is to reserve a pool of memory up front. So sort of the computing model since you're now sharing the computing model that we use in Red Panda when you're embedding your own functions is you have to carve out how much memory you want that we can allocate for the VM. So on startup each core sort of reserves some you know a few megabytes maybe multiple megabytes if you're running something like JavaScript on each core so that we can then use that memory that's pre-allocated and isn't subject to fragmentation for the VMs itself. And then for the tables themselves those are usually fairly small so we just limit that to our maximum allocation size that we try to fit everything into. Awesome, so takeaways. So Red Panda can help you transform your data directly in the broker, message broker in a high performance way. You can use whatever language that compiles to WebAssembly to use this sort of simple programming model that we've exposed. WebAssembly itself is friendly to thread per core architectures and sort of these more advanced and better requirements. And then WebAssembly runtimes itself have yeah support these sort of advanced use cases of thread per core or limited memory usage. Thanks for joining. I'd love to take some questions. We're gonna be in the main section at booth F13 if you wanna talk more. There's a link for feedback. I would love to hear your feedback on the session, the talk, the content. And also if you're interested we are hiring as well on my team and other teams within Red Panda. So this sounds like an exciting sort of space for you. I'm happy to talk more. Thank you. Any questions? I've got a question for you. You know, I think one of the things that really you guys crushed is making sure that you're always zero copy. You know, and just really fast transformations of data. And what's the performance impact for the inside, you know, inline WebAssembly transformations? Right now you're pulling the data into WebAssembly processing and then coming back out. Soon you'll be able to keep it outside I believe to the guest house barrier. But what performance are you seeing? Yeah, so right now you have to copy every record into the VM and record out. We do this on a per record basis for like lower memory overhead. But we've been able to push like a few megabytes in our sort of constrained environment per core per second with WebAssembly. And I've done some benchmarking where we've been able to get over a gigabyte a second on larger, a few like medium-sized clusters pushing through the WebAssembly VM. So that's sort of the, I think power of sort of just in time compilation. There's a lot of other things I didn't talk about. Like some, there's a lot of tricks that WebAssembly run times use to prevent bounds checking for all the memory accesses by using some virtual memory tricks. We can't necessarily do that in our environment. So we have to enable a lot of the extra security steps that advanced VMs have removed in more generic contexts just so it works for our embedding. But we're still able to push the performance envelope quite well. Great, other questions here in the back, Sean. All right, so I have a bit of a technical question. One of the performance bottlenecks in really highly parallel systems with WebAssembly in the runtime side can be in how you use the WebAssembly store and how you use the linker, for example, in Wasntime. So there are cases where, are there cases where you're like spinning up a new linker for every transformation that's being applied or like what does the parallelism model actually look like in the runtime context? Yeah, yeah, that's a great question. We've done numerous iterations of our ABI to sort of like figure out what the right model is. So we actually only composed, we generate a single instance of a module within Wasntime on a per, so one per process. So we do one, and then what we do is every core then gets a, not a copy of that, but a reference to that module. And then at runtime we look up the host data through the embedding to make sure we're grabbing information for that instance of the VM in particular, if that makes sense. So there's a way to like, within a store embed very specific information and one of the pieces of information we embed is what the underlying, what we call engine, is that has the actual records being pushed through for metadata and things. So we only have to compile a module once for an entire broker based on our ABI and how we sort of structured our embedding. Okay, that makes sense, appreciate it. Yeah, of course. Just a question back here. Hi, great talk. I found your use case kind of fascinating being able to like strip out privacy sensitive data and offer sort of like two copies of the record and record logs. I'm wondering first off, is that transactional? And if so, how does that handle the case where like the WebAssembly runtime, you know, the code in there fails and like throws an exception or doesn't deliver a record? And what does that mean? Does the like the raft leader reject the record or do you get like, how does that work? Yeah, no, that's a great question. So first of all is we support at most, or sorry, at least once delivery semantics. We don't have exactly once yet we want to get there. There's some ideas we have to get there, but they're larger system changes. So it's fairly common in Kafka to have at least at least once processing. So that's sort of the model we've rolled out at first with that model. And the other thing is you mentioned with the raft leader, these are all an asynchronous sort of process. So we don't ever block incoming rights and throughput from clients directly to process these. There are ideas that we want to do this later on to be able to validate payloads and things like that. But for now, this is sort of, you can think of it as like a background process running in within the broker that sort of consumes this data and then writes it back out. That's sort of the processing model, which allows us to do for failure modes, we can just sort of restart and reapply for at least once and sort of periodically commit data. But we would like to do more things like, if your code's just failing, throwing exception, whatever it is, then the web assembly run time's ooming. What we can do is like create a dead letter queue or something for that. So you can resume processing. For now we sort of halt processing as sort of a safe default, but we would like to do more dead letter queue, which is more like a standard for stream processing applications. Great question, thank you. All right, please join me in thanking Tyler.