 Thank you Christina. Hi everybody, welcome to today's talk on simplifying streaming data transforms with Wasm. So today we're going to cover a little bit about what WebAssembly is and how we use WebAssembly in our data transforms feature within the Red Panda streaming platform. Talk a bit about the data transforms architecture and the associated use cases. I'll give a live demo and then we'll get to Q and A at the end. So let's dive in. Let's get started by talking about WebAssembly or Wasm as it's abbreviated. WebAssembly is a low-level portable binary format. So it's a compilation target for multiple languages. The idea is not to write directly in WebAssembly, but to write in a higher-level language like C, C++, Go, or Rust, or any other language that compiles down essentially into a Wasm module. It's designed to complement JavaScript, essentially designed to complement JavaScript on the web. JavaScript has allowed such a rich web experience, being able to ship code from the server down to the client and run that code in your browser has revolutionized web applications. WebAssembly is the next level because you can compile the code. It's a lot smaller, a lot more portable, so it's easier to ship around and ship down to the browser. And unlike JavaScript, which is an interpretive language, WebAssembly is compiled. So it also runs a lot closer to native speeds. So it's smaller, easier to ship around and essentially faster. And it can work alongside JavaScript as well. So you can call WebAssembly functions from JavaScript and vice versa. It's around 20x faster than JavaScript because essentially all the browser has to do or the server side application has to do is decode the WebAssembly module and run the code, whereas JavaScript has to be parsed and interpreted. At Repanda, we think what JavaScript has done for the web was and can do for server side applications. And that's exactly what we're doing with our data transforms engine. We are shipping a WebAssembly engine in a server side application and essentially flipping on its head allowing developers to ship custom code into the server and then therefore change the behavior of the server side application. So let's cover the details as to exactly how that works. But before we begin, let's just give a brief overview of what Repanda is. So Repanda is a modern streaming platform for mission critical workloads. It's fast, simple and reliable. So it's designed from the ground up in C++ to gain full performance of modern hardware. So be able to run across all of the cores on a multi-core architecture, use lots of memory and benefit from faster IO subsystems. So when we're talking about SSDs and things like NVMe drives, Repanda is able to leverage those technologies. So you'll find that Repanda is very good at scaling up as well as scaling out. It's a distributed system, right? So it scales across many, many, many machines. And this enables new use cases that require high throughput and low latency. But because Repanda is so efficient, it also can get equal or more performance from essentially fewer servers. And what this means to you is it will reduce your infrastructure costs. It's simple for operators to deploy and manage. And it's simple for developers to adopt and use. Repanda fully supports the Kafka API. So out of the box, it supports the entire ecosystem of Kafka related tools. So whether it's produces and consumer implementations, client libraries written in various languages, or operations with other third party systems, these tools will simply work with Repanda. And because Repanda ships in a single binary, it doesn't depend on any external systems. This reduces the risk from sort of operational complexity. By default, and at its core, Repanda is essentially a distributed, durable, fault tolerant transaction log. Repanda uses the raft consensus algorithm to replicate data between all of the servers in a Red Panda cluster. And it uses raft for everything. So not only data replication, but also replication of states and metadata. And it's also used for failure recovery. So the raft algorithm also provides things like leader election. If a server in the Red Panda cluster fails, then raft would automatically reelect the leaders and Repanda will remain operational. And this allows Repanda to deliver very predictable performance even at very high loads. So now we know a bit about Repanda. Let's talk about how the data transforms feature uses WebAssembly and essentially relies on a WebAssembly engine to run custom server-side functions within the streaming platform itself. Data transforms allows developers to write essentially custom JavaScript functions and soon to be whatever language you want, as long as that code compiles down into a Wasm module. And you can deploy JavaScript or these Wasm modules into Repanda, into your streaming platform, and run simple data transforms against your streaming data, essentially on the topics within Repanda. The premise here is it reduces the data ping pong in and out of the system without something like data transforms and without being able to run custom functions in the streaming platform. You have to essentially operate an external stream processing application like Apache Spark or Flink or something like Materialize to read or stream the data out of the out of Repanda, out of your sort of streaming platform, to run simple functions and then write the results back into your streaming platform, incurring the additional bandwidth and also incurring sort of the internal mechanics of the streaming platform such as data replication and storage. So with data transforms being able to run JavaScript and WebAssembly modules within the platform itself allows you to perform things like data validation, data normalization, filtering and routing within the platform itself without the data having to ever leave the platform. Let's talk a little bit about the architecture of how the WebAssembly engine is used by Repanda. Here we'll discuss a bit about tooling, the storage mechanism of the functions and the execution of these functions via the WebAssembly engine. From a tooling perspective, Repanda comes with a command line tool called RPK and the idea is to make the developer experience as easy as possible. So you can run RPK, Wasn't Generate and essentially the command line tool will create boilerplate code for you. You'll have the Wasm API that Repanda provides with lots of templated code and essentially then you can just go in, complete your function, fill in the function and deploy that module into your streaming platform. From a storage perspective, those functions, whether they're JavaScript or Wasm modules, are stored within a compacted topic within RedPanda and therefore because it's essentially a topic, we can leverage things like Raft to replicate those functions around to every server in the RedPanda cluster and therefore the functions are available to perspective. There's a couple of flavors here. So the first one is an asynchronous version of the Wasm engine. This is for stateful one-shot transformations and I'll go into the details behind this on the next slide. The Async engine is currently out on tech preview in RedPanda and we welcome feedback on the API and the developer experience so that we can iterate and improve it over time. The second version of the WebAssembly engine in RedPanda is what we're calling the synchronous engine and this is where we will embed the WebAssembly engine directly in the RedPanda process and run the custom functions on the hot read path. So essentially as part of a fetch request coming from the consumers. So let's talk a little bit about the differences between these two architectures and the different characteristics that they're able to deliver. Starting with the asynchronous engine. So as I said this is currently out in tech preview. It sits as a sidecar process to the core RedPanda process so essentially on every server in the RedPanda cluster you've got the broker process running and to the side of that you also have the WebAssembly process running it as well. When you deploy a function into RedPanda you tell it what topic you want the function to essentially read from or consume from and you tell it what child topic you want to materialize the output from your function to and the engine handles all the mechanics here for you so they will automatically create you your child topic or child topics if you're rooting your messages and here you get a one-to-one partition matching against the parent. So if you have a parent topic with say a single partition and 3x replication your child topic will have one partition 3x replication and those replicas will be located on the same the same servers the same brokers essentially as the parent so it essentially mirrors the the parent setup. In the async engine we call this one shot stateful transformations and essentially what that means is for every message that you write into your topic and for every replica of that message because those if you have if you have a replication factor on your topic that message will be copied essentially to the to the followers by the RAF consensus algorithm so for every message and every replica of the message the function will run locally on mirror child partition so it's essentially runs once for every message and every replica of the message and it's stateful because essentially this is running asynchronously within within the streaming platform itself so you can think of it as the function is consuming from the parent topic running the function and then producing on to the child topic asynchronously so the state component of this is is the the function or the the the process that's running the function has to maintain its offsets against the parent topic so if there is a failure or the broker needs to restart the function knows where it got to offset its process up to and then it can continue continue processing from that point on because the function runs on every on every broker against every replica of the message the transforms are assumed to be idempotent right so your replica sets stay in sync they're identical so let's have a quick look at the this from an architectural perspective I mean typically you'd have you'd have three brokers in your cluster but it's easier to to visualize we're just two in this case so here we have two red panda brokers running on separate separate servers we have our our code processor internal topic so we've we've written our data transform and we've deployed into into red panda via the raft replication mechanism this is then copied out to every every broker so it's available for for running on every on every on every broker too the green topic here is our parent topic and you can see the producers writing into to the lead partition for for that topic and then raft is also replicating those messages out to to the followers on the on the other broker here internally within the stream processing engine itself there's another process called pacemaker and this is what's doing our this is what's essentially providing the state for one-shot transformation logic so pacemaker is consuming the messages from the parent partition and also they are also the replica it's running the messages through the wasm engine and in this case this is the google chrome's v8 v8 engine which sits in the sidecar process and the result is then materialized or produced onto the onto the child partition again this is happening separately on the on the follower on the replica so the same the same state is generated in the child partition on the replica and because the the the the message in the parent level has already been replicated by the raft consensus algorithm we don't need to replicate the results at the child level replication has already happened and we've assumed that the function is idempotent and is generated the same the same message on on both the both the child partitions so let's talk about the differences between the async and the synchronous engine that the our engineering team is working on the synchronous version of this right now and the difference is here the web assembly engine so the the v8 engine will run inside the core red panda process and it will run on the hot read path so the results and the functions will run and lazily materialize the results on for every fetch request that consumers send into into red panda how this is all tied together is via a concept called data policies so data policy essentially creates a relationship between a topic a data transform so your custom function and a consumer so when the consumer connects into red panda and starts reading messages from the topic via the data policies it will check the data policies and see if there's an associated data transform and function and if there is it will run that function as part of the fetch request on the hot read path and and the results will be streamed directly down to the consumers so here your trading storage for essentially cpu whereas in the async engine you're materializing the results and therefore storing the results within within red panda itself here you're not storing the results you're streaming them directly down to to the consumer but because the function is run on every fetch request this takes more processing processing power so whilst you'll be able to get more performance from the sync engine and it's for low latency transforms you you do so by incurring the processing cost essentially and not having to store the message so you're trading storage for for a processor here what's really nice about the sync engine is it allows you to apply essentially different views over the same topic so because you can create data policies for different consumers and here we're talking about consumer group IDs and things like that if you have a different policy per per consumer then essentially that consumer will see a different view of the same over the same topic because the the messages streaming from that topic begin are getting modified in flight as they're streaming down to down to the consumer and if a different consumer has a different function associated with it they'll see i'll see a different essentially view of that data so visually what this looks like i think the architecture is a lot simpler so we still have our internal uh co-processor topic where our functions are stored we still our parent uh topic but here it's not really apparent it's just i mean it's just a normal topic um and we do away with the child topic in this case and we move the sidecard process into into the red panda process inside the inside the red panda process directly then when consumers connect into red panda and want to read data from from the topic they do so by checking the data policies and if there is a function associated with that consumer the function will run against the against the data as it's streaming down um what's nice about the v8 engine is it has a concept um uh called v8 isolates which essentially will allow you to run uh each data transform in a i guess a sandbox area within within the web assembly engine itself so you can put some safeguards around uh what that function is able to do in terms of accessing memory um and um and processor and therefore um uh you can safeguard and essentially protect the functions from each other and also protect the the wider red panda process okay let's talk about uh use cases so um i mean the possibilities are are infinite because you can essentially write whatever transformations you like they are really custom functions the i guess the initial ideas that we have for this functionality is for data validation uh data normalization and here because the schema registry is embedded in the in the broker um application itself um you could write a data transform to um in uh basically pull a schema from the schema registry and make sure the message is streaming through your topic adhere to that uh schema or even use it to to change the schema as the as the um messages are streaming through and the the demo i'm going to get next actually actually does that so it receives um messages in jason format it applies a data transform function to transform those jason messages into our very messages and materialize those on the child topic data masking is a is another good use case so if you do have say gdpr requirements um you could use uh data transforms to uh mask your pii information as the data is streaming through streaming through red panda and here the parent topic will have the original message so um um that's the full sort of full fidelity view of the information in this case you might want to apply acls on that topic to restrict access to to downstream consumers so they don't see that pii information but then the child topic could have a different set of acls applied to it to open that up to a wider audience because the data's been obfuscated and is therefore safe to safe to consume filtering is a good use case so filtering out messages altogether or filtering out values from from within messages and um you can also root messages as well so you're not fixed to writing to a single child topic um you can actually decide to uh root messages to different child topics so kind of a fan out process in in in this case uh the synchronous engine as well because it um essentially runs for every message on on the hot read path this will allow future integrations would say um stream processing engines where you'd be able to do things like predicate push down and do sort of live filtering of the data that way uh we'd love to hear your ideas and if you want to chat more about use cases then um please reach out to us on our on our community slack and i'll share the link um with you with you at the end of the talk okay cool let's uh let's move on to the demo so um just before i switch to to the demo just to give an example of the templated or boilerplate code that that the command line tool provides and i this is exactly where i use to generate the demo code um so you can run rpk was and generate give give it a project name um in in in the demos case i think i've called it have very transformation you get um a couple of files that are sort of pre-populated for you so package.json in this case your JavaScript dependencies um you get a source directory with the main.js in this case and that's where you add your transformation logic um and i'll show you the example in second you get a test directory where um in the javascript page you get mocha your mocha tests so you can do unit testing on your code and then webpack.js just essentially can be used to package everything up into a module for deploying into repander so let's take a look at what that looks like in practice okay so um all of the code has been published to the red panda data github repository and in there there's a repository called red panda dash examples there's lots of examples in here um but the one we're focusing on today is in the wasm sub directory and here is my um generated code my project name is transform underscore avro there's a couple of other nice sort of helpers in here as well so there's a composed dot yaml if you want to run this example on your own laptop you can do so with with docker um it's nothing nothing complicated in here so it essentially spins up a single red panda docker container on on your laptop and exposes the necessary ports etc that you need and there is an associated red panda configuration file um most of this is pretty standard the only thing to note is enable underscore coprock is set to true and this just tells red panda to start the the web assembly engine the cycle process essentially let's take a quick look at the code so here you can see uh the the templated code we've got our webpack in our our package.json files that were generated for us i've added a readme um so you can follow along with the demo in your own time we've got a source in the test directory so let's have a look at the main.js in here so a lot a lot of this and most of it probably 90% of the code in here is boilerplate and generated for you by the command line tool i've added a couple of things so an additional dependency on the avro js library for doing our json to avro transformations i've named my parent topic and the parent topic in this case is going to be called market underscore activity admittedly i've been lazy and i've hard coded by avro schema in here but i could also fetch that from and i should have fetched this from the schema registry but it's easier to to see in the code here and here is my um transformation function called to avro and here i essentially just parse the past message um assuming a json message and then i encode that message um as an avro uh a binary avro message using the schema schema above the rest of this is pretty much default but you get this is essentially the contract with the with the streaming engine with red panda so red panda will call process record and pass you a record batch and then you're free to do whatever you like with that record batch in in this case i'm just mapping the record each record calling my two avro function and i'm writing the results um to a child topic called result here the essentially what you have to pass back is a is a map with the key set to the child topic name that you'd like which is also generated for you and then the transforms record batch as the value if you in a routing case you can add as many of these entries as you like so you could have results two three four and a different set of transformation transformed records uh associated to each of those and that's how the web assembly engine and red panda will route the messages to different troll topics okay so let's run up the demo first of all um and actually we can uh we can follow along with the read me here as well so the first so okay so you have a single single container running and then let's uh deploy and build and deploy the the code so first of all we'll run npm install to install our dependencies i've already run this so it shouldn't take long there we go next we will run our test so let's run our in this case javascript mocker test which is past so our unit test just runs and calls our json to avro function so that's working as expected and then let's run npm run build which will call our went pack configuration to bundle everything up ready for deploying to red panda so here we should have a distribution directory and in here it's still called main.js but essentially that's all the bundled code and any dependencies also bundled into that into that one file okay next step is to create our um parent topic uh market activity so let me just copy that okay and let's deploy our data transform to red panda so using the rpk tool so we can do rpk was and deploy we're going to pass our bundled up main.js file and we're going to name function json to to avro send in okay one thing to notice is if you despite the despite the function being deployed you still only have the parent topic you also now have the code processor internal topic so our function has been deployed into red panda but the child topic isn't created or materialized until until we start calling the function so let's do that next in this green tab over here there's got some consumers so the command line tool rpk can consume from from red panda we're going on the left hand side we're going to consume from our parent topic market activity and on the right hand side we are going to consume from the child topic and that's going to be called market activity underscore result underscore so let's set that as well and we go I have a producer here it's written in scala but it's essentially going to write some data in json format it's market activity data so it's just downloaded from the from the medecite website hi jane sorry to interrupt it seems like some people can only see your browser window they're not able to see the actual terminal is there any way that you okay yeah of course my apologies for that let's um how does that look looks good on my end okay my apologies okay so um that's the trouble with doing live demos isn't it let me um I mean there wasn't much to see on this side of things but what I will do is just flip back to the code I think you actually saw the code right you didn't see the terminal windows so that's fine um in the terminal window essentially what I've done is run through the run through the readme I've deployed let's see the code up here I've created my my parent topic I then ruled the I used the command tool rpk wasn't deployed to push the code um it bundled up js file in this case into red panda and I've called the function json to avro um so so that was run here and deploy successful and if we list the topics in red panda you can see we have our code processor internal topic here where our function is stored and we have our parent topic created um which is called market activity on the consumer side nothing to see here yet luckily so on the left hand side here I've just spun up a consumer to read from the from the parent topic market activity and on the right hand side here I've spun up a consumer to read from the topic market activity results and then on the producer side I have a scalar based producer which will essentially um write some data into our market activity topic it's json formatted data um that I've just downloaded from from the website um and it's mark it's essentially a simple market data for the s and p 500 stock to go so let me run that up I slowed it I slowed the stream down here so it's easier to read so you can see the key here is the stock ticket spx and json formatted messages which just summarizes sort of the day's market activity for for that stock now if we go back to the consumer side you can see here on the left hand side this is our parent the message messages streaming through in json format and then via our data transform here on the right hand side the child topic the messages are streaming through in in the binary average average format that will run for 15 20 minutes so we don't have time to to sit and watch it um so essentially that that includes the that concludes the demo we can split back to the to the slides now let me just make sure you can see the slides right thanks very much let's uh let's go to some some q and a okay so the first question we have on q and a is uh what is the performance impact of running a data transform okay it's a great question um well essentially it varies right because the um the data trans the data transforms are completely custom you can write whatever the transforms you want and that's going to have an impact on on on performance um if the transform is really simple then there's not going to be much backed but if the transformation function is pretty complicated or you're doing something like calling out to an external api um for uh say enrichment for thing then that's going to slow down slow down the stream and um uh you'll see the but the performance difference there the um with the asynchronous engine that's obviously running asynchronously within the breaker itself so the parent topic um won't be affected but then um the messages are being asynchronously read topic run through the function and produce onto the child topic so what you'll find is the child topic will in in in that case will slightly lag behind uh lag behind the parent and the idea is we will provide monitoring metrics around around this function so you can track sort of how uh like you do now with with red panda and you can see and monitor consumer lag you'll be able to monitor data transform lag as well because it's essentially the same thing but running internally within the breaker itself okay so are there any limitations on what you can do with wasm for example http calls for transforms um so so no that there is um is that data transforms are intended to be relatively simple and essentially a one-to-one or um a mapping of of of messages uh the intention and there isn't a sort of any framework around this but the intention is to not um uh do things like aggregations or maintaining states in in in your data transform so um you wouldn't want to use it for for window functions and and complicated aggregations for this we would still recommend using a sort of the stream processing application like spark or flink um they're more geared up to to to do these more complicated transformations but if you're familiar with say spark um and uh sort of the the more simple map and filter functions that you get in spark they would be ideal and perfect for for for data transforms um and they will they will run nice and quickly so there aren't really any limitations on what you can do if you wanted to do external http calls out to say enrich your data you're free to do so but um what we would say is uh it's probably best in that case to make sure you benchmark your your your code um essentially benchmark the the throughput and latency you get uh before deploying the function and then do the same after you've deployed the function so you can uh at least understand the performance implications um and the different characteristics of the of the functions so are there any templates for transforms in other languages great question i mean um uh not right now so uh like i said the um the the initial version of data transforms which is out now is the the asynchronous version um this uh is out on tech preview and as part of the tech preview we have uh essentially just released templates in javascript this is so that we can um communicate with users and work with users just to understand and get feedback on the api feedback on the templates in the developer experience and then we can start iterating and improving on it from from there um we'll be introducing other languages via web assembly in the future so uh not only will we have a web assembly engine running in red panda you also be able to then write the the transformations in any language that you prefer as long as it has a associated wasm compiler and you can compile that down into into a wasm module at that point we will most definitely be providing templates for for data transforms in other languages probably starting with c c plus plus and go so there's another question around performance implications will it be feasible to write staple type functions uh and uh the follow-up question for that it was never mind uh i answered it yeah um so so that's right yeah i mean uh at least initially it won't be feasible i mean it's feasible you can do what you do what you want but there will be performance implications if you start maintaining a state within within the wasm engine you don't really want to be to do that because it will affect your sort of throughput and latency um there's no reason that in the future we won't expand the the ability to to provide sort of more complex functions will certainly be expanding the api to to handle common common functions like um adding init functions for example so if you wanted to do any initialization before your data transforms were to run this could be like downloading schema from the schema registry um you could also do things like adding sort of timed functions to run as well if you wanted to keep downloading the schema and checking whether there's been any schema updates that kind of thing or loading reference data and into memory that kind of thing so i mean what i would uh employ you to do is if you do have any idea on what you would like to see from the the data transforms engine um and and our use of web assembly then please reach out to us on the community slack um and uh we'll be able to to have a chat with you about it and and add those add those ideas onto our onto our roadmap okay will there be uh support to allow pipelining of functions in the future again this is a prime example of that feedback that's that's a that's a great idea again we we we have spoken about this internally our engineers are sort of speaking about all different sorts of possible features we could add to data transforms pipeline functions is a is a great idea and something that we will we'll definitely add in the future um uh let's uh let's continue that discussion on on the community slack and um i'm going to take these these ideas and and also open github issues right so all of the code is is on github all of the issues are on github as well so if you do have any ideas for sort of feature requests around uh our use of web assembly and the data transforms and please also either either raise it with us on the community stack or or open the issues directly themselves okay so oh sorry one one final question um what are some use cases where spark or flink would be more appropriate choice and are there plans to have these capabilities be a replacement of all uh spark use cases i think i kind of answered that before but um like data transforms is great for sort of the map and the filter functions when you're doing window functions aggregations and and the other thing that pops into my mind here would be joins if you want him to join streams together these things at the moment best fit with with spark and flink but again uh there's no reason why we wouldn't think about adding these into into red panda in the future those complicated functions aggregations windowing maintaining states doing joins on different parent topics they're much more complicated to add into a streaming platform you have to think about if you're joining two streams together how you then have to join the key you have to have a key for example and then you have to copy data over the network to to match records together um then it becomes a lot more complicated to implement and sort of changes from a stream streaming platform into a stream processing platform but they're great that's great feedback and great discussions to have on the community slack okay thanks christina well thank you so much to james for his time today and thank you to all the participants who joined us as a reminder this recording will be on the linux foundation youtube page we hope you're able to join us for future webinars have a wonderful day