 All right, thanks for coming. We're here to talk about WebAssembly and observability in production. So before I start, this work that I'm presenting is really a group effort with a lot of people. And I'm not necessarily the primary contributor here, so I want to take a moment to recognize all these people who've committed a lot of code and ideas and documentation and things behind this talk. And yeah, a lot of this has come together in the past like two months or so. So it's all a pretty big group effort here. But about me, my name is Ben. I'm a co-founder of a company called Dilipso. I live in New Orleans. So if you're ever in the area, come find me. We'll go talk about Wasm and get a bignet or something. Before Dilipso, I worked at a company called Datadog, which is a pretty big observability company. And one of the things I was trying to do there was figure out how to get Datadog's tools to work on the edge. So think like CloudFare workers, fastly computed edge. And that's where I kind of like was first introduced to this problem and sort of learned a little bit about it. And what did I learn exactly? Well, first thing I learned is that these tools just don't work on the edge or in Wasm. Pretty much none of them work. And it's a pretty big lift to get them to work. The other thing I learned is that companies will not deploy their business critical code to these environments without observability. So I witnessed this firsthand. I saw some companies here actually lose pretty big contracts for not having this solved. And I'm not going to say specifically who. But yeah, this is why I think this is a pretty big critical problem for Wasm adoption. If we don't solve this problem, people are going to be very hesitant to deploy code that actually matters in the production. So a little outline of this talk. I'm going to first talk about why observability tools don't work in Wasm. Then I'll talk about how I think we can achieve a parity with traditional systems so that it's kind of equally as good as just using a normal VM, normal deployment architecture. After that, I'm going to talk a little bit about the future and how I think Wasm in the end can actually be a better target for observability. And I also want to kind of talk about how I think the main weakness right now with observability really can be turned into a strength, and we can build some kind of interesting things with that. So I need to start off with, hey, what is observability? If you're at the last talk, Ben, Titzer, another Ben, started with this same slide. But I've been having a lot of trouble explaining this to people. I think it's important to kind of explain it from my perspective. When I talk about observability, I'm coming at it more from a level of application developers, SREs, operators, and working in production. And I think it's important to spend about five minutes kind of explaining to people what that is. Because when I say a word like debug, I want you to think about a human debugging a program and not like GDP. It's a process. So with that in mind, I view observability as a process or a practice, not necessarily any particular tools. So you might formally recognize this as an OODA loop. But if you've ever been responsible for code in production, you basically do this loop every day. Like if you've ever been responsible for code in production or been on call or something, you recognize this. So just to walk you through what it would look like. So generally, you wake up, and the first thing you have to ask yourself is, what is the state of my system in production? And in order to do that, you generally look at dashboards. You might have some alerts that you got that morning. You look at monitoring, SLI, stuff like that. And you will almost always, if your code is used by anyone, you will almost always find a problem. And that problem could be as serious as like your entire system is down, or it could be as not serious as like, it's an alert, and it says that you're going to run out of this space in three months. But it's a problem, right? So once you identify a problem, you go into this orientation phase where you're doing root cause analysis, you're writing queries, you're reading the code, you're talking to people, you're coming up with hypotheses as to why this is a problem and how you might be able to fix it. After that, you have to decide, either together or alone, what to do, what action to take. And then after that, you act. You change the system. You push a piece of code. You run a command. You change the database. Whatever you need to do to resolve that. And you just loop, right? And yeah, again, like if you've ever been on pager duty on call, like you're familiar with this loop, right? So to me, observability is really just about this. And it's about all the tools that are in service of making this loop more efficient and quicker. It's also about understanding the state of and debugging and fixing issues in production, which is pretty different. You don't really have a lot of the same tools that you have available in development. I think there is a case for observability tools being used earlier in the development life cycle. But we're mainly interested in what happens when wasm becomes part of the whole of your production system. So a lot of companies are trying to do that. They call it shifting left. But I just want to get people out of the mindset of not really talking about using a debugger, like the normal tools you use on a day-to-day basis. So what does this data look like in an observability system? Well, the industry has kind of standardized on some different data points. And we call all of these telemetry. And there's three main flavors. So first is a trace. And a trace is kind of like an atomic operation in your system. It's made up of a tree-like structure of steps, which we call spans, which represent the duration and metadata of each step. And what the step is exactly is up for you to define. I mean, it can be as high level as an HTTP request or as low level as a function call. But I want people here who are used to seeing full application tracing to kind of like get that out of your head. It's a more of an abstract, higher level idea. You could imagine if you query into Google or something, right, like how many functions do you think that? How many spans do you think that creates? Possibly millions. I mean, it's hitting so many different systems. Typically in the application world, we're not really fully tracing every execution of every function. So the point of trace is really is that they just kind of help you identify bottlenecks. And they help you understand what is taking up the time in this operation and kind of what things doing. So metrics are another type. And they're just like a pure numerical measurement that represents the state of some variable in your system at a given time. And they can represent things like infrastructure stats, like it could be CPU usage. Or it could be application level stuff that you define, like say a user sign up. But it all goes into the same sort of measurements that you can query and aggregate. And they scale really well, because they're very small, but, and they're also like easy to aggregate and visualize, and they're really an ideal source for things like monitors and dashboards. The last is logs. And logs are kind of like the junk drawer of the telemetry. A log is basically just a timestamp in some unstructured metadata. And logs are, I think logs are useful because like every program since the beginning of time emits logs. So you don't really need a lot of coordination. Like your system is generally made up of things that you wrote, things other people wrote, things that maybe were written in the 60s, you know, like it's made up of a bunch of stuff that all probably emits logs. So if you can stream logs to one place, then you can kind of get a pretty good overview of your system, albeit quite low bandwidth. But yeah, they're useful for that kind of base level observation. So where does all this telemetry go exactly? There are a variety of commercial and open source solutions for this. And you can really mix and match them however you want. But in short, they're all just centralized databases which store your telemetry and allow you to visualize and aggregate the data or set up alerts and monitors. So let's look at a hypothetical Python web service here. We're gonna, I wanna show you kind of what it looks like to integrate with one of these systems. So this is a Python web service that uses Flask. The first thing that you'll note is that telemetry can come from all layers of your application stack. So here, you know, you might be getting like GC stats from Python. You might be getting HTTP latency numbers from Flask. You might be getting some traces from your application code. It can kind of come from all over the place. Things that you wrote, things that you didn't write. The goal of these types of systems is to get the telemetry out of your application, out of band as quickly as possible. Obviously when you install this kind of stuff to observe what your code is doing, you don't want it to affect the code at all. And one of the main ways people do this is they basically deploy this kind of side process along with your application processes. And they call it either an agent or a collector. There's kind of different names for it. But basically it's just this like process that runs alongside your application and it collects and samples and batches and sends all the telemetry to the centralized system. All right, so that's kind of observability. That's kind of how it works in the traditional world. Now we're gonna start talking about WASM. And in order to do this, I'm gonna like really quickly describe a minimalistic functions as a service platform we're calling IOTA. So let's like create, let's design and create this thing. So first thing we're gonna do is create the kind of host and the web server. We're gonna use Rust, the AXM web framework, and WASM time. And we only really need two endpoints for this. We just need one to upload the module, store it on disk, and then another to look at that module and run it. And this is what's in there is all just kind of really standard web framework codes. It's not too interesting, but there's a link there if you wanna go like check out what the code does. Our actual functions are just gonna be the standard like WASI bytes in bytes out thing. We're gonna, to keep it extra simple, we're just gonna, we're not gonna use waggy or anything like that. We're just going to map the raw body of the HTTP request to the module standard in. And then we're gonna map the module standard out to the raw body of the HTTP response. That way we don't really have to write new code. It just goes bytes in bytes out. So let's like write an example program. This is a program that counts the number of vowels in the incoming string in standard out and then just prints the number in standard in and then just prints the number to standard out. So in order to use this thing, we just compile our WASI module, we call our upload and then give it a name and then we can later just invoke that with some data. So here we pass in some data hello world and it's gonna tell us like three, there's three vowels in that which I think is correct and yeah, that's it. And now we can go raise some money and start a serverless company. It's that easy. So how can we add observability to this? So yeah, this diagram kind of shows our new architecture. It's fairly similar to the Flask one but you'll see now we have all these WASI modules, right? And the thing I wanna call attention to is this red box here, right? So this is really where the problem of all of this stuff lies and really it's kind of one of the main problems with WebAssembly in general. And we're calling this the ambiguous IPC layer. So the problem is that like all of this data has to get out through just a huge mixture of like CIS calls and you might be surprised to learn that there isn't really one like unified way to get data out of this telemetry data out of these systems and into these agents or collectors. There are a lot of reasons for that like historical reasons and otherwise but it's just like an unpredictable set of like CIS calls and UDP, TCP, GRPC, files, like all sorts of stuff. So the ultimate problem here is that like baked into our observability tools, all of them, all the libraries, everything is this assumption that the code is running in this normal privileged environment that has access to the entire system. But with WASM that you can't really guarantee that. So let's dig into these specific assumptions and why they're a problem. So the first assumption is I can make IPC calls, right? Like if I want to track metrics and send them to stats D or like a data dog agent, I need to make calls to a UDP server. Okay, can you even do that? I don't know. And is your host going to allow you to do that? Do you want your plugin to be making calls to your local infrastructure just to get some metrics out? So they also assume that they can do this efficiently, right? Like I said before, they need to get this data out of band really quick, asynchronously, non-blocking, just get it out of the application, which can also be a challenge in WASM. They have to have access to high resolution clocks. Like every piece of data I showed you has a timestamp at the least. And some of them like traces need to, sometimes on a like nanosecond precision level time things, right? So how do you know how long something took if you don't know what the time is? You don't, right? So if you have to either give access to clocks or you know, you might be out of luck. Like for instance, like up until recently, Cloudflare, their workers were, they basically would lock that clock to the last IO event, right? There's a lot of, if you're not familiar, there's a lot of danger with giving untrusted code access to high resolution clocks. So the other assumption that they make is there's only one tenant, right? So we built all these tools for the multi-tenant web. And the assumption is that if you install like some kind of, you know, APM tracer in your code, that that's your code and you wanna capture it and it goes to your data dog or your honeycomb. But with WASM, you know, we have this future where we are kind of the multi-tenant agents now. And these tools don't really support like, it is assumed that all the L is your data. When in actuality you may want to, if the person is uploading a module to our system here for IOTA, we may wanna actually send that to their data dog instance, which is not really something these tools assume. So that's where the Observe SDK and ecosystem comes in. So Observe is like a fully open source ecosystem and it works around these problems and it serves as like the base foundation for a lot of observability tools in WASM. So what does this look like? You interact with it mostly just as a library that you embed in your host. And the main trick here is just that instead of like the WASM modules having to navigate that gap, we basically just redirect up to the host, which is at a privileged level and you can completely control what happens with that data. You can control how it gets over to the agent around your system, right? So it's just kind of a level of indirection where we're saying instead of giving the modules the access to the entire system, let's give them a little narrow scope piece of the system that's just related to the severability data. And then we can completely control that. So what does this look like to the integrator, to the programmer doing this? So here's the example, we go into IOTA. You create an adapter and I'll explain a little bit about what an adapter is later, but it's basically just a piece of middleware that allows you to transform this data to go to different sort of platforms. So here we're gonna create a data dog adapter and then we're gonna start it by passing in a wasm time linker in our wasm data. Then you can just call your function as normal and then you just send it a signal telling it our function invocation is over. And that's really all you need to do. On the guest side, the story is a little bit more complicated. I'm sorry, this is like a ton of code, but this is the count vowels example and I just, I refactored it out into functions so that I can kind of demonstrate some more of the API that you have available. But let me just kind of go through each thing so first you'll notice is this instrument macro. And basically this is a macro where if you put it on top of your function, if you decorate your function with it, it will turn that function invocation into a span. So you can include it if you want or not. It's probably a little bit pedantic to include it on this function, but just for demo purposes, you know, I'm putting it there. We also notice we have some functions for metrics and logs. So I'm creating some stats defunctions or a stats symmetric for vowels counted. I'm saying how many vowels we counted. I'm doing a warning log saying we counted this many vowels again. And you'll also notice down here, we have ways to annotate our current span with metadata. So I have exposed the span tags function which allows us to pass in some metadata that gets attached to the span that we're in. And this can be useful for adding like application level sort of metadata to your spans, which we'll see why you want to do that in a second. So what does this look like to the operator? What does this look like to the person who's actually on call? So, you know, this person will log in to Datadog and now they'll see all of our modules are admitting traces for this count vowels operation. So, you know, you can see some things like this took 206 microseconds to execute. You can see every function invocation that we traced and how long they took. At the very bottom, you see our span tags made it through. So this is useful because depending on the platform that you use, they're all roughly the same, but you could put, you know, some metadata in here that you can later use to query or aggregate or visualize this data. So you can imagine if like a particular customer is having a problem, I might want to write a query for like user ID, one, two, three or whatever. Or you might have a monitor that says like, you know, only alert if it's happening for this user kind of thing, like, you know, user imagination. So you'll also see that Datadog recognized like, hey, this thing emitted a log as well. And it knows every log that came from this function invocation. So if we click that, we'll see all the logs that this function emitted. And we're able to do this because in our adapter, we're able to kind of take that raw data and sort of form it in a way that the end sync or the end platform can kind of tie it back together and understand it. We also see on top that it knows this is from an HTTP function. It knows HTTP level details like headers and IP addresses and status codes and all that kind of stuff. That's because we have an interface for sort of injecting metadata into these traces from the host level. So here we are basically able to inject some metadata from AXM into this trace. And Datadog, we do it in a way in our Datadog adapter so that it can recognize that and tell Datadog this is what you're looking at. And that's helpful because Datadog can just like, generate a bunch of stuff for you knowing that it's a web service. So here's another view of this function invocation. We're just looking at the spans ordered by execution time. So you can kind of see like how this would be useful for tracking down bottlenecks, stuff like that. Yeah, finally, I read a little query here to take. We were emitting that metric, right? I read a little query here to take the average metrics, sorry, average vowels counted over time. And vowels are going up and to the right so our vowel business is on fire. And yeah, let's talk about how does this all actually work? So we're gonna dig a little bit deeper into what this actually looks like, what it's doing under the hood. The important part to highlight here is in red. So we call this the observe API and really it's just a set of host functions that can be used to kind of narrow down and replace this IPC layer that we were using before. These host functions are implemented in observe SDK and we can do it in an efficient and non-blocking way. And we provide language level bindings in the guest side that just kind of help you safely call this layer and have types and all that kind of stuff. So the API itself looks something like this. I mean, this is super alpha. It's gonna grow and change over time. This isn't quite all of it but it all it is is basically just passing some pointers across and letting the host basically pull the information through. So what languages are supported? So I don't live so we have this general principle where we only really wanna build on top of what's supported everywhere. So we feel like with development tools compatibility is really the most important feature of tools. So in order to do that we build as close to the wasm spec as possible. So because of that, that allows us to build these tools that can run in any language in any runtime. So with observe where we have this working in Rust and wasm time, go and was0 and JavaScript and whatever runtime environment it comes with. But we've written a way so that it can work with any language in any runtime and it can work today. You don't need to wait for something to happen. So we also built this iota demo in all these languages too just to kind of test it and sort of prove it out. And yeah, if you wanna start a function service company with observability built in go take one of those demos and then go raise some money or something. Yeah, so then on the guest side it's just basically host functions. So there isn't really any language support necessarily. Anywhere you can call host functions you can use it but we're trying to create like little nice layers for each one. So right now we have C and C++ and Rust and I've been experimenting with dynamic languages so I have JavaScript support kind of working and these will come really quickly. For adapters, these are the adapters that we support right now. This data all comes out in a very generic like sort of superset way and then we're able to create these adapters and kind of create a plug and play solution for wherever you wanna send it which could be useful, right? Like for our functions as a service platform we may wanna tell people like, hey, you uploaded your modules here and if you use Honeycomb then you can just do a Honeycomb integration step and then we can send it to Honeycomb or data dog or whatever, right? If you want your own open swarming machine thing we do that. So let's talk about the future, what's next? As I mentioned at the start our next milestone is to achieve some level of parity with traditional systems. So we don't really want people, we don't want this to be it's like own little world, you write all this like weird instrumentation and our own little thing. There's a lot of like code and frameworks and applications that are instrumented already. We want as much as possible for that stuff just to work in this ecosystem. So that's kind of the next step. After that, we're also looking at exploring how WASM can really potentially unlock some broader challenges that people have with observability. Because we actually think it could be like a more desirable target than traditional systems. So on the achieving parity front, first thing we're looking at doing is basically creating little shims for existing libraries that allow you to support normal stuff on the guest side, right? Like let's say you have some libraries that are instrumented with open telemetry. Essentially we'll be able to have this little shim that there's a concept called exporters in this world where basically it just says like when I'm done capturing the data this is where I'm gonna send it. And a lot of times with something like open telemetry that defaults is like it basically makes this like gRPC call to the open telemetry thing. So we could just provide a little shim that just says instead of doing that send it to our Observe API, right? So we hope to kind of provide these little shims to kind of unlock these already existing ecosystems of instrumented code and frameworks and everything. Dynamic languages and runtime level instrumentation. So this is pretty common in the traditional world and I have a forked version of Quick.js that can emit these metrics to the Observe SDK. So things like at your runtime level things like garbage collection, memory allocation, the things that you don't really control or instrument that are happening at the runtime level that you might want kind of blended in with the rest of your observability data. So a couple other extra stuff, distributed tracing support that's like basically right now like you might have like going back to the Google query example, right? When you enter a query in Google probably gonna hit dozens if not more servers and along that route is essentially a bunch of traces. A lot of these platforms allow you to kind of glue all those traces back together through something called distributed tracing. And right now like WASM is kind of a black box in that, right? But distributed tracing would allow us to just basically say, okay instead of this black box of WASM like in your whole thing it actually continues the trace down. More guest language bindings which is a simple but a must. Yeah, the instrument to guest run times. I'm gonna start looking at go lang next I think. More adapters also make it easier for people to write their own adapters. It's already pretty easy but we just need to like document it and make it kind of an official thing. And yeah, also the multi-tenant stuff which I haven't talked too much about today but creating kind of like more official ways to control this data that's coming from multiple tenants. So what about the future? So I'd mentioned we're exploring ways that we can turn this kind of weakness into a strength so to speak. So there are a lot of kind of challenges that we wanna face. The first kind of experiment we've been working on is this auto instrumentation compiler. So basically what this does is it's a piece of code that you can give it an already compiled wasm module and it can basically just statically recompile it and add, observe, probes and tracing inside of it. So this is super useful because if you ever worked with a lot of application developers they don't wanna trace the code or write instrumentation. And sometimes like if you find out there's a problem it's like you could just turn this on at runtime without having to redeploy or anything. So these things could be kind of useful for that and we're trying to just push this as far as we can to figure out what all can we do just given the wasm module at an automatic level. And we feel like it can go pretty far. Like we like to look at wasm not necessarily as a compilation target but more kind of like an IR, like an intermediate representation or something, right? Like there's a lot that you can do when you get your code in LLVM IR. There's a lot of tooling and things that can happen at that level. And we feel like with wasm there's kind of some similar things that could happen at that level too. Another thing we've been exploring is kind of like real-time configuration changes at runtime. So one of the biggest problems I see with observability tools right now is that you often don't know what info you need to debug a problem until after it arises. So if you ever, you know, instrument some application code your first tendency is to like instrument everything. Like just instrument all the functions, right? But like I said earlier, like if you do that I don't care how big you are, it's gonna cost you millions of dollars to store all that. And it's mostly gonna be noise, right? It's like you don't need to track every function invocation in your system. So what ends up happening is people kind of have to step back, their CFO comes to them as like can we not spend millions of dollars on Datadog? And they say okay let's step back and kind of like figure out what needs to be traced, what doesn't. And what always ends up happening is you trace too little, right and then when you have a problem that comes up that's unique you need to go redeploy, you need to go kind of like add some logging messages or change some things, kind of hacking around, right? So we're thinking like with some tools like this we could kind of create a balance where if you can change these things at runtime at in production then you know you might be able to say like zoom in on a problem piece of code without having to redeploy without having to apply to every operation in your system without having to store all this data all the time or deal with things like sampling and all that kind of stuff, right? So you can imagine you have a problem with the customer and they're looking at some code that really just doesn't have the logging messages or the information that you need to figure out what the problem is, you can maybe write a little query that kind of turns on the instrumentation just for that one function just for that one customer and then sort of record it for a while and see what's happening, right? Because in production you often can't just like take that code out and bring it onto your system and use the normal debugging tools, like sometimes the data situation's too complicated, sometimes there's like sensitive data in there, you know, you don't really have access to all that stuff. So you have to be able to debug it in production. All right, so that's it. Check out our GitHub repo and file any questions as issues or discussions. Please keep in mind like all this stuff has really come together in the past like two months. So things are gonna be changing really rapidly, but that's also an opportunity to help shape it. If you think there's anything that can be changed or added or removed, either like from the APIs or maybe you don't like our open source license or something, the next month would be a great time to do that, like we're willing to just kinda do whatever we can to make it useful and make sure this is something that we can all kind of build on top of. So yeah, that's it, thank you. Are there any questions? Do we have time for questions? Does anyone know when this thing is over? Five minutes? Okay, four minutes. I have four minutes for questions. Anybody? Yes? With what? Oh yeah, this would support Grafana, probably through StatsD or something. We don't have like official support for that, but it would work, yes? It works now, you can go try it out now. Yeah, just go check out our GitHub. But I will say like it's, you know, there are fewer controls on it right now, we're working on figuring out how you can have a little bit more control over it and also kind of learning things that maybe we should trace, maybe we shouldn't trace, like trying to figure out how to give you a dial so that it's not just everything. Right now you put your module in there, it'll instrument everything, but yeah, we wanna give you a little bit more control. And then also some of that control is gonna be inverted to the SDK layer, so it could be a situation where you instrument everything, but then at the SDK layer you can kind of filter out things so that you don't have to actually send it along and store it and, you know, piss off your boss by spending all the money. Yes? Yeah, definitely. I mean, well actually not like right out of the box, but that's the goal, right? So this API that I was talking about, it doesn't automatically hook all those things up, but it's there so that we can basically create those in directions to that API. So basically it would probably be like an application layer thing where you're saying, when I'm running in Wasm, don't send it to standard out, send it to this thing. DataDog itself, where is this thing? DataDog itself does support open telemetry, the agent does I think, don't quote me on that, but the way it works now, like you could, if you have an open telemetry agent or sorry, they call it a collector, basically we would just use an open telemetry adapter here. It's in that over that way, but what's important to understand is like at this layer, like from the Wasm to the Observe SDK layer, you could think of it kind of like a superset of all of those formats. So it's not really specifically open telemetry that format, it's just kind of like a very generic format that we can transform. Yeah, the shim will handle that basically. It doesn't just send it to our APIs, it also is gonna probably do some transformation. I haven't really fully figured it out yet, but it will probably do some transformation or if not, there will be a special API that just says, this is like just a pure open telemetry channel to send data through, and then on the SDK level, it actually transforms that. So the API is still kind of in flux, how to support all those different things on the client side. All right, I think we're out of time, so thanks everybody.