 Hey, everyone. Hope you're all having a good conference so far and learning a lot about observability. My name is Dan Norris. I'm here with my colleague, Jonas Burgess. And we are here to talk to you about observability at the edge, instrumenting WebAssembly with open telemetry. So to lead us off, I just wanted to give you a quick overview of what we'll be discussing for the next 20, 25 minutes. So first off, my colleague is going to give you an overview of WebAssembly in case you're not familiar with it. And a brief introduction to WasmCloud, which is a CNCF, hopefully soon to be incubating, project that we happen to work pretty closely with. And then I'll go into the more nitty-gritty details of why we decided to implement open telemetry, why now, what our journey to get there looked like, and then specifically get into how we ended up implementing support for hotel tracing logs and metrics. And then we'll end it off with future work where we think things will be going with these integrations. All right. So before I kind of go into what is WebAssembly, I just want to quickly survey the room to see how many of you have heard of WebAssembly. Can I get a quick show of hands? OK, you've got some people. How many of you are using WebAssembly in production or just playing around with it, like actually using it day to day? OK, cool. That's good. So I guess just for everybody else, just for your identification, a little bit about what is WebAssembly and why should you care about it. So WebAssembly is a binary code format that's designed for executing programs. It was originally designed for the browser. The idea was that as applications in the browser start getting bigger and bigger, they start to slow down. There are some just limitations that we're going to run into with JavaScript alone. So the idea was that what if we could bring more performance languages into the browser, things like Rust, C, Go, and actually be able to run those programs inside of the browser. So there are some interesting properties that came out of that. The idea was that we'd get a target environment where you can write the software once, and then you could run that software anywhere where WebAssembly is available. But also, as you probably all know, the internet tends to be kind of a hostile environment. So the security model that WebAssembly has adopted is kind of almost like the polar opposite of containers. So if you're familiar with running containers, you typically tend to take permissions away from the containers to secure it. But in the case of WebAssembly, the model is such that you add capabilities only for the things that you want the program to be able to do. So out of the box, the program can do actually nothing at all with the outside world. So it has no ability to interact with things. But you, as a person deploying the software, actually get to define like, hey, this needs to be able to talk to you a certain file or things on the file system, but not anything else. So that's kind of a really interesting property. And those are the kinds of things that we, as it turns out, are very helpful for also the, not just the browser world, but also on the server side. And furthermore, WebAssembly applications are compared to containers, for example, are tiny. So a WebAssembly program, actually a demo that we've been showing around here was, one of the components in that is just 700 kilobytes when it's compiled to WebAssembly, as opposed to a couple gigs of a container, right? So it has some really interesting properties from that standpoint. One of the interesting, kind of like adding onto the interesting things from WebAssembly, there's a theme here that you might be picking up on, WebAssembly also has this idea of components. So what you can do is you can basically compose programs, WebAssembly components, program, sorry, built in different languages. So WebAssembly components allow you to essentially define an interface for exports and imports of what your WebAssembly program does or needs to function. And then you can connect those programs written in different languages. So in this case, we have a WebAssembly module written in Rust that's exposing some functionality. It's like a library. And then we have another component written in Go that's actually importing that functionality and using it as part of the running the program. So this is all possible because of the interface that's used for composing these things together. So that's a pretty cool thing that you can do. Now, you might be wondering, okay, that seems interesting, but how do I actually deploy this, right? It's not like they're just containers and you can ship them on Kubernetes here off you go. Well, that's where CNCF WasmCloud actually comes in. So WasmCloud was designed to be an application platform for running WebAssembly components in a distributed fashion. So the idea is that wherever you can run a WebAssembly runtime, specifically in our case, WasmTime, whether it's in the cloud, on the edge, on a small device that you might have in your house, you should be able to deploy WebAssembly components there. And the way we accomplish this, or I guess the furthermore what we allow you to do then is those components that might be deployed in different locations can seamlessly talk to each other and RPC calls, because underlying all of this, we have something called NATS, which is another CNCF project that basically acts as an interop, like a message box behind the scenes. So that's pretty cool. Kind of just to wrap up what the takeaway from all that is, because WebAssembly has this like target environment where it's like you just compile it down to a bytecode and then you can ship it and run it anywhere. WebAssembly essentially allows you to target all sorts of new environments that you might have had, not have available to you before. And then specifically, WasmCloud itself, what it does is it kind of steps up that by making it possible for you to take these WebAssembly components and distribute them globally across the world and have those components talk to each other without necessarily knowing where they might be running. Cool, so that was a pretty good overview of sort of like the what and contextualizing kind of where this is coming from. I'm gonna spend some time talking to you about the how and the why. So as I mentioned, there's all this WebAssembly kind of has to run somewhere. And so we call that the WasmCloud host. And that is the program that is responsible for running all of these WebAssembly components. And so you can think of it kind of like a cubelet in the Kubernetes world, right? It's just a program that's kind of sitting around waiting to be invoked by some sort of schedule or orchestrator for the most part or individual control commands to basically start some workloads, stop some workloads, update them, right? Make sure everything's still running and kind of keeping going. So again, very much kind of like a cubelet to make that analogy. And normally I wouldn't mention this because it's kind of trivia, but it's actually kind of important for some of the things that we're gonna be talking about here is that whole project is written in Rust. And so that has been interesting to say the least, particularly with kind of adding observability through OTEL. We in particular use the OTEL Rust SDK and also the tracing ecosystem that comes in from the Tokyo project. Not sure how many Rust stations are in the crowd, but Tokyo is kind of how you do, it's almost like the canonical way to do async code. And so the SDK, kind of like SDKs and other languages, you know, Go, C, Java, whatever. There's mostly responsible for providing a lot of those types and also giving the client capability to be an actual OTEL exporter. Whereas the tracing crate, which is kind of unique more to the Rust ecosystem, is actually kind of how we end up doing a lot of the instrumentation. It ends up providing us all the macros that we kind of like decorate our individual functions with. And so it's really handy because we can automatically do span propagation and all that sort of stuff. Unlike, you know, something like Go where all that sort of just attach to a context or whatever. That all comes in through a completely separate library that is not actually provided by OTEL. And so really the takeaway is you actually do need both if you're writing or intending to use OTEL with Rust. So it's kind of just like a pro tip. But kind of as like a meta point, like why did we decide to go with open telemetry, right? I'm sure many of us in this audience, right, are pretty much already sold on all things of observability. But we ended up going with OTEL in particular, right? Precisely because it is a single standard, right? It encompasses all of the signals. So traces, metrics, logs, that most of us who are operating production systems really care about. You know, having that really enables a whole bunch of functionality down the line and you don't really have to worry about, you know, the implementation changing out from under you. It also is nice because kind of inherent in the spec and the philosophy is that we decouple the source of our telemetry from what is receiving it, right? This whole notion of being able to, you know, in your client or, you know, in our case, this host code, being able to instrument all of our code, you know, with tracing or whatever, and being able to export that to just some sort of a collector, a thing that implements a standard is really, really helpful, especially in a world where that we happen to interact in where these hosts can be running anywhere, right? They could be in a Kubernetes cluster in Amazon. They could be running on some Barrett Metal and some Colo. They can be on some Raspberry Pi that's running in someone's closet or underneath their bed like we don't know. It could be on ESP32 in theory. So like, you know, like a little tiny development board. So being able to handle that effectively, even when it comes down to the telemetry layer is really, really interesting. You know, customers or consumers can decide that, you know, they wanna do a bunch of processing sort of in line before they send it out, right? They could ship them to, you know, big massive collectors that kind of live in their data center and then ship those out to, you know, downstream ETL systems, who knows? But it's nice to be able to sort of ship that off and not be bound to, say, a particular collector implementation like a fluent or a vector, like great projects, don't even be wrong, we use them, but it's nice to be able to kind of abstract that a little more. And the other benefit really is that there's just an enormous ecosystem that started to spin up in the hotel community and that just continues to evolve and improve and add additional functionality. So we, as people who are contributing to an open source project, can continue to take advantage of that whole evolution. So we've been kind of working toward this for a while, actually. We started probably backwards in some ways to probably how most people might approach this, but we actually started implementing distributed tracing, specifically May of 2022. And the reason that we sort of settled on that was because we realized, as we were building out WebAssembly applications for the company that we work at, Cosmonic, but also for our customers and people who are in the Wasm Cloud ecosystem, we realized that because of the way that these distributed applications are being built, it was hard to debug, right? It was hard to understand what was going on. It was hard to be able to visualize it. So we realized that distributed tracing, in particular, made it really, or much, much easier to be able to understand what's going on, be able to diagnose issues in production, be able to see and understand at a glance where these call stacks are kind of flowing. So we decided to start there. Also kind of as a historical note at the time, since that was a while ago at this point, metrics, the lectures and logs APIs, and someone will probably correct me on this, but I know the logs one in particular was like barely kind of specced out at the time. Metrics might, I don't think it was even an alpha yet. Someone probably knows the actual dates on that. But anyway, they weren't stable, right? It would have been a risk for us to sort of take that on. And in particular with logging, since that's sort of been such a core thing from observability anyway, like us logging standard out and shipping that over, in our case, we were actually using a lot of vector in production, we already solved that problem. So we felt that distributed tracing was good enough to start, we would work with it as the standard evolved. And so we've actually recently revisited that. Watson Cloud is pretty imminently going to hit 1.0. And so we decided it was time now that things had evolved in the community. And we really wanted to make sure that for a 1.0 we included all of the observability that SREs or anybody operating these systems would want. So we did decide to implement metrics and logs earlier this year. So I want to talk a little bit about each of the individual signals and how we did end up implementing them and some of the nuances of some of the decisions that we made there. So let's start with tracing. So you also included this before, but Core to Watson Cloud and kind of that project is that these WebAssembly components, these really lightweight kind of ephemeral things, actually I'll communicate by RPC using NATS as kind of a way to abstract the network, which is great. It makes it really easy to run, stuff kind of knows what to talk to you basically by topic or address. But what that means is any time you want to invoke one of these components or any time one of these components is reacting to some sort of an event, we need to be able to propagate span and trace ideas just like any other system, like if you're writing a little Go program that is handling HTTP requests, like you're handling that automatically if you're using the SDKs. But since we're kind of going through this third party, we got to figure out how to do that. And so the answer that we settled on was similar in some ways. We have the ability, using this NATS project, to set arbitrary headers on a message. So very similar to what you would do and what actually how spans are propagated like in HTTP, we can attach key value pairs of strings to each message, just as header metadata. And what's nice about that is the Rust SDK for Hotel has an extractor trait. And for anyone who's not as big of a station out there, a trait is kind of like an interface in some ways, similar to Java Go, right? It's kind of a description of the various functions you need to implement, but then libraries can consume that and just call the interface and it's all good, right? Call a trait and Rust. So we implemented that trait, and it turns out as we were kind of doing some research in the ecosystem, right Kafka inherent in its implementation of Hotel actually does something very similar. So it was kind of nice to be able to use that as inspiration and use kind of what we were able to do through the SDK. So it's nice because that means the host handles all this. It's not even something a consumer or anybody who's trying to write these applications even needs to worry about. So an example of what this might look like, this is actually a NATS message, like just running the CLI kind of subscribed to a topic to kind of dump out all of the payloads basically. And so most of it is kind of, you know, not that interesting. What I have highlighted in yellow down here is actually all the headers on this message. So in this case, all it really has is the span and trace IDs. That's what the trace parent key is. If you've ever unpacked hotel HTTP headers, shouldn't be surprised. It's pretty much the same thing. But our extractor actually is able to, anytime it receives a message, right? Know that it basically needs to propagate, you know, the trace instantiated to span, kind of be able to handle like, actually all that data right and propagated outwards. And then there's a JSON payload, but that part isn't that interesting, at least for this. And so, example of this kind of working, right? I pulled this out of Honeycomb actually, just out of one of our production systems. I don't expect you to read it. It is very small, right? Part of it's deliberate, mostly to give you a sense of the fact that there's a lot of colors on here. Those colors are individual spans. These spans can be very, very deep, actually, in this system because everything's kind of traversing gnats and kind of everything's kind of cross talking all the time, which is good and bad. Because unfortunately, there are some limitations to hotel right now with Wasm. And in particular is like, unlike maybe some systems that you probably have worked with, with distributed tracing, with WebAssembly right now and the run times that exist. It's kind of an open research question actually as to how you support tracing within the context of an individual function. So, in our case, like our traces actually end right at that call to the component. So you know how long it took. You don't know what it did or what the function you were actually invoking. You just know you made a call and that call did something. So this is actually just a blown up view of one of the parts of that trace that I just showed you. So you can see where the invocations are happening and then we make a call. And what's cool about that call is it's 81.8 microseconds, which is pretty fast. I'll take that. But I don't know what that is, right? It could be doing anything. So I don't really have the context, but you kind of have to build that up through understanding sort of what your system is doing. But as it turns out, yes, that's a limitation and yes, there's ongoing work. But we have found it's actually still good enough even in that current state to really help diagnose issues in production just like you would in any other system that you might be using distributed tracing with. But again, the downside is that you get these really deep call traces. So what we've learned kind of a pro-tep in this situation, which if you've worked with tracing is probably not a surprise, is that you have to sample. Like you just do not have a choice unless you really, really like paying your observability vendor if you happen to be one or using one, a lot of money or spending a lot of money on storage for Amazon. Like, I mean, they're cool with that, but I doubt you are. So, and again, it's kind of just a consequence of running these distributed systems. But what's nice about OTEL is that because these collectors are so flexible, it has tail sampling. So you can kind of do that on your own. And as somebody who is consuming this data, you can decide how much of it you want to keep, how much of it you want to throw away, you know, how much of it do you really need at any given time to be able to understand the boundaries of your system. So let's talk about logs. And again, because a lot of this is particular to Rust, it turns out it's actually really straightforward if you've done distributed tracing. Like with OTEL, it turns out to enable tracing is like 90% of the work if you've already gone through it. Because if you think about what a log is, it's just kind of an event and distributed traces are events. And so the tools that we already kind of have at our disposal are to give you almost everything that you need. So in our case, it was really just enabling some additional configuration to actually send those logs to an OTEL remote endpoint that actually supported logging. And really the hardest part about this was more just there's some nuances in the SDK that made it kind of the code is a little not great when you're initially configuring the exporter and structured logging. So it wasn't that bad. And talk about metrics. If anybody has implemented metrics with Prometheus, like actually instrumenting code and adding counters and whatnot, kind of the same thing if you look at it, right? A lot of the data types are kind of the same. The loop's pretty simple as you're sort of working through that mental flow chart. You find an inch thing you want to instrument. You decide on the type of metrics that you care about, as the counters, the gauge, histogram, whatever. And then you modify, you increment it, you decrement it, whatever. And so in our case, what we had to figure out was what does it look like to add additional metadata? And so again, if you worked with something like Kubernetes and you've looked at what those metrics look like, you know that there's things that get included in every metric, like a pod ID and a namespace, things to actually segment and know this metric actually applies to this application running at this particular point in time. We realized early on we had to do the same thing. We included some sort of machine identifier and what we call a lattice ID, which is basically a namespace. And so we needed to be able to include that so that you're distinguishable. Hotel does have some sort of like top-level metadata contracts, namely resources and instrumentation scope. Resources in particular are cool because it's kind of like top-level metadata of things that you want to include. And the problem is they don't automatically end up in labels without doing a bunch of work. And instrumentation scope, pretty useful. It's kind of like ways to sort of identify like this is how I logically partition parts of my program. But we learned that at least in the SDK that we're using it wasn't really easy to get that data. So we had to figure out something else. And so what we chose to do was include those labels on every metric just in our code. Lotta wrote some functions to kind of decorate all that. But that was kind of a feel soft choice, right? We decided to do that because we wanted our downstream consumers to be able to aggregate all of that or get rid of it or really leave that up to the control of the collector to be able to decide how they wanted aggregate or not. Opting in, like there's a way you can do this specifically with the Prometheus exporter. There's actually a flag. You can set to take resources and add them automatically on labels, which we thought was a really good idea. But we realized it's optional. So it meant like downstream consumers would have to know that that's an option that you have to enable in order to get meaningful metrics. Like that wasn't gonna fly. We'd rather give you too much data and have you decide that you wanna throw it all away than to make you figure this stuff out and just become an expert. Just didn't make sense. All right, so that's a lot of about how we went about it. Now there's a few things, just gonna wrap things up. There's a few things that we're really excited for in terms of future work. I know that there's active work happening around the continuous profiling space and introducing that as a signal. So we're hoping that once that becomes more formal thing in the open telemetry space, we could just leverage that by adding an integration into Wasm Cloud. Then, as Dan mentioned, there's some function level tracing that we don't currently have access to. But the good news is that in the web assembly space, there's a WASI standard, which is gonna have time to get into WASI standard here, but there's essentially a standard. There's a group working on introducing a standard that will allow you to add function level tracing inside of your web assembly components as long as the runtime itself supports that standard. And then finally, something we'd like to actually implement across our entire project is ensuring we have the open telemetry semantic conventions being followed, because if you're coming from a different system where you might be used to those semantics, it'd be really nice to be able to just take that, use the same things that you're already used to from elsewhere and just apply that same knowledge in measuring our metrics. So, with all that said, if you're interested in learning more about this, we have a booth on the solution showcase at the back in the startup area, K37. So, come online, we'd be happy to tell you more about web assembly. We also have a, the WasmCloud project itself, being a CNCF project, has a slot on Wednesday from three to five in the project pavilion, so sorry, three to eight, so we'd love to have you there as well. And finally, if you're interested in learning more about web assembly components, one of our colleagues, Brooks Townsend, and then Michael Yuan from the WasmEdge project in the CMCA space are running a tutorial that the QR code we have here will put you on your calendar. And then, if you want to learn more about WasmCloud itself, we would encourage you to check out our docs on wasmcloud.com, and then the instrumentation work or the setup that we have set up for the project for you to be able to easily kick the tires when this stuff is available in our GitHub as well. So with that, thank you very much.