 So let me introduce myself, I'm LaMila, I work for Microsoft, I work on Azure SDKs and observability. Later on, I will tell you about our experience with instrumentation with open telemetry, the ups and downs of it, and share some practical information. Yeah. Great. Hi. I'm Ted, one of the co-founders of the open telemetry project and I work at LightStep. We should do that. Alrighty. So yeah, let's get started. So normally when I give talks about how open telemetry provides value, usually it's some slides like this talking about tracing metrics logs, how they're all braided together, and how that brings a lot of value. But that's not what we're going to talk about today. Today, instead, I want to focus on this other problem, which is that instrumenting open source software sucks. Like let's say your library wants to log an error. Sounds like it'd be so easy, but it's this surprisingly difficult journey. You don't know what the application owner picked, but the only option you have to pick generally is standard out and that's probably not right, especially for their metrics. So what do you do? The real problem is that your library needs to support many applications. So when it comes to picking a solution, you have to defer to the application itself. The application owner is the person who picks where the data went and there's no one right answer for that. So since all of the data needs to go to the same place, each individual library can't pick a solution for instrumentation and observability because the instrumentation is usually tied to a particular data system. So you instrument with a particular library and that tends to pick a particular format, sends the data to a particular place and every library can't pick a different solution there. So life is terrible, but that's not the only problem. The other problem is that open source libraries have to compose well with other libraries, which means they have to be very picky about their dependencies, right? Libraries don't get to synchronized with each other. So the dependencies that one library takes on might conflict with the dependencies that another library takes on and those dependencies might have dependencies which conflict with each other. So since instrumentation is always going to be out of date somewhere, version skew can be a big poisonous effort here, plus any other dependencies that you might have if you're hauling in say like a big implementation when you haul in that instrumentation. So once again, life is kind of terrible, but we wanna solve this because instrumenting open source software is worth the effort. A lot of the heavy lifting in most applications occurs within these open source libraries, frameworks, database clients, HTTP clients, all of this stuff is doing a lot of work and we wanna know what's going on in those libraries. So how can we fix this with open telemetry? The first problem, dependency management, by taking a separation of concerns, we can look at it this way, right? Like the library authors, they only need to write the instrumentation and the application owners, they need to choose what to do with the data. So we can decompose observability into these two pieces which in open telemetry we call the API and the SDK. So library authors only touch the API, application owners only touch the SDK, the API only contains the tools for instrumenting libraries. It's a very thin layer of interfaces. The SDK on the other hand contains all the nuts and bolts for actually processing that telemetry and it's a very extensible framework that lets you send the data in any format to any place. You then take the SDK and you bind it to the API at runtime. So you actually aren't even forced to choose the SDK we provide. You could build your own SDK if you needed to. So why does this help? Well, it helps because separating these concerns means that the API no longer has any dependencies because it's just this thin layer of interfaces. So it can't create a dependency conflict because it has no dependencies. There we go. The SDK on the other hand, it's only loaded once and it's loaded by the application owner. And so the application owner now is managing the dependencies and you're not going to see version conflict happening within that SDK because it's only getting loaded once. So at the end of the day, there's always a way to avoid dependency conflict here because all of the dependencies are coming in in one place and application owner can manage that. But that doesn't solve the other problem which is backwards compatibility. We have to ensure that instrumentation is always backwards compatible with prior instrumentation because again, everyone is not going to go through and upgrade all of their instrumentation all the time. At the same time, we wanna keep adding more functionality to open telemetry. So we do this by separating the API into stable packages and experimental packages. Stable APIs can only make additive changes that are fully backwards compatible. Anything else we do has to go into a new separate package that is initially labeled as experimental. And open source libraries, as long as they only depend upon the stable APIs, well, they're never going to have a situation where they're having version conflict between the versions of the APIs that they loaded up. So once again, we're seeing separation of concerns making it so that the API never breaks anybody because we're never gonna break the API. And I should say just in general, breaking APIs should never happen with any kind of widely shared library. Because at the end of the day, what takes longer? Carefully designing this API and ensuring that we don't create backwards compatibility issues or fixing millions of call sites because we happen to break something. So the point is actually none of this was specific to observability. Any library, any widely shared open source library should be thinking about these kind of design concerns. And just at open telemetry, we just care about stability and long-term support quite a bit. So you might be saying like this is great but you said stability is important so when will open telemetry be stable? Well, today the tracing portion is totally stable. Metrics we're hoping to stabilize by end of year and logs will hopefully be stable in early 2022. For each language, you'll wanna look at that particular implementation but it lists which packages are stable and which are experimental. Also coming in the future, there's going to be more kinds of signals, ROM, EBPF profiling. So we're gonna keep adding stuff and you'll be able to take advantage of that once it becomes stable. But you might be saying great, I've got my library today, I wanna do instrumentation, so what should I be doing? And to answer that question, we have Moodmila here with some real-world feedback. Yeah, thank you, Ted. So let me share the story that we've done in Azure SDKs. We instrumented our client libraries with open telemetry, we've done it in Java, in .NET, in Python, in JavaScript. You can find bits and pieces in Go. We've done it for almost every generation too of our SDKs that we have. Storage, Key Vault, Cosmos DB in Java only. Our messaging SDKs, Event Hub, Service Bus, Event Grid, Control Plane SDKs and much more. So when you work on something like this, what would you think about? The first thing is like, what do you users want to know? What do you trace? And here I'm mostly talking about tracing, maybe a bit of logs and events on spans, but I don't cover a metrics yet. So let's talk about tracing. So what your users would want to know about the library? And like the first place to start is some complex API calls that your library has. The public API, which internally does some communication with underlying services. And users would want to know about the outcome and duration and also some domain info. So like if your library is a database, you probably would want to include the database name, table name or collection, maybe a statement and other things. So this is a good place to start. And one of the benefits of having this public API spans is like users don't know when they happen, it's magical. And connecting them and understanding where in this, in their code that they wrote the span is collected, what it describes is beneficial. So the other layer of instrumentation is outgoing network calls. In libraries, usually you don't want to do this because it's supported by auto instrumentation. The one thing you should be cautious about is that your public API call, it should be correlated, it should be apparent maybe to this underlying network call. So users can see. And like having those two basic layers helps users understand like the flow without like asking, let's say for your support, they can see, okay, this is how my library communicated with the service. And maybe they can see, okay, there were five rich rise or long call to the service. So they come less to you as a library owner and they could be able to figure out more issues on their own or when they create an issue, they will give you a more context. So this helps users with observability and you as well. And at the same time, you create the solution mostly for users even though you benefit from it. It's not the library internal observability solution. Okay, so the important part is semantic convention. So first thing you do, you can check is open telemetry has a solution for you, which spans to collect, which structure they should have, the relationships between them, which attributes to put on spans. And we have a conventions for HTTP, JRPC, database messaging, some interfaces, Kubernetes, FOS and more. So the semantic convention, they are absolutely critical for user experience. So users can let's say query their traces for HTTP status code. And for all HTTP clients, hopefully in the world in all languages, it's the same key used everywhere. All of the nice visualizations you might have seen are built on top of semantic conventions, topology maps, all aggregations, those beautiful Gantt charts. They are so beautiful or cascade views of traces that you may be so. They are all depend on libraries following the semantic conventions and going forward, they can bring even more value. So the core piece of distributed tracing is context propagation. So libraries participate in it by propagating context from the wire within the application and to the wire again on the way to the next service. Let's talk about what does it mean? So if your library is a web framework or maybe a messaging consumer, then you should expect context to come from the wire. You either get an HTTP request or maybe a batch of messages and you create a child's pen. Well, if you've got a batch of messages with different context, you kind of can link these contexts to the new spend you create. The important part that you take this pen and you pass it along to the user code and you do it through some open telemetry APIs and it's an implicit context propagation. So you don't invoke something in user code to give them the context, it just flows. It depends on the language like in Go you have an explicit context. In other languages, this implicit thing works in some languages better or worse, but this is not your concern. You don't invent a new way to implicitly pass context. You use open telemetry API and it does its best effort to propagate this context within the process to everything else. If you can use some additional explicit way, it's nice to have this extra thing for your advanced users who maybe do some background processing crates, reds, or do something that generally doesn't work well with implicit context propagation. So implicit is super critical, explicit is nice to have and it depends on the language of course. And then the client libraries, they take this ambient context from open telemetry or usually they don't even need to think about it. It just happens underneath the open telemetry APIs. They create the child spends and they pass this context alone. So for example, if your library writes some logs for your own observability here maybe for users, then if you make this context, we call it current or implicit, then your logs can get the context stamped on them. So you can use the same thing for your own observability and improving your logs. And then the network clients that happens underneath, you don't necessarily instrument them, but they create child spends from your library spends because you made this context implicit. And they pass context over the buyer to the next service. There could be other layers of instrumentation. Somebody may want to trace some DNS calls or TLS. So making even this low level context current or implicit also makes total sense. And always if you can, like the disimplicit part is super important if you can have an explicit pass and get context from users, go for it. It's critical for some of your advanced users. Okay, so let's talk dependency. That gave you a great introduction in why it's not such a big deal, but I know you're concerned. I'm concerned too. So you take dependency on open telemetry API. It's tiny package, it is stable. It follows somewhere too. It has long-term support guarantees. So open telemetry takes API stability seriously. But semantic conventions are experimental. It's not the API we expose. It's like a soft contract on telemetry. This attributes names, which attributes to put. They may and likely will change. In some cases, maybe even the spend structure of the relationships may change. So be mindful of that. At the same time, it doesn't break your runtime. It breaks some user expectations on the backend. Okay, so to mitigate some of these concerns, you can make your instrumentation opt-in at least initially. So expect that it will take a few iterations before you like figure it out. You maybe you want to get some user feedback. Maybe you over instrument it then they don't need it. It's too costly. Maybe certain operation is too noisy. Or maybe you are under instrumented. And then just give it some time to mature, get some feedback. And eventually, we want instrumentation to just light up. So you have your library, you have open telemetry SDK and user application. And it's just all users need. With opt-in mode or maybe plugin mode that we usually have, it's easier to make this dependency issues hidden, keep them hidden, but you, and you can start with that. But eventually you want to have it all in your library. So you don't need to document it. You don't need to make these plugins discoverable. And usually plugins break just because people forget to install them more because like maintainers don't even test with these plugins. Yeah, but this is the necessary level for the few of your first iterations. Okay, then you probably think about performance as well. So the open telemetry API is not without configured SDK. It does nothing. It's a tiny package. It does not affect performance. When open telemetry SDK is configured, usually there is some sampling. And it reduces performance impact significantly. So otherwise open telemetry is non-blocking. It consumes limited resources, but some overhead is expected. We work in open telemetry hard to minimize this impact, but you should expect some impact. And so far users can like reason with it because of the value they get. But by all means benchmark your library. And if you find anything I'm sure all open telemetry languages SDKs will be happy to help with optimizations. One thing you can do on your site is don't trace verbose operations. If something is verbose use logs, if something for your own purpose use logs, users can tweak the log level to get more or less of it. If you want to share something with users, but it's not a span, you don't have to create span at an event on the span. Okay, and related thing is be mindful of user costs. They pay for telemetry a lot. They pay for ingestion, for storage, for processing, for everything. If they host it themselves, if they use some observability backend regardless, they pay a lot. If you are not sure that the new thing will solve some problem for majority of your users, maybe don't instrument it or maybe put it behind feature flag or something before you learn more. Okay, now you know everything that there is to know. A quick recap. Your core functionality should work regardless of anything that happens in the tracing world. If it's not traced, at least your library works. Then dependency help is real. You can ship in plugin or in some optimal until it matures. Then semantic conventions is absolutely critical for UX. Please, please, please follow them. And if you don't find one, help us develop. The context propagation, it's even more critical than semantic convention. Maybe I don't know how to emphasize it more. Propagate your context implicitly and explicitly. And when you test your instrumentation, do test it. Try it out in a semi-typical user application. So pick a popular web framework. Enable all other instrumentations that there are. Make sure your library telemetry correlates well with like incoming HTTP request, underlying network layers, stuff like that. Yeah. And come join us in instrumentation seek. Help us polish the last details about the instrumentation. Help us develop new semantic conventions for popular technologies. And just share your feedback ideas. Yeah. Thank you. That's our talk.