 Well, welcome everyone to this session on living the XKCD 97 dream with open telemetry by Greg Matthews. Thank you so much for joining us Greg and yeah, without any further ado, over to you Greg. All right, great. Thank you. Yeah, so welcome to the talk about open telemetry here this morning, my time. And today I want to get you excited about embarking on a journey to start using open telemetry in your organization. It's a really it's a really big deal that as a technology community, open telemetry has managed to unite such a broad section of software observability space under a common banner. And it's really great for everyone involved library authors, practitioners and even vendors. But what's even so exciting. So maybe some people out there don't even know what open telemetry is. Maybe others have already tried it, but they want to learn more. Don't worry, we're going to embark on this journey together from the beginning and take some scenic stops along the way until we reach the end. My wife and I love to travel with our three kids. We believe that if we can get them out there to see the world and they'll have a much more global awareness and more empathy as they grow up. And I've included some pictures from our trips that we've taken to visit the western part of the United States. And perhaps I can inspire you about open telemetry but maybe I can at least inspire you about coming to visit if you get the opportunity. So before we can really talk about open telemetry though we need to talk for a minute about the problem that we're trying to solve. As we build increasingly complex software systems, we have an increasing need to know what's going on inside in live production systems. Observability is defined as the measure of how well the system's internal state can be inferred by knowledge of its external outputs. It's a concept that has its roots in control theory, which is a fascinating topic that was originally applied to physical systems and sensors, but has more recently found a lot of application in computer systems. I work at a company called Stored where we build software to power what we call the cloud supply chain. As you might expect from a modern cloud native startup, we have microservices and Kubernetes and Istio and all the things. It's as complex as the marketing graphics on our website make it seem. In a complex system, we need sensors to tell us how things are working so that we can have insight into what the problem is when something goes wrong. So maybe you have some basic monitor setup like server CPU and memory consumption, or maybe you don't even have that and you're just rebooting things when it starts to slow down. So what kinds of superpowers can better observability give you? The most obvious superpower is the visibility into exactly how your complex distributed systems work in as much detail as you or your library authors care to instrument it. For example, this is a trace being visualized in Datadog APM from a production system that I used to work on. So using some basic instrumentation, we can see that there's three services involved represented by the green, blue, and orange colors. The green service is performing two tasks in parallel modeled by the purple color. One of the parallel tasks is making a call to the blue service followed by a call to the orange service. The other task makes a call to the blue service, which seems to have aired and then it retries successfully. We assume that based on the structure that after both parallel tasks are completed, then the overall result was returned by the green service. If I hadn't redacted the details, you could you could have inferred even more about what operations it was performing. You can also easily see how long each operation took in relation to the others as well as the leverage we got by using concurrency compared to just a status code or some server logs. This is pretty amazing. The next superpower I want to talk about is closely closely related. When something crashes in your application, maybe you have some kind of exception management tool that will let you know about it. Most likely you get a stack trace to show where it happened in the code. With this superpower, you can now see the exception handling within the context of the overall distributed system. You can see that there was an error, even though the actual response from the endpoint was successful, and it was apparently mitigated by a retry from the calling service. That's all stuff that would be hard to see in just a stack trace from one service. If we click into the error details, we can see the error message in the stack trace. But we can also easily click into any of the other related details to see how we got to that point. For example, we could check the endpoint being called or the parameters being passed. One of my favorite things to do when we first set up observability tooling on an existing system is just to browse the traces and look for a pretty obvious low hanging fruit. I'm sure none of you would ever do this, but people are still out there writing code that results in n plus one queries to the database. Can you see it? It's all those little tiny blue things down at the bottom. If we zoom in, we can see that the blue bars at the bottom of the graph are actually tiny individual queries. I can try this example on purpose, but it's definitely something I've seen in the wild as well. Y'all, it's 2022, and there's something even better than that. In our cloud native world of microservices that nobody actually understands, it's not even that uncommon. It's n plus one API calls. Just because you've built a bulk fetch API in your microservice, it does not prevent people from calling that fetch one thing in a time in a loop. Another thing that can quickly jump out from a color coded visualization is multiple calls to the same service, which can indicate the potential for combining the calls. For example, here we have several requests in parallel to the same downstream service. Unless this is done intentionally to chunk a large request into smaller ones, it may be more efficient to combine these into a single call. That could potentially decrease load on the downstream service as well as decreasing the overall latency of the top level requests. Now flipping that idea around, it's also worth considering whether we could have better latency by not combining calls to the same service. For example, imagine you need to fetch a large data structure from one service and then use some embedded IDs to fetch more data from a different service. It might be faster to just query the IDs initially and then make two concurrent calls to fetch all the details from each service. This one is kind of obvious when looking at a trace, but much more difficult to see when you're looking at logs or metrics. In this trace, you can see that there are gaps in time between when the client made the call and the downstream service processing the call. And if the green section represents the HTTP call being made on the client side, and the blue and orange sections are the downstream services, then probably these gaps can most likely be attributed to network delays. It could also include time spent in untraced sections of your web server or proxy infrastructure. So also keep that in mind. And the final superpower, when I mentioned, is the culture shift that becomes possible when you have this kind of tooling deployed. Rather than troubleshooting based on past experience or gut feelings, our engineering teams can have a common baseline to start from, based on the traces of what really happened. That ends up resulting in much more precise communication about what the problem is by linking to an example trace that can be explored. It's a very powerful way to communicate a lot of context around a problem without having to spend time explaining it. Now that you've seen some of what's possible once you have observability, let's get more into the details of what open telemetry is and how you can use it. Yes, this is a picture of 30 Raspberry Pi camera starter kits. Isn't that typically what you take when you travel? Let's dive more into what open telemetry is and also a brief history of how we've arrived at today. So back in the ancient before times, there was no law in the land. Everyone did what was right in their own eyes. It was a time of Nagios, SNMP, and incompatible vendor solutions. Many practitioners tried to make ends meet, building lots of different tools and scripts. Ultimately, open census and open tracing emerged as the top contenders in the space of open source observability tooling. Both had achieved some amount of adoption, but they were still not quite compatible, though very similar. In early 2019, it was announced that these two projects would be superseded by the open telemetry project, and work began in earnest. By early 2021, the open telemetry tracing specification was christened with the 1.0 version and declared stable, but what does it mean for the tracing specification to reach a stable status? While open telemetry is architected as a set of independent signals built on top of a platform that provides the mechanisms to propagate and collect the data from those signals. Signals work as cross-cutting concerns that need to be considered at all the software layers, from libraries and frameworks to the server applications built on top, and through to the clients of those services. For each signal, there's higher level concepts that need to be defined, like the semantic conventions that should be encoded in the data to support the standard tooling and developer-facing APIs for manipulating that data. There's also lower level implementation details in the SDK to actually collect, process, and store the data for later use. And finally, there's community contributions related to each signal to integrate various libraries and frameworks in a consistent way. Along with all that structure and long-term thinking come some great supportability promises. For APIs, the project promises to provide support for stable APIs for three years, so that the instrumentation you write today should continue to work without needing frequent maintenance to keep things current. Similarly, the plug-in authors and application owners contributing to the OTEL ecosystem are guaranteed one-year support on those APIs. In addition to the stable APIs for instrumentation, OTEL also specifies the Open Telemetry Protocol, or OTLP, which describes a common transport mechanism for all the data through the stack. It's an efficient binary protocol leveraging protocol buffers in either GRPC or HTTP. To top it all off, OTEL provides what they call semantic conventions, which enable consistent tooling for visualizing and analysis. For example, the semantic conventions for HTTP explain when and how to set the error status on a span specific to HTTP operations, so that different HTTP client and server libraries across different languages can all consistently describe the operations the same way. The semantic conventions also explain what names and formats to use for common attributes for various kinds of observability data. For example, how different parts of the overall URL should be represented, so that tooling can be developed that leverage the data for aggregation, searching, filtering, et cetera. So now that we've talked a bit about what Open Telemetry is, I'd like to dig into what each of the signals are a little bit more. Logs, metrics, and distributed traces have historically been referred to as the pillars of observability. Rather than pillars, I like to see these as three different views of the same system, like the faces on a cube. Each aspect touches the others, but also has orthogonal concerns. For example, distributed traces can tell us a lot about what's happening in a sampling of specific requests. But each trace also represents many operational data points that can have new meaning when aggregated across many requests and visualized as a time series or a histogram. Traces can also tell us about key operations that happened or where in the process errors have occurred. These facts are things you might want to collect in logs and allow searching and auditing later. Each aspect tells a different story about what the system is doing inside. Traces tell us a story about the details of particular requests across services that contain more detail than would be typical in logs. They can have attributes that can contain both low cardinality tags, like metrics have, and high cardinality structure data, like logs. To make that practical in production, often only a small fraction are sampled. Metrics also tell a story, but not about individual things happening. They talk about types of operations rather than particular instances. They describe the system in a particular time window and often have tags to break down the useful context around the metrics. Logs tell a different story about your code, usually in fine detail. Ideally, logs have a structured format with regular fields to make them easier to index and search. Usually, they're retained for a fixed duration based on simple rules like severity or environments where they were generated. I don't have much time to go into all the details about metrics and logs, but I feel like distributed tracing is less familiar to a lot of developers I've talked to. So I want to briefly cover the theory and concepts, so it will make sense when we actually go to use open telemetry. At the core of distributed tracing is the concept of a span. Spans have a name and a span of time during which that activity took place, which is different than logs, which are generally treated as point-in-time events. Spans can also have metadata attributes attached. There are some semantic inventions defined for common use cases, but in general, they could be anything. Spans can also have events attached, which are similar to a span in that they have a name and a set of attributes, but they represent something that happened at a point in time rather than over a span of time. Each span also has a status, which is typically either unset or set to error by instrumentation, according to the various semantic conventions. This is very useful in identifying and visualizing interesting spans as they're collected. Spans also know about their parent span, which allows a hierarchy to be created so that we can visualize the execution of a program using a waterfall graph like this one. Note that unlike CPU profiling flame graphs, this represents a single actual request rather than an aggregation of where the CPU time is being spent overall in a running system. Each span also knows about the overall trace that it belongs to. Note that there's a top level span in the trace that has no parent, and we call that the root span. We can also represent concurrency in these trace visualizations, which tends to be tricky with CPU profiling flame graphs. Here you can see that our Phoenix controller is making two parallel HTTP calls, for example, using tasks.async before proceeding with handling the responses, which includes making a database query, et cetera. Where distributed tracing really starts to shine is when we think about what happens within each of those HTTP calls. In this case, we can imagine that the first call is being made to a different Phoenix-based server, and the second one is being made to a Rails-based service. In the context of the overall trace, we can visualize which services were responsible for the part of the request by looking at the callers. A span context needs to be propagated between services or processes so that the child span knows which trace that they're a part of and what their parent span was. The span context also lets the downstream process know whether the trace is being sampled and can also propagate other state metadata. Until recently, this was a vendor-specific free for all of incompatible headers, but luckily, there's now a W3C standard, which is used by OpenTelemetry. If we consider the tracing process from the standpoint of the originating service, we pass this trace context to them in HTTP request headers, but we actually have no idea what they do with it and we don't wanna bloat their response payload by requiring a trace dump every time from the downstream service. So how do we get a complete trace? Well, each of those downstream services is able to locally set up a tracing context based on the parent traces ID and which span ID that trace was called from. Then each service can send its part of the overall trace as if it were the only thing that existed and the trace collector process can merge them all together. Since the top level span within each downstream service knows what its parent span ID was, even though it doesn't know all the details about the span. The trace collector is also responsible for sampling which traces to keep. A simple way to do this is via probabilistic head sampling. We choose some percentage of the traces to keep or drop. The sampling decision is propagated to downstream services so they know also whether to sample in the same way. The collector still has the option to store or drop completed traces depending on its configuration. It's also possible to perform so-called tail sampling on the completed traces, perhaps based on criteria like the overall latency of the process or error somewhere in the trace. So far we've mostly been talking about synchronous client server operations but open telemetry can also be used to model and visualize asynchronous operations as well. For example, a producer could send a message to multiple consumers or multiple producers could send messages to a single consumer that operates on a batch. These can be represented using the span kind which describes the relationship of the span to its parent or child or children. In addition to parent child relationships between spans, it's also possible to declare causal links between spans. This is particularly useful when you have spans that are part of a disconnected traces but you want to be able to track their relationships. For example, a bulk batch processing operation may happen much later than the process that caused it but it may be useful to retain a link to the original operations that were related to the batch processing. So now that we know a little bit about open telemetry and observability in general, you should consider the local law before proceeding. No, I'm not talking about GDPR though you may also want to think about that. I'm talking about Conway's law. Depending on how your organization is set up, maybe you're on a team who wants a certain section of the code base or certain microservices that you could focus on instrumenting with open telemetry and then expand from there as you're able to demonstrate value. Before setting out, it's also a great idea to take stock of what existing observability tooling you already have in place so that you can make sure you don't unintentionally break some tooling that someone else is depending on while you're experimenting with this new thing. And finally, consider what kind of organizational support you need. Do you need to get permission for yourself or your team to spend some time on this as an innovation project? Are you gonna need help from IT operations, security or SRE to get the data sent to an internal or an external tool for collection and visualization? Are there other engineers in the org who would also be excited to give this thing a try with you? And so equipped with all this great information in this presentation you saw on the internet, you're ready to actually give it a try for yourself. Like we were just talking about, every journey is better when you bring along a friend or a guide who's already been there with you. And so far, I've been trying to keep the presentation more generically about open telemetry and observability as concepts in the broader technology community, but I'm personally focused on the beam community and primarily on the Elixir language. So some of the upcoming content will be specific to that. But don't give up if you're joining from another language community. One of the great things about open telemetry specification is that it tries to keep all the concepts and terms consistent across all the language implementations. So most of the content will apply to you as well. We touched briefly on this earlier, but I wanna dig a little bit more into what it means that Ohtel provides both an API and an SDK as separate packages. So Ohtel describes the API as the interfaces for instrumentation. These are just the abstract things that you could call, but unless you also include an SDK, the implementation of each of those API endpoints is required to be a minimal stub that doesn't actually do anything. When you include an SDK, whether the official one maintained by the Ohtel project for your language or a third party vendor implementation, the NOOP API calls will actually begin to do what's required to collect and forward the data for that signal. The key thing to understand is that if you're writing a library or a framework, you should feel comfortable to depend directly on the Ohtel API package as your means of providing telemetry about your code. If the application developer using your library does not intend to use Ohtel, then they won't include the SDK and thus they won't incur any significant performance overhead since the API implementations will be NOOPs. The API and SDK implementations for each of the officially supported languages are maintained in the OpenTelemetry org on GitHub. There's several directories within the OpenTelemetry Erlang Git repository to maintain the source for each of the packages. From each of these directories in the GitHub org, a package is pushed to the HEX repository for developers to use. In the Beam ecosystem, the API is implemented in Erlang so that it's easy for any Beam language to use the same implementation. For Elixir specifically, there's also a thin wrapper around the API to make it feel more like Elixir. The Elixir shell is truly very thin, mostly just providing Elixir-style modules that delegate to the ones implemented in Erlang. But it also provides some conveniences that Elixir developers will appreciate, like macros to easily create a span around a block of code. That way, whether you're using Erlang or Elixir, you get first-class support for your language without the overhead of maintaining two complete implementations. Before we really dive into how to set up OpenTelemetry, we need to briefly talk about an unfortunately confusing naming collision. The BeamTelemetry project was started approximately in parallel with the OpenTelemetry project, but with much more focused goals. The Telemetry library essentially gives you a simple, standard, and safe means to instrument your library, framework, or application code in a way that others can easily consume. It's maintained by members of the Erlang Ecosystem Foundation Observability Working Group with input from the maintainers of many popular libraries as well as Elixir itself. The way it works is that your library code is able to call this execute function, describing what kind of event you want to emit, along with measurements and metadata about it. Someone else can then attach to the events that they're interested in, specifying handler function. Then when the event fires at runtime, the handler function is called and can do whatever it needs to do. Maybe log something, send metrics to a statsDServer or start or finish a tracing span. It's important to note that these handlers are called synchronously from the library code in order to give maximum flexibility for various use cases. Thus, it's important for telemetry handlers to be lightweight and defer any heavier processing somehow to another Beam process in order to not block the actual operation from proceeding. These events can do anything you want, but conventionally libraries will emit start, stop and execution exception, start, stop and exception events for their key internal operations so that people can create consistent instrumentation libraries that consume those events for things like open telemetry spans. We can also very easily consume metrics that have been prepared for us in advance by the library. So it's a relatively simple synchronous dispatch API. Under the hood, the way it works is that there's an S table that keeps track of the handlers that have been registered for each type of event. They're called synchronously one after another when the library calls the execute function. So if handler one works, but handler two crashes, it's automatically removed from the S table and it's not called when the execute function is called again in the future. So just like baseball, it's one strike and you're out. I think I got that right. So it's relatively safe. I say relatively here because you still do have to be careful to not do crazy things in your telemetry handlers. There's already quite a large number of libraries that emit telemetry in this way, including most of the popular ones that you're likely using today. So at this point, it's essentially the standard that most elixir and Erlang developers are going to expect from their libraries. Okay, but seriously, that's enough talking. Let's do some walking. How do we actually get some stuff instrumented in an app? The obvious place to go as you're getting started is the official getting started guide for Erlang and elixir on the hotel website. That will walk you through some of the basics that we're covering in this talk today. So it'll be a good refresher for later. But for now, let's focus on the goal of getting an example app set up to send trace information to the console so that we can understand how to get the basics working. The fastest way to see how that's done is probably to take a look at the example apps that we have in the hotel Erlang contrib repo. If we look at the Phoenix example, for example, we'll see that there are three hotel related dependencies that have been added. The Phoenix and Ecto ones will hook into the beam telemetry events that those libraries expose and the exporter will be used to actually collect and forward the hotel data initially to the console and later to OTLP vendors. In the application.exe, we need to tell each of those integrations to connect themselves to the telemetry at runtime. In particular, in the case of Ecto, we have to tell it what the telemetry prefix will be because unfortunately it varies the event name instead of just indicating the repo via metadata in the event. Finally, we need to configure the exporter. OTLP gives us a lot of flexibility in how you wanna process and export the data. So the required configuration is unfortunately somewhat complicated as a result. This is how you tell OTLP that you wanna have batches of data sent to the console. Then when we start up the app and hit localhost 4,000 in your browser, you should be able to see lots of OTLP data printed out on the console like this. The format of the data isn't really intended to be human readable, but we can definitely see that there are some recognizable things in here, like the span IDs, various attributes that we saw from before related to the cement conventions for HTTP. Okay, so that's awesome to confirm that we have the instrumentation setup, but in order to truly show some value, we need to send it to a tool that will help us visualize the data in a more consumable form. We could certainly set up an open source option like Zipkin or Yeager, but I also wanted to point out how easy it is these days to set up a free plan on some of the vendors to directly support OTLP. For example, if we wanna send to Honeycomb, we just need to tell it to use the Honeycomb OTLP endpoint and include the headers that will authenticate you and send the data to the right place within Honeycomb. And then when we navigate around in the app, we'll immediately start to see things showing up in Honeycomb, not bad for just a few minutes of effort. Clicking into an example trace, we can see the relatively simple spans that this demo app is generating for Ecto and Phoenix, along with some familiar attributes we saw earlier in the console. That's pretty awesome. Similar to the way we pointed the exporter at Honeycomb, we can also point it directly at LightStep. One issue here though, is that LightStep currently has a non-standard API endpoint, so it has to be configured in this little bit more awkward way. My guess is that at some point, they'll support the default URL structure and this will no longer be necessary. When we run the app with this configuration instead, we can see data coming into LightStep. And if we click into one of the traces, we can see that all the details are there that we'd expect. This is really cool to me that we didn't have to do anything at all in terms of the application instrumentation and we can easily switch vendors just via configuration like that. In the interest of time, I've only showed you how to get a very simple application hooked up with Phoenix and Ecto here, but there's a lot more contributing integrations for other libraries that you might be using in your application. For example, these are the hex packages that depend on OpenChalometry API package. Some, but not all of these are collected into the official hotel org and GitHub in this contrib repository. Also remember that many other libraries emit telemetry using BeamTelemetry that you could potentially integrate on your own using the generic OpenTelemetry telemetry bridge library. As you might expect, there's some trip hazards worth mentioning. Trip hazards because we're on a trip. Yeah, okay. First of all, you should understand how sampling works and figure out how you wanna handle it. With many vendors, you have to pay based on the amount of data you collect and honestly, there are only so many traces that anyone's ever gonna look at. And also the more data you're sending, the more performance overhead your app will end up incurring. The thing to be aware of is that you might go looking for a trace in your tooling and not find it because it was sampled out. Related to that, you'll need to make sure that you correctly propagate the trace context headers between your services. If you don't pass a trace context, then the downstream service won't know how to connect to that trace and won't know whether the parent trace was sampled or not. When that happens, you might get only bits and pieces of the overall trace that were sampled independently. Also, even when you're using NTP to set your clocks on your servers, it's likely that your clocks will drift and be slightly different. There's not much you can do about it, so it's just useful to be aware of it. For example, here's a trace where we make a call to an external service. And as you can see, the downstream service has predicted ahead of time that we're going to call it and it started work ahead of time. No, no, it didn't do that. There's really not any way to be sure where the downstream request falls in the timeline of the caller, but some visualization tools do use some heuristics to make guesses about clock skew between services. Context propagation is how child spans know that their parent spans know about their parent spans and the overall trace that they're a part of. For calls between services, potentially running on different machines, it's pretty obvious that you need to arrange for that context to be sent along, for example, in an HTTP header. The trip hazard here is that particularly when running on the beam, each internal beam process also acts as a boundary across which you need to make sure the context gets propagated. We'll cover more about that in the next section. And the last trip hazard I want you to warn you about is that your instrumentation will only be as good as the telemetry data it's based on. For example, Phoenix will generate this code by default and insert plug.telemetry here in your endpoint module. That means that the starting point of any telemetry data you send will be after any plugs that have already run up to that point, like plug.static, for example. And the way that it eventually sends its stop event is using register before send, which will fire just before sending the response to the caller. This seems fine at first, but if a response is never sent, for example, because the connection timed out and the call are disconnected, unless you do something about that, you will probably never see that request in your metrics and traces because it never completed and also never fired an exception event. So now that we've taken the first few steps and we're aware of some of the things to watch out for, we're ready to really travel the road to our destination. The path you take will depend on what your observability needs are, what technologies you're using and what existing observability tooling you already have in place. For example, maybe you're using a variety of languages to build a complex web of microservices and a service mesh and various API gateways. OTEL can make sense of all that by giving you a standardized view into the services and API calls between them as you begin to instrument more and more of them. Or maybe you have the world's largest job on a list and it's such a black box that it's impossible to understand how it works inside. OTEL can help you there too, tracing the path taken through the code as it runs in production. Maybe you don't have any synchronous API calls but you struggle to reason about and visualize the big picture of how your system works because you're lost in a forest of stream processing stages without clearly defined dependencies between them. OTEL can help propagate the context between these stages so that you can see the path your data is taking and where it's consuming resources or hitting bottlenecks along the way. The off-the-shelf instrumentation will definitely get you started and give you framework that you can start to use to understand and maintain your software system but it can only take you so far. Soon you'll definitely want to find you'll definitely find that you want to manual instrument internal parts of your application. You're in luck because the official OTEL guide for Erlang and Elixir also has an instrumentation section that walks you through how to do this. To show a simple example, let's look at the example Phoenix app we were using before. It has a straightforward controller action for fetching an index of blog posts. If we imagine that it was doing something more complicated that we wanted to see a span as a span or a trace, we can do that as follows. We just wrap the operations with tracer.withspan macro and use the tracer.setattributes to add additional attributes if we want to, whether they're custom ones or according to various semantic conventions. And just like that, we can now see both of our custom spans showing up in light step. And if we click into the series business span, we can see that there's a custom attribute showing up there as well. I've briefly mentioned context propagation before, but now that we've covered more of the details of the OTEL APIs in Elixir, let's look more specifically at how to do that. Probably the most obvious place where context propagation is needed is when making HTTP calls to downstream services. All you need to do is for each downstream call where you want your context to be propagated, inject the context into your headers before making the HTTP call. This will use the standard OTEL headers that should be honored by any OTEL compliant implementation across different languages and tooling. Perhaps less obviously, we need context propagation across beam processes so that child processes will correctly connect to their parent and those HTTP headers will be correctly passed downstream if they're performed in concurrent tasks. The way I've personally solved this problem several times in the past is just to make my own replacement for the task module that calls the standard library implementation but wrapped with a span. You can see here in the async function that we get the span context from the calling process, then we call the real task dot async implementation passing in a function that we'll call the original function but first attaching to the parent context from inside the child process. Now we can rearrange things so that we start the async task at the beginning and awaited at the end. Without our custom task module, we would end up not seeing the serious business span at all in our observability tooling. But since we propagated that context to the child process, we're able to see the async task proceeding in parallel with the database call. We can also see that our custom span is properly connected where it should be. At this point, if you don't see an integration already available, don't be afraid to build one using the beam telemetry events that the library may already be emitting. If you look into how the Phoenix integration works, for example, there really isn't much to it. You just need to attach to the telemetry events according to the library's documentation and pull out whatever attributes you wanna capture in between your start and stop events. Really, the hardest part will be reading through all the semantic conventions and mapping them to the metadata that may or may not be exposed by your library. So something else that you might have noticed in some of my screenshots so far is that the service name isn't set. And also we see a warning that we don't have a version attribute set. I found this pattern to work well for that. If you see the version, if we set the version like this based on an environment variable in config.exe, then all you need to do is arrange for that version to be set at the time when you're compiling your application. Since the versions should be based on the source code used to build the app, we should never need to change this at runtime. Most likely this is whenever you're running the mix release command. For example, if you're building a Docker container from CircleCI, you could pass it in as a build argument like this. And now we can see the service name showing up correctly as well as the version of the code being deployed. So it's really cool that we can switch observability vendors just by updating our configuration files. But what if we could switch it without redeploying the app at all? What if we could simultaneously send to more than one vendor without imposing any additional load on the service itself? That's what the hotel collector service can help us do. Let's see how that works. The collector is composed of three main layers, receivers, processors and exporters. Receivers allow the collector to listen for various protocols on various ports so that you can adapt any existing instrumentation you have today into the hotel ecosystem as you migrate everything towards more native hotel protocols. Exporters similarly allow you to plug your standardized hotel data into existing infrastructure as well as new hotel infrastructure. In between, processors allow the data to be filtered, enriched and otherwise transformed however you need. The first thing to know about the hotel collector is that there are a lot of different deployment options and configurations that are possible. You'll probably want to look through the docs to understand what kind of setup you need for your organization. But to start at zero, this is the current state of what we've seen so far. When you're not using a collector at all, then you can point each of your services directly at one vendor at a time and you don't have much flexibility into what you're sending other than head-based down sampling to control the volume of the data. One step you could take next would be to deploy the hotel collector as an agent on each of your hosts. I'm talking generally about hosts and containers here. It could be a Kubernetes host, an Amazon ECS host or just a plain old Linux server without any containers at all. The basic idea is that you're running just one collector per host and all the services on that host will essentially just send their data via network connection to local host. The additional value you get from this configuration is that you can offload some of the processing and configuration from your application containers to the collector container. It can filter, enrich and fan out your observability data to multiple vendors. If you're using a more complicated container scheduling system like Kubernetes, you can also choose to deploy the collector as a sidecar to each application container. This can be an advantage if you want to configure the collectors differently for different containers. If you're concerned about the resource consumption of a service including all its observability overhead, et cetera. Just like we saw before, these sidecar containers still have all the configuration flexibility to transform and route your data however you want. And if you're operating at the scale of hundreds or thousands of containers where you just have additional flexibility in your configuration, you may want to look into deploying a collector gateway. It works how you might expect just targeting your agents at the gateways and your gateways at your vendor tools. You can also consider more complex routing rules. For example, maybe you want to temporarily route an unsampled stream related to a particular set of services to an internal tool while you troubleshoot a specific problem without racking up a huge bill with your third party vendors. For this presentation though, I'm just going to demonstrate a very simple starter configuration like this. So let's set up an OTLP receiver and a batch processor that collects traces into a batch before forwarding them along. Then we set up OTLP exporters configured for both LightStep and Honeycomb. And finally, we have a pipeline setup to receive and process all the traces and pass them along to both exporters. You can check out the collector documentation to learn more about what's possible, including collecting host metrics alongside the traces I'm showing here. This is just a relatively simple configuration that does what we want. We also need to remember to reconfigure our service so that it's sending hotel data to our local collector instead of directly to a vendor. We just need to configure the exporter to send the data to wherever your collector API is being hosted. Also note that now we no longer require runtime environment variables in this configuration. So we can just put this in config.exx instead of runtime.exx. For folks trying to follow along at home, this is how I'm running the collector in a container on my laptop. We pass through the relevant environment variables that we used in the configuration and map the OTLP port 4318 to my laptop's network. Finally, we mount the config file into the container and tell the collector to use it. My laptop has an M1 processor, so that's why I'm running this ARM version of the container here, so you may not need to do that. And again, this is all just Docker magic. It's not really relevant to OTEL itself, but I'm including it here in case you wanted to follow along yourself. And finally, after going through all those steps, we can actually see that we get the exact same traces in Honeycomb and LightStep at the same time. So now that you have some things set up and you're collecting some data, the next step is to demonstrate the value to your team. My son here is playing the role of the business stakeholder who's so exhausted from going to meetings all day that he can't appreciate the amazing visualizations that I'm trying to show him. One important step is to start building out useful visualizations for your team. Unfortunately, I didn't have time to put together a bunch of impressive demos about what each tool is capable of. But the good news is that regardless of which vendor tooling you're looking at, you're just an email away from having one of their sales engineers show you the product. And then you just have to point your collector at their ingest API and immediately try it out with your own real data. Think about that for a second. You could try out a new vendor tool now just in minutes instead of weeks. As we wrap up this presentation, I hope I've been able to inspire you to give OpenTelmentary a try. I also wanna encourage you that what you've seen today is really just the surface of what's possible. If you're a Minecraft player, you'll know that getting to the end dimension is far from the end of the game. Getting your services all integrated with Hotel feels a lot like getting your first Elytra or Shulker box. It's when the real fun starts. So now that you've got things all figured out, you can share these new superpowers with your teammates. Maybe you have ideas for new tooling that everyone could use to look at the same data in a new way. Maybe you wanna get involved in improving and developing new semantic conventions for a specific technology that interests you like web proxies, caches, or firewalls. Maybe you wanna help build integrations to popular libraries in your language ecosystem. Huge shout out here to Brian Nagley, who works at SimpleBet. His contributions to the Hotel contrib integrations are the main reason it's so easy to get started today. And I also wanna encourage you to get involved in the Hotel community in whatever way you can. Ask any contribution helps, even if it's just writing a blog post about your experience or asking for clarifications in our Slack channel or improving the documentation so that everyone else has an easier time. And again, huge shout out here to Tristan Slaughter, who works at Splunk. Tristan has been consistently leading the charge in the Hotel Beam community and has written the vast majority of the code that powers the core of Hotel on the Beam. And anyway, thanks for watching my talk. And I hope you'll come find me later in the Elixir Slack or the Discord. Hi, thank you so much, Greg. That was a very, very interesting presentation. Definitely learned a lot.