 But before that, I just wanted to let you guys know that you can see these papers that we have here, taped to the door and to the shop board about an after party that we'll be having. I don't know if you can fit this many people, but definitely we can all get together and go out for some years afterwards. So yeah, that will be sort of the location around there. Yeah, and we'll now start with the talk by Greg Mafford. So Greg has been a member of the sort of beam community, specifically more involved with elixir, with nerves. He's a member of the EEF Observability Group, and he's also been heavily involved with the effort of distributed tracing within the community. And today he's going to present about open telemetry. So yeah, give it up for Greg. All right, thank you. So before I dive into specifically what open telemetry is and how it's a success story, I want to talk about the problem that open telemetry tries to solve, which is observability. So you may have seen one of these instructions before where you have to push the button and then you receive the bacon when you're in the restroom. So if you push the button and you don't receive bacon, you're going to be heavily disappointed, right? So let's start with, let's science this situation. What can we observe about the situation? First of all, no bacon. This is very disappointing. One out of five stars. So there is warm air coming out though. Is this supposed to cook the bacon? It seems like that's not really safe. There's also lots of noise. So this machine's making a lot of noise. Is it broken or is it supposed to do that? I don't really know how this thing works. Okay, the noise stopped. Definitely broken. No bacon came out. So if I engage my silver lining machine, I can say, well, at least it responded to inputs, right? I pressed the button, something happened. There was a lot of correlation there. So probably it's because I pressed the button. So when we talk about observability, often your vendors will tell you that the pillars of observability are logs, metrics and distributed traces. I actually like to think about it more as different aspects on like a cube of looking into your system from different dimensions. So like the faces on the cube, the aspects of observability are related on one axis and different on another axis. So if we consider distributed tracing, for example, you can learn a lot about what your system is doing in particular requests. You can see if there was some kind of an alert or an error that was raised, maybe how long it took. And then if you aggregate those over time for different requests, you can maybe learn a lot about the metrics in a period of time. And then if you also have logs, then maybe you can learn something from an auditing perspective like, you know, bacon jam detected, something like that. And each aspect tells kind of a different story about what your system is doing inside. So traces tell a story. They tell a story about particular requests. They're usually categorized with metadata, tag metadata and also data about what happened in that request up to the user. They're usually distributed over all of your different services. So you can see what happened in a microservices architecture or monolithic architecture or maybe even across different vendor middlewares. They're usually sampled because your vendor usually charges you based on how many you're sending. So you want to sample those down to just some useful percentage, not like millions of traces per hour because you'll never look at all those anyway. Metrics tell you a different story. So they're usually about types of operations, like how many of something happened in a certain given period of time, whether they were successful or errors. And they're usually aggregated by a period of time and some tag about the metric. Logs tell yet another story. They're usually in very fine detail, exactly something that happened at a particular point in time. And hopefully if you're lucky, they're structured so that you can search them and aggregate over different indices. And usually they're collected from all your services into some central, usually vendor hosted platform and you pay a lot of money for that. Yeah, so I want to talk a little bit about distributed tracing theory because that seems like when I'm talking to developers, that seems like the thing that people are less familiar with already. And it all starts with the core concept of distributed tracing as a span. A span has a name and it has a start time and an end time. And then also it can have various attributes where there are some semantic conventions. For example, for HTTP client server interactions and databases and things like that. But in general, it can be anything you want. Each span knows about what its parent span is so that you can have this hierarchy of traces. And you can draw this waterfall graph or flame chart depending on what you want to call it. And then all of those spans know that they belong to the same trace. So there's a kind of a trace ID and a parent span ID. The top level span in the trace doesn't have a parent, so that's the root span. And that's kind of how the collector knows that this is the top level of the trace. We can also model in these waterfall charts parallelism. So in this case we have two HTTP client gets happening in parallel on the left there. And then the results from both of those things are aggregated into this handle responses section of the processing. And you might do some kind of a database call in there, et cetera, something else. Where distributed tracing gets really interesting is that inside of each of those parallel requests we can see what the downstream service was doing. So in this case we have the top level trace is from a service that's colored yellow. And then we're making calls down to a green service that's written in Elixir and Phoenix and a red service that's written in Ruby and Rails. So you can kind of see what each of them are doing inside, even though they're different technologies, different servers. And then when we bring it all together you can see overall the entire trace. We had a Phoenix server that was calling into both of these things and they're color coded by where the work happened. So the way that works is that the top level service sends a span context to the downstream services. And that can tell the downstream service whether this trace is being sampled or not and then some other state metadata. This used to be a vendor specific free for all of incompatible headers, which was awesome. And now there is a W3C standard called context propagation which is used by open telemetry. One of the really neat things about distributed tracing is that the upstream service doesn't really need to care what the downstream service is doing. It just needs to send that trace context or span context. And similarly the downstream services don't need to care about what the upstream service is doing. They don't have to return all of their span data back or anything like that. They just need to pass their piece of the trace up to a central collector. And then the collector knows that this span is a sub span of that span so it can stitch everything back together into a cohesive trace. Another job of the collector is to decide which traces it's going to keep. We can do that pretty simply with probabilistic head sampling. So the idea there is that for each trace that started at the very beginning that is it's not a child of an existing trace. You can decide to just flip a coin or keep like 0.1% of traces or whatever you want to do. You'll notice the top one there we decided early on that we weren't going to sample it so we only have sort of a skeleton trace of each transition between services there. We don't have to keep quite as much information because we know we're going to throw it away anyway. And then for the other ones at the beginning we said we're going to keep both of these but then it's up to the collector to decide how many to actually keep based on your rate limiting or how much you want to pay your vendor. You know, whatever you want to do. There's also such a thing as tail sampling so in this case we can decide that we only want to keep the ones that have an error thrown during the process but it could be really anything. So I want to talk really briefly about some built-in beam tracing functionality that exists in the beam. I want to note that this is not used for distributed tracing and I'll talk about why in a second. But this is Erlang's trace pattern functionality and then there's also a library called recon trace that's really useful for interacting with that in a more ergonomic way. So the way that works in general this is more like a sketch of how it works because it's kind of beam internals but essentially you tell the beam these are some trace patterns that I want to watch for so anytime a function gets called that matches these things I want you to set up a tracer process that will tell me that that happened. There's a star on the tracer process here because there can only be one of those per beam but then anytime your application processes make a matching function call they will sort of magically send a message to that tracer process and then it will send your interactive shell process a message about those. So some really nice features about this is that it's production safe as well as development. It's very useful for interactive troubleshooting and debugging of your system. Some downsides you can only have one per beam and it's only for local tracing so you can't trace across distribution. So the reasons we wouldn't use that for distributed tracing is not because of the local only part because we could run it on all the nodes. It's mainly because there's only one per beam and it's designed for interactive use so if you're using that resource in like a tracing library then people can't use it for interactive troubleshooting. So I want to talk a little bit about observability superpowers that you get by having this distributed tracing functionality in addition to your kind of standard logs and metrics that people are more familiar with. So the first one of those is the single request flame graph idea. So here's an example from Datadog APM. This is from an example project that I put together for the Spandex library. But in this trace you can see a summary of three different services being involved. The first being a plug-based gateway. So this is a cowboy and plug-based server and then it calls into a Elixir Phoenix back-end service and then the blue at the bottom is making a database call. So in this case we've modeled the database as a separate service since kind of in a distributed system sense it's different than the application server and it lets you see a glance that 27.1% of this time was spent in the database layer. So kind of related to that being able to see what happened in a particular request we get stack traces in context. So this is an example from an actual production system where there was an exception here, you can barely see it but right here there's a red box around that part of the trace but you can see at the top level up there there's a 200 response being returned. So we threw an exception but we handled it and we returned a 200. So in that case if you were just looking at the exceptions in your logs you might see a bunch of crashes or something and you might be worried that there's a big problem but actually we've recovered and it wasn't a big deal. So that gives you a lot of context around how you got to the point of that error and also whether it was a big problem for you at the top level. So if we click into that error span there you can see some more information about what the error was and I redacted out the stack trace because it's from a real system but you would see a stack trace there as well. So that gives you a lot of context around clicking through each of those bands and saying how did I end up at this point? What life choices have led me to this place? So another great thing that you can see really quickly when you deploy distributed tracing is N plus one queries because we all have these things. These ones are pretend because I made them in my example app I would never write an N plus one query in production but down at the bottom there you can see that there are quite a few database calls probably like 1500 database calls being made and if you zoom in there you can see that there's one being made from the controller and then from the view there's 1400, 999 so you can see that there's probably an N plus one situation going on there and also queries from the view. So these are kind of like structural things that when you deploy distributed tracing even if you didn't write the code you can say that doesn't quite look right. Another really great thing that you can see kind of visually and structurally is calls to the same service downstream. So this is another real trace where you can see that there's a bunch of things happening in parallel here but you can also see that three of these are going to the same service based on the color the light blue service there. So there might be a chance that it will be more efficient to batch those up into one call and get all the data back at once instead of making three separate calls. And kind of the reverse of that sometimes you can actually get lower latency by doing things in parallel instead of batching them up. So an example here is if you need to make a call to a downstream service and then use the results from that payload to make a different downstream call you might be able to reduce your overall latency by just doing a quick request to get the IDs from that first call and then make the original requests and then in parallel make the other downstream call. So another really great thing you can get out of these graphs is a feel for how much network latency you're dealing with. When you look at this trace you can see that the sort of teal there is an external request.get so this is like a client module that we've written and then you can see the downstream processing from the service that's getting called and there's these gaps in time here where you can't really account for what happened between making the call and the downstream service receiving the call. So there's obviously a lot of other things at play here. This isn't all network latency but if you had a lot of network latency then that's where it would show up and it would be something you could look out for. I think the most important superpower that you get from distributed tracing is this culture shift where now whenever we're troubleshooting a problem someone will link to a trace in APM and there's all this context that comes along with that like how does the code work? What's going wrong? What did I expect? What did I not see? All these things kind of come with a really simple link into a tool that gives you all that information all at once. It's really changed the way I work at a company called Bleacher Report. It's really changed the way that we troubleshoot things in production. So with all those superpowers obviously there come some pitfalls, some things that you'll probably have to figure out as you're deploying this technology. The first of which being sampling so if you're not using sampling then you're probably going to send a lot of spans to your vendor or to your open source thing and either it will be your problem or it will be someone else's problem when the bill comes. So you should think about how many traces you actually need to have. Incomplete traces happen when you don't have tracing implemented on one of your services and that means any downstream calls don't get attached to the upstream calls. So the distributed part of distributed tracing doesn't happen and then it's not as useful. Another thing to be aware of is clock skew. So in this case the downstream service anticipated that we were going to call it and then it started working on the request ahead of time to save us a little bit of time. So that didn't really happen. The idea here is that the servers don't have the same clock, right? Even if you're using NTP the clocks are not going to be exactly in sync all the time. There's not a whole lot you can do about it other than just know that it's a thing. So real quickly open telemetry and the history of open census and open tracing. So open tracing was a CNCF project and this is from their website. So it's not a thing that you can download and it's not really a standard. It's more of an API spec and it's various implementations. So basically like people knew that this was a thing that they should build. They built some things. There's like some guidelines and pirate code around how it should work. This is where the spandex library falls and there's also an otter library and Erlang and X-ray wraps around otter in Elixir. Really interesting the open tracing standard is supported by Nginx where your proxy can actually participate in your spans which is pretty awesome. I kind of wish that more vendor middlewares did that kind of thing. There's also open census, so this is a competing standard with open tracing. It is a single set of libraries that you can download and it also handles metrics and traces and they were planning on having logs in the future. Open census has a whole bunch of different deployment options. So you can have nothing in your app or you can have some thing in your app that talks to an external collector or you can deploy it as an agent in your app or a sidecar container, whatever you want to do. And then those open census collectors support the tail-based sampling thing that I was talking about. You can also chain them and output to multiple different vendors or open source internal platforms. But all of that just to say that these two projects merged into a new project called open telemetry which is also a CNCF project. So yeah, that's the success story of now there's less standards than there were to begin with even though temporarily there's more standards. So from the open telemetry website, this is again a single set of libraries and tools that you can download and use. There's official clients for each language. It's currently supporting metrics and traces and logs in the future just like open census. On the website they have sort of a status tracker that says how far along each of the implementations are. It's not up to date so the Erlang one actually should be at 0.2 at this point. But the more exciting thing is that Erlang is on there at all because it seems like Erlang usually is not listed on official support for anything. So I'm excited that it's there. I want to briefly mention that I contribute to the Spandex project along with Zach Daniel. We have easy integrations with Phoenix, plug, and Ecto. So that's kind of where my background comes from here. It implements the open tracing standard but it only supports Elixir really because it uses a lot of macros and it only supports data dog APM. So maybe people here have used it and enjoy that but it has some limitations to it. And then open census as I mentioned obviously implements open census. It's written in Erlang and Elixir support and it has a bunch of back ends. But basically these two projects are being merged just like the industry projects. And we have an open telemetry beam github org and we're working on the official libraries in the open telemetry org. So I want to talk a little bit about the Erlang telemetry library because on the observability working group we were talking about should we tell people that they should directly use this new open telemetry thing or should they use telemetry? And luckily we were at least able to steer people away from calling this new thing OTP like open tracing platform because that would have been really bad. But we've decided that we're hoping that this telemetry system that's currently for metrics and events we can also use for distributed tracing with a very simple API and then down the road maybe you'll need to directly use open telemetry but hopefully you can get by with just telemetry. The main selling points for telemetry are that it's simple, it's standard and it's pretty safe. I mean it's as safe as dependency gets. So the way that it works is that in a library you would register or you would just kind of fire the event. You don't need to register ahead of time. And then in a receiver you can say I would like to receive events like that. Yeah so those get called synchronously which is useful for distributed tracing because you want to be able to catch those start and stop events when they happen. You can also quickly get time series metrics out of things because they're just embedded straight in there. So it's simple. And the way that we keep it safe is that you register different handlers and if one of them crashes it gets removed from the table and it doesn't get called again. So one strike out, you're out. So it's relatively safe. And it's pretty standard, it's already included in plug Phoenix and Ecto. So yep, takeaways is if you're a library author you should instrument it with telemetry. If you're an app developer you should integrate with telemetry and you should learn more about open telemetry as it matures. That's all I got.