 Good afternoon. Thanks for coming to joining me today. My name's Dave McAllister. I'm an open source technologist for a company called IngenX, which you may have heard of. And it's one of our major open source projects. IngenX open source is probably one of the most used load balancer, reverse proxy, web server, whatever you want to name for this. But I'm actually here today to talk about something near and dear to my heart. And it's about distributed tracing. But I'm going to start off by pointing out that tracing is a data problem. There's a lot of data that comes into tracing. And because of this, observability is now also a data problem. However, the more observable we make our systems, the faster we can figure out what the heck is actually going wrong with them. And observability has a lot of different characteristics to it. It's usually made up by three classes of data. Metrics, basically, do I have a problem? So this is your learning structure. Logs, why the problem is happening, or what the application or infrastructure thinks is going on. And the new one here, this thing called distributed tracing. Where is the problem occurring? And basically, it's changed the model of how we have approached monitoring. From looking for things that we knew could go wrong, to being able to figure out why things we didn't think could go wrong or didn't think up in the first place, otherwise known as the unknown knowns. Unknown unknowns can go back and figure out what's going on with those. And that gives us visibility into the state of the system itself. Let's us do predictive analysis, as well as reducing this mean time to clue. Mean time to clue means you may get a faster mean time to response and resolution. But basically, tracing helps you get to mean time to clue. And so we've been doing this for a while. Distributed tracing is not new. The original Dapper work came out in 2010. But why now? Well, all of a sudden, we've had this thing called microservices suddenly take off. We have lots and lots of architectures that are built on the concept that is made up of applications that are as small as necessary, not as small as possible, that are linked together in loosely coupled forms. And when we add that to this structure around cloud, things can become a little bit more challenging here. Nice thing about them is that we only change the pieces that we need to change. And we can change them any time we want to in each of these pieces here. Each of these pieces can be owned by a totally separate team. And sometimes, they don't even talk to each other. The largest structure I've ever seen for microservices has over 1,000 unique applications being built in that. If you look at the Amazon front page, if you're buying something, it's about 135 independent microservices running each page. But when we go to microservices, we get all these weird challenges that come into place. And this thing on the right is called the Knieven Framework. And what has happened is in mathematics, my background here is, you know that you never change two variables at one time, because you can never then figure out what's actually going to need to be changed for this. With the concept of cloud and microservices, we've changed two things at once. We have changed the complexity structure. Microservices have made things a lot more complicated, a lot more moving parts loosely coupled. At the same point in time, when we moved to cloud, we have added chaotic behavior in the fact that we have elastic and ephemeral behavior. So now, we don't know what's going on. We don't know where it's going on. And we have to figure out where the exact problem occurs at any given moment. And that leads us into this structure of distributed tracing. So what's distributed tracing good for? Well, what you're looking at here is called a red monitoring grid, rate or duration. And it's actually tracking the request as they go through our system. And in this case, I'm pulling them together. By the way, this is not real time. I'm pulling them together and aggregating them in 10-second clumps so I can see what the things are telling me. I'm also using a strip chart to show me the live verification of data as it's coming forward. And then I can see what my request times look like in each of these pieces. This adds a capability of metrics to our environment. So we now have more metrics from our distributed tracing than we had before. And the unique thing is that it's actually looking at each individual request as it goes through the system. It's not accumulating them. I'm accumulating it so that I can make some sense out of this. So when I look at this, this is what gives me my chance to build that MTTC structure for this. And with this, what has happened is that we now have a number of new structures for monitoring that have come into place. ROM, which is real user monitoring. What is the experience of a specific user as he transmitted the system? Synthetics take that real user monitoring, build a synthetic user so that we can actually test each time that's going to happen what's going to happen inside of those. NPM, network performance, APM, application performance, infrastructure. All these pieces break into the environment that then takes advantage of these things. And you can then see and break it down by workflow how the work is progressing through this. All of this is driven with some bits from distributed tracing. Now, tracing itself is defined by concepts. And we're going to start with a span. And a span represents a single unit of work. You get to define what the unit of work is. Your unit of work could be the entire microservice or your unit of work could be pieces inside of a microservice or your unit of work could be a complete application put together for this. The trace is all of the spans together that made up the request from start to finish. If a request does not end, it does not count. It is non-valid. And because of this, most application systems actually have timeouts for traces because you can't see the trace until it's actually concluded for this. With this, however, just getting these numbers is meaningless unless you again know what's going on, where it's going on and how things are propagated throughout the system. So let's take a look at a trace real quick here. This is known as a directed acyclic graph. It is one of the common ways that you will take a look at tracing. It has some interesting information. In this particular case, for instance, I can see that my trace came in through an API point that the distance inside of that API was 260 milliseconds. It then passed to two different places and the pass took a period of time which then breaks down into other pieces. So I now can see the path my request took through the system. I can see how long each piece took for each of the systems and should something goes wrong, I can start looking at where errors could occur. I can also see over here the far side what's going on in detail from each of these pieces. This gives me all of the information I need to know about where something's going wrong. So I can now actually track every single request as it goes through the environment. Each of those pieces tells me a particular information. This is one of the two principle ways that people look at traces. The other one is a waterfall plot. And a waterfall plot basically breaks down and does the exact same sense of information but it shows it to you by running across a list of how things have progressed down from parent to child and where they are. And then over here it tells you the amount of time that each one of these spent inside of those pieces. So this trace is actually made up of this span and this span which then is made up of each of these different spans going across. These two spans which are the principal child spans here are almost exactly the amount that the principal span traces. There's a little bit of slop inside the system because when we're doing this loosely couple of communication between nodes, the node metrics for time are not standardized or stable. The best standardized time in a network time protocol is roughly a hundred microseconds. And so it is very easy to get a little bit of slop that shows up inside of those. So again, I can see the span names. I can also start seeing where my breakdowns of performance, where the actual performance looks like for each of these functionalities. And so this makes up a very clean and simple way of looking at that. However, again, it doesn't tell you where things are going on, what the underlying infrastructure looks like or any of those pieces. And that is something we thought about. And so when we think about this, we now have what's called baggage or you can think of this in lots of other terms. But distributed tracing includes the ability to provide things that are carried along with a trace. And so you can carry things along the trace that says, okay, so here's my container that put on here. Here's the Kubernetes node that it was running on. Here's the type of span that it was looking inside of here. And these things are what are built into the system. These are semantic conventions. If you make an HTTP call, you will get a series of baggage that comes along with this. What type of HTTP call? What the response code was, did you get a 504 or 404 or 301? You'll get where it came from and where its target was as built into the distributed system for this. So each of these pieces, we can look at the trace, break it out and actually see the underlying infrastructure. If you have ever dealt with a noisy neighbor problem that's eating memory, you have to have this. You cannot find what's going on because the application is perfectly fine without finding out where the noisy neighbor occurs. So what do we need to do this tracing thing? You need a unique ID. And unique IDs need to be able to be able to be propagated through the system correctly for this. You need instrumentation. It's not simple to do this. We'd like to have a single telemetry so we can combine all this data together. We'd like to support all possible languages, all the developer frameworks. And because we're in a cloud, it's got to be distributed environment capable right away. Enter this thing called open telemetry. Open telemetry is the future of observability and is the defining metric for distributed tracing. Metrics, mostly logs coming. So each of these pieces come into play for those functionalities. But everything starts with a unique ID. Span IDs can repeat. Trace IDs may not repeat for that. And so inside of open telemetry, we provide those three things. We provide a trace provider, which is the start point. It actually makes the things happen. Trace creates the spans and passes propagates that thing. And span is the operation that actually traces the activity that's going along. If you look at something like a service mesh, you get this almost all for free. You get it from end point to end point. You don't get the inside pieces for that. If you are looking at doing the other functionality pieces, there are different classes. Currently, it supports almost all languages, but your mileage may vary. Some of them are much more mature than others. So if you're going to implement this, you got two basic options. You're looking at traffic inspection. I just want to know how things are communicating through the system. Service meshes, almost all service meshes have this built-in from day one. You just get to propagate and you get to understand what your traces look like. Code instrumentation, how you put things inside of the code. And code instrumentation can actually vary being auto-instrumented. Again, your mileage may vary, or manual instrumentation. And you can do both at the same point in time. And so when you focus on code, you're basically going to add this client library dependency for this. You're going to focus on the service-to-service communication first. You then can add the spans inside of your applications to get a closer look at the application functionality. And then you can add function-level calls and all those other pieces. But you can start simply and then build complexity that goes out. And so what that basically means is that you instantiate a tracer, say you want to trace something, you create the spans, you enhance the spans, and you configure your SDK. Automatically, just add your file. Add the Java jar file. Automatically, you've got trace data coming in here. You've got a lot of trace data coming in. Most of it is probably meaningless to you. So you can go in and subtract out what you don't. If you do it manually, you're a little bit more complex. The two can be mixed together. And on our work, we tend to mix them together. We start automatics, subtract out what we want, and add in the new features that we want. So problems, performance issues are obvious. Happy users are obvious inside of here. These things combined together give us missing context that we're not available to see other ways for that. And then when you're doing this, if you're doing this, you want to be sure that you understand why you want to trace. Start with your service boundaries. Repeat as you need to. But be aware of information overload. We are talking a lot, a lot of data coming in at one time. And so distributed infrastructure gives you, or tracing gives you views into the application, the user, and the infrastructure. The win is when you get most applications, most third-party applications, they're now hotel ready out of the box. And so projects that are open source projects, if they're making anything a request, they need to be that as well. Great proxy for user happiness. Unhappy users are slow and do not fuck. But it does not magically solve your problems. And so finally, I do want a closing thought, simply because I like this. Back in 1976, 79, Brian Kernigan, Unix Preventors, wrote, the most effective debugging tool is still careful thought, coupled with judiciously placed print statements. Distributive traces are our new print statement. They give us insights, but it's still up to us to figure out what's going on and why it matters. And with that, I'd like to thank you for letting me have 15 minutes on stage today. You can find me on LinkedIn. And if you get a chance, we'd love to know what your thoughts are on where the directions are going for observability, DevOps, and tracing in general. Thanks.