 Hello, everybody. It's 4.30. I figure it's time we can get started. If you're coming on in, grab a seat. This is OpenTelemetry 101. Let's instrument specifically for traces. Today is sort of an observability 101 and a tracing 101. The most important thing to know at the top is all of these slides are available. If you don't trust QR codes, there's a bitly link for you. No worries. This workshop will be available after KubeCon. So if you are running out of power and your laptop dies, there's something with the network. You can't grab Podman in time, whatever it takes. You can access this workshop at any time, free of charge, extend it, modify it for your use case. So no worries. Take this workshop at your own pace. I'll leave this up for a couple more minutes as people are finding their seats. But thank you all for coming. It's really lovely to see so much interest in hotel and tracing. It's my favorite telemetry type. I'm excited to share it with you today. I have with me two helpers right over there. If they can raise their hands, stand up. If you are working on the workshop and you get stuck, raise your hand and one of them or me will come over and see if we can debug and help you out. The way this workshop will go is I'll present some content. We've got about five labs, but this is sort of a work at your own pace. So if you want to skip ahead or there's stuff you already know, it's really for your benefit and your learning. All right. We've got a couple more people finding their seats. Welcome. Welcome. Okay. The second thing that is important to know other than the QR code is there are a few pre-rex that you may want to get started downloading. One of which is Podman. If you haven't heard of it, it is a alternative to Docker. There's a lot of other use cases for it, but that is what this workshop is set up. So grab a link to Podman. Download it for your system. Python 3. We'll be working on a Python 3 application. And then I should have added a comma here. Python 3 is one dependency and then the sample application, the repo is another. So while folks are streaming in, make sure you've got those three set up. And if the Wi-Fi is a little iffy, you've got, there's one whole intro to observability so you can kind of keep retrying. Okay. We'll take things away. So if you have this loaded on your laptop, I kind of recommend having one window with the slide deck and one window with your editor or another browser. This is sort of how you can get around the slide deck. Everything that is green text are links you should click. I do not have video images, and then we've got some code snippets in there. So we're ready to get started. Lab one, observability primer. We start each lab with a goal. So here is really just making sure we all have the same understanding of common terminology. Observability loves to throw in scary academic sounding terms, and I just want to demystify those and get those understood up front. And let's understand where OTEL fits in the landscape. So what is observability? There's, you ask 10 different people, you get 10 different answers. I think it's how effectively you can understand your system behavior from the outside using the data it generates. Monitoring, on the other hand, is the continuous process of watching and tracking system health based on a predefined set of data. I think of monitoring as like the smoke alarm in your house, it's checking for smoke particles. It will alert you when it senses those in the air. It's kind of always watching. What is telemetry? The process of recording and sending data from remote components to a backend. And when we talk about software telemetry or infrastructure telemetry, that is typically metrics, logs, events, and traces. And if you've been to some of the OTEL talks today, you know that we will soon be adding another type, which is profile. So telemetry is really just about sending this sort of data from one device to either a central backend or a proxy. What is instrumentation? This is the code that records and measures the behavior of an app or infrastructure component. We can really break down instrumentation. I got a slide later into three kind of categories. There's the auto instrumentation, which is mostly what is marketed and is really the first step most orgs take, especially with open telemetry. Sort of out of the box you flip on or toggle auto instrumentation and boom, you get some data. There's programmatic instrumentation, which is where you're manually bringing in libraries, setting up some configuration, and then of course manual instrumentation when you're adding those custom attributes. And that, all of that primer brings us finally to open telemetry. What is it? Why do we care? Standardized set of vendor agnostic SDKs, APIs, and tools to adjust, transform, and send telemetry to observability backends. If you've been to talks, you know how real the vendor agnosticism is. There's a lot of wonderful cooperation across all of the vendors in the observability space, and we all can play nicely in the hotel sandbox. Unsurprisingly, hotel is a part of the CNCF. It's joined back in 2019. So one thing I think gets a little bit confused about open telemetry is what it is not. It is not just a tracing tool, although we did start with tracing. That was kind of our first signal to GA. We have expanded into all of the other signals. And interestingly enough, hotel is not a back end or storage system. It is the pipeline and the set of libraries to generate the data, to transmit them, maybe transform them and then export them somewhere else. And it is also not an observability UI. And that is why in this workshop, we needed to bring in a UI for tracing. And in this case, I chose Jager. But that is because that is not the purview of hotel to get into storing this data long term or visualizing it. When we break down the hotel components, we've got APIs defining the data types and how to generate the data, the SDK, which are defining language specific implementations, plus some configuration data processing and exporting. You are in luck if you work in one of these languages because these are the supported ones with open telemetry, although I am sure the community would be happy if you were to find another language and add it to this group. We will be looking at the registry, but if you want to take a chance now, that green text, again, green text is a link you can click on. That will take you to the open telemetry registry where you can see for the apps and libraries and frameworks you use on the daily if there's already instrumentation out there. So in our case, we'll be relying on the open telemetry instrumentation flask library, which is built on the hotel middleware, and we'll just be observing a very simple web application. Again, this is what the registry looks like. You can check to see if your favorite library is instrumented. And the one component that we won't be using today, but I think is important to understand overall about open telemetry, is the collector, which is an open source proxy that receives processes, transforms with OTTL, the open telemetry transformation language, your telemetry data, and can export it out to various back ends or storage. So this is a little rehash of what I covered before. I really break down instrumentation into kind of three concepts. The automatic, what you get out of the box just by installing something, the programmatic, where you're mixing pre instrumented dependencies and manually adding metadata and manual. I think a misconception is that if you're manually instrumenting, you're manually instrumenting anything. No, what I hope you take away from this workshop is that you can do any sort of mixing and matching from automatic to programmatic and manual. It's not one or the other one versus the other. You actually probably need to rely on all three types to get the best visibility. Great. Like I said, automatic instrumentation is great because you don't have to make code changes. If you've got service mesh running, that's something you kind of get out of the box with tracing and we'll take a closer look at, we'll have a lab on auto instrumentation, programmatic, and manual. So no worries about catching up on the code there. So look at us. We completed lab one. We've got a common understanding of some of the terminology that we throw around in this observability space. And we looked at a high level overview of the components of open telemetry that are going to be relevant for us today. So now we will be installing and configuring open telemetry in our demo app. And yes, there's some resources that you can find at the end of every lab. If you have questions or maybe issues that come up, please free to submit them to the GitLab repo associated with this or get in touch with me on mastodon or email. So this is where the fun starts. This is the interactive portion. And if you join us a little bit later, you made podman, Python three, and then grab the GitLab repo with the sample application. So what we want to do for automatically instrumenting is we want otol to get set up on your machine. We want to configure the SDK, run our demo app, and view trace data in the console. We're going to start building these concepts up one by one. All right, we will be working with a Python flask app. I specifically chose Python because we've got a really lovely, strong set of documentation in the Python community, the Python agent, and a lot of great code examples. This is, you can fill in the blank with your favorite framework, but for today we'll be doing Python. All right. So we will begin. I imagine this is where folks may run into some issues as we get to the interactive component. And again, you run into a snog, raise your hand, and me, or one of my very helpful helpers, will come around and try and debug with you. But this one, the step I think should be pretty easy. Let's make a project directory and let's add a CD into it. You can name it whatever you want, but it's probably best if you copy-paste. Next, you're going to want to download the demo app. This is a Python flask app. It is very simple. It's got three endpoints. Nothing too fancy there. You can grab a Git clone, HTTPS, SSH. Choose your own adventure. I'll leave this up just for a little bit to make sure folks have time to grab it down. And now we want to just explore a demo app. What are we going to be instrumenting? What do we need to get visibility into? We're starting with no instrumentation. We don't know anything about this app as it's running. We've got three routes that we're going to look at today. We've got our basic slash. And it is just going to display the count of how many times you've loaded that page for that session. We have Dogo, an endpoint that calls out to the dog API and fetches a random photo of a dog. It is very delightful. I have a little fun today. And finally, roll dice, which is just going to display the result of a randomized dice roll, a number between like one and six. So pretty simple. Nothing too gnarly in the code today. Because really what we want to focus on is learning the concepts of tracing, the concepts of instrumenting with traces. So I really wanted to slim down the complexity. And so we begin. Once you've gotten that sample repo down, go ahead and get into that directory. And we will build our first image. And again, I did all of this testing with podman. So if you do run into a Docker problem, I wish you luck. I may try to help. But this is how the workshop set up. And this is why I wanted to make it available after the fact. So you can kind of hack on this at your own leisure. So we'll start by building a image, podman builds, we're going to tag it. We're calling this app hello hotel, little simple. Oh, gosh. And I did put a Docker file. I thought I changed that to container file. That's fine. You'll know you're successful when you get a message like below. It will obviously be a different ID. Once you've got an image, now we can run this container. So you're going to run this command. You can copy paste, but let me just walk through what's going on. We've got port 8000 exposed in the container mapping to our local port on 8001. There may or may not be I may or may not have changed it to 5000 later on. I really hope I didn't. We will find out. So once you get this running, we're going to there's this little command that snuck in here, open telemetry instrument. That is the component that is going to be doing our auto instrumentation for us. You open up the source code. You'll see there's no hotel libraries being installed. We're totally doing this from the outside with the hotel agent. We are going to export our traces to the console. We're going to export some metrics, but we're not going to be working with metrics today. And then flask run is what we give to the application to start it up. Once you've got that running, go ahead and open up local host 8001 and confirm that for that slash endpoint you see this webpage has been viewed one time. And note I am a backend engineer, not a front end, so there's not a lot of pretty CSS or anything happening. So once we've gotten there, we can kind of confirm your setups working, the apps working, podman's working, we're ready to move on to the next step. If this is causing a problem and you do not get to this webpage has been viewed one time, you are please raise your hands. We'll come help. Or you can kind of follow along as we go. And just kind of see how the slides run. So just a quick show of hands is our people getting this running. I just want to get a little temperature. Okay. A few. Yeah. Good. All right. And again, please, we do want to help you. So raise your hand. And I will come over. Okay, so if you confirm that that's working, go ahead and just stop that container, control C. And now we're going to run interactively and use a little tool called open telemetry bootstrap. And what that will do is go ahead and detect whatever installed libraries we have in the app. And in this case, it'll see Oh, there's a Python library for flask. It will go out to the registry and find relevant instrumentation packages to bring in. This is the magic of auto instrumentation that happens. So go ahead and run your container, map your ports, run it, run the image that we just built, and then go ahead and make sure that you get into a shell. Once you are in that shell, we're going to PIP install the both open telemetry distro and open telemetry exporter OTLP. OTLP is the open telemetry language protocol. And that is what speaks hotel traces in spans from one system to another. So we need both of those things. PIP is what Python uses for package management. And once you're in there, you can run open telemetry bootstrap dash a install. And that's what will go out and grab all those dependencies and those instrumented libraries for us. And you know you'll be successful because you should be dropped into root into that container. I guess it depends on how you set up the pod and VM. But if you're just doing it all kind of like vanilla from the start, you should be dropped in as root. Now you can run the auto instrumentation agent. And this is where we're going to lean on open telemetry instrument that commands. We're going to again, this looks very similar to our app run command right, but we're wrapping it in the open telemetry instrument kind of agent there. So we've kind of changed our run command. We've added this up top. And what we should get is some verification. So if that is working, you can go ahead and open up local host 8001. Make a couple requests generate some traffic. And you should see in your console and no worries if this is super small. What you should see are spans appearing in textual form. This is this is what a span looks like when it's not visualized in a UI. It's this blob of information. And that is how you know that you have successfully wrapped our flask app in the hotel auto instrumentation agent and are getting span data. This is success at this point in the workshop. I'm going to pause because I don't want to get too far ahead and see raise of hands. Are we at this point? Have are you able to verify your auto instrumentation or should we pause a little bit and maybe walk around and do some helping raise your hand if you're successfully at this point? We got some. And again, if you missed in the beginning, there is a QR code and a bitly link you can follow along. And this is totally available after cube com. So no worries about fitting it all in today if you run into a snack. Great. So we were interactively in our container. We need to get out of there. So just type exit or some systems you may need to control C out of there. Great. So what we did is confirmed that without making any code changes, we were able to get span data automatically just by wrapping our command with the hotel instrumentation agent. Now let's go ahead and add a span attribute. In this case, we want to just see how many times the page has been loaded. Maybe that's an interesting metric for us to track. So we're going to hop into your ID or your text editor or whatever you're using to write code you're most comfortable with. Open up that sample application and find app.py. What we're going to do now is manually import the open telemetry library, and we're going to modify the index method that is what is attached to the slash route for this app. So the things you're going to bring in from open telemetry import trace, you're going to instantiate a tracer. So we've got to make sure we've got something that's tracking all of our spans. We'll just call it demo app. You can call it whatever you'd like. Hello hotel. And then we're going to drop into the index method. When we're manually instrumenting, you've got to start the span yourself. And so we're going to say with tracer start as current span, which means this method I'm in, we're going to start a new span right now. You should call it something meaningful and relevant to you. So in this case, load home page works. And then I always just like to type out fully. We're going to reference that variable as span. Some people shorten it to s. I just like to be very explicit. And then the next line you'll add is span dot set attribute attributes or key value pairs. And so we'll call this page load dot count. And then you're going to give it the value of hits, which is how many times that page has been loaded. So when we've done that, we're going to do this loop many, many times in this workshop, we will make some code changes. We will rebuild our container image. And then we're going to load up our app, send some traffic to it, and then look at our trace data. This is a loop we'll do over and over again. You are more than welcome to write a little script. Or if you've got your podman VM that's like mounted to your directory, you can kind of do this stuff on the fly. But for the ease of use, we'll just kind of run through these commands as is. So go ahead and rebuild your image. Same kind of command as before. Make sure you get that success message that things are building. Run our container. Copy paste this command. I don't know that I want to read all that command out, please copy paste liberally and get this application running the way that we're going to verify this manual instrumentation. What we've done is add what we should see is when we load up our homepage, we should see in our console a span pop up that specifically has this in this attribute block page load count one or however many times you refresh. I like to refresh like a madman, so mine always gets up to like seven or eight. And that is how you know that we've successfully piped through that manual attribute there. We will not be working with spans in textual form for long, I promise. This is just the easiest way for us to constrain the space and get started early on. So I'm going to pause right here and just do another temperature check. Raise your hand if you've been able to verify that this page load count is in your span payload. Oh, okay. Now we're losing people. Okay. Then again, please feel feel feel comfortable and free to raise your hand. Work with a neighbor or I've got two TAs. All right, we've got one person that needs some help. We are here. We've got another three. Okay. Yeah, keep them up. And we will head over and Vic. We got some one here and one up front. We want to make sure this is a good learning experience for you. We got one right there. Perfect. Okay, when you've gotten to this point, we can go ahead and stop that container. We did a good job. This is this feedback loop that you should get used to as your instrumenting code. I make my change. I need to validate. I need to run it validate that what I expected to be there is there and then move on to the next. So at this point, we will have completed lab one. And even if you ran into some issues with auto instrumentation programmatic instrumentations like a totally will be using a different image build and everything so you can roll forward or we'll come over and help. And if you're curious for some more resources, there's the pod. Python auto instrumentation agent config here, the open telemetry instrument man page. If you want to do a little more digging on your own. And again, this project repo and the hotel official site, but that one's pretty easy to find if you just Google hotel. Okay. And we've got one more. So when we're talking about programmatic instrumentation, like auto instrumentation is great. It's not that it's not this great thing. It's just there's a little bit of a dark side where people think that that is all they need to do. They treat auto instrumentation as the finish point instead of the launch pad. And I just don't think you're ever going to get the full visibility you need for context or business specific stuff without programmatically or manually instrumenting. So that's why I thought it was important to kind of show the differences. So now we'll move to programmatic instrumentation with hotel libraries and we'll finally bring in Yeager for some trace visualizations because working with spans and textual form is just not my favorite. So go ahead and head back to that IDE, open up app.py and reset that file. So just you can kind of delete what you had and copy paste here. This is what your file should look like. And you can see our routes are very basic. So we are going to update our imports. The dots just mean there's more code there. I just wanted to focus on what we'd be changing. So we're going to import these libraries open telemetry instrumentation flask. We're going to import flask instrumenter. That is what we'll be doing the programmatic instrumentation. What that means is the maintainers and authors of flask have already taken the time to add open telemetry instrumentation, the spans and metrics or whatever you need to flask. And we can just bring that in and we can kind of piggyback off of that. You do not have, you should not really be manually instrumenting everything. I really think you lean on manual instrumentation for specific attributes and metadata and then maybe some internal code paths that couldn't naturally be picked up by a framework. There's a couple. When we're bringing in the hotel SDK, we're going to bring in two things, the console span exporter. I guess we'll still be working with the console a little bit and then batch span processor. And what this will do is bring a bunch of spans in the queue. Once that queue gets, you know, whatever size, it will then send off spans. So sometimes you're like, I've been sending traffic to my app where it wears, why don't I see traces? Well, maybe you need to make a couple more requests and make sure we have enough to get a batch out the door. And then we'll move to configuring the hotel SDK. So after our import statements and above any existing code, this will be the first thing that we want to drop in there. We're still an app.py. We're not going to be anywhere else for a while. And this is the all the config that we need to do. Set our tracer provider and get our tracer provider and make sure we're adding that batch span processor and the console span exporter. That's not a lot of config. It's kind of nice. It's not like totally making our app code horrible. It's very minimal. And the last part we need to do is set up our programmatic instrumentation. So we've dropped in, we've set up a tracer, we've set up the provider. Now we need to say, okay, this flask instrumenter needs to run and do its thing. And what we're instrumenting is our specific app instance. So go ahead and make sure this is reflected in your app.py. And oh, yeah, it's a little scroll. So if you're following along, you can, if you got a little bit lost or the lines didn't add up, you can kind of copy paste this like blessed version of the file. And again, this is this loop that we've been talking about where we'll rebuild the container image, we're going to run the container, and then we're going to send some traffic and validate our results. This is the loop you should get used to when you're instrumenting. So in this case, the only thing I've changed is that we're going to tag this programmatic. In case you wanted to kind of head to head the manual versus programmatic versus the auto instrumentation later, we just I've tagged them differently per lab. Make sure you get the success message that you've built it and then go ahead and run. Only thing that would have changed from the last time is this programmatic. Everything else is the same doing the flash run. And go ahead and same thing just open up local house 2001. Confirm that you that the app is still running. And you can go ahead to roll dice, you can go ahead to doggo, what flask auto and what flask programmatic instrumentation does flask knows about routes. So flask will make sure that it tracks all of our calls to the different routes, but it doesn't know too much more about what we're doing on the inside there. So you won't see like the how many times this pet web page has been loaded, but you will see a span for the slash route has been called. So if you just do roll dice, for example, what you should see what we get out of out of the box with programmatic instrumentation is this representation of a trace. And again, we didn't change any of our routing code. We just imported the flask library, which is essentially what the auto instrumentation agent was doing for us. Go ahead and stop that container. And now we finally get to actually looking at these traces in Yeager. Podman like Kubernetes has a concept of a pod, which is where you're running multiple containers and they share some resources. So what we're going to do is open up app pod.yaml. It's the yaml you know and love. And just make sure you're comfy with what we're doing here. We're bringing in the Yeager all in one container. This is not production ready. This is specifically for local testing just in case you were wondering. And then the ports that you need to care about 1-6-6-8-6 is the port for Yeager's web UI. 4-3-1-8 of course is for sending the hotel data. And then we're going to make sure that the collector OTLP is enabled so Yeager knows to receive hotel data. And we're going to be sending it via OTLP. Okay. So your other container should be stopped. So just make sure that's happening. You can always do a podman PS and see what's running. And the way that you run a pod in podman, podman play cube and then pass it your pod file. You should get a success message that not one, but two containers have spun up. One for our app and one for Yeager all in one. Yeager natively supports sending hotel data. So we all the config that we had to do was also pretty minimal. The stuff all plays nicely together. You'll know you're successful if you can open up local host 1-6-6-8-6. And you will see the very, very cute Yeager mascot. This little gopher detective. He's following the footprints, the trace. This is what success looks like for this part. And let's go ahead and generate some traffic. Make a few requests to local host 8,001, to doggo, to roll dice and kind of see what you get out of the box in Yeager. What you should see is the hello hotel or whatever you have named your service in the drop down. And you should see some little dots for the request representing each request that you've made. And then, you know, like I said, I love to refresh stuff. So I made 10 traces. You may make one or two. And not a hand. This is what this is where we're at now. Yeager has a lot of really beautiful ways to visualize traces. One common visualization you will see is a trace waterfall. You can go ahead and click. So if you see down here like hello hotel doggo, you can click that and be taken into the trace waterfall view. This trace waterfall, if you look at the Chrome dev tools sort of like their waterfall diagrams or it's not quite a flame graph, but this should be, this is a visualization you turn to when you want to examine in detail one particular request that was traced. And you can get all sorts of helpful attributes. If you click on one of the spans, it should open up into this nice little table. So if all of this was working for you locally, then we have completed programmatic instrumentation and successfully sent and viewed some traces in Yeager. Leave your pod running because next up we're going to talk about the visualizations and you may want to explore that on your machine. What is Yeager? It is an open source distributed tracing system. It spun out of Uber when Uber was sort of had their death star of microservices and they really needed tracing to understand the complexities of the path that one single request could take in their system. Tracing was their answer. So they built Yeager and originally Yeager actually had its own ecosystem, its own instrumentation and format, but luckily Yeager decided to join forces with OTEL or at least interoperate with OTEL and has actively deprecated any Yeager specific instrumentation in favor of OTEL. So we use Yeager today for the UI. If you want to self host distributed tracing back in the UI, that is also what some companies do. So Yeager all in one. We're not going to go super deep into Yeager architecture because we're just using it for a UI, but there's two components that you should know about. The collector, which similar to the OTEL collector, receives processes and sends that trace data and Yeager query which exposes those APIs for retrieving traces and of course our beautiful web UI that we are going to be very comfy with soon. So this first page that you land on is sort of that default view of view you can query for traces, you can query for spans and you can look at traces in aggregate. Yeah, so you could say maybe I want to see everything that status code 500 or I want to see only traces from this specific service. This is sort of your homepage view as the Yeager search console. The scatter plot up there matches the traces that Yeager knows about and in the case of all in one we're storing this data in memory which is again why it's not a production ready system. But if you want to click maybe you see like in this example, wow that's a very, you know, maybe that took a lot of time or why is this dot so big, kind of click on that and see the traces that led to that. Table view I think is helpful if you want to compare so each of these traces should have a little checkbox. Maybe you want to compare a trace from before and after a deploy or before and after a feature flag got flipped. That is a very common use case and so Yeager supports that out of the box. Once you click on a trace you get again taken back to this trace waterfall view and when you're in running in a production system maybe at scale in a cloud native architecture you actually could have traces that have hundreds of services, thousands of spans and it can get kind of overwhelming. So having this trace waterfall view where you can collapse groups of spans or spans from a particular service is really helpful to navigate as your pinpointing maybe what's the source of latency or specific error that you saw. We're going to be working with really tiny traces today but just know that please make heavy use of collapsing spans because it can be data overload and it's sufficiently big system. You can even use, it's a little tiny, you can even use this tiny search box so maybe you are looking at a hundred span trace. Maybe you just, you know that there's a specific attribute or property you want to look at like get request 400, 500, 200 and so if you go ahead and type into this text box one of those attributes it will bring you down to the span that matches that so that can be a helpful way if you know what you're looking for or you're curious if some attribute popped up in a trace and you don't want to read the whole waterfall that's one thing that you can do. Again the scatter plot it's kind of a quick way to visually compare traces without getting into all the details of the spans and the attributes. You can look for things that are out of the ordinary or anomalous and clicking on that bubble each bubble is a trace you'll go to that trace waterfall detail view and then the table view you can also sort by most recent you can look at a dependency graph there's really so many different ways that you can interact with and visualize this trace data which is why I was really happy to leave the sort of JSON text representation of spans because this is the tracing that I know and love being able to interact with it like this. So we've talked a little bit about the trace table again you can sort traces by duration this is back on that home search page the amount of spans so maybe if you saw maybe for a request to check out you normally see 50 spans and then all of a sudden you see a trace that has 300 spans something might have went wrong there so being able to look at traces in aggregate and sort of sort and filter them is very helpful and then trace details so again clicking on a span so you get to you see the traces in aggregate you find one that looks interesting you click into it you're taken to the trace waterfall you find a span that's interesting it's maybe in the critical path and taking a long time it's got a wonky error message something about it doesn't look right you can again zoom in more and drill into the details for that span that is where all of the manual instrumentation of that metadata comes in handy so you can start to do some correlation so the beauty of traces is you can go from that trace aggregate view some of these requests look a little funky let me look at one of them hmm this is I think where the problem is or it's this call to this database and zoom into that span and then all the way back out again traces are really great for that zooming in zooming out as you're developing hypothesis as to what's going wrong or maybe what's going right maybe you made a performance improvement and you just want to understand and explain how that happened and get that promotion so doesn't always have to be a bad use case when you're loading up the trace page there are a couple other ways you can look at traces it is not super discoverable but there's this tiny tiny drop down on the right hand side just follow it on your laptop you can grab a graph a span table and a flame graph if you are like really into flame graphs and that's how you troubleshoot best I love the graph or sometimes it's called the system topology view this is what is going to show you is how the services relate to each other and there's another view you can get up how each method or span relates to each other you can change the color to highlight either what has taken the longest what's in the critical path or self-times which shows the longest spandrations that weren't waiting on children or other work to be done this is experimental so but it is something that you'll see in a lot of vendors as well so the more you get used to exploring spans and traces in the graph view it's really easy to bring that knowledge over to any sort of vendor this bands table again if you're looking at a trace that has so many so many spans and you just know that you're looking for an error is true or some other attribute you can go ahead and even search within a trace just for a specific maybe we're looking at the ginger load or the HTTP get there's lots of ways that you can inspect and query this data and I am not a flame graph person but if you are please feel free to use the flame graph view to visualize these traces you can similarly to the waterfall you can collapse the unnecessary details that you don't need to see or don't care about and you can copy that function name to use in maybe a metrics query or logs query in another system or even highlight similar spans within the trace so that is why I wanted to bring you here today because there are just lots of ways for you to visualize the data in a way that makes sense for you and we'll take a quick look at comparing traces for change I will say this is comparing traces is what I used the most when I was troubleshooting as an SRE but Jaeger has a very interesting way of showing them I think there are some very nice options and in vendors and I'm excited to see what the open telemetry desktop viewer does for trace comparisons that is another project that is kind of getting started so why do you want to compare traces really most of the time when I would get paged or someone would tell me something is wrong I'm like well what changed it was working before what is different so I like to look at a trace a request that happened from when things were normal to afterwards when I got alerted or somebody told me something was wrong being able to compare that request path is super powerful and in our case we'll think for our doggo app why did one request for doggo take 685 milliseconds in the other only 281 this is something you should be able to repro on the machine that's a little hard to see there but what we do now is click both of those checkboxes hit compare traces and you'll be taken to this view they've decided to model the colors of your code diffs so gray represents spans that are in both traces so kind of fading into the background you don't need to focus on those that's what was similar red nodes red earth spans that were only in the first trace that you selected green nodes or spans were only in the second trace you selected and so if we look at I don't think I can zoom in but if you look closer maybe on your machine at this trace diff you can see what was present in the first trace was compiling the ginger template which ginger is what flask uses for its HTML templating under the hood and it's only done on the first time a page is loaded so in this case we were able to compare traces to say oh the first time that we make a request to this application when it spins up it takes a little bit longer obviously it's not the end of the world we're still under like a millisecond one second but we can see where that latency came from and so that's a small example that shows why trace comparison can be super fast and helpful and again like why are there so many ways to visualize a trace it's really because you have that ability to zoom out looking at traces and aggregate to zooming all the way down to an individual operation that happened to take place for a request and it'd be really overwhelming to try to shoehorn that in into one specific visualization or one single pane of glass as the industry loves to say and so for me it's all about the ability to go from high level to low level back up again and you just need different visualizations to support whatever zoom level that you're working with okay so you can go ahead and stop the pod or if you want to keep playing with the visualizations you can but I think that means we will go to the next lab yes so we reviewed our trace visualizations we've kind of gotten a little more comfortable with the Yeager UI and now is the final lab on manually instrumenting metadata this is the change we'll make to our instrumentation loop we'll make some code changes we'll rebuild our image we will run our container we will generate traces by sending requests to our app and then we will load them up in Yeager and see the results of our instrumentation so we're adding one step to our instrumentation feedback loop but it's still pretty manageable all right if you were really into the trace comparison there's a deep dive blog that i've linked here in these slides you can check out the Yeager project site or while you're here go talk to a Yeager maintainer talk to the Yeager folks about what's going on in their world you can read about native otlp support and if you're really keen on bringing Yeager to your org it's definitely a good idea to look at the deployment options you've got available to you because all in one is not for production i cannot say that enough it is just for our local testing today all right and here we go lab five and it's kind of a rehash of before but again automatic and programmatic instrumentation gets you most of the way to visibility it is definitely better than nothing but if you have the skills and can teach other developers the skills of manual instrumentation you will be able to add specific metadata to your apps so that you can derive insights from your business you will see some examples of what will be manually instrumenting now as we begin so we'll head back to our IDE open up app.py kind of delete whatever we had there and go ahead and copy paste this reset app.py then we are going to manually bring in some libraries again reinitializer tracer provider which is what creates the tracer that accesses and modifies span as we run our application and we did like that flask framework instrumentation it was pretty nice to be able to see traces for the attached to the routes we were making requests to so we also want to bring back the flask instrumenter so we could we could go through this line by line but i do promise you can copy paste that we're bringing in some hotel libraries and we're bringing back the flask instrumenter and again this is the the loop we know we will build our image make sure that we get that tagged and in this case only change is we're going to tag this as manual just so that you've got a version of the programmatic the automatic and the manual if you want to do some comparisons later on you would need to remap some ports to make that work but that's pretty doable so the other change we'll make is over in app underscore pod dot yaml which is our pod spec because we've changed our tag for the image we built we need to make sure that our pod has that updated tag and then comment out that command block because we're manually instrumenting now and we don't need the auto instrumentation agent to wrap our flask command because we're going to be running the otlp span exporter and relying on Jager's native otlp ingestion to send spans over hccp so we don't need open telemetry instrument anymore we think it for its service but we are moving on to manual instrumentation you should not have your pod running if it was running from later on go ahead and control c exit out and go ahead and playcube at pod.yaml and again we're back in our loop where we're making requests to our application endpoints I like to have a variety of traces so maybe just make a couple to each of our endpoints slash doggo and roll dice and this is where I had 500 so we started out with using port 8000 so go ahead and swap that five to an eight I'm very sorry and we'll update this in gitlab tonight so you'll know it's working if you can load up Jager which again localhost 16686 you don't have to select hello hotel from the menu but Jager also instruments itself so you will see Jager query pop up as a service we really care about hello hotel spans so maybe just go ahead and select hello hotel as a service and confirm that you see traces that reflect the request that you were making to the application and when that is all said and done we are ready to add some manual instrumentation we did this before but it's important this is people's first steps with tracing instrumentation where they just need to add a key value pair span attribute they're just metadata to annotate a span was just more information that might help you the sort of API call is span dot set attribute key and value both strings in our case we'll go ahead and because we're manually instrumenting we'll declare a span we'll call trace we'll get the current span that is attached to index and then we'll go ahead and set the attribute again hits the string and then hits the value that comes from index very similar let's stop in our case when we're stopping a pod it is actually podman play cube pod file and then dash dash down then go ahead and rebuild your docker file container file and then run the pod again then go ahead and open up your browser make a couple requests to localhost 8000 we know that we've started with 8 we'll stick with 8000 and go ahead and make sure that when you find a trace for that slash route that you click on the span details page open up that span and make sure that you see right at the top in span tags hits and however many times you are refreshing that page and this is where we think okay Flask auto instrumentation was instrumenting at the route level but it didn't capture external requests or custom work and our dogo endpoint we know calls out to the custom dog api we don't control that code we can't add instrumentation there but that means if we only have Flask programmatic instrumentation and we are looking at traces for dogo we don't really know how much time we spent calling out to the dog api and how much time we spent internally processing or pulling out the dog breed it'd be super handy to know if it was a dog api problem or an us problem if we needed to optimize this code path and we don't have to stop at Flask programmatic instrumentation library let's see I got this out of order well let's see what we get for dogo make a few requests go ahead open up Yeager look at what we get for dogo out of the box and you'll see it is just a one span trace which doesn't tell you a lot but it does tell you how long it took to fulfill this request to dogo and what you'll notice if you look closely at the span tags is you'll see a span dot kind and it says server which reflects that this span was generated from Flask's point of view our application and so again we don't know how much time we spent waiting on the dog api maybe they had problems maybe they pushed out really buggy code that took forever all we would see from our end is longer and longer request times or higher and higher latency and we wouldn't be able to figure out why our request suddenly taking longer for the dogo endpoint so we can instrument that go ahead and stop your pod and what we want to do is a look at what else we're doing with flask and if you look at that import statement you'll see an import requests if you work with python you know that request is a pretty popular library and it might be possible that that meant maintainer was super kind and already instrumented for us so we could be in luck and all we need to do is import a library just like we did with the flask instrumenter add a few lines of config and then we'll get this extra visibility to our external htp requests actually all htp requests but in this case we care about external um so in the new tab you can open up the open telemetry registry it's a good idea to get comfy opening up and exploring the registry just to see what's out there you don't want to instrument more than you have to um you can filter down to python you could filter even down to type instrumentation if you want to type in requests or um you can click this handy green link and it'll take you right to haha the top result which is yes thank you um I believe it is Kenneth who runs uh the request he's already done the work for us so let's pull in this library in order to do that this is where the python 3 dependency comes into play because we need to pip install this library and make some code changes locally python uses this concept of virtual environments uh so that you can isolate system dependencies for another so you can have multiple versions of a python library but they're all kind of isolated in different directories um the way we create a virtual environment python 3 dash m you can call it whatever you want it's easy if you if you stick with a copy paste here and then go ahead and activate that virtual environment you know you will be successful if you run a pip dash capital V and you'll get this nice little long that's not super nice long path and as long as dot bend is in there you know that you're activated and good so at this point now we can do the work we've seen that there's this request library out there we want to bring that into our application we're going to pip install open telemetry instrumentation requests request is that name of that library um python uses a package management uh requirements dot txt to just track what we're installing so you can go ahead and pip freeze type that to requirements dot txt to save your work um and let's go ahead and configure this add an import statement near the top for importing our library request instrumenter namespace just the same as flask instrumenter and then right under our flask instrumenter instrument app go ahead and do request instrumenter dot instrument um add a sufficient size or more complex application you may just want to pull that out into a config file but we're just working locally today and keeping it simple again we want to rebuild our image we'll still keep that manual tag and we'll go ahead and run that pod great we really only did work um we were really curious about doggo endpoint so you can make request to everything else but make sure you make a few requests to doggo so that we can check out what we got by adding this new library you can go ahead and search the hello hotel service and even drop down to the particular operation for slash doggo and what we should see is that the span count has increased from one to two so let's take a look at what we got our trace waterfall should show that first root span that overall um how long it took us to respond to that request to doggo and we should see a child span underneath for an HTTP get request out to the dog api if you look at the span dot kind it is not server in this case in this case it is client because we are the client making an external request that's one way that you can sort of sort through what side of the request that you're on so all we had to do is bring in a library add a little bit of configuration we did not have to manually instrument the doggo api call and we kind of got this out of the box so while yes we're manually bringing things in we're still benefiting from the programmatic programmatic instrumentation we'll go ahead and stop the pod and now we'll kind of mix in some other manual instrumentation so there's a little bit of work done in this doggo request to pull out the dog breed so that as you're loading it up you know what kind of dog you're looking at because maybe you're a cat person you don't know it's helpful to have that info you could imagine that was maybe a computationally expensive method that we needed to run it could maybe be running some crazy rejects or something like that and it might be helpful if we tracked our internal work in addition to the dog api maybe we saw that latency was up to doggo requests we've confirmed that the dog api is still running super performant and now we got to look internally what is causing that latency well we need to add some more instrumentation to track this internal work this is where we're going to create a new span so far we've been relying on spans that have been kind of created from these programmatic libraries and now we are creating a new span that these libraries could not have known about this is where we're manually instrumenting in this case it is the get breed method so we want to make sure we're not just creating the span and sending it into the ether we want this span to be tied to traces that are related to in this request path so that means when we create this new span it needs to be a child span related to that first overall doggo span when we look at the hotel api it is tracer dot start as current span to create a new span in the current trace context you can attach or unattach it we will be attaching so all of those words to say open up app dot pi it is a lot easier to create a nested span than to do that explaining head down to the fetch dog method find the doggo route go to fetch dog and add that line with tracer start as current span i just called it the method name again you can call it whatever you want and then we'll call it as child and again we will rebuild our image we will launch a pod make sure that you have the pod stopped and generate some traces so make a few requests to the doggo endpoint and let's see what we got what we're expecting to see is going from a trace with two spans to a trace with three spans one overall tracking our request to doggo one for our external call to the dog api and then finally that internal work for get breed it is not computationally expensive in our case but it would be helpful if we knew internal work versus external work so go ahead after you've made some requests open up yeager ui that should be old hat by now and search for traces for the operation doggo we should if everything goes according to plan see three spans in these traces and go ahead and click on one of those to get the detail view so we get the doggo overall what's called the root span we have that get request and now that third span at the bottom is operation for get breed so now now that we've mixed programmatic in this manual instrumentation we could step back step away from our computer and no if all of a sudden we get paged because it's taking super long to fetch these dog pics because they're really really important to your business you have a sense of was it me or was it my dependency and it took a mix of programmatic instrumentation and adding that manual span which was not a lot of work but did give us a lot more visibility so when you're instrumenting there's always this level of detail what would you need to answer questions like that when you're on call or even when you're deploying changes how do you know that your changes were effective you should think about when it makes sense to add in that manual span or spans so again you can visualize as a trace graph the waterfall or you can go ahead now that we've got more spans it's more of an interesting trace we can hop over to the trace graph and see that while it is simple with three spans this is what the kind of operation map looks like doggo is starting a request and then it calls out to the api and does our little get breed work also if you want to visualize it by time like what was in the critical path what was taking the longest click on the t over the right hand vertical bar menu and that is going to highlight for it's very helpful for very complex request trains the spans that are in the critical path which means it directly contributes to the slowest path that you have to if you want to do optimizing work you need to look at the critical path so if you were working on doing some perf improvements go ahead and trace it go ahead and flip on this color by time and see where you need to start honing in to do that work and again like i said we've been using some toy apps but here is an example with a hundred and sixty spans looking at that trace graph you can see how powerful this bird's eye view is and imagine if we had flipped on by time to see where your path of optimization should be so this while we're focusing on learning instrumentation today know that this stuff gets more and more helpful for the more complex your distributed system that you're instrumenting so those are creating a nested span and creating a span attribute there is this concept called span events very confusingly in the yeager UI it will show up as a span log i do not know why there are plenty of hotel maintainers here that can have that conversation with you but let's go ahead and add a span event it is basically a structured log with a name and one or more attributes and a timestamp so that's kind of why we want to call it a lot because it's a timestamp with some textual metadata we look at the api it's pretty simple span dot add event whatever you want to call it page click page load whatever and then your attributes so let's just look at the role dice and points so go ahead to the role dice method go ahead and get the current span we don't need to create a new span because we know that flasks auto flask programmatic instrumentation is already instrumented this route but so we just need to get a hold of the span that's already there already going to be created we roll our dice and we get some result that we maybe care about and so go ahead and span dot add event roll dice and maybe it's helpful for troubleshooting the future to know what the result of that role is so the attributes is a map of key value pairs in this case we'll call it results and pipe in whatever the result of that dice roll was so to see this again very similar stop your stop your pod rebuild your image go ahead and run the new pod send some traces over to the role dice end point and let's verify what this looks like in yeager what we're going to see when we click in to our trace view is boom you see this little thing that says logs all of a sudden we've got the span event that shows up as logs with a time stamp that time stamp is relative to the start of the trace itself just important to know you can go ahead and click down and see whatever our result was in this case it was five you could add a lot more logs here maybe you could tie it into the existing logs that your app has but span events are a really helpful way to enrich your traces we can go ahead and stop this pod and there's just a couple more attributes we'll get to one of which is span status a lot of times when you're looking at traces you will the most important field that I relied on was error equals true that would mean for a given trace one of the spans somewhere in there maybe multiple experienced in error and that's probably something you want to look at and compare to a healthy trace or a green trace but maybe you're manually instrumenting and maybe you need to manually change a span status so maybe if we rolled a dice and a five meant something had gone terribly wrong there's no way that the programmatic instrumentation would know that so you would need to go ahead and manually set that span status there's three unset okay or error and when a span status is set to error that is there's a lot of visual cues that pop up both in Yeager and any vendor that you're using span status error is very very meaningful in our case the api is just get the current span set the status either unset okay or error and if you want to add a little description of what's going on that'd be very helpful to future you so we will simulate having an issue with the dog at api so open up random pet pic that html and update our template to include a search bar plenty of things can go wrong when you give folks an open search bar so there is a chance that whatever our user is searching for is not actually a dog breed that the dog api has and that's an error or a case that we want to handle or the user could totally put it nonsense or you know many many things so we want to pass to the template any error messages that we get resulting from validation we do on our end great so we're going to go ahead and import a breeds list that I've copied over from the dog api and then we're going to add a couple things to our flask import statement there now we're going to head back to app.py find our doggo route and update the fetch dog method so now we need to handle post requests from the search bar form we need to validate the input and we need to pass any error messages that we get to the template so that we show the user what you're searching for cannot be found here and we should get a little message to ourselves as we're looking at this in the future hmm this is what they this is what the users are searching for and it didn't go according to plan so you can go ahead and kind of copy paste this block and we'll do our stop rebuild relaunch generate traffic open up jager so what you should see when you load up the doggo endpoint is this very beautiful search bar and you can go ahead and start searching for things let's start with a valid search though husky grayhound those are definitely in the dog api and you should get returned one of those adorable breeds and now we want to test an invalid search so like macaw which is a type of parrot or tabby which is a type of cat or your name go ahead and throw that in there just something that will throw an error and what we should see is that we've got this very nice error message for the user uh-oh no breed found but what helps us as the operators understand what happened here go ahead and open up the jager ui search for traces for doggo and you'll see that there's no traces with marked errors which is kind of interesting because we know that we just had a handle an invalid search case find your trace with the invalid search and it is the one that will only have one span or you can kind of just follow along up here the reason it only has one span is because it was filling the post request the invalid search term failed our validation logic which prevented us from even calling the dog api because that would have just kicked back you know whatever the error message was like the 403 not found maybe it's 304 it dog api would have given us an invalid search anyway and it would have given us a failed request and so we just already dodged that by moving the validation in our app we passed our error message to the user and we did successfully respond to that invalid search request to doggo but we know that there was kind of an error along the way or that things didn't go as planned we know like in my case i was searching for tabby but even when we look at the span details we don't really see that something went awry so we can instead set a span status because even though we successfully responded it still may be important for us to track what users are putting in or what users are ending up on this error page it will not make the entire trace failed if we set the span status but it will be helpful for us as we're troubleshooting so we'll go ahead and pop back into the doggo route we'll update our fetch dog method and we'll create a new nested span for these post requests handling that form search input and for good measure let's just see what users are searching by adding an attribute for breed on to that span this will help us very much in the future debug and understand what the source of these errors are and then we will update our error handling code and this is where we are going to manually set our status to error we also might want to record any exceptions that pop up within the context of that span so it'll be a child set status and child again is what we're calling our new span that we created up here we are going to set status status code error and we're going to record the exception and this is in our try catch block again bring down your pod rebuild your image bring the pod back up send some requests to doggo make sure you have a healthy mix of valid in invalid search terms so doberman akita husky grayhound valid maca tabby page invalid um calico fox then let's load up yeager and do a little bit of comparison so going ahead to look for doggo traces verify that there is at least one trace that has this that was successful but has this little red error box that is what we did by adding setting that span status to error and you'll notice because we still technically did a successful response back to the user we did load um that beautiful user facing error message we don't want to say this whole trace failed this whole request to doggo failed because it didn't but we do want to know something didn't quite go right and let's take a look at what that was um going down into the search breed span since we recorded the error message you can see uh we did some custom error message no breed found you'd be like okay that's fine someone just didn't know what they were searching for um and our breed made it as a span attribute that's all super helpful stuff when you're investigating and troubleshooting to have that error message not locked away in logs perhaps in another system but right there in your trace as you're looking at things so oh my goodness we've completed our labs we've looked at auto instrumentation we've looked at programmatic instrumentation and we've hopefully seen the value of manually instrumenting and that it you do not need to manually instrument everything but instead you want to have a nice healthy mix of programmatic and manual um instrumentation we've looked at all the different ways you can visualize traces how you can add attributes record errors and most importantly self-serve and look for the libraries you care about in the open telemetry registry we will not be taking distributed tracing with hotel and production that is a lab i will be working on uh later on keep an eye on this repo i do want to help folks move hotel into prod but the focus of this workshop was just getting your feet wet with tracing learning a little bit more about spans and what you can do today to instrument some sample apps or applications you're running at work we've still got this room for about 13 and a half minutes so i'm very happy if you want to raise your hand and ask a question or kind of talk shop or if you ran into issues or have suggestions i'll be kind of walking around but i very much appreciate everyone for coming to the workshop spending your end of your wednesday hanging out and learning more about open telemetry so thank you very much