 We're super excited to be here. Now before we get started allow us to introduce ourselves So my name is Adriana Villela and I am a CNCF ambassador I'm a HashiCorp ambassador blogger podcaster by day I work at service now cloud observability the artist formerly known as light step And I spend most of my time doing open telemetry work Reese and I actually work together in the open telemetry and user working group soon to be sick super excited And by night I like to climb walls and completely unrelated. I'm a huge fan of capybara's Hi everyone, my name is Reese Lee. I'm a senior developer relations engineer at New Relic as Adriana mentioned we work together on the open telemetry and user working group Where we're focused on connecting end users to each other through enablement content as well as events and facilitating a feedback loop with project maintainers and end users to help improve the project as well as drive adoption I guess my fun fact is I like anything spooky and paranormal So credit for the idea for this talk actually goes to our lovely MC Austin Parker They had written up a today. I learned type post where they discussed a conversation that had had about Who's someone about how open telemetry deals with error? recording which made us wonder oh Dude, where's my error? Where's my error dude? so how does open telemetry handle errors and What options do you have for recording errors using open telemetry? This is what we're going to answer for you today in our session We're going to first set the stage of some background then we're going to get into how errors are represented and handled an open telemetry We'll then give you a demo on How the same open telemetry instrumented error is represented in a few different backends? And we'll talk about why that matters for you as an end user and Finally, we will do a quick wrap-up So Adriana is going to set the stage for us shall yes before we talk about all this lovely stuff Let's do a little bit of background So here's the deal So happens that different languages approach errors and exceptions in different ways So for example go doesn't really have the concept of exceptions But then languages like Java and Python have things like mechanisms for catching and throwing exceptions Which is awesome But what happens when you're in a situation where you have an app that's made up of multiple microservices? Written in different languages and we're trying to do the observability thing Well, we need a you know a standardized way for capturing our telemetry And that includes a standardized way for capturing our errors So what do we do in that case when we can't even agree on you know How we're handling errors in different languages and of course The answer is open telemetry to the rescue our good friend and Which leads me to the open telemetry refresher, which I'm sure everyone here is familiar with it So this will be a very quick refresher So open telemetry cncf project open source vendor neutral Allows us to generate ingest process and export telemetry data to one or more observability backends for analysis and interpretation awesome so Back to errors Well, we talked about errors. We talked about exceptions But what is the difference between the two now if there are different definitions of errors and exceptions? And we can probably argue all day about this But we've come across definitions for errors and exceptions that we particularly like so these will be The basis for our talk here note that this is these are definitions for errors and exceptions within the context of technology purely and not like But not not in not necessarily in the context of open telemetry So what is an error an error is an unexpected issue in a program that hinders its Execution so an error can be something like a compile time error. You forgot parentheses on an if statement Maybe you forgot curly braces. Maybe you forgot a semicolon or It can be something like a logic error The program is executing sure but it's not doing what you're expecting it to do Then we have exceptions an exception is a type of runtime error That disrupts the normal flow of a program and that can be something like dividing by zero or Something like referencing a memory address that doesn't exist shout out to you know old days of doing programming in C Where that sort of things seem to happen all the time? So there you have it Thank you. Okay, so now that we have set the definitions. Let's get into the main course error handling and open telemetry So as I didn't just mentioned languages have their own ideas about what constitutes errors and exceptions as well as how to handle them So how exactly does open telemetry deal with all these conceptual differences from language to language? This is where the hotel specification or spec for short comes in The spec provides a blueprint for developers who are working on various parts of the project it standardizes implementation across the different languages and Since language apis And SDKs are implementations of the spec there are general rules against implementing something that isn't covered in the spec This is to help provide a guiding principle for organizing contributions to the project in Practice though there are a few exceptions. So for example a language my prototype a new feature As part of adding it to the spec usually designated as Alpha experimental before the verbiage is added to the spec As another example the spec allows for some degree of flexibility for a language to implement something as naturally To that language as possible. So for example, most languages have implemented record exception And since go does not have the conventional concept of Exceptions it has instead Implemented a record error Which isn't he does the same thing and we'll talk a little bit more about that in a little bit So now that we have a unified framework for how to handle errors Let's see what options open telemetry provides for us So first of all we can record errors using either spans or logs In open telemetry a span represents individual Represents an individual unit of work within a distributed system for example an HD call or database call and They can they then make up the building blocks of a distributed trace Spans are related to each other and to a trace do something called context Context is the glue that holds at sorry that turns a pack of data into a unified trace And we like to think hazel weekly for that awesome definition Context propagation allows us to pass information across multiple after systems and tie them together You can learn all sorts of things through traces of our systems such as how our services are connected and talking to each other As well as what occurred in our application at a given time Open telemetry enhances spans in several ways One of the ways with which you can enhance our spans is through metadata or attributes in the form of key value pairs So when you attach relevant information such as user IDs or cross parameters you can gain deeper insights into what occurred within a given trace Spans also have a field of span kind Which is additional information that can provide Developers with further context for troubleshooting Span kind is determined automatically by the instrumentation libraries that you use Open so I'm to define several span kinds each of which has unique implications for error porting We have clients which is for outgoing synchronous remote calls such as outgoing HTTP calls server which represents incoming synchronous remote calls and then internal spans represent operations that don't cross process boundaries and Finally, we have producer and consumer which are typically used for message queue operations Spans can be further enhanced with something called a span status as well as a description of that status For example, here we have an acceptance message message that was captured along with a status By default a span status is marked as unset unless otherwise specified you could also set it to error or as okay, and finally we can enhance spans with something called Span events a span event is a structured log message embedded within the span on which it occurred Span events can help enhance spans by providing additional descriptive information and you can also capture additional information on the span event by using custom attributes When a span status is set to error a span event is created automatically that captures a span error message and stack trace Earlier we mentioned a method called record exception Since go doesn't support the conventional concept of Exceptions it is implemented instead as a court error However, with both of these methods you have to make an additional call to set the span status to error And that's what it should be because it's not out of my automatically going to be set to that So this means that you can actually use span events to record additional information on that span event So by decoupling the span status from being automatically set to error when a span exception occurs you can support the use case where you have an exception of That has a span status of okay or unset and this allows the most Flexibility for instrumentation authors as well as end users We can also record errors using logs with open telemetry In open telemetry a log is a structured message emitted by a service or some other component They include a message a time stamp as well as a severity level and Severe two levels Represent the type of message that's being emitted so debug info warning error or critical Open telemetry allows for the correlation of logs to traces In which a message a log message can be associated to a span within a trace Vias trace context correlation which we talked about earlier So if you spot a log with a log level of error or critical You can navigate to the corresponding trace to find out more information about what happened And if your back-end UI allows it you can also navigate to the log from the trace UI So is there is one better than the other for recording errors? Spans or logs the answer is everyone's favorite. It depends Perhaps your team primarily uses logs perhaps a primarily use traces Another thing to consider is you the back-end that you're using does it render both logs and traces does it support trace and log correlation and How easily curable or discoverable are your is your data? So if you've been using a proprietary agent to monitor your applications and have migrated open over to open telemetry You might have noticed that An error that was captured by open telemetry instrumentation is Expressed a little bit differently than the same error captured by the proprietary agents instrumentation This is mainly due to the fact that open telemetry simply bottles errors differently than how vendors might have been doing For example vendors have their own notion of what represents a logical unit of work within an application or system You're probably familiar with the term transaction which can mean something slightly different from vendor to vendor and in open telemetry This is represented by a trace which is made up of spans so already Vendors have had to adjust How this data is populated in their UI because it is a different data model and Finally, we have span kind which impacts which has a bill Possibly to impact your error reporting Some backends might have opinions on whether a Sorry might have opinions on like only server and consumer span kinds should be counted to an error rate and not internal errors for for example And in a second here Adriana is actually gonna Take us through a little demo and that's we're gonna where we're gonna demonstrate kind of like some of these differences you can see from vendor to vendor In Yeager, oh another thing to consider to you is some backends might have created a different Signal type for span events in Yeager. They're represented as logs because that's Basically what they are so we'll show you an example of this in just a little bit Demo time So we are going to do a demo. It is not a live demo because in my experience any time there are live demos bad things happen So it's pre-recorded demo You still get to hear my lovely voice as I live explain the pre-recorded demo. Okay So we have a demo application written in Python There is a client and there is a server. It's a simple application The client makes a request to a roll dice endpoint on the server. It's a flask application The server rolls a virtual die i.e. It outputs a number between one and six and outputs it to standard out Now as I mentioned, there is a client and a server portion to this application However, the client's not so interesting. We'll be looking at the server code and here are the notable parts so here we have the roll dice endpoint and This part here basically calls another function called do roll which actually does the work and Then we have this section at the bottom which does the initialization for our application It initializes our flask application. It initializes our traces and our metrics now This application has been instrumented with open telemetry using auto and manual instrumentation We are adding some logs for the purposes of demoing all this good stuff and for fun Z's We're also capturing a metric Now inside the do roll function. We are creating a span called do roll You can call it Bob for all I care does not matter, but it should be a name. That's meaningful to you Now in our span, we are doing a few things. We are creating an attribute We are also creating a span event and that span event has attributes of its own We are also creating a log message. It's an info message We are creating logs using the Python logs bridge API aided by Python auto instrumentation Which means it does magical things because we are embedding our log within our span definition it means that the log gets Automagically correlated to the span and then finally we have the our one metric. It is a counter Instrument and this counter gets incremented by one every single time this function is called Now this is an error's talk. So we should be throwing an error at some point I have forced an error basically every time a die roll is divisible by two It will throw an exception and this exception is caught in the rule dice function Now within the rule dice function If when if the exception is caught then we create a brand new trace and this trace We're saying, okay, we're capturing the exception By using record exception now record exception is basically a span event But it embeds the stack trace as part of it and also just for fun Z's I'm also creating a log message here and it is an error. The log level is error So here is the video of the demo now this repo is publicly available and we'll include a link to the repo at the end of this talk This is available on github. I've made it available through github code spaces So you don't have to set up a local environment test this thing out So hopefully it'll set save you some headaches if you want to play around. So here we go So first we are opening things up in github code spaces and it takes a couple minutes to start up I sped things up so you don't get bored. The other thing is The hotel collector config for sending stuff to the observability back ends I put it in this hotel call config extras file, which I've get ignored because it's got some Some keys to access the observability back ends, which I do not want in github So that's been added separately next. We are building the Docker images for the Python client and server We'll be running this thing using Docker compose in a minute once the images are built a little hamster is going And It's almost done. Okay. Now it's done. We are going to run Docker compose So this thing is going to start our Python client server the hotel collector and Yeager and we can see that it's running here Now Yeager is automatically Exposed through port one six six eight six in github code spaces And if we go over to the ports tab and click on the little globe thingy Then we can open up Yeager in github code spaces and we can take a look at our traces And as you can see we have some traces that contain errors represented by the big red dots and the ones that do not contain errors which are represented by the little green dots So if we go to a non error one, we can see our trace with four spans and we've got our do roll span and it's got our span event and The with the message and the two attributes then going to our error trace You can see that our span we have the one span with the error. It's marked in red And it's got our usual span event, but also because it threw an error It's also capturing the stack trace as a new span event automatically captured And because it's throwing an exception that extra span that we created shows up there as well And if we click on that we should be able to see our span event that we created That captures the stack trace as well. Obviously, this is overkill We don't need both of them, but it was just for the purposes of demonstrating this now We're opening up a different observability back end and as you can see Similar interpretation looks slightly different the red triangles represent the errors that the traces with the error spans and The green dots represent the traces with the non error spans. So similar thing we can see that All is good We see our span event with our message and our attributes But we can also see our log message that we added in because this back end supports supports logs and That the log is actually correlated to this particular span and if we click on the log We can see the log message and we can see it also in within the context of other log messages that belong to the same trace So we have that correlation now we look at and at one of the error spans And clicking on that we can see the do roll incurred an error so it's got it showing up as red we have our Additional span event with the stack trace which we expected and again We should see our same log message and because it was an info message. We don't expect it to look You know scary and red But we can see again There's the correlation to the the overall trace and we also see our even number span That was created because we incurred that error which has our exception message and also our log message Which we didn't see in Yeager, right because Yeager doesn't support logs So if we click on this log, then we will be able to see the log message in a minute Which because it was an error it shows up in red and again We can see it in the context of other log messages that were part of the same trace So now Reese is going to show you what that looks like in a different and in yet another Observability back end just to give you an idea of You know, it's kind of interesting to see how how different products Interpret things in different ways Yeah And I think it's really valuable for end users to kind of understand and get a look at how they are represented differently So this is an example of how traces might be visualized So here we've got a group of traces which we can see Easily which one of them had errors and from there we can click into one of the traces with an error And look for the span that has the error and From there we can see metadata about the span in this case The span status code the status code description as well as the fact that there was a span event captured when we click into the span event Which you can see here is referenced as a Span events versus a log as it was in Yeager. Sorry, I clicked too fast and I don't know how to go back But also we're running out of time so I gotta go This back end has created a new signal type for span events called span events and So in Yeager it was represented as a log here is represented as a span event and you can also navigate to the Associated logs and vice versa because this back end supports Trace and logs correlation yep, and There you have it. I believe we have one minute to wrap up so error handling as challenge open salametry provides a an open standard blueprint for how to handle these errors as well as providing different ways for us to record errors do spans logs and it supports correlation you can enhance spans with Metadata a span status span kind and a couple of different ways that we covered and How your data is visualized in your back end? Maybe a little bit different than how you may have been used to you by using a proprietary agent instrumentation simply because open salametry Models errors differently than how vendors have been doing Not all images are created by humans Dolly, but honestly Adriana is an amazing prompt engineer and did all the penguin images. It was fun And we have some handy QR codes for you to check out Yeah, I have a podcast called geeking out and I've had guests like Kelsey high tower and charity majors and Reese So you should totally check it out Also come see us in the hotel observatory at cube con starting tomorrow We've got some really cool stuff going on Including end user feedback sessions So please if you have any feedback on open telemetry sign up for one of the feedback sessions We would love to hear from you And come join us on Thursday for a party that we're hosting with our friends Docker plumey and tail scale and We also have some handy licks here to reference from the slide from a talk today. Thank you so much