 Hello, everyone, and welcome to the tutorial on OpenTelemetry, using OpenTelemetry for distributed tracing on Kubernetes. My name is Pavel. I work at Red Hat. I'm maintainer of the OpenTelemetry operator project, as well as contributor, maintainer of Yeager as well, and maintainer of the Grafana tempo operator. And yeah, it's been a long time contributing to the CNCF's observability projects, and with me is an amazing crowd of people. Hi, my name is Benner, and I work with Pavel at Red Hat, and I'm also contributing to the OpenTelemetry project, and Yeager. Hi, I'm Anthony Marabella. I work at AWS, and I'm contributing to OpenTelemetry and our distribution of it. Hi, I'm Manusia. I'm a software engineer at Apple. I work for the observability team there, and I'm an open-source contributor. I work mostly on metrics and traces. Hey, everyone, I'm Matei. I work as a software engineer at Coreologics, and I also work in the observability, and among other things, I work in the Prometheus ecosystem, but as well in OpenTelemetry. Okay, so the whole tutorial is written on GitHub, so we will continue there. If you could please scan the QR code or go to this URL, and if you will have any questions during the tutorial, please raise your hand, and we will help you on your laptop. You don't have to follow on your laptop if you don't want to. Everything is in the repository. You will be able to replicate everything at home after the conference as well. So I will go to the GitHub, and the repository is as well pinned on my account, so you can just jump there and find it as a first repository. So what we prepared today for you is sort of, I would say maybe comprehensive tutorial on how to use OpenTelemetry for distributed tracing. We will start with setting up the environment. Then we will do some theory about how tracing is implemented in hotel, what are the concepts, and then we will do the instrumentation. We will use the auto instrumentation to instrument services running on Kubernetes. You will see how the OpenTelemetry operator is used on Kubernetes. Then I will continue with the manual instrumentation. So we will use the API and SDK to manually instrument the application, or one of the services from the application. And then we have a bunch of topics that are really important in tracing as well. We will look at sampling, how we can reduce cost of data that we want to collect. Then we have metrics that are derived from trace data. Then we will look at the OpenTelemetry transformation language, which can help you to do the transformation that you need, which is usually useful for PII, removing any sensitive data from the traces that you collect. And then there is a wrap up, and we have as well optional sections. One of them is tracing core Kubernetes components like cryo, API server, and so on. Those optional sections are as well in one of these previous chapters, and you can follow them at home. So let's start with a setup. And I as well forgot to mention that this is the first tutorial that we are doing. The first one was in Amsterdam. It covers all the signals. So we were talking about tracing metrics and logs. Then in Chicago, we did a tutorial on metrics. So all the content is on GitHub. You can go there and learn about the previous ones as well. So let's start with a setup. We will need a Kubernetes cluster. There are some instructions how you can spin up a kind cluster. I have everything ready locally, so I will not replicate those comments. Then we will need a third manager because the open telemetry operator uses the third manager for securing the webhooks. And then we will deploy our observability backend that we will use throughout the tutorial. The backend deploys Yeager all-in-one and Prometheus server. And after that, we will just pour forward the Yeager UI to our host. So I will just pour forward. And the next step is the introduction to tracing. Are you able to deploy those services locally? Do we have any issues? I see one hand there, some hands there. I don't hear that. Yeah, so, yeah. We just merged one pull request like two minutes ago. And- The problem is that Docker Hub is rate limiting us. So I was doing a test run before like an hour ago and I wasn't able to do it. But apparently I didn't realize that I didn't build it for all the architecture, so I apologize. Or you can just, in the repository, you will see the change and you can go to the previous Docker image for Yeager all-in-one, which comes from the Docker Hub. And we'll be right back. Yeah, so it's merged. But now you might see the rate limiting from the Docker Hub for pulling the image. Yeah, okay, so we'll continue. If you see, if it doesn't work, please raise your hand. Okay, so much is your next section anyway. So anyway, you should have a bit of extra time now to set up your own environments. As I said, if you have any issues, please raise your hands. And so let's look into a little bit into what is distributed tracing. I guess most of you already have some understanding of what this is, but we'll look a little bit at the model that we use in open telemetry. So this should give us like a common ground and common understanding of what is distributed tracing. We'll start even from a further point. We're talking about observability rights, so we're running these complicated systems, microservice architectures. And oftentimes, it's not easy to know what's going on inside our systems if there are any issues. So we somehow want to understand what is the internal state of our system. And because we cannot magically see inside of those systems, we somehow need to get the information out of the systems and for this, we use different telemetry signals. So you're probably also familiar with most of these, but we have metrics that gives us certain quantifiable measurements about our systems. We have traces which we'll talk shortly about. We have logs that gives us information about specific events happening in our systems. We have profiles that inform us about performance of our code paths and so on. So it's an open-ended list of different signals where each gives us different type of information about our system, right? As Paolo already mentioned, for metrics, we did similar tutorial back in KubeCon North America last year, so if you're interested, you can check that out. There's a link here as well. Now, talking about traces, we can think of traces as thinking of a story of a request that is going through our system and understanding how it travels through our system when it goes through different services. So the trace gives us the whole story, but we need to understand parts of this story, and for this we use a smaller basic unit that we refer to as SPAN, and a single SPAN will relate to a specific operation. If we look at SPAN, or if we imagine it as an adjacent representation that we have here on the left-hand side, this is an example of a SPAN that I took from the Open Telemetry documentation, so we see that it's defined by certain parameters, certain information that it contains, such as the name of the SPAN, context of the SPAN that we'll talk about in a bit. We have some time-related information. We have certain attributes, events, and so on. Now, I was talking about a story of a request, but how do we understand the story or how do we connect the pieces? How do we connect the operations and the SPANs that represent those operations? So for this, we use the SPAN context that gives us information, how SPANs relates to other SPANs within our tracing systems. Each of these contexts will include the trace ID, so the ID of the trace that the SPAN belongs to, and a SPAN ID, which is, again, a unique identifier for the SPAN itself. So these are usually automatically generated for you, so you usually don't have to deal with these. We also more specialized part of the context, trace flex and trace state, which we'll talk about a little bit later. These relate to so-called context propagation. We also have a parent ID that informs us of a relation to a SPAN that's in a parent-child relationship. I'll also talk about this in a moment. We'll also shortly look at links and what type of relationships they represent in tracing. So if you visualize a trace, doesn't matter whatever system you use, as you'll also see in our Yeager backend that we are using, typically you would visualize trace as a so-called Gantt chart, which looks like this. So we see there at the very top, we have our so-called root span. Which spans the whole duration of our operation. So in this case, we can have some clients calling our API or doing some operation. And within that client span, we then see that we have some API call, we're doing some authentication authorization, we're calling some other service, some payment gateway. We might be calling some database and we can visualize all of this with the help of a Gantt chart. Now, as I was saying, we can build based on the context, we can build certain span relations. Commonly we will talk about parent, child and sibling spans. So here we see that at the top, we have a span called hello. And this is again identified by trace ID and span ID. And we see that the parent ID is null, which means that this is the top most, this is the parent span. So then on the left side at the bottom, we have the child span called hello greetings. And again, it's identified by the identical trace ID as the one we have above. It's unique span ID. And in this case, we also have the parent ID information. And when you look closely, you will see that the span ID of the hello greetings match the span, sorry, the parent ID information in the hello greetings will match the span ID of the hello span. So this is how we can build the relationship parent child. Then if we see that two spans share the same parent, we can think of them as sibling spans. For some situations, we might not be able to represent or it might not be convenient to represent the relationship in this kind of direct manner. This could be typical for some asynchronous operation where one operation might cause some asynchronous operation, other operation that can occur sometime in the future, but we might not know when exactly. And since we don't want to wait for this other operation or we don't want to track this whole operation as one, we can represent the relationship in a different way. So we can use so-called links where links represent a causal relationship between spans. And it works kind of similarly as we've seen before. So a link, as you see here in the bottom, will include span context information of another span. So here we see that the hello greetings span in the links that it links to the span. How are you? Since we see that the trace ID and the span ID match. So as we're seeing, we have also other attributes or maybe that's not the correct word here. So we have other parameters or other information that are represented in the span. Most of them should be kind of self-explanatory, but we can go through them such as attributes which are certain key value pairs that further represent some information or some metadata about the span. As you can see here, some examples, there can be some IP addresses attached, HTTP method, like route that's being used, user agent and so on. We have also events that are structured log messages. So these represent certain events that happen during the operation that the span represents and are related to the span and happen at a certain point in time during the operation that we're tracking. Links I was mentioning previously, so we had an example for that. There's also status code and a message which informs us about if the operation ended successfully. There are certain technicalities in the specification because so we can have unset okay or error status code, usually unsets and okay mean that the operation ended without an error. Okay is a specific case where the developer might want to explicitly mark the operation as successful and then error, obviously, if there was certain error and it can also include an error message. Last thing to mention is that we also have so-called span kind which informs us a little bit more about the originator of that span depending on the context in which the span is being created. This can be a client or server. We can also have kind internal or it can be a producer or a consumer. Now we talked about how we're building this story of a request and to ensure that we can build that story. We need to be able to, at the end of when we receive all the span within the trace we need to be able to rebuild that story back. But how do we do this when we are crossing certain process or network boundaries, right? So if one operation is happening in service A but then the service A is calling service B, this other service also needs to have a context about our operation. And so how we do this or what we refer to how we refer to this concept, this context propagation which allows us to propagate this information between services. Traditionally you can depends on which technical solution we have but a good example of this is doing it through HTTP headers. We have also certain standards for how context propagation looks like for HTTP headers. These, there are two related context propagation standards supported by open telemetry. So we have the trace context which recognizes the trace parent and trace state headers where transparent gives us the information about the parent of the trace. And you have a trace state which can include some more vendor specific information related to the trace context. On the other end we have a baggage which is also related to context propagation. Baggage is more about propagating certain custom key value pairs that are information associated to your spans but they might not be included as attributes in the span but nevertheless you might want to use them. Typical example could be a user ID which might be available only in the user service for example that when you call another service downstream this information is no longer available so if you need that information somewhere downstream you can include it in the baggage and it will be available in the other service downstream. So with the theory out of our way we can finally start deploying our applications and start doing some tracing. Right thank you Matej. It was a lot of theory and it seems like it's complicated to use it but it's actually not and we will see that in the next step. So I will go to the further read me and in this chapter we will deploy our sample application into the cluster and then we will instrument it with open telemetry auto instrumentations or agents. But first what is the application? It's a polyglot application written in Node.js and Python and Java. So the front end obviously Node.js back end one in Python and back end two in Java. The functionality of the back end one and back end two are is identical. What it does it has a single API that receives a player name and then in the returns a number from one to six. So the front end calls both back ends gets the number and decides which one is higher, right? It's like rolling a dice game. So let's deploy it and I will just check if everything is created correctly. Yeah it seems like all the services are up and running and we're gonna forward the front end the UI to our host and it's, this is the, is it big enough? Okay yeah so we see the player one rolls one player two rolls four and then so the four is higher than one so the player two wins. So the app is running at the moment is not emitting any telemetry data so we don't have any visibility into what is happening. So we're gonna instrument it and we have essentially two choices when instrumenting the application. We can use the auto instrumentation or the manual instrumentation and there is big difference in them. The manual instrumentation requires a lot of work. We need to use the open telemetry SDK and API and we need to find those dependencies pull them into our application. Then we need to identify which RPC frameworks we want to instrument from our dependency stack. For that we will find as well the open telemetry instrumentation pre-built libraries that we can use. Then we need to all initialize it together, wire it together and of course recompile and redeploy our application. The good thing about manual instrumentation is the flexibility, right? We have everything in control and we can instrument only the parts that are crucial for us. At the same time we can make mistakes and we can forget instrumenting mission critical APIs and if we forget to do that and something goes wrong we don't get any telemetry data for those parts of the application. Then we have the automatic instrumentation which doesn't require us to recompile the application. We can just simply download the agent, put it next to our application, restart it and we will get telemetry data. So it's very easy to get started with the auto instrumentation. On the other hand, we don't control what the auto instrumentation is doing and how it is doing, right? So it might be maybe less performance in most cases probably not because it's reusing the libraries from the manual instrumentation in most cases. But you just have less control over how the whole thing is built. And in this chapter we will use the automatic instrumentation. Before we instrument the app we need to send the telemetry data somewhere. We deployed the Yeager with Prometheus which are the backends, right? Instead of those two we could use as well Datadog or Dynatrace or Splung or other providers. And the app, the instrumented app can send data directly to vendor but that's not really good practice. It's much better to send data first to open telemetry collector because it clearly kind of divides the data collection that happens in our environment from the vendor that we want to use. And on top of that the collector has a lot of capabilities that will enable us to remove any sensitive data to as well extract new data from traces, let's say metrics. It's as well able to enrich the collected data with information from the environment like let's say the Kubernetes pod name, deployment name and so on. So let's quickly deploy the collector. I will show you the CR from the open telemetry operator. It's very simple. There is just the image, then the deployment mode we want to deploy it as a deployment, single replica and then the config. The config is a string and you can paste there the actual collector config that you would run on your host or in Docker. So what we have done here is a very simple configuration. We have just OTP receiver. So we're gonna receive OTP data. And for the exporters we are exporting traces to our Yeager metrics into the Prometheus and locks, the OTP locks will be written to the collector output, right? The collector has the debug exporter that can print all the data into its console. And then there is a corresponding pipeline which essentially enables those components. Yeah, so let's deploy it. Yeah, it should start very quickly. It's a very small Docker image. The collector is up and running. That looks good. And now we're gonna start with the instrumentation. And instrumentation or the auto instrumentation on Kubernetes, it's a two-step project, process, sorry. First, we need to create the instrumentation CR, the custom resource. And the custom resource essentially contains or defines the configuration for open telemetry SDK. So there is information like where the data should be sent, what are the propagators for the context propagation and how we want to sample the data. So let's create the instrumentation CR, copy-paste problem. Okay, so the instrumentation has been created. And it does nothing. If we create it in our cluster, it's just a CR. It doesn't instrument the apps. As I mentioned, the instrumentation is two-step process. The next step is to put an annotation into our workloads. And the annotation needs to go into the pod spec. It doesn't go into the deployment annotations, it goes into the pod annotations in the deployment spec. And here we need to understand what is the language of the application, right? The backend two is written in Java, the backend one is written in Python and the front end in Node.js. And so for the Java, we are using the inject Java, for the Python inject Python and for the Node.js inject SDK. And the inject SDK, it's special. It doesn't inject the instrumentation libraries, but it will just configure the SDK that already exists in the application. You can think about it as just kind of doing a control plane for the instrumentation. If you go to the source code of the application and go to Node.js, you will see that the Node.js already contains some instrumentation, but it's disabled. And if you inject the SDK, it will enable it and configure it to send the data to our collector. So I'll just quickly instrument, quickly put those annotations. So now what happens is we have changed the pod spec of a deployment which triggers a new pod. Now the open telemetry operator has a pod mutating webhook and it will see there is a new pod starting. It will see there is an annotation on the pod and according to the annotation, it will inject the correct instrumentation libraries and SDK configuration. And let me go back to console. Let's see how the pods are, what has changed. So what the operator injected is an init container and the init container, what it does, it will copy the instrumentation library into a shared volume of this one that is mounted as well to the application container. So the application has access to this agent and then on the application container, it configures the open telemetry environment variables for the SDK, which is the sampling, the exporter. And on top of that, it enables the instrumentation by setting the environment variable for the given runtime. In this case, it's a Python, so it will configure the Python path to load the Python libraries before the application. If this was Java, it would configure the Java tools options and JVM would load the Java agent first and then the application classes would be loaded and the Java agent will see which classes are being loaded and it will inject the instrumentation code. So now we should have everything ready. The app should be fully instrumented and we can go to Yeager console and I will just pour forward the frontend again because we restarted it so it would fail if we load it. We see the app is still working well and I can open the Yeager UI and we should see traces from the frontend. If I go to the trace view, which is the visualization that Matje showed, we see the entire trace going from the frontend to the backend, one and backend, two. So we have successfully instrumented the application by essentially doing two commands, creating instrumentation CR and applying annotation to the workloads. The auto instrumentation from OpenTelemetry does more than tracing. It as well reports metrics and logs depending on the instrumentation, of course. For some languages it's not doing logs, for some it does. You need to check for specific auto instrumentation for given language what are the features. So what we're gonna do here, we're gonna take a look at the logs. So if you get the collector logs, you will see there are some logs put into the standard output of the collector and these are the logs from the backends to application from the Java. Right, so for instance here we see the log that the application started in 4. something second. If you get the logs of the application itself, you essentially see the same logs, right? You see it here, starting application, blah, blah, blah. That's actually, it should be this one, started application in 4.59 seconds. So the collector is, or the instrumentation is getting the logs from the standard output of the application and it's enriching the logs with the Kubernetes attributes to identify from where they are sent. It's sending them to the collector and collector is printing them to the standard output but you can as well configure the collector to send the logs into your open search or other logging systems. And these are the logs and the metrics are sent to Prometheus. So if you pour forward Prometheus service, you will see what are the metrics sent from the back end to from the Java as well. What we see here is the RED metrics for server. So we get the latency and the number of calls and the number of errors and as well, couple of JVM metrics. We got this all by simply instrumenting with a simple agent. So everything is working but oftentimes the auto instrumentation will need to be customized because you will have a special requirement in your organization. Let's say you will need to capture more data, let's say more HTTP attributes, not attributes but the headers, or you will need to trace methods of your business logic. That's a good news is that you can do that with the Java agent with just a configuration change. So what we're gonna do, we're gonna configure the Java agent to capture the HTTP response headers, the content type and date. And we're gonna instrument the main method of the application. So if the main method will be executed, the Java agent will create a span for that execution. And we do that by specifying the fully qualified class name with a method name. And we can do that for any class that is in our code base. So I'm just gonna apply it. And now I changed the instrumentation CR. And the instrumentation is only applied when the application is starting. So if I want to make these changes effective, I need to restart the application. I can do that with a rollout restart. Now it will restart. And if I go back to my Yeager and I search for traces from the backend too, which is the Java app, I should see a span from the main class, right? So I am effectively tracing the start of the application because I configured the Java agent to create a span for the main methods of the app. And there is only a single one because I just restarted it only once. If I do restart again, I will get a second trace for the second start of the app. For the response attributes, I need to search for the traces that go through HTTP API of the backend too. And here I should see the response headers. Actually, I need to execute some requests, right? Because the API, it wasn't the hit yet. So this is a recent trace. And now I see the response header content type and the response header date. So to sum up using the auto-instrumentation in Kubernetes, it's fairly simple. And if we need to do more data capture, we can oftentimes do that with configuration change. We don't have to rebuild our applications. There is an optional section where you can use OpenTelemetry API with combination of Java agent to do more complicated stuff, right? So you can attach more attributes programmatically. You can use some annotations to create new spans and attach attributes. You can play with it after the conference. Okay, and now we will transition to the next chapter about the manual instrumentation. Keyboard layout. Okay, next we would like to have a look on the manual instrumentation, which is quite handy if there is no auto-instrumentation for your language. Probably there will be an SDK. And also if, for example, the auto-instrumentation doesn't support specific versions. For example, Go-122 was not supported by the auto-instrumentation, so there is a chance that we, or we can solve this by doing it on our own. We are a bit quick, short on time, so we have two versions, and I would probably go over it instead of doing it manually here. So we will find in this repository an un-instrumented backend and an instrumented backend. We can quickly go here and see this is a Go application, which is more or less a replacement for the roll dice application back into. So it does exactly the same. And what we see here is it has an error rate defined and a high delay defined, which becomes a bit more useful afterwards. And yeah, we have this roll dice endpoint. We roll the dice. We cause maybe a delay, maybe an error. That's it. And the first thing we should do is basically define the critical path of our application. So if we have only one endpoint, which is super interesting for us, probably it's the roll dice one. And so maybe we jump over to this one here because it can be better navigate. So this is now the instrumented version. So what we need to do first is, similar to the collector, we need to define the next border. We would like to send the data. This can be empty. We can configure this over the environment variables as it was shown before. And we also can do some processing. In this case, we just add here the batcher to do this. And the last one is we need to now make this accessible in our program to all different libraries. And therefore we register this globally. So there is now the option here to create this new tracer and set it then, which will then register it and we can access it from all over the program, which is then the next thing that we do. So usually there is per library, you can then create a specific tracer. Let me search it here. And then you can also give it a name and to identify afterwards where your code was instrumented and also what version it was and so on. So and then the interesting part, we would like now to instrument our method that we had before. We can go and the SDK offers something to wrap the handler and create a new span. So we don't need to write all the instrumentation stuff on our own. Here we add the route attribute so that we afterwards can see which route was hit. And yeah, then we register it with this. So this is basically a wrapper function for registering something to our router. The span basically that is then coming from another service or coming from a created by this handler there directly. We can access this directly from the context and go with propagated explicit. So it's always attached. And what we do here, for example, we add an event and in this event we just set a player name. We will see this afterwards in Yeager then reflect it, which is quite nice because we can also search for something like this. And then we would also go down this path and maybe this calls delay method which is calling a database or something which would be quite interesting for us. So we would go here and could do the same. We start a new span. Since this or if in the context is already one existing we create a child span. This goes automatically. And then we can also here add an event so that we afterwards can see what did we actually roll and what's this method hit. So because then this delay will happen. The same goes for the error. There is another thing. We set an error here directly. This error or record error, which under the hood is basically a specific event, an error event that helps us afterwards to identify a bit more details. It also set an attribute for the error.acquards or the stack trace depending on the language. And also then we need to explicitly set our span to an error state because by default it's untouched. And yeah, so maybe we directly go over to Yeager. We prepared an image. Oh, it's the opposite. We prepared an image with all the instrumented code already so we can quickly take this one. Should be there in a second. Okay, started. Oh yeah, no, it's there. And there was this special thing in the code where we had this error percentage. Basically, we have currently an error percentage of 20%. We can set this using the environment variable so that actually there something interesting happens. We create some extra requests. Playing it a bit. So we should see some spans with errors. So for example, when we go down here and we inspect our span, I have to check so we have the error state which is there. And we should see. Oh, this isn't a front end deployment. It's something else. There's a few more times. So maybe this one is a new one. That's again in the front end. Use back and forth. We set the environment to 20%. So the chance should be not super low that we cause some errors. Maybe we directly go to this one here. Okay, so we get it directly from the, we directly search for the go service. What we can see here now is basically the cause error method. And we see the locks and the events that we added. So we had an role event and the number was actually seven. And in case of the error, we see it was an error event and we see also the message. So the seven is lower than the 20. And yeah, that's almost it. Okay, also with the delays, they should also be here. Somewhere reflected. So it looks good here. We have four seconds and we can go in here. We see the event. And we, oh, that's the error again. But it's the error and the delay in combination. So we see it was a 13. The threshold was 20. So this call thing is delayed. Okay, yeah, for testing and playing around with it, we also have in this section here something prepared. You can just run Yega locally. And also the application when you compile it, you can configure the environment variables and then directly get a faster feedback from what you instrumented if you would like to play around with it. In the next section, basically we would quickly like to discuss what happens if you instrument a lot or use the auto-transmentation. You produce a lot of data, which over time can be quite expensive. And what we can see here is basically how the data is delivered from our services. So we had the front-end service, back-end one and back-end two. Those sent the data to the collector. And from here we would like maybe to go to some SAS vendor or somewhere else. And if we have a decent setup with a few thousand pods, this may produce tons of data which can be quite expensive. When you go here and have a look on GCP or on X-ray, you see that one million traces can quickly sum up for a quarter million dollars a month, which is probably unreasonable to find tiny issues. At the same time, the majority of traces is probably something that we are not really interested in. As we have seen before in the front-end application before we had this causing error service, everything was fine. And you don't want to spend that much money for just knowing it works. Yeah, and in the next section, basically, Anu will show you two approaches how to solve this issue a bit. Thank you, Bani, for the introduction to sampling. Now I'll talk about two different sampling approaches. First is head-based sampling and the second is tail-based sampling. I'll give a quick introduction to both the approaches and we'll configure them both in this section. First, let's start with head-based sampling. In head-based sampling, the sampling decision, the decision whether to keep the trace or discard the trace is taken at the beginning of the trace creation, which is when the trace is created, which is at the beginning of the request flow typically. So traces are sampled randomly in this technique based on a predefined sampling rate or probability. Because of the randomized nature of this sampling technique, it cannot guarantee that it'll always capture the important traces. So as you can see here in this example, so each dotted line here is a trace and every dot is a span. So what we have here are four traces. So with randomized sampling here, it did not capture the error trace. So whereas on the flip side in tail-based sampling, so the sampling decision is taken at the end of the request, which is when we have all the spans of the trace available when the trace is complete. So it could make informed decisions by looking at all the trace data that is available. So we could define policies to capture the most interesting and most actionable traces that we want to. So this is a biased sampling technique whereas head sampling is an unbiased technique. So we'll start with configuring head-based sampling with open telemetry. So head-based sampling is configured at the SDK level and open telemetry ships with a bunch of in-built samplers for implementing head-based sampling. So for the list of all the available samplers, please check out the official documentation. But for this tutorial, we'll be using one specific sampler which is parent-based trace ID ratio. This is the most widely and commonly used sampler and it is also recommended by upstream open telemetry. So with parent-based trace ID ratio, we sample a random percentage of parent spans or root spans based on the sampling rate that we set here. And that decision is propagated to all of its child spans. So we either trace the complete trace or discard the complete trace. So with auto instrumentation in our previous section, Pavel showed how to set up the auto instrumentation CR. So the head sampling configuration here is under the sampler section. So the type here is a sampler type that we want to configure and the argument here, the number that we specify here is a sampling rate for parent-based trace ID ratio. So different samplers have different inputs. So the argument here is basically the input that we provide to the sampler that we configure. So this is very straightforward. All you have to do is specify the sampler type and the input here and you would have to apply the updated... So just a second, let me just navigate back. All right. So in our previous section, the sampling rate was configured to 100%. So it was sampling all of the parent spans, which essentially propagates to all the child spans. It is doing 100% sampling. You can play around with different sampling rates and all you have to do is apply the updated instrumentation CR and restart your deployments. And you could see the updated sampling percentage and under the environment variables in your pod spec. So this is pretty straightforward. So I'll maybe let you play with it offline, but let's also look into how do we set it up with manual instrumentation? How do we set up head sampling with manual instrumentation? You could always use the environment variables even with manual instrumentation, but if you want to explicitly set it in the code, this is how we do it. We can configure the sampler here with the with sampler and then you could use all the available samplers. So there are different types of samplers, like always on, always off parent-based trace ID. So all you have to do is set it up with the sampler configuration here. Now let's look into tail sampling with open telemetry. Tail sampling is configured at the collector layer, which is the backend that receives all the spans. So it requires us to stand up a collector and implement a processor called tail sampling processor. So the tail sampling processor defines a set of policies which is used for the sampling decision. And there are a bunch of policies available which you could choose from and define with the tail sampling processor. So before looking into the processor configuration, what we would do is we would go ahead and update the environment variables in our back into deployment to generate more errors and high latency traces so we could capture them once we deploy our collector with a tail sampling processor. Let's go there. Sorry, I'm a Mac user. So we updated the environment variables and then we'll go ahead and deploy the collector which will enable the tail sampling for us. Then we will see if the pods are up and running. Yeah, I just restarted the open telemetry collector. So we're good. Now let's walk through the tail sampling configuration which is placed under the processor section in our configuration file. So as you can see here, so here is the tail sampling processor under the processor section. The first three configurations here are optional configurations. The first one here is the decision wait time which is a time to wait before making the sampling decision. So the collector, in our case here it is configured to 10 seconds which means the collector waits 10 seconds before it takes sampling decision on every trace it receives. So since the first span, it buffers all the child spans for 10 seconds. So we are assuming that in 10 seconds we'll have the complete trace to make the sampling decision. So this depends on your use case. You would have to understand your application to know how to configure the decision wait time. So for this example, we just configured it to 10 seconds and the default is 15 seconds. And the other two are essentially the number of traces and expected new traces per second which will help the collector build the data structures under the hood. And the next one is the policies. This has no default. So we would at least have to configure one policy for tail sampling processor to work. So we have three different policies defined here for this example. The first policy is the status code policy which makes sure that we sample 100% of the errors. In this case, the status code is set to error so it'll ensure that we sample 100% of the traces with erroring spans in them. And the second policy here is the latency policy which will ensure that we, and the threshold is set to 500 milliseconds so it'll ensure that we sample 100% of the traces which have a duration longer than 500 milliseconds. And the third one is a probabilistic policy. So other than along with the errors and slow traces, we also randomly sample 10% from the rest of the traces so it would help us surveys any other issues that we might not know of and it will also better understand the whole system health in general. So these are the three policies that we used for this example but there are a lot more policies and we could write complicated complex policies. So you just refer to the upstream documentation. The link is here for the list of all the policies available. Now let's go ahead and execute some requests on the application. So we should only trace error, traces with errors and duration longer than 500 millisecond after this. So if you see, it's taking a while to load. That means it actually is a trace that took longer than 500 milliseconds. I just clicked times and then we'll go to the Yeager UI and traces. So you should only see error traces and traces that took longer than 500 millisecond in this case, it took three seconds. And also some random because we have a randomized 10% policy, you'll also see some regular traces too. So that's all about tail sampling. We also have a couple of advanced topics here. So I'll just give an introduction to both the advanced topics so you could try out the configurations later at home. So the first advanced topic here talks about how to scale the tail sampling with open telemetry. So for tail sampling to work as expected, all spans for the trace should be processed by the same collector instance. So in our previous example, it's a simple setup. So one collector instance would suffice, but as the system grows, you would have to keep more traces in memory so you would have to horizontally scale it by adding more collector instances. So at that point, you would end up with fragmented traces. So you cannot guarantee that all spans for your trace will end up in the same collector instances for performing tail sampling. So we need to implement a two layer architecture for scaling tail sampling with open telemetry where the first layer here in this architecture deploys a set of trace aware load balancing collectors. And the second layer is the collectors that actually do the tail sampling. So the first layer of collectors here, the load balancer collector would be configured with an exporter called load balancing exporter, which will ensure that all traces, all spans for a trace would go to the same downstream deployment where the tail sampling happens. So this is how we actually scale tail sampling with a two layer collector setup. So I also have the configuration here to deploy the two layer setup. You could try it out later at home. So this will deploy a load balancer collector with load balancing exporter configured and also two gateway collectors which actually do the tail sampling. And the next advanced section that we have here is a Yeager remote sampling extension. So if you guys have used Yeager remote sampling before where you could centrally control the sampling configurations using Yeager collectors, open telemetry also supports the same with this extension. So you could set up the extension to load sampling strategies or configuration from a Yeager collector that's down in the pipeline or it could also load from a static file in your local system or you could also configure it to load files from a centrally managed file server. So it essentially provides you the same capabilities and you would achieve the same thing that you would have achieved with Yeager remote sampling. So that's all I have today for the sampling section. Anthony will take over and talk about span metrics in the next section. Thank you, Anu. Yeah, so sampling is wonderful for reducing the amount of data that you have to process. But once you do that, you start to lose some of the visibility into the actual shape of your traffic. You might not see all of the spans so you can have an accurate count of how many operations you're performing or you won't necessarily see all of the durations if you're only sampling the long duration spans, then you won't know what your actual average is. So we can try to get some of that back by using a connector that exists in the collector called the span metrics connector. And that does pretty much what it says on the tin. It takes spans and produces metrics. So connectors are a type of collector component that act as an exporter on one pipeline and a receiver on another so that you can connect the two pipelines together. And in this case, it acts as an exporter on a trace pipeline and a receiver on a metrics pipeline. So we use the default configuration here. It's pretty reasonable if you've got an HTTP service that has latency in the millisecond to second range. If you've got a service that has a different distribution, you might want to look at configuring the buckets for the histograms that it produces, but typically you can just drop it in and start to use it. So here we add another traces pipeline to our collector configuration. And this one bypasses the tail sampling processor. So this exporter will receive all of the spans and be able to produce metrics from them. And then we also add this pan metrics connector as an additional receiver on our metrics pipeline. So those metrics will be exported to Prometheus along with the rest of our application metrics. So if we go ahead and make this change to our collector configuration, let's see, how I move this to the other one. Three fingers, that should be right. Okay, so now that we've got that configured, three fingers this way. All right, so that will set up our collector to start producing those metrics. If we go over here and make some more requests, then we should be able to go look at Prometheus. We have Prometheus port forwarded. So we've got that forwarded. We should be able to go to our Prometheus. Try this again. Here we go. So if you look at calls total, now we see that we are getting metrics and they're coming from each of our applications. We get information about the name of the application that produced it, what kind of span it was and the name of the span as metric labels. So that's good and useful, but you might also want some more refined views of this. So Yeager has a new UI element that can take those metrics that are produced by this pin metrics connector and produce a NAPM style latency error distribution and request rate graphs for you. And it only requires some basic configuration. You tell it what type of metrics you're using. Here we're using Prometheus where it can locate those metrics. And we also have to tell it some information about the types of metrics that we're producing. So here we're using the span metrics connector. So we're telling you that that is what's producing those metrics into Prometheus for us. So if we go ahead and apply, did I not include the backend here? I'm sorry. So if we do that, then we will reconfigure our Yeager deployment and now we will be able to go look at it and we should be able to go to this monitor tab after we make some more requests to our service, get some data in there. And if we look at the monitor tab, we should now be able to, oh right, I killed that, port forward helps if I actually let it reach Yeager. Sorry, here we go. So now it will give us information about the operations that we have. So here we're in the backend deployment. We can see the roll dice operation. If we look at our go back end, we'll see that we've got some error rates. If we go here, it will then link us to traces from that service directly from the monitor view and we can see what is producing those errors. So yeah, that's a very quick and easy way to get back visibility from your collector that's doing tail sampling without having to send all of those spans to your backend. We wrap that up. I just want to add something that the auto instrumentations, as I showed you before, they as well report already metrics. And you might ask the question, like what is the difference between these two, right? The metric reported from the auto instrumentation for the same kind of latency count and error count and the metrics that we get on the collector. Sure, the difference is that not every instrumentation is necessarily going to produce metrics. So some instrumentations, whether they're auto instrumentation or manual instrumentation, like for instance the co-application we have that was manually instrumented only produced spans. So it didn't have any red metrics that were being produced by auto instrumentation. Some of the auto instrumentation libraries that existed some like we just may only produce spans and don't have any metrics. So you can kind of derive metrics from those spans if the metrics don't already exist. Cool, so yeah, sometimes it's supported, sometimes it's not supported in the auto instrumentation. And the other is difference is that the collector might receive sampled spans, right? Just the subset of the all the trace data that was produced by the instrumentation. Can the metrics be skewed by that if we generate them on the collector? The metrics can, yeah. And that's why we set up a separate pipeline from the pipeline that was doing the sampling. If you were doing head sampling though, then the collector only has visibility into those metrics or those spans that actually made it to the collector. Okay, now Mati's gonna talk to us about open telemetry transform language. All right, we're almost at the end. Thanks for sticking with us. Let's look at our last section that unlocks the ultimate power to transform your telemetry. How open telemetry enables this through so-called open telemetry transformation language or for short we'll refer to this OTTL. This is a language that is kind of a standalone component within the open telemetry collector. It's reused between different open telemetry components. Mostly you would use this in processors such as filter processor, transfer processor or routing processor. We'll see example with the transform processor. And then for the language itself, for most of it would be probably quite intuitive as any other scripting or programming language. It's the main thing with open telemetry transformation language is it depends on statements. These statements will be included as part of configuration in your open telemetry collector and the statements then include particular parts such as context and functions. Here I included a simple example which will maybe make it a bit more obvious. So here we have a statement which is first invoking the function set. We're seeing that it's setting certain attributes. So it's setting attribute client error to true. And we will only do this if we see that we also attribute HTTP status that is equal to 400 or we have attribute HTTP status equal to 404. So this way you can have a statement that transforms the telemetry. To go a bit more into more details, let's look at some of the core concepts in the OTTL. As mentioned in context, this is as you'll see in our example, this just specifies which part of the telemetry you want to work on because you might want to work on the resource or its attributes or you might want to work on the span. You might want to work on a metric or you might want to work on a particular data point in that metric. So you need to specify what you want to work on and you can do this by accessing the context with the path and with the familiar kind of dot notation. So as we've seen in the example above, if you want to set an attribute, you would access the attribute with the square brackets and then go from there. Functions, there are basically two types of functions in OTTL, there are so-called editors and converters where editors are manipulating the telemetry themselves. So this will be functions such as set or delete, key, replace and so on. And conversely, the converters are just to transform the input into certain output that they do not modify the telemetry itself. You will see also example of this. And then you have other features. You have, obviously, you have operators to do comparisons, literals and so on. So let's look how this will look in action. We will use, we will see an example with the transform processor and what we have in our application, one feature that we didn't mention explicitly, but we also support recording the name of the player who is rolling the dice. But due to certain privacy concerns, we don't want to record the name of the player in our spans, right? So in order to avoid hefty fines, we'll anonymize the names of our players. First, let's see how we record the name of the player. So if you don't have poor farting set up, you can set it up. Again, you have the command here, but I believe we have it set up on our side. So I'll just go ahead and fire some requests. You can also choose from some of these names or you can come up with your own. So yeah, Barbie wins in this case. Let's do a couple more. Let's go back since I didn't open it in a new tab. And let's see if we have something here. Yeah, so we have the whole name of the person who was rolling the dice. And since we're trying to avoid this, what we'll do is we will make sure that we only record the first letter of the name of this person. And this is our simplified way of anonymizing the person in this case. So then you can look at the configuration that we have here. I took out the excerpt for the transform processor. What we have here, error mode is not that important for us, but you can either ignore the errors. Then we have the context, as I was saying. So in this instance, we're working on the span itself, and then we have statements of what we'll do with those spans. In this case, we will be setting the attributes. So as we've seen, we're capturing the app player one attribute, but we want to only include the first letter. So in this case, we will use the function substring to only capture the first letter, as we see here. So we are specifying which attribute we're accessing, and we're only keeping the first character in this case. And we'll only do this if the attributes exist, or if it's not empty. But that is not all. If we go back to Yeager and we look more closely, we'll see that we're also recording the names in some other attributes that are recording the HTTP, rather than the parameters that we are passing to the application. So we also need to take care of this. And so we simply, we will extend our configuration and add a couple more parameters, sorry, a couple more statements. And in this case, we will use the replace all pattern function where we have, again, in this we pass certain parameters to this function, so we'll want to work on attributes, we want to adjust the value of the attribute, so not the key. And here we have a pattern that we will try to match against in our attribute. And what we want to replace it will be with a simple placeholder. So instead of including the name of the player, we'll just have player name in curly brackets as a replacement. As I said, all of this configuration is included in this collector file. We'll just go ahead and, okay, I'm also struggling apparently. Cool, luckily I have just one command. So it looks like our new collector came up almost instantly and let's see. Let's go back and try to fire off a couple more requests now with different names. Let's do a couple more just in case. So let's go back, not too far away. All right, now let's look the aggregate again. We should see some new traces and let's see what we get here. So we see we have the app player one attribute as you've seen in the parameters of the request that we're using where player number one is called Neo, but we're shortening this with anonymizing this person and shortening it to N. And then again, we also see we have the HTTP target and HTTP URL attributes. And we're based on the pattern that we provided. We're replacing both names with just player name placeholder. So that's it. You have transformed your span attributes now and you'll know how to apply this to other parts of spans or even other types of telemetry signals. And with that, we're ready to wrap up our workshop. Okay, thank you very much for staying so long with us and we have roughly five minutes for questions. There should be two microphones in each aisle. Seems like no questions. Yeah, but we will stay here. If you want to ask something privately, we can do that as well. All right, thank you very much again. Thank you.