 So next up, we have Mattwe and Audiot. I mean, you've read the title yourself, so they didn't ask for any special introduction. So just a round of applause, everyone. So hi, everybody. We're really excited to be here. Thank you for hosting us. And Amsterdam Observability Day 2023. My name is Oded. I'm leading the platform group at CoreLogics. With me, Oded, you probably know him. He's an expert in observability and cloud engineering. Today, we are going to take you on a tour on exploring the delta temporality in open telemetry. And we see if that adds up. Our plan for today, for the next 20 minutes or so, talking about the history of the observability ecosystem a bit, diving into the open telemetry spec in regards to temporality, cumulative, and delta. From there, we'll do a deep dive into in-practice with the SDKs, the collector, and probably our observability backends. And in the end, hopefully, we'll get to a conclusion or more of a confusion. We'll see. So let's start. So taking the observability ecosystem, we're talking about a few decades ago, where measurements started for companies. So we see that companies are starting to be interested in measuring stuff. It can be things like counting. They're starting to count their operation behavior. It can be the number of calls that an application did to a database. It can be the application counting. We want to count the number of users that are currently connected to our website. And it can be more interesting metrics like the number of coffee cups that my employees drank today. And we are seeing things like timing. We want to time things. We want, on the operational layer, to be able to understand how much time it takes to a call to be done to a database. We want to be able to understand the application ones, like the time it takes to buy a book on my website. And again, more interesting stuff, like the time it takes to drink coffee in the morning. And from there, we're seeing two traditions evolve. One of them is the StudsD tradition. And the next one is, of course, Prometheus tradition. Just to do a quick poll, how many of you are using Prometheus today? I guess most of you. How many of you are using StudsD? Nice, and there are a few that probably use both. So talking about StudsD a bit, so it's a simple demon. It basically aggregates data and sends it to a backend after a specific time frame, mostly 10 seconds. It's run over UDP, fire and forget. We don't care about the data. We just send it really, really fast. And when it sends data, it's either send it in a gouge concept, which is the current value, or on a control level, which is something like increment from the previous one. And then we go to the Prometheus tradition, where you have HTTP client, and the Prometheus scrap it every an hour and then on a specific time interval. And it has a fully fledged severities DB database. And today, it can be used as an agent mode, using a remote write to a different backend. And Prometheus doesn't really care if it's a gouge or its counter. In the end, it returned the current value. Now, the thing about it is that when we go to the open telemetry specification, a new data model was emerged. And this data model basically said that we are talking about time series model, and we have SAM, and we have gouge, and we have an histogram, and exponential, and histogram, and so on, and so on. And the big thing about open telemetry is that they say that when we are using popular existing metrics data formats, those can be transparently translated into open telemetry data models for metrics without losing the semantics or the fidelity. And specifically, or explicitly, they are talking about Prometheus and Stats D, where those formats explicitly satisfied. And in order to reach that, a new concept was emerging, which is temporality. So temporality in the end, or aggregation temporality, means that, or it defines, our metrics aggregator, reports, and aggregated values. It basically describes how those values relates to the time over which they are aggregated. So when we talk about cumulative, which is our Prometheus behaves, the metric aggregator reports changes since the fixed start time, which basically means that current values of cumulative metrics depends on all previous measurements since the start time, while talking about Stats D, which is more of a delta aggregation temporality. Metric aggregator reports changes since the last report time, with the values of delta metrics, are only based on the time interval associated with the last or the one measurement cycle. Now, how does it look like? So when we are talking about delta, we're starting in t0 time. We have a request, a request, a request, and we finish up one second cycle. We see that we have a count of three points. And then we continue to the second cycle from t01 to t02. We have one request. This request in a one second cycle is, again, being calculated. And we see that the count is one. And then we continue to the next one second cycle. And we have another request and another request. So from t02 to t03, we have two requests, and so on and so on. The increment is based on the one cycle time frame. And while we are looking at the cumulative temporality and sums in this case, we are talking about, again, the first one second cycle. We have a request, a request, a request. And we see that from time 0 to time 0, 1, we have three requests. And then we have another one request in the next one second cycle. And we see that now we have four requests from t02 to t02 and go on and go on till t05. Basically, I'm getting the current value in the end. I'll now add up to my colleague Matej to continue the explanation with a more specific example. Thank you. Thanks for that. So just to kind of finish this, this is the right slide, just to finish off like a little more practical example or we have like a representation of some pseudo representation of series. So how we would count, for example, HTTP requests differently with delta and cumulative. So I think it's rather self-explanatory. Again, we can think of this as happening in certain time space or in certain intervals, t02 to t1, t02, t2, and so on. And we have events happening during these intervals. And I'm sorry, it seems like it's we have certain events happening during these intervals. And now the question is, in those last two comms, we can see how we are reporting this. So when we are talking about delta, you will see that we will report only the changes that are relevant during that given time interval. So if we've seen a get request that we responded to with status code 200, we will report this. On the other hand, as you see, for example, in the second interval, there are no requests incoming. So with delta, we are not reporting any new values. But with the cumulative count, we're still reporting the latest value of the counter. And again, if we have, for example, just 200 requests coming with status code 200. So on delta side, we would report this increment. But on the cumulative, we are adding it to the previous value. And again, we're reporting for both requests with status code 200 and 400. Now what Odette showed and talked about, this was on the level of the whole stream. Now let's look at the individual data points. And let's look at how time works in relation to data points in the OpenTelemetry data model or the specification. There are two time values that are connected with data points in OpenTelemetry. One of them we can call as a time of observation. The spec refers to this with the technical name or the name of the actual field that we can find it in OpenTelemetry time in Xnano. The second time, we can refer to as a start time or a start of a sequence time. And I'll explain in a bit what kind of sequence we're talking about. So we have two time values connected with the data point. Now talking about the time of observation, this is probably more common and more straightforward to understand, right? This is the time when the actual measurement happened or we presume that it has happened. So this is like the decisive time stem, this is the time that is attached to the data point that we have. This one is mandatory. So we should include it with every data point, obviously, because we want to know when the measurement occurred. Depending on which part of the spec you're looking at, you will find this called, I mean, sorry, depending on which part of OpenTelemetry you're looking at, you can find this called, the field will be called time in Xnano or simply a timestamp. And as the time tells us, this is like a Unix time expressed in nanoseconds. Now the second time, which the spec refers to a start time in Xnano, this time value is less important to the individual data point, but it helps us to make sense of how data points relate to each other temporarily in the data stream. So it helps us to make sense of how the data points align within time. So as I said, this is less important to the individual point but important to understand how the data points are sequenced. It's not mandatory, but it's strongly recommended for, especially for sums and histograms. And the value is again expressed in Unix time in nanoseconds. So the second value is required to correctly build what spec refers to as unbroken sequences. So we can think of this as a space in the stream where the data points build a coherent sequence and there are no kind of resets, overlaps, gaps, there are like no alignment issues between those points as you've seen on the example of the sum which that showed you, which showed like the ideal case where all of those time points aligned. So this is important to have the correct value for the counter and maybe more importantly to be able to also calculate rates properly. How these points need to align, so this differs between delta and cumulative or that alluded to this already. So for delta we will be looking to always align the time Unix nano, the start time Unix nano of the next point with the time Unix nano of the preceding point. And on the other hand with the cumulative delta all the points in one stream or in one sequence will have the same start time stem and they will just report different time Unix nano values. Let's move on to the actual practice and this is, I think this is what kind of where our interest in this topic came and where we started to experiment with it and play with it because we had some users who were using stats D and I was like, okay, what's now is this delta temporality as someone who's coming more from the Prometheus traditional Prometheus ecosystem. So I think this is where things got more interesting and we started to look at the whole pipeline and started to think, okay, what's happening with the data temporality in different stages as we're going through the collector pipeline and we're processing those data points and how this can affect the temporality and the actual data and how it comes out at the end of the pipeline, right? So I'll move from all of the different, through all of the different components from the client side all the way to the back ends. So if we begin at the client side, obviously you can, so you have multiple options which kind of protocols or formats you wanna use to send the data to collector. So you can use the, what we already mentioned the traditional things that come from outside of the open telemetry self like Prometheus stats D, I'll keep using these two as examples since we mentioned them as the traditions from which we're coming. So here there's really, although we can kind of think of them as having different aggregation temporality, the notion of temporality doesn't really exist in Prometheus or stats D, it's not part of that form other protocol. So like we have no choice of, or we have no control over this, we're just sending the data we're providing the data to the collector. Whereas if you decide to, or if you use instrumentation from the open telemetry, if you use the open telemetry SDK, this is native to open telemetry. So you have the possibility to choose the aggregation temporality and you can, so you have more options to configure this and by default according to the specification, open telemetry will use cumulative temporality but you can switch these to Delta. Now then we move on to the collector, hopefully most of you are at least somewhat familiar with it but we have different components responsible for different parts of the pipeline. So we have receivers which are kind of the first touch point in the collector, they have processors which can do some sort of processing of the data and then we have exporters when it's coming at the other side of the pipeline and then we're sending it to our observability backend. So as I said, the receivers are kind of the first touch point where the data is coming in and in this step, so the data will be translated or transformed, I guess translated is better version, a better term from the protocol or from the format of the client to the actual representation that is used in the collector. This is called pipeline data, we can find it as a P data and so this is also where the decision which temporality to use comes into play. So obviously it might not be possible to choose temporality because if stats D gives you just the Delta values, you have to use the Delta temporality on the other and if you have the cumulative values you need to translate this as a cumulative temporality. So this is all where the receivers which come into play they take care of this and they then translate this into the temporality that's understandable for us within open telemetry and within collector. Obviously this is different if you use the open telemetry SDK, the kind of native way to send data because then it's kind of translated one to one if you use Delta temporality then it can be translated also to Delta temporality in the pipeline data format. Moving on to processors, so processors as I said with this we can do some manipulation of the data points. I think there are three processors that can be of particular interest to people who are working with temporality or who are thinking about how temporality works in collector and it's the cumulative to Delta processor. This one is usually kind of bundled or thought of together with the Delta to rate processor. So the first one allows us to simply take cumulative values and turn them into Delta values and usually then this can be used piped into the Delta to rate processor that can then output us, give us directly rates as gauges so you can get the rate values or the rate representation directly from those cumulative values. So you can use this independently or both together we'll also talk a little bit about when you can use cumulative to Delta by itself and transform processor, this one is I would say more requires more expertise or more it's requires it gives you more power so it requires more knowledge or more paying more attention because this processor uses the open telemetry transformation language which is like language within open telemetry that makes it possible for you to define certain transformations of the data within the open telemetry. So you can with this, you can also manipulate the aggregation temporality of those data points. There are also some helper functions that are part of the processor like for example, creating gauge to some where you can then change monotonicity or aggregation temporality but it should take caution when you use this one because it doesn't prevent you from doing some unsound transformations or doing something that goes maybe against the specification or against some conventions. So I would recommend to look at it only if you have very specific use case. Lastly, when exporting to your backends and arguably this can be one of the decisive factors when you're also looking at whether to use Delta or cumulative temporality. So we can split this into two parts. On one hand, we have exporters which are at the last stage of the pipeline. Again, we're translating the pipeline format back to the format that we want to output this in and here we need to be mindful of how the exporter will translate the data and the caveat is that not all exporters will be able to transform all of the data. I think good example, I think right now is the Prometheus remote write exporter that is not handling all of the data points with Delta temporality and it might drop some of the data. It will give you warnings or errors but you need to, so if something like that happens you should be aware of that and you need to pay attention to that and the other side of the coin is then your backends whether you're running your own solution or you're sending the data to your vendor. I think you should be also aware how the data ends up on the other side of the pipeline. I think here is good to check either documentation or check the documentation of the tool you're using or your vendor because some vendors might support or might prefer data that is provided to them as with Delta temporality, others might prefer cumulative and so on. So again, it's like important to understand what happens once the data lands in your backend and here if you're interested you can also take a look into the spec. There are some guidelines how the transformation happens especially with Delta to cumulative because in the end when we keep sending the Delta values they need to be aggregated somewhere in the pipeline and if this happens at the end of the pipeline in your backend it's good to understand how this transformation occurs and also interesting one can be to look at how open telemetry protocol is transformed to Prometheus. This is there's also certain guidance in the spec that for example Grafana Mimir which supports ingestion of metrics in the open telemet in the OTLP format and then translates it into Prometheus which is like native to Mimir. So they specifically refer to this guidance so that can help you understand like if you send some data points with Delta temporal, sorry, data points with Delta temporality you can gain an understanding of how they will be transformed and how will they end up on the other side. So just a short conclusion, I think there's like no perfect end-to-end solution so if you're using one or the other you kind of as you go through those different steps in the pipeline it's good to think about what happens with those data points and how they will end up on the other side of the pipeline especially it's especially good to pay attention if any data can be dropped and when reasonable you can opt to transform to Delta whether this is because your back end prefers Delta or whether or you want to do some further processing of the data points so you go from cumulative to Delta to rate and I think that's where we end it, thank you.