 So as I said, hopefully you all are in the right room. I was realizing that some of you may not realize that Hotel is short for open telemetry. So this is just my play on the title. Quick introduction. My name is Rhys Lee. Can you guys hear me? I can't tell. This is awesome. OK. I am a developer relations engineer by day at New Relic. I previously started actually in tech support. And I mentioned that because I really enjoy working directly with end users. And so that's how I landed in the open telemetry end user working group when I was switching for how I could best contribute to the project. We have two overarching goals, which is to increase the adoption implementation of the project as well as to maintain and facilitate a feedback loop to help improve the project. To that end, we host various activities to connect end users and contributors. For the interest of time, I won't get too much into those. But definitely feel free to come find me after it or come to the open telemetry booth in the solution so showcase. A quick fun fact about myself is I'm a review from Malaysia. I have lived in the Pacific Northwest in the States for almost 20 years at this point. And the Netherlands is my 15th country that I've been to. So yay. OK. I have, can I use? Hi, AV, can I actually use one of the handheld mics? This is killing my vibe right now. Oh, there we go. OK. So I've categorized today's content into four broad categories. We're going to do a quick metrics overview followed by a quick open telemetry overview. And they'll get into the meat of the talk, which is a metrics dip. And then we'll look at some next steps for you to take after this session to learn more about metrics. So first, we'll start with a metrics overview. We'll look at what is a metric and why are they useful? What is a metric? A metric is simply a measurement about a service that's captured at runtime. And these measurements can be further aggregated over time to help us identify trends and patterns. Also, there's broad categories of metrics. You have application runtime metrics. You've got infrastructure metrics. And you probably also want to collect some custom or business metrics as well. To illustrate this as well as subsequent concepts, I'm going to use the open telemetry community demo application as a basis for our examples. So the demo application is actually based off of GCP's hipster shop, which I think is now called the online boutique. So a lot of you may already be familiar with it. We have updated it so that it sells telescopes because our logo is a telescope. So with this in mind, what are some examples of metrics we might want to collect about our application to understand the health and monitor the performance? So off the bat, we've got some standard metrics we call golden signals. So things like throughput, response time, and error rate to help us understand traffic, latency, and errors that are occurring. We might want to look at CPU utilization. And we might also be interested in collecting custom metrics such as number of active users, total process orders, total process orders of a specific item. And these are all examples of metrics. Why are metrics useful? So there's really two parts to this question, which is, one, why are metrics useful in general for observability? And what characteristics about metrics make them more useful than the other two main telemetry types, which are traces and logs? So generally, all three are used usually and can junction with each other to monitor the performance of your system or application. You can also use BN data to power certain graphs and charts. But generally, metrics are much better suited and more broadly used to power graphs and charts. Numbers are optimized for processing, storing, compression, and retrieval, which makes them easier to query, as well as enable longer retention of data. And where metric data really shine is helping us reduce the volume of data while still providing insight into that data. So if we were to export and analyze measurements one by one, that could be really expensive. Metrics help us reduce the volume of data. And finally, for alerting, metrics from the basis of SLIs or service level indicators, which are then used to set SLOs or service level objectives, which teams use to calculate their error budgets. OK, if you're here, I presume that you may be familiar with open telemetry, but just so everyone's on the same page, we're going to do a quick refresher on what is open telemetry. Open telemetry is an observability framework built on an open standard. It was formed back in 2019 as a result of two existing projects merging, open tracing, and open census. Actually, I think it was the last year that I know of. It was the second most active CNCF project in terms of contributions after Kubernetes. And the project aims to standardize how applications are instrumented, as well as how telemetry is generated, collected, and exported. Something to consider is that open telemetry is not a data visualization tool, and it is not a stored solution. So you do still need to send your data back in platform to analyze your data. So how does open telemetry do all this? It does that by providing a set of language-specific APIs and SDKs, tools, components such as the collector, instrumentation libraries, semantic inventions, as well as a protocol called OTLP, or Open Telemetry Line Protocol. And so going back to the demo application that we're using as our example, it is an application that consists of several microservices written in different languages, including Java, .NET, Python, and Ruby, that talk to each other over gRPC and HTTP. With Open Telemetry, we have one standardized set of tools that you can use to instrument everything, and you can instrument all your services once and pretty much be able to send your data to whichever backend of your choice, and you can update that back end simply by changing your exported endpoint, which is one of the biggest draws of Open Telemetry. It gives you freedom from vendor lock-in. It also makes changing them out really easy. OK, so we are going to get into the meat section of this talk. Hopefully I left enough time for this section. OK, so we're going to cover such a scope. Metrics and Open Telemetry, we're going to look at the architecture of a metrics pipeline, and then we're going to look at metric instruments, types, and the use cases. OK, so I want to quickly explain why I'm calling this section metrics dip. So the scope of this session is necessarily very brief, right, to fit into a 35 and now 25-minute time slot. So there's really so much more on this topic that you could get into. A lot of the content that I cover here could really be their own full sessions. And so metric dip is just a play on the term deep dive. So rather than putting on a scuba suit and deep diving into the lake of metrics, we're really just kind of dipping our feet into the water. But hopefully I've made this session such that you leave with a solid foundation of at least how to get started. I also will have a list of references for you to look into for next steps to learn more. OK, so metrics in open telemetry is broken out into the API and the SDK. The API is used to instrument code, so generally application owners will use the API to install pre-built instrumentation for their framework or libraries. And the SDK is used to implement the API. And you can also use the SDK to configure what happens to your telemetry that's collected by the API, including processing and exporting it. So first up, we've got the meter provider, which is the API entry point for metrics. We use a meter provider to obtain meters, which you can use for different scopes, a scope here being a logical unit of application code. So for instance, instrumentation for an HTTP client library would have a different scope and therefore a different meter than, say, a database client library. And we use meters to create instruments, which are what we use to record measurements, which consist of a value and a set of attributes. And the SDK provides implementations for all of the above. Meter provider, meters, and instruments, legacy can be used to configure what happens to the data that is collected by the API. Ooh, I feel like there was one more point, but OK. That's not really all you need to know right now. OK, now we're going to get into some concepts that we're going to look at at a high level to help you kind of understand before we get into the metric instruments. So first up, we've got aggregation. Aggregation is the process of combining multiple measurements into a single point. So a really simple example is, let's say you have a set of measurements that represent the daily sales of your telescopes over 30 days. You could aggregate those measurements into a single data point, which would give you the total of all your telescopes sold over that time period. Next, we have the notion of temporality. Temporality dictates how you aggregate. And it's related to whether the reported values of additive quantities include previous measurements or not. And there's two flavors. We've got cumulative temporality, which indicates that the measurements are accumulated when they're exported. Another way to look at it is it always has the same start time, unless your app restarts in which case all measurements will start from that new start time. We also have delta temporality, which indicates that measurements are reset each time they're exported. Or another way to look at it is there is a constantly moving start time. Also, I'm not sure how many of you were here for Observability Day yesterday, but there was actually a pretty great talk on temporality called Does it Add Up Exploring the Delta Temporality in Open Telemetry and Beyond by CoreLogix team. So I just want to give them a shout out. If you want to learn more about temporality, I encourage you to seek them out. They might have a booth here, I think, or I'm sure you can find them on DNC of Slack. But yeah, the whole session was really getting into temporality. It was neat. I learned some stuff. Monetonicity is related to whether the value that you're courting is always increasing, which is called monotonic, or always increasing and decreasing at the same time, which is non-monotonic. I also want to cover dimensions real quick. Dimensions in the context of metrics refer to an attribute that's associated with the measurement. So let's say you are counting the number of active users to your online telescope shop, and you want to collect some information about these users. So let's say you want to collect their location. That would be, we'll say country. So the country would be added as dimension or attribute on those measurements. The last one we're going to cover for the slide is carnality. Carnality is generally defined as the number of unique elements in a set. In monitoring, it refers to the uniqueness of an element within a set. So using our example from just now, which is you're counting the number of users to your shop, and you're collecting their location. So let's say you decide to collect their country. If your users happen to be all from the same one or two countries, that's low carnality. But if you were to collect, say, their city instead and your users are from all over the country in these countries, that would be high carnality because the uniqueness of that value has increased. Carnality is important to consider because whenever you're collecting telemetry, so let's say you're running a sale on your site and all of a sudden you have users from all over the world that are flocking to your site to get your wonderful discounted telescopes. And if you're still collecting, say, their city, all of a sudden you are experiencing an increase in the load on your system, otherwise known as a carnality explosion, which is a very dramatic term that I enjoy. And also another thing to consider is that a lot of backend vendors that you send telemetry to will impose card cardinality limits. So just something to consider. We won't get into adding dimensions in this session, but since for those reasons that I just outlined above, carnality is important to consider. Okay, so now that you've kind of learned a little bit about the behind scene pieces of metrics in open telemetry, let's take a look at the architecture of a metrics pipeline. So you already learned about this part. Measurements are recorded by instruments. And from there, a metric reader takes these metrics and then off they go to the metrics exporter, which translates them into an output format for different protocols. And you can then send them onward to a data analytics tool or tools every choice. Metric instruments types and use cases. So you already learned that met up, sorry, instruments are what we use to report measurements. Each measurement up, sorry, each instrument contains the following fields. The first two are required. So instrument name and the kind, which we'll get into a little bit in a little bit. And then there's two optional fields, which is the measure of unit and a description. So I just have a very simple example here. Let's see, you wanna keep a counter of the number of telescopes you've sold. You might wanna call something simple like telescope sold and you're gonna be using a counter. And measure of unit could be telescope and your description, total telescope sold. What instruments does open telemetry provide? Open telemetry provides six instruments. You've got the counter, up-down counter, async counter, async up-down counter, histogram and the gauge. I'm gonna have three columns coming up shortly that indicate the properties of each instrument. We're gonna start with synchronicity, whether an instrument is synchronous or not. I really thought I was gonna get the pronunciation that time, synchronicity. So an instrument is considered synchronous when an instance of it is called, when an event that's being measured occurs. So let's say you have a user clicking on a button that you're observing. You would want that, that would be considered a synchronous instrument. And an instrument is considered asynchronous when it reports measurements on a set interval. Excuse me. Next we have the additive property. Ooh, actually, that was too fast. So whether or not you want to use a synchronous or asynchronous instrument really comes down to convenience for you. Do you want the measurement at the time that it's created or do you wanna report it on a set interval? So that's really up to you and your use case. Next we have additive property, which refers to whether the measurements are summed or not. And lastly we have the monotonicity, which if you will remember, just refers to whether the measurements you're recording are monotonic, which is always increasing, or non-monotonic, which means it fluctuates or goes up and down, or non-monotonic. This last column here shows the aggregation strategy or simply the aggregation of each instrument. And it refers to the data point that, the data type that each instrument produces. So as you can see, the first four counters, all produce sums, histogram produces histogram, and a gauge records the last value. So keeping the list we just went over in mind, why might instrument selection be important? So you learn that each instrument has a default aggregation. The default aggregation reflects the intended use of the measurements. So instrument type affects how measurements are aggregated, which ultimately impacts a type of metric that is exported, which in turn impacts the way that you can query and analyze it. So put another way, different instrument types and different aggregations support different modes of analysis. So for example, let's say you want to measure the latency of search results for your telescope shop. So you wanna know like how long is taking search results to pop up when users are trying to look for a specific telescope in your shop. So some of these measurements would not make sense because you can't really derive anything useful from that sum. What you would want is a histogram so you can see a distribution of the measurements. So you would want an instrument that will produce a histogram. Here is a brief framework for how to choose an instrument. Think about how do you want to analyze the data? Do you need the measurement synchronously or can it be reported on a set interval? And finally, are the values monotonic? These will help you decide. And we're gonna go into each instrument a little bit more in depth, starting with the counter, which you can see we is synchronous, additive, as well as monotonic, the default aggregation is the sum. And examples of usages would be number of bytes sent, total orders processed, total cart ads, total cart ad failures keeping in mind. Of course, this is, we're using the telescope shop as our example. Total checkouts, et cetera. You would want to use the counter when you want to count things and compute the rate at which things happen, or when the sum of things is more meaningful than the individual values. So an example of what a graph or chart might look like when you're using the counter would be this. So it's monotonic, which means it's always increasing, which is why it's a straight line. Next, you've got the up-down counter for use when the measurements you're recording are non-monotonic. Example usages would be number of open connections, number of users, queue size, memory and use. You would want to use this when you have values that are negative, or that go up and down, something like this. So as you can see, number of active users is going up and down over time, which is realistic. And now we have also an ASIC version of the counter. Example usages are CPU time, cache hits and misses, total network bytes transferred. You would want to use this when you need a sum of your measurements, but they may be too expensive to record synchronously, or it's more appropriate, or you just would prefer to record the monoset interval. And since it's monotonic, the chart will look like this as well. Next up, we have an ASIC version of the up-down counter. So example usages would be memory utilization, process heap size, number of active charts, changes in the number of active users. You would use this when you need a non-monotonic additive counter to report on set intervals, or when you need an absolute value, not a delta. And again, just an example chart of what this might look like if you're using an ASIC up-down counter. And histogram, so you might have gathered by this point that histogram here refers to both an instrument type as well as an aggregation strategy. Open telemetry supports two kinds of histograms. We have the default, which is the explicit bucket histogram, where you pre-define your buckets ahead of time. And open telemetry does something cool, which is if you have measurements that fall outside the maximum upper boundary, it captures those in an additional bucket. You would want to use a histogram when you want to analyze the distribution of measurements to identify trends, or you want to calculate the min, max and average response time. You are probably familiar with what a histogram looks like. Here's a little simple example. So I do want to just talk about exponential bucket histograms real quick. We won't get too deep into them, but they are pretty cool and powerful. Again, there was a talk yesterday on Observability Day. I don't think the recordings will be available for a while, but if you want to find the Grafana booth and talk to this team about their talk, they did a really good job with talking about open telemetry exponential histograms, specifically in Prometheus, or there's also a blog post written by one of my colleagues called exponential histograms that's available on the open telemetry blog right now. Lastly, we have the gauge, which is asynchronous, non-additive and non-montonic, and it requires a last value. So examples of use cases would be CPU utilization, temperature of hardware, average memory consumption. You want to use this when you want to report data that isn't useful to aggregate across dimensions and you have access to those measurements asynchronously, or when you want to find a green control of when a non-additive measurement is made, particularly when its purpose is the distribution. Here's an example of what a chart might look like if you're using a gauge. You might have this question too, which I had when I was putting this deck together, which is when might you want to use an ASIC up-down counter versus a gauge because they're a little bit similar. As a refresher, the ASIC up-down counter is non-montonic, records an absolute value, gauge records, is also ASIC, but records a last value. Essentially, it depends on whether you need to sum values across dimensions. So you want to use the up-down counter when you want to aggregate or sum across dimensions in a meaningful way, and the gauge when you want to report data that isn't useful to sum across dimensions or when individual measurements are important on their own and do not need to be summed together. So temperature is like a pretty simple but common example. What are you gonna do with a sum of temperature readings, right? Oh my gosh, okay, we might make it. Okay, so what's next? We're gonna do a quick recap, and then I have some suggestions for what to explore next, and then we're gonna just do a quick look at credentials. I have some people to thank, and, okay. So we learned about what in metric is, why they're useful for observability, what open telemetry is, and some of the utility and customization options. I might have glossed over those. It provides in metric generation and collection. We do have an open telemetry project booth in the Solutions Showcase, so please stop by. We have lots of lovely people who are, would be more than happy to talk to you about metrics as well as anything hotel related or observability related. We also talked about metrics concepts as they apply in open telemetry, and we also looked at open telemetry metric instruments and how to choose one. And I've basically put this table together from the slides that we just looked at. I'll leave this up here for a second. But yeah, this is just basically a summary of all the metric instrument types that we just talked about. Okay, what to explore next? So like I said, there's so much more to learn. There's so much I didn't cover, whether in-depth or even like mentioned at all. First suggestion would be to try it out yourself. Instrumentation, implementation. There's something called the views API, which is really neat and you can use it to actually change the default aggregation of these instruments. So that's one of the customization options I didn't really go over. But yeah, look into push versus pull-based exporting, application runtime metrics. The open telemetry collector provides different processes you can use to transform your metrics data. And there's so much more. This is just like a suggested reading slash exploration list. And finally, credits and references. Specifically thank you to these people. And then I have some references, including the two talks that I mentioned that happened yesterday, the end whose recordings I believe will be up in the next few weeks. But otherwise, I believe these teams are here. So yeah, if you want to learn more or chat more about open telemetry metrics, we are here and we are at time. So thank you so much for your patience and for sacrificing this beautiful day to be here with me. And enjoy the rest of your time in Amsterdam.