 So hello everyone for the next 25 minutes. I will keep people hostage from lunch And cover part of the observability space that has that we interact with every day, but we haven't developed on it much until recently There's been a lot of evolution in hotel and observability landscape in general and this area is no different No different and I'll be talking about metadata metadata is something that bugged me quite a bit over the last few months and I have lots and lots of thoughts about this I might not be able to cover most of it in this talk, but feel free to find me after the talk in the questions or at lunch Or later. I keep going Here's a quick agenda of what we'll be covering. We'll Again, we only have 25 minutes, so we might stick to the basics on some of these areas, but Find me after and we'll dig deeper. Now. What are we talking about systems? applications have become increasingly complex Distributed and Observability is essential at like every company team project. Everyone's working on observability now and that's where telemetry comes in, right? There's been a lot of tools and solutions that have been maturing over the last few years and There are three signals that are pretty commonplace like logs metrics and trace and I will for I will warn folks in advance when the AI was summarizing my slides to create my notes It said I was being very pedantic Which is true and in the next few slides specially I will be read I will be defining some things very pedantically because of two reasons I think it will be important when we start defining what metadata is and why it's important and Also because I am generally a pedantic person Another warning. I actually didn't know what images to use in my presentation So all the images are what AI thought I was talking about when it summarized my slides Yeah, but We've all interacted with logs in some way logs are time-stamped text records They could be structured they could be unstructured and they have some meaningful metadata associated with it Most programming most programming languages have logging capabilities. We've seen logs. Here's an example of some I pulled out from my VM Generally, it will have like a time stamp It will have a unique identifier in this case like the insert ID somewhere It'll have a bunch of severity fields of things that each observer you back end will have Be part of its structure and some JSON fields They could be independent. They could also be part of Traces as spans. We've seen spans quite a bit again. Don't blame me This is what I think traces are but it will learn Traces represent whole journeys of requests. We've just seen traces quite a bit and right now Spans have start times to have a duration and they have again some metadata associated with spans that help you contextualize Where bottlenecks are and what what's taking time? Traces again, let you This is what like a trace looks like you can look at how long a Service takes at different stages a user action how long a request takes at Several services and then the database AI is evidently not very good at spelling, but I appreciated the effort Metrics the the final signal we're talking about today It is a timestamp measurement of a service at runtime It consists not only of the measurement itself, but again some meaningful metadata associated with it and This is an example of a metric. We will actually Go deeper into this metric later down the slide deck, but this is like a CPU utilization metric It'll have a lots of information. It'll again have a timestamp. It'll have some metadata Like resource labels instrumentation scope labels and so on but this will make a little more sense hopefully near the end of the talk and That's kind of where we are now each of these signals are pretty well adopted I expect most people here to have interacted with maybe all three of them and But there's a problem in the OSS world like there's a lot of standardization Around each of these signals, but maybe not a lot of standardization to make the experience cohesive between them So think jumping from a metric For a high latency service to jumping to the logs for the service and then jumping the traces for that service Much of how we make the ergonomics easier for querying those things are what I think about when I'm working at Google Cloud That's that's me. That's who I am I'm Rodan Sharif. I work on observability tools and solutions for GCP Most of my work so far has focused on agents running on VMs I've worked on the open telemetry a little bit and but more recently. I've also been working on compatibility with open telemetry and Prometheus and getting it to run on like serverless environments like cloud run and Much of the work that we put in and a lot of the magic that goes into the observability experience and making everything ergonomic Has it has mostly to do with the compatibility between open telemetry and Prometheus and how those ecosystems work with the data models that power cloud logging cloud monitoring and cloud trace and So I've been thinking a lot a lot about how we build cohesion between these products and We saw from the definitions earlier. There are things that tie the telemetry signals together They're all time-stamped measurements of something that's happening at a specific time and they all encode some metadata I've mentioned metadata quite a few times. So I'll be pedantic again. Let's start defining what metadata is. I Actually tried very hard all the definitions we saw earlier were from like hotel docs and Prometheus docs Metadata is not a very well-defined term yet, but I've seen a few definitions out there. I put a few There are like just arbitrary key value pairs that Tell you what is being measured that is kind of the definition that the log definitions and metric definitions We're using before there's also data collected by a large social media company But that's a different kind of metadata that we're not going to talk about There's also data collected about The exposure of a telemetry the signals itself what's creating it the emitter of it But not necessarily about the signal itself, which is they're all kind of close But let's be precise now when we talk about metadata at Google and what I talk about in my slides Are any descriptive or supplemental information that helps you? contextualize your data and contextualize your signals they they're not being measured by the signal itself and not to change to those information affect the signal itself It exists to provide like high-level context and help you get a deeper understanding of the signals and There are lots of reasons why they're important Metadata typically is what you use to enrich your metric you enrich your signals being collected It allows you to make changes to metadata without actually affecting your your metrics themselves or your logs themselves You can associate Arbitrary user labels with your metrics you can group and filter by metadata You can your applications don't actually have to worry about Metadata that they're collecting instead they will focus on the signals themselves and you expect this kind of enrichment to happen after the fact and Again the most important thing to me at least is we use metadata to correlate Not just signals with each other not just logs with traces and traces with metrics, but also how we correlate signals to real-life entities This is kind of how we think about metadata often where application is the runtime the libraries themselves will focus on Instrumenting and collecting the signals, but the application might not be aware of what environment it's running in You know application might not be aware of what container it's part of or what pod or what environment what cloud provider all of this kind of information that is Not known or maybe doesn't have to be known by the application But it's super useful when you're quarrying and you're using observably back ends are the kind of metadata that I'm talking about and Those are those are the examples here, right? Like the geographical location things are running in your application and production is Is has different is alerting when your applications and tests aren't That environment information your instance name your version number any tags that you use on your machines are all metadata that you might want to have to use when you're quarrying but not necessarily when you're instrumenting and Now let's look at what some people what what people do Out in the world like what does from ethios do with metadata. I actually quite like from ethios from ethios has a very simple data model it consists of In large part two things samples and labels samples are just the metric the The value itself and the timestamp and labels are just key value pairs that give dimension to your metrics Metric name is another label for example. Oh, that's it. That's all there is It doesn't differentiate in its data model between metadata labels or metric labels or resource label or instrumentation labels at the end of the day, they're all just labels and They're all superhuman readable and this is the exposition format for what from ethios metrics look like I expect most people here have seen some That doesn't mean Prometheus doesn't have metadata though Prometheus is very well designed for like pole-based ingestion where The ingester or the scraper is generally aware of what it's scraping from and so and it'll do that using some service discovery Mechanism and the service discovery mechanism will give you a lot of valuable information a lot of valuable metadata to the person configuring Prometheus and This is the crucial difference when we this will expose a lot of this metadata as Meta labels labels that exist when you're scraping But the onus is on the person configuring Prometheus to actually add it to your metric if you don't add none of them are added By default so in this example if you're using the Kubernetes service discovery There are a bunch of like pod metadata and pod meta labels that are available that you can add it to your metrics Except you have to add them to your metrics and none of them they're by default There's we're not gonna go into how they're added There's a really nice blog post by burn Brazil that has a flow chart I like flow charts about how the meta labels are added to your metrics if you so choose I Didn't add them because it's it looks tricky and I don't like it This is how they interact with other labels that exist from the Applications themselves, and this is how eventually it'll all be part of the metric labels in Prometheus You might not always want all your metadata to be duplicated and exist on every single metric and every single signal So the open metric specification also has something called target info metrics Or info metrics. This is another way you can have metadata metrics be part of your data when you're querying them You can have a separate static value metric that exists Not to measure something but just to give you metadata So in this case you have the target info metric that gives you metadata about the environment like what environment is in What the host name is what the data center is what the region in front of the owner is and then if you want to use This kind of metadata metric in your queries You can have a metric and you can have a prom ql join with it on query time So this is an example of like an HTTP request metric that you want to filter with using metadata the metadata the Availability zone is not part of the metric itself But you can perform a join and you can filter by the region that you care about if you want to add more labels to your Dashboards are more labeled to your data. You can just add it to your group by and you you have the benefits of the metadata These are seen a little trivial queries though the second you start doing something a little more complex like here You're breaking it down by the 95th percentile broken down by availability zone Almost immediately your queries become complicated and that is how Prometheus does it. What does hotel do? We're gonna be talking about hotel quite a bit today A quick primer about what hotel is and what hotel isn't hotel is a vendor agnostic tool agnostic way of ingesting managing and exporting your data, but Crucially what it's not is it's not a storage solution. It's not a visualization solution You can have things be sent using the hotel protocol or OTP protocol to Observability backends you can use the collector to do a lots of processing like we've seen today and send to observability backends, but It's not an observability back-end Which means it it does have really nice data models. I like them It does define it does unlike Prometheus differentiate between signal signal attributes and Instrumentation scope attributes and resource attributes signal attributes like we talked about before or things that come from the signals themselves For example in a CPU utilization metric what the state of the CPU is that's a signal attribute signal attributes aren't metadata but the other two are Resources in hotel are immutable Representations of entities that are producing telemetry They are at the actual source of the telemetry that it's coming from so an example is if Something this producing telemetry is running on a container in Kubernetes. It has a pod name. It has a namespace It's part of a deployment which has a name all of that information Are part of the resource attributes It has another level of scoping called instrumentation scope instrumentation scope in hotel refers to what is the instrumentation library that is the logical unit in the application code that's Creating the telemetry or instrumenting the telemetry. This is typically the instrumentation library But you could add other things in there like the service URL or something that that is producing the telemetry and The metric we saw earlier had all of this the CPU utilization metric for example it has The resource information which tells you what location is coming from it has the instrumentation scope Which tells you this is the host metrics receiver in hotel producing the CPU metric and then you have the signal Information tells you like oh, what's CPU state? This isn't and then you have the timestamp and the values But again observe hotel is not an observability back in so even if its schemas are nice and I quite like schemas we often have to transform it and and to fit the conventions off the back end Which means if hotel is paired with Prometheus What do we do with all this fancy? information we have with resources hotel will actually start taking all these resource attributes and create the target info metric that we saw before in Prometheus and will send all of his resource attributes as labels in this metric So in this case you have the namespace the container the pod region all as labels in this in this target info metric and then You will join with it as as you would in Prometheus land Sometimes again, you don't want the separate metric Containing the metadata you actually don't want to deal with the joins in Prometheus Hotel you can use the collector and have these resource attributes added to your metrics using Processors like this one where you take the resource attribute and you move it to the metric attributes So that they're actually part of the metric itself and that's kind of like The problem I have with resources a little bit like I like the hotel's data model, but eventually What matters at query time is like what the data model that is used at storage or at query there are also other problems with resources in Hotel there's a lot of commingling of entities again the example We talked about where the resource contain the container information the pod information the region the cloud all of those are actually like different entities and there's no way in for us to Define like the smallest set of identifying information in hotel There's also a lack of precise entity. You might have a lot of attributes for signals that Identify a resource and a lot of attributes that give you information about it, but don't actually change the resource itself that is not supported in hotel yet there is No way to distinguish between identifying and non identifying attributes in a hotel You can have change your attributes. It's immutable, which means if I change my VM name I'm not actually changing the underlying resource, which is the VM, but a no tell that is a new attribute a new resource and then there's always High coronality issues when you have too many attributes to any resource too too many labels And a lot of back-end scan deal with that very well Now that we looked a little bit on the instrumentation side I with the time I have I'll go over a little bit of what some of our data models look like and how an Observability back-end will have a different data model Maybe surprisingly or unsurprisingly Google Cloud also has an idea of resources. They call it monitored resources and But the key distinction here is the monitored resource are the minimum set of attributes that uniquely identify an entity in This example we're looking at the GC instance, which is a VM and it only has three attributes It has the project ID. It has the instance ID and it has the zone All other metadata about the VM whether that be the VM labels even the VM name all of it is stored separately in a metadata API the monitored resource in Google Cloud for tracing for logging for metrics are the primitives that we use when we are ingesting it in our back-end This is an example of observability in context that where we use some of this information. So here I'll be filtering with It's some of the resources. I will be looking at my VM logs and when I look at it I can see the VM name even though it's not part of the The telemetry at all I can see the resource labels I can see what logs are coming from this metric from this log and then conveniently I can go and start looking at Metrics that come from this VM the metrics again has more metadata about like what What annotations it has what other integrations it has it I can drill down deeper. I can see all the metrics for this VM Within metrics to I can drill down into the metrics and look at how metadata is like a first-class citizen in some of our You I I can actually group by not just the system labels But also the metadata labels that are not part of the signals like in this case machine type I can filter by them and I can group by them as though they were always part of the metrics itself except they were not there and This is kind of why I wanted to do this talk because this is not magic actually what we do internally and Google He really similar to what the open metrics specification does We also store metrics as we also store metadata as metrics we also write the metadata as Static value metrics with the metadata being label fields and when you have a metric written to a monitored resource We will have metadata written as metrics and before your metrics are Showed to you in the query explorer in your dashboards we actually do a meta join and join them together and show you a joined view of the table and this is what powers a lot of our dashboards and a lot of our UI and This is why metadata in move cloud is treated Very similarly to the metric labels even though they're stored separately And that's kind of where we are now There's a lot happening in the evolution of metadata and all these ecosystems converging There's still quite a bit of work to be done but I think a lot of this work has to be done in open source and As the telemetry signals become more powerful and when they're used together We've covered only metrics today though logs and traces have their own solutions and they also have their own bespoke way of dealing with this and We don't have time to go with that. Also. It's almost lunch Yeah, hotel is doing a lot of things hotel actually is trying to deal with the problems with Resources they're working on this concept of identity of entities Prometheus is working Prometheus will improve how joins with target info work and make those more ergonomic There's a lot of stabilization happening of hotel components specifications There's a lot of semantic conventions that we heard about today that are all improving this experience rapidly And that's kind of where we are today We came from a world where the solutions for these signal types were pretty independent and arose from independent needs and then we started standardizing them and now we're making them interoperable with each other's ecosystems and I think we're headed to a world where we think about observability more holistically Profiling is becoming a popular fourth signal metadata is used to tie these experiences together and connect logs with metrics But this space still needs some maturing. I Was struggling to end this presentation. So I figured I'd leave with a few tips before a duck away for lunch Stick to conventions. I like schemas When they don't exist conventions of the next dusting also conventions are Basically schemas that we just agree on add metadata proactively. It helps your queries. It helps other teams It helps make sense of your data, especially when Things are stressful and you need to know what these metrics mean Again schemas are nice. I threw that in there because I do like schemas fight me metadata is pretty loosey-goosey right now. We need to develop scheme conventions around them Interacting with metadata is pretty bespoke monitoring has its own solution logging tracing have their own solutions and We saw what the definitions of the signals earlier. They all have things in common. They all have metadata. They all have timestamps and we really should have find a way to have a single source of truth for metadata that all the signals can interact with and That's kind of like I know timescale tried this with prom scale before and but maybe they're a bit ahead of time And that's it that that's all I have. Thank you for listening to me ramble