 Anyway, so welcome to an update on Open Metrics. If you could close it, oh, perfect. So how we structure this is relatively easy. We do a speed run through the history for those who don't know. And then we come to the current state and then we go towards the future. At the end, if we have room for questions, Kim is going to run around. So looking at observability data. Historically, you had basically one standard in this space, which is SNMP. Anyone of you who's ever dealt with networks is probably familiar with this. There's a little bit of a laugh hate relationships. And one of my major pain points with SNMP is actually that it's based on ASN1, which is like super old, seven-bit encoding, highly efficient, an absolute mess to get right. And it's really, really, really, I see people laughing. Yeah, it's really a mess and it's kind of risky. It's often shabby and slow. Oftentimes interfaces are hard to implement. Like, yes, you can walk through them, but it's really inconvenient. The data models are vastly different between vendors, sometimes even between just versions of software. And the thing about hierarchy data sets is it almost never fits your needs. Of course, if you have region data center customer and you want to group by customer, congratulations, your data is wrong. So those are just pain points which existed before Prometheus. After Prometheus, it is the de facto standard. Anyone in this audience who is using Kubernetes is very likely also using Prometheus or something within the Prometheus family. Of course, else you're going to have a really bad time with the Kubernetes. The same is true for the Prometheus exposition format that it also became this de facto standard, but not a real standard. We, and I'm speaking as a Prometheus team member here, and also there's quite a few up front. There was an absolute explosion over the years of compatible endpoints with literally hundreds of thousands of installations of Prometheus alone, with literally millions, probably tens of millions, of course this slide is two years old, tens of millions of users who are using these formats directly or indirectly every single day to keep their stuff running. And we have standard exporters, we have standard libraries, in particular the Go library is one of the most Go libraries on earth, so there is substantial success, but yet again, this is no official standard. And yeah, label sets, I mean, I don't have to convince you that labels are better than hierarchical data. There's also some politics involved. In particular, in former times, I mean, CNCF is getting bigger and bigger every year, comparing this to 2016, the first cloud native con, this is awesome what's happening here. But still a lot of vendors and projects were very, very torn on should they actually adopt something which has a different name, this thing called Prometheus. And it's gotten better and better over the years, but still there is some smell attached to a single thing, in particular back then. A lot of traditional vendors, they just respect standards, like you write it down, you say this is the specification, implement this and you're going to be fine, and that makes a lot of conversations easier. And obviously we wanted to reuse the complete install base of Prometheus. We didn't want to have something which diverges too much, because else you split the community and that's going to hurt every single person. You need to maintain long-term stability and long-term compatibility. So many, many different companies chipped in with making this happen, and the result is an actually neutral standard. Like there are several things in there which Prometheus didn't even really need, but which others needed, in particular like Google with the underscore started for their Monarch installations. These are things which aren't their own purpose to have a better general standard. Yeah, this is a shout out to the core group. I'm here alone today, but yeah, those are the people who actually made things happen. So what does this mean for you? If you're using Prometheus exposition format and there's still quite some large installations with students and that's completely fine, it is largely the same. And again, this is on purpose. For quite some time, most of you in this room have probably already been using open metrics without even realizing it or without caring about it, because it just keeps working. Of course, we spend so much time in making it working. We see a little bit of the blips in a few, but by and large, things just work. There are breaking changes. You need an underscore total if you have a counter and sometimes there are collisions. The libraries which have migrated handle this transparently for you, but if there is a naming collision, obviously you can't do much. Underscore total at the end was already an anti-pattern for non-counters in standard Prometheus. So anything who did things by the book does not run into this issue. But the fact is there's a lot of people or there are a few people who ran in there. And the other thing is the timestamps are now in seconds. Of course, one of the things we try really, really hard in Prometheus is to always have base units. So some of those non-obvious things which happens if you just count Joule as a counter, as a measure of how much energy you use just by putting a time component onto this. This turns automatically in watts. Just cause that's how the SE system is designed and all the physicists like years and years ago put a lot of effort into making those different units really seamless and transposable between each other. And that's why we believe so strongly in base units and that's why we made this change and don't have milliseconds anymore. There's also a ton of cleanups relative to the Prometheus exposition format. It's a lot cleaner, it's a lot tighter. With Prometheus exposition format, you can do more or less endless amounts of white space. You can't anymore. You see if you actually have completed a scrape. You can do an nanosecond resolution if you need this. Prometheus doesn't support it, but you can do it through open metrics. Again, one of those things, through open metrics goes beyond what Prometheus can do. It's all 64 bits as in Prometheus. We have a new metadata type for the unit. Underscore created, I already mentioned this and we're going to see it in a second to last slide. We support push explicitly in open metrics, even though Prometheus again does not really support it. I mean, there's been work on this and it's getting there, where in particular with remote read write, you can do a lot of stuff. But open metrics again goes beyond what Prometheus can do here. And the other thing is one of the really nice things. I don't know if you saw the documentary about Prometheus, which was aired on Tuesday. That's also a point there where just the ability to connect to an endpoint and read this with your web browser is extremely powerful, which is why we mandate that this is being carried forward so people can easily debug. This is the one big highlight feature. If you don't know what exemplars are, so when many, many moons ago, when I met with a few people at Google and we were considering to merge open senses with open metrics way, way back, they mentioned that for them, searching for traces didn't scale. And when Google tells you that searching for something doesn't scale, you probably better listen and figure out why, because they probably did the math on this one. And the thing is there was just too much. So the approach here is you have your label sets on your metrics or with low key, even on your logs. And the rest is just an ID, an ID which can point to a trace, which can point to a span, which can point to both, but that's it. And then you can use this to look up that specific trace directly, which means you don't have this thing where you sample a lot of data away and you can't jump into your traces. Of course, you see this error and this would be interesting, but it got sampled away. No, if I see my high latency bucket and I can jump directly into a trace which shows me this high latency bucket, which is much more efficient for the computers, but it's also much more efficient for the people. Datadoc also deserves a shout out. Of course, they invested quite a bit of engineering time to make it work within the Python parser. That was really nice of them. I mean, you normally don't know them as a company who are hugely invested in open source, but there they really made an effort. Sorry, I still believe they deserve a shout out for this. Open telemetry. In the early years, there were some bumps in the road, but those have been completely wiped out, so to speak. I myself by now an open telemetry voting member just based on all the contributions to make sure that everything is and remains compatible, which yeah, and it is, which is super nice. And also for the Prometheus conformance program that's officially part of it. If you see those things, like in the header, you can see, like in the HTTP header, you can even see the format, but else if you see like, sorry, brains, like for example, when you see an underscore created, you know that it's pretty certain to be open metrics. If you see an EOF at the end of the file written, like this, you know it's open metrics. It should not matter to you, because again, Prometheus and everything handles this completely transparently and don't even notice, but if you do hand debugging or something, those are the really detailed hand signs. And they own it, of course, as mentioned earlier, as one of the breaking on your things. If you want to transition by hand, again, you need to really be careful with your underscore total. You should make certain that you send the correct content type, not Prometheus Explosion Format 004 anymore or plain text or anything, but like actually open metrics 1.0. And please set the accept headers or else you might have not the greatest of times. There are a few known issues in 1.0. Again, we sometimes have this issue reported that people have clashes in their namespace and they have existing underscore total. There's not a lot we can do about this because it's just how the new standard works. And it is actually a good thing where we just made a previous implicit requirement and strong recommendation mandatory. So at some point we had to eat this pain. If you have counters and you have underscore started in there, this can double the amount of counters and the cardinality. And that can be a surprise to a few people and that can also lead to pain. We might make this optional in open metrics 1.1 that's not yet clear. I guess we will probably do it because it's just, okay, I see people shaking their heads and they're on Prometheus team. See, it needs some more, it's needs some more, no more discussion. And the other thing is we found one bug where we had one must wrong, but as the EB even F, the extended bug is now a normal form is like the thing which most people at least in ITF space who implement stuff implement against anyway, we should be fine. So these will be fixed in 1.1. What's up for 2.0. The highlight feature is high resolution or native histograms where you don't have this LE bucket thing anymore where you need to set your own boundaries for your buckets and everything, but you can actually just toss it at a thing. And in most cases it's just doing everything as it should. In the cases where it's not doing what it should, you can set like the basic computations of how the bucket boundaries are auto-computated, but it's really a game changer for native histograms. The other thing which this does is it introduces for the first time complex data types in Prometheus. Previously you needed one single metric for every single thing which you wanted to talk about. Now you can actually have more than one single data point. One single sample point to be specific in one sample. And that's going to be new. Of course, we need it to be more efficient, but also it means quite some changes on the backend. It can also lead to slower career performance. There's quite some work being done on this to not have this happen, yeah. The other thing, and I just had this conversation with Chris Anychek. So then Conan Chris Anychek back, like way, way back in dark CNCF ages. He asked us to split out open metrics from Prometheus to have its own thing, to have its own standard. And at the time, this absolutely made sense. These days, the landscape has massively changed. People have basically stopped fighting with Prometheus because most people just accept it exists and it's the thing which won. Yeah, so maybe, but this is another point of discussion, maybe we release a 2.0 and then we fold it into Prometheus or maybe we fold it into Prometheus and then we release a 2.0 or whatever, we'll see. So there's a few resources here for the ones who want to take pictures. And now we come to the questions. Before we come to the questions, sorry, one more mask reminder cause you signed a legally binding contract. Yes, you and you other person with the nose out. Yes, you, oh, that's, yes, go ahead. All right, cool. So actually the last slide you had for the references, I guess I'm confused about, if I'm starting from scratch right now, what format do I write in? Open metrics, open telemetry. Maybe I don't understand the difference open metrics into telemetry. So you're opening a much wider question than just open metrics, but that's completely fine. Ideally, you don't really care cause you use a library and the library does the thing which you want. That makes sense. If you know that you want to be using Prometheus in the other side, then my own personal recommendation would be to use the Prometheus client libraries. Of course they're much more efficient. Of course they do one thing and do it well as opposed to what open telemetry is solving where they need to have like a complete data plane in between in the standard to be able to transform from different formats into other formats to allow this ubiquitous and really like transform into everything kind of model. And that does come for free, obviously. So that's a little bit of the decision, but it's the rest you can use either. All right, thank you. Thanks Rigi. So I mean, thanks for the kind of concept of the roadmap. It feels like exciting to kind of like improve this. And I was just asking like, what was the historical reason for like creating another metric for this created timestamp? Why not just immediately putting that into metadata? Is there any blocker maybe? Do you remember? We had many, many discussions and the short version is that Google really, really, really wanted to have this because they needed it for Monarch. And honestly back then we, like we did some back of napkin math and we were like, okay, this should be fine. These days with more and more metric data exploding because people keep adding more and more and more. Arguably it's a little bit of a victim of its own success here. And yeah, that's why it's directly in there. The other reason is we didn't have any mechanism with informatics to persist any of this, this type of metadata. There's a good reason why we could do this because unless you do something fundamentally wrong and update to create it all the time, it doesn't actually change. So on the cardinality side, on the index, it actually imposes some cost on the storage and the actual block storage. It's almost free. And I think we focused maybe a little bit too much on how it was almost free on the actual TSTP block size and not so much on the index side. Thank you for the talk. Can you tell something about the semantics of the created and the total metrics? Sorry, you mean the underscore created? Yeah. It's really easy. You would just, let me see if I have an example. It's right here. So the thing is, as you can see, the type of foo here is histogram, not counter, but same difference. And the thing is, as you can see here, it's only foo. And then you add those other extensions. You add the bucket boundaries, again, not high resolution, not native histograms. This is the 1.0 style. You see the count, you see the sum and you see the underscore created. And the underscore created is literally just a timestamp when you did create the thing. Okay, thank you. Oh, there's one over there. Hello, I had a question about, let's say I'm upgrading a metrics library for language and I want to move it over to open metrics from the existing format. And I'm a little bit worried about changing the names of the metrics with the underscore total suffix. Is there a strategy for moving it over without losing, completely having to rename your metric and lose that data? Is there a way to gradually move that over? So first you should look at your data if this is actually a concern. Cause chances are, if you followed Prometheus best practices, it shouldn't be a concern. Of course, you wouldn't have any underscore total. Of course, you are supposed to put the unit at the end and something meaningful and that's usually not the underscore total. So the likelihood of this existing is not super high. If it exists, you need to find some plan. If you write your own library, cause that's how I understood you, then you can in theory just not do it and not be fully compliant. Like it works, but it's not nice. If you use a real library, you're going to run into problems and ideally you already start this migration before, like the naming migration before you do the open metrics migration. The reason being, if you have two lifts at the same time and something goes sideways, you're going to have a really big bad time. So it's better to do like one step, wait a month, see things are stable and they have settled down and then do the next step and break this up into two smaller migrations. Great, thanks. That makes sense. Yeah, no worries. I guess you were first. Hello, earlier this year I was looking into this for a project and saw, oh, there's ProtoBuffs that I could use, which I would have been happy to use. And then it was not, I never found out what the state of things was with that and I just moved on, but is that something, it's just an idea, is it implemented, is it spec'd out, or is it, you find the ProtoBuffs in the spec? People were talking about it and it wasn't clear, it didn't seem, I wasn't sure if it was an idea that came up and it was abandoned or if it's going to be available in some future version or if it, I was completely confused. I'm just going to put Gotham up on the spot here. Who didn't pay attention? Do we already ingest the ProtoBuffs of OpenMetrics directly into Prometheus? Okay, then the answer is currently no. Yeah, so the thing here, complete transparency and honesty, with my Prometheus head on, we were kind of waiting for the native histograms to settle. Of course, we knew we needed to actually have this like larger change back, because initially we supported like in Prometheus 1.x, we promote supported ProtoBuff and then we took it out because it didn't have any efficiency benefits at that time for us. And we've kind of been waiting and I mean, OpenMetrics mandates text format, it does not mandate the ProtoBuff, but for the 2.0, you'll have it then and then you also have the native histograms. But it will still mandate the text and not the Proto. Of course, again, this debugability with just your web browser is something we highly, highly value and we don't want to lose this. But there's more efficient ways, I agree. Of course. Thank you. I just wanted to ask, your reference like remote write and effort in standardization of that interface is actually one of the interfaces used the most like to plan like Prometheus with like long term storage and things like that. Can you please comment more on how this is gonna be impacting the standardization and what is gonna happen, especially on the Prometheus side here? So those are different. Of course, when you have your exporters and your instrumented whatever here and you have your Prometheus here, then this is where OpenMetrics is being spoken. If you have your Prometheus here again and you have other Prometheus instances or you have a long term storage, here is where you speak remote read write. So it's a fundamentally different use case. In theory, you could hack together a library which uses Prometheus remote write to push from those workloads down there. But that's not really what the whole system is optimized for. We do have discussions about how we are going to evolve the remote read write format. We might actually just steal OTLP and shove that into Prometheus, but that's an ongoing discussion. We don't yet know. Yeah. Yeah, okay. I was asking really like also in relationship to what is happening on the hotel side where you have like this agent which is essentially like scrapping and then like pushing things whether there was any idea on the Prometheus side only scope to metrics to do something similar and use a standard interface. Thank you. I would wait a week and maybe you'll find something which is nice. I'm looking at you, yeah. Yeah. It's going to get easier over time. That's, yeah. Yeah, I had another question about, I'm really excited about exemplars. Just wondering what the rate of adoption is. I noticed you mentioned on like Cortex, Prometheus itself, is there any updates on like whether Tempo is supporting exemplars or if there's like some kind of additional work to get that working? Or is that like kind of a separate? No, no, I get it. So first I forgot to write down Mimir in this slide but of course it also supports exemplars. Tempo is written with exemplars in mind. Tempo is designed to just take these IDs and give you a really, really quick way to access your tracing data. The newer versions of Tempo also allow like search and such where if you want to shove labels into the thing you can do so and it will work but if you want to run in a really efficient manner that is actually how Tempo is supposed to be used. Are you referring to Parquet and TraceQL or? Sorry, what? Are you referring to Parquet and TraceQL with the search for Tempo and labels? Yes, yes. Okay, okay. Yes, great. That's basically distinct from this happy path of I have only exemplars and I can jump into stuff really, really quickly. And so are the labels that you're talking about, is this part of the open metrics format as well? So what? You mentioned labels, searching by label in a tool like Tempo. Oh, so yeah, the format of, so the format of the exemplars, we deliberately did not specify this within Prometheus but what we did in the specification is all the examples which you can find in the specifications are taken directly from the WC3 tracing standard on how to propagate tracing context. Morgan McLean and such were working on this within the context of WC3. You'll find the same in open telemetry like the same shared history. We didn't feel comfortable to mandate it and to write it down. We didn't feel comfortable to write a should but we made certain that anyone who, like we put the breadcrumbs in there, you will find the WC3 standard from our specification. And also if you do the usual thing and just like look at the EBNF, look at all the samples which we have in there and you start implementing your thing, automatically you land precisely on the WC3 standard for distributed tracing context propagation and that's no mistake. Great, thanks, good to hear. What do you see going forward? What are you interested in for open metrics 1.2 or 2.0? What interesting ideas do you have coming up? Well for 2.0 the main thing is again the high resolution histograms, like that's going to be big in like with pretty much all of my hats on this one is going to be a big one and we've been working towards this for literally years. We also spend quite some time making certain that the approach which we are doing within open metrics and permittees is directly congruent with the approach taken within open telemetry in as much as you have good and 100% compatibility between all of those in anticipation of the ecosystem at large moving towards native histograms relatively quickly. Of course it is one of the most voiced pain points with Prometheus style metrics that we only have really flat metric types that we don't allow complex types. Yeah, so that's the thing which anyone is most excited about I think. I don't know, you need to come back here in like two or three years. Just for the online audience the question was what about 3.0? Any more questions? There's one more. Just regarding to the exemplars. The exemplar is just a random one random entry from the total. For example, the bar cut LA.1, total one and this is one of these one, a random Mijusen event if I got it right. So the string you see in those curlies that is what I talked about with the WC3 standard. Yes I got, but there are several events, IDs contributing to this bucket and so this with ID, ABC is just one of them. Oh yes, yeah, yeah, yeah. You can only by design per exposure you can only expose one single exemplars cause we don't overwhelm the storage or anything and also we don't want to give people to build a large foot gun and turn promises into an event monitoring system cause then you're going to have a really bad time. So you can only put one in. The other thing which you see behind there but again this is completely optional cause we are waiting for people to figure out how to use this, how it's actually useful for them. Our thinking here was we put in the specific number of that specific exemplar. Of course you have those buckets. But buckets are nice but I mean we are almost back at native histograms and it's also nice to have like more specific hits and that's why here we showed how you can have your trace ID and at the same time give the user exactly the information about what specific runtime in that one latency bucket or whatever. So it's up to the application or the library which one to choose as an exemplar? Yes you can basically put 64 characters of UTF-8 into behind this hash pound and it's just going to be handled and the rest is again deliberately undefined with a suggestion of how to properly do it if you want. Thank you. Thank you. Hi, are there any SAAS vendors supporting open metrics natively or is it better to just use Prometheus and exporters? So Grafana Labs cause we run Mimir. I mean I don't know if AWS is running Cortex internally or as like user facing, I don't know if they actually enabled the users. I honestly don't know. So at least there is Grafana Labs if... Data doc. I don't know if they have it in their service. I honestly don't know. I only want to, do any of you know of others? Which no, okay. So at least Grafana. Okay, we have five more minutes for questions if anyone has any more. There's one more. Just wanted to clarify vocabulary. When you say native histograms, is that the same thing as the sparse histograms? Yes. Okay. It shows how long we've been at this that we have like half a dozen different names. Like exponential histograms would also be... I mean I think the PR which I saw merged, I think today was from the branch native histograms and then the PR description says... So the branch name was sparse histograms and the PR said native histograms. So just to give you an idea of how much of a mess this is, but going forward we are having one name. We also have one name which is synchronized between Prometheus and between open telemetry to just have... We have two names. We have two names. Exponential and sparse. Okay, apparently we still have exponential and sparse. But maybe we can hash this all through Girogian others and... Not at all. If it's... Yeah. Take the mic if you want. I mean answer totally... No, yeah. It's not really an answer. What I hear is exponential histograms in open telemetry. I mean sparse histograms I hear in other communities. I thought we had fixed it. But I don't know. I mean that's the reason why we have a working group. The message type in OTLP is exponential histogram. And so I think we're gonna stick with that for a while now. Yeah, okay. See you. I wasn't here, sorry. Yeah, it's been part of the spec for OTLP for quite some time now. So it's, you know, the name was specified before. Yeah, personally I don't care about the name. I want to have one name which everyone is using because I also get confusing. The rest I don't care. But yeah, apparently we have work to do. I thought we had fixed it, sorry. Anyone else? Or do you want two minutes of your time back? Can you just comment on the timeline for the 2.0? Are you waiting to stabilize the exponential histograms like implementation before pushing the standard? Okay. That's basically it. Even before we cut the 1.0 release, we already had consensus that 2.0 would be, that exponential or sparse histograms would be the reason why we cut 2.0. So we've been waiting for this quite some time. I can't give you any specific timeline but we've been waiting for this for quite some time. We want to get this going. Okay. I'm getting the sign we are out of time. Okay, cool. Ben, thank you very much.