 Hi everyone, welcome to the future is bright, the future is remote write. My name is Tom, I'm one of the Prometheus team and actually primarily work on the remote write code. In my day job, I'm the VP product at Grafana Labs. I'm still trying to figure out what that means after three years. I also started the Cortex project alongside Julius and more recently started the Loki project, the kind of Prometheus inspired log aggregation system. When I'm not coding, when I'm not working, I like to make 3D printers and occasionally brew my own beer. So without further ado, today I'd like to talk about four things, right? I'd like to talk about remote write. How did it start? What is it? What is it for? I'd like to talk about standardizing remote write and a lot of the recent efforts over the last two months to define what it means and to test that. I'd like to talk about what's next or what's in the immediate future for the remote write protocol. So I'd like to talk about metadata, I'd like to talk about exemplars. And finally, I'd like to talk about, you know, further off in the future for remote write, some of the ideas we've got. I noticed the slide says, examplars, I'd love to know what they are. So what is remote write? Actually, funny enough, the story of remote write kind of matches my story with Prometheus. The first PR I did to the project was to switch the remote write system from GRPC over to Proto buffs and HTTP. This was because it was quite hard at the time, you know, this was almost five years ago. It was quite hard at the time to get GRPC to go through an elastic load balancer. And I wanted to use Prometheus remote write to send data to Cortex. And really, that is what Prometheus remote write is for. It's for sending data to other systems. Prometheus sits there, scrapes the samples, scrapes your jobs, scrapes metrics from your instrumented applications and exporters, collects them, stores them, but can also forward them on to other systems. So we did this, we did this about five years ago. And over the past five years, we've seen many vendors kind of take notice of Prometheus and really add support for Prometheus remote write to their products. You know, they've made it so that you can send data from Prometheus to pretty much any of the metrics vendors in the world. In fact, if you look at the Prometheus docs, you'll see there are over 30 different projects that accept Prometheus remote write or send Prometheus remote write. And really kind of it's amazing that only over five years has it been so popular. And we're not resting there, you know, we really want to kind of push the interoperability story in Prometheus as far as it will go. More recently, in not the most recent, but the one before that release, we added the ability to actually have Prometheus receive remote write requests as well. So you can now configure a Prometheus to send data from one place to another, right? This really solves the kind of global federation problem in a very different way. You know, we noticed with Federation, Federation is the way you can set up a Prometheus to scrape the metrics from other Prometheus servers. So this allows you to have a Prometheus server, let's say in each region, and then a global one that scrapes all the data from the regional ones and get that kind of central view where you can run your kind of central aggregations. However, the challenge here is that this requires the global Prometheus to be able to scrape all the edge ones, right? So you have to open up firewall ports and figure out ways of securing, you know, three or more different Prometheus servers. With the push based approach coming from Prometheus remote write, you only have to open up that one central location and kind of lock down and authenticate one central location. And you can have the edge locations push to that central location. This might not sound like a big deal, but imagine if that these edge locations were on IP addresses that were changing or were maybe on flaky networks. You know, the remote write protocol might be a better fit for that. So this is in the second to latest Prometheus release. It's experimental. We'd love for you to give it a go, see if see what works, see what doesn't work and see if it suits this use case. So that's remote write. That's how it started, what it's used for. What we have noticed with all this adoption by many, many people is there's been some differences in implementation. And we want to make sure that this ecosystem is interoperable and really that all the users of all the different components can have the best possible experience. In particular, one of the things we've noticed is there's a bunch of projects popping up that fill this kind of scraped Prometheus metrics and send them elsewhere using remote write. You know, obviously Prometheus does that is the original one. About a year or two or a year ago, we launched the Grafana agent, which is a kind of stripped down version of Prometheus using all the same code, but kind of stripped down lighter weight version that doesn't have any local storage. You can't do queries, but it will just scrape your jobs and send them using remote write. There's also the Victoria metrics agent influx telegraph will scrape Prometheus scrape jobs instrumented with Prometheus metrics and send them elsewhere using remote write. And then more recently, you've got open telemetry, which will also scrape jobs using Prometheus metrics and send them over remote write. So these are the five systems that I'm going to look at and, you know, there are other systems. I have to kind of slip in at least one meme into each of my talks. So we started this effort to standardize it, right? We wanted a document that described what Prometheus remote write was, what it meant to say you were compatible with that. We also wanted to explain a lot of the reasoning behind the decisions, why we've done it this way. It wasn't all just arbitrary. And finally, we wanted to give some thought to how would we future proof this protocol? How would we upgrade this protocol? How would we make it so that when we introduced a v2, we could be kind of backwards compatible? I'm hoping that this will allow us to remove the experimental flag from Prometheus remote write. One of the key things here is that we're not changing anything with this standardization. We're just documenting the current behavior, how it currently works, because there's so many people that use this that I don't think realistically, we can change it, at least not in a incompatible way. You know, this will also allow us to offer some kind of compatibility guarantee. And I just want to say a big thank you to everyone in the community who's commented on the dock, who's proposed changes. It's really been a big team effort. So now that we have this standard, it's time to test those agents that I mentioned against this standard. And to that end, over the last week or two, I've built a test harness that runs an instance of each of the agents, exports some metrics for them to scrape, and then configures them to send that data back to the test harness via remote write. We then, in the test harness, export various different metrics and test, examine the response we get over a remote write and check it matches what we expect. This is relatively straightforward. We export a counter, we check, we get a counter back. We export a histogram, we check, we get a histogram back. But there's a lot of nuance to this protocol. There's a lot of different areas which some people do or don't implement. And so we wanted to kind of figure out what level of coverage we had. These are early results. I'm filming this talk almost a month before you'll be watching it. So hopefully over the course of the next month, some of these vendors and some of these other projects will improve their compatibility with remote write. And we're actively working with all of them to do that. So to start with, obviously we have Prometheus. Prometheus implements all of remote write. The specification is just a document that describes what Prometheus does. So we'd expect it to pass all these tests. The Grafana agent, using all the same code that Prometheus uses, just implements the same thing. And again, we kind of expect it to pass all the tests. Interestingly, we found a bug in the Grafana agent using this test suite. It turns out we were not removing duplicate labels. So we fixed that in the point one release. It was one line of code. The Victoria Metrics agent does pretty well actually. It has some slight inconsistencies around how it does the up metric. It doesn't actually, as far as we can tell, send the up metric when there's a failed scrape, which is kind of important for the test harness to tell the difference between the agent not doing anything and the agent successfully kind of failing. And also the Victoria Metrics agent doesn't implement the staleness markers, which are really important way we can tell when metrics go away. Telegraph, not quite as complete as the Victoria Metrics agent. In general, it doesn't do up or staleness at all. But it doesn't quite have the right job labels there. You can program it to add job labels, but it's kind of missing some of the features of service discovery that would add those job labels. Then finally, the open telemetry collectors, the newest kid on the block, so not as far along in its support. It doesn't propagate histograms correctly, doesn't do a lot of the job labels correctly, no up metric, no staleness markers. We're actively working with the open telemetry team to try and get better support. And hopefully in a month or so, we'll see if they've got better support. I'll be hanging around for a kind of live Q and A at the end, and I'll make sure I bring updated results for that. Next, metadata and exemplars, not exemplars. First, metadata. Not a lot of people know actually that Prometheus client libraries allow you to add help text and type metadata to every metric in your application. Prometheus scrapes this and stores this in memory, stores the latest values in memory, and then has an API so that client applications to Prometheus like Grafana or the Prometheus UI can query this and use it to help build these kind of UIs. So Grafana builds this particular UI just allowing you to see what each metric is. We want to enable systems that implement remote right to have the same information and basically the same experience as Prometheus. And so back in February last year, so over a year ago, Josh added support for metadata to the remote right protocol. So it's an extra field and it's kind of a bit best effort, right? We're taking some of the metadata and sending it on a period alongside your samples. We want to make a series of improvements to this. We want to write the metadata to the right-hand log. We want to send it alongside the same metrics that it's supposed to be sent with instead of just kind of arbitrarily sharding it. And Rob from Chronosphere actually has a PR that achieves some of this that we want to work with him to get in over the next few months. So the next example we want to give is exemplars, right? Spelt correctly this time. In exemplars allow you in Grafana to overlay kind of dots on a graph. And then when you click on one of those dots, one of those exemplars, you can jump straight to the trace that that dot kind of represents, right? So Bjorn added exemplar support to ClientGolan quite a while ago. And Callum added exemplar support to Prometheus in the last release. And together you can implement this really cool experience that really speeds up kind of incident response and debugging workflows and really kind of makes the whole system feel a lot more integrated. Prometheus doesn't actually care where these traces are stored. You can store them in Yeager, you can store them in Zipkin, or you can store them in Grafana Tempo itself. We want to make this available to remote write endpoints to people who are implementing remote write should also be able to receive exemplars and offer the same APIs and the same experience. And to that end Callum from Grafana Labs, he is adding a remote write for exemplars. So this is going to be writing the exemplars to the right-to-head log and then tailing that right-to-head log is as part of the remote write code and sending them out in batches. This will also enable kind of long-term storage of exemplars and some really cool use cases around that. Callum assures me this is going to be merged by the time you're listening to this talk and we really hope that this is going to be in the next major release of Prometheus. So finally, what's next? What are the kind of more long-term things we want to do with remote write? First thing I want to talk about is atomicity. So Prometheus has this really cool guarantee that, sorry, before I go into the guarantee, a lot of you will know that Prometheus metrics are normally, you know, sometimes composite and actually made up of multiple time series. You know, the example here being a histogram, which is made up of time series per bucket alongside a kind of account and a sum time series. This is kind of how we build up these histograms and how we can tell latencies in your application. And so it's really important that when you run a query, you only see a complete scrape. You see a consistent snapshot of a scrape. You don't see kind of partial data or half updates from one scrape and half updates from another scrape. Prometheus offers this, right? It's quite a cool feature of TSTB and it's really useful. Remote write unfortunately doesn't do this. It actually prevents systems from offering this guarantee because of the way the remote write client inside Prometheus splits up batches of requests to send them in parallel. So this means that remote write systems might actually get metrics or samples for different series in a histogram in a different order, right? And we'll actually see these partial states. This means they don't have the opportunity to implement this, which we think kind of sucks, right? So we want to fix this. We haven't decided yet how we're going to fix this. You know, one of the thoughts is that we're going to make sure that this scrape, the entire batch of samples that are gathered in a single scrape are written to the right head log in a single batch, which is kind of already the case. But then the remote write system reads that entire batch in a single kind of scrape, in a single read of the right head log and sends it out to the remote system in a single batch as well. And so kind of aligning this throughout the entire pipeline will at least give the kind of the systems at the other end of the remote write or give them the opportunity to offer atomicity. Another thing we're working on solving is kind of or improving at least is the handling of 429. So 429 is the is the status code that remote systems will send when Prometheus is sending samples to them too quickly. It's 429 is kind of rate limit is a back off, right? So Prometheus is designed to back off and retry on 500s. Now 500s indicate there was something wrong with the system, you know, it was the service fault, we couldn't handle the request, you know, please try again. But Prometheus doesn't retry 400s, right? And that's on purpose because 400s indicate there's something wrong with the request, right? And that request will not succeed if you try it again. In particular, like 400 might be an invalid request, like it's just gibberish, like please don't send me gibberish, you know, it might be that the, you know, you've hit some limits on the total number of series, there's no point trying again because you've hit these limits. And it's weird, I think that 429 is a is a is a is a rate limit, right? Because, you know, actually, if you back off and retry the 429, the request that received a 429 response might actually succeed. Either way, what happens is, you know, we talked about how Prometheus is collecting data from your applications and writing them to a right head log and then reading that right head log and sending it to a remote system. And if that if that if there's a network outage or heaven forbid an outage on the remote system, then that will start to buffer up on disk, right? And these samples will buffer on disk and Prometheus will basically wait, you know, periodically retrying, but wait for wait for the system to come back up. And then after that outage, Prometheus will try and replay that right head log to the remote system to fill in any gaps. You know, what Prometheus will replay as quickly as it can, and it can replay pretty quickly. So it replays many, many, many samples and that upstream system sends a 429 saying you're replaying too quickly. At that point, Prometheus will basically just drop the data. So after after doing all this really hard work to to batch up the, you know, to buffer up the data on disk and make sure it's there for, you know, during the period of the outage, when the system comes back, we replay 429 and drop the data. And it's a real shame we do this. So we think we can improve this by backing off on 429s and knowing when to kind of balance, you know, catching up against kind of falling behind because, you know, you never want a system to kind of get behind and stay behind. So finally, another area we're actively looking in and investigating and looking at improvements for the remote right system is its bandwidth usage. You know, I touched on a bit earlier how we designed the system to be pretty simple, pretty stateless, right? And this is a key principle that I think has really made the protocol easy to implement, right? There are no interdependencies between messages in the protocol. And this simply just drastically simplifies downstream implementations. It makes things like Cortex easy to write. It makes the the adapters between Prometheus and graphite and Prometheus and influx. It makes those adapters simple to write. And I think this has been pretty successful. You know, we've seen, as I said, 30 different implementations of remote right. And I think one of those, one of the reasons this has been successful is because it's a relatively simple protocol. However, this comes with downsides, right? It's expensive, right? The fact that the batches are stateless means that they have to repeat labels many, many times, you know, the same labels sent in most batches. And this really eats bandwidth, you know, we use between 10 and 20 bytes per sample to send via remote right. And Prometheus only uses one or two bytes per sample on local disks. So you can see there's a big, big room for improvement here. And our internal monitoring at Grafana Labs, just to propagate samples between our internal Prometheus nodes and our internal Cortex cluster, we're doing over 90 megabytes a second of remote right. So there's probably a 10x gain here. And there's various ideas. We think a lot of the work we're going to do on atomicity and batching and building these consistent batches will allow us to have a kind of symbol table in the remote right requests that will probably reduce bandwidth usage quite a lot. And that's it really. You know, we've covered kind of what is remote right, what's the history? Why does it exist? How did it start? We've covered our efforts over the last couple of months to standardize remote right to really document how it works, why it works that way, and then test implementations to make sure that they work the way they should. We talked about what's coming kind of, you know, pretty soon, hopefully in the next release or tour of Prometheus, we should improve the way we send metadata via remote right and start sending exemplars via remote right. And finally, we've talked about some of the ideas we have for the more long term future, how we want to make it remote right atomic, how we want to deal with back off and retry on 429s and how we want to reduce the bandwidth consumption of remote right. So I'm going to hang around now, take some live questions. If any of this was interesting, please do ask and thank you very much for listening. Bye-bye.