 All right. Hi, everyone. I'm Julius. I'm one of the co-founders of Prometheus, and I'm also the founder of Promlabs, my company. With Promlabs, I help other companies use Prometheus, get the most out of their Prometheus, build stuff around it. So I do consulting, training, products, like a training website or live trainings, but also a ChromeQL query builder called Promlabs. So if you're interested in any of that, check out promlabs.com. But today, I'm going to talk about the Promql compatibility efforts that I've been working on since last year. So this is also mostly going to be an actual Prometheus open source talk. So Prometheus has become really successful. It's become the de facto standard for metrics based open source monitoring. This is really amazing and has also brought a lot of other players onto the playing field who want to have a piece of this, who want to interoperate, who want to build a business around it or just have a different alternative that is compatible. In this diagram, I'm showing a number of different open source projects and hosted services and other projects that Prometheus can send data to. This is not an exclusive list, just a number of examples. And in general, this is great because the standard Prometheus server might not be the best for everyone. For example, for someone who wants a truly horizontally scalable alternative, Cortex might be better. Or for someone who doesn't want to run their own storage systems at all, a cloud based service might be very attractive and so on and so on. So alternatives really offer different trade-offs for the different use cases of different people. In general, the interoperability then also gives people a larger ecosystem of puzzle pieces to put together in ways that they see fit. And as long as this competition stays healthy and open, we will get better monitoring for everyone. But at the same time, we have to be vigilant and be clear about compatibility expectations and actually delivery of those expectations. So if someone does not implement one of our interfaces correctly, but says they do so, they might be misleading users or surprising users when they least expect it. In the worst case in alert that you don't get at night, that would be bad. Or you might just be then locked into their specific behavior and feature set, which is no longer compatible with the rest of the ecosystem. So you can't easily switch anymore. And ultimately, this leads to ecosystem fragmentation, making the ecosystem weaker. So we want to really encourage people to be both clear around their communication when it comes to compatibility and to also implement solutions to be compatible as compatible as possible. So originally, when we released Prometheus, we focused a lot on the implementation of the different components internally, like how does the Prometheus server work internally? What does it do? What does it target internally? How does it track metrics? How does alert manager dispatch alerts and so on. But of course, all these different boxes, the components also talk to each other. And especially now that people came in and said, okay, we're going to create Prometheus compatible solutions or that interoperate with Prometheus in one way or another. The lines between these boxes, the arrows have become more important. The exposition format is a pretty prominent one for transferring metrics from a target to Prometheus. So, you know, that is currently being standardized or has been standardized as open metrics, which is the standard for sending metrics to Prometheus that has its own test suite and spec. There is the remote write protocol, which is being used to send data from Prometheus to some remote storage system, which could be hosted or, you know, another open source thing that you run yourself. There's the alerting protocol between Prometheus and alert manager. We haven't really seen that being implemented much because it's pretty specifically tied to those two and a bit low level. But fourth one, of course, that we have been seeing support crop up from different people in hosted services and open source projects has been our query language, PromQL. So that is also what I want to focus on today. People re implementing PromQL or offering it as part of their solutions and the compatibility statements they make around that versus what they actually deliver. PromQL is really an important piece of the Prometheus story. So typically always when you collect the data and you store it, you then want to first do PromQL on it to do anything useful, whether that's dashboarding, alerting, ad hoc debugging, you know, automating the ICD pipelines or even exporting it for later use. Typically, you want to run PromQL on it. And PromQL is quite a powerful language. It's a big interface for sure. It has a lot of subtle behaviors and, you know, that you can get wrong or right. So it's quite important for users to not be surprised and so on to get this implemented correctly. So there's a number of projects, vendors, hosted services, etc., that now have PromQL support or claim partial or full PromQL support. You have open source project again in here, you have hosted services and so on. And we're going to go into some of these a bit deeper a bit later. So, but the question is how do you even test compatibility? The issue here is that there is no formal spec for PromQL. You can read the documentation on Prometheus.io and it will tell you the relevant or most relevant bits and pieces from the user's point of view. But there's still a million small edge cases and subtle behaviors and so on that if you just read that, you will probably not get that right if you just completely reimplemented from scratch. So how do you test that? So what I ended up doing is basically taking the Prometheus server itself as the reference implementation and testing against that, like comparing any vendor implementation against the Prometheus server's own behavior. I wrote a tool for that called the PromQL compliance tester. It takes a test configuration, the reference API, which would be typically your Prometheus server as the reference and then the vendor or open source project PromQL API that you want to compare and test. And then it takes a number of test cases in the example that I used. I had like over 500 different PromQL queries and they expect that both of the reference and the test system have been pre-populated with the same data so that you would expect in the best case exactly the same answer coming back in a query. So the tester tool then runs the same query against the same data set in both implementations, compares the results and writes a test output. And the test output will include how many tests failed or passed and a percentage score. Basically, you know, just based on did they really return exactly the same results, modulo some very slight allowances in floating point values here and there, or, you know, it pre-sorts the result before comparing because sorting is undefined completely in PromQL unless you use the sort functions. And for the cases where there's a failure, it actually gives you a detailed error message or an actual diff of the difference of the sample values return. So you can debug and figure out what is actually going wrong. So test summary might look something like this. This is just a very small excerpt of one that's currently linked off one of the blog posts. That's in this presentation on the prom labs homepage in the resources section. You can basically see each query that was executed. And for the ones that failed, either get an error message or get this kind of diff output, where, you know, in this case, you can see that there's slight differences in the actual floating point values for the quantile operator. So this project started as a prom labs project in the prom labs GitHub org, but just now April 2021, I donated it to the new Prometheus compliance repo, where we as in the Prometheus team all want to collaborate on, you know, bringing more focus on compliance and compatibility testing to the various Prometheus interfaces, remote write also and open metrics and so on. So it really made sense to donate this promql tester there as well. So now it lives there. I did two initial test runs of different vendor implementations and wrote blog posts about that last year. So you can find the full details in under these two links. Today, I'm just going to talk a bit about the latest results from December 2020. One note, I didn't run tests in the recent months after that. So it could totally be that some vendors have different results now, hopefully better ones. So keep in mind, this is the state from December 2020. But of course, like we hope to have updates again in the future as the Prometheus team. So here are the vendors that are included in the last test run. These are not all of the ones that I had initially listed, because, you know, for example, I had log Z listed, but I think I didn't even know that they had a hosted Prometheus service yet or it didn't exist yet. And so they were not included and Wavefront has some quite limited promql compatible or not well, capabilities, I would say, because they are quite compatible. But the HTTP API that they had was so different that it wasn't even possible to run the tested tool against it yet. I have been communicating with them. So, you know, potentially it will become possible in the future, but I can't say anything about that yet. So for now, I tested, you know, all the ones you can see here. I want to give one word of caution about interpreting the numeric test scores, because either don't just look at the score alone, because obviously, like some differences in behavior, they cause one test failure. But, you know, depending on the difference in behavior, it might actually be a really impactful breakage or a tiny difference in floating point value. So really look at the detailed test results to see, like, if this really matters for you and to see how bad it really is. And also some differences are more general than others. So they might like affect multiple query types versus one specific little bug over in popular function potentially, but that I only test once. So it depends a lot, of course, on my test query set and so on. So look at the detailed results, if you really want to see what is going on and want to make an informed choice when choosing a Prometheus compatible system or a Prometheus compatible system. So this is a quick result overview table here of the different systems I tested. Most of them are 100% or close to 100%. I have two outliers here, New Relic and Victoria metrics, which are quite far from that. We're going to see why that is later. Then there's the feature of cross cutting issues that can be isolated in the tests. So there are sometimes in an implementation general query bugs or issues that cause potentially even all queries to be different if you don't factor out this issue. For example, an older version of Cortex had a slight query input timestamp parsing bug. So every timestamp in the output would also be off by like one millisecond, and then all the results would not be the same. And basically, you know, to work around that and still make the rest of the results comparable, I added certain query tweaks that you can turn on for test targets saying, send a query in such a way that this bug does not occur. So you can still run the rest of the of all the comparisons. So the query tweaks that are necessary to enable that are still listed here. And if you click in the blog, on those details, you will also see what exactly those query tweaks were that were necessary. Okay, let's take a look at some of the results just briefly. So the first one was Chronosphere. I don't have to say too much about that because they got 100%. Everything passed. They are based on the open source m3 DB system coming out of Uber. The good thing here is they reuse the native promql engine code. So you know, that's already very helpful for getting very close or 100%. Cortex as well is reusing the promql engine code is an open source project got 100%. One little note here, in an older version, when I was still using the legacy chunk storage mode with it, it couldn't execute queries that didn't contain a metric name. So for that, some of the queries were failing. I only got 99.62%. But you know, if you run Cortex now with a new block storage, you should get 100%. Grafana cloud is Cortex based, and they are using block storage. So they also got 100%. Modulo one little cross cutting issue. They align the incoming query timestamps to the resolution step to enable query caching. So this may or may not be a problem for you depends. But it's something that should be pointed out. It's not technically correct promql evaluation. But factoring out that issue, they get 100%. M3 as the pure open source project. I also just tried running that myself, storing data in there, querying it back out and also got 100%. Great. Metric fire is another Cortex based hosted service. They are still running a slightly outdated version of Cortex. At least they were in December. And so they didn't get the full 100%. But you know, I hope maybe already they're getting 100%. If I would test them now, they are aware of the issue. And basically a Cortex update, hopefully will bring them to 100%. New Relic is a hosted monitoring or application performance monitoring player that's been around for a long time. Full disclosure, I have been consulting with them a bit on this trying to improve the promql support they were building. The big challenge in their implementation is that they wanted to reuse what they already had. So they have an existing query language called NRQL and an underlying database NRDB. And they were conceptually not really compatible with promql data models and language concepts. So it was not really possible to or feasible to transpile everything faithfully into NRQL. So, you know, as some examples, obviously, you see the score they got in my case was only around 31% with some cross cutting issues. So they don't, you know, support certain features like binary operator modifiers, quite important, I would say, not all functions we have in promql, staleness handling, special float values, and a bunch more of little different behaviors that you can look up in the blog post. So, you know, pointing New Relic at a Grafana dashboard or pointing a Grafana dashboard with a Prometheus data source at New Relic might give you quite similar looking results sometimes, but it's going to have quite some differences and not supported queries as well. Then there is another project which is like prom scale is an adapter for the timescale database. By the timescale folks, it's open source and it's also reusing the native promql engine. And again, great job getting 100%. Thanos, same case, open source, reusing promql engine from upstream getting 100%. As the last case here in the alphabet, I have Victoria metrics, they are also an interesting case in that they really position themselves marketing wise as a drop in replacement to Prometheus. But they have their own language dialect called metricsql. But, you know, in the first sentence, typically describing it, it says metricsql is promql backwards compatible. But then if you keep on reading, there is a bunch of exceptions listed how they're not compatible and they have extra functions and they do quite a bunch of things quite differently. And yeah, in this case, they got around 60%, not quite with some crosscutting issues as well. You know, they have different behaviors around when they drop metric names, they select one more sample in range vector selectors than Prometheus does. They don't support the staleness markers that Prometheus does. They remove NAN float values from outputs. They also don't store full float values. So you can't get the full float values back out of the database. So that also sometimes caused the thresholds of the comparison of the comparator to say, Hey, you know, this this query actually failed. And there's a couple more differences in backwards compatibility that are not quite met here. So this is just, you know, something you might want to look at if you were to choose a solution. So that's it for just diving a tiny bit into individual vendors and projects. There are still some open questions around this effort in general. So like, how do we deal in test reports with slight differences versus larger differences? Do we report them just as one test failure or do we want to like characterize them in different ways? And how would that even be possible in an automated fashion? So maybe it is just not. And we just in general want to encourage people to get to 100% anyway. So, you know, maybe that will not be as important. There could also be some behaviors in the native PromQL engine that we actually want to treat as undefined. For example, the subquery alignment step, the subquery step alignment, or, you know, sorting is already undefined, or I already ignore sorting in the tester, but there could be other things that we don't actually want to compare in behavior. And then there's also the question of how to version the compatibility in the test results over time, saying like with which PromQL version are you actually compatible and how much? All right, there's some related from future work going on as well. As I mentioned, in this compliance repo, we are linking tests and specs for open metrics, and we have tests in there for remote write. Someone just proposed creating remote read tests as well. Obviously, there's the PromQL work in there as well now and potentially future interfaces as well. Ultimately, we want to get to a point where people building systems that are Prometheus compatible to get certified marks, potentially even be able to self certify themselves, but we can't say that for sure yet. So initially, for sure, we, you know, these tests that give you some certification will require some manual effort, likely input from the Prometheus team as well to say that yes, it's actually compatible. But ultimately, we should figure this out as a community. So please watch this compliance repo and contribute to it and potentially like discuss on the Prometheus open source channels, the typical ones, the mailing list, the chats and so on. All right, so stay compatible. Thank you and I wish you a happy prom con.