 All right, should we get started? Sound good? All right, I just want to start by saying how tremendously excited I am to be back speaking at in-person conferences again. It's great to see all of your smiling faces, hopefully, under masks, please. But today we're going to have a bit of fun talking about the strange, scary world of monitoring serverless applications. Well, is that going to let me change slide? Oh, no, technology. It works. For a bit of a quick introduction, for those of you who haven't met me yet, my name is Colin Douch. I have been doing this observability thing for arguably far too long. I started out in physical control systems before realizing that spoiler alert physics kind of sucks. And the less you have to deal with that, the better. But I now tech lead our amazing observability platform team at Cloudflare. We build the tools that Cloudflare engineers use to monitor and debug Cloudflare. And today I'm going to take you on a bit of a journey through Cloudflare's evolution into the serverless space. Because it's kind of interesting, right? Over the last 10 years or so, we've seen this slow transition into serverless applications. And at least from an open source observability standpoint, it really feels like we haven't been able to keep up. I don't know whether that's the vendor lock-in that's inherent with serverless platforms or what. But I'll be honest, I'm getting old. Part of me just wants to shake my fist in the sky and say, hey, let the young folk deal with it. But I mean, the more I'm forced to deal with serverless functions, the more I've learned to love them. Because I have been forced to use them, as I'd imagine at least some of you have. And they're great. They solve quite a few issues that our standard servered applications have. They get things like scaling for free. They get things like microservice architectures really, really easily. So what we really need to do is update our operational models to deal with them. And that brings me to Cloudflare workers. This isn't a sales pitch. I just want to give a bit of flavor to our experience. Back in 2017 or so, Cloudflare invented its own serverless platform as you do. And the annoying thing happened because when you invent something, the absolute worst thing that can possibly happen to you is that people start using it. And as an observability team, we blinked. And in that blink, we had a bit of a problem because we had this weird Cambrian explosion of serverless applications at Cloudflare. In that blink, 10, 20, 30 different serverless applications were suddenly built on top of Cloudflare workers. And despite the circular dependency, this was a bit of a problem. And I'll be the first to admit, we fumbled the ball on that one. We hadn't really offered any solution to these teams. And this led to a whole bunch of interesting solutions that we'll get into later. Really, it became clear very, very quickly that we needed to offer some sort of solution to this, some sort of actual offering to bring serverless functions back into the fold of well-supported architectures. It just wasn't very clear how to do that. And there's a lot to talk about there, from metrics to logs to traces and everything in between, but let's talk about time series today because we don't have that much time. And I think we'll all agree that the sort of de facto standard for collecting and storing time series these days is Prometheus. I did sort of a lingua franga. Everything uses right. Everything uses text-to-text position. And I know we have a few Prometheus maintainers in the audience today. So at least from me, thank you because you make my life a hell of a lot easier. But if we're being honest, Prometheus makes a few assumptions. It has a bit of a downside, and apologies for the image. It's really hard to find safe for work old Greek art. I don't know why, but Prometheus has a bit of a downside. It has a dark side. It makes a few assumptions that are very innovative, at least for the time. And if we're being honest, they sort of force you into a particular architecture. Prometheus, at a high level, assumes a few things. It assumes that your system lives long enough to be discovered and scraped. It assumes that your service is sort of network-enabled, and it can expose things over the network. And it assumes that you can do your own aggregation. And let's break those down a bit because they're perhaps not obvious. Prometheus, firstly, requires that your system lives long enough, right? Excuse me. Prometheus, rather inevitably, I think, uses a pull-based model for metrics collection, right? For those of you who haven't seen that in action, the general idea is that instead of your service sort of actively pushing metrics out to some sort of collector, Prometheus will discover where your service is, reach out to it, and pull metrics back. And this is kind of cool, right? Because it gives us a bunch of things for free. We get things like health monitoring, because if we can't pull your metrics, well, maybe your service is down. We get things like capacity control because we can decide how often to pull metrics and how many of them to pull. We get things like easy testing because we don't have to mock out collection endpoints. But Prometheus isn't always scraping, right? It's scraping every five, 10, 15 seconds, whatever, which means that in reality, your service has to live for at least five, 10, 15 seconds, which may or may not be the case. At the same time, and this sort of goes hand in hand with the last one, Prometheus assumes that you can expose things over the network. And again, maybe the case, maybe not, because in practice, this requires you to be able to listen on a port. It requires you to be able to spin up a server. If you wanna secure the communication, you need to have firewall rules, you need to have TLS certificates, maybe client certificates. So again, it sort of gets a bit messy in the weeds. And finally, Prometheus assumes that you can do your own aggregation. And this is a bit opaque, so it's easier to sort of represent it with a bit of a counter example. If we imagine, say, a service that only ever processes one request and some hypothetical service, we can ignore the previous two requirements for now. We say, somehow Prometheus gets our metrics, but our service processes one request. So what would a request count metric look like in that case? Well, our service processes one request, right? The request count is a one. So we push a one to Prometheus, and if we get a second request, well, we start another service and it pushes a one to Prometheus. And this is a bit of a problem, right? Because from Prometheus' point of view, well, that metric jumps to a one and then never goes anywhere. So we can't accurately represent things like request counts in this standard setup. We could maybe do some magic if we explode our cardinalities and do recording rules or something like that. But in practice, we probably want to do that. We need our service to be able to do some level of aggregation. So let's summarize those a bit. With those requirements, Prometheus is very well suited to a particular model of architecture, right? A service that is long running, a service that is network enabled, a service that is able to do multiple things in every sense of the word, a traditional servered application. So let's consider our serverless example. Well, our serverless functions live for hundreds of milliseconds, maybe seconds at worst. Our serverless functions aren't really network enabled in the traditional sense. They can't listen on a port. They can't spin up a server. And even if they could, that would sort of be antithetical to the whole idea of serverless in the first place. And they only have a process one request. So in all three ways, our serverless functions don't hold up these assumptions. And when we were first sort of investigating these problems, these were the issues we encountered. So what can we do, right? We're engineers. We look for solutions. And so we can look at prior art. Serverless functions have been around for what, 10 years nearly at this point. So this problem must be solved already for us. And first thing we can look at is the push gateway. Don't roll your eyes over there. The push gateway, for those of you who haven't seen it, effectively functions as a long-lived proxy for metrics. The idea is that you can HTTP post a text exposition and the push gateway will handle all of the difficult parts of those first two requirements, listening on the network, living long enough to be discovered and scraped. So we only need to be able to push something and it works. And how the push gateway sort of functions and what makes it interesting is how it handles sort of repeated pushes, that is pushing the same metric twice, for example. And this goes back to our request count example. If we push that request count with a one and then we push it with another one, well, the push gateway replaces those old values. So we still get this repeating one problem. At the same time, in the same upstream documentation, well, we can find the aggregation gateway by our friends at Weaveworks. And the aggregation gateway is very, very similar. It functions as a push gateway, except in the case of this repeated metrics issue, right? If we push a one and then another one, well, the aggregation gateway, perhaps obviously aggregates them, right? And secret by aggregate, we mean some, we're just being fancy. But ostensibly, this solves all our problems, right? We can listen on a port, we can just sort of use these push gateway semantics, but we can aggregate our values, we can do request counts ostensibly, we're done, right? Well, maybe, maybe not, because at this point, we decided to go on a bit of a survey through our engineering teams because engineering teams at Cloudflare, they're not done, right? They're not pushing things to production without monitoring. So they must be doing it somehow, and we wanted to work out how they were doing it, whether or not there was a consensus in sort of what was the best solution out there. At the same time, we wanted to know what they were doing, like what sort of metrics, because each of these solutions has a different set of trade-offs. And I'll be honest, what we found was a bit of a mess. This is not trying to disparage any Cloudflare engineers, they were doing what was necessary, but there was really no consensus around what was the solution to this problem. We had push gateway setups, and teams weren't using counters. We had aggregation gateway setups where teams weren't using gauges. We had one team that had built this weird Frankensteinian monster of a binary that sort of combined the two into one, and sort of had weird routing logic to route different metrics to different gateways. That was actually very interesting. We had metrics that were being tromboned back through internal UI pipelines. We had teams that built sort of custom logic on top of Sentry to pull time series metrics. There was really no consensus about where we could go. At the same time, we wanted to know what sort of metrics people were pushing, right? How were they dealing with these trade-offs? And perhaps most obviously, teams needed counters. Counties are very boring. The general idea is we have a bunch of metrics that we can sum up together to produce a final value. Nothing too interesting there. Counties are counters. At the same time, teams need gauges, and gauges are more interesting because gauges, well, if we're say processing a thousand requests and we're just replacing gauge values, we're dropping 999 of those. So is that useful? Maybe not. And this is where I wanna revisit that Frankensteinian binary from before because that team was actually doing some really interesting stuff, like taking aggregates over different gauge values as they came in to sort of capture these changes. We'll get back to that in a second, but that was actually quite interesting. At the same time, teams needed histograms. Again, perhaps unsurprisingly, and as much as we didn't wanna have to deal with the cardinality of histograms, they were there, and I suppose they're useful. But histograms ostensibly are just counters on steroids, right, where we're summing up all the individual buckets into a final value. But beneath the surface, there is some level of complication there. What happens if we get a push with different sets of buckets? How do we sort of merge those together? How do we even merge a summary, right? Does that even make sense to do? I don't know, you tell me, but we had histograms. And finally, and perhaps most interestingly, we had infometrics. And if you've ever orchestrated anything with say client Golang, you've probably seen an infometric. What makes them very interesting is the fact that their behavior is not obvious. Infometrics, for those who haven't seen them, generally the idea is you have some sort of static label set baked into your binary, and your binary will expose that as a gauge with a value of one. That's not too interesting, but what is interesting is how those behave over time, because when we deploy a new version with a new static label set, well, the old version shuts down, that label set finishes, and the new one comes up. And it's interesting because that is not implicit in the fact that it's a gauge. That is purely sort of born out of how the application handles these metrics over time. So where does that leave us? Let's take these new requirements and revisit our upstream solutions. Well, the push gateway, maybe that works, maybe it doesn't. In fact, if you read the documentation, who does that? Not me, who has time? If you read the documentation, well, the push gateway is sort of out anyway. It very clearly states that, hey, probably don't use this for serverless things. Oh well, we can sort of abandon that one. That's sad. Promises so many things. At the same time, we have the aggregation gateway, and the aggregation gateway gets us sort of closer, maybe. It gives us different trade-offs, right? We can represent counters, but the way the aggregation gateway handles gauges is slightly weird, and we still can't get our info metrics out of it, which is what we really wanted to do, because the aggregation gateway works only on types, so a metric type has a behavior. Maybe that works, maybe it doesn't. At the same time, we were investigating event-based solutions, right? As an industry, we're moving more towards these sort of high dimensionality, high cardinality events. Maybe we could do something there, and we decided against these things for two reasons, mostly. A, at least at the time, they were rather immature, and we didn't really want to put something that experimental into production, and B, we didn't really want to fracture our environment too much, because as an internal platform team, you really have to focus on your developer experience, and having two separate systems, one for serverless architectures, and one for more standard architectures didn't really sit well with us. We wanted to sort of have a consistent experience throughout the whole stack, so where does that leave us, right? Because all of these upstream solutions sort of seem out of the window, so it became pretty clear pretty quickly that we were gonna need to build something ourselves as annoying as that is, because we had a couple of requirements. We wanted to maintain compatibility with all of the existing sort of client libraries and client orchestration, and at the same time, we wanted to represent all of these different semantics that we had, some encounters replacing gauges, but also infometrics, and again, these statistical functions over incoming gauge values, so it was pretty clear we were gonna have to do something ourselves, and that's what we did. We called it the gravel gateway. The code is there, it's in Rust, so buyer beware on that point. The main thing I want to mention, because it'll make my family proud, is that this is a mining pun. I come from a mining family. I think this is funny. In mining, we have things called aggregates, like sand, gravel, and things. I think this is funny. Literally no one else does, so if I could have a petty laugh so I don't cry myself to sleep tonight, that would be appreciated. Thank you, I appreciate it. As I say, code is there, go have a look, buyer beware with the Rust stuff. Some functional requirements we had when we were building this. We wanted to maintain compatibility with the existing client libraries, right? If we're already building something bespoke internally, we wanna do that as little as possible, because training new engineers to use new internal things, it's just an extra burden that teams don't want. At the same time, we needed to aggregate, like the aggregation gateway, but we needed to support way more different types of aggregations. All of these things we'd identified, and therein lies a bit of a problem, right? Because here's two metrics. Ooh, ah, fancy. Two gauges, ostensibly, right? We have a build info gauge, and we have a go threads gauge. I remember this slide. And ostensibly, these are very boring metrics. They're two gauges, but there is some sort of implied behavior in how our applications handle them over time. The second one there, standard gauge, goes up and down. The first one is an info metric, and there's nothing inherent in that text exposition that says it's an info metric outside of the metric name, but we'll ignore that. And this is the problem, right? Because how the application handles these over time is sort of the defining factor in these metrics, and in our serverless example, well, we don't have an over time. We get one shot. So we need some way to communicate to something external, like our gateway, how these metrics are supposed to behave over time. And there's really only a few controllable fields we have if we want to sort of maintain all of the text exposition things. We have the type, but that's a very fixed sort of set of things. We have the help text sort of maybe. We have the metric name, and we have the label set. And any one of those last three would have kind of worked, but ultimately we decided on smuggling some semantic information into the labels. And that's what we did. We have the clear mode label in our gateway. Clear mode label allows communication with the gateway as to those over time semantics. For example, we have a clear mode info that represents these sort of info semantics. Every type has a default clear mode. So we sum counters, we replace gauges, that sort of thing. But being able to specify it more directly allows that sort of direct control. It allows us to bring serverless functions up to the almost level, the same level of monitoring as our more traditional servered applications. For example, if we want to sum counters, again, that's the default behavior, but if we want to be a bit explicit for the sake of demonstration, well, we can use the clear mode sum. And if we push a one and a one, we get a two. Basic math, hopefully everyone can follow along with that. The main interesting idea is that we do strip out the clear mode when we re-expose these metrics to Prometheus. That allows us to make this whole process entirely transparent from the Prometheus end. There's no way of telling. If you're looking at the metrics in Prometheus, whether a serverless function is running or whether it's a more traditional application architecture, and I think that's sort of nice because being able to tell the difference sort of spoils the abstraction a bit. At the same time, we have the family clear mode, which gives us these sort of info semantics. If we push a 1.17 and then we push a 1.18, we get a 1.18 out the other end because we've gotten rid of that whole metric family when that new one comes in. We've gotten rid of that 1.17, we replaced it with a 1.18. Again, perhaps simple. The more interesting one is statistical functions, and this is where I wanted to revisit that because metrics are a simple number. They're a single number that changes over time and means medians percentiles, they're a bit different. So we can sort of track a mean as a single number over time, but is that useful? Because if we deploy a new version of our service that doubles the memory usage, well, we probably want to know about that without that doubling being dragged down by two weeks of a previous function. So in reality, what we want to be able to do is expire values over time. And to do that, we actually stole a concept from another internal project we have called Pleadora. Go listen to my Olyfest talk if you want to talk about that. But the general idea is that in Pleadora, we have a concept of pebbles, which works well with the gravel gateway idea. Pebbles are effectively a sort of time-based bucket and we keep a circular buffer of those over time. And each bucket contains a sort of pre-aggregated value of all of the incoming values in that time slot. So in reality, we get sort of this aggregate of aggregates out the other end, which does lose us some level of precision, but in practice, we found it doesn't really matter too much. There are some edge cases there, it's not perfect, but we get this sort of idea. We have this rotating buffer and we can get rid of things over time. And this is where we get more complicated clear modes, things like mean five minutes. We're taking a mean over the last five minutes of data and that allows us to get better insights, I think. At the same time, because we're not augmenting the text exposition format in any way, we get compatibility with existing client libraries for free, which is a good thing. It means we don't have to train engineers to do anything differently. We can just use standard client libraries to add another label set or another label to the label set. And because most of us I'd imagine are engineers and we like some code, here you go. I don't know why we like code. Why do we like code? Here's some code, excuse the JavaScript. You can see that it's basically exactly the same as a standard orchestration, right? We have a request count. We don't even need to change that one because the default for counters is we sum them, easy. And we have a version. We have an infometric. That's pretty cool. If we deployed a 1.1, we would clear out that 1.0 in the gateway because of its clear mode. Fun code, ignore the JavaScript, please. That's actually taken from an internal Cloudflare worker. We have strips down, of course, but that's the general idea. Also, that's actually URL, please don't hack me. So that's about it. We have the gravel gateway. Let's look towards the future, shall we? Where do we go from here? Well, in the gateway itself, we have a few things coming up. We're always trying to evolve it because we're always trying to improve on things. We're adding some interesting aggregations, things like deltas of counters over time, things, that sort of stuff, more statistical functions. We want to rewrite pebbles a bit to get rid of some of those edge cases. We're also doing some interesting things with scaling it up because it turns out, at least at Cloudflare, we get some rather large services which kind of sucks when you're doing sort of six, seven-figure requests per second. Turns out a little go binary on a, or a little rust binary on a server can't really keep up, but hey, we're looking at sort of clustering it, scaling it out. Again, look at the code for the early sort of mentionings of that. At the same time, we're looking at things like open metrics. For those of you who haven't seen that, it's a sort of attempt to codify something very similar to the Prometheus text exposition format. The main interesting thing is that it defines a whole bunch more types, and I know at least some of you out there have just been smugly going, ha ha, Colin, we've already solved this problem, and you are right because in open metrics, we get things like infometrics, we get things like state set metrics, all of these different fancy types, and that is great because maybe we can get rid of this hacky clear mode someday. Maybe. At the same time, we're moving as an industry towards more event-based things because ostensibly that's what we have, right? We have an event push out of our serverless function. We're just doing some level of aggregation on it. So again, it's a really interesting area of research to sort of look at these more event-based solutions and aggregating time series out of them. As I mentioned, we have an internal thing, Cleodora. Again, go see my OLLFS talk if you want to know more about that. But it's definitely an interesting area of research and I'm excited to see where we go from there. But that's what all I have. Again, code is there. If you want to read more on the sort of motivations for this, then there's another QR code with a link. But that's about all I have. Final takeaway I'd like people to encourage is even though things have been around for a long time, like serverless functions have been around for what, 10 years now? There are still open problems in them. So even in a world where you might think that there are no problems left to solve, there are plenty of problems for you to solve and please go out and solve them because it makes my life a hell of a lot easier. Thank you.