 OK, thank you so much for joining from Tuesday and our talk. We are ultra excited to be here in person to talk about a non-trivial challenge of combining short-living jobs and serverless with cloud native point-touring. But before that, short question to the audience. Raise your hand if you know what the word fleeting means. Do you know what fleeting means? Only a couple of hands. OK, good, thank you. I was worried that everybody will know what this means, because I didn't know when I was proposing this talk with Saswata, because I never heard this word in my life. So let's start with a short explanation. So the dictionary says that fleeting means something that passes very quickly, and generally is gone sooner than you expect. And I found this amazing cartoon art by an artist called Mikke Bach that explains the word very well, in my opinion. He actually did a word-a-day cartoon series around 1940s. And I think I have to start doing those to kind of learn English better, to be honest. So in this cartoon, we see a person called Wilbur losing something that looked like a fist duel or fight. And the woman watching this said that he stood up for a fitting moment. And I think this is pretty clear for me at least, hopefully for you as well. But how this can be applied to metrics? My colleagues at Saswata will explain that in a second in details, but we could summarize fleeting metrics as follows. It's an ability to collect metrics of numeric values over time that aggregates information from fleeting moments like Wilbur's fist duel. For example, imagine Wilbur have a lot of those fights. Perhaps you want to learn with what ratio he is winning or losing those fights. Or perhaps you want to know the median duration of those fights per day. As you can imagine, it's hard to get those information from our poor Wilbur, who potentially passed out already after a lost fight. And maybe forget what happened. And generally, this is what we will be speaking about today. So gathering metrics from those fleeting processes, let's say. Short disclaimer, we won't solve all your problems there. There is no silver bullet, essentially. But hopefully, we'll give you more context of what are the options and why it's not a trivial problem and where our potential pitfalls of this solution. Of the solutions that are available in the space. But before that, short introduction. Together with me, I have Saswata on stage, who flew over 20 hours with two stops from India, first time in Europe. Woo! Woo! Woo! Woo! Woo! Yeah. No, no, no. Try again. Hi, everyone. My name is Shashad Mukherjee. I'm a software engineering intern at Red Hat on the monitoring team. And I was a GSoc 21 student developer under Bartek's mentorship in the town's project. I love working with distributed systems and observability-related technologies. And when I'm not working, I love to read and go out for long walks. And my name is Bartek Podka. And I'm a principal software engineer at Red Hat. I maintain various projects in open source, mostly within GoLang, including Prometheus and Thanos. And I'm a CNCF tech observability tech lead. And I also write a book with Aureli called Efficient Go. And I put way too much content about observability. So it's kind of related as well. So yeah, it's a fun talk. Let's go. So for today's talk, let us begin by trying to answer some initial questions that are what exactly are short-living or fleeting jobs and why monitoring them is so crucial. And also, what are the problems that we face when we try to monitor them? So when we talk about monitoring, we usually refer to the metric signal of observability. With the continuous growth of system complexities and the data that we process every second, we need monitoring to understand the state of our workloads. And this is accomplished by tools like Prometheus designed to let us determine whether our systems are performing to expected levels of service by collecting and querying metrics and notifying humans about any problems. And the most common use case for these metrics is to construct drill-down visual dashboards and to trigger alerts when a system is behaving anomorously. So let's first understand how Prometheus normally monitors your workloads. So in any process, you can have numerous events occurring as it programmatically does things that you need it to do. You can even say that every single CPU instruction is an event. And many of these events or groups of events are of interest to people like us who are responsible for running these processes, as this can tell us if the system is behaving as we expected to. And we want to generate numeric data from these events by aggregating them over time. Since we know that we will need to alert, analyze, or debug the processes that we run. And Prometheus creates these aggregations of event data from time to time and generates a series of samples for us, which are a pair of numeric aggregation values and timestamps. And these series are what we call as metrics. So with this pull-based model of Prometheus in mind, let us try to understand how the infrastructure and the functionality that we control has changed over time and why our metrics are now fleeting. In the past, we would prepare physical servers for software to be deployed on, which would involve installing the operating system, hardware upgrades, device drivers, and so on and so forth. As there was strong coupling between hardware and software, this could be called as static infrastructure. Afterwards, virtual machines emerged. So now you could deploy your software to a kind of simulated physical server. This provided a lot more flexibility and made deployments repeatable and started a shift towards dynamic infrastructure and reduced the amount of functionality that the user needed to control. But there were still some overhead and limitations with this, and that is when several containerization tools and technologies started to be born and made it possible to run several different applications on the same system, without any of them interfering with each other. The surface area of user control further reduced with this. And then we now have serverless, which abstracts a lot of this away. Users now no longer have to care about where their code is running or even when their code is running, as this is taken care of by serverless solutions. And even if it is not serverless, applications with predefined completion states have existed for a while now, such as bad jobs and cron jobs. Now this is sort of an oversimplification of a very, very long timeline of changes, but highlights a very important trend, which is that users like you and me now have way less control over the code that we run. Now, this is especially relevant in monitoring as we do not have the full overview of the situation in such environments. We author the code, but we don't run it. So one exactly is the nature of such short-living or fleeting workloads. So when we talk about short-living jobs, we usually indicate some processes like serverless functions, batches, bad jobs, or cron jobs, or even ephemeral containers. And when we talk about such processes, they usually are running and doing the work for a limited, but arbitrary amount of time. But short-living is kind of a wrong name here or a misnomer here. We cannot define a hard limit or a boundary for when a process goes from short to long. Say if a job runs for 10 hours, is it long or is it short? So then how do we make the distinction here? Well, we can define it as a process that has a well-defined completion state. Any process which has some completion state after which we don't expect it to perform any operations and it can be terminated gracefully can be termed as a short-living process. The most interesting information for us is usually after the completion of such a process, which is what we usually want to enumerate as metrics. So now that we know what a fleeting workload really is, let us try to understand how monitoring is different based on the type of processes. For longer-living workloads, which are by far the most common workload that is monitored by Prometheus, like for example, a web server, it is relatively easy as it can be scraped and metrics can be collected at any point of time until and unless something goes wrong. As the recent aggregations of events are always there in the process itself. It can even have several fleeting processes inside of it, but as long as they are instrumented, the aggregated data can still be found within the longer-living process and can be scraped successfully by Prometheus. However, for workloads composed of only fleeting jobs, Prometheus pull-based model becomes problematic and effective monitoring cannot be achieved. A scrape might miss the fleeting job completely, in which case it simply cannot collect any metrics, and in the case it does coincide with the running fleeting job, it can only collect some of the partially aggregated data but will miss the rest of the events. Just like Wilbur, who was a fleeting process but got knocked out and now we cannot know how many duals he had or how many he won. Prometheus in a typical setup also has the inbuilt capability for monitoring the health of the discovered or configured targets. It does this by generating a dedicated time series called up with job and instance labels. And on each scrape, it appends a sample to the series with a value of one, if the instance was reachable and scraped successfully. But if the scrape fails and the instance is unreachable, Prometheus appends the time series with a value of zero instance is now down. That is not possible with fleeting processes because of the difficulties I mentioned when with regards to how pole-based monitoring can scrape fleeting processes. So to summarize, we have three major problems when it comes to monitoring fleeting jobs using Prometheus semantics. Fleeting processes can miss Prometheus scrapes completely or partially. Health and success can't be deduced from upness of the process alone. And finally, it's hard to aggregate data correctly while you're only aware about a single event, which is a point that we'll be talking about later on. And before this talk, we also discuss the same problems with open-fast and key-native communities. Hello, hello, yes. Okay, so what if you are Prometheus user or you want to be a Prometheus user but you have a lot of those fleeting jobs like serverless functions, bad jobs, or any kind of short-living processes as we define them? What do you do? Well, initially, there are two categories of solutions that are often mentioned maybe as an ultimate solutions, but the truth is they're just some options in the space. And let's go through those. And those options solve some of the challenges that's what I mentioned with fleeting jobs. So let's focus on the first one, which is simply relying on event-based observability and deriving metrics from it. As we mentioned, processes are full of events, and some of them are useful to observe and count, so then we can create alerts, monitor, dashboards, and really use that for observability. Particularly when we look on bad jobs and fleeting processes, they are mostly about single events. So it feels naturally to have observability that collect logs of events. And there are, of course, many solutions for that, like logging, tracing, and even dedicated solutions that allows us to capture information on the context about each of those events and forward to some collectors and back in some storage. With tracing, it's even better because those things are linked together in a structured way across requests. With events stored in just one place, we can process that information and provide really any metric. We often call those jobs rulers, aggregators, generators, sometimes exporters. And the ruler can then deliver this data to Prometheus, or Prometheus-compatible system, by allowing Prometheus or some kind of collector to scrape it, or maybe backfill through a TSDV block or something like that. To sum up, with this solution, we know about all the events, so it's kind of easy to aggregate and find common dimension for your metrics. And there is no regular interval of scrape happening. So process can have whatever lifespan. It might be fleeting, short, medium, long, long-living. As long as we are able to deliver reliably this single trace about each event to some back end, we should be fine. And there are many solutions in open source that you can grab, and they will do that work. And it's still growing. So for example, you could use Grafana Loki with Loki ruler. You can have Grafana Tempo. And there is, I think, just, I don't know, I seen that two weeks ago. There was Tempo metric generator. And there are ideas to have some rulers as well. You could have some very nice collectors on steroids, like open telemetry vector, which can derive some metrics from the events that it is consuming. And there's M-Tail, which is kind of an old Google project that just parses some log line and produce metrics. So there are some solutions. There are more. And of course, there are vendors who try to solve that for some amount of money. And there are downside as everything. And because all of the engineering is really about trade-offs, the first one is enormous cost that really usually results from complexity of such system. You need to gather every interesting event from all fleeting jobs, all fleeting processes. Then process and derive metrics from those. And then, so then you can alert and monitor and do the whole observability story. That's a serious traffic per second, just for observability data. And then store all this data in your backend. Now, imagine yourself as a function. That's one thing, one event. And then it meets perhaps few spans, tracing spans, about the situation, and wants to send it to some collector. That's probably a few kilobytes of data and some network call. So it feel maybe OK from the first glance. But several less are meant to scale from zero to 1,000 queries per second in millisecond-second span. And you're receiving pipeline for those events have to scale in the same way. And you have to pay for that. That's a lot of compute power in engineering time. So it means larger cost. Not only that, but you have to store and reach index and process that data in your backend to derive meaningful information from it, metrics. And you are in very much in the big data world these days. So then you need to pay. And again, on top of that, also a metric system that will use those metrics. So generally, it's more costly solution. And at the end, it's very tempting to use this for all observability as a silver bullet, but it's not. And I presented some ballpark calculation on the conference in UK two weeks ago. And just tracing backend cost for some vendor, you will pay 30 times more than just for your compute power for your application logic. And I'm not counting here even additional cost of collecting all of this and sending to this vendor. So you can imagine it can go very sideways if you really are short of maybe, I don't know, you want to save some money, right? So it's kind of crazy. And I actually use pretty cheap and nice vendor for this calculation. And there are more expensive one. And then, of course, there is a solution in tracing world which is called sampling. It's a very popular technique of filtering information about events to only send those that matters, like requests that were failing and maybe very slow or maybe interesting for some monitoring purposes. And sampling is really must have to have reasonable cost of tracing event nowadays. But then it means we cannot really derive metrics from that very easily. Because even storage or collector has only partial information of what happened, like maybe 10% of things. So you don't have a big picture and it's kind of inaccurate, right? So it kind of defies the solution. But in some way, having just the solution, like full sampling solution for just fleeting jobs, which maybe is a small part of your overall system, maybe makes sense. Maybe it's a fair solution for you. But with complexity, reliability is another problem. There is a reason why Prometheus is so popular. It's just a single binary that just works and close to the wall close. And here, there are so many stages that something can go wrong and it makes your monitoring just less reliable. And some of this you can mitigate, of course, by pushing more money and engineering time. But yeah, it's just a trade-off. And last, but not the least, something that is maybe overlooked is that we treat serverless function as a single quick job. And you really want to make another TCP connection to send observability signal and wait for it and pay for that. Like price for function depends on its execution time. Like for example, in Lambda, you pay for megabytes per second you use. So if the majority of your time spent on the quick function that does some logic is really spent on sending some observability information, that is not sustainable. But also, we are not experts in serverless jobs. So we are just looking that from a monitoring perspective. There are other problems, too. Like again, you cannot buffer and batch touch this monitoring data together to maybe send it more efficiently. So it's much more expensive and slower than if you would have maybe a longer living job. And what do you do with, you know, if tracing collector is down? How do you buffer? Those things doesn't have state. So, yeah. So let us take a look at another category of solutions that exist in the ecosystem and allow us to push partially aggregated metrics instead of just events and how this might be useful. So let us start again at the beginning where we have a fleeting process. We have multiple fleeting processes that have some events internally that we are interested in. However, as normal Prometheus script does not work here, we need to collect that data somehow. Something that we can do is aggregate some of the recent data about these interesting events within the process itself. This partial aggregation cannot give us the full context that we need, but can only give us the very small context of events within a fleeting process. As Prometheus is not a push-based monitoring system, we require some additional gateway or collector component which can receive these partial metrics and might be able to aggregate it even further before exposing it for Prometheus to scrape and generate samples from. So in theory with such a push-based approach, all of your processes can push partial metrics which are essentially aggregations of their recent data collected from the events within the process itself. With a push-based approach, your fleeting processes are no longer dependent on the time of a scrape. Fleeting jobs can simply flush their metrics to the gateway as soon as they are completed and allow the gateway to aggregate it further. Now, distributed aggregations with this approach are possible, but let's try to understand what it is and why we need it. So in this kind of scenario, each of our fleeting jobs can aggregate their internal events in their own process. However, as these aggregations can only give us the very small context of a single fleeting workload, we cannot use it as this is not an overall view. As you can see here with the full counter metric which is sent by each fleeting process after it has been incremented to values like one, two, or three, but for us to get an overall view, the metric can be aggregated at the gateway or collector level to the value of six. And this is what we can term as a distributed aggregation. However, different types of aggregations are needed for different types of metrics within Prometheus. For counters, metrics where all the labels match can just be added up. So as you can see here, once a counter is pushed twice, we can expose it with a value of two. For histograms or even summaries, the buckets can be added up and if the bucket boundaries are mismatched, the result can be the union of all buckets. So as you can see here with when two HTTP request durations of five seconds or 3.14 seconds are pushed, they get added to 6.28. And for gorges, metrics can also be added similarly to counters. But we might want to provide different semantics for aggregations. For example, this CPU temperature gauge would benefit from aggregating the last push temperature instead of just simply adding up all of them. So several open source solutions can help us in building such a partial metric push-based architecture for our fleeting processes. All of them with varying semantics, implementations and configurations and each with its own trade-offs. Also, functionality like distributed aggregations are complex to implement. So for example, Prometheus push gateway supports push and cache functionality but not distributed aggregations. Other solutions like Weaveworks from aggregation gateway does, but aren't really customizable. And there is even a new rust-based aggregating gateway that does this via labels. And it's built by an awesome engineer at Cloudfriar called Colin Douche who we actually spoke to before the stop. And he is giving a coupon talk about the same. And you can rely on ecosystems of open telemetry and stats D as well for partial metrics, push-based architectures too. Thus, you need to carefully consider what fits your workloads. Now there are also problems around metric staleness semantics. Metrics staleness is not handled well by such push-based solutions. Solutions like the push gateway never forget the time series that is pushed to it and will expose them forever to Prometheus until you manually delete it via the push gateways API. Which might make it frustrating to use. And this is also the case with aggregation gateway but it does not even have a metric deletion API. This means that to forget old or stale series, you would simply have to restart your service. And some solutions like open telemetry allow instrumentation to tell when to delete the metrics via specialized OTLP protocols. But it is not done automatically and if you would pull with Prometheus, it can cause issues. So we have to also in a way share cardinality between our fleeting processes. As each of them pushes some partial data, often the same exact metric with the same labels. But one process cannot see how the other processors are aggregating their events or how they are pushing. As a result, it becomes really easy to explode with cardinality with inconsistent labels and dimensions. And this can also lead to some pretty serious accuracy issues which can be really hard to fix. So with push, reliability also takes a hit. Push requests may fail from your fleeting processes. And the gateway not being scalable in the majority of cases might become a bottleneck by dropping push requests or simply not being able to process them. And getting ummed as more metrics are pushed to them. Thus, with such solutions, there are way more things that can break and need to be fixed. So as Bartek mentioned earlier, push pushing in any form might mean an enormous latency hit as you have to initialize some TCP or UDP connection. While partially aggregated metrics are easier to push instead of actual event data, this might already make it too slow for your environment. And we also need to worry about scalability. For many of the existing open source solutions, the gateway or collector components are not horizontally scalable, like the Provence has pushed gateway or the aggregation gateway. Multiple instances often cannot be run together simultaneously, so these components will likely become single points of failures in your monitoring architecture as you start to push more and more metrics from your applications. Hello, hello. Can you hear me? As we learned, the two mentioned solutions, pushing events and pushing metrics, they have their smaller or bigger problems, and the main theme is complexity, and it might be fine for your requirements, by the way, but maybe there is something simple we could do. I think once you solve, by the way, the challenges with this push of events or metric, at least for me, it's easier to accept some kind of limitation of pool-based metrics. And following that, in our opinion, there are two solutions, and I think we are over time, so we have three minutes, so I will just explain one solution. And so, first of all, rich, closed box monitoring. So, generally, the open and closed box monitoring is something that you might know from previous white box and black box monitoring, same thing, but different name, because we are all moving to more inclusive language, and premise is very simple. It's open box monitoring if the source of the formation is directly from the source of the event, for example, from Wilbur directly. He might be the best person to tell about some details and what he planned to do in his mind and what went wrong. Closed box monitoring is when you deliver similar or good enough information from someone or something who participated in this event, maybe computation, maybe network call, but they are not source of it. And they might just capture information on the way. So, for example, the same questions around winning grade and duration, we might ask Wilbur's partner, perhaps wife, that we're watching all those fights. She might not know exactly what was on Wilbur's mind, but she saw the whole situation, right? And this is really some lesson from our requirements, gathering discussions with open fast community. It's funny, kind of, we met with the maintainers and including Alex who is here, and we ask, hey, with the Prometheus community, what do you need? What do you need to make open fast better? And they were like, what do you mean? We are happy. I was like, okay, why are you happy? And the reason why they are happy is they build a monitoring, and they get this monitoring information cheaply from surrounding of those fleeting processes, which also means there is no instrumentation, manual instrumentation needed, it's auto-instrumented. If you think about this more, there are many long-living processes around your fleets. For example, orchestration system like Kubernetes, you have CubeState metrics, so if you have BadJob, you don't need to instrument BadJob, you can tell it's kind of successful rate or durations by just CubeState metrics, right? The network proxies or gateways are part of your request, so you can, they can provide metrics about from network traffic perspective, so you can totally build red method monitoring like rates, errors, duration, all of it just from service measures. Finally, our OS knows about kernel, knows about everything. Every CPU instruction, IO, that is happening with your serverless function, so also memory usage and performance and so on, so everything else you can kind of derive from existing tools like C-groups, eBPF. The thing that you might miss is a open box monitoring, so things that are really within the context and logic and have the function itself, but because function is super, super small, the context is usually very small, so the question is do you really need that, right? And second, I don't think I have time for that, we can talk about that later, but essentially it's like recommendation for serverless architecture as a user or serverless, you cannot do more, but we have to skip that unfortunately, but all of this, the recommendation I didn't mention and also the close boss monitoring, what you can do if you are just a user of like Lambda or Amazon Lambda or Cloud Run or any of those existing solutions, right? In some way, the reality is that clouds, maybe cloud vendors want to lock you in in some way, so users are left without kind of possibility to use normal Prometheus metrics, normal code registries and normal instrumentation and those close box monitoring things, so it's really about also incentivize those vendors to integrate more with opens up and CNCF kind of monitoring, so I guess it's some time for some feature requests here and there if you are users of those platforms. So really, yeah, that's the sum of it. The big learning here is that there is no silver bullet here, unfortunately, both even by solution, pushing metrics, close box monitoring, has their pros and cons, and hopefully we'll learn today about some trade those and implications, and hopefully all of this will give this important debugging, monitoring and observability data to Wilbur who can improve fees, duels, or ideally better negotiation skills to solve disputes in a more friendly way. And yeah, that's it, thank you very much. We have a cut over our four minutes, so maybe we can try one question while the others are getting set up. Are you already cabled, Alex? Okay, then we're going to use this time. Yeah, would it make sense to run something like the push gateway as a demon set and then run your, then push to basically local host? You can answer that if you want. So, yeah, so as far as I understand it, push gateway is not recommended for those kind of solutions. It does not have any such distributed aggregation capability, so it's only recommended for service level, ephemeral bar jobs, and running it on like a demon set or even multiple replicas of it isn't recommended. I think there was some mailing list about it too where Brian commented, but no, it isn't safe to run it like that, so yeah. Any other questions? Maybe you can give us two minutes on the worker approach then. Okay, so worker approach, oh my god, yeah, I should pick this, actually do this, I didn't know there was enough time, but anyway, it's about making sure that there is, so usually you serverless platforms, really architecture the serverless processes, those fleeting processes as isolated process. But if you move a little bit step back and really schedule those in multi-thread environment rather, so have like as one bigger living, long living process that schedules those functions and maybe only one type of the function and then only you scale those long-living processes that orchestrates those functions because those functions should be super small, so there is overhead to even start the process. So it would be good because finally you have some long-living worker that will share some state that then you can batch different things, not only monitoring, but maybe other state as well for even for your logic, right? And yeah, but the state of the world right now is that the native use Kubernetes, so they would need to move a little bit away from that, so it's just a recommendation from your someone that looks at this architecture from monitoring perspective, so yeah. Thank you. Thank you, thank you. Any more questions? Just pop it in Slack, three promises there, and I'm certain you will answer them. Thank you. Yep.