 I'm really excited about being here. It's my first time in Europe. All the talks have been great so far. I've been running around between both the rooms. But I'm here for now. I hope you're excited about learning some things about serverless observability. Let me start by telling you who I am. I'm Vridwan Sharif. I work on observability tools and solutions for different GCP environments. Most of my work has been focused on agents like OpenTelemetry and Prometheus and our own agents called the Ops agent and getting them to run on VMs. But more recently, I've been working on leveraging OpenTelemetry and Prometheus on serverless environments like Cloud Run. I'm here today for two main reasons. The first is I always wanted to visit Europe. And this way, Google pays for it. And the second is because for the past six months, I've been working on making Google serverless offering called Cloud Run more observable. There were lots of lessons along the way, and I think it makes for a good talk, I think. And a lot of our lessons are very applicable to the open source world, and we just wanted to make sure we build these solutions in OSS as opposed to in private. This is a quick agenda of what we'll cover. We only have like 25 minutes, so we'll stick to some of the basics. And if you feel like some of these topics are rushed, find me after. They probably are rushed. But yeah, let's just get started. Let me tell you what we're talking about. We'll cover serverless briefly. We'll talk about some of the problems with serverless observability. We'll then cover some changes we actually need to Cloud Run itself to accommodate observability. And then we'll talk about its implications on different kinds of telemetry systems, like push-based ones, like OpenTelemetry Collector, pull-based ones like Prometheus. And then we talk about how we productionize and configure these agents in the wild. So let's get started. What, why, and when do we use serverless? Now, the premise of serverless is pretty simple. It gives you the promise of having users only pay for what they consume. It scales quickly and transparently as workloads' demands change. It's a way of providing the infrastructure as a service on a paper-use basis. Typically, you will have to write code or some configuration. And that's it. You don't have to worry about the underlying infrastructure anymore. And Google serverless offering, called Cloud Run, offers serverless for containerized workloads. If you have containers, you can scale them pretty seamlessly with Cloud Run. It implements the Knative serving API. And like Knative, it attempts to scale to zero. What that means is, if your service sees no traffic, you pay for no CPU or memory costs. Why you would use serverless? Well, the main reason is cost. It's generally very cost-effective. You often pay a fraction of what you would pay otherwise because you don't pay for unused space or idle CPU. Developers that want to use serverless will not have to worry about how to scale up their code, and they'll all scale on demand. It also gives you a quick time to market because you don't have to worry about complicated rollout processes and procedures. You can just write code in the serverless vendor. It usually takes care of the rest. I really like this illustration that shows some of the cost benefit of serverless. The things you see in dark blue there are your cost savings because you don't pay for unused space or idle CPUs. Serverless is not a silver bullet, though. You don't get to use it for every problem you encounter. For Cloud Run specifically and other containerized serverless workloads, there are very specific conditions you need to satisfy for you to use it. For Cloud Run, it has to be requests driven, usually HTTP or GRPC. It does not require a local file system. Network file systems are for most file systems. Network databases are fine, though. Your container should be able to handle multiple instances of the app running simultaneously and doesn't have very strict, very high CPU or memory usage per instance. And the most important one is for Cloud Run specifically, it has to be containerized. There are containers that we're scaling. There exists a perception that if you're running containerized workloads, you have to choose between Cloud Run or Kubernetes or some equivalent. But I don't think that's really true. They can and should be used together. I like this sketch because it kind of gives you a simple illustration of when to use what resource, what runtime. If you want committed resources, use VMs or Kubernetes. If you want pay as you go, use one of the serverless offerings. If you're using containers, evaluate Kubernetes or Cloud Run, and that's what serverless is. Now let's talk about why observability is tricky when you're dealing with serverless. So on the spectrum there from traditional on-premise instances from all the way to software as a service and serverless things, you trade off what you manage with what your serverless vendor manages for you or your Cloud vendor manages for you. On one end, you manage everything from the hardware to the operating system. And on the other end, you only manage maybe the configuration and maybe some application code. But what does that mean for observability? If your vendor abstracts away the infrastructure that you want to observe, what does observability mean in this environment? Well, sometimes it feels like you're looking at a brick wall. You might be flying blind without access to custom telemetry that you really want or you're used to in the other runtimes. And sometimes when you want to add this telemetry, it feels like you're onboarding onto the Cloud vendor-specific tech stack now. I don't mean to say there is no observability on serverless right now. There certainly is. Like if you have any logs that go to standard out or standard error, they usually go to like cloud logging or one of the equivalence back ends. You also have like a bunch of built-in metrics for serverless. Like this is an example from our back end. You get request counts, CPU utilization, you get accounts of running jobs, and a bunch of built-in metrics that are quite useful. But say you have your custom application, your containers running some exporter. You want some application custom code that you've instrumented yourself, some things about your business logic that you have telemetry for, what then? Then you're actually out of luck. You might need to actually get your hands dirty and add a considerable amount of complexity to your infrastructure. And that kind of sucks from Google's perspective. We would much rather have that you build an image you instrumented once with your favorite instrumentation, whether it be Prometheus, open telemetry, or something else. And you take this container and you run it across any of these runtimes. And then instrument and visualize them the same way. This would be like what ideally we would want, except that serverless branch there is not really true. It's not as simple, because of a few problems. And let's cover them now. Now, serverless compute is designed to reduce runtime of services. This means that instances in serverless might actually live at an extreme for exactly one request. They might pop up server requests, and then the instance might terminate. And so all telemetry produced by the serverless instance will need to export and flush all their telemetry very efficiently and quickly. And a lot of the instrumentation libraries that exist right now don't actually have that functionality. They don't flush telemetry on shutdown. That doesn't happen with Prometheus and open sensors, for example. Open telemetry does support this, but we don't want to require everyone just to use that for instrumentation here. And oftentimes what you build for serverless will be built in a way that's not very portable. You're often writing containers that are built specifically for serverless to get around this. Admittedly, the second problem is more of a Google problem. Our time series database, Monarch, doesn't allow cumulative metrics to be stored without start times. What that really means for us is we actually need two points to report the first cumulative point to our time series database because we cache the first one and normalize every subsequent point with it. In serverless, this is tricky because we might not live long enough to actually have two successful scrapes because of the previous problem where instances die very quickly. The third problem is quite general, though. A lot of the Prometheus-based scraping, Prometheus-based and pole-based metrics, they often require that metrics are pulled and not pushed from the instance itself. And this is a problem. Prometheus is pretty opinionated, as probably all of you know here. It assumes that you have long-running instances that live long enough to be scraped and discovered. They're OK with missing scrapes, and they communicate with local hosts and ports. A lot of these assumptions don't actually hold with serverless vendors. And even if it did, an external Prometheus entity doing the scrapes doesn't know the lifecycle of these instances themselves. And so what you'll have is you'll and we'll see an example shortly. You'll have the Prometheus scraping fail oftentimes. There are solutions around this. There are push gateways and aggregation gateways. Actually, in KubeCon, two years ago in Europe, there were two talks about how you can play around with Prometheus for the serverless world using aggregation gateways. There are talks from Colin Daas from Cloudflare and Fleeting Metrics by Bartek and Saswita. Interestingly, they're all here at this KubeCon, so hunt them down too. But this is what that looks like. In Prometheus, when you have a long-running instance and you have interval-based scraping, you'll get successful scrapes and you'll get a lot of aggregated data that you can pull away. But when you take this long-running instance and you split it up and have a bunch of short-running instances, you might have scrapes that miss entirely. And even scrapes that do actually scrape partial data, you'll only get partial results because that instance might never be scraped again to get the rest of the aggregation or the rest of the results. And that's a problem. There's also the last problem. It's a minor one. In the other runtimes, like Kubernetes, you might have nice operators to configure your agents. But on Cloud Run and on a lot of serverless vendors, we don't have that. So we need to be more creative on how we configure these agents and deploy them. And so this is what that looks like. The nice and portable story you wanted, we don't really have. Instead, we have this very hand-wavy system of how we should instrument and deploy our observability solutions. And we were determined, OK, we were going to address this now. How should we do it? Well, we actually cheated. We decided that this problem, we should actually partially solve in the serverless, like Cloud Run service itself, not so it could run our bespoke agents, but also so it could run hotel and Prometheus. So let's cover that now. What changes did we need to make to Cloud Run to accommodate observability? Well, the first one is we introduced the sidecars. Now, sidecars allow you to start independent containers that run alongside the main container that's serving web requests. The main use case for this is to support running collection agents, like Hotel Might Run and a sidecar. And they can communicate freely with the agents, with the applications. All containers will share the same network namespace so they can communicate with local host and port. They also can share files because they have shared volumes that are mounted. To be fair, observability is not the only use case for sidecars, so we didn't have a very hard time convincing people to add this because they can also be used for nginx in front of your application container. You could have authentication and authorization filters. You could have connection proxies. Sidecars are generally quite useful. We didn't need just sidecars, though. We also needed one more bit, which is we needed the ability to order the life cycles of these containers. And so what that means is we can define a dependency between the containers. In this case, container A is a dependency of container B. And what that will mean is container A will start up before container B starts up, and container B shuts down before container A shuts down. And let's motivate why this is important with a couple of examples. Say you have a push-based approach to your telemetry with the open telemetry collector. Your application starts up. It might want to flush some metrics. When it flushes that telemetry, it will expect that the hotel collector is up and ready to respond to those flushes. So the hotel collector actually needs to exist before your sidecar is alive. The same thing on shutdown. On shutdown, your application might flush telemetry once again. And when it does that, the hotel collector needs to still be alive. And so the hotel collector needs to terminate after. And so this dependency relationship is quite important to make sure we don't miss telemetry. For the pull-based approach, now say you have Prometheus as a sidecar. You actually need the opposite. Because the flow of telemetry is initiated by the sidecar itself, the Prometheus sidecar might actually want to pull when it starts up. In which case, it will expect the application to already be alive. And the same thing on shutdown. On shutdown, when Cloud Run says, OK, there's no more requests. We're shutting down the instance. We actually need to shut down the sidecar first so it can do a final scrape of the application before the application itself terminates. And so this is what an instance would look like with no instrumentation. Once you add these sidecars, they will look like this, where we talked about the dependency before. And so every time the instance pops up, the sidecars pop up with it for the purpose of observability. And when they shut down, they shut down together. And that's it. That's all we needed to do to Cloud Run. Now let's look at what this meant for OTLP push-based ingress. And then we'll talk about Prometheus after. Turns out nothing. That's all we actually need. And OTLP push-based metrics now work fine. All you need to do is make sure when you add the collector as a sidecar, you specify the dependency ordering, and you'll have your metrics flowing with no lossy telemetry. And that's quite nice, actually. Pull-based metrics are not as nice. Pull-based metrics were a little tricky and not as simple. And we actually needed to make a few changes to the Prometheus libraries and some hotel components to get this to work. The main thing we needed to add was the ability to scrape on shutdown. This is that when the Prometheus collector is shutting down, it performs a final scrape. And we had to add that to the libraries. And we need to make sure that these scrapes are guaranteed, even if the sidecar just booted up. The last point there is a GCP-specific thing. We needed to make sure that these final scrapes actually do complete within 10 seconds. Because of the dependency ordering, we need to make sure that after 10 seconds, the application gets a chance to gracefully shut down. Otherwise, everything is going to be killed non-gracefully. This is what that looks like. You'll have your regular interval-based Prometheus scraping, and then you have this final scrape at the end to make sure you collect all the time and you're not missing any aggregation. This is what it looks like when your instance is only alive for, say, 10 seconds. We'll actually scrape optimistically on startup with some offset. And then, again, we will scrape on shutdown. And like we talked about before, there is the extreme case where your instance only lives for exactly one request, in which case your instance pops up and dies immediately after it serves the request. We will still scrape once on shutdown. I alluded to this before. Our issue with GCP was we weren't able to deal with just a single scrape for cumulative metrics. So we needed to make a change to open elementaries Prometheus receiver to allow us to do some hacky things. One thing is we actually synthesize a start time metric. So we will look for the process start time metric that a lot of Prometheus SDKs will export. And if we don't find that, we'll use the collector start time and synthesize that instead. There is a slight catch that's very rare and definitely an anti-pattern, where if your serverless application is like using an exporter that's emitting a metric about something that lives outside, you might see a high rate. But again, this is rare. And we were OK making this trade-off. And that's it. Now we should have pole-based and post-based metrics working correctly. The only last bit we need to figure out is how we deploy and configure these collectors in the world. Well, this is admittedly a little hacky, but we use Secret Manager. Secret Manager is our secret store. There are a bunch of equivalents. But what we do essentially is we store the configs of these collectors as secrets. And these secrets are nice because you have granular access controls for who can edit and view them. And we then mount these secrets onto the collectors so that the sidecars can access them. This is quite nice because any update made to the secret will automatically find itself in the local file system for these collectors because they're mounted. And then we have the supervisor process of the collector that pulls for updates to this file and reloads the collector so it makes sure it's running the latest configs. This is what that looks like. You have the Secret Manager and how you store the secret. It's mounted as a volume. And then you have your Cloud Run sidecars. It'll keep reading the file until it sees a new config and at which point it will reload the collector. If you want to configure it, all you need to do is, again, define the dependency ordering that we talked about before, making sure that the application's dependency off the collector. And then you define the secret, mount it as a volume, and use that hotel collector that uses the Prometheus receiver. And that's it. You don't have to make any changes to your instrumentation. This is an application that we have that runs. It's the same image that runs on a VM, on GKE, and on Cloud Run. And using this, you have the same metrics come flowing in the same way from all three runtimes. And that's kind of where we are now. We had built-in telemetry support. We then made some changes to Cloud Run to accommodate observability. Over the next few months, what we're trying to do is making sure the changes we had to make to open telemetry and Prometheus are upstreamed so that they're useful to not just our collectors, but just these projects in the wild. And then looking ahead, there are a bunch of problems we still have to solve with Cloud Run and serverless. A few of them are resource overhead because the sidecar approach is nice, but customers directly pay for them, pay for the CPU and memory they use. So us keeping control of how much memory they use is vitally important. And there've already been a few talks about instrumentation overhead and maybe collector overhead is something we can look into. There are other issues, like CPU throttling. This is quite tricky. This one is annoying, and we have not figured it out yet. But essentially, if you have a serverless service that doesn't have high enough QPS, that it's continuously alive, but doesn't have low enough that it shuts down, you might be in a state where Cloud Run will give you an instance, but it will give you no CPU, in which case your scrapes will fail and your flushes will fail and you get a bunch of errors and limited observability. There are also other approaches we're considering. Those are more vendor specific, though, like having sidecar less monitoring where the control plane itself will do scraping. But that's not as interesting because it's completely closer. Yeah, but that's what I have for you. I hope this was useful. I'm also curious about what people will use these changes we made to Cloud Run for. Yeah, but that's about it. I'll open it to questions. Thank you for listening to me ramble.