 Thank you very much. Ah, sorry, my VPN's trying to connect. OK. So thank you very much. Thank you for the invitation. It's really great to be here. It's super exciting. Yeah, so I'm going to be talking to you about observability at scale. I think this presentation might be a bit different, because what Open Systems is doing is we're actually, it's in the title, we're actually focused primarily on edge devices. So I think most of the stuff which we hear about nowadays is on cluster monitoring, on cluster observability. We're going to see how we kind of do this, but for edge devices. So there's a lot to get through, so I'm not going to mess around. So this is what we are going to talk about. So this is the agenda. I'm going to spend a bit of time saying what we do and what we've been doing for the past two decades or so. And then sort of outline the challenges which we're now facing, give you a brief overview of the action plan, and then just give you an update on kind of where we are on the roadmap of putting it into motion. So I promise there is only one kind of marketing-type slide. This is it. The reason I chose this slide is because I think it really, in one picture, it really shows what we are really doing. So what our kind of business model is. So we are a SASE company. This stands for Secure Access Service Edge. And this basically splits into sort of two. So you have the networking side over here. This is Wide Area Network. This is software-defined networks. This is basically that our customers can talk to each other in offices around the world. It's VPN, this kind of stuff. On the right-hand side, you have the security aspect. So this is about proxy. This is about firewalling. This is about zero-trust network access. And these two things come together to make SASE, which Gartner invented in 2019. It turned out we've been doing it for two decades, but it doesn't matter. And on top of that, we have the managed SASE component. So this is a 24-7 Global Network Operation Center, which on top of the SASE offering, we're also making sure that everything is working as it should be. Just to give you an idea of how this can look, this is what a customer might see when they log into the portal. You get an overview of where your hosts are, where they are around the world. And yeah, you've got a lot of kind of metrics that you can drill into already in this portal. So the start of the journey is really, well, if we think about our fleet of edge devices, they're lovingly provisioned in a workshop before they are delivered to our customers. So we ship them off to our customers. They're individually named. They're really pets, right? So they're our kind of pets, which we send off into the world and we hope they're looked after. This is to give you an idea of what they might look like. This is one of the most, I think this is the most commonly used box which you might receive. This is the new line of boxes which has kind of more options as you can see. And then the customers will take the boxes and put them into some home like this, which breaks our heart a little bit. No, it's OK. This is to give you an idea of how we've kind of grown over the past two decades. So back in 2004, we had 80 physical hosts. 2007, we reached the 1,000 host mark. And sometime between 22 and 23, we crossed the 10K host mark. So there's a lot of devices out there which need to be looked after. Now, what happens when a device gets sick? So they're actually nursed back to health by mission control. Mission control is, I'm going to talk about that next, it basically follows the same kind of recipe. You log into the box, you grab the logs, you check the service statuses, and you do your kind of Linux diagnostic stuff, which you can do. This is how mission control looks. It's a bit blurry. Sorry about that. Basically, you have tickets which come in, tickets can either be made by a customer or they can be made automatically by the monitoring pipeline. And that's what we're going to talk about. When something goes wrong, it ends up in a ticket. That's what you take away from that. How do we actually monitor the fleet? So this is how we've done it up until now, let's say. So we have three kind of components to the monitoring pipeline. Service nurses, GUMMA, and metrics. So service nurses and GUMMA go together. This is about application monitoring. So this is small scripts, which basically continuously monitor the status of services on the device. When something goes wrong, they emit a notification, which is a log line. And each nurse is effectively a state machine, so different events can trigger the nurse to emit different log lines. This is how the log line sort of looks, and this is fed into syslog and labeled and all that kind of metadata is attached. On the other side, we have GUMMA. This is the grand unified monitoring architecture. It gives you an idea of the kind of importance this thing has in the whole thing. Its job is basically to pass, classify, and filter syslog. And to give you an idea, this thing is basically matching against a list of around 3,000 regexes on every log line which comes in. And then it will decide whether it needs to forward that centrally. And once it gets central, it will enter this implication engine. And the implication engine is 20 years of business logic, which decides whether or not to make a ticket and mission control. It's a black box, really. So this was in 2008. We had GUMMA. Now we also have metrics. So we have a metric stack. We've had metrics along the whole way. The most recent incarnation of the metric stack, beyond what I'm going to talk about later, is with influx DB. And this is where most of the statistics are coming from in the customer portal. We scrape locally on the host with Prometheus. And then we feed that into Kafka. Kafka consumer then takes that into influx DB. This started in 2015. And this is how influx DB looked in 2015. We have two physical data centers where influx is living. In 2021, this is how influx looked. So it wasn't doing so well. We had scaled out as much as we could. But it was sort of needing a lot of manual intervention by this point. With my favorite gyro ever, the trip to DC to add more RAM to influx. So basically in 2021, this kind of prompted a review of the whole kind of monitoring stack. Can we really go forward with things as they are? I think it's clear we cannot. And we identified maybe three big challenges which we needed to solve. One is new environments. So we're very good at edge devices. But cloud, virtual machines, we are novices there. Log-centric, alerting. So everything that we do is based on logs. Every alert which comes into mission control started as a log line. This is not very efficient, especially when we're evaluating 3,000 regexes. And that list of regexes is very poorly maintained, let's say. This all leads into the basic problem, which is scalability. The current solution works really well for edge devices. But it does not scale to beyond this. So we need to have a rethink. So this is why in 2022 and 23, we've had an observability focus. Now, each challenge which we need to solve, it nicely maps into each part of our monitoring pipeline as it exists. So we have the customer fleet. We need to make sure that we can ingest metrics from anywhere, not just edge devices, but cloud-based services, virtual machines, any cloud provider, GC, whatever. We need to make sure that we can take the metrics and we don't care where it came from. On the alerting side, we need to make sure that we should make sure that we're alerting on things which matter. There's a lot of noise and mission control because of how we generate the alerts, which is based on log lines. And things can just happen which make a log go crazy. So you get 500 tickets and mission control, which are noise. If you can change to alerting on metrics, you only alert when things are actually going wrong, when your CPU temperature is going. Crazy for some reason. That's something which you should know about. And finally, this big one is the implication engine. We need to take this 20 years of business logic and somehow break it apart and make it accessible to all the service teams, not just some gurus. Now, where are we in the process? So we have, let's say, we started back in 21 with this project. 22 was really focused on tackling the first part of the pipeline, so of building a sort of environment agnostic observability platform. And now, 23 and plus, we're moving into the alerting on metrics side of things. So I'm going to talk today mainly about these two stacks, actually only one of these stacks. On the right hand side, we have our logging back end, which is based on Grafana Loki. These are the sort of stats of the cluster. I don't know. It's three terabytes or three terabytes per day. I don't think it's a small cluster. I don't think it's a huge cluster. It's pretty big. I would love to talk about Loki, but today I'm going to focus on Thanos. On the other side, on the left-hand side, because Thanos is really what sort of inspired the, that's where we started with this change. So it feels right to start with Thanos, basically. So we have 50 cores, 800 gigabytes of RAM, 370 pods, and we are ingesting 110 million metrics into Thanos, typically. We burst up to 240 million sometimes. There are reasons for this. There are different processes which generate metrics. But this is a pretty typical load level which we're dealing with for Thanos. Now, a bit of, I think probably most people here know what Thanos is, but just do a basic refresher. This is the kind of classic Thanos architecture. So this is the Thanos sidecar model, which basically plugs Prometheus into the store API, exposes Prometheus, so you can query Prometheus with Thanos. But it also allows you to flush blocks from Prometheus into object storage. So it's a long-term retention of those Prometheus metrics. This is really good. Now, we actually, well, we run both architectures. But the one which is actually most interesting for us is the receiver, so the receiver model. So there is another way of running Thanos where you have routing receivers and ingesting receivers. And so basically, these routing receivers can accept Prometheus remote write requests, pass them along to ingestors, and then the ingestors push them, push the metrics into the object storage. That's what's going on here. Of course, this is important for us because our devices, for sure, are not living on a Kubernetes cluster, let alone the same Kubernetes cluster. So we need this functionality. This is the main reason, I think, that we chose Thanos is because we were able to have the sort of coexistence of the sidecar model with the routing receive model in one global query view, which is really, really nice. So this is how the architecture kind of looks for our global view of all metrics, coming from all sources. So everything is written centrally. This is a key kind of compliance-driven architecture. We're in Switzerland. They like to have the data centrally in Switzerland. It's good. It's safe. It's OK. So we want to put everything centrally. We have to do that. So what we do is we have a ingress on our central Kubernetes, which is backed by Nginx. This is a Nginx gateway, we can call it. And so when we want to write something, we actually write to a public endpoint and we authenticate with a client certificate. So all of the edge devices have client certificates, which then get checked by Nginx. We can also accept stuff from external Kubernetes clusters. So maybe we have regional hubs, sort of different Kubernetes clusters running in UK, wherever Germany, all over the world. We can also write metrics to our central cluster. And then what happens in Nginx is we get rooted based on the header, which is the Thanos tenant header. Now Nginx sends the metrics based on the value of the Thanos tenant header into one of several Thanos instances. And I'm not sure what the proper term for multiple Thanos is, than I, Thanos is. We can debate that later. And so there is a dedicated pipeline for each Thanos tenant. And we've also got the Prometheus sidecar running to collect the local cluster metrics. So we have everything ending up in different buckets. So what do we mean by a tenant? So for us, it's very important that we don't blow up the ingestion pipeline. I'm sure maybe you have some experience with cardinality explosion. This has happened in influx and it ain't pretty. Our problem is that we're kind of at the mercy of the service teams. So if the service teams decide to make a new metric which labels on domain, then that's horrible. Cardinality goes through the roof. We did it to ourself once. So we actually sort of have a hard tenancy. So we have a dedicated pipeline for each service component, each service team that may want to write metrics into Thanos. So we have sort of proxy teams, one teams. And this also extends to cluster metrics. So it's a similar kind of thing. We have a dedicated pipeline and that's it. We can also help with the cardinality explosion mitigation by doing active series limiting. So this was a new feature in Thanos 0.29. I think that was back in November last year. So what effectively happens is the routing receiver is constantly querying the ingesting receiver via a meta monitor. And it effectively says how many series are coming in for this tenant. And then it checks against the configured series limit. And if the current active series is greater than the series limit, it will block the request. So we have these kind of two different ways of mitigating the effects of cardinality explosion. It's very important when we don't have real control over what the service teams are doing. How do we actually collect the metrics on the edge devices? So the metrics are mapped to tenants in the device. We have a Prometheus instance, which is running in the device. And we have a Telegraph instance, which is actually scraping metrics and pushing them into Prometheus. Prometheus then pushes everything to a Telegraph producer. And the producer then remote writes everything to Thanos. I can already hear the question, why do we have this guy? There is a reason. And the reason we need the Telegraph producer is because Prometheus has separate shards or separate kind of streams to which each metric is written. Now, what happens if one of the tenants goes crazy is it can actually fill the buffer, can fill the stream. And when that happens, all streams fail in Prometheus. So one tenant can break everything on the device. We cannot have our hard tenancy pipeline on the device. So that's a problem. Now, what we do with the Telegraph producer, Buffer, is we actually constantly read stuff from Prometheus so the Prometheus queue doesn't get full. The Telegraph producer is then a nice buffer and it keeps the metrics flowing on the devices. And then that's why we have this guy remote writing. The config is quite simple here. So you can see we basically match on particular metric names and they get written into a service component label. And then in the Telegraph output, we map the Thanos tenant header based on the value of the service component label there. Now, for the read path, the goal is to have a global view again. So regardless of whether it came from an edge device or if it came from the cluster or a different cluster, we want to be able to query it with one instance of Thanos, of Thanos query or Grafana or whatever. So the naive approach would be to just add a store API for each bucket. Store API let is doing the querying is actually fetching the data from the buckets. Plug those into a single query, have a query front end if you like, and say, job solved. But what we have to think about is query a quality of service. So in this architecture, all the users have the same priority. So oh, sorry, I skipped ahead. The problem here is that you have one entry point for the queries. And so this means that if you have an important user, like the portal, portals our customer facing side, we cannot, in this setup, we cannot say that portal should have more importance. So what we can do is we can introduce user queues. So this is separate instances of the query which are dedicated for particular users, like the portal, like the ruler, like Grafana. This makes it slightly more complicated. But we can at least then guarantee that the portal team will get one third of the available kind of power which is available for reading queries. Can we do better? So can we actually prioritize the portal queries if we want to? It turns out that we can. And what we can do is we can effectively layer queries. And we basically introduce bottlenecks, which are intended to make the portal query a higher priority. So what we have here is we have two queries, the ruler and the Grafana query. They have equal priority in this queue. But they then get bottlenecked here. And so at least 50% of the available kind of, let's say, store requests are reserved for the portal query. I didn't come up with this. This is actually taken from a really nice feature in Grafana Lowkey, 2.8, where they introduced a sort of cascading or stacked query scheduler. And we decided to see if we could try this in Thanos. And it turned out that it worked pretty nicely. So this is where we are today. We have the new metrics pipeline. We also have the logging pipeline working. And what we want to think about now is where are we going with this? How are we going to achieve our goals going forward? And the goals going forward really, at the moment, we're not delivering value. We have fancy pipelines, but we're not really delivering value because mission control is still getting a lot of noise. So our job is done when mission control is happy. So we want smarter alerts. And we want smarter alert handling. So when an alert comes in, how can we handle it better? This is, let's say, the new pipeline. So we have our environment agnostic inputs. They're going to the logs back end, the metrics back end. This is all working really nicely. But we still have this implication engine, this black box, which we still don't really know what to do with. The idea is that we're going to use the alert manager. And actually, we say, we're going to. We're currently feeding alerts into alert manager. So when our back ends generate alerts, they send them to alert manager. And the alert manager actually channels them into our piece of software, which is our business logic, which is called alert handler. And alert handler is effectively an interface which allows us to insert additional logic. So we can enrich alerts with information from a database, like a runbook, for example. We can decide to group alerts. We can cross correlate alerts. One example of why this is important could be when there are an ISP outage. So if there's an ISP outage, a whole ISP goes down. You might expect maybe all hosts connected to this ISP to go down. So you're going to have 100 tickets in MC. If you pass them through a smart alert handler, you can say, oh, look, this IP ISP is down. Therefore, all these connected hosts will also be down. And I group those tickets together. And this will reduce noise and reduce toil in MC, which is a really, this is the main target that we want to do. So this is the goal. This is actually to build the unified alerting pipeline. This is now where our focus lies. So we have the back ends, which are running really nicely. A key thing about the alert handler component is that we've designed it as a Kubernetes operator. And what this means is that service teams can actually break open the black box. And they can configure the alert handler by themselves. So they can enrich the alerts however they like. They can also create the alerts however they like. They have the Prometheus ruler. So we really empower the service teams to tell us how alerts should be handled. And we give them the tooling which lets that be. So in conclusion, I think the scalability dragon is sort of there for everybody. Everybody will fight against this at some point. For us, it was a big challenge, scalability going forward and dealing with new environments. But I think that we are doing well on our implementation of the action plan. So what we now need to focus on is actually making use of the raw observability data to actually, let's say, take the pillars of observability and actually make it valuable. This is the next steps. So hopefully next year we can have a review. So cheers to 100k plus hosts. And thank you very much. I just want to mention we are hiring. So we're also growing internally. And if you'd like to, if you're interested, then please come say hi. And thank you very much. Any questions? OK, we have two here. I'm interested to know how the integration between the alert manager and your custom alert handler works. Like is it a webhook or how do they interact technically? So we have a custom channel. It's like a webhook channel in the alert manager. And so this is the link. It's a webhook between the alert handler and the alert manager. Sorry, I'm just wondering, at your scale, have you ever seen the issue when, let's say, you upgrade Prometheus? We started. And it consumes lots of memory and time in order to process the files in write ahead log directory. So we've seen something similar in our receivers, this kind of write ahead log replay. Is that the sort of issue that you're talking about? Yeah, and how do you deal with this one? Instead of increasing the memory, have you found something? So what we do is we try and keep a reasonable, we try and balance the retention. So we keep the retention quite low, actually, like about six hours in the receivers. So we don't have huge retention because that'll just, you'll be murdered. Also, you can scale out horizontally to reduce the number of series which each receive is actually handling. So we have, I think, our sort of, let's say, empirically found guidelines, about 200k, 200,000 series, or 160,000 series per gigabyte of RAM. Please don't quote me on that. Now I said it. So this is our sort of empirically found guideline for how much RAM you need per chunk of series. And then this we just kind of scale horizontally or vertically to meet the demands. And we have the active series limiting. So we can, we receive alerts when, let's say, 70% of the active series limit is met. And so then we can go and have a look. We check the resources. And usually that's enough to get a prompt before things actually turn into fire. So great. May we have time for one quick question? If there's any, one more? OK. Thank you. Thank you very much. We have now 10.