 Hello, everyone. Welcome to Cloud Native Live, where we dive into the code behind Cloud Native World. Hello, everyone. Welcome on my behalf. So I am Ani Talastra and I am a CNCF ambassador, as well as a product marketing manager at Cast AI. Very happy to be here and very happy to be here with amazing topics and presenters as well. Every week, we bring a new set of presenters to showcase how to work with Cloud Native Technologies. They will build things, they will break things, and they will answer your questions live here today as well. So you can join us every Wednesday at this time. This week, we have a lot of people presenting from Ops crews. We have Cesar and Alok here with us today to talk about next generation observability using open-source monitoring. But before we get to the topic, I want to remind you also to join in for KubeCon plus Cloud Native Con virtual, North America, October 11th to 15th to hear the latest from the Cloud Native Community. So that's already next week. So now is the high time to start getting on on getting your tickets if you haven't yet and see you there as well. This is an official live stream of CNCF and such, it's subject to the CNCF Code of Conduct as well. So please do not add anything to the chat or questions that would be in violation of that Code of Conduct. Basically, please be respectful of all your fellow participants and presenters as well. So with that housekeeping items done, I'll hand it over to the presenters who have amazing content today to share and we'll stay on for Q&A moderating as well. But yeah, go ahead and kick it off. Thanks a lot, Amy, for introducing us and the opportunity to present to CNCF. My name is Alok, Alok Kuham, the founder CTO of Ops crews and my colleague Cesar who's the principal solutions architect. We'll do the presentation and the demo today. The topic today is one that's always been of interest, which is how do you get observability for cloud native applications? And specifically, the topic that I think is always on everyone's mind, given where we are is how do we leverage all the technologies, especially monitoring instrumentation that's coming out of CNCF and open source to achieve that. So I'm going to set this up with a short introduction and kind of state the problem and give you our philosophy and approach as to how we are solving this problem, leveraging the CNCF and opensource technology as an example of what all of us can do as we move to cloud native or all actively working and running applications in productions. So with that, let me start in sharing my screen and kind of kick this off. Let's click a second here. And hopefully you can see this, right? Perfect. We can see very well. So as we said, we are talking about the next generation of observability using open source monitoring. Just go across the legal side. Let's start with what exactly is the observability problem. And specifically, let's talk about what cloud native is, what cloud native applications create new sets of challenges. There are a couple of different factors. Most of you are aware of this, but it's worth highlighting and pointing them out. Probably the number one thing is obfuscation, if you will. Services managed microservices and cloud native services that are running in the application have dependencies all the way down to the infrastructure as a service and platform as a service. But now we have Kubernetes in the middle. So there is some obfuscation. So you don't see those dependencies. So that's one, which also creates what we call multiple points of performance loss. A service can be used by multiple services, even if it's being broken through and allocated by Kubernetes or by the cloud vendor. So that creates one problem. The second problem related to this is dependencies. And the dependencies are two levels. We talked about applications down to infrastructure platform services. The other one is dependencies across the services themselves. And this is because you have a very large number of objects. You could have thousands of microservice talking to each other. You have long chain dependencies, not obviously when those happen and how they may impact each other. After all, cloud native applications are fundamentally complex distributed systems. And then of course, it's not just managed Kubernetes managed services. You have SaaS as well as APIs. What makes it even more challenging is now you have dynamism. What was great for agility is we could add and take and change services autoscale. But that means the structure and those dependencies are dynamically changing. So this creates a significant, what we call visibility, but observatory challenges to know what is really going on at any time. Even if you ignore, the load is changing as well. So highly complex. The good news is, and this is where CNCF and open source instrumentation and all the monitoring comes to play, we have data on almost every level. But if we just look at the scale and complexity, the amount of data that you have and understanding what is happening becomes extremely complex. So that's one of the things that has actually become the scale and the complexity and trying to get to the right insights, right? So what do we need? This is where we start talking about if you think about the problem and the cloud native in. One of the first things you, if you understand, if you talk to subject matter experts they usually understand what is happening because they know those dependencies, those service interactions, the dependencies across each other as well, even as dynamic. So we need to be able to extract that automatically because at that scale, we can't do this. So that's one, capture structure and dependencies dynamically. Second, if you want to understand what those dependencies mean, it also means you need to understand what those applications do. A database works differently and what it does versus a queuing system or let's say a load balancer, right? So this means when you're looking at them you can apply knowledge of typical IT operations. Shared services can have issues like noisy neighbor. Kubernetes can restart applications and has to be ready before something happens. There are allocation limits. So all of these means that's the lens a subject matter expert uses. We need to embed that when we look at the application. What is the current state? If it's dynamic, we need to understand for every component what is going in, what's going out, what is the workloads? What resources is it using? What services is it calling? At the end of it, you really need to provide. We really need to have anyone in the DevOps team, SRA teams, even the application owners to understand what is happening in the applications. What is the expected behavior of each component and how they interact? Because only then we know what to expect and that there's a problem. So then the question is, how do we get the data needed to build this application understanding? Okay, this is why we embrace CNCF and open source. The, what we essentially have to do and what we are doing is building this analytics layer that processes the information. And we just don't, it's not just simply about metrics. It's about traces. It's about flows between them. It's about what's happening in changes in the configuration in the Kubernetes as well as the cloud. It's about logs that provide this information. And if you look across the landscape today, all of them, if you think about open telemetry, all of these are available now as open source instrumentation. You don't need proprietary agents anymore. You can deploy Prometheus. You can provide Garfana Loki. You can collect them. You can get traces with standards like Yeager. It's standards like Prometheus. It's like EBPF for flows and so on, so forth. So our thesis and a strong belief is embrace open source and CNCF and leverage this information to do what we need to do, which is understand contextually what's going on in the application, process the data to get that real time understanding of the application. So what we're gonna show to you today in the demo is how do we take the data that's coming in from this open source monitoring? Essentially, build out that structure, what we call application graph. And as we understand, and understand the interactions and those dependencies for each component that comprises the application, build out this behavior model. Now, this behavior model is not simple. We can't pre-define what metrics we have to choose to build. In fact, what we do is don't take any assumptions. If it's a generic container or if it's a known container like a database or even let's say a queuing system like Kafka, use all of the information to extract what the unknown knowns are and what influences at any time so we can predictably understand what is expected out of that. That'll tell us if there's a problem in deviations, once we learn this, in fact, we want to do this in situ while the application's running, observe it, understand the behavior across all the applications and use it to detect deviations that'll indicate problems. In fact, that should tell us emerging problems. And then because we know the structure and we understand the behavior, we understand how components are supposed to interact, how Kubernetes plays a role, do this global dependency analysis. That means check into configuration, check into changes that I mean, cross check with the events and the logs, but also look at what the expected metrics are because that'll help you isolate the problem domains and isolate the faults. If we can pull this out, we reduce the amount of space and time, the space that the ops teams are looking at and reduce the effort it takes to resolve problems. We can now also pull in traces on top of the flows to kind of add more granularity and visibility. So that in a nutshell is kind of the approach we are taking. Think of it, this is a real-time telemetry processing from all of that. The idea is to provide actionable insights and if you will, the actions you can take to correct problems as you see them. And of course, the best way to do this is to show this on demo, which Cesar is gonna do. So Cesar, I'm gonna pass this over to you and take it from there. Absolutely. Thank you. Thank you, Olo. You know, that's a really important slide you have up there, but we're gonna actually jump back to that in just a second. I'm gonna share my screen. Perfect. And reminder to everyone listening as well, leave your questions so we can ask Q&A in the end or throughout the session as well. So leave them to the chat and we will get to them. Awesome. So this is our, this is, you know, the obscrues landing page. Now, before we jump into a lot of these things, what I will do is I'm actually going to deploy obscrues into a cluster while we're here and I just wanna show the simplicity of how you can deploy not just obscrues, but also all the underlying tools that Oloq was mentioning, you know, the Loki and Prometheus, et cetera. All those are just, you know, a couple of commands away. So this is already a running cluster, but I'm actually going to deploy into a separate cluster that I have here. And so we're just gonna switch to that really quickly. And so I'm gonna show here, just clear my screen and first we'll look at QCTL, get nodes. This is just a single node cluster. The point of this is to really show the deployment. So it's an EKS cluster that I've got running here. And what I'm gonna do is I'm gonna show the pods that we have running. And it's very bare bones. It's just got the AWS node, components, core DNS and Qt proxy. So what I'm gonna do is if I switch over into my VS code, you'll see here, we have a few commands, right? We're gonna add a couple of repos. We're gonna run a Helm repo update. By the way, Helm is our preferred choice for deploying, you know, most of these tools makes everything easy. The Kubernetes package manager, definitely check it out if you haven't used it. But, and then we have a couple of commands. One is the Helm upgrade install for the obscurus components and then Helm upgrade install for the actual underlying open source tools. Again, Prometheus, Loki is gonna get deployed as well as Node Explorer, CubeSaintMechits. We'll talk a little bit more about the architecture right after this. And then we're also going to deploy again, Loki itself. So what I'm gonna do is I'm just gonna copy all of this and then just paste them into my terminal and give that a go. And this is just giving a status. You know, it's successfully updated the repositories. One of the obscurus gateways are being deployed and then so it found that it doesn't exist. That's been successfully deployed. Now we're also deploying the underlying open source pieces. And then finally, Loki is being deployed as well. And we're all set. So that's gonna check in. I'll look at that cluster again. And we just see the pod starting to come up now. So we'll check in with that cluster a little bit. But right now we're gonna go back to the existing cluster. Actually, before that, I am going to bring that screen back up that Elope was just sharing because I do wanna talk a little bit about the architecture. So give me one sec, we'll bring that up. In the meantime, are there any questions that have... Especially on the deployment? Yeah, no questions so far, but usually people think about them and then they ask in the end. So I'm looking forward to all of them. And if not from the audience, I will have plenty of questions and everything. I'm not worried. Fair enough. So I do wanna share that last screen that Elope was sharing, which is our architecture. So we just deployed, but what is it that we actually just deployed? So as Elope was mentioning, the whole purpose of these tools is to observe and observe intelligently and observe easily without the need for heavy, typical proprietary agents. I think the industry has really standardized on a subset of tools, a lot of them from the CNCF. And so that is, again, what we leverage. I mean, really the monitoring layer is, I should say a data collection layer for monitoring is really standardizing and becoming really easy to ingest. There's not always a need to go with heavy proprietary tools. So we're leveraging that. In this example of the architecture, this is a five node Kubernetes cluster. And across the top are just your workloads, right? Whatever your applications are running, you might be running in Node.js or Engine X or MongoDB and set Kubernetes, whatever you're running, that's across the top. And then in the next two layers, which are in this like reddish and blue colors, you have the open source components now. So you have Prometheus as well as Loki for metrics and logs. And then in the blue, you have the exporters and collectors. So for example, we leverage node exporter for node level metrics, right? C advisor for container level metrics. And it's of course important to not only look at the container itself, right? This is why we use both. You don't only need container metrics, but also that whole infrastructure layer of the actual Kubernetes nodes that are running the workloads, right? So you need visibility into both. Really cool tool KSM exporter that gives you the state of the objects inside of Kubernetes. So all of those exporters, of course, feed into Prometheus and makes all the data collection really, really simple. PromTail itself component of the Loki stack is going to grab all the logs from the actual workloads that are running all the container logs, pod logs and node logs as well. And we'll feed them into Loki. Now those are the open source components that we're leveraging in here. So this is what we just deployed right now with those commands that you saw. And then of course, you've got the actual underlying pieces. We've simplified it here, but the actual pieces of the cluster are running inside of Linux nodes. You've of course got the Qubelet running on each one of those nodes. And then you've got a Kubernetes API instance. So what we're doing is we're also collecting, like we mentioned, it's super important to have events. It's super important to have the objects that are inside of your Kubernetes cluster. So we will query the Kubernetes API directly to grab and do discovery of those objects as well as event collection, right? So all the Kubernetes events, whether it's replica sets that are scaling, failure to schedule an image, failure to schedule a pod onto a node because of an image failure, all different types of Kubernetes events but we'll grab all of them. And the other thing that I mentioned earlier was the gateways, right? So because all this data is already collected, we need to have a way to grab it and of course feed it out into two ops screws. So what we do is we have these super lightweight singleton pods that you'll see here. It's basically one pod per telemetry type. You have the metrics gateway here, which is going to leverage Prometheus's remote write capabilities. Prometheus will write metrics out. And so we'll also grab the Kubernetes objects using the Kubernetes gateway. The cloud gateway is going to pull into your cloud. You're using EKS, if you're using GKE cluster, whatever, whatever that variant is, we will go in and you need insights into things like, not just the cluster itself, that's a great starting point but also insights into the other services that are tangential to your cluster, right? So things like load balancers that are handling the connections that are coming into your cluster. Things like maybe RDS instances that you're using, let's say if you're using AWS, those cloud databases that you're using, calling from your cluster out to those services. It's important to highlight those and to be able to observe those in context as well. And again, log gateway as well as Yeager for tracing. All of those are just super lightweight pods that package data up and send them off to ops groups. Just want to emphasize a couple of things, they start here and the audience realizes we are not in the data plane. We are only on the monitoring plane because we're sitting on the host, not touching the containers, not putting sidecars and we don't have to touch the application code. So, and this is again, leveraging anyone who's running production can deploy these. All we have to do is collect that from this open collector that. So that's all we have to do with the minimum amount of touch and that kind of simplifies the deployment process and the data collection process. Also, the data stays there. We don't have to store the data away and lock it away. It is all open access for everyone. Yeah, exactly. I think one of the big things, I mean, we are talking about a mixture of ops groups as well as the tools themselves that again, things like the CNCF has enabled to exist. Again, so, Prometheus, et cetera, all these tools, even though we are talking about ops groups, ops groups is pluggable. The fact is that these modern architectures for observability tools, including ops groups, but even not, even if you don't have this layer there, all this data is still there and that's the important piece. As I was mentioning, the ease of use and commoditization of the actual collection tools has made this really, really impressive because your data is there. It's so easy and you're not tied to a specific vendor. You're not tied to some sort of proprietary implementation. All these open source tools allow you to collect and keep that data and leverage it as needed for business analytics, in this case, observability, but for whatever, capacity, planning, et cetera, so. So users can use all of this data with or without ops groups. They can build there and we are just gonna help them get the insights they need. Absolutely. So I'm gonna jump back into, I'm gonna jump back into the, I'll just move this out of the way. Jump back. Do you see if the deployment's ready? Yeah, I wanna make sure that this is up and running and it looked like it came up pretty, pretty fast but I will check again, yeah. So all these are running and you'll notice and I'm gonna show this to you inside of the cluster as well, but you'll notice. We have a couple of different namespaces. We have the collectors namespace and we have the actual ops crews gateway, but you'll notice things like C-Advisor, right? Oki, Oki Promtail, then KubeStake Metrics as well here. The Prometheus instance as well is up and running. Here's node exporter and then our gateways are there as well. So let me jump into that cluster itself. This is actually our demo, which we're gonna jump into some more of the cool things that we're doing with those open source tools and all that data that we're getting, but this is just the cluster itself. Again, I just kind of really wanted to show that deployment. I'll refresh my screen here, just make sure everything's up to date. And here is our actual deployment where we were just looking in the command line. So you'll see, I have a, again, we saw that it was a single node cluster inside of EKS. So you'll see that node and you'll see again, you'll see these components. You have Loki, you have a Promtail, you have the ops crews, gateways, node exporter, et cetera. Again, we're now building this really interesting view that I'll show details on, but all within a couple of minutes, right? Just while we reviewed the architecture, this is all done out of the box. So again, super cool that we're able to leverage the open source tools for grabbing all of this. So now I'm actually gonna jump into the actual demo cluster itself so that we can see some more detail. Again, this is our view and what you're seeing here is a lot of data being represented by Prometa, sorry, being collected from Yeager, if you're familiar with EVPF, we're leveraging the Linux Kernels EVPF capabilities to actually build this view where you're seeing data flow across and we do support tracing and there are a lot of really awesome cases for tracing, but there is also a lot of cases where you might not want to do tracing, maybe you want to avoid overhead, et cetera. So the EVPF capabilities of the Linux Kernel really allow you to build this kind of view and topology and structure view without forcing the need for tracing. So I think that's one of the really cool things that modern implementation have allowed us to do. Now, before again, going into all the details, I just wanna show one more time, the underlying piece is in a slightly more complex cluster. You'll notice that there's a bunch of filters and stuff across the top. So we're leveraging, again, the open source tools make it really, really easy because when you deploy these tools, they're sending things like labels off to Prometheus and we can stitch those labels together and make it really, really easy to ingest this data. So now we can actually leverage the filters that are being attached to your workloads into different entities. So you can filter by app or you can filter by namespace and all this is just incredibly easy to do now with these modern tools. I've built a view of just the underlying pieces using these filters. So I'm just going to apply this data collection layer which is gonna show us the op screws as well as the open source tools. So in this case, we have a five node, excuse me, we have a five node cluster here. These are the five nodes, some of their data. But again, a lot of these are running as demon sets. So if I zoom in a little bit to like node exporter, you'll see five instances of node exporter, five instances of C advisor, QB state metrics is a singleton. So you'll see all of this and how Prometheus is actually going out and scraping the metrics from those endpoints. You'll notice those arrows going outbound because that's the way the traffic is flowing in. And then you'll see the op screws gateways. Like you'll see here, Prometheus is, as I mentioned, leveraging remote write capabilities to feed the Prometheus gateway for ops crews. And then you'll see on here on this left side, the Loki components. So these are the actual pieces running inside the cluster, which is where we're getting all that data. You'll see these are actually posting out to our, excuse me, to our Amazon instances, which is where we're actually housing this particular demo. So, excuse me, sorry about that. So I'm gonna go back, I'm just gonna clear that filter and I wanna start showing some of the really cool things that the underlying tool sets allow us to do. So again, we're leveraging ABPF to show you this view of the flow of the different components. You'll notice the different pods, for example. And you'll also see, as I mentioned, it's important to have a view into other pieces that your infrastructure is touching, not exclusively Kubernetes, right? So you'll see things like, again, as I mentioned, we're running in AWS. So you'll see things like this elastic load balancer, right? And we can actually click into it. And this is kind of what we call our quick view. But if you click into the actual load balancer, you get data related to a load balancer, right? The DNS name of it, the IP addresses, the ports that are exposed, and both, sorry, I should have said private and public IP addresses for the load balancer, along with metrics. Now, this is a metric snapshot. We can look at some metrics in a bit. I'll show you how the context for that works. But again, all this stuff is being leveraged from the underlying, in this case, CloudWatch, right? But for all the entities as well, so here we have actual pods, right? Let me pick something that's maybe a little bit more interesting like an Nginx pod. So if I click on the Nginx pod, again, all the data from the underlying tools, you'll see if I hover on this, you can see connection data and architecture data, things like performance pieces that we're seeing, this Nginx controller is calling out to the Nginx service with a response time of 57.23 milliseconds, or report 30,000. So architecture validation, again, super easy because of the data we're collecting from those tools. And then if I click on, for example, the pod itself, right, it'll bring me into that quick view again, which you'll notice is pervasive throughout the platform. And again, we're leveraging the native data from Kubernetes, right? So the label that was attached to this pod and the manifest is automatically picked up. And again, as you saw earlier, that view that I built was based in part on some of these labels and namespaces. But why, so why is it important to have all this data? Why is it important to have things like the namespace, like the IP address, like the start time? All these things are important. I can give you, I mean, just off the top of my head, a lot of different examples, right? Important to have the start time to make sure that the latest config map that you applied is now in place, right? If you know that you applied a config map on October 6th, right, which is today, and you're seeing that the start time was from February 16th, you know that that config map is not in use because the pod itself has not bounced. Things like namespaces. So a lot of the companies we work with, they have giant clusters that, they might have 50, 60 node clusters, even more hundreds of nodes in their cluster. And that might be a single cluster across the entire enterprise for just their non-prod. So what happens there? You have, I don't know, let's say seven instances, they have prod and pre-prod and stress and QA1, QA2, QA5, and you have all these individual instances. Well, how are you going to determine which slice of the application you're looking at? Usually that's going to be segmented by namespace. So it's important to be able to not only look at that, but also, you know, leverage, filtering inside of your observability platforms for that. So the other thing is, of course, the context that Alok was mentioning. So it's really important to have context and I'll actually click on a container to get a little bit richer data. It's important to have that context. So one of the really cool things that we do is stitch all this data together. Again, the facility that's available to us because of, you know, the underlying tools, doing such a good job of sending over labels, et cetera, for us to be able to, you know, stitch this data in richly allows us to then now do some really cool things like contextually giving you access to. It's like metrics, right? So if I'm looking at this ingress controller container, as you see on the upper left corner, I can click on metrics. And of course, it's important to have metrics, you know, because you need to know what's happening with your workloads, right? What does my CPU look like? As you can see, CPU utilization is really, really low here from 12%. So you might know that you might be oversized on your sizing for the actual pod, you know, you might be able to get away with reducing your resource limits. We have a view on that and I'll show you some of what we do with again, all this data and how we make recommendations and highlight pieces that, you know, could use some more resource, so higher resource settings or lower resource settings, but we'll jump to that in a sec. But again, you get all this, you know, whatever data is available for this particular entity, you will get, now we're received by its memory utilization. Again, the one thing I'm seeing here just by looking at this is that this pod is really oversized. But I'll go back here and things like any events, I won't jump into every single one of these just in the interest of the time, but Kubernetes events, logs, I think that one is an important one. So let me click on logs. So again, we're looking at this NGINX controller and now we're straight into the logs for that. The context is important. Now, I'm just giving you guys like a sneak peek of what is actually under the hood, but in reality, while it is kind of fun to go in and explore all of this stuff, the ML is really what brings a lot of this together. So, but I just want to show you guys what's underneath. So again, you'll see some of the links. Connections is just like a table view of what it's talking to or what's talking to it. You see that the elastic load balancer is, you'll notice by this little arrow, this is upstream. So you'll see elastic load balancer is calling anchors NGINX. And you can see performance again, coming in from UBPF and from Prometheus, performance data around that, bytes in, bytes out, latency average itself. Now, we, again, I'll jump into the ML and just a bit and kind of show the real magic behind that, but I want to show a couple of other views. So actually, we'll go into the node view. This is just another view of the underlying data. It's important to be able to see what's running where, right, you might have a particular host that's problematic and you might want to know what pods are running on top of that host. So again, I mentioned we have five nodes, one, two, three, four, and fifth node all the way here on the right. So here we're breaking up these views into basically doing a QCTL get pods with a filter on the individual nodes but showing them all at the same time. So you can see the workloads, you'll see the advisor running on here, core DNS. And you can also see data related to the node itself. So if you're wanting to check the config, maybe you have a node that maybe there so it's all supposed to be moving off of a Docker runtime and onto the CRIO runtime. For example, and you can check the actual config of the node itself and get details, like you can see that this is still a Docker container runtime, but also things like the Kublet version and the kernel version that you're leveraging operating system images. And again, just like we saw for the pods themselves, it's important to have the node level metrics, right? So this is again a high level snapshot of the metrics for the node, but you can jump into it and again, just as important to understand how, you can see here a timeframe selector, but how at any point in time, your nodes themselves were behaving, maybe they had some sort of spike, et cetera. Now again, we have alerts that will automatically notify you of that, but it's great to be able to dig in and explore at will. So the other, so this is, before I actually jumping out, I do want to show the balancing because I didn't mention the balancing. So in our balancing view, we'll show you how much resource, how many resources an individual pod is consuming. Uh-oh, it's spinning, hold on, let me refresh my screen. I don't think I made a sacrifice to the demo gods today, so. Is the balancing not coming up? No, it's just come up, I think it was just my... Well, there you are, yeah. Yeah, so we'll show you, we'll show you resource data for CPU, memory and disk and you can see, right? For example, let's just pick on C-Advisor, right? So we have the C-Advisor pod and you can see that for example, C-Advisor has no request and no limit set for it. So that might be something to explore. We can also see that it's best effort and the currency-pew utilization is 195 millicores but the average is about 124 while the max is 220. So this might help when you're optimizing cluster workloads and trying to understand. Cesar, can you also show them who's hogging most of the memory on their average? Because that's always an interesting one. Absolutely, so what we can do is we can sort by current, right? And we can see that load balance of heartbeat is actually, surprisingly, consuming a lot of memory. It looks like it's consuming 139. Most users may not know that and I'm misallocated. That's actually, you know what? This is actually perfect because I am extremely interested in the fact that the load balancer heartbeat pod itself is consuming this much memory. So this is actually a really great find. This is where you have to crack things open to understand what's really happening under the covers. You know, Kubernetes does a nice job hiding all that but it could cause problems. Yeah, so again, this view really is specifically around that, right? Being able to optimize workloads, make sure that your limits are properly set. You might have some youth are crashing or something and we'll alert on that when you're looking to proactively go in and identify things, this view is perfect for that. I mean, it works both ways, right? Either you're undersubscribed and you may have evictions or you oversubscribed, over provisioned and then you're wasting a lot of resources and doing amazing auto-scaling out, not realizing that's not the right place to, better to drip irrigation. Yeah, especially when you're running on cloud. Cloud is great, but cloud is not. That's right, that's right. Surprised how many customers don't do that? How many users don't do that? So again, there are modern applications very often. We know that Kubernetes has won the orchestration battle, so tons of modern applications are actually running, of course, inside of Kubernetes. So again, based on the Kubernetes API data, excuse me, and some of the other open source tools, we built this view for Kubernetes resources so you can see things like deployments, replica sets, and demon sets that are in your cluster. And these are all, this is a nice map, but it's also clickable. So for example, I can look at pods, right? So I'll click on pods. And now I'm looking at, again, the magic of those labels that are being automatically collected by the underlying tools to allow the building of these views, you've heard me mention it two or three times at this point, but it's really almost magic, right? So being able to grab this data around namespaces and how many pods are running in those namespaces, any failed or anomalous pods, anything that has an auto-detected anomaly will show up here. But you'll see kind of the distribution of your workloads. And then you also have the search bar. So if you wanna find something specific, like if I wanna look for Prometheus, I can do this search and now I'm looking at my Prometheus pod, but I'll take away that filter. And you have a tabulator list of all of these, of all of your workloads, right? So, and I can look at the namespace that they're part of, their IDs, their status, the host that they're attached to and the IP address for the pods themselves, along with the labels that are attached to them if they're part of a replica set or a demon set, and then quality of service data, along with created time and start time. But also, again, we come back to this quick view built with the data from all those different tools. And you can look at all this data, the labels and all the metadata and you're back into that, right? There's other views, this is very similar. But what I wanna highlight is the richness of the data and the contextuality that having all these tools, all the metrics, all the config, all the events all stitched together, that richness that is provided to us in the context of all these views. So now the next thing I kinda wanna jump into is the alert view and a look. I think this is a lot more pertinent to some of the stuff you were talking about. So please feel free to chime in. But with all this data that we are now receiving, once we stitch it all together, the one thing that we're hoping to drive home is the smart layer as the look showed in the slide, the smart layer on top of all these tools because it is important to have metrics. It's important to have traces. It's important to have network data and config data and change data. But what you do with that data is that's the real challenge that I think we're all facing today. The data in many times is siloed. You might be using some proprietary tools for one piece of data and other open source tools for another piece of data. Or even if you're fully open source, you might be looking inside Prometheus for one thing, directly inside of Loki for another. You might be going directly to the Kubernetes command line for other things. So that is one of the challenges we're trying to solve. Having, I think everybody's trying to solve that issue. Having that context switching is, if you guys are into looking at brain science at all, you'll notice that context switching is a big problem. It's a big drain on us. So that's one of the things we're trying to avoid. It wastes resource time. It makes your teams less effective. It increases the outage duration, which in many cases means lost money, lost opportunities. If in a healthcare environment could mean losing any health and losing important time to care for patients. There's a myriad of things that could be affecting, but the point is you need to be more effective. You need to have this context so as to not to waste time and waste cycles. So this is really what this screen kind of represents. It's really the culmination of all that data that we were just showing, combined with that smart layer that Alok was talking about. So we have the ability to set thresholds just like any tool, right? You can set thresholds directly inside of, you can do alert manager inside of Prometheus and set thresholds there and create alerts on that. And we have the capability to, I'll actually jump out here, just for a sec to show that. But that's not the philosophy that we wanna go with. I mean, if you really want to, you can come in here and you can select a metric and apply a threshold, right? You can say, well, all right, create an alert if I'm over 200 milliseconds. And if we detect, we'll even provide automatic threshold suggestions for workloads that have been running, right? So this is the current max, et cetera. So suggested threshold here is like 0.35% because in this case, the CPU doesn't go over that. This is a little bit of our RML in play, but this is not what we want to do. There are a little bit more significant pieces that we can do around really stitching all this data and the behavioral models that are created with all this data. So let me find an alert. I think we were... So when you're doing that, I think worth a comment, Cesar, while you're finding it. And the reason you cannot rely on a fixed set of thresholds on a fixed set of metrics is you're making an assumption on what that container or that service is doing. And you may not have seen it. You may not have tested locally or on the cloud or wherever. So instead of trying to guess and trying to optimize and tune it, instead of waiting for it to, let's say hit a saturation limit, it's better to find out when the problem is actually happening and detect that. This is where you want to have the intelligence to build on to understand what behavior is. Is it working correctly? As opposed to when it hits a limit and kills over and dies. That's the whole idea. Otherwise, you'll be just tuning thresholds across all the services and guessing on which metrics to pick and in a container, you have the choice of picking 50 metrics. Exactly, no, and that's a real challenge. I mean, you and I were talking about this last night about the fact that, especially with the ability to scale out that Kubernetes provides with being able to scale out on the replica sets. Workloads are running in much tighter windows than they used to before. So it's a lot harder for you to set thresholds nowadays because workloads might be running in just like a really small window. I mean, everybody wants to maximize their resources. Some might be running at capacity. So when real deviations occur, right, a machine is gonna be able to find those deviations much more easily than a human. You're gonna have one of your ops guys just literally scouring through dashboards and finding, all right, sometimes it's at 91, sometimes it's at 89. So maybe we should set it at like 91% and that's the thing. The machine is gonna do a much better job and that's just my opinion. Actually, let me double down on that. This is where I think a lot of people realize what has changed in the Kubernetes with autoscaling. Let's say you decided that CPUs at 85% is what you're worried about. So you set a threshold, but demand increases and you can autoscale, which means when you hit 85%, you increase the number of replicas. There's not a problem unless there are. So why are you alerting when I know I can autoscale? So I'm gonna create all these false alerts every time you try to autoscale and scale back. Containers behaving fine or the application is behaving fine. Nothing to do with it. Increase demand, increase resource usage. But just because I hit the limit and I'm gonna about autoscale is not getting another. That's a false alert. If the application wasn't behaving correctly when it's supposed to not hit 85%, that's something I'm worried about. How do you detect that? That's the key and we can't do this across hundreds and thousands of containers. Okay, so yeah, thanks for that I love. So yeah, this is an example of again, one of the, one of our alerts. Now the machine learning has done its work. But again, the work that it's done is it's done on the data that's being provided by Prometheus. And on top of that, we're bringing in things like locks for Loki and context for you to look at that, any events that are happening. So again, this is whole rich construct and really rich object model, essentially, what is being built with, bringing all those open source tools together. I'll highlight just a couple of things. This is, there's a lot of info on the screen, but important to know is that, actually for starters, we're seeing that some metrics are not normal in this particular hardcast container inside of this particular pod. We're looking at things like the namespace itself, shopping cart, excuse me. And here are some of the metrics that are violated. I won't go into this because we have a more interesting view of these actual violated metrics, but I do wanna call out what Alok was also just mentioning around not being able to, you're not gonna know, and in fact, many times, you might set alerts on some metrics, but I've been monitoring for, I don't know, 10 years, something like that. And I don't recall seeing like somebody going in and setting in the threshold on container file system reads, right? Or some of these more esoteric or less well-known indicators of performance, right? And I think that's a real, that's a shame, but again, this is something you don't have to do because the ML is gonna do that for you. Again, you're getting this data from Notix4, you're getting this data from Cadvisor, so- And ABPF. And ABPF, correct. You might as well actually leverage it as opposed to just collecting that data and then just not doing anything because nobody actually knows what to do with it, right? So, yeah, these are, this is an example of all the metrics that were taken into consideration by the ML to trigger this particular alert, but we'll go into the more interesting view of the analysis. And I'm actually just gonna give you a little taste and we're gonna jump out of this because this kind of links to a larger piece, but you'll notice, we're highlighting some of the issues, right? We have this Fishbone RCA that we call, which really shows you different categories of metrics and configurations that are probably important to this scenario. File system is not involved, it's all grayed out, green and graying, you know, they're good or not involved, but I'll highlight just the category. So configurations have changed, supply side workload is having issues, that's why some of these are red, demands that workload is having issues, and so is CPU. We'll jump back to this, I just wanted to give you a taste of this is an auto-detected anomaly by the ML, right, with all that underlying data that we've collected. But let me step back just for a second to show something. Here we have another alert, it's saying that our response time, we have an SLO breach on this particular service, right, the internet service. And it's saying that we have an SLO value of 2,500 milliseconds and that we're actually responding at over 3,500 milliseconds, so we're gonna have seconds of response time. That is of course important, right? So somebody is deemed that this SLO is important to set and now it's being violated. So what our ML is doing is it's looking up the stack, down the stack, upstream, downstream to identify where there are actually issues that could be affecting and causing this particular SLO violation. So again, even just visually, right, we can see that there are some clear problem areas here in the red and here in the red, right? And this is, again, this is really no work being done and it's except by the behind the scenes by the ML and I'll highlight what these red pieces mean in a sec. But I just wanna click on this anginex and we get a tabular view, again, an amalgamation and integration of multiple data sources. You have the actual SLO violation saying it's over 41% violating. We are, because of EBPF, we can see the flow of the requests and we can see which is the highest latency path. In this case, it goes from anginex to web server to cardcast to card server to DB server, which you're seeing back here. And RML has also learned what's normal, right? So expected behavior for cardcast and expected behavior for DB server is out of normal, right? So we're a second and over 2.4 seconds, respectively. We know this is not some sort of increase request issue because again, we're also bringing in data related to the URL request count and this is actually going down. So it's not an increase request problem. So really what is it? Well, we can jump into, if we wanted to, we could jump into each of these individual components, but we know we don't have to and nobody's gonna do that because as operators you're trying to get and resolve the issues as fast as possible. So what I'm gonna do is I'm actually gonna click on this red and it's gonna take me actually back into that alert we were just looking at. So what happened, and I'll just go back into, I'll go back a sec. What happened is that the ML actually detected that for this particular anomaly, it brought in this completely discrete anomaly that we were looking at earlier that's completely separate, but it brought it in and said, you know what, this is likely a contributor or a cause to your SLO violation. And what happens is now we'll actually go into these that we were looking at earlier, which by the way are charted automatically here. So you don't have to, you don't have a message saying you're, I'll click here, your response time is slow. And then you go in some other tool and then go to the different dashboard. It's all in here and this is automatically charted for you. You can see that the response time has increased by close to 2000%. You can see that the response bytes themselves have increased the size went from one meg to close to eight megs. Incoming response time for requests coming into this particular, you know, cartcast container, the response time has increased by 1500%. CPU utilization has increased by about 50%. But finally, the real piece is the image, right? So this is now leveraging all the open source data and bringing it together saying, you have an imagery question, right? You were running version 0.6 and now you're running version 0.4. So that's really the issue, right? We have, we have a bad image that way that's a well-known 0.4 is broken. So that's really the cause. I'm out of time. So I'll hand it back to you, Alok, and to you as well. Yeah, thanks. I want to kind of cap a couple of comments and leave it for folks who are attending to have Q&A. What we just showed was, in fact, if you looked at what we just did here, that response time change in SLO, came from EPPF, an open source component that's already there. We didn't have to do anything. The metric changes came from Premiere. There's an EPP for every container and the flows. The configuration change we detect from Kubernetes state metrics, for example, right? So when we look and see an event change, there's a logger event, we can bring that in. We haven't shown the traces. We can drill down, which is still in the works, which will be shortly out. We can pull in the trace at that time to see exactly what happens to even further confirm. But the point is, all of this data is there, freely available. You can install it, you can do this, but the whole idea is why should you be looking and checking metrics across a thousand containers with 5,000 dashboards of logs and metrics? Observability requires us to be able to quickly isolate the problem and go there. And that's what we are saying. Open source monitoring has released all the data that's available that we need for telemetry and changes, even dynamically. Let's embrace that, add this intelligence on top. You can do it. We're just showing you a way how we can do it and make your life simpler. That was the whole part, right? So if I were to just summarize and then open up for questions and then just emphasize this, I think this is worth highlighting again. We have all these open CNCF metrics, open telemetry is there, metrics, logs, traces, EBBF. Yes, if you had this too, we can also deal with that by the way, but we don't have to touch any Kubernetes, even changes as we are going on. Just using that and this workflow once you have, that's what the intelligence is, right? And that's kind of what we're emphasizing. I'll leave it there and pause for questions. Great, sorry, excuse me. So really great, amazing presentation, really good, very in-depth demo. Always happy to see those, thank you so much. So as said previously, now is the time to ask the questions. You can leave them in the chat. I'll be helping you to moderate the Q&A, but let's get started and let's kick off with a few of my questions. So I know you mentioned a bit about Prometheus as well as how you play with other CNCF tooling and open source tooling and whatnot, how these play into this. Do you want to expand on how do you work with Prometheus monitoring framework or did you cover everything that you wanted to talk about there already? I think we did and if you go back to that screen too, you noticed how when we deploy Prometheus itself, right? And with that, just like you would today in your Kubernetes cluster, as a demon said, enable the CAdvisor metrics, enable Node Explorer, that's all we are pulling together. I think what we've added, if you think about it, because we needed what we call layer seven metrics is we now we leveraged eBPF as well, which is another open standard. That now gives us the coverage of networking not only at the bytes and packets, but also the request rates at the URL level response time essentially gives us golden signals. So nothing out of the ordinary as long as we have that coverage. We're using Prometheus as a time series database, so leveraging the exporters just easily sends that data or the data is scripted by Prometheus for all those exporters. So yeah, that's all the metrics that you see, whether they're network metrics or the CAdvisor container metrics or Node metrics, whatever, all that's being fed into Prometheus and that's where we're grabbing all those metrics from. So it's actually a really key component to the observability layer. Perfect. We don't have too much time left, but I do want to ask a question because we went to kind of the latest and greatest, I guess, of observability and monitoring here. So what do you think are going to be the next steps for this scene? What will be the, either for off-screws, what are the upcoming features or what will the whole space be moving into the future and what will be the focus? Absolutely. By the way, we will be at KubeCon in person next week, really looking forward to that after the last two years. So hopefully we'll talk to more folks who are really working in cloud native. One of the things we didn't show is we are adding tracing, open telemetry with Yeager. That brings up more. Adding more and more capability on the causal analysis pieces so they can work and try to do fixes. We didn't show, for example, there are some issues that we are adding to, like whether it's issues related to Kubernetes faults and failures that stop at start time or on time, whether it's application level, we are adding more capabilities that are more knowledgeable about a known application awareness. Like how does a Kafka behave? What to look for those metrics? Even those have permitted as exporters with them. So now we can even drill down deeper into understand a specific problem with Kafka, a specific problem with an open-source database like MySQL. So that gives us even more granularity to understand the problems there, right? And of course we'd start adding traces. And I think as a, in the overarching space, I think, but I think the whole concept of bringing multiple sources of data together is really where the industry is starting to come in, right? And adding that intelligence, Obscuse is not the only company that's doing that. Although I think we're doing some pretty cool things, but you'll notice that more and more are starting to integrate the tools together, right? Because there's a lot of value to be had from not isolating the individual tools. I think the industry is starting to come to realize that. So that and putting smarts into whatever platform is kind of like the next phase, the future of observability. I do want to re-emphasize that. The fact that with what has happened in the last three, four years, two years ago Kubernetes became a standard. Now we have the full ecosystem of instrumentation. It's for anyone going to cloud native, embrace the open source and monitoring and then add the intelligence where they really need it. That's what we see the community and it's taken a while, but it doesn't matter if you're an enterprise customer with a lot of legacy application in the cloud or a new startup. So we highly recommend that embrace these open sources, build on it, that gives you more power and brings more capability into your hands. Perfect. And perfectly on time as well as far as the timing of the session and the Q and A goes. So perfectly wrapped there as well. So thank you everyone for joining the latest episode with cloud native live. It has been really great to have OpsCruise, Cesar and Aloak talking about next generation observability using open source monitoring. It has been an absolute pleasure. We really are amazing and happy to see so many attendees joining in as well. And thank you so much for tuning in as well. And we will be bringing cloud native TV obviously on every Wednesday going forward as well. But actually next week we will have CubeCon and therefore we will have a break because there's so much going on already in the cloud native state next week. So no need for our live stream. But after that we have a session on supply chain security. So in two weeks tune in to hear more about that. But thank you so much for joining us today and see you next week everyone. And see us at CubeCon. Yeah. Thanks. Thanks.