 Hello, everyone, welcome to Cloud Native Live, where we dive into the code behind Cloud Native. I'm Taylor Doulizal, a senior developer advocate at HashiCorp, where I focus on all things infrastructure, application delivery, and developer experience. Every week, we bring a new set of presenters to showcase how to work with cloud native technologies. They will build things. They will break things, and they will answer your questions. In today's session, Alok and Caesar have joined us to talk about leveraging the CNCF observability tools for Kubernetes troubleshooting. This is an official live stream of the CNCF, and as such is subject to the CNCF Code of Conduct. Please don't add anything to the chat or questions that would be in violation of the Code of Conduct. Basically, please be respectful to all of your fellow participants and presenters. In short, please be excellent to one another. With that, I'd love to hand it over to Alok and Caesar to kick off today's presentation. With that, I'll turn it over. Thank you, Taylor. So I'm Alok, the founder and CEO of App Scrooge, an observability company built on open source and CNCF telemetry. And I will also introduce Cesar Quintana, my colleague, who was the principal solutions architect at App Scrooge. Thank you. So the way we thought we would do this is that before we set up the demo itself and go through the fun part, I'll only do a little bit of setup so you don't have the context of what we do. So with that in mind, let me share my screen and bring up kind of set the stage if you will. Let me find the right one. And let me know if this is coming up. Cool. That looks good here. OK. Great. So as mentioned by Taylor, we are talking about how to add intelligence and observability now that we have open source monitoring. Going through the standard confidentiality and legal notice, we'll skip over that. I'll let the stage by revisiting what really has become the challenges of cloud native application observability. And fundamentally, this has been happening now for a few years. As applications move to microservice architectures, there are three things that we know have started and really added a lot more challenges to the ops teams that are managing them. Number one is just complexity. Scale, the tiering, there's so many dependencies. And the dependencies, as we've shown already, you've got, of course, the application pieces, could be past services, could be SaaS services, could be a Kubernetes container running an application code. It could be serverless, all of them. And then, because you have a dependence on Kubernetes itself orchestrating all of these, and then the underlying infrastructure, wherever it might be. So these create tiered dependencies, if you will, from top down, kind of what we call vertical, as well as across. And this is happening all the time. The third complexity is dynamism. Great. We want to be agile. We want to add services, change any one component, scale on, scale in. Some things drop, some things brought up. That, together with all of this, is like a highly complex distributed system. And just looking at a couple of metrics is no longer sufficient. It's things are changing, things are coming up. You have a zillion dashboards. So the good news, and the sort of not so good news is the following. The good news is, thanks to CNCF and Open Source, pretty much every possible telemetry from real time metrics, logs, events, configuration information, traces, flows are all available now, directly from an open source environment, meaning open telemetry is an example, open source that has existed now for a while. So that means we can move to the key pieces to solve that problem. What's not so good news? That complexity. The role of observability is changing. It's no longer about dashboards and just alerting is how can we help the ops teams get to understand what's happening in real time so they can detect quickly, find the real issues, and get back up and running? It's the same things that you've heard. Time to, meantime, you just should be fast. Don't waste time with false alerts and get to the root cause. Meantime to resolution, right? So what we've learned, having been born cloud and native ourselves is what does ops really do? If you think about it, what ops does beyond getting all the telemetry is they understand the dependency and they look at an application with multiple services talking to each other. They understand what the interactions are. They know how applications, services are being monitored by the orchestration. They know the dependency of infrastructure. So really savvy ops teams and sorry teams know that. They also know what's changing and they apply curated knowledge. They know, you know, a published, subscribing or producer, consumer model versus a database what they're supposed to do. They know what to look for, what metrics to look for. They are aware of the app and the dependencies. That knowledge is what they use when they look through the data and see through whether it's the metrics, the laws of the traces, right? So fundamentally, when they look for something that's happening, they're looking at the every component, the inbound and outbound, who's talking to whom, what resources and services depending on all of this. And so they essentially build context to understand what is happening so they can actually detect the problem, isolate it and analyze and figure out what the resolution should be. So if you think about if observatory has to be really intelligent, they have to establish this context, this understanding and surface that from all that effectively caught a noise that's coming in, all the data that's sitting in. If you can't do that, then we've actually made the life of a typical DevOps and SRE very difficult. So that's what we want to do. So our thesis is help leverage this data, right? If you look at open source and open telemetry, clearly we know that things like Prometheus getting information from fluid and D for the logs and pulling into something like Grafana Loki using Yeager or open tracing standards, even to get the traces, looking at flows from things like EBPF and Istio, using configuration information from Kubernetes, changes, even the cloud infrastructure data as well as pass information, all that that's available, pull that together to build that context across all of them and then help reduce the amount of information and focus on the right information that ops needs, right? So this is probably the one key slide as we get into the demo to tell you, on the left-hand side, what you're seeing is all the open telemetry, all the open source that is available today. You don't have to put proprietary agents and do proprietary information code thanks to CNCF, thanks to open source monitoring that's available. So the first thing as we said for context is understand the structure of the application, what we call the application graph, right? So imagine able to automatically build out that structure in real time. So the ops guide doesn't have, even the app developer doesn't have to go around trying to figure out what's talking to you or do something exotic to get there and that graph has to change dynamically. That graph tells you who's talking to whom, what are they dependent on, how is Kubernetes managing it, allocating resources appropriately or not and how does Kubernetes have access to the type of cloud resources it needs to allocate to the services. And then to really understand what's going on, that context, pull the data and what we do is something called a behavior model. Profile every component that comprises the application to see what is expected. Profiling is kind of like building a very simple emulator model and you can look at it to collect the data to figure out, for example, across all the metrics that you collect, all the flows and events and say, hey, this one, for example, is IO dependent and this one is CPU dependent or mixed or this one does a lot of calls so that as data is coming in, as you learn that, this is where ML comes in. You know what to expect because you have the real time metrics, you have the application graph. So once you have started learning that over time, within, actually in our case within 24 hours, you start getting a baseline behavior which gets better over time. You can start looking at deviations. So you don't have to worry about setting thresholds. You don't have to try and guess what the thresholds that start tuning them. First of all, you don't know which metric and what the level is. Let the ML model learn expose that so that we can then analyze it. Again, in context, we know what are the drivers for every service because we know the application graph. So once we have these deviations, whether it's coming from explicit alerts like a failure and infrastructure failure that Kubernetes detects or application starts slowing down or having a degradation, you put that all in context and then analyze that. We essentially do what we call local detection across the application. And then we call an analysis of what we call dynamic decision which looks at everything in context six. Why would this happen given what I've seen? So essentially think of it as almost like an anthropomorphic what an ops would do and they understand. If we can put all of this in place and automate this pipeline, we have reduced the amount of work that ops spends today trying to understand what is it? What does the application look like? Who's talking to whom? When there's a problem there instead of setting thresholds and if I do, how do I analyze it? If we can collapse that and reduce that, we have really done the right service to get the right level of intelligence and observability. So this flow, if you think about it is what we will demo today using what you're seeing on the left. Like essentially build context, understand the application graph, understand the behavior to service problems detected, analyze it in context using all the telemetry we have, including changes, logs, events and help isolate the cause. And so that's our purpose of our demo today. And I'm gonna hand this over to Cesar because we wanna get to the demo and he'll tell you exactly how we leverage open source monitoring and use CNCF open telemetry to do this. So I'm gonna stop sharing, Cesar, please take it away. Actually, look, if you could go back to that, we'll talk really briefly about those open source platforms that we're leveraging, if you could share that. All right, I'll go back. I shouldn't have done that, I was a little too hasty there. All right, is that coming up, Cesar? There it is. All right, let's go. Yeah, so again, everybody, my name is Cesar Quintana. I'm a principal solutions architect at Obscurus here. And yeah, so to add on to what Alok was mentioning, the whole premise of leveraging these open source platforms that essentially the whole data collection layer has been commoditized, right? Absolutely, building data is now easier than ever to access things thanks to these powerful open source, particularly around the CNCF platforms, right? So what we've set out in mind, right, is to build something and leverage these amazing tools to make everybody's life easier, right? So things like, this is an example of our architecture of how we're leveraging all this open source data and all these open source platforms. So as you'll notice here, if you focus on that Kubernetes cluster square on the right side, right? What you'll see is across the top in the green, you'll see your workloads, right? You know, pod one, two, three, four, these are eventually your own applications running, whatever you're doing, whether you're running an e-commerce side, a financial trading platform, et cetera. This is what you're running inside your actual workloads. But underneath in that light and dark blue are the open source tools that are now so common throughout the IT landscape and in the modern application environments, right? So towards the bottom with a dark blue, you'll see here in this reference architecture, we're showing Yeager, Prometheus, Loki. It could be something, you know, this is just an example. We can leverage logs from other sources like Fluent. I think somebody asked about Fluent. It could be Loki, it could be Fluent B. And then we take metrics in from Prometheus and then traces, we're leveraging Yeager as a backend for our particular architecture, but we are supporting open telemetry libraries for the client side. So the important, that's one of the really, really cool things about the new, you know, standards is that they're now, you know, well-defined, which means that you could be using a mixture in your environment of open Zipkin and Yeager or the open telemetry libraries themselves and still have a unified backend where you're able to collect all that data and leverage it and use it, even though you're technically using disparate libraries throughout your enterprise, right? So what you'll see here, you know, how we've architected ourselves to be built is again, around these open source platforms, whether it's Fluent D, whether it's Loki and Prometheus and Yeager, et cetera. They serve as, you know, now your data collection and data store, you don't have to go out and pay another vendor, you know, 10, 15 X for storing just metrics, right? When you can store them in your own infrastructure, we're all doing the same thing, right? Just putting them inside of a long-term, you know, bucket, right? And so now that's under your control. And so we, for example, Promtail, right? If you start looking upward towards the stack on the light blue, Promtail will run as a demon set, collect logs from all your nodes and from all your containers, right? And then you have on top of that node exporter, right? Functioning as an exporter for Prometheus to grab the metrics from the nodes themselves and going above that, you'll see C Advisor collecting data from the containers themselves running on each node. And then we also leverage KSM exporter, pretty awesome grabbing Kubernetes object status data. And all those are gonna be fed out into Prometheus or to Locan if you're using traces into Yeager. And really, you know, now even just with that, you've got a pretty darn functional observability layer, right? Now you have metrics and they have traces. Now you can go into different places and look at your logs. But what we're, you know, what a Locan was mentioning earlier is that smart layer, right? Now you want to leverage all those pieces of data, bring them in together and do something really, really powerful with having all that context, all that configuration data that we can grab from the Kubernetes API and then just bring it all together. On top of that, you have metric data, configuration data, performance data from and event data from your cloud environments, right? So bringing in things like, you know, more and more applications are hybrid, right? They're using, you know, whether it's VMs and Kubernetes or serverless and PAS, you know, you have all these really, really hybrid environments that, again, it's the whole extreme production of data and having one place and easy ways to collect them. And that's really what these open source platforms have allowed us to do, right? But going back to what I was mentioning about cloud, you also want a place where you can grab your data and bring it in, talking about, again, serverless or function as a service. These, the PAS layers, which are only constantly growing, right? You know, you have these cloud caches and messaging services, cloud databases, et cetera. So, you know, what Obscure sets out to do is not only grab that open source data and leverage those collection platforms, but also bring in the cloud data and mess it all together and build something really, really rich and then provide actionable data based on that. So, what I'm gonna do is I'm gonna show you a demo of Obscure's, oh, sorry, I look. I had, since I had the opportunity to look at the message, someone asked, what about FluentD or FluentBit? And, you know, everything about, so let's address that. Yeah, no, so as mentioned, right? We can take logs basically from whether it's Loki, FluentBit is usually the thing, or FluentD, those are usually the pieces we run into, right, and absolutely. You know, the whole point is to build a modular, flexible platform where you can grab data from, you know, whatever your preferred variant of that is, right? So, yeah, absolutely. Obscure's particularly provides support for FluentD, Loki, and a few others as well. Yeah, so the takeaway message I wanna do before we go into the demo, so we can, you know, address them all across is, as long as we have the source of getting the logs in, the metrics in, this approach will still work. Of course, with OpenCNCF, we don't have to do proprietary agents, proprietary instrumentation. We can be sitting outside without being intrusive. So think of it that way. The real intelligence or observability is not how the metrics got to us and what it is. As long as we have coverage, that's the key. The coverage is all of these is needed. You can't just go on metrics and logs and traces independently. It doesn't give you the whole picture. You know, otherwise we are one of the six blind men looking at the elephant, which is in the room. All right, go ahead, Cesar. I had to say that because I think it's true. No, thanks, thanks for that. All right, so now I'll share. So look, I think you might have to stop. I can stop sharing, right? Sure. All right, so let me share. Hopefully you guys will be able to see my screen here. So I have one proof pointed, but there you go. It's coming up, excellent. Okay, awesome. So yeah, so this is our landing page for Ops screws. And you can see there's quite a few pieces of data here. You might, you know, this screen might look familiar for any of you who have used APM tools before. So this is a real-time service apology map, essentially. You know, we're leveraging, as I look mentioned, EVPF as well, right? So EVPF allows us to grab this network data and bring it in alongside not only the tracing, which in this case happens to be optional because we have EVPF. So, but it is the EVPF network data alongside the tracing data, alongside the metric data, alongside the logs, alongside the configuration data discovered from whether it's cloud or the Kubernetes or the virtual machines themselves or the serverless or other pass components that's all brought together in a single place. But we're giving you this real-time, excuse me, a flow of how your services are interacting with each other. And I'm zooming in more. Of course now, this is not even, this is nowhere close to some of the busiest environments, but you can see that it does get busy really, really quick. And that's one of the cool things about, you know, having all the configuration data and the really rich data that the underlying tools like CAdvisor collect is that we get a lot of really rich object data along with the metric logs. So things like being able to understand, you know, the configuration data of these pieces allows us to also extract things like labels and tags. So when you have a busy environment, you might only want to filter, for example, on a particular namespace, right? I might only wanna look at maybe the obscures namespace. And so that really helps you cut down on some of that noise when you're trying to isolate an issue. But going back to kind of our premise, you know, what we're showing here is a mixture of quite a few different pieces of data. You're showing the EVPF pieces. Again, we talked about cloud. So this demo happens to be running inside of AWS. But, you know, whatever cloud you're running on, you're gonna have that pass layer very likely. So being able to collect that data and bring it all together to your Kubernetes environments, you know, and all monitor it in a single place is absolutely powerful. So if I click on, for example, that AWS RDS instance, you know, again, we're talking about the metrics. So if you look at this right side, we're collecting all those individual metrics, the read IOPS and the throughput, et cetera. And this is a high level summary, but important is metrics, right? So I can go in here and look at all the individual metrics. That's one of the pillars, all the outputs of the ability. And that's just for one entity, same thing for a pod, right? This is a pod in the container. So if I click on a pod, same thing, I'm bringing back all this configuration data, all these labels, and what time this was created, what host it's running on. It's important to understand all these things because when you're troubleshooting, you know, well, what time was this pod running? It was supposed to have been restarted five minutes ago. Did we actually perform the restart, or was there an issue? You know, doing that rollout of the application. Well, look, it's been running for, you know, since a couple of months ago. So, hey, that rollout wasn't successful, right? Again, we've got metrics as well, and each entity has its own pieces of data. And it's important to be able to look at that data, again, in context for whatever problem you're troubleshooting. In this scenario, I clicked on this container. It happens to be the agri-agent, but I clicked on this container and now I'm getting additional data that's contextual for that particular container, the ports that are being exposed. But on top of that, you know, being able to see how the infrastructure is working, what things are related to what. So for example, we have these contextual access to these different pieces, right? So if I click on this three-layer view, right, what it does is it shows me this particular container and this pod is running some details about it, the IP address, the image name that it's using, as well as some high-level metrics such as CPU and memory. But also it shows me what Kubernetes node this particular container is running on, as well as some of the neighbors and those CPU and memory metrics for those neighbors. And then this Kubernetes node is running on top of what cloud instance, right? So when you're troubleshooting, I know I have some instances in, let's say you're running in KS and you have some nodes in one particular subnet or one particular availability zone that are having connectivity issues and you're trying to diagnose, you know, right, you know, in this little click, you can understand if your container happens to be running on one of those nodes and things like the region and how much storage is attached to it. But not only that, again, as we mentioned, the ubiquitousness of all this data and the ease of collecting makes it really, really simple to bring it all together. And now we can look at the infrastructure map that we call, which is essentially a cloud map. And now we're looking in the context of this particular cloud instance and we're looking at this EC2 virtual machine and looking at the configuration of that and the text, right? And I'm just kind of showing behind the scenes the all the open source data that we're actually collecting and how even that open source data by itself makes it really powerful. But once you combine the intelligence, which I'll talk about in a second, that's where really things really start to take off. As we mentioned, we're collecting data from the Kubernetes API and from the container. So that's where we're grabbing, you know, the individual container metrics and the node metrics. Let's have an understanding, for example, at a per node view, right? So instead of looking at it from a kind of application-centered view, I can look at the node level, let's clear out some of these filters. Now, so you see we have five nodes running and now I'm looking at each individual node and I can see the workloads that are running on top of that node. I can click on metrics and get the metrics for that particular node. So loading just a second, but I'll go back and then we can actually look at the configuration for the particular node itself. I have some filters on here by default. Real-time internet issues. Ha ha ha ha, always fun. There we go, there we go. All right, so yeah, so again, we're collecting all the configuration and made a data not only of the containers themselves, but even the nodes that you're running on. So things like the memory utilize, sorry, the memory capacity with the node is ready. So if you'll see here, max memory, max storage, what version of Kubernetes are they running? And so here we see that we're running version 117 of the Kubernetes node, which is actually probably a little bit outdated and the kernel version of the operating system that it's running on, et cetera. So we're bringing again, all this data together, which is really, really empowered by all these open source layer tools. We're not using custom agents. We're not doing anything special. It's just leveraging all this data but bringing it all together in a single place. On top of that, I mentioned it's important to cover things like PaaS services and serverless. So again, we also collect that kind of data. So you'll notice here, you saw an RDS instance, I think I might have shown a load balancer as well. In this case, in this environment, I have an API gateway running with an S3 call out actually via serverless. So you'll see this API gateway. And again, I'm grabbing the data from that particular API gateway, just like for the containers we saw, that particular entity's made it in for the nodes. Now here's for the API gateway and some of the metrics as well. And same thing for the serverless functions, right? I can see the ARN or that particular serverless function, the region, and I think the metrics are down to that. So the whole point is to bring something that's all together. And finally, you know, actually before I show that, I also didn't mention traces and let me actually share this screen because I think I'm not sharing that. I do want to show the traces before jumping on to something else. Here we go. So again, we also have our trace map view that we just recently announced. And so when you're leveraging, as we mentioned, distributed tracing, you know, we can collect all that data again on a single space. And now what we're doing is we're collecting the individual traces. And actually we're doing something pretty cool, which is what we call the trace map. And identification of these trace paths. Sorry to look at. I don't see a screen. Oh, you don't see my screen? Sorry about that. Can you share? Because I don't think that I want to read my trace. Okay, great. Okay, then. And I did see one thing. If you are able to bump up the text just a little bit. I saw a couple of comments about that as well. Sure. Is this, hopefully this is a little bit better? Yeah, I think that should be good. I think it's going enough. Gotcha. Let me know if there's still visibility issues, but I bumped it up just a little bit. Yeah, so again, we're just, we're, I'm just showing off the tracing capabilities again, just bringing everything all together in a single place. You can see here, this, this trace map showing the different interactions from the front end to the ad service, the product alloc service. But one of the really cool things that, that is kind of unique that we've been able to develop is identifying, you know, a lot of times in tracing, you'll get transaction identifications. Oh, I'm seeing that. Hopefully, hopefully this is a little bit better. I think I've hit the limit of my, of my zooming in capabilities. Sorry, guys. I always thought it was a little bit bigger. Hopefully this is some sort of, okay, so we've got the traces. We've got the trace maps. And it looks like I'm getting some, too much noise on my machine. So sorry about that. I'm seeing that in the chat. Turned, hopefully turned off the notifications sounds here. Hopefully that'll stop interrupting. Okay, so we've got the trace map view, but we're also discovering what we call the trace paths. So these trace paths are not just, sorry guys, give me just one second. I'm trying to, You're still on Slack guys, that's why. Yeah. Taylor, you know how this goes. Oh, absolutely. Absolutely. It's always fun. I feel like as soon as anyone goes live, that's there must be like a hidden button or something like that somewhere. Cause that's nicer to get a lot of money. We are being pinged on that. So anyway, I believe I've turned off, not to disturb successfully, which I thought I did before the call. I just shut down Slack. My apologies to everyone. Okay. So let me head back here. Okay. You know, we have auto discovery, essentially of not only the transactions themselves, which are used to seeing distributed tracing platforms, but we are also grabbing that indication of the paths themselves. You might have a transaction, you know, for one of these products that, you know, might be a slash checkout, but you might have a different types of checkouts for maybe a class, right? Maybe you're selling a class on your e-commerce site versus a product, right? So even though, you know, they're both called checkout, one might go to AD service and another one might go to the checkout service and then product catalog service. So you know, they're both named the same who identify those differences between them and then also perform anomaly, automated anomaly detection and profile those transactions separately from each other, right? So that is, you know, that's some of the tracing. We won't delve too far into this because I want to show really some of the magic behind what we can do now that we have all that really rich open source data, right? So let me stop sharing and reshare my other screen. Just give me a second here. So there we go. You guys should be seeing my screen pop up here in a second. All right. All right. So, you know, one of the things that we can do now that we have all this open source data is that we can now start doing anomaly detection, detecting of misconfigurations, misbehaviors. You know, one of the things I actually did not show if I go back here really quickly is that we can all, we're also collecting configuration data not only at this kind of high level made a data kind of view, but we're also showing the entire manifest. So if I click, then I'll just show what I did there. If I click on detailed view of this particular pod, right? Now I'm looking at the actual manifest for this particular pod. So I can look at the details of what exactly is going on throughout without having to go inside the command line and figure out, you know, get a QCTL, get pod dash O YAML. This is way simpler and it also helps keep everything in context and keep you inside of a single place. But now with all this really rich data and knowing, you know, the other thing we do is we have what we call curated knowledge because on top of all this, you do need to understand how these systems interoperate with each other and what kind of dependencies they have on each other. That's why we do build that relationship view leveraging all the data. That's why we want to know what containers running on what pod and I'm sorry on what node and what node is running on top of what piece of infrastructure is that. We know when a piece of infrastructure is down, we know that it's affecting, you know, the container that's hosted on it. And, you know, there's a lot of nuance and variance to the kind of problems that can arise. But having, again, this richness of this open source data makes it all possible. So I'll show a couple of things that we do here. Let me find an alert. I think I was looking at this alert a little bit earlier. So I'll explain a little bit what this is, right? So in this case, we have a deployment problem, right? On our particular web server deployment, we're supposed to have a total of three replicas. And in this case, you know what, I won't put the text a little bit because I know that was asked before. So we're supposed to have a total of three replicas. In this case, we've only got two available replicas. And this has been going on for a little bit. So down here, you know, we provide some details. It's part of the shopping cart namespace. It's a web server deployment. And here's some, you know, additional kind of feel, key value pair details. But we'll go to the fun view. I know some of you guys love reading JSON, but I kind of like this, the UI just a little bit more. So when I click on this Analyze View, what it shows us is what we call the contextual RCA, which is our Fishbone RCA, right? So in this case, what we're showing is we're showing failure categories across the top and bottom that are affecting this particular deployment. So again, all this is being collected just through, you know, carrying the Kubernetes API and then the relationship of collecting the events and the containers and linking those, all those pieces together. So we have a replica set scaling issue, right? We're having an issue scaling up an additional replica of that particular image. And now we're getting actually a backoff restart as well. But this is all really associated to the startup failure, right? And if I click on that, what it's going to tell me is that I have an invalid image name, right? So obscuse is spelled with one eye and it looks here like somebody spelled obscuse with two eyes. And so that's a bad image name. You know, it took us all of, what, you know, three, four seconds to figure out that one of our replicas isn't coming up because of a bad image name. So it's those kinds of things, the richest of the data that allows us to build these really, really quick root cause analysis pieces into something like obscuse, right? So yes, you can do this from the command line. It's, you know, it's a little bit more work. It'll probably take anywhere from, I don't know, 30 seconds to a couple of minutes, but, you know, multiply this times a thousand, times 5,000, that can happen in a month. You know, that's a lot of time saved for operations, right? And you'll also notice other ones that some of these are more complex. And, you know, these aren't just building blocks to what I'm going to show you in a sec of these individual kind of problem detections and anomaly detections. But you'll notice other categories. So things like a missing config map, right? If you reference a config map in your manifest that does not exist, you know, you're going to have a failure of your pod. So we'll highlight those things, or failed volume mounts, or even bad image tags. I think I might actually have a bad image tag in here that I was looking at. Yeah, so very similar scenario, but for the cart server, if I click on Analyze, yep, you know, same kind of symptoms, you know, replica set scaling issues, we're having back off restarts going to a crash loop. But, you know, in this case, we have an invalid image tag, this particular image tag does not exist. Right? Now, the other thing that I didn't go too far into, but it really is absolutely key, is machine learning, right? So from all of the individual services that you deploy onto your clusters, what happens is that with that data being collected from CAdvisor and from Node Explorer and from the Discovery pieces, what we do is we create a really rich behavior model, right? We detect what is normal behavior for your individual services, right? So if you are, you know, we don't just look at one or two metrics, like error rates and response time, but we look actually, each one of the entities that I've shown you have their own behavior models. And there's a bunch of others that I didn't show you as part of this demo, but things like, if you're using database like MongoDB or a JVM or an NGINX container, right? And then the generic containers themselves, the nodes themselves, they all have their own behavior models. And we pick up a mixture of a lot of different metrics to understand what is normal behavior. And then when we find what is abnormal, we have these types of alerts that are prefixed by ML telling us that there is some sort of ML-detected performance violation, right? So if I click on this, in this scenario, you know, again, I'm going to get some details of what happened, right? I know, let me zoom in a bit. Network transmittal bytes increased by 540%. And level four bytes for the outbound traffic increased. And inbound transmittal bytes decreased actually, so we don't only detect increases, but also abnormal decreases as well. But just like in the other scenarios, if I click on analyze, I can get a fishbone representation of what exactly is going on with the metrics of YDML in the first place triggered an anomaly. And so I'm going to zoom out just one piece. Just like you saw for the Kubernetes-specific deployment scenarios, now, in this Fishbone RCA, we're looking at a container view, particularly the card cache, had some deviation in its metrics. And actually before looking at this, I'm going to go back just to the summary screen and show you down here, if I click more details, you know, speaking about the ML and all the metrics we take, these are all the different metrics for just the generic container model that we're looking at, right? So again, it's not just one or two or three metrics. We're looking at transmittal bytes, packets in, packets off, memory failures, CPU utilization, all this data, particularly for the container, is again provided by CAdvisor, an open source tool, right? So again, going back to the analyze button, now we're seeing the actual pieces that actually triggered the ML. And so now you'll notice that the Fishbone has changed from our startup failure. Now we're showing memory and file system and CPU. And so right away, we'll show you in red. You don't have to go and look at a chart for this specific thing. It's here, right? So I'm seeing CPU utilization has increased by close to 50%. I'm looking at demand, which is incoming requests. The response time has increased by over 1,700%. I'm looking at outbound supply side. Response time has increased by 2,200% for outbound requests from PartCache. And then on that, our response size has increased from one to close to eight megs, right? And then bringing in the Kubernetes layer, it's this whole image change, right? So again, bringing in the data from the Kubernetes API, I can see that I've got a recent image change that's likely contributing to this failure. Now, again, I'm glossing over a few details because in the interest of time, I want to show you guys how we bring, you know, a couple of these things together. This is, you know, an ML alert. And by the way, you can see this automatically chart those important metrics down here. So you can see their behavior during the time of the anomaly. And as mentioned, you know, you can drill down into any logs that might be coming in. Actually, I should probably show that. I don't think I did. So here's another example of an anomaly database server. And you'll notice here that you have different contextual access, right? So in the application state, I'll show that. We have a time travel capability where with all this data, the metric data from Kubernetes, the log data from Fluent, from Loki, the trace data, all this, you know, we build that real-time map that you guys saw and all the configuration data. We take snapshots every five minutes. And I'll show you guys that in a second, but you can go back in time at the time how your system was configured during the time of this particular anomaly. In this case, this goes back a day. So if I click that, it'll take me back into, you know, a day before and show me the entire config of my entire state at that time. But we'll go into that in a second. I can click on metrics to understand the metrics for that database server or any events that are related to that. In this case, I want to show logs. So if I just click on the pod or the container logs, maybe it wasn't logging there, but I do want to show that we have contextual access to the logs actually. Let me find, I just want to show because I think we did not actually show logs. Let me see. I think maybe no Explorer will be logged in. Sorry. And Cesar, I saw a couple of questions come in on that front. One question was, does Obscrews enable custom metrics? Does Obscrews enable custom metrics? Yes. Basically, it's any data that's being exposed to Prometheus, right? So as long as data is being exposed to Prometheus, that data can be brought into Obscrews, right? We're just leveraging Prometheus as a metric ingestion point. Awesome. And then one other question was, do the ML alerts wait for a particular size of training data before alerting on that front? So I can't answer that. So typically, it depends on, you know, our default is about one day, but we can speed it up so we can learn. And the only reason I'll say that is because let's say there's hardly any activity on a weekend, you deploy it. You're not going to see a lot of activity to profile it, but the next day it starts increasing. So over time, essentially, we update that. A default is 24 hours, it can be even less. It can make it even a few hours. Just have enough data to get an initial profile and then we continuously update that. Cool. Thank you. No, the users don't have to do anything. I wish I could learn that fast. Accelerate. Like one day is good, but yeah, if I could learn like this. Exactly. If you saw that generic container example that says Arshu, there's about 30 metrics and you have no way of knowing because there's some cause being made as a problem versus a memory failure suddenly increases. So there is no way a person can do that. That's what the beauty of using the ML to get unknown numbers. Keep going. Yeah, absolutely. So the thing I wanted to show is logs because I actually did not show that even though it's super important, but anything that's logging, right? We picked that up from your standard out. But if you click on any, whether it's an anomaly and this is showing the pod, right? I have a pod open and from its, this is what we call the quick view, right? From its quick view, one of the, one of the links you have is for logs, right? So I can just click on logs and that takes me straight into the logs for that particular service. Now this is pretty, you know, static logs here, but I can, you know, it is searchable, right? So I can look for requests, for example, or conversion. Yeah, the idea is to contextually link it to the problem. Correct. Yes. So, you know, where we get it from, yeah. Correct. Yeah, I think I don't have a problem here that has logs right now, but, but we do surface that as well. So if you're having an anomaly, you can just straight into the logs for when they're active and that will show. Now, what I do want to show with all of this really put together is, I'm going to show you an alert, right? So we have, I showed you guys, you know, how we collect all the different data, the architecture, right? Again, we're leveraging just purely open source on tools here to collect the data from, you know, whether it's VMs or whether it's Kubernetes, et cetera, and the application level of MongoDB exporters or Nginx exporters, as well as traces, whatever, open telemetry, compatible libraries, all basically built on open source. But now, right, what we have here is, again, I also showed you the anomalies on how we, like, specific Kubernetes detection, and then I showed you the ML, how we automatically detect performance deviation and, again, lots of different metrics. So in this scenario, we're kind of tying everything together, right? So I have a response time SLO breach on my Nginx service. So I'm going to click on that. And, again, here's some details, right? I have an SLO of five seconds. My response time is at over 15 seconds. So I want to see what's going on, right? If I click on analyze, what this is going to do, and I'm going to close this, we'll come back to this somewhere in a second. I'm going to close that piece. What this is doing is now we're showing a slice of the actual app map that we were looking at earlier, but now it's focused on the timeframe and in the context of this particular anomaly, right? So your root is essentially here. At Nginx, you're seeing a slowdown, but we've also identified what downstream services are involved, right? So we have Nginx itself, right? So this is the Kubernetes service, the pod of the container, and same thing, service, pod of container. For a web server, red is service, red is pod, red is container. Then we've got a cart server service, pod and container. You'll notice immediately in the red, it's highlighted. So we're doing fall domain isolation as well. You don't have to call the Nginx microservice team whoever's managing that. You don't have to call the web server, microservice team whoever's managing that. Could be the same team, could be a couple of different teams. You don't have to reach out to them. You don't have to go inside your tools and look at the metrics for these parts. If we're showing you, they're healthy, right? So what the data has shown us, the data we've collected from these containers, as well as from the network data and the configuration data, and combined with RML, that intelligent layer of the operations is, we've highlighted the red pieces, right? So our container for Redis is red. Our cart server service is red, and so are our pod and container. So we'll kind of take this in the chain and see what's going on. So we've identified we have an SLO failure up here. We're responding really, really slow. Now, if I click on the next piece in the chain, I'm showing that Redis container is problematic. If I click on that, what it's going to do, it's going to show us this is a separate, technically a separate anomaly from the Nginx one, and the ML has detected that this is very much related. And you'll see a few different failures, right? So you'll see that we're getting an increase in throttling on the CPU. The user second solar spin on the CPU has increased by about 10%, but really interesting actually here is you'll notice the response time normally is at 2.94 milliseconds. Right now we're at over two seconds. This was automatically detected. And then also super important is our error rate, right? Basically, you usually have zero errors. Right now our error rate has jumped up to 36. Out of every single request has essentially got into an error mode. So something is wrong. So we're going to go back here and just see what our RCA is pointing at. So Redis is calling card server. Now if I click on card server, I'm going to see the alert very, very clear. The card server doesn't have any pod to serve request. That's a very, very clear indicator that obviously Redis is experiencing a bunch of response time failures and now error rate failures because there are no requests behind the Kubernetes pod. So I know pods behind this service to serve any requests. And to look into just a little bit more detail if I click on this particular pod now on the card server to see if I can glean what's going on. Looks like I'm having a back off restart. And if I look at the details of that alert, it'll actually show me a little bit more detail. Again, these are all separate, but linked together problems. If I click on the analyze tab, now it's going to show me the real root cons. This is not an invalid image name as we talked about earlier. So this broken image name with two eyes in India, this is, I'll zoom in a little bit so you can see that. Isn't much. But you can see that obstacles India here showing two eyes, right? So we have this invalid image name and that's really the root cause of the issue and it's all shown right here in a matter of seconds. Our engine is experiencing a response time slow down. Redis is saying that we've got an increase in response time and alert and errors. Card server is saying, well, I don't have any pods to serve. And the pod itself is saying, well, I can't start because somebody gave me a bad image name. All this in a matter of, you know, about 20 seconds, right? It took me obviously close to a minute, minute and a half to explain. But all this understanding the Kubernetes level up to the application level connecting each other is all powered by these open source tools. Plus, you know, the intelligent layer on top which is in my opinion pretty darn cool. In the interest of time, there are things that I wanted to show. I did want to show time travel, but I think we're pretty close out of time. I want to open it up for questions. So with that, Paul, turn it back to you. Yeah, so I think we should go back to Q&A. So in fact, but just to summarize, kind of just back to the premise we started with, right? In order to really help us in this new cloud native microservice environment and Kubernetes, you know, we have no longer to worry about where the data is coming from. All the telemetry is there. The idea is to build it, but really what observatory has to do to have this intelligence is able to understand the full context of the application, occurs everything across all dependency, track that. Users should not have to do that, so that's what we need to build it. And then understand the application profile the behavior so they don't have to worry about how to detect problem settings. We want to take that off the table and then contextually analyze everything because now we have rich data in this whole distributed systems. Whether it's an infrastructure or Kubernetes related or down to the application, they all link together. And you don't need six different folks looking at traces, logs, events, alerts to do that. That's the role of observability in this new world. And thanks to OpenTelemetry and OpenSource, it is possible to do that. Take off as the proof point. You don't need to worry about running multiple silo tools to really build that intelligence and reduce the amount of effort. I'll pause there. That was the whole point of guys take advantage of the OpenTelemetry and the OpenSource tooling that CNC has been helping. We are firm believers in that and we hope you can leverage it too. Awesome. Thank you so much. I did see quite a few questions come in. I see at least three. One question, the first one from Ishmael was how easy is the installation? That's a great question. Let me see if I have it in this environment here. So for a typical deployment into maybe a non-prem cluster, we leverage Helm. Again, another OpenSource tool. We leverage Helm in these commands. It's about three or four. It's actually five commands. Sorry, could you bump that up just a little bit? Yeah, I absolutely will. So essentially if you don't have these existing tools, because a lot of people, as I mentioned, these are essentially the facto standard for OpenSource monitoring in all these modern environments. Most of the people we've been into already have these tools. So it's actually a little bit simpler. But if you don't have these tools, I think it's this last command that will deploy all the OpenSource tools if you don't have them underneath already. Essentially, it's through Helm. These five commands and that gets you from a greenfield cluster to up and running. Literally, copy and paste, you're up and running in about three minutes, four minutes and you have the entire environment that I showed. The only thing that's not available right off of that is the ML alerts come back in a couple of hours to 24 hours. This is usually that sweet spot. But everything else that you saw within three to five minutes of deploying is you're getting all that data. So this is how simple it is. Awesome. Thank you so much. The next question I saw was is it free or what levels? How does it work? Is it a SAS? Or is it something you can host on your own? It's a SAS service, yes. Cool. There is a premium offering for people who want to try it out. So you can go to our website opsource.com and check it out. Cool. Thank you. We've got lots of questions coming in. Thank you everybody so much for submitting those. Please keep them coming. I think we have about seven minutes. I'm happy to field those as much as possible. The next question is, is it a good idea to import this data to do offline troubleshooting by importing data collected? I was wondering about troubleshooting edge deployment cases where we don't have access to the cluster. Interesting. So you're talking about when you don't have if you can collect this matrix you're saying and push it to us it's a little trickier with this depending on the context maybe you have to dig into a little more specifics you know what data because remember not to understand the application context we pull everything we may be able to see the dependencies. So seeing it in isolation doesn't tell you that so it would probably be specific so we can take this offline and this attendee has something specific that we can follow up and bring us. That will mention you know it depends on how your edges it just compared like if you flat out don't have like access to like export metrics right I mean again the servers itself isn't really doing much on the collection side it's really around having Prometheus on the cluster and having Loki or FluentD on the cluster to collect that data really if you don't have access to there that's something to be explored but speaking about edge itself we have recently published like a joint blog with Verizon where we're talking about I don't know what's going on with my DND but I know somebody mentioned it so I promise I turned it on but anyway you know we when you're running Kubernetes clusters for example at the edge or running workloads at the edge it absolutely is a supported model right again that joint blog I mentioned is launching a Kubernetes cluster on AWS wavelength and you know with Kubernetes and observability built in with you know with the ops screws and it functions perfectly fine but again if you have like a really really locked down edge it can be something we can talk offline and feel free to reach out or we can talk about that scenario it's a definitely use case because a lot of the edge applications are around deployment Kubernetes so we are playing into that space absolutely awesome thank you next question is AWS bottle rocket supported can you read the second part AWS what yeah Amazon's operating system bottle rocket it's kind of built for containers and running things on that front my initial thought would be yes because of the interfaces that you've chosen to bind to you know CNI, CSI all of those things but so long as those are supported that should be good but not sure if that might correlate to the system metrics might be the specific question yeah correct so it should be I don't remember if there is actually somebody that is using AWS bottle rocket but again we actually don't build in that much into the OS because as long as they're running the minimum required kernel on the nodes which is I believe kernel 4.15 Linux and up yeah I mean we shouldn't have any issues supporting that if you want to explore I highly suggest signing up for the free version and it should work I don't see why it wouldn't so yeah primarily we look at OS's that enable collecting flow data that's probably our primary requirement that's it and all those nodes run not expert of course cool cool interesting interesting it's fun to kind of see all these different computes and then being able to surface that information and you know it's not just AWS we work with any cloud vendor that's the advantage of working without looking at provider agents awesome awesome awesome next question is does the SAS support SSO and SAML again yes absolutely yes so I believe the free version doesn't have like that quality of life there's some of those things that are like more enterprise features but yes absolutely many of our customers are using Azure AD or they might be using Octa etc we absolutely support that excellent next question is promise to keep peppering you until we're out of time what resources does ops crews require I'm guessing that might be pertinent to like kubernetes kubernetes primitives like nodes storage deployments config maps you're talking about what is required for just the collectors that we have right that's an easy one I think it might be more the platform to like install that integration to report the data to the SAS so for you know when we showed the architecture really you know that some it doesn't work right but typically you know for the actual open source collection tools I mean you know each one has requirements but they're really small I mean I mean we're talking about hundreds of millicores to run the open source collectors like you know see advisor and nodes where those are all really really lightweight the only piece that use utilizes more resources is really right and that's just depending on the you know how many objects you have in the cluster we typically for a small size cluster recommend maybe like and maybe like two two CPU eight or twelve gig machine but you know as you scale up in the amount of objects that you know that I think I think I've seen just and I might be misremembering so if you have if you want to more details like really hard numbers please reach out but I think we've seen like five close to five thousand containers being monitored by like at this point maybe 64 gig for CPU machine node to power Prometheus and it's not fully used but sometimes when it you know when you scale out and and and you get just all those some tons of a spike in objects that's really when it uses that but Prometheus is really the biggest one and you'll find this you know it's not an option it's a Prometheus piece but other than that all those components I mean you're talking extremely extremely small resource requirements really negligible on your clusters absolutely absolutely well with that unfortunately we are at time I really do appreciate everybody reaching out and asking those questions like all good things that streams have to come to an end so we are at that point but thank you so much if there is anyone looking to kind of reach out to either of you is there a good place to open up those questions sure you can reach us to info at obstruz.com to be generic enough and you can also ping us on the website obstruz.com itself you know we should be easy to find us we also linked it if you want to look at our obstruz page love to chat with you guys get your feedback excellent this was interesting and I know exciting given where things are going with open telemetry open source and so well thank you both so much thank you everyone for joining the latest episode of cloud native live it was great to hear from aloke and Caesar we really again love the interaction and questions from all of the audience join us next week to hear about how we're going to be building stability in kubernetes with Andy Suderman of Fairlands thank you all for joining us today and we will see you soon thanks everyone thank you