 I'd like to thank everyone for welcome to today's CNCF live webinar, real shooting of Kate's applications. I'm Libby Schultz and I'll be moderating today. I'm going to read our code of conduct and then hand over to a local co-founder and CTO of Ops crews and Nick Lee, CTO of Megazone Cloud. A few housekeeping items before we get started. During the webinar, you are not able to talk as an attendee, but there's a chat box on the right-hand side of your screen. Hello, thank you, and continue to do so, and also leave your questions in the same spot and wait to as many as we can at the end. This is an official webinar of the CNCF, and as such is subject to the CNCF code of conduct. Please do not add anything or questions that would be in violation of that code of conduct, and please be respectful about fellow participants and presenters. Please also note that the recording and slides will be posted later today to the CNCF online programs page at community.cncf.co under online programs. They're also available via your registration link that you use today, and the recording will be on our online programs YouTube playlist. With that, it's over to Alok and Nick to kick off today's presentation. You all take it away. Hey, thanks, Louis. Thanks for the introduction, and thanks everybody for joining today's webinar, live webinar demo, and let Nick give a shout out as well. Good morning, everyone. This is Nick. Thanks for time. All right. I'm going to jump straight ahead to the topic at hand. Hopefully, you guys can see my screen. Everything looks good. Today's topic is going to be an interesting one. It's going to be about how can we do real-time troubleshooting of Kubernetes applications, especially when those applications start showing problems and performance issues. I'm going to dispense with the legal notice, et cetera, and thought we should give you a little bit of background who we are in case you don't know who ops through this. We are a relatively young company based out of the Bay Area, but our focus is almost exclusively on how to provide observability for Cloud-native applications. Because it's Cloud-native, we are an active participant and member of the CNCF community, and as you will see, we are pretty much built totally on CNCF and open-source instrumentation. I don't want to read through this, but you can see we've been working with a number of customers, number of partners. We are venture-backed. The other thing I'll point out, because we are focused so much on open-source and CNCF, Prometheus being one of the first projects, Julius Volts, who, if you guys are following Prometheus and CNCF instrumentation, is on our advisory post. We are glad to have him. So we'll go straight into a little bit of background on Megazone. So I'm going to hand this off to Nick. Again, as I said, you can obviously find out more while looking at our website, ops.com. Nick, if you want to give a little background about self and Megazone Cloud. Sure thing. So Megazone Cloud was founded in 1998. We mainly focused on helping the customers to utilize Cloud better. So headquartered in South Korea. We have an office here in Palo Alto, Toronto, Canada, Tokyo, Hong Kong, Vietnam, and Shanghai. And we recently opened one in Australia as well. That's the latest office. So our main focus is using Cloud for the customers and helping them to use it properly. So we have AWS, Google, Microsoft partnership, and we're the largest AWS partner in APEC and top three in globally, as well as we have original Clover service providers such as Neighbor Platform and KT Cloud. We try to help the customer from the day one to operational, continuous operational. So consulting, development, any migration they need from the on-prem or another cloud to the other cloud and operation after the migration or development. We work with various partners such as Ops Cruise, ISV partnerships in Korea. And one of the things that I do here in Palo Alto is looking for that leading edge technology companies to bring their technologies to Asia to help the customers in Korea as well as reduce the gap between the US and other part of the country. We also provide a service called the Hyperbilling to help the customer with the billing on their cloud usage. So multi-cloud building services, as well as a space one also named as a cloud for it, which is a Linux foundation project that helps managing the multiple cloud in a single portal. Next slide. So the reason why we are working with Ops Cruise was that internally we were challenging, we were facing the challenges. Like I mentioned, we have a cloud for it and a space one, which is provided as a SaaS product. We wanted to make sure that we are within the SLOs and we provide a right level of equality of service to the customers. And we saw the same pattern with other customers as well, our own customers. So one of the largest mobile telecom company and mobile service provider in Korea was actually working with us and asking for help on their Kubernetes environment. So we're trying to solve their problem because they were facing to use multiple Kubernetes tools. It's solid and it's hard to maintain and operate. And they need to translate the metrics and perform the complex correlation. The problem with this correlation is that unless you know exactly what you wanna do and you know every aspect of the metrics, it's very difficult to have that. And getting those information and providing it to DevOps, to enhance the DevOps practice and skills is not an easy thing. And it takes way too long to have the newcomers to get trained and start using the environment and the tools properly. And keeping up with the new release and a various open source project, it's not an easy task. And it's not just us, it's our customers and also globally. It's a hard to keep the talents. When you hire new operation people or DevOps engineers, it takes at least three to six months for them to understand what it is and for them to understand what they need to do to keep up the same environment and make it better and upgrade. So the solutions that we're looking for is that's something that we can automate but easily adopted and we can train others easily. As well as we get a single pane of glass with all the integration and telemetry coming into that environment. And it doesn't hurt to get a machine learnings and AI assist at troubleshooting because we all know we don't have enough people to all the troubleshooting by ourselves. If we can get help from others, that's always better. And easily understood SLO and a quality of service for our DevOps and service owners. So that's where we were a partner up with ops crews and trying to help the customers with these challenges. Hello, it's all yours now. Thanks Nick, I think you said the stage and hopefully many of you can recognize or empathize with the issues that Megazone and their clients were facing in Kubernetes. I would say having been working in this area for six, seven years, it's not an easy problem, right? So to get right to the heart of it, Kubernetes application performance troubleshooting is not just about Kubernetes, right? It's about everything that sits above it and below it, right? Think about just what has happened with cloud native services, just the number of objects, containers, services. We've seen 3,000, 5,000 containers in a single cluster, large number of nodes. And then it's not just that, right? You have service to service calls, you have the SaaS entities that are not being managed by Kubernetes, could be external calls to APIs. And then of course, the reason you go into cloud native and agile is you can constantly make changes. You can make changes to the services you can scale out, scale in, change the code version, so on and so forth. So you have now on top of these large complexities and scale dynamic changes that are affecting everything because every time you add a new service or take it out, your dependencies and who's talking to whom has changed. And if you don't do the deployment right, how do you know that causes a problem? Or runtime, something happens, you didn't know about the infrastructure or something else that changed or the configurations you set. All of that is not just a Kubernetes problem, it goes all the way up, obviously the dependency on infrastructure, but also all the way up to the microservice and therefore applications. So the question is if the application does fail, how do you know is it at the code level, something else on a third party service or it's Kubernetes or it's the infrastructure? And so that makes it complex, right? The good news is, and this is part of the reason we love working inside the CNCF ecosystem is all of the telemetry need metrics from Prometheus, right? Flows, we are looking at level four bytes and packets or even level seven, whether it's from Istio or EBPF, Extended Berkeley packet filter that tells us request rates, response times, error rates, that events from Kubernetes state metrics. So you know in changes that you made, you can capture that. Logs that are coming in, whether it's the application level or at the specific container level issues and of course traces, open telemetry. So if you look at that between Prometheus, Istio, Kubernetes, Fluenty and of course open source like EBPF and Loki, Yeager, Zipkin, all of these telemetry as well as information on the configuration all there. What's the challenge? Just like the number of objects for every object, there is metrics, flows, events, logs and traces. So if you think about it, there's a huge cardinality problem. Technically speaking, right? If you analyze it, a large Kubernetes application is really a highly dynamic complexity of spatial dependencies and temporal dependencies. And you're only getting certain of these metrics every 30 seconds or one minute depending on how you're scraping that. So we have a fundamental cardinality problem when we are trying to debug and trying to figure us out in real time. How do we help ops know what's going on to troubleshoot a problem if something's slowing down in Kubernetes? That's the focus, right? Now as I said, and I'm just re-emphasizing all of the things that we are looking at, all of the metrics and telemetries already available. So I'm going to just quickly be look at that if you look at things like all the standard metrics, you recognize most of these logos, but of course you also want to capture cloud level metrics because that gives you the infrastructure information on the VMZ are using or the persistent volume and the storage, right? What we want to do is leverage all of these and this is what you should think about. All of that being available, leverage it to figure out, hey, how are they tied together? Because that's going to tell you contextually what those things mean across their different telemetry. Where is the dependency? Because dependency usually leads to what could be the causal paths? And then if you knew what to expect, what we call predictive behavior, then you know what to look for and we'll talk about that briefly in the context of the demo. And this all ties together and leads to the whole idea of trying to do causal analysis, which is not an easy problem as we mentioned. So what we recommend and what we have done as an example is as you can see on the left-hand side, leverage all of the open telemetry, right? Especially the CNCF ones. And effectively what you want to do is stream processing. I'm not going to go to hold out a detail here. You can always look this up and maybe some of you are doing something similar. Collect all the telemetry, bring them contextually so you know how they link to each other. So you're not looking at metrics or logs or traces of flows or any events happening across the infrastructure Kubernetes and app in isolation because then you are doing all the work. If you can pull that together and get the topology, meaning who's talking to whom at the application level down to the dependence to the infrastructure, that gives you better context to knowing what's going on, right? And then of course, look at flow, understanding behavior and all the way as the sequence says, able to isolate the cause. So we have all this information. So the context across this telemetry configuration changes is very important and I'll emphasize this again. The whole idea is to get enough information across this. So we know what is the state of the application at the time when this problem was detected, right? This is where analytics comes in. This is where you would need automation because there is no way one or more SRVs are going to be able to do this manually without having some automation. And that's the whole point. The point to make here is all of the data is available as close to real-time as you can. Can you pull the insights you need? So by the time a problem does happen, you can able to figure out what the problem is. So let's talk a little bit about what does that mean in automating cause isolation? If you're a computer science geek and I've looked at this problem, there's something called non-polynomial complete or NP-hard problem. What does that mean? It means that the number of possible combinations of data that you have across and trying to get a sense of what is happening in time and space is very, very hard. But you know what's interesting, I've talked to customers and people who are using this, they say some of their best sources of isolating a problem is their senior SRE guy who's seen it all. So because it's not a simple problem and highly non-linear, think about how we solve this problem. We leverage information about the IT style. We know that when a container is working and it is sitting on a node, it is using its resources. We know that those resources are coming from the cloud. We know there's a shared service and that shared service can become a bottleneck when multiple things are talking about it, especially things like databases, et cetera, right? These are things, these are aspects of knowledge that really good SREs use. They also look for what I call, as you can see in the second bullet follow the breadcrumbs, they will look at the dependencies. Who is talking to whom? Because they know if the alert is happening somewhere and they're slowed on, if something is in the path that is gonna be, anything is, that's not in that path but probably not to be relevant, right? They also know when there's a problem in Kubernetes, is this readiness or liveness? They know the meanings of that. So they're using all of this knowledge together, looking at the alerts, then looking at the metrics, logs and places. And let's say if you have a service that has a certain kind of behavior, like let's say is IO intensive, they'll start looking at saying, hey, they did this problem. So expert SREs who do cause isolation use all of this information. So why not follow that instead of trying to look at everything and throwing all the information without not able to narrow it down? The whole idea for an automated system has to follow these breadcrumbs, understand, use knowledge, use the information properly and narrow down to a very few set of objects which gets closer to the likely cause. I would venture to say perfect cause isolation in real time is not theoretically possible. However, if you have enough information and you have used the information correctly, you can get to it very, very quickly. And that decision system is what most SREs use. So kind of a block diagram on the right is basically saying, you know, when you see the alert, looking at where the alert is in the source and then start looking at, say, hey, what's around it? What is the performance? Who was it interacting with? Was there an issue depending on the type of alert in Kubernetes? Was there an infrastructure? Was there problems and saturation? And based on that, you would start looking and eliminating the possible cases that you don't have to look at, right? So that's what the dynamic decision system is. And that's what you're gonna talk about, how to leverage all the information, extract insights that will help us isolate the problem. So before I go into the demo mode and show you how this works as an example, I wanna give you an idea what we want to, what you're gonna see in the demo, what instrumentation is being used. So our deployment architecture is shown here. Effectively, all the blue that you're seeing as you'll recognize it are open source instrumentation sitting inside the Kubernetes cluster. So CAdvisor node exporter, your familiar Prometheus components deployed as demon sets in the cluster. And Promtail is being used to collect the logs from Loki. So that's also a demon set. I would add that in the node exporter in order to understand flows and not just bytes and packets coming in and out of the node into and out of the containers, we are leveraging EBPF. So our node exporter also uses an EBPF collector so we can look at what L7 metrics resource, sorry, request rates, error rates, response times are happening at the level of a container within a node, right? We can get that information. We also have the, most of our environments have Yeager. So we collect, we can collect traces. So the three primary four things that you're seeing here are Yeager for traces, Prometheus for the metrics and flows, Loki for the logs. And then because we want to look at changes that are coming, we are also capturing Kubernetes state metrics. This is one way to see changes that have happened within the cluster. So then what we do essentially is once this is instrumented and you can do this yourself with the Helm chart, we basically have four plus one additional, what we call pod, each for a type of telemetry. So we are collecting the metrics from the orange metric you see Prometheus gateway to collect all the metrics from Prometheus. The log gateway pod is basically collecting all the logs through the Loki collector. Kubernetes gateway is collecting Kubernetes state metrics and changes. So we know what the configurations are and changes being made. And then finally the fifth one you're seeing is the cloud gateway because we want to know what is the infrastructure and so do we understand the dependence. All of these essentially collect data and just as you may be familiar with a scraping at a classic interval that you set up whether it's 30 seconds, one minute depending on your scenario and bandwidth. They're collected, compressed and sent to a SAS service which is shown below in the orange hexagon which is where our controller is. The controller is basically getting all that information which I showed in that previous staged pipeline and processing and extracting and doing that information basically bringing them into context, discovering the topology, figuring out what the dependencies are both at the service level down from Kubernetes to infrastructure and then looking at the data, figuring out using machine learning what they expected behavior. So that's kind of what the scenario is. So today the specific use case that I want to talk about in the next probably 20 or so minutes is an application slowdown that we have detected and how do we can analyze especially in the Kubernetes kind of scenario or Kubernetes problem that caught as an application slowdown and how do we do that? Now I want to make a note here that the case that we are looking at here we are not using tracing. Clearly there are ways of using tracing for doing that. There's a whole nother problem and that's a separate issue on something that we use called trace path. In fact, there is a CNCF live webinar on that if you're interested, you can follow up with us. But today's case we're going to look at doing root cause analysis for Kubernetes affecting application but there's no tracing label. All right, so at this point I'm going to switch my screen to the demo tab. Let me see if I can do that again. All right, so let me go back to sharing. All right, I think that's the one. If you guys, Libby or Nick screen showing up, just confirm. Looks good. Excellent. So folks, what you're seeing here is what we call our application map. You can look it up. I'm not going to spend too much time here because our focus is going to be on the automated cause analysis. And what you're seeing effectively here is I can zoom in and out is basically everything put together that shows at the, what I would call the service to service level. So in fact, if I hover down and you can see this basically says request coming in. Let me give an example. Load balancer talking to an engine X card load balancer going into this container, going into another service, et cetera, going into the whole thing all the way down to, hey, there is a Postgres database and it's running actually on AWS. So this application map that I'm showing you here is being built by the data that we are getting. And in fact, obviously can organize it by labels on the application, by the namespaces that are running on it. In fact, there are different applications running on it here that I'll show in a minute. It's running on a five node Kubernetes cluster. I'll show that in a few minutes, the different pods. And this is a really small application. This is the what test bed that we try with pods, containers, the SaaS services. This is in fact running on AWS as you can see it, load balancers and there are actually multiple clusters but they're connected together. It's a multi-cluster environment but we won't focus on that today. And if you wanna know what are the different namespaces, I think I can hear on the screen here. Let me share this tab. If you guys can see this, I can actually search by namespace. And in fact, what you're seeing here is, this is the application, a chopping guard as a small e-commerce application. You can see a robot trap application which is a IBM application. The ob screws. I'm actually not seeing anything right now. Oh, did it not? There you go. There you go. Sorry, there is a slowness here. I might just be on delay. Yeah. There we go. So I'll start again. So what I've done is, I've actually tried to show you guys that different namespaces of the different application namespaces and or what's being deployed in this cluster. And what I was showing here was today we'll focus on this little e-commerce app called shopping cart. And there is more there. There is our ob screws deployment. It's only namespace, robot shop, which is another e-commerce application, online booting that's used for tracing all of these. One reason for showing that is to tell you that we can filter down, et cetera. But for today again, and I'm switching screens again and hopefully back to this. Hopefully the screen switched. Can you guys see that again? Always a delay. Yes. Might be a want to be my check. So to give you an idea of why this matters here, these, when I look at any of these containers, for example, engine X coming in from the service. In fact, if you look at this, as I highlighted, you can see because we collect not only Prometheus metrics by, but also the flow metrics, you can actually see average response time between this and its corresponding service. So this is actually very useful, right? If you think about it, a lot of enterprises that use this environment, this kind of environment ecosystem for monitoring obstacles can see that dependency right away. And why do I say that? Because when we're doing the root cause analysis, if I'm not involved in this card service and there's a problem here, then I don't have to look here. Or in some other application service like here, I don't need to look at this. But if the data is flowing in and there's a problem I know what to look at. I have narrowed down the focus and I can see visually what's going on. So that's one key part, right? Knowing the topology and the dependency is the first thing that most of us do. This is what an expert SRE will say. I don't need to look here if the problem is here. Well, how do I know that? In real time, as Kubernetes is changing, I should be able to see this dependency, how the data is flowing. So let's go into one more depth here. So I'm going to zoom in here in the shopping cart app. Hopefully you can see what I'm doing there. And if I look at that, and I'm going to shift my screen here so we can see, the key here is able to see for this container, what are the telemetry in context? And which is what I was saying earlier. I'm going to move this up a bit here. So metrics, what is the metrics coming in? Obviously, back from Prometheus, et cetera, right? I can get that. What are the, if there's no events, I know this one doesn't have it, but there's logs, what are the logs in there? I can look for specific logs. For example, is there a problem on this container, on this engine X, something has happened, right? Anything that I'm recording, I can look for errors, et cetera. I'm not going to go back, but I had that in context, what we call a quick view. What is it talking to? In case you're not seeing it, I can see because I am capturing, because we are capturing those flow metrics as well as null and we can disambigate Kubernetes namespaces, we can say, hey, what is the IP address it's coming from? It's inbound. You can see that it's coming from this engine X controller, bytes, et cetera. Sorry, look, I'm not seeing it again. It might just need to reload, but I just want to make sure. Yeah, because I might be going ahead and things haven't updated. I could see it. I could see the screen right now. Can you see it, maybe? Yeah, I could see it. Yeah, all right. Let's ask audience. Yeah, audience, if you can pop in the chat because I'm moving it. Okay, they can see it. It's my connection then. All right, I need to probably get into a better connection, sorry, Libby. So what I was getting at is putting in context, we talked about metrics, logs, but also the connections because this is what we want to know when we are trying to see who's talking to whom and what is the problem. So in this case, just to quickly summarize, I know what is inbound. I can disambiguate that. I even know how much data is coming. You know what else is coming. And actually there is, if you look at this, if I click on this, let me see if I can find the right way. Might be hard to do it because I'm trying to show. This on this port itself, there are multiple connections because there are different ports involved in there. And outbound, as you can see, is talking to this other engine next controller. And in fact, you can see on the screen. So what we were trying to point out is, and if there are trace paths involved, if I want to go look at the TCP address, I don't have one explicitly here, we can get trace paths as well as in service performance, right? More interestingly, what about the configuration? We can pull in the actual Kubernetes manifesto so we know what has been designated, how much is the resources, what volumes is talking to, are things healthy? What is the rate at which we are scraping and the timeout settings? All of that information, including namespace, everything is right at your fingertips. So you don't have to go switch and do cube couple commands, right? This is important. And so any changes that happen, we will update and present this. So having everything together in context is very, very useful. I'll do one more switch before I get into the root cause problem. So for this application, and as I showed there is multiple containers in this environment, as I showed, what does the node map say? Node map says, where are those applications sitting? So for example, I was looking at nginx, and if I want to find it, I can look at and figure out there are five nodes here. And I can see which of these nodes have what containers and in fact, more interestingly, how much are being used? So another view that we be able to pull together from that is the usage of requests. And I feel that numbers are the highest requests. And what is the request limit, et cetera, both at CPU and memory and on a node by node basis. So we can see whether you're over-provisioning, under-provisioning, right? So for example, here, the request on this one, this primitive node exporter is set at 200 and the requests are already exceeding. And so the reason it's red is because it's worstable and it might be prone to eviction. So that gives you another view both to right-size the environment, right? So going back to this application, what happens when we have a problem? And in order to do that, I'm gonna jump into what's called our, not this, but the alert view. So I'm gonna switch the screen again. Let me know if it's showing up. I actually picked up an alert 105.915. So Nick, I'm gonna rely on you. Can you see my screen has changed to the alert window? Yes, it has. All right. So the one, the primary example that we use today is an RCA analysis that we are doing on a service level objective breach. We do this automatically because we're collecting flow metrics on that ingress on the shopping cart I've showed you. And it was run a little while ago. So I can just go through this. If I click on it, this is where things start getting interesting in terms of how this is automated. So we capture alerts automatically based on explicit alerts from Kubernetes infrastructure, predictive using ML, which I'll talk about in a few minutes, but also if there is delays on the service level indicators, and I will go back in a minute and show that to you. So actually I should do that. Let me go back to this example here and show you that for this service, you can see the one feeding into that SLO on the ingress, there is something called SLO SLI. You see that something has been set. The suggested is this value and that's done automatically by analysis by the system. You can also look at what the current max is. Someone has manually, the user has set it at four seconds and you can obviously change that. If that's because if you're looking at an outbound connection to a customer facing, you can set that. So that's where we can set SLOs right in this app map. And so what we are looking at now, sorry, I'm switching back the tab here. Can you see my tab again? I think I switched this tab and I'm maybe. Yes. It shows the alert detail detect. Yeah. So I'm gonna share this tab. What I was gonna show you, I think I've missed it. On this tab is that the data coming in into on the ingress side on that, that we can detect an SLO SLI. And while we can do this by using machine learning and what the expected should be, someone has set it at four seconds. And this is what we'll use as an example of where the breach is. So the service level objective for this application on the ingress side has been set at four seconds. All right. Now I'm gonna switch tabs, so bear with me. And then I'm back to that alert that I showed you here. And I apologize, I'm jumping back and forth, but I'm trying to give you context of what was the SLO and what does the system do in this kind of automated. So for that SLO breach on that shopping cart app, there is a breach you detect because on the flow. So if you see that, it says it's automatically detected because as you can see the four second SLO, we are at 6.657. And there's a lot more detail here. Obviously this one is very simple rule. I can see you can set this up. The max across what's coming in that graph based on the response time and the flow can, if it's more than 4,000 milliseconds, that's what triggered it, right? It's a latency based alert. And all of the details and the aspects of that that we already have in context are given here. But the most important thing from the user's perspective is we saw an SLO breach. The question is, how do I know what caused the SLO breach? So what we've done in this system, the decision tree that's running in the background, the AI engine does this and analyze. And this is where it starts getting interesting. So Nick, I'm assuming screen has changed. Can you guys see the screen change? I'm on the analyze tab. So now this is interesting. What you're seeing here, this is automatically be done in the background whenever an SLO kicks off. Remember decision plan is kicked when we detect a problem. It is actually saying, okay, for that, there are five connection paths. And the latency, the highest latency path has been pulled out. The whole idea is that if you know the context, can you narrow down the path? And everything that's read here is basically saying all of them are high latency along that path. So I'm gonna close that. And you can actually see laid out here in this high latency path, engine X coming into it's bought and container to the service for web server, bought and container for that to a cart cache, a caching element, bought and container, the cart server, bought and container down to the database server. And then, oh, I think I am actually on, am I on the right one? I might be on the wrong one actually. Let me go back and see if I got the right alert here. The one I'm gonna look at, let me see. Let me just check if I have the right alert. Okay, I think I have a different alert here. But the one I wanna show you is slightly different. So let me go back and retrace. That one also is an interesting one, but that one does not a Kubernetes one. So the one I wanna use, I apologize. I'm gonna do this real time. I think it's this one that has a Kubernetes related issue. This one also has a breach. This one even higher. Same case at a different time, 15.454 seconds and exceeding the four second service level. When you look at the analyze, same thing happens, slightly different. And what it's saying is in this case, there were three connection paths. The highest latency path was pulled by the system by using that topology. If I close that, in fact, you'll see one, two, three, four, five, six, seven, eight alerts detected in red. And you can see where the breach happened during what time while this was happening. Now, starting here, as I said, similar to what I was showing earlier, going from NGENX to its pod and container to web server pod and container and so on. So put all the way to the backend, the SCART server. What's interesting to note is that is not the only path. So normally a user would actually go and say, hey, what are the possible paths on which that NGENX container, sorry, the service is dependent on and there are three paths, right? You could have more, you could have 10, 15, depending on the complexity application. But the reality is what slowed down is the path that has the highest latency path. And not surprisingly, that's the one that's got other alerts. The system can then isolate and saying, let me just show you the one that's really relevant and that's key. So that's the first thing the decision system and automatic causal analysis should do. So let's kind of walk through that. So obviously this is a situation where it was slow. We've already detected. If I go to the web server, it says, hey, this is slow because it's higher. So the service is slow. That's not surprising. If you analyze it, it'll say, yep, it's slow because the response time downstream coming back has slowed things down, right? So that's not surprising. What about the next error? Because this slowed down, expected. How about the next one? That also is high. This one also has an SLO breach, this card cache. Let's look at that again. This one also has its own analysis tab. This is also higher than expected. And this one, I believe, is an ML-based alert or actually it's response time has been higher. And it captured that. It says, okay, further dependencies further down. So slowing everything down, going back, going back again to that source again, let's go down to the next one. What about this? So this was slow. What about this card cache? This says network metrics are not normal. And if I click on it and you'll see there is something called an RCA tab. That's being analyzed besides just being high, meaning triggered based on the expected service level of response time. If I click on it, this is where it starts getting interesting. What you're seeing here is that this one has what we call our fishbone analysis. That means it's categorizing that container based on memory, CPU, file system, if there's any IO dependence. Demand side meaning what is coming in and the responses to that. Supply means in our case, what is going on downstream and even configuration changes. But if you look at one of the things that sticks up is, hey, the number of errors actually has increased in this case. That's one. Second, this packets here decreased. The data that was going outbound has decreased from 100% decrease. There's nothing coming in and neither is anything going out. So this is a suspicion why the network level metrics. And this has been detected by just having an expected behavior model that was learned over the past. The reason is if you think about threshold based alerts, usually are high water marks. And the challenge with that is when things drop below that, you would normally won't detect it. But if you knew the expected behavior data comes in but the data doesn't go out or received on the downstream side, this is where an expected predicted behavior model helps. So going back to the detect, the reason we have this and the analysis says there's something wrong, the fact that there are more errors coming in. So going back now, that chain again. So that's what we detected here. What does the cart server say? We're starting to see something here. If you see it says cart server does not have any pod to serve requests. So if you are an SRE and you know this dependency, why is this slowing down and also there's no data? Well, if there is no pod here, there is no data coming back. Neither is any requests coming in. So that itself is telling us that the reason why this network metric detected on that ML side and this pod not showing serving requests are connected. So we have a problem that's further away downstream until we come here. So if I go to this container, it says, well, in fact, I already gave you the answer. There's an image pullback error here. And if you look at this, it says, hey, the cart server container is terminated. In fact, if I click on it, it'll just like say actually, I don't have a container on that. It's terminated. So if it's terminated, of course, what that says is I'm not gonna have, I'm not gonna have anything from the cart server responding back. If I go to the container now, now we can start looking at what has happened. This analysis now will now say what is going on here. There is no pod running. And now if I click on the analyze, we would use the same fishbone except it's not for a container. This is specific to the Kubernetes problem. And if you look at this, the same fishbone that you saw for the container side, it has two classes, pod schedule failures, node failures, startup failures, and runtime failures. And this is predefined. Remember, we talked about curated knowledge. Someone who is familiar with Kubernetes knows how startup failures can happen, deployment failures, runtime. And if you look at this, this says it is constantly in translation and it's saying container not ready. We are collecting this from the Kubernetes state events. There is a backup restart. Why is there a backup restart? Because the image is not loading. If you go back and try to pull it, container's not ready, pod doesn't come up, pod doesn't come up. What happens further down? The service that calls it does not respond. There is one more thing that's detected automatically dynamically. There is an image name problem that is incorrect. And therefore, because it's invalid image name, not surprising, it's continuously trying to pull it off and never getting ready, which means as a result, if you go back to the problem and you look at this, you can now see what has happened. Image did not load correctly. Kubernetes issue, pod did not come up. Card server has no pods. This starts seeing network errors and errors. That increases the response. And as a dominant effect, all the back goes back to the engine next. So I know we'll run out of time, but what I wanted to point out is, unless you're able to pull together all of this information in context, eliminate the pads, look at the dependencies, analyze each of them, you're really looking at, if you notice, flow metrics to understand why this was slow, understand all the way down to card ML metrics to understand why that problem happened, looking at Kubernetes state metrics to figure out why and analyze it to say, hey, if you fix that image, pod will come up, pod comes up, card server can respond, then all of these will be straightened out. There's an old saying, I think you've heard of, for want of a nail, a battle was lost. The nail was a bad image name that's propagated all the way up to create an SLO case. So I just wanna give that because I know I wanted to show a couple more examples, but having specific contextual knowledge on the system, analyzing things in sequence, all of that, bringing all that together is the key here for us to be able to understand how this problem works, right? So I'm gonna actually go back and bring this other deck here. So the whole idea is in order to automate the causal analysis, you have to really leverage knowledge about the stack, right? Understand how containers nodes work, how Kubernetes works, understand dependencies, understanding alerts, metrics, logs, flows, and traces if you have in context, because that'll kill you and give you those dependencies, understand expected behavior. Once you have that, when you see a specific kind of problem, you can narrow it down and then eliminate all the scales as the examples that I showed you, eliminating all the other paths to get to that point. So that's kind of the way we can automate this causal analysis. As I said, it can never be perfect because you don't have 100% information at 100% granularity of time, but you can use all of this information to solve the problem, right? In this case, we were not even using traces. So just to kind of summarize before we go into Q&A and throw it up for discussion, as you're probably well aware of if you're using Kubernetes, there are multiple issues that impact the application performance. And that makes just the cardinality and the complexity of space and time and dynamism makes this RCA quite challenging. We can't solve the problem not by blind correlations. That's not gonna help because the number of possible correlations is gonna be very large. We have a cardinality problem there as well. But if you go look at, understand how to resolve this, how we do this very well, it's really, as I said, follow the breadcrumbs and eliminate things that are not relevant. And that's where a decision system is needed. That has to be runtime and automatically doing that for you so you're not spending time. Leveraging curated knowledge is absolutely important. There is no such thing as blind correlations and blind ML that does not work. It'll just lead you to false paths. And that means understanding the full telemetry, the configuration is changing. And finally, the message for all of you who are following the CNCF, you can leverage all the open source CNCF. So we've just pulled that information together to do that. So I'm gonna throw it for Q&A and switch back. Are there any questions that we can address? Go ahead and leave your questions in the chat if you're looking to ask something. Please post your question. Oh, look, there was a question from Oliver. Does offscrews show a metrics for services in the cloud provider that hosts the Kubernetes cluster? Exactly. For example, AWS SQS and things like that. Yes, we do. Let me share this step. Yes, of course you can because you can collect the data. So what we do then is use that data that we're getting on the infrastructure. So I think the example I'm showing here, Oliver, is what we collect. For example, there's load balancer that you have to specify to the cloud, whether it's one or the other. So here's an example of collecting data from AWS on the load balancer that we showed. It's also visible, Oliver, at the app map when we were talking about, so for example, this is postgres, oops, sorry, postgres metrics, and you can get the queue depth, et cetera, from the cloud vendor itself. You know, obviously it's not Prometheus open source, so we have to kind of be able to pull that to do that. Hope that answers your question, Oliver. Anyone else have questions? Great, I'll give everybody just a minute. I'll just, if there are questions, I'll just post this if they want folks want to follow up later, want to talk to us or general question on CNCF, metrics, monitoring, whatever. Any questions for Nick? You can put in the handles or select, okay. That's kind of one thing I do want to mention. Offscrews is free, by the way. Yeah, I'll just say offscrews is free to use if you want to sign up. This is a freemium case. So I'll just put in there. I know some people are concerned this is just a product, but it is free. So we are, again, built totally on open source. You can do what we are talking about. And if you don't, you can obviously get the free version as well. All right, thanks guys. All right, does anyone else have a question? Yeah. I think that's it. Thank you, Alok and Nick so much. Thank you everyone for joining us. Thanks everyone for joining. Thank you very much for you to reach out. All right. All right, and we will see you next time. Thank you so much. Thanks Libby. Bye. Thanks Nick. I will sign off.