 Welcome to our session. Our session is titled, Prompt, Help Me Debug a Cluster. My name is Anusha Raghunathan, and with me is Lily Van. And we're both software engineers working at Intuit. And I'm trying to figure what a good position of this mic would be. All right, this is comfortable. Here's the agenda for today. We'll start with a background of Intuit and its infrastructure at a glance. Then we'll dive into others and what we've called cluster golden signals and how we've used to alleviate some of the problems of a platform engineer. Then we'll dive deep into how to use AI for platform debugging and our initial experience with this. We'll have a demo following that. And we'll finish off with takeaways. Intuit is a global fintech company that builds several financial products and services. So if you have ever used TurboTax to file your taxes or QuickBooks for accounting and payroll, know that they are all running on Kubernetes-based platform infrastructure. Lily and I are part of this platform team. And here are a few numbers to show the scale at which we operate. I'd like to highlight that we support about 7,000 application developers within Intuit. And these developers are running about 2,500 production services. We have a lot more pre-prod services. But this is just production. And they go up to a bigger scale during our peak seasons. And these are running on about 315 Kubernetes clusters that have about 30,000 Kubernetes namespaces. Now, this is a pretty large-scale operation. What does it take to observe such a large fleet of clusters? Now, even when there are no big change events that are happening in a Kubernetes cluster, it is challenging to observe it, as you might all know, because there are constant moving parts. The pods are getting resized. Nodes are getting resized constantly because of Kubernetes' resource optimization strategies. Now, add to it change events that we have. At Intuit, we have a few of these. We have the monthly cluster upgrades that keep the Intuit Kubernetes fleet up to date with Kubernetes versions. We have AMI rotations in order to comply with security. Then we have a lot of cluster add-on revisions to keep up with the different features that we have as far as cluster apps and cluster add-ons. And then we have a core set of platform features that are built on top of it. All of these require cluster upgrades. And then we have something called as season readiness. How do we prepare for a peak event, such as a Super Bowl event or, let's say, a tax peak event? There's a whole bunch of testing that goes in behind the scenes in order to prepare our platforms for such events. So those are huge change events for the platform. And then the actual peak season in itself. There is a lot of scaling events that are happening across several dimensions of the platform with respect to compute, network, storage, observability, and what have you. And finally, there is changes expected or unexpected from our cloud provider AWS in this case. Now, what does this mean for the life of a platform engineer? It means that at the minimum, they are receiving and resolving hundreds of alerts on a very weekly basis. So let's dive into alerts and cluster golden signals. Now, what are some of the concerns that a platform engineer might have, especially when they go on call? There is a bunch of components that we monitor in a Kubernetes cluster, whether it's per node with respect to CPU, memory, disk, network, or processes. There are Kubernetes components that get monitored, and then there is broad life cycles. As we all know, there are industry standards for this. So the metric sources for node components, we use Heapster, for Kubernetes, we use Prometheus, and for pod lifecycle, we use KubeState Metrics. And there's a lot more, but these are the main ones. And all of these alert are platform engineer. And when they pass a particular threshold, the platform engineer then has to work on these alerts and look at different things such as Kubernetes events for observing, Kubernetes logs, a whole bunch of different dashboards, and runbooks to actually remediate the problem at hand, and potentially need cluster access for such remediation. Now, this can be overwhelming if you're doing this for about 100 plus alerts for a particular cluster, and you're having 315 clusters to manage. Now, you do the math that adds up, and our platform engineer is getting slightly overwhelmed at this point. Now, let's make this interesting, and throw in an incident. Who here likes to be on an incident call? Wow, that's interesting. Do you all work for PagerDuty or incident.io? Well, Lily and I don't like to be on incident calls. When we are, there are a bunch of other business-related questions that come in. How many services are impacted in this incident? Are the clusters that are running them healthy or not? Is this issue a service issue or a platform issue? Whom do we have to page additionally? And what is the blast radius of this issue? How many more services in the upstream or downstream dependencies are affected by it? At the end of that incident call, we're like this, constantly drowning in questions, alerts, and what have you. Now, there are fundamentally two problems here. There is a longer time to detect problems because of alert fatigue, alert overdose, and false positives, which results in increased time to detect Kubernetes platform issues. Then there is a longer time to remediate these problems because there is an abundance of data sources, and at point in time when you're on an incident, you want to be able to quickly get to the root of the problem. And the runbooks are not fully automated, and there is not a much of a streamlined correlation between different events that are happening in the cluster. So this results in an increase in MTTR. Now, in order to solve this, the first part of the problem and to reduce the alert fatigue from a sea of alerts, we've defined what are called cluster golden signals. Now, this basically is a philosophy derived from service golden signals where you have four pillars across which you can measure the health of a service. Similarly, we measured it across the four different pillars for the health of a cluster. And these four pillars are basically error, saturation, latency, and traffic. And what we do is we have a collection of algorithms and quality metrics and dashboards to provide a single pane of glass to observe the health of a cluster. And once the health of a cluster becomes degraded or critical, you can have an option to get alerted on it. And the idea is to filter out the whole noise of alerts and give a few good quality signals that can be in turn used for alerting, mainly because we think that these alerts will end up causing incidents. Now, how did we do this? We identified the core critical components of a Kubernetes cluster and we bucketed them across several functionalities, such as control plane metrics, authentication, autoscaling, network critical add-ons, and so on. And for each of these components, we basically generate a single Prometheus health golden signal. And the health is generated based on error SLAs that are being breached or error counts. And the health can have one of three values. It can either be healthy, degraded, or critical. And the overall health of a cluster is basically an aggregation of all of the critical cluster components. Note that a cluster can be healthy only if all of the critical components are healthy. Even if one of the components are degraded or critical, the status changes to degraded or critical. And then we build dashboards to surface these metrics and we set up alerting based on the health of this cluster. The idea is to actually automatically create incidents based on cluster golden signals being degraded or critical. Here is a quick architecture overview. Here are the sort of high-level grouping of the metrics. So we have control plane, bootstrap add-on metrics, cluster add-ons, and then AWS-related metrics. All of these have, we've written Prometheus rules to specify exactly what determines the health, what, when it will breach the health SLA as Prometheus expressions. And then they are deployed onto the Kubernetes cluster in the Prometheus namespaces where Prometheus servers are running. And the alert rules are established for that. And then when the alert conditions are met, then we trigger a alert and it also gets reflected in our dashboards. Now, here is a quick screenshot of the health of an entire Kubernetes fleet. So we have about 315 clusters. So the Honeycomb view basically gives you a snapshot of what the health is at any particular point in time. And then the screenshot on the right actually shows the health of a specific individual cluster and all of the individual components in it. So you can drill down and get more information if there is something that has degraded or critical. Now, here is a sample Prometheus rule that shows the health golden signal and how it is calculated. So in this case, it's basically an aggregation of those critical components that I had mentioned. And taking a close look at a single cluster component, let's take a look at the core DNS add-on. So in this case, we determined that core DNS has actually failed by looking at the error SLA. In this case, we look at all of the total responses that core DNS gives over a five minute window and look at how many serve fail errors were returned in that particular window. And if that percentage breaches, let's say it goes below 95%, then we determined that it's actually a bad situation for core DNS, mainly because as we all know, most of these Kubernetes components reconcile on a periodic basis and they sort of get out of the error in a situation if possible. There are reconciliable errors and then there are ones that you cannot recover from. So we look for those non-recoverable errors and are able to determine that and alert based on that. Now, what a platform engineer really wants is to lower MTTD by using cluster golden signals. And this part we've been able to achieve. But now that we've detected the problem, what can we do to actually help fasten the remediation? How can we actually get to the debugging and root cause of the problem? Hey, we know that core DNS has an issue, but what is it that is actually causing those serve fail errors? How can we get to that? Why is a component failing? And identifying the root cause and remediating the problem is fundamental to actually lowering the MTTR. For this, we started looking at AI for platform debugging. Whether it, I mean, when I go on call, when I'm not able to find my answers in my private runbooks, I'm actually looking to the internet for a plethora of information, whether it's errors in my Kubernetes logs or events, whether it's Prometheus metrics, whether it's any knowledge-based articles that maybe open source enthusiasts or cloud providers or anyone in between has written. So I'm actually constantly looking for text information out there in the internet. And similarly, we have runbooks within Intuit that have a plethora of information, but they're just not as streamlined as I would expect. So can I use the public information and the private information that's in there to streamline my on-call experience? Can I actually use the cluster golden signal to actually trigger something for deeper debugging? Can I use the Prometheus remote writer abilities to actually do get some AI assistance? To talk about all of this, I'd like to call Lily to take over. Thanks, Anusha. So as Anusha mentioned in the previous slides, we can summary that to debug a cluster issue, we need three steps. First, we want to identify the error components. And second, we want to identify the root cause. And finally, we want to know the remediation steps. So let's first take a look at an overall solution of those three steps, and we will dive into each step after that. Let's say we have a Kubernetes cluster, and how do we detect the problem today? We aggregate with Prometheus' rules and the metrics to detect the cluster error. So whenever there's an error happening in a cluster, there will be a metrics triggered. And then the Prometheus server will capture the corresponding metrics. Now we have to take the problem, and we want to run some deeper checks to figure out the root cause or more error details. We have deployed a debug namespace in the same cluster with a lightweight Golan service called second check running. And we have configured the Prometheus' rules to send the remote right quest to the second check service. And after the second check service received the request, it will talk to another service called KSGPT, which is an open source tool for scanning your Kubernetes' issues and to help you to diagnose and try new issues. It can scan your target namespaces and report the error messages like the pod error logs, and then the pod is in the crashing loop. And then finally, we would like to know how to fix the issue or what are the next steps. So KSGPT integrates with couple public LLN to get the remediation steps for an error. And in addition to that, we also leverage our internal private embedding service that we can get remediation from the private content. And then finally, the second check service will aggregate the results from the private content and the plastic from the public content and then upload the results to an S3 bucket. So the data can be consumed by an internal user platform or it can be also used to enrich the private content. Here's a step I just described. So let's take a closer look at the first step. So here is the detail of how we configure the promises rules for remote right. So you can see that we have a Q config there with all the details and the URL is a point to the second check service. And it's looking for the metrics called DNS missing local restarts. What does it mean? It means if my call DNS pod is restarting in my cluster, the premises server will receive the signal and send a remote right request to the second check service. Now how does the second check service talk to the KSGPT? So we run those two parts in the same namespace so they can talk through a Kubernetes service and then it's talking through a gRPC connect. So here's a code example of how you can use the gRPC client to talk to KSGPT and they were calling API called Analyze Request and it provides the target namespace and is explained meaning if you want to enable the AI or not and the filter is an array of streams that you can give for the filters for the analyzers such as pod, log, nodes and the backend is referring that which AI backend you want to use. Now let's take a deeper look at what is the KSGPT? It's an open source tool for scanning your cluster, diagnosing and triaging issues. It has a good array of analyzers that it pulls the relevant information from your Kubernetes object spec such as pods, nodes, Kubernetes service, Kubernetes events and even the live log. So it integrates with different AI platform for to enrich your error message or to get the re-commodation solutions. So here you can see from the chart it can call the open AI API, it can also call the Google Gemini API and then in addition you can also set up your local LLM through the local AI interface. So let's take a look at the sample output of two analyzers. This output does not have AI enabled. So this is simply scan the object spec and retrieve the information from the Kubernetes events or the spec status. So for the crown job analyzer, it checks for if your crown job is running as expected and for the deployment it also checks if the replic has matched the actual report running and reports the error. So let's also take a look at example of the pod analyzer. So if you go through the code, in the high level it will fetch all the pods in the target namespaces and for each pod it will go through each containers, check for status for pending or crash. And once it found that this pod is not in a good state, it will fetch the latest Kubernetes events, aggregates all the filter messages into output. If you have AI option enabled, it will create a prompt with all the error messages and ask a solution from the public AOLM. So after capturing the error details by the KSGPT, the next step is remediation. So KSGPT provides pretty good integration with public AOLM, which has a rich context of general Kubernetes AWS knowledge. However, it doesn't understand, it doesn't have any context of in two Kubernetes cluster specific issues because we have our own custom add ons, we have our own networking configuration, IAM role authentication. So therefore, in addition to the public AOLM, we also use AI with our private content, such as our wrong books and documents, which can help us to solve the specific into the Kubernetes cluster issues. So let's take a look at the output of a KSGPT with open AI API. So we were using the model GPT for 32K and on the right is the prompt we are using to tell the AI what kind of format and solution we want. So if you look at the example, it's the same analyzer I shared in the previous slides. So it says that your deployment has one replica, but zero available. With the help of the AI, it actually added the more details in the error session. You will say, it will say, okay, it may do two various reasons, like part are not being scheduled, the parts are questioned, or the reason is pop feeling. On the second part, it provides a solution step-by-step with the Coup CTO example commands, like you can check your deployments, check the logs of your pod, and you need to check your application health endpoint, and then it tells you how to fix the user by using the Coup CTO apply. And then we also try to deploy a local AOLM with Lama 233B model, which is running in the same cluster on the GPU. We were using the same prompt and then we run against the same problem. So it generates very similar solutions, like a bunch of Coup CTO commands that you can check the logs and check the deployment. But it actually added a little bit of extra steps at the end, says you can scale down or try to scale down and scale up your deployment again, or you can try to roll back to a previous version. So we can see, as we compare both results, they are both reasonable, and they are actually steps we do when we troubleshooting a cluster issue. So moving to the private embeddings. Intude has deployed private embedding platform that use embedding-based search to provide a service to get an answer from your private content with AI. So there are three steps here. First step is data preparation. So we can upload our wrong books and documents through their UI or API. And the service will break down your documents into chunks and create embeddings by calling the open AI embedding API and store those vectors in a vector DB. The second step is data searching. So when the second check service asks this service for an answer from my private content, the service will embed the queries by using the same API and they use the distance between query embeddings and data embeddings to rank the content and then return the top and relevant content. The last step is to ask AI. So the relevant content will be added into a prompt and then send it to a public AOM to generate a response. So here basically we are feeding the model with relevant context through a model input. The prompt can be something like, please use the following context to answer the question X, Y, D. This is the steps I just described. So let's look at a demo. Okay, so we have deployed two parts in the debug namespace, case GPT and the second check. And then we have created two services that can access them in the same namespace. Now let's check the premises rules we have configured for this demo. It's in a different namespace and we can see under the remote write session, we configured the URL to the second check and the point and it's looking for the metrics called in as missing log restarts. Now we have to create error situation that the called in as port is restarting or crashing. So let's check the called in as port status. Now we see that the called in as are in the crashing loop. So we can describe the port to see, to get more details of why this port in the crashing loop. Let's pick up one of the parts and it's running in the coop system namespace and then we can see it's in the crashing loop because it's romkilled. Then at this time, then they should send a signal to the premises server and the premises server to the send the remote write request to the second check service. Here we are looking at the logs of the second check service. So we can see that it receives the metrics of called in as restart romkilled. After it received the metrics, it will cause KSGPT to run a deeper check on the port. So then the KSGPT will scan the target namespace which is a cube system here and then check the called in as port spec combine with the Kubernetes generate the error message and then ask the public LLM to get a response. So here we're still using the open AI API with model GPT for 32K. And then we are not using the private content in this demo because the romkilled is a pretty generic issue that Kubernetes knowledge could have it. Okay, now we receive the response from the KSGPT. So the second check service will upload this results to a remote S3 bucket. Then we can see, yeah, the results is there and it's in the JSON format. So it has pretty accurate results. So it says that the called in as part is in question loop due to one killed and you can inspect the logs with the CTO commands and they actually teach you how to increase the memory limit by changing the resource session on the spec. So in the future, we are thinking for a simple remediation step, like increasing the memory limit, we probably can automate it through the Kubernetes API. And of course, not every one killed issue can be fixed by simply increase your memory resource. So if you have a complicated use case, we would recommend you use your private content plus the public content that you can get a more accurate answer. So RT is also actively contribute back to the KSGPT upstream. There are some couple features and fix we have worked on. And we are thinking in the roadmap, we're thinking we can add a more AWS integrations to get details of like say a EC2 instance status and describe the EKS API server status and also even the VPC configurations, which can help us to troubleshooting the AWS issues in our cluster. So with that, I'm gonna pass the ball back to the Anusha for takeaways. Thank you, Lily. Great demo. All right, what are our three takeaways? We were able to achieve some streamlining using cluster golden signals, where we are able to get a bird's eye view of the health of all of our clusters at the same time. And we were able to also alert our platforms for incident platform engineers on incidents using this process. And by tying up cluster golden signals with LLM-based debugging, we are able to actually complete the loop end-to-end by not only detecting early, but also getting remediation early. Our initial results with platform debugging using AI have been pretty promising. So we're gonna be investing more. The KSGPT community has been pretty welcoming of our features and contributions. So we're gonna be continuing to do that. And as Lily said, a lot of this has to be peer reviewed. Right now, we're not applying any of the remediation that the LLM is throwing at us. It's going to be peer reviewed. We're also gonna be looking into things like rags as well as enriching our runbooks and using the embeddings-based Q&A at this point in time. So thanks for attending our talk. And if you guys have questions, there are two mics on either sides, we'll be happy to answer. My question is, as you mentioned right now, Ragh, it was the thing that I was thinking about. Are you planning so that the tool can also, for example, if while debugging, there's like a tool that it can recover from or automate something, is it something that you wanna try in the future or you want just to use it for others? So as far as automatically remediating, we wanna be careful there. We haven't tested it enough to actually apply the remediations and making sure that it works directly. The Ragh use case will be for making sure that we can check the accuracy of the results that we get. Again, things that we haven't explored yet, but definitely Ragh use case will be more for checking the accuracy of what we get back from the other them, yeah. Okay, thank you. Thank you. Thank you, great talk. Just one quick question. Are you also thinking of extending it to the logs, like scanning the logs which comes on the standard out for the pods and linking these Prometheus metrics with the logs as well? And if you are thinking on that, like how much it has influence on the cost because log engines and analyzing the logs has a big cost involved. So currently the KSGPT already have the log analyzer. So they will query the 100 lines live log from the pod. So we are also thinking to add integration because we are using the Splunk. So we're also thinking to add integration to pull the logs from the Splunk. But definitely with the limitation of the length of the log and also we need to look for the keywords, for example, arrow, field, and panic. So that definitely comes with a cost. I think we'll put some limitation on the context we want to pull from the logs. And also maybe there will be some threshold on the API call if we integrate with Splunk. We need to be careful on that part as well. OK, thank you. So also just to point out, this is near real time. We are not thinking about historic logs. So there is a company SLA of meeting which is every incident gets detected within about five minutes, remediated within less than an hour or some such thing. So although the historic information and logs would be useful for training, when we are actually querying for immediate use, we wouldn't be looking too far in the history. So it would be what you have locally. And then depending on the use case, going out to getting slightly more than the time window. Yeah, makes sense. Thank you. I just want to add more. So I feel the log is extremely helpful if you have your wrong books that describe how to solve a specific problem according to that logs. Hi, thank you for your talk. Have you experimented with adding business logic like what service was affected, what, say, office was affected, or what component was affected and stuff like that? Yeah, so right now the components or the services that we're thinking about are limited to the cluster components. But we definitely want to extend it to business logic or business analysis. Because when there is an incident call, we get asked a lot of questions that we don't have out of the box charts for. And then we have to scramble around and get some operational data from what we have. So that's definitely something we want to work on. Which clients were affected, how many, how many? Yes, which region, which, yeah, exactly, yeah, definitely. Well, thank you again. Thanks. Thank you. I have a question. We have one problem, but there might be a situation. Multiple incidents will be triggered for one particular problem, right? So when we use this KGPT, multiple solutions will be there, but actually the root cause is one. So how do you have this incident calls, or sprawls? Many incidents, but everybody's debugging into different, different, but actual root causes one. So how the MTT are timed on all these things you have mentioned, right? In this kind of situation, how do you deal with it? Because it keeps generating a lot of solutions. You get into solution fatigue, or which one to pick? Yeah, yeah, that's a good question. So we try and solve that using cluster golden signals where the dashboards that we've built. So let's say we have a case in point. Let's say we have an outage because of a CNI issue. And there are five different clusters that are affected by it all in their own silos. But when you actually look at the golden cluster golden signal dashboard, you can actually group it by cluster component that is failing. And hopefully all of these five clusters would actually show up as red because their health was degraded or critical. And they would actually fire up, so you would actually catch it a little ahead of time rather than the KSGPT. The KSGPT would still be, there is no deduplication of solutions at the KSGPT level, but there is at the cluster golden signal level where all the CNI would be failing on the five clusters and you know that there is, you could basically do deeper debugging with that knowledge. Okay, so you use human cognitive things on the side, on the panel. Means I mean to say human, little bit we manually do these things at the stage. Yeah, so you would have to look at the five solutions that are coming out of these clusters with the understanding that all of these five actually failed their CNI checks in the cluster golden signal, so far, yeah. But there's definitely room for improving that. Yeah, thank you so much, thank you. Thank you. Yeah, also thanks a lot. My question, as far as I understood, the results from the AI analysis will be stored in an S3 bucket. How are the golden signal alerts mapped to the analyzed content and how will the system engineer then later on access the stuff on the S3 bucket? So basically Intuit has some internal platform, internal user platforms that we can consume those data from the S3 bucket and present to the user, maybe on a UI or something. And also we can read the data from the S3 bucket to automatically create wrong books from the public LOM and then enrich the private content for our private embeddings as well. So I think instead, in the future, when we get this system up and running, instead of S3, we can maybe add another DB or MongoDB and to save it more properly and it can be rendered through the APIs and it can be consumed by different services or different other platforms within Intuit. The ideal target state we're thinking about is when the platform engineer got an alert, they actually have embedded metadata as part of their either page duty alert or something with all of the enriched remediation steps in there as well. So the loop goes back to the alert and we might be able to pull that off with page duty metadata. Yeah. Thanks. Thank you. Any more questions? We're on time now. All right, thank you very much. Thank you. Thank you.