 Welcome to the demo of using SLU for monitoring highly available services. I'm Sana Javad, Lead Software Engineer at Salesforce. Our team provides Kubernetes-based platforms for Salesforce teams. My colleague Gument has joined me for this demo. Hello, I'm Hemant. I'm a Principal Software Engineer in core infrastructure team, and my team uses SLU for monitoring Kubernetes clusters that run a mission-critical service for CRM application. Who is this demo for? It's for folks who operate Kubernetes clusters or run their applications on top of Kubernetes clusters. We will briefly cover the architecture of SLU followed by a life-side incidence that we debug using SLU information needed in matter of minutes. We hope by the end of this talk, you will learn some tips and techniques on how to investigate different Kubernetes events in using SLU. So let's get started. Here is the high-level overview of Kubernetes cluster. It consists of multiple worker nodes, which run containerized applications in pods. These nodes and pods are managed by a QB API server, which is a part of control plane. The Kubernetes cluster state where the pods are running is a formula in nature, and the workloads can run on any nodes at any time. Kubernetes has great set of tools for debugging issues currently happening in the cluster. For example, we have Prometheus, Kube dashboard, or Kube Cuddle commands. But oftentimes when a life-side incident occurs, the main focus is on the mitigation and root cause analysis is left for later. This becomes particularly challenging for incidents that happen because of some Kubernetes events in the cluster. This information is available on the cluster for only one hour. After this, the only way to debug what happened is by correlating multiple various logs of control plane and they're finding a correlation using time-pads, thus making it hard to find the root cause. So there was a need of a tool which could provide Kubernetes resource lifecycle with timestamp. Also show all the related Kubernetes resources for an event or a deployment at one place. And also have a very easy interface in terms of bar charts and diagrams which are very easy to understand and get information. So our team at Salesforce released an open source project called Slope. It has around 2.8 million Docker pulls. Slope is the only tool in the market which provides timelines of Kubernetes resources with timestamps and also the payload of what exactly was there in the cluster on the resource. So how it works is like you deploy it in your cluster and it basically gets the information on a regular basis and it can store that information for around two weeks. So for example, in this view, you can see that it is running for this cluster and you have the option to select different time ranges like now minus one hour till two weeks. Also you can filter out the resources for a particular near namespace as well as there's an option to basically filter by Kubernetes kind and then you can get the sorted result as well. Now all the resources are for example, like a particular kind are depicted by one color. Here the red ones show deployment objects, the green ones show corresponding parts. So you can see that there was a deployment happening at this point in time and you can see when this part was created in a very timely manner. The green pills here show that there was a Kubernetes event that happened at this point in time in the cluster and if you click on it, you can get the details. The yellow one indicate that there was something going wrong at that time because there was like a warning or an error in the Kubernetes event that needs to investigation. Here is the overview of Slope architecture. It's a self-contained service with no dependency on external data storage. It has multiple layers. The first layer is Ingress which basically keeps a watch on Qube API server and it regularly tracks the state of different objects in the cluster. Once this information is received, it is processed and then stored in BadgerDB. BadgerDB is also an open source key value data store and basically you don't need an external database or anything here. Now I will be handing it over to Hemant who would be sharing some of our favorite life-side incidents that happened in Salesforce and how we use Slope to investigate those and get the required information in a matter of minutes. So thank you. We're gonna show some real incidents that happened in production which are simulated in a non-corporate personal AWS account because we cannot obviously share the corporate account setup. Now, these incidents, we have tried to give them funny names but these were not fun at all when we were debugging them. The first of these incidents is where we have the service go down because the leader pods of the service kept getting bounced, right? And digging further, we found that it was happening because the ephemeral storage, the empty dirt being mounted by the pod is getting filled up by the container which is using it inside the pod. Right off the bat, when we started looking at this, it was hard to figure out why it was happening because the pod logs or the container logs wouldn't show anything because this was happening at the Kubernetes level. And, you know, the different control plane logs which show the events were helpful but it takes a lot of time, right? Let us see how this was easy to figure out by using the Slope UI. So I'm gonna switch to this Slope UI view. Here, the stateful set we are interested in is the eviction, right? It has four pods. And we see that at this moment, something happened to eviction zero pod and clicking on this and looking at the events surrounding that, we can see it got evicted. And why? Because the ephemeral storage was getting full. So this points directly to, you know, something happening when the cron job which was running at the same time was filling up the ephemeral storage. And we can quickly see which one point it is, you know, what is filling it up and change the design so that it doesn't happen. So as you can see, the UI simplifies looking up, correlating the events so easy. Right, let's look at another incident. So in this second incident, we had unexpected restarts of pods which were running on nodes, running in a totally different zone than we were running maintenance during off-peak hours. And digging further, what we saw was that cloud provider was trying to balance the nodes when it had inadvertently brought up extra nodes in the zone where this happened to compensate for the total desired nodes that are supposed to be there when, you know, the zone where the maintenance was happening ran out of capacity for the instance time, right? So to put this together, we had to look at the control plane logs. We had to look at the activity on the auto scaling group as well as the pod logs, et cetera. And let us see how this would be simplified by looking at DUI. And the stateful set we are interested in here is the auto balance stance, right? Here you can see that everything was running fine for this stateful set. We have four pods and they are auto balance stance zero, one, two, and three. And at this moment is when the maintenance happened and we know that, you know, these pods would go into pending state to find another node, but they stayed down for a long time. They didn't find a node to run on. And also we noticed that the auto balance stance one and three had restarts. They weren't even supposed to have any restarts, right? To see this, we move to the pod, the node view where we can see that this node, which is supposed to be in zone C got terminated at the instance where we saw that restart. The unexpected restart of the pod. And same is true with the other node, right? So even this one, if you look at the details, it is into C where the termination happened. And what we noticed is that, you know, the new nodes which came up as a result of these two going away were actually in the unexpected zone. So the zone where they were supposed to come up was into A, okay? So the same is true for the other node. As you can see, this one was in to A, right? Whereas the ones that came as replacement right here, if you look at the details, it was into C. So this was the case for all the nodes that came up as a result. So with these four nodes that came up were into C. So obviously AWS, which is the cloud provider here, was trying to balance the desired count by bringing up nodes in the remaining zone. Eventually though, it found capacity into A. So this is the point where it brought up the to A node. But what happened around that time is also that these two nodes which were running fine into C got terminated, right? And you can see that the ones that were brought up continued on, right? What this shows is that the balancing activity took down nodes to shrink the extra nodes that came up in the surviving zone to C. And during that activity, it brought down the nodes which were running fine, which needn't be terminated. And cloud provider wouldn't know this because they don't know what kind of workload is running on those nodes. As far as they are concerned, there are four nodes, they need to bring it down to two. And two were terminated, which just stopped at the service. See, having this UI simplifies correlating these things so easily. And let's switch back to the third incident. This is where, this is less of an incident, more of a scenario that you can debug using the Spinnaker UI, sorry, using the Slope UI, where for whatever reason, if the party is stuck in pending state, it could be due to failed webhooks, failed mount points, it could be due to remaining nodes not meeting the resource requirements, et cetera. If that were to happen at off-peak hours where you're not watching and you have to debug that later on, then it helps really to have that history in the UI. So one such example can be shown in this Slope UI where if I go back to the pod view, and there was an incident where this eviction one is stuck in pending state for a long time and digging through the logs, we can see this was due to a failed mount point. See, see this message where it says multi-attached error of the mount point. And, you know, so these messages are at your service when you pull up the UI to see right away, you know, why that pod is in pending state at that moment instead of digging through a bunch of logs trying to correlate exactly what happened at that time. So in this case, see the mount point was failing to attach. And usually when this happens, the pod would get stuck for 20 minutes is what we noticed due to an EKS bug. So yeah, these are the examples we wanted to show. And, you know, with Slope, we can save the debugging time a lot. So please try using Slope, join the community, try to contribute if you can and happy playing with Slope.