 Okay, thanks for having me, and I hope you've had a good day so far. My name is Chris. I'm the Market Desiree Lead at JP Morgan. I'm also responsible for the high performance computing platform that we run within the markets organization, which is used to power a bunch of our businesses. I got the QR codes here for GitHub and LinkedIn if you want to connect, either here or later, it's totally fine. I'm going to talk today about a lot of operational considerations when you're supporting HPC or other GPU type workloads in Kubernetes. And I want to start because just to make sure the room's level set, I'm a site reliability engineer. So reliability to me is really kind of what I call declarative engineering practice, meaning that I want to always have the state of my customers happiness in mind whenever, whatever I'm doing. And that is much like Kubernetes being a declarative system, right? It's working to always maintain that desired state. And I always think that site reliability engineering is that practice. So take that in mind as we're going through the slides today. A lot of talk about Kubernetes in general, and if you think of its origins, it was a lot of websites, a lot of APIs, a lot of different types of microservices, microservices, architectures, and so forth. It led to us looking at customer happiness or customer expectations with a different set of service level objectives. We often focused on availability, latency, error rates, and so forth. Often what you see in the Google SRE book. But with HPCAI, it's a little bit different because the workloads are different. Oftentimes you're submitting a task, or you're training a model. They're long running jobs, or you're rendering graphics, or doing quantitative analysis. But those processes have a beginning and an end. And sometimes the time delta between those is pretty long. So you want to make sure availability, saturation, duration, and freshness, which is a term to say is your data stale, is super important when it comes to HPCAI and AI workloads that you're running on Kubernetes. Underneath the hood, it's largely the same. But how we operate, or how we think of what I call day two operations in HPCAI clusters is vastly different. And it poses a lot of questions. For example, are your GPU nodes in your cluster ready to take on tasks to completion because of the way the nature of those workloads are? If it dies in the middle, you have to start that task all over again. And that can add to a lot of performance problems, can also add to a lot of cleanup, if you have kind of intermediary data stores and things like that. Another challenge is with vanilla Kubernetes. The Kube scheduler can't filter and score nodes for GPU workload placement because it's not scoring anything off of that, off the capacity of the GPUs. CPU and memory is fine, but it does not have that out of the box today. So you're kind of left with two options, right? One is a custom scheduler, and you write that, and there's been some great examples of that today so far with some of these talks. Or you do something that we're gonna talk about in a little bit around node problem detector, which is a relatively recent addition to how we determine the health and readiness of nodes in your Kubernetes cluster and how we can apply it to GPUs. Lastly, and this is more beyond Kubernetes, is your upstream data, what you're sending to train a model, or what you're sending to run a Monte Carlo simulation or something like that? Is that equally as available as your Kubernetes cluster? So that's something that you need to have those conversations with your upstream customers, your data scientists, or your engineers or others that are sending that because it's all great to have a compute cluster ready to go, super resilient, but if that client data is not there, you can't really do anything, and more worse, you're wasting money. Next, we're gonna talk about capacity and scheduling. We did touch a little bit on the CUBE scheduler before. But how often are your GPUs busy or overworked? We talked a lot about even with splitting of workloads or splitting up GPUs. But they still have a single bandwidth of power, capacity, temperature. Do you want to send a workload to a GPU that's overworked or cooked? Maybe not, because you run that risk of it failing and then having to start that over. If they are, or you have scaling challenges, or you want to run more tasks in parallel, and we heard some discussions earlier about how hard it is to get GPUs, do we look at things about scheduling? Q, which is one of the newer projects, has some really interesting insights in that area, and it kind of is effective, a throttle. It allows the GPUs to finish the work that they're doing without crowding each other, but that's another option you have. Additional capacity, do you have it in your zone region data center? Whether you're running bare metal or with one of the cloud providers, is it available? Should you need it? Do you need to reserve it? And then if so, what's your plan to provision it? It's one thing to have, it's another to get it up and running and then start whatever critical workload you have to continue processing or have your business run. And then on top of that, can it be automated? The best run books are the ones that are in code. And then lastly, how can we prevent GPU workloads from starving others? This is typically handled with, if you're running kind of a mixed cluster of general compute and GPU, you can get away with this with some node labeling and affinity rules and things like that. But if that's not the case or you haven't done that, how do you avoid clashing or taking out critical services that are running on those nodes that make the cluster work? Lastly is resiliency. Are you prepared to fill over? And is typically with the scarcity of GPUs. In particular, I'll talk about Nvidia, but with CUDA, you can to most of an extent change the card underneath. So if you need to change the instance type on Amazon, for example, as long as you're using CUDA, it's largely okay unless you're using specific features. And if you fall into that bucket, you might want to talk to your cloud provider about making sure they have the same card in a different availability zone or different region, whatever you need. And what's the plan for restarting tasks once you fail over? So I think that these are all kind of common questions that, unfortunately, sometimes happen is they get answered in the middle of an incident, which is never great. So we're going to take a use case because we talked a lot about telemetry and observability. And we're going to talk about vanilla Kubernetes and what it provides out of the box. Right now, this is kind of a very common screen for a lot of folks. Describe a node and you look at the node conditions. These are the ones you get with a base installation. This one's from Kzeros. Now, we have something called node problem detector, which extends the capability of us to add checks. This, in some of the cloud providers, is a standard offering now. I believe GKE and I think Azure's Kubernetes offering also has it. But also, you can see kernel deadlock, read-only file system, Docker overlay, things like that. But what's missing from here is that we don't have a way to check the readiness of the health of a GPU node in Kubernetes. So let's go get some metrics and let's put these checks in. So I'm going to, this is the DCGM exporter. You can get that through the GPU operator within video. You can also get, you can install it standalone if you'd like as well, depending on how your cluster setup or flavor of Kubernetes you're running. And this is going to expose a lot of metrics, in particular ones that we care about when it comes to GPU health and so forth. And it will scrape those metrics into Prometheus and then Grafana, you know, the typical stack that most operators and SREs like to have. And then there's kind of an example of one there running off my home lab. And then what we can do with no problem detector is create a custom plugin that will check those metrics in Prometheus and give us ideas of whether the GPU is healthy or not. So this kind of is one of the, this is just a snippet of the JSON file. I've kind of cut it down for brevity, but this is one of the custom plugin checks you get with no problem detector that you can just add. And the result is, is that you get additional checks here when you describe the node. And depending on how these processes run, you know, you can have GPU healthy with a readiness check, check temperature, and then also check pressure. Now, what you get out of this too is that when, you know, we talked about scheduling earlier, this is one way to at least kind of cheat the scheduler a little bit. So if a GPU is not healthy, then, you know, it will go into a not ready state and that will prevent the scheduler from sending another workload to that, whether you have affinity rules or anything like that, but this will prevent that, which, you know, in depending on your risk tolerance for, you know, how hot a GPU could get or something like that, depending on how you're set up. This will help prevent the risk of, you know, a task crashing in the middle for whatever reason I'm having to rerun it. So that covers observability. However, you know, metrics are just kind of a single entity and the whole spectrum of how we support this cluster. I take it as, you know, a good metaphor to it is you think of a hospital with a ton of technology. There's so much data and metrics there. You still need doctors and nurses who not only know, you know, the human anatomy, but also know the patient and be able to be able to support those types of scenarios where they need to intervene or they need to give medicine or things like that. Similar with this, you know, you can tell me the GPU is hot, but if I've never worked with a GPU before or I've never done anything with it, I'm not going to know what to do. And so that's one of the key things there. So how do we get to that point of, you know, knowing your patient in this scenario? Load testing is one way, you know, understand the limits of your system, fix them, and then do it again. You know, and part of that is, is how quickly can we get new infrastructure online if you have kind of spiky incoming requests that could happen. What's the part, what's the, you know, first points of failure and then address them? And another good thing here is, you know, when you have incidents, you can, you know, for especially ones that are kind of load-based or kind of, you know, have latent issues that are triggered from load. This is a great way to kind of validate any changes that you come up with through retrospective that you have with your team after you've restored service. Second, chaos testing. I know that's kind of a very buzzword term, but the best way to learn how to support your HPC cluster is to break it. That doesn't, that's regardless of Kubernetes or anything else. And also what plays a critical role here is tabletop exercises. You know, when we talked earlier about resiliency, you know, we can have everything written down, but once you're in the thick of an incident and you're trying to restore a cluster, whether it's HPC or not, you know, having that muscle memory, if you will, of having done this before, having gone through it, it creates a lot more psychological safety and you actually perform a lot better under that stress. So you have that, so practice is really the key term there. And this is especially good for when you have post-incident retrospectives that you have incidents that are process failures. You know, nobody knew what to do, and you sat around hours trying to get a hold of somebody. Well, that's probably a good time to go, you know, let's go break the system and fix it again. Let's go run through those tabletops and make that happen. Lastly, incident response. I kind of alluded to this before, but telemetry and alerts are only good if we can mobilize and restore. Alerts still need to be interpreted. Like I mentioned before, with GPU temperature, I need to know what I need to do from that. Muscle memory among operations teams. But also what's really important is building understanding of the types of tasks and workloads that are coming into your cluster. It's like understanding your application. As a matter of fact, SREs should be committing code anyway, so one way to learn it is to open a PR and help the team with reliability objectives and things like that. Some kind of closing thoughts here, but no matter how intelligent our systems are, the humans behind them will make or break a successful outcome. And when you think of whether it's AI and how smart we've seen some systems get, they still break. And there's still people getting paged and there's still people waking up and responding and restoring those incidents. Nothing is foolproof. Nothing is 100%. Reliability is also a team sport. So I talked about it before. Obviously work with your software engineers. Work with your data scientists. Work with your quantitative analysis. Have them understand that reliability is a first class feature in whatever they're trying to do because it will pay dividends for them in the long run. Kubernetes can help as we saw with some upfront effort for HPC. We put a no problem detector in and we're able to short come one of the many things that we have to do to support that cluster. What I'd like to see in the future would be maybe some more open standards around this. Have those checks by default with some sort of implementation around it. I know it's difficult because it's tough to be vendor neutral when you have really two main proprietary vendors in the space and you have a bunch of smaller ones, but I think having more of these open standards, it's not only gonna help teams interpret differences between them, but also give us the ability, GPU temperature is gonna be same across every card, right? So that's something we need to monitor, we need to think about. So I'd like to see us as a community move towards some more open standards around how we support GPUs in Kubernetes. That's it. This was built and revealed JS, if anyone was wondering. It's a bit of a learning curve, but it seems to be a good ROI. Also in video OSS, I think, I believe they, but the GCM and the GPU operator, I've seen a lot of growth in that area over the last couple months and years, especially as the AI space has been taking off. And lastly, just a quick shout out, CNCFLF staff, this is the biggest convention center in the US and it's just a very complex thing to pull off and I think they're doing a great job, so I just wanna give them some shout out as well. So feedback, QR's there, and if not, I'll take questions. We have quite a bit of time for questions. Oh, thank you. Thanks for the great presentation. We have enough time for questions, so there's a mic in the middle of the room, so feel free to over there, also pass the mic around. I'm also happy to talk outside too. All right, sounds like we can. So you mentioned error budgets, have you done anything with service level objectives for GPUs? So you're looking for, so error budgets with service level objectives? Yeah. Yeah, so that's a good one. I think that availability SLOs, duration SLOs, you can obviously, you can use a lot of metrics, especially in the application stack to, or the task and put some telemetry around for processes taking too long and we expect it to complete around this much time, things like that. We also have common availability. One of the other things we were looking at was adding some metrics here to say how many times have we failed scheduling, so we can actually improve the cluster or maybe some changes from an engineering perspective. But definitely I think SLOs and error budgets play a big part into this. What I would say when constructing those is start at the customer and then end at the customer. So whatever that roundtrip is, that's what you want to measure. And the closer you get to the customer, the more accurate your SLO is. Hello. Yeah, nice talk here. So I have a question here. So when you said all this kind of SRE maintenance here, it's kind of for the privated cloud or something, or like SBC cluster, but what do you see like if we maintain public cloud? Like any challenge would be different. So the question was, is it more about this type of, so if I go back, no problem detector? Was that what you were referring to? Yeah, also, yes. Like in public cloud situation, we probably couldn't reach to this information, rather than the customer. So in like a public cloud environment, you can still get at these checks. I think the challenge may be in overriding the custom configuration that maybe some cloud providers have. That may be an interest, that's an interesting question. I think that we'd have to turn to folks that have, because there's two ways to deploy no problem detector. One is you just do it, you deploy it standalone, or they're typically getting an add-on that like GKE has an add-on for it that's like baked into their offering. So I would check with the cloud provider to see how you can override some of those, or maybe there's a facility for you to supply some custom plug-in checks that go into that as well. But I think that what gets kind of tricky is because it's a managed service, whatever you override, if you kind of interfere with what they're trying to do and maintain their SLOs, it may get a little tricky. Okay, it sounded good. Thanks. You got it. Maybe I have a question also. Sorry. Two slides below, I think if you go down. Yep. Or in the GPU health, maybe it's the next one. This here? Yeah, here. I was curious, because this is the node problem detector, so it's the node health. If you have like a multi-GPU node, is this something that would still work? Because this is the node health, I guess. Yeah, so how I would, it would still work. I think where you would have the secret would be in the script that you're executing. So you can kind of get flexible with how you want to look at that. And you may want to say, well, if you have multi-GPUs, if you have like two GPUs on the node, and one's good, one's not so great, you know, you can make a determination and say it's ready or it's not ready, depending on the type of workload and how much it's going to consume. So, but I will say that I think that there probably is some work in heuristics around that that we could definitely take a look at for sure. I think the scope of how we initially built this out was a one-to-one with a single GPU on a node. But that's definitely a good avenue to explore. And I think also when we start talking about splitting GPUs too, the custom checks are going to get a little tricky too. That's thanks a lot. Thank you very much for the presentation. I'll answer your question.