 So the reason I have this talk to begin with is because I keep having the same sorts of conversations So I talked to HPC people they have assumptions about how the systems work I talked to cloud people they have assumptions how the systems work, and they're different So trying to design solutions in the spaces is difficult when you're making different assumptions So I've been spending a lot of my time trying to figure out how to Put together paradigms that work that still give us the niceties that we have in the cloud today So who am I? Cloud software architect. I'm also a CNCF environmental sustainability tag chair So if you want to talk about sustainability, I have that too. I don't know how that's gonna work in HPC type systems But we'll see and so quick poll who here is experienced in HPC Who here is experienced in cloud specifically? Who has both? So fewer hands, right? So if you've ever been in a room where you have both in one party speaking Greek in the other's Latin, it's hard to translate So one of the things that I have rants about continually is that we can pull from different thought patterns So here's Anthony and on a pale horse. I know this is a Very old book, but he had five different thought patterns So you had linear parallel creative circular and intuitive and if you look at these the first two apply Often to what cloud people assume which is linear. They they're not looking at parallel and The second one is to parallel which is what a lot of cloud Computers, so we're gonna start with a generic example of an HPC system So you have you know compute nodes you have some sort of ingress Which is usually a one giant logon node and you have your storage You have some assumptions that come from this is so So Starts with infrastructure is is mutable so users are assuming fine-grained control So you have MPI you have muma nodes. You have CPU pinning and you have memory pinning This gives you access to compute resources that are usually handled by resource manager in a scheduling system and Options exist for common patterns Your users are often system experts Which means they understand infrastructure. They have a tolerance for complexity and they think in terms of parallel compute So it's the type of people that are using your HPC systems are usually Very well-versed in how these systems work Users own the compute nodes. There aren't any other things scheduled there So users usually there's exclusive access to all the resources on those nodes Security is handled through very strict access to the cluster itself and the rules within the cluster is very widely So I've seen very open systems and I've seen very close systems depending on what you're doing The second one's harder to run on and at the end of the run the resources are made available to other jobs You no longer own them Additionally of most of these systems are homogenous at least from the old world So the HPC applications usually assume that the nodes have at the very least a very similar setup So that means the same hardware the same bios the same OS the same drivers the same firmware and the same software Infrastructure there are tools that look for inconsistencies for instance Intel cluster checker. I'm at Intel I also used to work on this project and I will tell you sometimes we would get reports where people would Send in very long logs of what they would get out of these tools and The thing that was wrong with their cluster was they would have different ethernet drivers. That's it That was killing their performance Then we get into cloud systems. You have some CPUs. They're in some nodes. You don't know necessarily a lot about the rest of the system Sometimes in particular systems you may but with cloud that's not the normal assumption You user base is difference users desire simplicity. They don't want to know the details They're not systems experts. They may have never racked a machine Who here has wrapped a machine? This is not who here hasn't Who here hasn't that mostly does? Kubernetes and not HPC So and Users often do not think in terms of parallel applications. They know only about their specific application They can ask for a number of CPUs or a set of resources and maybe define parameters But they don't always know what they were getting so the crest requests aren't always helpful and I've seen entire teams at various companies that their whole purpose in life is to figure out how to save money on cloud because they're wasting resources so And I was on one of those teams at one point we you know one flaw would be ten thousand dollars a month The users share notes this introduces jitter This blows through your bandwidth and hurts reliability and scaling doesn't always work correctly in this case They're also heterogeneous your machines can be anything You have different CPUs different speeds different nicks and different memory and they all behave differently Since it's scaling doesn't always work The machines are less stable, which means the workloads sometimes are assumed to be able to auto reschedule and The compute is considered to be robust inside of the hardware So let's talk about some positive patterns in Kubernetes because there are some things in the cloud We want to keep and one of them is things like device plugins the device plug-in is very simple You have a device manager. You have some devices on the device plug-in on the nodes You have various devices that get lifted into the pods that you then are able to use. This is really neat It's easy to write. It's easy to use you have gRPC. It's pretty quick once you get the resources in It's not the startup usually that's hurting you and you can usually track the health of the devices There are some limitations with device plugin You don't have any specific understanding about the devices behind beyond healthy and not healthy and so because I'm part of sustainability and other things that we look at our power use voltage temperature Bandwidth, which isn't really sustainability of its other items There is one thing that I would really like to see fixed in Kubernetes and that is this herd of managers So Alexander Kinevsky, I don't know if you talked to him because by Sasha. He's also at Intel He calls us a herd of managers. So if you go inside the cooblet, there are Four managers you have topology manager memory manager CPU manager and device manager And all of these have to talk to each other as far as scheduling resources so that they're in this Right pneumo zone and this doesn't even work correctly in most cases And and so you have to have them talking to each other anytime you make changes or updates You now have to make sure it works with all of your managers What do we want to keep from Kubernetes? We want to separate the average user from systems knowledge. There is no reason for a you average user to continue to have to know intricate details of your system and They don't want to so if you look at where people are going they're going to AI ML because they don't have to necessarily You have these very bright mathematician type minds that are going into cases where they don't have to know systems knowledge You want to keep pluggable infrastructure You want to make it easy for the admins to put in any type of infrastructure It's even better if users can toggle their own options according to their workload profiles and make more things available We still have scheduling issues, so we have kubatch which allows about scheduling. This is wonderful similar to HPC It's using a variety of systems and We've already referenced it today It does not solve the issue of guaranteeing node consistency or of the firmware the OS the bios, etc So you're still going to get the performance issues I referenced with cluster when we were using cluster checker when we would find ethernet drivers being disparate and Intel has a project called telemetry we're scheduling which allows scheduling according to node metrics So you could actually look for things like this on your node But the metrics are currently node only which we're trying to fix so you can pull things off the node as well And and is also internal to Intel so we still have to upstream it There's also network aware scheduling. This was a neat project put out by IBM labs And it uses a cluster's network latency and topological information to better schedule latency and bandwidth sensitive workloads There's no guarantee though if the co-located pods kills Initial assumptions, so if you have a co-located a pod and it suddenly starts using a bunch of bandwidth, you're stuck So it still doesn't really solve this problem and various people have tried to solve this problem. They run into the same thing and And What do we want from scheduling that we don't have today? We want guarantees of node behavior for a duration of a workload. This is from HPC. You have guarantees of node behavior You want guarantees of network behavior for the duration of the workload? And you may want guarantees of compatible versions of software in the node that includes all your bios settings and your ethernet settings and HPC we also have check pointing we don't want to get rid of that it's available for most ML Pipelines, but it's not default at this point and apparently 125 for containers There's now a new checkpointing thing. I haven't played with it I don't know if anyone else here has but that looks promising as far as keeping checkpoints for more advanced things We want native CPU management users can control So all advanced solutions are currently out of tree. You have Nokia CPU pooler. You have CRRM I've talked to people that are actually running their own daemons on the nodes The cooblet options must be turned off in order for these to work So all those managers that I showed you earlier you have to turn things off and We've done some tests in turning these off and playing with some CPU management and we're getting over a three-time speedup which better use of resources and With our CPU management over current Kubernetes This is the thing that we're running in case you're curious and I'm happy to talk about this This will come out in about two weeks three weeks. Maybe open source internal to Intel and so We've we've basically made it so that we can have mixed cores So you can have pinned cores and non pinned cores and we're adding ISO CPU type cores as well So I'd like to see the cooblet move something more to this model here So basically have a plug-in so that these managers are pluggable Instead of having external and we do have some preliminary work on this other things We don't have we don't have guaranteed networking bandwidth with network topology We still have the dual-nick per node you either have to channel bond or you have to do weird things with multis and We've and it's very Hand-done in order to do that and we want something faster than the current MPI operator in Kuplo Maybe using a direct fabric between pots Maybe using Libfabric which is an HPC tool for the base. So you still have the fast fabric between We have other factors still there's file and block and container attached storage Some of those container attached solutions require a core outside of Kubernetes just to run because they're running clients all the way through they're interesting and We need to be looking at other want run times wasm is getting popularity. We're gonna have to deal with it and Singularity is also an HPC specific container. I don't know if anyone's used that It's pretty neat as a project, but it's not really mainstream. So anything during container D isn't gonna handle that Then there's two last things we're gonna talk about real quickly one thing is we want to update the math So once upon a time people put together ps3 is because they're cheap especially at the time compared to the compute nodes So they made all these clusters out of PlayStation 3's what they found was that the floating point operations were different So they're not necessarily following IEEE standards and instead of infinity or a negative infinity Sometimes they would just peg the numbers at either ends, which isn't really what you're going for It's also very slow and certain types of floating point compute But if you're using different processors How do you do the math? Because the numbers you're coming out may not be consistent for your same processors so this is something where we have to get a little bit more robust and And this is I'm calling this a side quest But this is because I don't think we have everything we need today as far as what we need to be thinking about with compute Because we need to break out of correctness and efficiency only that's what we've been doing with software for forever We want things as efficient. We want things correct, but we don't have any robustness whatsoever So today reliability is a hardware problem almost exclusively a Desirability is a software problem So if you go to this guy's lecture is one of the this particular paper he Goes through this is one of my old professors. So I have a place dear and true to him and He actually took quick merge and bubble see it sort and put in random events where it would give the wrong answer The only one that really converged on a correct answer was bubble sort And we all know bubble sort is very slow But we need to start thinking in terms of how to make things correct at least at the micro level and Start falling towards correctness and that's the sum of this So thanks for listening. Thank you Have any questions? What's the the name of this? CPU management software that's going to be released. Do you have like an estimate for you know? When we'll be able to take a look at it. It'll be two to three weeks We're calling it the CPU manager control plane because it's a control plane All right, thank you, and you're you're welcome to contact me after and I can send you an email as soon as we release any other questions Sorry, just quickly, you know it's just you mentioned singularity there And that's something that we've used a lot ourselves, but we found that the like There doesn't even seem to be any real desire in the community as well to take that up Like there was a singularity CRI, but that was dropped like two years ago. I'm just curious if you've seen that Any interest anyway, yeah So CRM they have they have a solutions for CPU within the runtime And they're trying to add in a class-based resources into the cooblet The the shortcoming of this is you're still going to have issues where you have different runtimes So as people try to introduce different runtimes and singularity is a very nice HPC runtime It's not going to address that So I really think we need to be looking out of just runtime solutions because it's they're very good for what they do But also be looking outside of that and have lighter pluggable managers so we can plug and play them according to user needs Hi, and your talk it also talks about a three times performance improvements, but usually in this world Nothing comes for free. So what kind of trade-offs that you have to do for that? Can you say that again? I didn't understand you're talking about achieving a three times performance improvements But in this world usually nothing comes for free. So what kind of trade-offs that you have to do for that? So we were pinning pods So we were pinning particular pods in particular particular areas So we got and there's a talk about this later, but we went from 48 percent to Over 70 percent use of the cores instead of the normal So we we were using hotel reservation for both that particular one. So there's particular benchmarks And it's to your question. I have a related follow-up on the CPU management and the pinning Is this aware of all the resources on the node for example, could you do the Collocate GPU CPU things like this currently at CPU only but with we're look the way we're doing the pluggable mess We're trying we're hoping that we can we can expand beyond this. This is just the first cut So we've been working on this since about January. It's not a long project Right. So it's CPU manager, but it can be aware of other resources We have plenty of time for more questions Actually have one one question So this was a great summary of like, you know All the things that we should be considering for HPC on communities or missing points What do you think about storage like I don't know if you like touch if you look into that more deeply and So storage is a The thing about storage is you want to get your compute as close to your node as possible And we weren't doing that in HPC necessarily because we had our storage on the luster file system, right? Going across a huge network So, you know Intel had Optane memory, which you could put it on the node and preload And that's that's really the way the storage solutions that I've seen are currently working So if you look at Weka, they have different clients. They've a client per Nick Basically, and they they pushed it across trying to put it co-located So I think storage may need some more design some of the cool things that I've seen one of them was out of Western digital They actually are doing modifications in the storage. They have compute in the storage So maybe that we shift some of the compute into the storage itself and do micro Microchanges for when you just have to do plus one or do some sort of update and then do something else as far as on-out compute But that will require more sophisticated pluggable approach. Yeah, when do you upload? When do you know how do you express it on this spec? right and Currently, it's not particularly clean and I agree with you. We need to get a little bit better on that Okay, I guess we I think we have a break now What's the state of like accelerator devices and you know Kernel driver support on the host system versus going into the container and everything else like that Is is there any new research working on that to sort of bypass a lot of the OS kernel So try to bypass what so you were saying before You know better networking and MPI you have for the GPU layer the RDMA GPU direct communications Is there anything that can be brought over from HPC land into into Kubernetes that's being done today that that you You can comment on yeah, so if you look at some of the back-ends So I'm familiar with the Habana infrastructure. They have the back-end network So they actually have direct GPU to GPU behavior. So we can we can borrow from that If you're doing things, but it's you're still gonna have the piece where you still have the CPU and trying to get the memory over So I think we need to start reimagining to be honest as far as how we build these structures But there's pieces of Kubernetes. I don't want to lose Including simplicity because it's these systems are very complicated. So users just don't want to Don't want to deal with that level of complexity. I'm at your right. We should be looking more on that But those are mostly that you have to look at the training jobs the training the training cards are the ones that are specific with the back-end I'm sure Nvidia has something similar. I don't know though Kind of curious if you've seen any solutions about network because this is Often of a common thing because you know nodes are multi-tenant. They're not sold in it So you get like bursty behavior starvation and sometimes ultimately node unhealthiness So one of the things I would like to do is make it so that you have a driver basically Sitting there monitoring your pod behavior. And if something is blowing through the network you reschedule that pod But we don't really have anything like that So we have things that look at current state, but you still need a monitoring component behind there That's rescheduling so we can do something similar to what we do with telemetry We're scheduling and play with that and that's one of the projects that I have in my long list of projects But then start monitoring the network behavior But you have to monitor it as it goes because as long as you have multi-tenancy you're gonna have that issue I think we have a talk later today as well on network us When you talk about rescheduling things There's gonna be a lot of complexity with local data So moving in a pod off a node could mean that it's Completely starting from this from scratch Right. This is where we need the checkpoint the checkpointing pieces. So we just now have checkpointing of pods So I would so I think at this point because we have that now We can probably start to look to leverage so we can start checkpointing the pods and then save the state somewhere So that if we have to de-schedule and reschedule you can start from the current run state Thank you. Well, any other question for Maro? Otherwise, thank you very much again. Thank you