 Alright, thank you. My name is Daniel Milroy, and I'm a computer scientist at the Center for Applied Scientific Computing at the Lawrence Livermore National Laboratory, and I'm going to be speaking on behalf of my collaborators at the laboratory, and also Claudia Maisale, who could not be here today from the IBM TJ Watson Research Center. So I'll be talking about fluence approaching a converged computing environment. So to give everyone a little bit of a background on what I'm talking about here, I want to give an example of a pre-exascale scientific workflow that really strain the capabilities of traditional HPC resource managers and schedulers. So this is the multi-scale machine-learn modeling infrastructure, also known as MUMI, which won the SC-19 best paper award. And this features a message-passing interface, or MPI-based simulation, coupled with in-situ analysis and machine learning as well. So there are three primary areas that really text or strain the capabilities of resource managers and schedulers. And the first is the co-scheduling challenge. So the coarse-grain component of this particular workflow, and I'm just kind of going to provide a brief overview of this because it's very complex, required that analysis be bound to cores that were nearest PCI Express buses on node because those particular cores needed to communicate and exchange data with GPUs on the nodes. So it needed to exploit very fine-grained node local topology information. The second was the jobs communication and coordination challenge. So this particular workflow consisted of approximately 36,000 individual concurrent tasks using 176,000 CPU cores and over 16,000 GPUs on the Lawrence Livermore National Laboratory CIRA pre-exascale cluster. So the coordination between a lot of these tasks was very, very difficult using traditional workload managers and schedulers. Finally, the portability challenge. So given all of this coordination and given the difficulty of actually adapting this workflow to a particular resource manager and scheduler, porting that between different resource managers and schedulers is tremendously difficult. So in terms of next generation, what we're going to from here is we're starting to see cross-cluster scientific workflows, which are demanding portability as well as integration with the cloud. So this diagram is an architectural diagram of the American Heart Association Molecular Screening workflow, which was the eScience 22 best paper. And this combines more traditional MPI based simulation analysis with machine learning and now containerized components as well. So when I mentioned that this is a cross-cluster workflow, it features one component that runs on a dedicated CPU cluster, one component that runs on a CPU and GPU cluster, and then a Kubernetes cluster that provides a message queuing glue between these two clusters. So we're seeing this particular pattern at the laboratory and beyond. There are other examples at the laboratory, such as combining traditional HPC simulations with AI and machine learned surrogate models, as well as potentially orchestrated databases. So despite there being examples already at the laboratory, we anticipate that there are going to be many, many more in the future. A recent survey indicated that up to 73% of the laboratory workflows are interested in cloud integration. So I mentioned that many of these complex workflows challenge the capabilities of traditional HPC resource managers and schedulers. And this is where the flux framework comes in. So flux is a hierarchical, combines hierarchical management with graph-based scheduling, which addresses many of the challenges that I brought up in the first two slides. So the first way it accomplishes this is to enable full workflow support. And it does this via hierarchical resource subdivision. So flux can actually instantiate itself inside of itself, essentially add infinite. And this allows for specialization at each level. So at each level in a nested flux hierarchy, you can change the scheduling algorithms and use divide and conquer approaches. The second point is that flux can actually manage resources basically anywhere from bare metal resources to virtual machines in the cloud, et cetera. Another huge advantage is that workflows really only need to program to flux. I mentioned the portability challenge before. Well, flux can actually instantiate itself underneath external resource managers like Slurm and LSF. Finally, you see the directed graph diagram at the bottom. Well, the directed graphs are used as a resource model to express complex or potentially complex and changing resources. Finally, flux, the flux framework provides rich and well-defined interfaces. This facilitates communication and coordination between different tasks in a workflow. Flux framework provides CLI, Python, C, C++ bindings, Rust, and we're working on Go bindings as well. So given the emphasis on MPI and traditional HPC workloads, we want to actually declaratively create kind of an HPC cloud slice. And this is really complicated and requires a couple of things. The first is scalable MPI bootstrapping and also find or high-quality pod placement. And I'll talk about what that means in just a minute. But basically, the idea is that you have more complexity than you do in kind of bare bones or bare metal HPC environments. You have mapping of MPI ranks to pods, mapping of pods to underlying physical infrastructure. So the first thing that we did to enable this capability was to allow declarative MPI to scale three orders of magnitude higher than it did in our tests. So we used the Kubeflow MPI operator, which initially in our tests only scaled to about 80 ranks. So we identified and fixed some race conditions in the process or the flow diagram to the left, one of which was that the launcher didn't wait for workers to be ready. The second was that workers would actually report a state transition to ready before SSH was available, which caused failures upon MPI bootstrap. We also fixed a DNS flood that would cause overload of core DNS. So this fixed MPI operator, we tested it with just MPI in it in Hello World and got that to run up to 16,384 MPI ranks on EKS. So the second component of this is Fluence, which used to be known as Kubeflux, which brings HPC grade scheduling and improved performance to Kubernetes. It's featured as part of the Kubernetes plugin schedule or framework. So Fluence supports CPU memory and GPU resource requests. It allows for zone awareness via locality information that's actually embedded into the Fluxion or scheduler component resource graph. We tested this using Department of Energy Coral to exascale pre exascale benchmarks such as AMG, LAMPs and QMCPEC. We found that Optimize AMG ran well in IBM Cloud and AWS and we got LAMPs, which is latency sensitive to strong scale in EKS up to 3,008 ranks. So Fluence actually accelerates Coral to benchmarks in OpenShift on IBM Cloud. So this is a ROX cluster. I'm going to provide a little bit of overview of these performance results. You can see QMCPEC is the leftmost plot and LAMPs is the rightmost plot. These sets of box plots, in these sets of box plots, the blue indicates default Kube scheduler and orange is Fluence. So you notice right away that depending on the rank size, the variability is much higher with a default scheduler than Fluence and also the run times are higher. So Kube schedule, part of the reason is that Kube schedule is unable to pack and single nodes. So even when limiting or modifying affinity and anti affinity, Kube scheduler was not able to perform as desired. We also compared multi app simulated workflow performance, which was scheduled by Fluence versus Kube scheduler. So we simulated workflows by taking three Coral to benchmarks and running them simultaneously within a single availability zone of an IBM EKS cluster. And this heat map represents pod placement. So at the top we have Kube scheduler, those are the top 40 jobs and the bottom 40 jobs are actually Fluence. So along the horizontal axis are the unique node IDs and again vertical axis is job IDs. And the color indicates the number of pods that were actually mapped to each individual node. And we saw a lot of pathological mapping behavior due to the random tie breaking of the Kube scheduler, which resulted in artificial resource starvation and delayed execution for some of the constituents of the workflow. So you can see this represented in two box plots to to my right. The first is AMG where you can see that that is heavily skewed distribution for the default scheduler, which you can't really see the color there, but you can take my word for it that the one that is skewed is Kube scheduler and QMC pack. You see similar behavior. So in general, we saw the Fluence scheduled workflows run up to three times faster. You can take a look at this. This will be published in the canopy 22 workshop at SC 22. So finally, I want to talk really briefly about our latest work, which is actually to replace the MPI operator using the flux operator. And this is going to enable both HPC grade pod scheduling and also hierarchical management in Kubernetes. And the idea behind this is to bootstrap the flux tree base overlay network across pods such that complex scientific workflows can run and make use of the hierarchical resource subdivision and scheduling capabilities that they use on bare metal. So the first first thing that we're going to start working on is porting the American Heart Association molecular screening workflow to Kubernetes. So thank you very much. I have some links to the to the software available on GitHub. And let me know if you have any questions, questions. That's really interesting. Thank you for that. I guess one thing I'm wondering about is this is mostly outside or looking in, but has has there been talks in the DOE about like converging on these different like science, like scientific workflow of representations? Like I know it seems like every every lab kind of has their own. And I know the bioinformatics community has a bunch of their own to. And I guess I'm just curious if I just want to know your thoughts on that. So you're talking about kind of standardizing workflow components or standardizing workflow architectures? Yeah, among the DOE. Yeah, there are there's an exa works project that's part of the exa scale computing project. I'm not part of that, but I believe there are several members of the Lawrence Livermore laboratory that are working on that. I believe Dan Laney is one of them. So I'm not an expert on that particular portion, but I know their efforts underway to do so. Yeah, hi. How are you dealing with the concept of like users within your MPI operator replacement? Yeah, so this is this is something that we haven't really gotten into yet. The multi user cluster is going to be a little bit complex. At this point, we're basically just assuming that all everyone is the same user. But once we have once we allow there to be multiple users, we're going to have to contend with a lot of the problems that you work with like fairness and that sort of thing. So we haven't we haven't worked on that yet, but that's upcoming. Hi, maybe we can take this offline, but I was wondering why you're well doing this improvements to MPI operator or even now thinking of forking if I understand correctly. Why not bring these changes to the existing MPI operator in Qflow? Yeah, there I think there were some complexities with the license agreement with the MPI operator. So I think it is possible to do so, but it is rather complex given the constraints based on our various organizations. So yeah, that would be desired definitely. And if those were to be resolved, you would be willing to do it? Absolutely. And I think it's more of a level of amount of time in order to resolve it, not that whether it can be. In that case where it is tested on AWS EKS, did you use any auto scaling there? I'm sorry, you didn't hear that? Did you use any form of auto scaling? No, no, not yet. But this is something that we want to integrate in the future. So I mentioned that the flux framework scheduler is is we work on elasticity in it. So we want to actually integrate that into fluence and allow the allocations to change shape over time. So yeah, that that is definitely something we're working on. Any more questions? I got one. So with your comparison with Kube scheduler, like it's clearly shows that Kube scheduler is optimized for service type orders, tries to spread, etc. Did you try at least to tune it other than using part and affinity, etc. But the configuration itself, right, like changing the weights of these, like, you know, various features that they have inside? Yeah, that that would be a question a question for Claudia. She did that work primarily. I believe so. I believe she also tried skew too, but I don't want to speak for her too much because I didn't run that. So the answer is I think so with caveat. I'm not sure. Yeah, thank you so much.