 Hello, and welcome to our session at the KubeCon Cloud Native Conference, Europe 2022, taking place from Valencia. We'll be talking today about KubeFlux, an HPC, or high-performance computing, schedule a plug-in for Kubernetes on this Wednesday, May 18th of 2022. My name is Daniel Milroy, and I'm a computer scientist at the Center for Applied Scientific Computing at the Lawrence Livermore National Laboratory in California in the United States. I'm going to be the first speaker, and we'll be introducing some of the background, and then hand off to Claudia Misale, who is a research staff member at the IBM TJ Watson Research Center. We are representing a larger collaboration between three organizations, which are, again, IBM TJ Watson Research Center, the Lawrence Livermore National Laboratory, and Red Hat. So to give you a little bit of a background in high-performance computing, and in particular supercomputing, I want to introduce a supercomputer to you. As an example, or perhaps exemplar, I'll highlight the L-capitan machine, which is projected to achieve greater than two X-flops. So that's two times 10 to the 18th power floating-point operations per second, and is projected to be the world's fastest machine in 2023. Now, high-performance also has high power requirements, which, in the case of L-capitan, will be approximately or less than 40 megawatts. So with this power consumption, in order to fit everything into a relatively small space, you need to have quite a bit of density. The compute blades that compose the machine are packed very closely together in cabinets and typically require liquid cooling. Now supercomputers are becoming more and more heterogeneous, which means that there are many different types of processors and coprocessors that are integrated beyond just simply CPUs. For example, GPUs, and now we're starting to see ASICs and FPGAs, as well as dedicated machine learning coprocessors. These machines feature high per-length bandwidth, so currently around greater than 200 gigabit per second per link. Perhaps more noteworthy for those of you who are not familiar with supercomputing is the low latency requirements. So often supercomputers have sub-microsecond switching latency. Finally, one of the other differentiators or something that's particular to supercomputing is that there tends to be very high utilization. You usually see 99% resource utilization at any time. Now beyond the performance capabilities, I'm featuring the LCAP-10 machine because it will integrate cloud technology and we expect more machines in the future will follow this model. Computing achieves high performance through three primary areas, which are efficiency, proximity, and shape. Now HPC applications themselves strive for high parallel efficiency, which we also call scalability. And this basically means that as you add more processors to an application, the application performance improves more and more. Now many high performance computing applications have problems that are subdivided and distributed across all of the different processors in an application. And then those processors cooperate and communicate with each other in a pattern where they exchange subdomain boundary data or data with their neighbors. These applications tend to be very tightly coupled. They need to know who their neighbors are and they communicate with them frequently. This raises a third point, which are shape and proximity. So a major way to achieve high performance is through placement or knowledge of topology, or that is connectivity and shape of the application as it's being placed on the hardware itself. And this includes knowledge about the non-uniform memory access or NUMA nodes, the details of level one through three cache of the processors, GPU devices, co-processors, and where these devices are relative to each other. This raises the issue of distance. So actually length-length can play an important role in applications, especially those that are tightly coupled and those that have sub-microsecond latency requirements. In these cases, the speed of light actually can be a limiting factor for the applications. Now cloud is really becoming a dominant market force, which is influencing high performance computing. And a lot of this is simply a question of economics and community use. The public cloud revenue forecast is projected to achieve $500 billion by 2022 and up to $600 billion by the end of next year. You can contrast that with HPC spending, which should achieve approximately $32 billion by the end of this year. Now, while more money is certainly going into both areas, proportionally a lot more goes into cloud computing. HPC really shouldn't become or want to become a technology island. So as all the research and development into software and hardware gets put into cloud computing, unless HPC responds to this and integrates the R&D, it's going to become increasingly isolated. Now we can contrast that with Kubernetes. And as we see at this well-attended conference, it is a huge user and open source developer community. So there's a ton of energy that's going into this space right now. HPC is responding and taking notice of this as evidenced by workshops that are dedicated to containers and container orchestration and related technologies within the super computing conference, which I think is the largest super computing conference in the world, as well as other HPC focused conferences. However, I should note that cloud computing, the cloud community is starting to recognize that it can adopt some of these technologies from HPC as well in order to increase its efficiency and performance to control costs. Really the focus of this talk is that scheduling or placing components of applications and components of the applications in Kubernetes is not yet fully realized and there remains great potential to unite these two communities in a converged computing environment. Workflows at LLNL are currently demanding cloud technologies within HPC and we're projecting that that demand will increase dramatically. We're seeing applications and composite workflows such as the American Heart Association molecular screening workflow, also known as AHA moles, which makes use of Kubernetes. There is the rapid COVID-19 small molecule drug design workflow, which is a 2020 ACM Gordon Bell Special Prize finalist. It's actually being adapted to make use of Kubernetes as well. We have the new Autonomous Multiscale Project at the laboratory, which is going to take advantage of a lot of the elasticity and automation that Kubernetes offers. We also have partnerships with Rutgers, which performed research with or on COVID-19 within the radical pilot workflow. Despite all this current interest, the 2020 Laboratory Application Survey determined that fewer than 10% of applications are currently using cloud, but there's a huge amount of latent interest, 73% may adopt cloud in the future. Now as a team, we're interested in converged computing, which is really the best of both cloud and HPC together within the same environment or system. Now, high-performance computing schedulers or the component that makes the placement decisions in terms of processors or where the processors of an application reside is complementary with Kubernetes, meaning that they both have disadvantages which are strengthened in concert with the capabilities of the other component. First of all, the high-performance computing resource and job management scheduling software really can't orchestrate the full lifecycle of containers, and in particular the networking. They can do a limited degree start and stop containers, but not to the degree of full declarative orchestration, and they're certainly not designed for elasticity or automation. Now, Kubernetes brings its own disadvantages for converged computing in that it was originally designed for loosely coupled apps or microservices. The hardware scheduler itself is limited. Now, Cloudia will talk in detail about what those limitations are for converged computing in the next segment. The resource expression in Kubernetes is limited in terms of what HPC needs. As I mentioned before, high-performance computing applications like to know every single detail of the hardware itself and its topology. The Cloud Featured Discovery project aims to provide that level of detail to the application, but that is in addition to Kubernetes rather than a core component. So all of this is to say, again, I want to re-emphasize that fully featured high-performance computing scheduling in Kubernetes has not yet been achieved, and it's something that we really need to do in order to bring these two worlds together to enhance the performance of applications in Kubernetes, as well as enhance the scalability, flexibility, and automation of HPC. Unfortunately, most HPC schedulers and resource managers are not suited to the challenges of converged computing. There are significant trends toward complex workflows, extreme resource heterogeneity, and converged computing, which are rendering traditional workload managers increasingly ineffective. There are five principle challenges. The first is the co-scheduling challenge, and this is where complex workflows require component coupling and co-scheduling of different types of hardware, such as CPUs and GPUs. The second is the throughput challenge, where uncertainty quantification and other ensemble-based techniques can submit tens of thousands of short-running jobs, which overwhelm traditional HPC resource and job management software. Third is the job communication and coordination challenge. In this case workflows depend on data transfer between various components within a framework. Fourth is the portability challenge. In order to make sure that an application runs across multiple platforms, application designers and engineers need to port an app to a large variety of HPC resource managers and schedulers. Finally, we're starting to see extreme resource heterogeneity and cloud integration, which stretches or exceeds the resource model capabilities of traditional resource managers and schedulers. Flux, or the Flux framework, solves all five of these key technical problems. Flux is an open-source project in active development. The GitHub Flux framework organization is composed of multiple sub-projects, such as Flux Core, Flux SCED, which is also known as Fluxion, and Flux K8S, or Kubernetes, which provides the scheduling interface with Kubernetes. There are over 15 contributors to the project, including many of the engineers behind the SLURM resource and job management software for HPC. Now, Flux has two modes, single-user mode, which has been around for almost four years, and the multi-user mode, also known as the system instance mode, which is the plan of record workload manager for the laboratory El Capitan Exascale system. The Flux framework also garnered the R&D 100 award in 2021, so there's a tremendous amount of interest that is being devoted toward this particular framework. Flux also pioneers and uses graph-based scheduling to manage complex combinations of extremely heterogeneous and diverse resources. Traditional resource models and their managers can't cope with extreme heterogeneity because they were designed when systems were node-centric, and the data structures they used to define the resource types are very simplistic. In this sense, they are monogamous or derived from a common simplistic node-centric route. However, Flux, or the Fluxion scheduler, uses a directed graph of resources, which consists of a set of vertices, which are the resources themselves, and edges, which define relationships between these resources, in turn elevating resource relationships to an equal footing with the resources themselves. This flexibility can be extended to the myriad resources offered in the cloud, and the graphical representation can model flow-based resources such as network or power. Directed graphs can also express complex scheduling logic without changing the scheduler code. A final advantage is that directed graphs facilitate elasticity or resource dynamism through well-defined procedures and algorithms. Now, Fluxion itself offers a rich API for performing graph-to-versal and resource allocation specification and fulfillment through matching. As an example of this, the Flux framework solves the critical schedule need for L-capitan rabbit multi-tier storage. In order to schedule SSDs in each rack, you can actually select different components of the graph itself. So you can mount these components, or SSDs, as node local storage to compute in the same rack, and then that's used to build an ephemeral, per-job, luster, parallel file system. This whole process was deemed far too difficult for traditional schedulers, but it's easily enabled by Fluxion since there's no change to the core code required. Scheduling SSDs anywhere versus within the same rack requires just requesting different resources in the resource graphs, in other words, basically just changing your job resource allocation specification. The framework expresses end-to-end job state transition, which exhibits the richness of the Flux framework as it integrates with data warp or rabbit service containers, which are provided by HPPE, and integrates with Kubernetes. So Flux is developed with a flexibility and extensibility in mind to integrate with the cloud. Now, Claudia will give details on how we combine it with Kubernetes in the scheduling framework. In this second part of the presentation, I will discuss what are the requirements needed to run a traditional HPC workload on Kubernetes, then cool Flux internals, and some preliminary performance results. To run an HPC or AI workload, we need to fulfill the following requirements that are missing in the Kubernetes default scheduler. We need to first have batch scheduling and schedule the application as a whole, be able to express hardware topology and use that to do a better placement, allow application to use resources exclusively or not, and finally choose among different placement algorithms that better suit the application requirements. Now let's go over all of these points in detail and we'll see what we can do with upstream Kubernetes and what we can enable with cool Flux. With batch or group scheduling, we mean that an application is composed of different replicas that need to be scheduled as a single entity because the application will not work if any of the replica stays pending. To solve this in upstream Kubernetes, we can use the podgroup CRD and the co-scheduling plugin scheduler published in the official scheduler plugins repository. Cool Flux reuses the podgroup CRD to label pods and gather them to create a batch. Then it creates a request to Fluxion to schedule it in an all or nothing fashion. For instance, Fluxion is a batch scheduler, this was very easy to enable. And for instance, if we want to schedule an MPI job with four replicas, we get an allocation if and only if there is enough capacity to start all of the four replicas. To enable topology awareness in scheduling, upstream Kubernetes provides the non-resource topology CRD and the topology aware scheduler plugin published in the official scheduler plugins repository as well. This plugin limits the topology at not level and defines the NUMA topology. Another way to define your topology is manually via labels and then using affinity and high affinity gains and tolerations extensively in the pod spec, but the solution is not scalable. Cool Flux resource model is graph based and the topology is at cluster level and the topology can also be extended with software information. Every pod or group of pods spec is being translated into a graph that is matched against the graph of their resources. Multitenancy is a core component in Kubernetes, but HPC application are not friendly to noisy neighbors and would rather run undisturbed. In upstream Kubernetes, we can use labels affinity, etc. But again, this is not a scalable solution. We can also use Trimaran that is an upstream plugin scheduler for a lot of work placement. Trimaran learns from branding application and Prometheus metrics, therefore it's not able to solve the problem of the source. Cool Flux is able to assign exclusive exclusive resources to applications starting from any level of the topology. Let's consider an example where we want to run to Molekgrad and MC nation and then other services as well. The simulation benefits from not sharing resources with one another and each simulation needs their MPI rents to be closer to lower latency. Cool Flux will use fact policy to place MD simulations on a subset of the cluster and place the services on the remaining nodes. Lastly, upstream Kubernetes provides different placement algorithms via profile and the schedule framework makes it very easy to create custom schedule plugins. Alternatively, you can force placement decisions by use the use of label, but you won't be able to simulate pack algorithm, for instance. Also, it is not always possible to spread pods using anti activity. Cool Flux already implements different flavors of package spread, which are useful to schedule workloads with very different requirements, for instance, not for intensive workloads versus compute intensive points. Also high or low level constraint can be specified at the job level in the YAML file, which can also produce different placement results and it makes it very, very flexible. The Cool Flux plugin schedule for Kubernetes is implemented on top of the schedule framework to promote portability and this integration with Kubernetes. The schedule framework allows the extension of the default scheduler in Kubernetes with custom logic, which means that by using the framework we're basically implementing on top of the default scheduler, which is called the COOP schedule. The COOP schedule does two things. It filters and score the nodes in the cluster, every time a new pod needs to be scheduled. The COOP schedule first goes through the filter phase where it filters out the nodes that cannot host the particular pod and then the nodes that may host the pods are scored into the score phase. Here the nodes are given a score based on different requirements and the node with the higher score will host the pod. The green and yellow labels in the figure on the left are called the extension point, and those can be customized with the preferred logic. Since the schedule framework is used to extend the default scheduler, it exposes the same functions the default scheduler implements. In COOP Flux we implement the pre-filtering filter extension points, avoid completely the scoring part, and then we let Kubernetes do the node reservation pod allocation and its execution. COOP Flux uses the Flux scheduler module as a library, which is made accessible through a goal and binding. We run this component in a sidecar container and Flux runs as a service to the main container, which implements the extension points. The Kubernetes default schedule is always running in the cluster and even more schedules can be added and they will all compete for the same resources. COOP Flux has its internal graph-based representation of the cluster resources and is managed in the sidecar container, and it keeps tracks of the existing allocation made by COOP Flux. While the default scheduler is aware of the resources being used by the pods scheduled by COOP Flux, the vice versa is not true if the tracking is not actually implemented. There is a reason and also to avoid noisy neighbors, which is not good for HPC workloads, it is recommended to use COOP Flux on a dedicated set of nodes. But let's see quickly how the execution flow works. At startup, COOP Flux gets the information about the nodes in the cluster and builds a cluster topology to start the Fluxion scheduler library with. At this point, COOP Flux is ready to go to get pods allocation requests. When we get an allocation request, the plugin converts the pod manifest into a specification format that Flux can understand. And the mesh allocate gRPC file is submitted to the sidecar container. If an allocation is possible, the result is given back to the COOP Flux plugin, which will inform Kubernetes infrastructure that the pod can be now executed on that particular node. In this slide, we show some preliminary results on the recent experiments to evaluate COOP Flux on a set of traditional HPC application. Here we show the results of running lamps and MD simulation application modeling groups of particles in different states, and it is implemented on top of MPI. We use the MPI operator and COOP flow for all of our experiments. We run the MD with small cluster, a three node open shift cluster, each node with 48 virtual CPUs, and 192 gigabytes of memory for a total of about 130 allocable virtual CPUs. We run two set of experiments, one with 32 MPI worker pods, and one with 64. We run the Flux with pack algorithm and the COOP scheduler. We turn COOP scheduler with affinity in the case of 32 worker pods because they would use only two of the compute nodes, trying to simulate some sort of packing. Now affinity is makes sense to be used on the 64 MPI worker pods case because it would spend on all the three nodes anyway. We use the availability zone labels as the nodes are all in different zones. We run one MPI rank as a baseline, then 16, 32, 64 and 128 MPI ranks in both cases, 32 and 64 MPI worker pods. And we tune the MPI rank command to make the best placement possible for the MPI ranks and enforce packing regardless of the scheduler that being used. Despite the tuning, we obtain better results with COOP Flux as it is allocating the pods with packing policy. While the default scheduler is not able to do that because it's not able to pack all possible pods on one node first and then place the net on the next one. Since lamps is also communication intensive, neighbor MPI worker pods benefit from being close to one another, and we see better performance with COOP Flux. On the X axis, we have the number of MPI ranks and on the Y axis we have the median time steps a day and the higher the better. In the graph we also highlight the COOP Flux histogram bars. We also evaluated COOP Flux with Gromax and other micro dynamic library implemented with MPI, which is very common in HPC also used as a benchmark. We published our preliminary results in 2021 at the Canopy workshop at supercomputing. We ran our experiments on a 34 nodes of the shift caster and the caster was distributed over three availability zones. And each node had four bit of CPUs and 16 gigabytes of memory. We were trying to run Gromax at its full potential but we evaluated scheduling strategies and how would they influence Gromax performance out of the box. We compare Kubernetes scheduler against COOP Flux, and we're on COOP Flux in two ways. Vanilla version where the cluster is considered as in the Kubernetes as a flat pool of the resources. We measure the number of nodes that creates a topology that is aware on that of the three availability zones, meaning that all the nodes are mapped into their specific availability zones. We measure strong scaling with the number of simulation per day and again the higher the better. In this graph on the X axis we have the number of MPI ranks and on Y axis we have the median number simulation day. We have the same length. Here, we run each MPI rank in a single worker pod. We see that performance start to degrade more drastically when we start to run more than one pod per compute node. But in general we see that vanilla COOP Flux outperforms Kubernetes and COOP Flux with zone awareness. The COOP Flux start to spread pods as much as possible while the default Kubernetes scheduler and COOP Flux with zone awareness tend to pack a little bit more on the nodes and this promotes over subscription. There is a significant gap to be closed between Kubernetes and HPC before we have fully convert computing. On the top branch of the diagram HPC needs more cloud capabilities to allow it to integrate the bottom branch. Moving from Slurm to LSF and then to Flux there is increasing cloud readiness. On the cloud branch you have progress in terms of increased scheduling sophistication and performance and efficiency moving from massive Kubernetes to third party application plugins such as volcano which integrates batch capabilities in Kubernetes. The COOP Flux plugin for Kubernetes is an attempt in bringing these two branches together into a unified convert computing environment. We are working towards a future where environments X, BDL, LSTC and automation of cloud and the performance and efficiency of HPC. That is the best of both worlds. Thanks for attending the talk and we open up for questions.