 This presentation is about manager Kubernetes as a next-generation academic infrastructure. But first, who we are. I am Lukasz Hitmanek and I am IT architect at Masaryk University, and also I am contributing to Cessinet, which is Czech National Research and Education Network. Hello, my name is Victoria and I am a doctoral student at Masaryk University in Bernal Czech Republic, an IT specialist at Institute of Computer Science at Masaryk University as well. First of all, let me introduce the research and educational infrastructure in the Czech Republic. Except real supercomputing center, we have two main kinds of infrastructure available to scientists. They are HPC and Kubernetes infrastructure. The HPC infrastructure consists of 32,000 CPU cores. It has 15 petabytes of storage capacity and is used by 3,000 active users. Those users are running about 20,000 jobs every day. And we also have 360 GPUs of virus kind. This kind of infrastructure is based on PPS Pro batch system. The other one, Kubernetes infrastructure, it consists of 2,500 CPU cores. It has 6,000 terabytes of dedicated storage capacity. It's backed on Flash-only storage array. It's currently used by about 200 users. They are running 1,000 ports every day. And this infrastructure is equipped by 50 GPUs. Some of them are NVidia A100 and they are yet to be installed. And we will experiment with MIG technology as well. This Kubernetes infrastructure is based on Rancher and RKA distribution. Speaking of managed Kubernetes, what can we imagine? Basically, it means that DevOps team managed the infrastructure. We offer tight integration with the rest of our infrastructure like DHPC. And we aim to offer many components that allow easy deployment of your application. We have, for instance, several storage classes like NFS, Samba, SSHFS or CVMFS. We also integrated CFFS, but this storage class uses a special version of CFFS driver. This driver has been patched, so we are able to change user ID and group ID that are locally visible. So it does not matter under which user ID runs the container. This patch is public as a pull request to set upstream, but as far as I know, it's still not merged. We also have one data storage class and both of these storage classes are implemented as few CSE drivers. We have a workaround so that the CSE driver can be restarted without baking mood point. Next, we have integration of DNS system for Ingress and Load Balancer. It means that DNS name is created for such a service like Ingress or the Load Balancer. We also provide like Encrypt certificates for both Ingress or also for non-web services. We provide a single sign-on service based just on annotations. So if you want to use a single sign-on for your application, they just need to add some annotation to Ingress and the single sign-on is automatically registered and provided. We also offer shared GPUs. It means that a single GPU can be shared by multiple containers or multiple users, but there are no guarantees about consumer resources from the GPU. We have also a slightly modified GPU operator from Nvidia that enforce GPU allocation. It means that no user can basically steal the GPU without letting know the Kubernetes scheduler. So let's look on manage Kubernetes from user perspective. Users are given project and namespace and we enforce a source quota on CPU and memory. The users are allowed to run only on privileged containers, which can be a bit limiting, but on the other hand we do not enforce users to use any particular user ID. Users are allowed to use any user ID they want. Users also cannot install custom resource definition or any other cluster scope resources. This operation is forbidden and only administrator can do, so it means basically a DevOps team has to install such resources. But we want the users to not struggle with maintaining infrastructure, maintaining Kubernetes, maintaining all the components that need to be run, and the user can focus only on own application or on workload and fully utilize the service the DevOps team provides. However, we do not offer just an infrastructure. We go a bit further and we prepare some prefabricated applications, such as Jupyter Hub and Binder Hub, as those two are famous and very popular. Also, the Jupyter Hub offers integration with HPC storage systems via SSH FS, and we also have two special instances of Jupyter Hub. One is RStudio that runs inside Jupyter Hub, so a user can get RStudio on one click that is integrated with HPC storage system, and the other one is AlphaFault on Demand. This application is based on Kolobora Jupyter Hub notebook, and we also integrated Molestar Viewer that allows user to preview the folded protein. Those two applications, the Jupyter Hub and Binder Hub, are run as a web application that has their own login system, but next to those applications, we prepared another application that are accessible directly as a rancher application, and those applications mainly contain or are based on remote desktops, and we offer applications such as KNIME, MATLAB, ANSYS, VMD, Viewer, IBM, Cplex. All these applications are based on either VNC technology and protocol or WebRTC protocol. In the latter case, the user is given a fully 3D accelerated desktop that is pretty capable of almost anything, and also we prepared containers that allows users to use SSH access to this container via network, and those containers were running behaves much like a virtual machine, because the user does not have a root access in the container, but on the other hand, using some say tricks and hacks, the user can install any package or anything in this container, so it behaves much like a virtual machine. We also offer some web-based applications such as Coke Server or Neo4j, and including other applications such as Personnel Minio, Paraview Server, Cpion, or Personnel Samba Server. Those Personnel Servers means that the user can run the Minio or Samba on its own and can connect the local computer to this service via S3 or via popular Samba protocol for instance from Windows System. So here you can find some examples of our prefabricated application. On the left top you can see RStudio running in Jupyter Hub. Below you can see the form for AlphaFold on-demand. You can see that most of the parametrics that are used for the scripts, standard scripts of AlphaFold, you can fill the parametrics in. Next to it on the right side down you can see the Molestar Viewer that offers the preview of Folded Protein, and above on the right side top you can see probably famous game Witcher 2 that runs in the browser and runs from a Kubernetes and is fully accelerated. It uses WebRTC and it's based on this project. So for a while you can enjoy the gameplay. So now let me reveal some implementation details. First for remote desktops. Our solution is completely unprivileged so it means that none of the participating container needs privilege escalation on run ex-root. Everything runs just on ex-user. However, it requires a patched ex-server. It also requires some minor changes to NVIDIA GPU operator. And as I mentioned we enforce GPU allocation and this enforcing denies to share GPU among containers because NVIDIA visible devices all is ignored if this is the only request for GPU. However, we use some GPU sharing from China Alibaba Cloud that is publicly available. And with this sharing we can share the GPU between ex-server container, desktop container, and streamer container. I also mentioned that we offer integration with the DNS system. However, we have no solution for name conflicts. Currently any user can select any domain name under some specific subdomain. However, this subdomain is shared among all the users so then can arise some name conflicts and there is currently no solution with external DNS driver. Also with LexanCrypt certificates there is one problem with the DNS challenge because we offer to get certificates also for the whole subdomain that is meant for both external DNS and LexanCrypt certificates. And in this case every user is able to get any certificates in this domain because there is no real validation of the request. And we also are not aware of any possible solution for this problem. Probably one of the solutions could be that we create distinct DNS zones for each user or every group of user but this is currently not implemented. We decided to use Kubernetes also for sensitive data processing. We set up a small cluster that is dedicated only for sensitive data processing. This cluster is separated from the public cluster. However, the single small cluster is used by all the users that want to process the sensitive data. We are working on Israel 27,000 certification which is equivalent to an IST 853 certification. But as I have said the single cluster is shared by distinct users which brings some isolation challenges mainly related to usually single ingress instance and also for Istio instance that is not multi-tenant by default. We do not run just few web applications or remote desktops on our Kubernetes infrastructure. We also use HPC jobs on pretty regular basis. Currently we run the HPC jobs via workflow managers. We use two of them. One is Snakemake and the other one is Nextflow. The Snakemake is integrated with task execution service from GH4GA initiative and the Nextflow is directly integrated with Kubernetes. So how does the HPC jobs work on Kubernetes? There are some bad rumors that it does not work as I have heard, but all we can say is that it works. There are of course some limitation. They have to bring some research opportunities. We also create many Nextflow enhancements. The biggest one is adding job support which makes the Nextflow computation almost immortal so it's pretty stable and it runs just fine. As Lukasz already mentioned there are limitations of HPC in Kubernetes. These limitations are eventually beneficial because they bring research opportunities for the community. There are plenty of areas where research can be conducted but we started with scheduling challenges because they were the most prominent to us. I would like to present to you some of our research interests, problems with Teco, solutions we found and new areas we would like to scrutinize. I will talk about four topics. The first is efficient resource allocation in heterogeneous and dynamic environment which is basically a Kubernetes cluster. The second being infrastructure comparisons of Kubernetes and traditional HPC based on batch scheduling. Third topic will be area of green computing that is with rising electricity prices and global climate status, quite an important topic. And fourth and last topic will be about connecting Kubernetes with HPC in a hybrid way. Firstly, I am going to talk about effective resource allocation in Kubernetes. As we all attend HPC Day, I believe majority of you have ever asked, answered, discussed or just came around the question of effective scheduling in any computing environment. Scheduling is an omnipresent topic because everyone tries to come up with the best scheduling strategy that will accommodate the most jobs on all nodes and no job will wait too long and cluster usage will be above 90% with no down times. Sadly, this is not the reality and we all experience a plethora of problems. We come from academic environment where computational resources are provided more or less for free for all researchers and academics. This is a very different approach from commercial providers where you can prepay nodes for desired time or follow pay as you use model. When you have ever closed to compute, you naturally don't want to pay providers more than necessary, not mentioning if you have specific requests on resources such as graphical cards whose usage can be really pricey. From the opposite point of view, providers reach very high resource usage because they combine offered plans in a very smart way and efficiently they overcome it very much. Our experience in the academia clearly shows that users drastically overestimate their resource requests. As you can see on the image, even the best used to request ratio for a namespace has a two-time difference. Users like motivation for precise resource requests because, as I said, they are free and secondly, they either don't know how the application works or the resource usage of the workload is not stable over time. However, bursty workloads, as we call the applications with unstable resource usage, are not the only case that makes scheduling in Kubernetes hard. We distinguish between two types of these bursty jobs. One are long-running services that are used three times a week for two hours and the second type are computations characterized by dynamic variation when most of the time resource usage is low but for some short time perhaps a more complex part of the computation resource consumption spikes. Users fear their job will exceed allocated resources which would cause job termination so they rather specify substantially more resources than are needed in order to avoid the situation. Second scheduling problem is already mentioned the user overestimation which causes low cluster usage and unused resources. The reason might be just sheer obliviousness to the concept of and logic behind the resource allocation. Third problem is posed by interactive jobs which are common in HPC for example when working with software like Matlab or Ansys. If interactive workload is created user doesn't want to wait until job moves from waiting queue to running for too long. They want to work instantly or in the span of approximately two to three minutes. In Kubernetes you can set a higher priority on the interactive job and but then you must decide which pod can be terminated. Also you must watch out for already waiting jobs they might require just slightly more resources than your new interactive job because these interactive jobs could starve others who are already waiting. Lastly force scheduling problem is tied more to the academia where you need to enforce fairness and at the same time account everyone for their resource usage. Kubernetes does not implement any built in accounting or fair or user fairness if we talk about multi-tenant clusters but these are crucial concepts. Imagine that you have a user who spawns too much interactive jobs and so this user will use all of the resources and new user might never get to compute. The good news is that there are some solutions to the problems. We proposed one possible solution to the need to reserve resources in the in the manuscript linked below. The solution is based on the existence of small or large it doesn't matter job that can be evicted easily. Maybe they do checkpoints maybe their inherent logic counts with restores. Nevertheless if larger or an interactive job arrives these jobs which we call scavenger jobs are the first one to terminate and the free space occupied by them is instantly allocated by placeholder jobs that serve just as a reservation. If enough scavenger jobs are terminated to accommodate new workload all these placeholder jobs free their resources to the workload for which resources were in the first place created or reserved. This is actually one way of implementing forward reservations as we know them from HPC. Another much easier solution would be to create separate clusters where each cluster is dedicated to accommodating a specific workload type. One more solution is a vertical autoscaler which should be available from Kubernetes version 125 and vertical autoscaler is able to scale resources on the running container. This approach might solve a lot of issues when you can change the pod requests on the fly. Now I will move from effective resource allocation to HPC in Kubernetes. We have been researching the potential of Kubernetes platform to run big workloads such as analysis on genomics data using workflow manager. We asked ourselves two questions can HPC work in Kubernetes and will short living tasks perform better in Kubernetes. We answered those questions by performing several genomic analysis runs on different infrastructures that being traditional HPC environment with batch scheduler OpenPBS and second environment the Kubernetes cluster. We compared NUMA aware and non-NUMA aware Kubernetes environment with NUMA aware OpenPBS environment. From our observations we can safely state that for Kubernetes to perform as good and even better as traditional HPC environment proper NUMA configuration is the most important aspect of the success. We have configured just the standard Kubernetes NUMA settings so no custom solutions or deep system administration work was needed. We also found out that NUMA memory manager has limitations because Kubernetes scheduler does not see the whole amount of available memory with each NUMA node it observes whole state in the cluster. It happened to us that many pods were rejected from the cluster due to an expected admission error which is unrecoverable. This error is caused by not enough memory on the NUMA node assigned to pod. This truly happened just because the scheduler thought that enough memory was available overall but the pod was assigned to the specific NUMA node which didn't have the memory. Additionally the time elapsed from job being scheduled till running job is much shorter in Kubernetes because container images are cached and therefore started almost immediately whereas in OpenPBS environment there's a bit of the setup which with larger number of jobs significantly delays whole computation and as a matter of fact runs in Kubernetes where much more stable overall. On these figures you can see the graphical interpretation of our results. The upper left picture shows that the average duration of long running processes of genomics analysis is the highest for non-NUMA where Kubernetes environment. If we configure NUMA the time is identical or just slightly more than OpenPBS. On the other side as the upper right picture shows if we compare short living tasks Kubernetes either with NUMA or non-NUMA configuration performs significantly better than OpenPBS. In summary the bottom image shows total duration of genomics analysis where we clearly see that Kubernetes with NUMA configuration delivers results faster than PBS environment. This is caused by the combination of the long running processes and short running processes because there were much more short running processes than long running processes and if these short processes were computed faster than in OpenPBS the whole computation was faster. To sum infrastructure comparisons up we just saw that Kubernetes is certainly capable of accommodating HPC workloads and its performance could improve even more. We found out that Kubernetes scheduler acts almost as a lifo last in first out queue because it does not preserve the queue and the implemented exponential back off makes just more mess in the queue. Secondly, Kubernetes does not reserve resources which would be handy for certain workload types such as once they request basically whole node. Thirdly, low global knowledge of node resource allocations leads to fragmenting memory and CPUs which could be used more efficiently with the bit smarter or knowledgeable scheduler. And just to mention this work was all done as a result of a manuscript that is currently under review and this information marks the end of infrastructure comparisons and now I will move to the green computing. Green computing is a term that everyone in IT has heard of especially these days we all hear about and really feel the rising prices of electricity and listen to the stories about how node deleting emails adds to the climate change by keeping the servers on. Majority of cluster providers would agree that there are times when huge clusters are just turned on but not utilized or utilized with really low effectivity. There is a whole unexplored field of better scheduling strategies that would accommodate workloads with higher efficiency. Furthermore, we as infrastructure providers should educate users on the best way of utilizing the infrastructure and the best environment for their application. A small container will be truly better for a static website than starting a whole virtual machine. Moreover, we can tune the hardware based on CPU usage and power on and off the nodes based on true usage. As an idea to work on, we came up with the thought that some cluster nodes could be dedicated to running specific workload types similar to scavenger jobs or short lead jobs. If there is a sudden spike of the amount of pots of this type, a new node could be dynamically added to the cluster just for these workloads. After these workloads finish, the node could be powered off again. All these steps might look like implementations of just simple thoughts, yet they have great power in reducing the power usage and increasing efficiency. Lastly, I would like to mention the concept of hybrid cloud that can be seen as a solution to the scheduling as well. The idea is pretty straightforward and is based on connecting HPC world with Kubernetes world. The HPC world has usually more resources or better scheduling capability and Kubernetes world is perfect for other, let's say short lived workload types. We are currently working on the implementation of OpenPBS connector, which would allow moving pods from Kubernetes to the PBS world transparently without modifying the workload inside the pod at all. The container would be executed in the PBS environment as container as well, probably just with more resources not available in the Kubernetes cluster. And with that said, I would like to finish this presentation on Kubernetes as the next Generation Academical Infrastructure. And thank you for your attention.