 Hi. Today we are going to talk about implementation challenges from HPC to containers in the academy. And who we are? Hello, I am Lukas Hitmanek. I am from Masaryk University and assessment from the Czech Republic. I am IT architect of our infrastructure which consists of both storage and compute nodes. And I am Victoria. I am also from Masaryk University where I work as an IT specialist. But essentially I take care about our Kubernetes infrastructure. So our Czech National Research and Education Network is called ANFRA-CZ and it operates HPC environment. We have approximately 20,000 CPU cores, 200 GPUs and 60 petabytes of storage. And the computational resources are accessible mostly through batch system, PBS Pro. And storage resources are accessible through Kerberized NFS version 4. Storage can be also accessed by S3 or SAF RBD, but just the minority of users choose this way. And we have about 1,000 active users. So in HPC we have two types of resources, compute resources and storage resources. Users interact with compute resources while creating shell scripts and running them in a batch system, PBS Pro. They have to have SSH experience because these batch systems do not provide any graphical user interface. We try to change this by implementing open on demand that was our attempt to provide some graphical user interface. And secondly, storages are directly available on worker nodes. However, storage resources are spread among various Czech cities and this creates confusion for users because they have to take care about compute and data affinity. On the other side, these storages can be mounted to users' computers. HPC brings some troubles. First of all, there is no straightforward way to monitor running computations, so users can't easily check the state of their jobs. Secondly, older scripts are not compatible with today's technology. I bet most of you know Python 2 versus Python 3 issues. Furthermore, every user must possess at least every tunic's skills because they have to interact with batch systems that are run via command line. Another problem is caused by time-limited access to carburized storage. After certain time, users have to renew their access token. Last but not least, setting up an NFS client is a hard task for every user when he or she wants to access their data from their own computer. So far, I have been talking only about HPC, but as we are at KubeCon, I will move to containers now. I think that majority of you have already heard about NGC containers or bio containers that are meant to be used in HPC, but how do we use them in HPC? In shared infrastructure, Docker is mostly prohibited due to security issues, so we can use Singularity, but this tool has problems of its own. Also, we have another option, that's Podman, but if we think about it, why don't we use native container infrastructure? As an infrastructure manager, you can choose between building shared container infrastructure or letting users run their own container infrastructure. Building shared infrastructure has advantages mainly for users. They don't have to deploy and maintain the whole infrastructure, which is in fact much more complicated than just being able to work with SSH and write some shell scripts. Instead, they can focus on research and their work. However, some infrastructure managers just provide users with write tools to run their own infrastructure such as OpenStack Magnum. Here in Czech Republic, we decided to go with the former approach and build distributed container infrastructure. We operate several Kubernetes clusters that are built on Rancher Kubernetes Engine, second version, together with Rancher Dashboard. Users can work with these clusters in multiple ways. They can either interact with native Kubernetes API or they can work with pre-deployed containerized applications or frameworks or they can work with Rancher Dashboard. To explain a bit more, in case users choose to interact with Kubernetes directly, they get their own or shared project, it depends on use case, together with the namespace. They can utilize various persistent storages such as NFS, SAF or S3 and they can also utilize GPUs and InfiniBand. In case users do not want to interact with Kubernetes directly, they can take advantage of applications which we have pre-deployed such as containerized Jupyter app on Binder, instance of Bioinformatics workflow tool Galaxy. We also offer Kubeflow and 3D accelerated desktop that can be used by users to run GPU intensive applications such as Ansys or Matlab remotely. Furthermore, we also use various frameworks such as European, GH4GH, TES and US standard, NextFlow or Snakemake. Having all of these, what are the benefits for users? First of all, users don't have to be skilled in shell scripting and other nasty unique specularities. They don't have to interact with Kerberos and configure NFS. They completely don't have to be aware of nRentopology. They don't have to know about software modules and their dependencies because we provide them in pre-prepared containers and overall, this is a direct way to run HPC containers easily. Every coin has two sites and from the beginning I have been talking just about one site, the brighter and better site. However, Lukash will now talk about the second site, which is much more darker. So, let's look at the dark side of containers in HPC. When we think about containers in HPC, there arise some challenges that I briefly introduced here. The first challenge is Kubernetes and HPC integration. Why do we need it? Imagine you are running some kind of infrastructure and you just can't say user is okay, we are finished here and from tomorrow everything is different. It just can't happen. And also, if you don't have unlimited budget, you cannot just build new parallel infrastructure, so you need to make some kind of transition in place here. The second, users are familiar with queues and they also expect some fairness of the infrastructure as queues are natural thing in HPC. The third one, scheduling. If you have significantly more users than compute or storage resources, you need to schedule them somehow. And last but not least, we need to get users trust, so they are willing to use the new container infrastructure. Okay, we are talking about Kubernetes and HPC integration. What does it mean? It means how to integrate existing HPC infrastructure with Kubernetes. HPC infrastructure typically consists of three parts. It's authentication and authorization infrastructure. It contains compute nodes and it contains storage. And if we want to integrate it, we have to deal with all these three parts. Well, authentication and authorization infrastructure can be shared between Kubernetes and HPC. This is because once you have some user database, you can naturally use it for both. Credentials can be used for both parts. So there is no problem here. Also, worker nodes are easily shared between Kubernetes and PBS Pro because you can easily drain the node from containers or drain for PBS Pro jobs. And once the node is free, you can assign new type of function on it. So you can switch between Kubernetes and PBS Pro. And moreover, you can even run both types of workload at the same time if you utilize PBS Pro Kubernetes connector, but we do not use it. On the other hand, storage. Storage is the real challenge and we will look on it in more detail. So, as I have told, HPC storage integration is a real big challenge. This is because HPC infrastructures usually are based on NFS or AFS file systems. And those file systems are meant for large data that the user can store and then stage it to worker nodes and do the computation on them. And you would need the same for Kubernetes. So how to access HPC storage from Kubernetes? The short story is you can't. This is because you have to deal somehow with user authentication. The user authentication is usually done in two variants. You can use your access tokens or the whole authentication is based only on user IDs. Problem with access tokens is that those tokens do not understand namespaces. And as you may know, whole container world is based on namespaces. Also, there is minor problem. The access tokens are usually time limited and you need to solve somehow how to renew the token. The other option, UIDs, there is a problem that most containers run as user 1000 and you can't distinguish between users only on user IDs because all the users have user ID 1000. So you need to remap somehow user IDs between local user 1000 and some user ID that is on a remote site in HPC. So for storage, we tried or somehow utilized three kinds of file systems. One is the old and known NFS. Second is SSHFS and the last one is Common Internet File System which you can know from Windows World as Samba Protocol. In the case of NFS, well, there is no UID remapping possible. On the other hand, it's fast and there are many CSI drivers for NFS. But user ID remapping that is missing is a real problem. You can use NFS for persistent volume claims for local storage that is dedicated to Kubernetes but you can't do integration with HPC that use variable user IDs for every user there is different user ID. In the case of SSHFS it can remap user ID but on the other hand it's slow and also there is a problem that CSI driver must not restart because if you restart CSI driver all mounts get broken and the user lost access to his or her data. In the case of Common Internet File System it can actually do user ID remapping. It has acceptable performance at least in the latest versions but unfortunately it's not widely supported in HPC world. So if you come to HPC administrator and you tell him or her that you want to utilize Common Internet File System the answer is no, you can't because we don't offer this file system. As I have told, queuing and fairness is another challenge. Queuing is currently not present in vanilla Kubernetes but you can install some add-ons such as Armada that present some systems of jobs and queues and this is really similar to HPC system. But do we need queuing system at all? Queuing system in HPC is usually used for distinguishing between different kind of work nodes such as SMP nodes or just HD nodes but for this functionality you can fully utilize the Kubernetes system of labels and we believe that labeling system can actually replace queuing system that is used in HPC. However, we also need fairness. We need that some say greedy user does not eat whole infrastructure so every user can eventually do computation. So we need some kind of fair use policy and we need some mechanism that can force this fair use policy and in this area we are not aware on solution yet but we think that resource quotas and priorities can at least help in this area and maybe this is all we need at all. Another challenge is scheduling. PPS Pro contains complex scheduler that is able to do complex math computation of where the particular job will run and when. On the other hand Kubernetes contains a simple scheduler that does not contain such thing as fairness and complex computation of order of pots and also PPS Pro scheduler can handle situation such as big job eventually will run on any node so the node is drained and the job is running eventually. This is something that Kubernetes scheduler doesn't do and we have problem that we also saw such problem that user submit job that is greedy on resources for instance it needs 60 CPU cores and Kubernetes does not drain any node to satisfy this requirement. It can be solved by pot eviction if you configure the infrastructure that pot can be evicted from a node but this approach is not good for HPC case because imagine that pot is running some computation for one month and you definitely do not want to evict it almost at the end of the month because it means that the whole computation will run from the beginning. Also almost all scheduling algorithms assume that jobs has finished running time but many Kubernetes resources such as deployment or stateful set does not comprise time limit so they potentially run unlimited time and this is not a bug, this is something that this kind of resource designed for it should run in unlimited time but for scheduling this kind of pot is a real problem because they can't deal with such situation so this is still open question how to do scheduling in Kubernetes that would somehow be similar to PPS Pro but our goal is not to reinvent just new PPS Pro system that is fit to Kubernetes infrastructure last but not least challenge is how to gain user trust users currently use HPC infrastructure they have problem with it but they have at least something, it eventually work but if you introduce them a whole new infrastructure they are naturally afraid of changes will it work, is it stable and mainly is it some hype or will it be stable infrastructure for next years you have to convince them that yes it will work yes it's stable and yes it will survive next year and it will survive also many next years we also need to build better portals to make it easier for the users our future plans contains continuing transition from PPS Pro to Kubernetes and we also plan to build some experimental setup we imagine that worker nodes will be equipped with large SSDs and we want to build fast shared storage from these SSDs from the worker nodes and the challenge is how to provide reasonable data redundancy most of the solution consists of just replicating data so you have at least two or maybe three copies of data but in such a case you effectively reduce the capacity to the half or even one-third and we want to find a way how to do is more effectively for instance utilizing some RAID 6 equivalent or Ericsselomon codes with better redundancy and we need all this because we are pretty sure that some of the worker nodes will be down for some amount of time for say repuls or they can have hardware failure and it's not acceptable in such case that whole storage is not is going to be used so to conclude we provide unified container infrastructure in Czech infrastructure it's multi-tenancy so many users can connect it's suitable for both web services and heavy HPC loads and we are already running HPC loads I see there was mention that we utilized next flow framework so from this framework we already run tens of thousands of jobs just to prove that our infrastructure is working so it's all from us and thank you for your attention