 Hello, let us present how to handle fair scheduling in a private academic Kubernetes infrastructure. First of all, let us introduce ourselves. I am Lukáž Hytmánek, I am architect at Masalik University in Czech Republic. And my colleague Dali Borklusáček is researcher at CessNet, which is National Research and Educational Network also in Czech Republic. So, the Czech research infrastructure is currently built on two approaches. One is HPC and the second is emerging Kubernetes infrastructure. HPC infrastructure consists of 30,000 of CPU cores, 15 petabytes of storage capacity and this infrastructure is used by 3,000 active users. Those users are running 20,000 jobs each day. Kubernetes infrastructure consists of 2,500 CPU cores, it seems to be smaller. It has half petabytes of dedicated storage capacity, it is currently used by 130 users and those users are running about 1,000 POTS every day. This talk is focused on our Kubernetes infrastructure that consists of a large, single multitenant cluster. It means that the cluster is shared by many users, they do not have admin privileges and are only given namespaces to run their POTS. Basically, we have two kinds of jobs, Interactive and HPC jobs, so what's the difference? Interactive job needs to run as soon as possible, also they usually do not have limited runtime and they typically burst. I will speak about this a bit more later. Contrary, HPC jobs can wait in queue. They have strict maximum runtime limit and usually we have more jobs than resources, so they have to wait in queue. Of course, we want to use our infrastructure efficiently, but it presents some challenges. The first one is bursty nature of interactive jobs. They typically run for a while and resting remaining time, but what to do with allocated resources? Jobs are mostly stateful, so they cannot be easily restarted. In the graph below you can see typical runtime of interactive job. Another challenge is how to prevent resource wasting. We collected statistics for several months and we see that most supported significantly overestimated request allocation. The red line represents idle state, but as you can see most allocations are below. Those namespaces do not represent a single pod, but usually several pods, so the situation is even a bit worse. We also see other problems, such as it is impossible to modify pod priority dynamically or adjust to general or tight pod allocation, because if we change allocation the pod has to be restarted. This is no problem for state micro services, but it is usually a bigger deal for long running scientific computation. You can easily imagine that such computation is stateful and runs for one month. It's not good idea to restart it several times a week or something like that. Luckily, there is community work in progress. One comprises change resources without restarting pod. This is very promising work in progress. And also checkpoints. Once the Kubernetes will support pod checkpoint and restore, we can deal with some of the problems as well. There are also problems bound with scheduling, as common HPC, but scheduler behaves in the way that when the system is full and a new user arrives, you can always tell the user what is their priority or wrongly estimate when the running jobs of other user will terminate and even provide them a non-destructive reservation. Scheduler also make reservation for big jobs. It means that if user request job that each whole node, like 64 CPUs or something like that, then HPC scheduler will do reservation on that node and prevent some of such nodes and prevent smaller job to arrive or occupy these nodes. And this is automatic. In Kubernetes, it is impossible to estimate pod wait time. When we are out of resources, there are no guarantees. The pod either starts immediately or possibly can never start because there are no free resources as pod do not have any limit on runtime. Those resources can be occupied forever. Or we can manually adjust the priority of new pod to evict some already running pod, but this is problem as I have said with scientific computation. Also resource reclining is not solved because there is no pod lifecycle management. And also there is no such thing as fair share in Kubernetes, meaning that Kubernetes could guarantee a user that the pod will eventually run. And also Kubernetes scheduler does not do any reservation for big pods only at cost of eviction again, but this is still not good for scientific computation. And also there is no automation. So this is all from our short lightning tool. And if you have any ideas that could help us to solve the problems, we would be happy if you would recharge at the contact below. So thank you for your attention.