 Hi everyone. Welcome to this session on science in cloud native. My name is Ricardo Russia. I'm a computer engineer at CERN. Today I will talk to you about the usage of cloud native projects in the scientific area and give you an overview of the status of this, but also the challenges that are still existing in terms of using this type of software for science. I'll start by giving an overview of the status of the usage in this area. Cloud native is really fastly changing the ways that scientific infrastructure is built and maintained. There are a couple of areas where this is really already very significant. The first one I would highlight is reproducibility. The fact that container is a very well-defined unit that can be easily deployed in different environments also helps a lot with reproducibility, which is a key for science. The fact that we can take an existing component, even an old component and deploy it in a current or say a more modern infrastructure that will come up in 10 years, this is really making a huge impact in terms of redoing all the analysis and making sure that things can be taken forward in the future. The second part is this idea again of having a container being built once using a well-defined way and being able to run it anywhere in different platforms. So in scientific infrastructure it's very often heterogeneous. Scientists will try to use as much as infrastructure is available to them and this means having to comply with different systems underneath. So the idea that people are standardizing in containers and container orchestration APIs really means that this task is being simplified for end users. And finally, which is also related to what we just talked, once you have this single unit where you have wrapped your code and data, then sharing these units with your colleagues is much easier. This means this is something that is really, of course, key to scientific collaboration, the fact that you can have your analysis and share it easily for your colleagues to reproduce. So all of this together means that the infrastructure itself is very much simplified and by doing this and having access to these tools, scientists can spend a lot more time doing actual science than maintaining the underlying infrastructure. One of the things that is also coming out of this is the access to a much larger set of resources thanks to these standardized APIs around cloud-native tools. There are still, of course, a few challenges and I will highlight three today. The first one is software distribution. When we start thinking that these scientific workloads are made of thousands or tens of thousands of individual pieces, pushing the software to where the analysis is going to be done is key and doing this in efficient manner is, of course, key. The second one is rootless environments. A lot of the infrastructure scientists have access to has very strict policies in terms of what and how people are able to run their workloads. These are shared environments so running them in unprivileged is a requirement. And third is the advanced scheduling. This is where the biggest differences towards compared to traditional IT happen when you're doing scientific workloads and this is things like batch-like workloads where queuing and priorities and fair share are very important. I'll cover a bit more of that as well. Starting with software distribution, ideally, container images will be very well layered and optimized. Having images that are over 10 gigabytes and not really well layered is really not uncommon in the scientific field. In reality, even images like that are, for example, 15 or 20 gigabytes. In total, the actual workload will require less than 6% of that to run properly. So it's very inefficient to have to download a full image before starting a workload. If you think that these clusters can be huge, hundreds or thousands of nodes, then this problem is even bigger. You need to pull the images across all the nodes. If the images are very large, this imposes a huge pressure in terms of network and storage. If you think that, in addition, you're running thousands or tens of thousands of parallel jobs, this problem is even made worse. To help with this, as I mentioned, the ideal would be to have optimized images. This is not always a possibility. The second one is caching. This is particularly important if you have geographically distributed clusters or nodes in your cluster. This will help a lot with efficiency. Of course, peer-to-peer distribution for offline distribution of the software is also a key. But one thing that really helps is this concept of lazy pulling. This is the idea that instead of downloading the full image before deploying your workload, you will instead do a remote mount of the image and gradually download only the content that is actually required and requested by the workload after the container is running. This means that you could have a flat startup time of your container and then access the actual container image contents as the workload requests it. One example of an implementation of this is this remote snapshot in continuity. This uses a concept called seekable tar, which basically, if you know how a Docker image works underneath, it's pretty much a set of tar bowls, each one for each layer in the image. But one smart concept that is used here is that a tar of tar is still a valid tar, so it's fully backwards compatible with the existing container image formats. But by doing a tar of tar, you end up with a seekable tar. So you can basically navigate the tar to find, for example, individual files that are being requested by the workload. This is pretty much how it happens. The runtime will, instead of downloading the image before launching the workload, the container will launch the container and expect that the data will be made available when needed. In terms of performance, this has a dramatic impact. You can see pretty much flat the startup time, no matter the image you're using in these cases. It's actually pretty small images. But if you extrapolate these two images of 15 or 20 gigabytes, the startup time will be very similar, of course. Then the workloads can be a bit slower as they request the data. But considering we are only requesting a very small amount of the total data, there's a very big impact in terms of reduced network pressure and storage needed under notes as well. The second challenge I mentioned earlier is this idea of ruthless environments. And this is particularly important for high performance computing clusters. A lot of the scientific workloads are deployed using these massive supercomputers, which are shared machines between multiple end users. So there's very much restricted access on what the end users can do in these environments. This is not a good fit for the way Kubernetes and other projects around it are built today and the expectations they have. But there is this effort to have what's called ruthless containers. And this is a possibility of helping to onboard more of these resources into this ecosystem. So a link here to the project. The goal is really to manage containers as an end privilege user. And here is not running the container, only the container itself and privilege, but to run the container runtime as well and privilege. And this will really lower the barrier of onboarding HPC environments into cloud native deployments. They have a very nice definition of what an end privilege user is. It's a user that is not in the good graces of the administrator, which is a very nice definition of what we expect here. There is support already for this kind of deployment in Docker, Pubman, Buildkit, container D. So there's already quite a lot of things that can be done using these projects. Having support in container D also means that tools like kind, Minikube, and Kubernetes, using a distribution called usernetys, as well as K3S, K3S, are already an option to try out this type of workloads. Finally, I'll mention the third challenge, which is advanced scheduling. And this is really, again, the key of the differences towards, compared to traditional IT deployments. These are features that are really required for traditional HPC or HTC high throughput computing type of workloads. I'll mention here priority queues. This is the idea that as you want to maximize the usage of the clusters, you're actually allowing workloads to be queued before being submitted. And these queues have priorities for higher and lower priority workloads. This is also meaning that there is a requirement for preempting running workloads to replace them with higher priority ones, since these workloads can take a few hours or even a few days. This is something that is not existing in the built-in schedulers, but there are multiple projects that are focusing on this. The second requirement is fair share. This is the notion that you want to optimize again the usage of the cluster. So you allow some teams or users to have more workloads running than what their usual quota would be if other users are not using completely their own quota. But over time, this should compensate. So you want to balance this for everyone to have what their expected quota should be in a longer period of time. The third requirement is gang scheduling. This is the idea of submitting multiple jobs at the same time. This is critical for workloads like MPI, where you need communication between the different pieces. So you need to be able to schedule multiple workloads at the exact same time, otherwise they wouldn't run properly. This is also something that has to be built in this case. The last one I will mention is, again, we talked about workload distribution in multiple heterogeneous environments. Another requirement is to do this across multiple clusters. Again, the goal is always to maximize the access to whatever resources are available. Multi-cluster is one of them, of course. So there are some projects that are really putting an effort into providing this in our ecosystem. The first one I will mention is Volcano. It is the cloud-native batch system. And it really tries to offer all the functionality from a traditional batch system, but using cloud-native APIs and tools. The second one is Admiralty. And here, the focus is more on the multi-cluster part. And they do this by having this notion of proxy pods on a top-level cluster that then have the actual workload pods running on child clusters. The third one is Armada. And this is, again, focusing on batch workloads. And they focus on scheduling and running of these workloads specifically on Kubernetes. And finally, I will mention the virtual kubelet. This is kind of masquerading the kubelet or the node in a Kubernetes cluster from the actual resources that serve the node. And in reality, this can be an actual node, but it can also be a remote API, including the API of an external Kubernetes cluster or a serverless platform. So all of this together really tries to make these goals of improving the access to all types of resources to scientists in using the concepts that scientists are already used to. There's a lot more going on. One of the really promising developments comes also from SIG scheduling, where they try to onboard all of these concepts into the Kubernetes scheduler. And this is something that will be evolving fast and then really looking forward to see progress there. So this comes to the end of my talk today. I hope I gave an overview of what the excitement towards cloud native in the science area is and what the challenges that still exist are as well. A lot of this discussion happens in groups like the CNCF research user group, and I put here the link. And also this project fall under the technical advisory group tag runtime in the CNCF. So this is also where a lot of the discussion happens. So I hope this is only a teaser. And I look forward to Cogon in May in Valencia with a lot more news in this area. And for everyone listening, enjoy Cook on China. And I hope to see you all soon.