 Alright, so hello back and thank you for giving us the opportunity to present our experiences. So today we'll talk about our experiences with centralized machine learning service at CERN and using Kubeflow and how we've been changing it to make it better for our users. My name is Ricardo Rocha. I'm a computing engineer in the CERN Cloud team. I do a lot of containerized work, networking and some machine learning as well. I'm also a member of the CNCF technical oversight committee and I co-lead the research and user group also of the CNCF. Hello, glad to be here. My name is Dane. I'm a software engineer at CERN Cloud team. I work on containers and machine learning and Kubeflow and I will start with some introduction about our project and then Ricardo will discuss things in more depth. So we work at CERN. CERN is a research organization for particle physics in Geneva, Switzerland and it's operating the largest particle physics laboratory in the world. The mission of CERN is to find the origins of the universe to answer the fundamental questions such as what is the universe made of and to understand how particles behave at the smallest scales. So to understand how particles behave at the smallest scales we need to use very high energies. So to do that we build particle accelerators and LHC is the biggest particle accelerator in the world. So it's a 27 kilometers ring of superconducting magnets that runs 100 meter underground in Geneva and in that magnet particles are accelerated near the speed of light and then they're collided at four different points for collision points. So after particles collided these high energies physicists gather results using detectors or experiments and then they use this data to get valuable insights into the science of the smallest scale particles. So we can see how magnets look like live and this is one of the detectors. We can see people here and how this structure looks like and this is essentially electronics to extract data from the collisions. So how do we use machine learning at CERN? So the data acquisition system works in a specific way. So there are 40 million collisions that happen every second basically at the LHC but computer infrastructure can only sustain 1000 events per second. So we need to somehow go from 40 million to 1000 and to do that we use trigger mechanisms. So essentially those are algorithms that select interesting events to save and to further process. So to which the question is what is interesting and how these interesting events are selected and to do that we can either use deterministic algorithms or we can use machine learning algorithms and for that we would use supervised machine learning algorithms that can either run in L1 trigger or at high level trigger to select interesting events and this works quite well if we know what we are looking for but the question is what if we are looking for some new physics. So so far machine learning has been used extensively at CERN for example to find to prove the existence of Higgs boson boosted decision trains were used quite a lot but there are other physics theories that were not confirmed by LHC data which were expected to be confirmed for example supersymmetry or extra dimensions so that hasn't been found in LHC data yet. So the question is what if our signal hypothesis for trigger algorithms was wrong so what if there was some kind of bias in supervised learning. So this calls for some unsupervised learning some algorithms that could actually learn during online processing that could train on experiment data not only on simulations. So besides these high level overviews of machine learning there are multiple groups at CERN that work on machine learning and they all have their own local infrastructure so our motivation is to provide the centralized infrastructure where users can actually use GPUs and FPGAs and TPUs and they can that they can have a user friendly platform to run their workloads. So this is our motivation for Kubeflow to develop a centralized platform that can be used by by different groups and we have been working on that for for a bit. The idea is to provide a full machine learning life cycle with Kubeflow to get data from detectors to perform data preparation to run some fast iteration jobs such as notebooks and to validate machine learning models and then once we're happy with our models to do distributed training and model validation and actually train our models and then to store the models and use them for serving for inference and production and with Kubeflow we can do all of that and this is why we are using it. So we started with a single user Kubeflow 1.0 at this point we were exploring available features making sure that pipelines work that cut-tip jobs work and that we can run our machine learning workloads so this was the initial stage. Then we moved to 1.1 instance with a multi-user and we integrated that with other certain services. We integrated that with a single sign-on with we are managing that with Argos CD and we have onboarded the users at that point. So we are at that point still working closely with users to gather their feedback to discuss things to provide support more closely and we were discussing the previous KubeCon in more detail that are bursting to public cloud and currently we are working with 1.3 instance where our focus is on security. We want to provide credential management and namespace management and vulnerability scans for Docker images and then also some runtime checks. The idea is to provide general availability of the service to be able to open it to thousands of people who work at CERN and this is our idea and Ricardo will take over. Thank you. I'll actually build on this one. So this diagram really shows the evolution of the service at CERN. We got to the point where we could scale to the size and the amount of resources that our users needed but then before opening it to production there were a couple of things that we had to focus on and this is the list that we see here under 1.3 and this is the requirements we had before making it generally available on premises. So the things I'll be covering here in addition to the resource availability I'll cover the management of credentials for the users, some like decoration of namespaces with user metadata, the scans and checks for images and then using things like OPPA for policy enforcement and runtime checks of the workload. So these are really requirements that we have that are not only for a machine learning service but they're quite important if you're having a like multi-tenant multi-user deployment. But the first thing I'll cover is regarding resource usage. So then introduced that one of the motivations we had was to kind of improve the efficiency of the resources we have at CERN instead of having multiple groups having each of the groups several GPUs we wanted to have like a central pool of resources that would make it more efficient overall. So we have an example here where we have different groups at CERN, say CMS and ATLAS which are experiments at CERN. So ATLAS, SUSE is supersymmetry. We had also groups in IT doing anomaly detection or in beam calibration doing reinforced learning while they calibrate the LHC beams. All of this is kind of inefficient because each of the groups have still maintained their own GPUs and also the resource usage is restricted to these individual groups. So moving to something like this where we basically have a single entry point for everyone to come and then benefit from these GPUs is a big improvement for us. Also we can integrate things like FPGAs and other accelerators as required. It does pose some challenges because you suddenly start sharing the resources between users. The other thing we wanted to do which Dejan also mentioned and we presented last coupon is we want to scale out. The amount of resources we have on premises are actually not enough for what we need in terms of accelerators. Both in terms of GPUs we don't have enough but also accessing specialized accelerators like GPUs which we'll probably never have and they are restricted to the cloud providers or IPUs in the case of Azure, TPUs for GCP. So we want to abstract all of this so that our users don't have to understand the infrastructure, they just have to run their workloads. The other part, so here it's where the integration starts. We rely on Kubeflow as we mentioned but then if we have this big pool of resources we might have say NVidia B100s and NV100s which are quite good at double precision, very good for distributed training when you're doing things like deep learning and we expose this via PCI password. We also have NVidia T4s which we use for notebooks, training and inference again using PCI password. But then if you're just using a notebook for validating your model with a small amount of data you probably don't need a full GPU. So we started looking at this idea of doing virtual GPUs and the previous talk was talking about sharing resources for model serving. This is similar, a similar concept. And then also we started looking at adding NVidia A100s, they will arrive soon and here we can actually do physical partitioning instead of just time sharing as with the T4s and VGPUs. So we wanted to expose this to our users when they spawn a notebook at CERN, so they won't see just like I want a GPU NVidia, they will actually have like a drop-down box that says I want like a full GPU or I want a virtual GPU. And here I would highlight what we are aiming for and we'll come back to this is when you see their GPUs available to use, no, we want to express to the user actually what is the availability of resources so that they don't just try senselessly to get a resource that is not there. And again we want to integrate the public cloud resources into the same setup, so we are like halfway there I would say. The other part which is, yeah I mentioned the A100, so if you've played with the NVidia T4s you know like you can virtualize these GPUs over A100s with this time sharing which is kind of like a fake partitioning, you're not actually partitioning the resources. But with A100s there are these multi-instance GPUs support which is really exciting because it gives us a lot more flexibility on how to partition resources to the end users without having to like do any compromise in terms of the expected quality of service. And then really building on the previous talk I would compliment the things we are doing so I mentioned NVidia virtual GPUs with times for time sharing. There were, I will just put a note here if you're using this one thing we learned is that this was not really suitable for all our users because the ones that need GPU profiling or want to use a tensorboard or something similar they actually require the profiling to be enabled on the GPUs. This was a limitation with the 12 drivers, the version 12 drivers of NVidia. This is actually fixed in version 13 we've tested this so we actually are about to deploy this to our users. This will make the use of our T4s much better. But again the next step is what we are getting early next year which is this new NVidia A100s where we can do up to seven times physical partitioning. And this is something that is supported directly on Kubernetes. The NVidia drivers are able to do to manage MIG cards. So this is really great because the VGPUs are not something that Kubernetes handles natively. And then the other part is the multi-model serving. This is also a requirement we have to make the best out of the GPU is to be able to split to reuse the same GPU for serving multiple models. And this is something like this is a very basic example where you basically create a single inference service and then you have one or more actual models that are linked to that inference service. So this is something that we also try to do and it's like a follow-up to the really good previous talk. So the second part I would mention is this requirement we have to integrate with our on-premises resources. So we rely on mutating mission webhooks for all of this. We are actually using the Opal Gatekeeper. If you've deployed Kupflow you know that you can do a lot with customizing in terms of changing the YAML that is being used for the deployment but this is not enough. One example is for example the notebook template to customize the notebooks in the Jupyter web app. It's quite limited on what you can do this if you want to like change the template dynamically based on some runtime information you can do this with customize. So we do this with mutating webhooks. This is used already in different parts of Kupflow. I'll give some examples and then I'll go a bit deeper. One thing if you want to for example to start a pod speed notebooks or pipeline jobs with the proper UID GID of the local users for multiple reasons we do this change at runtime. Another thing is that we need to manage the credentials for the users so that they can access like storage systems or any kind of other internal system. We inject those credentials also at deployment time of the pods creation time. Yeah the volume mounts for internal storage systems as well. So this is an example for the credential management. I would say mostly we have two types of credentials that we have to handle. The first one is Kerberos. The second one is OAuth 2. Most of our services require one or the other both. So in both cases they are short lived which means like you might have a credential use your training but it takes hours they will expire. So what we've done we've written a small tool that actually manages the credential renewal for the users transparently. So when they basically get a notebook they will upload say Kerberos credentials to their namespace and then the tool will know how to handle this and keep the credentials up to date. So if they then submit a pipeline training job or any kind of other workload on the cluster the mounts of these secrets will actually make the credentials available for the workloads to use and this is true both for OAuth 2 and Kerberos and you can see here like a diagram the first step is for a user to like push the credentials into a secret and our job will renew them and then the actual workloads have this mutating webhook that is actually mounting these secrets and making them available so that they can access things like storage or the internal spark clusters, the batch cluster which is based on HDConder the registry to upload their models things like this. The second part is what I mentioned that we sometimes have to annotate workloads with the metadata of the user so an example is again notebooks have to have a UID GID that matches the actual user so we do this by actually when a new user is on boarded they get the private namespace but we have a component that will basically fetch all the metadata from say LDAP the internal LDAP and we'll put all of this as annotations or labels in the namespaces of the users and the same is true for groups this means that we have all the metadata we need about the users to then do mutations at runtime when creating the pods to deploy them with the appropriate security context all the the proper user group that is required this is like a requirement from our security team and then the last one I would mention is the internal registry so all we mandate that all the workloads running on the cluster have to come from our internal registry which means that they've they've been through the vulnerability scans we 3v4 this we also sign all the images then we have the different policy enforcements at creation time which for example prevents external images from being run but also can do checks like is the security context correct or acceptable or does the workload have all the metadata we need like for accounting for example so we do we do have some some sort of caching and replication from Docker into our internal registry that then triggers all these scans and then we we go from there the second part is that at runtime like this this will help us at deployment time but if you have long-running jobs we also need to do some checks at runtime so we are using Falco for this and we do basically two main things one is to see that whatever vulnerability checks were done are still valid that there's no new vulnerability for some sort of long-running notebook for example and the second one is that the workloads are doing what they are expected to do so there's no shells being spawned inside the container there's no packages that are not supposed to be installed being being installed doing some weird system calls or network connections so we do this checks live and one thing that is quite uncommon at least but I think everywhere is that when we describe this stack to our security team they were actually very happy and this having a security team being happy about new services is not common so this was quite satisfying I think that's it so so far we have been working with a couple of groups at CERN who have been using our Kubeflow instance and in the next couple of slides I'll describe some individual user feedbacks so the first one the users really like is the integration between notebooks to pipelines they really enjoy using Kale so if they can go from a notebook to a pipeline seamlessly without writing any Docker images or anything with Kubernetes that that's really a big advantage for users so this is something we have a very very positive feedback so far then users really enjoy the ability to get resources on demand so for example if they have some models that need hundreds of GPUs to properly train they can't have that locally and on the public cloud we can provide access to this number of GPUs so users really like that and they ask for such abilities and then there we have some advanced users for example we have a group at Atlas one of the experiments that they have some they have a repo with machine learning project and then on that project they have continuous integration and yeah whenever they commit they want to run some continuous integration they would like to add a step that would trigger a pipeline into our Kubeflow instance within their CI so they need the API access to Kubeflow so this is something that we also need sometimes and then we need some better UI for when resources are not available for example for notebook creations for GPUs we implemented this additional tool that tells us if we have available GPUs in our system and as Ricardo was mentioning to select the profile for for a GPU so if users could actually automatically see if they have a GPU it's much better for them so to to sum up Kubeflow has been very well received at CERN we are about open to all users and all groups after we validate with our security requirements that Ricardo was discussing so yeah we want to provide credential and namespace management as well and for the improvements from the actual Kubeflow it would be great if we could have a complete isolation for example for pipelines it's already done in the back end it would be really great for us to have that in front end as well also pipeline artifacts to be isolated and also we would need some better debugging of cut-tip jobs because users can't see logs on the UI and yeah maybe a better feedback speaking of resource availability but in general Kubeflow is like whenever we have a presentation and certain people are very excited and our users so far really enjoy using it so thank you very much and guess if you have any questions let us know