 Hello everyone, welcome to our talk on building and managing a centralized machine learning platform with CoupeFlow at CERN. We'll be talking about some work we've been doing in the last few months and a service that we open to our users. Hello, my name is Dan Gulubovic, I am a Computing Engineer in CERN Cloud team. My focus is on machine learning infrastructure services with Kubernetes and I will present this talk with my colleague Ricardo. My name is Ricardo. I'm a Computing Engineer also in the CERN Cloud team. I focus mostly on containers, networking and more recently GPUs, accelerators and also machine learning and I'm also a member of the Technical Oversight Committee of the CNCF as an end user representative. So today we'll give a talk about service at CERN but just a very quick overview of what CERN is about. So CERN is the European Laboratory for Particle Physics, the largest particle physics laboratory in the world and we build like large scientific machines that allow us to do fundamental research. The largest we have is the Large Hadron Collider, you've probably heard of it. It's a 27 kilometer per meter particle accelerator that is 100 meters in the ground and where we accelerate two beams of protons to very close to the speed of light and we make them collide at very specific points where we build large experiments and you see here CMS, LHCV, ATLAS and ALIS. So I have an idea of the size. You can see the genie airport here on the picture. This is an image of the accelerator itself in the tunnel and you can see all the magnets that help us bend the beam so that it circulates in the accelerator and this is a picture of one of the detectors, the CMS detector compact molten solenoid. It's in a cavern 40 meters by 40 meters, also 100 meters in the ground and this is where we make the proton beams collide. This detector and the others as well act like gigantic cameras where we take something like 40 million pictures a second and the result of this is a large amount of data that we need to store and analyze. We collect and store more than 70 petabytes of data every year and this is after a lot of filtering. One detector like this can generate something like one petabyte of data per second. So that's why we are constantly looking to new technologies that can help us handle this amount of data. So the main motivation for our service is the expanded usage of machine learning in high energy physics. Different groups at CERN work on various machine learning projects in order to achieve scientific goals of the Large Hadron Collider and we know that setting up and managing machine learning infrastructure is not an easy task and currently most groups at CERN manage their own machine learning infrastructure. So we have four main experiments which all branch to different groups and that means that a lot of people use their own machine learning infrastructure. We want to offer a centralized place, a centralized service in order to reduce physicists' efforts in infrastructure and to allow more time for scientific research. One of the main applications of machine learning at CERN is in particular reconstruction. So during proton-proton collisions, short-lived particles are created in the detectors. For example, Higgs boson which leaves 10 to the minus 22 seconds and to capture the events of the short-lived particles. We measure energy depositions in the detectors. Detectors can be considered as 3D cameras which leaves the opportunity to use convolutional neural networks. Besides convolutional neural networks, we can use graph neural networks which are also very good at spatial representation. So the example would be to take the output of the detector and let that be an input to a network and the output of the network would be the ID of the particle, whether it's a Higgs boson or a muon or a pion, for example. And now lots of research is going towards graph neural networks. Another application is in detector simulations. So a large-hedron collider is getting upgraded. There will be even more data in the future and more sophisticated and faster solutions are needed to support the upgrade from various perspectives, one of them being simulations. So simulations are performed so that we can accurately estimate what is going to happen during the runs. And the traditional methods are Monte Carlo simulations. But recently 3D guns have started to be more commonly used. And they have proved to have a similar performance to state-of-the-art Monte Carlo and they offer 20,000 times faster simulation. And also with 3D gun, data can be simulated on the fly, which may reduce the need for storing the data. So our goal is to set up a platform to support the end-to-end machine learning life cycles. We want to be able to extract data from the detectors to spark or HDFS and operate on that data. Then we want fast iteration services, such as notebooks, because many users use notebooks daily, at least the notebooks are a good starting point for every machine learning user. Then for more computationally extensive jobs, we want to be able to perform distributed training with TensorFlow or PyTorch. And even we want to branch out to public cloud when resources are needed. So then after the training, we want to store models and to be able to perform scalable serving for for the trained models. So the platform that supports all of our goals is Kubeflow. Basically, with Kubeflow, we are utilizing power of Kubernetes to efficiently manage resources. And we also offer users all the desired features. The infrastructure part of Kubeflow is managed by our cloud team. And our users are physicists and scientists across the entire server. With Kubeflow, we can offer notebooks, pipelines, distributed training, model serving. And we can also offer bursting to public cloud when necessary. And that means that basically all of our use cases are covered by Kubeflow. And Ricardo will now discuss our setup and challenges in terms of setting up our Kubeflow instance. Yes, I'll pick up on the nice description from Dejan, before he does a cool demo, I'll just talk about the layout of the infrastructure we are using. So this is a very simplified overview of the clusters, the layout of our clusters. So we rely on us an entry point load balancer. And this allows us to to simplify the deployment and for example, to do upgrades by just adding entry points to the load balancer, new clusters on the back end. And then there's a gateway that is our ingress gateway to the services. The main important bit here is that we have three types of nodes. The first type is virtual GPUs. This is something that allows us to have a large amount of GPU resources, although they are not as performance as having a full GPU. But it allows us to have a much larger amount of resources for things like notebooks, for example, and we rely on T4s with time sharing in this case. And then we have the PCI path through node group type. And here it's mostly used for things like pipelines or distributed training, hyperparameter optimization, and also model serving where you want to guarantee a certain latency for the model serving. We do not do today any kind of faster interconnector and building anything like this. And the last bit we have here is CPU. And in this case, we have a much larger amount of resources. It's not as interesting if you're doing deep learning, but actually this platform ended up being used for other purposes as well, where workflows and pipelines can be useful. So you can see that we have something on the order of hundreds of virtual GPUs or the tens of full GPUs for the users and are of thousands of CPUs. Just very quickly, our deployment is based on a Kubernetes 118 clusters today. We use Koflo 11 still, and one difference from the standard 11 deployment is that we upgraded this TO215 and K-Native to 015. All the clusters and the deployments are managed using GitOps, and we have one repository where we define all the services and all the environments we support, and it's all managed by Argo CD. There is one very good feature here, which is Argo CD allows us to use customized just for the Koflo deployment. And then for the other components, we rely on the operators for both these TO and NVIDIA GPU operator deployments. And then for Prometheus K-Native cert manager, we are relying on Helm charts, upstream Helm charts. One of the key aspects is the integrations we do with the internal CERN services. So the first one is identity authorization and authentication. And we link this to the CERN SSO, which is based on Keycloak. So this allows us to have not only the tokens that identify the user, but also the mapping of the users to the roles and the groups they belong to. And what we do in our clusters is we have dedicated namespaces per user where people have a default quota that is fixed and cannot be changed. But also we have additional groups where people can belong to, and this is defined in the CERN identity if they belong to these groups. And in those groups, they can request additional quota like more GPUs, for example. And then the other very important part is the integration with our storage systems. As we mentioned, data is a key aspect of everything we do. So we integrate with the three main storage systems that are interesting in this case. The first one is we call CVMFS, CERN VMFS, which is a read-only distributed theoretical caching system that is mostly used for software distribution. The second one is an internal in-house developed system called EOS that is holding all the physics data. In this case, the important part is that we need to, we offer both Kerberos and OOS2-based access. OOS2 is very important for things like notebooks and anything that is, like, browser-oriented. And then the last one is HDFS. Dejan mentioned that in some cases people want to do the data preparation using Spark, and in this case we are accessing HDFS using Kerberos credentials. I will just summarize a couple of issues that we run into while doing this. The first one is that the Kupfl releases were not always very consistent in terms of what's supporting what. So 1.0 had, for example, multi-user support for notebooks but not pipelines. And 1.1 brought multi-user pipelines, but actually some of the components were not talking to this new API properly like Kail from notebooks. This meant that we spent quite a bit of time downstream fixing these bits when we did the upgrade. And this is one of the reasons why we are still in 1.1 and we are slowly updating to newer versions as well. The second one is actually customized in the way Kupfl is using customized. It's quite complex. So we decided to spend some time simplifying things and removing some of the components out of it, especially things like Set Manager, Istio and Knated, which are quite critical. And in the end we deploy them in another way and we only deploy the Kupfl applications using customized. And then the last one, which is still an ongoing issue, is how to manage additional packages that people might require for both the notebooks and then for their pipelines as well. And how can they install and add these packages easily to their containers? So this is something we'll mention more later. The last bit I will mention before we jump to the demo is how we are doing bursting to the public clouds. This is very important for us because we cannot get access to a much larger amount of GPUs and especially other types of accelerators like TPUs or IPUs. TPUs are very interesting for our use cases also because they're very cost effective. We tried this using different technologies over the last few years on the lower level. We tried Federation V1 and V2. We also have diplomats using the Virtual Kupflets. We are not still very experts in Istio but we are experimenting with it. But actually for Kupfl, the most promising results and the way we are offering this is actually to directly expose the other clusters to the users via their Jupyter environment. So when you get your Jupyter environment, your notebook environment, you actually get the additional clusters configured and these clusters are configured with the same CERN SSO so we can do something like using the Open Policy Agent to validate who is able to access which clusters, which groups can actually access each cluster. And then we do the quota management the same way in those clusters as well. This is working quite well. This is a simple, not so simple picture, but it's kind of simple given what is behind. But the key aspect is that this Jupyter environment, we have the cluster configuration and we are able to reuse the OOS token that the user already has by logging into the system at CERN. And then, but normally they would just submit to the same cluster where Kupfl is running the Jupyter environment and they would submit a TF job that would use GPUs on premises. And then once this training is done, we write the output artifacts to S3 where it can be served from the S3 to CERN. By exposing additional clusters in the environment, people can just source those clusters and then submit the TF jobs to external clusters. And in this case it would be the Google Cloud where we would have potentially thousands of GPUs available or GPUs. In the end, again, the output artifacts are written to S3 and served in the same way as if they would have been trained internally. So this is quite promising and this is what we are offering today. Okay, now we'll go back to our demo example. So we remember 3D guns. Basically, the main issue with 3D guns is extensive training time. So for our model, it would take 2.5 days to properly train. And for example, if we want to search hyperparameters or change the model, that iteration could last for weeks or even months. So this is with one GPU. So to create a more scalable solution, a distributed model is created. Basically, the model is trained using TensorFlow strategies. So for example, we are using, in our example, a multi-worker mirror strategy which uses different nodes with multiple GPUs. And also we have a script with accessing TPUs for the distributed training. So TF job and Kubeflow help us automate this distributed training process. We are able to quickly iterate over different training configurations. So basically, we are encapsulating TensorFlow distribution and it's managing it across Kubernetes spots. So we are able with TF job to run distributed training both locally and on a public cloud as Ricardo was describing. And yeah, at GCP we are using 128 preemptible machines for the distributed training. And now we can move to the demo. So let me share the screen. So now we are going to show our demo. Here we can see at ml.serend.ch our service dashboard. Basically, this is the Kubeflow dashboard and we can check all the Kubeflow features. So now we're going to show our demo. We can see our dashboard here and we can see Kubeflow features on the left. We have pipelines, notebook servers, cut-tip and the other features. So we'll go to our notebook which is basically where we have our demo prepared. So here we have a couple of demos we are going to show. The first one is one from our examples repo. Basically we created a repository for onboarding our users for various Kubeflow features. We are going to show KL. This example shows us how to convert a notebook to a pipeline without writing any additional Python code. So for that we're using KL deployment panel and basically the only thing we need to do is to annotate every every cell so that it converts properly to a pipeline component. In addition to annotating we are creating connections between pipeline components and we can also add the GPU to any specific pipeline component. So here we can in order to run we only need to click this button compile and run and we can see our pipeline running. So while our pipeline is running we can check other features which we have in our service. One of them is EOS. So EOS is where most users where all users at CERN have their personal directories and mounting EOS really allows us to be able to access broad data from multiple users. Basically every user can access their own personal folder here. And also we can show the usage of a GPU with NVIDIA SMI. So yeah we are starting with that. The main example which we have is 3DGUN. So here we have our 3DGUN training. So we have different scripts here. One of them is training 3DGUN with the CPU and then we train 3DGUN with the GPU here. And it's all distributed training when it comes to GPUs. So basically here what we want to check is a strategy. So we see that we are using multi-worker mirrored strategy for GPU training. And also as Ardo was mentioning we are also uploading the trained model to a bucket and we see that code here that after the model is trained we are uploading it to our CERN bucket. Similarly for TPUs only here we have a TPU strategy for distributed training. We also in this repository have a Docker file to build our image to run distributed training but we are not building it here. We won't do that but we'll show our TFJob YAML file. So to submit a TFJob basically we can define our number of replicas here, number of GPUs we are using here and then we also select the image and we also can select if we want full training and the number of epochs and other customizable arguments. So now we are going to submit our 3DGUN TFJob on our local cluster. All we have to do is to do cube CTL by 3DGPU. So yeah this one we are submitting to our local cluster and we can check our TFJob and we see our TFJob running. So now may be a good time to check if our pipeline has completed it has. So we can see our logs and we can see that our pipeline has completed training two models and we see which model was better and now we're running the distributed training of a 3DGUN on a local cluster. Additionally we want to run a 3DGUN training on a Google cluster. So basically inside the cluster folder users would get information about all available clusters in our service. So here we only have a CERN and GCP cluster and all the users have to do to access the additional clusters is to source these files. So GCP setup, setup.sh and now they should be able to, they are in the Google cloud cluster. So as we can see in the Google cloud cluster there are no pods in my personal namespace but in this local cluster we have our, I have a couple of pods here running and some of them completed. So what we are going to submit here are a 3DGUN example. So we can go to to our GCP, our GCP YAML file and we are going to submit that. So now we're submitting this TFJob to our Google cluster. Meanwhile we can check if our training on our local cluster has completed and yes we can see that it has completed and to check the Google cluster basically we want to have a watch and yes we can see here that our workers are deployed at nodes which have a V100. So in total we have 128 nodes running and we have 16 workers where each worker has eight nodes. So now our training job is running actually on a Google cluster and this is what we see here. So we can close this now and here we see that our local job has completed and now as Ricardo was saying after the training we submit our model to a bucket. So here we can see the trained model stored on our buckets. So we have a couple of files for each model and then this is all for one model and for one epoch. So we have our discriminator and generator for 3D guns stored in the bucket and we also have saved the model in the format so that it can be used for inference for serving. And also we want to maybe want to check this these metrics. Basically this is how we can store metrics, how we store metrics about our model after each epoch. So okay now we have covered the 3D gun. The last thing I'd like to cover is the inference services. What we want to do here is to submit an inference service and to basically serve a model by only specifying where the model is located. So we can secure CTL, apply and now we have created our inference service actually. It was already there but this is how we can we create it when we want. And then to test our inference we can test it from here and we see that we are getting results and basically as we were discussing we're getting a 3D output for that represents the output of the detector. So this is what happens when we do one inference but we want to do 10 curler requests at the same time so that we can see what happens to the number of predictor pods as we can see they are the number of pods it is increasing it's auto scaling so that it can support client requests. And basically yeah with this we have covered our demo. So basically during this training we were able to reduce the execution time from one hour to 30 seconds for one epoch and for the full training we managed to get from 60 hours to around 30 minutes. So TFJop really helped us speed up the development process and we see that we get almost linear improvement in our performance for our 3D gun model. And now Ricardo will offer the closing remarks. Yeah so basically I hope this was a nice overview of the service we are offering and the potential that it has by offering like a consistent environment where people can do their development but also interact with the services. We handle all the machine learning life cycle steps from preparation all the way to serving. We managed to centralize the resources that are pre scarce such as accelerators in this case GPUs and also we showed how we are doing currently the integration with external resources for GPUs, GPUs using public clouds. There are steps we are still working on so one of the main ones is to onboard new use cases and from those there's a very interesting one for reinforcement learning from the people doing the beam calibration where they want to keep the model live and updated live while the beam is running. The second thing I would like to mention is this needs for users to be able to curate their own environments and to add packages to their own environments easily. Right now the only option is to install the packages on the notebook but that doesn't work well when you're doing transition to pipelines for example or to distribute training. So we have some experience using a tool called binder for Jupyter notebooks and we are looking at integrating this with the Kubeflow Jupyter web app as well. And the last one is we are quite involved in the work ongoing work for Kubeflow improvements in metadata and artifact management so it's something that we'll also keep pushing for in the community. So we would like to thank everyone in the Kubeflow community for the great tooling and of course all the Kubernetes and cloud native tools that we rely on as well. And we are happy to answer any questions. Thank you very much.