 Hello everyone welcome to our session a scalable platform for training and inference using keep flow at CERN It's gonna be presented my my colleague Philip a senior research fellow with the Atlas experiment at CERN and Me Dana a computing engineer in the Kubernetes team So at CERN we do many things but at the core of everything that we're doing We are always trying to answer the question. What is the universe made of and how does it work? And it's my pleasure to take you now to CERN. It's located next to the Swiss Franco border next to Geneva And maybe it's most famous for hosting the Large Hadron Collider It's a particle collider that smashes together protons at the energy frontier and by doing that It allows us to study what the fundamental building blocks of the universe are So in the Large Hadron Collider We have this tunnel where the protons rotate and cross the border twenty two thousand times per second With almost the speed of light and they're brought to collision at four points We have big cameras particle detectors around them and then we get data three-dimensional pictures of these collisions and Transfer the data to our data center. So in the data center the data becomes knowledge and we create Reconstructions out of these proton collisions and then can probe how the fundamental building blocks of the universe interact with each other At the end we get data products like these plots that give us some insight about how these fundamental particles behave Such as the Higgs boson, which was discovered in 2012 So 2012 was a turning point for particle physics and for computer science This is the year on on when on 4th of July the discovery of Higgs bosons was announced and By this completing the last missing piece of the theory of particle physics the standard model 2012 is also the year when Alex net convolutional network was published and by this marking the beginning of the machine learning age and We've seen mind our challenge right now is to try and reconstruct and identify Higgs bosons in the 3d photos of proton proton collisions from the data We're getting from the particle detectors at the large Hadron Collider and To help us on this mission. We have John and we have Emma John is a physicist at CERN who wants to analyze petabytes of data data coming from LHC and We have Emma who is a software engineer at CERN and she's maintaining central computing resources and software services So as you can see in 2012 there are only 27 papers physics papers published which are featuring both machine learning deep learning and neural terms together with high energy physics in 2023 just last year this number grew up to 445 This year it's March and we are already Above a hundred papers published, which is absolutely great So what can we do about it? Well, we start with some assumptions first of all, we know that we want to do machine learning second we need to run on premise and Lastly, we want to give preference to open source projects and With this in mind This sounds like a great opportunity to use Qt flow the question now is Is it actually enough or doing it more than Qt flow? To answer this question we need to start by understanding. What is it exactly what we want to do? It's time to cover some requirements and The requirement number one is the platform should be able to manage the full machine learning life cycle So we're talking about getting the collecting data Analyzing it training and retraining Evaluating if a model is good enough putting it into production and then monitoring if it's it's good enough in its improvements and then orchestrating all of this into a life cycle of Course in reality probably it looks more like this And this is still overly simplified, but for the sake of our mental health we're gonna stick with a simpler version and We can see that keep flow comes with solution for most of the steps in our life cycle We can use keep flow pipelines for the collected data and for evaluating the models We can use notebooks to analyze it. We can use pytorch and tensor flow for training and retraining We can use cut-tip for hyperparameter optimization And then we can put it into production using nvidia triton or case serve and then we need to monitor and Monitoring is a tricky beast because there are many solutions But most of them are not completely open source from the last research that I've done I saw that there is ML run which tries to solve these problems It's not currently used by us at least not yet, but we'll see how it goes The main point here as was already mentioned before is that with kubeflow You have opportunity when you have a problem to have a Solution a project trying to solve this problem that is very well integrated with kubeflow So this all sounds like one precious kubeflow to rule them all Second requirement is the platform needs to be integrated with our CERN systems We have certain storage systems us or CERN VM file system We are talking about integrating with our authentication on the forization systems and quota management Because by default you get access to just a small quota, but you need to have a clear way of requesting more Requirement number three is the platform should be centralized Which means all the teams should have a single place where they can access run their experiments and get access to a common pool of resources This stimulates cross-department Collaboration, but also this prevents a lot wasted efforts for in-house solutions And because I think in this regard centralizing the resources is so important I'd like to go a bit into more depth. Why is this so important? So first of all because getting GPUs is hard So it's very important to reassign GPUs that are not used properly or so-called idle GPUs It's also important to try to do GPU sharing to increase the GPU offering Of course this comes with some problems some constraints Every user needs to understand what it means to use a shared GPU and not one dedicating GPU but I think it's worth it and Last but not least when you have a common pool of GPUs It's easier to get access to many GPUs at the same time to do distributed training Second point is bursting into cloud which means the system can be more elastic But also it means we get access to hardware that is so specialized or expensive or Both usually both but it's absolutely impossible to get them on premise and as a result we have opportunity to get them on cloud and Of course we get full control over the scheduling of resources Which means we can maximize for resource utilization while we minimize for carbon footprint and Requirement number four the platform should be easy to use Because more or less physicists are not usually very very well advanced in the infrastructure So it needs to be easy to use from the UI and keep flow does a great job at providing this we can create resources we can Collaborate of our colleagues at CERN and also we can just buy some clicks some boxes We can mount drivers. We can configure nodes. We can enable and disable integrations, which is absolutely great So to recap this is more or less our requirements We want to do the full machine learning lifecycle. We need to be able to integrate with our CERN systems We want to be able to have a centralized platform More mostly to have the GPU centralized in a common pool and then the platform should be easy to use And as you can see there is a small question mark near requirement number four And this is because for me from my perspective, I think Cube flow is pretty easy to use But at the end of the day is not just for me to decide but for our physicists that are using the system That's why I would give a stage to Philip and John over here to try it out and let us know Thank you very much, Diana So as an experimental physicist I know take over and make an experiment with John So John works at CERN and wants to find out something about the Higgs boson because it's a Fundamental particle that is the heart of our current understanding how particles interact it provides a mechanism that gives particles mass and Although it was discovered over 10 years ago. It's still a mystery and needs to be further investigated So it's a real challenge to find Higgs bosons and all of these collision events We are from the heart of Chetron collider because the six bosons decay instantly in yachter seconds to be quarks another species of particles and these quarks also we don't see them ever but we only see that tracks their traces through the detector so The challenge is to find these beauty quarks inside of all the collisions Fortunately, these have a very peculiar property and this is just a particle physics 101 because the focus is really how to utilize Cube flow to do all of this the thing is that Higgs bosons most frequently decay to these be quarks and quarks whenever They're created in one of the collision events. They Initiate sprays of particles that go through the detector So on the right hand side you see a picture How we see it in the detector the blue cones are so-called jets of particles that are each initiated by one of the be quarks and These be quarks have the property that they drift lift just very very long So if it were like a human roof would lift thousands of years, there are other particles Instantly decay to other particles or they are they are more stable on our time scales, but these be quarks They travel a certain range through the detector and then they decay to other particles So they kind of have their special own place of decay So we see one place where all of the particles from the Higgs boson and the collision decay and Move through the detector and then few millimeters next to it We see a second place of such such decays and we can use machine learning to identify this particular signature So we have one Model that we call G and to because we started with graph neural networks and now we use attention with transformer models And this architecture is trained with high fidelity simulation of these collisions and the Atlas detector And we use one of these jets from the be quarks And then take all of the trajectories of the particles inside the jet the spray of particles and Map it together with the jet and then defeat all of this inside our transformer model with labels from simulation and Train these millions of parameters for a long time and as an output We get three values that are high if it's a beauty quark Different kind of quark. There's a background for us called a charm quark or some light flavor quark So it's a signal versus background discrimination as a problem and we do all of this by Combining the jet properties with these low level tracks So we have up to 40 tracks per jet that we use for the categorization And when we initialize all of this and embedding with the deep sets go then in the transformer architecture and the beauty of this is that we have at the end different representations So we either can categorize the whole jet as bc or light flavor initiated jet But we also have auxiliary tasks that together and loss function help to classify the individual tracks The origin and also which of them come from a common decay vertex So you have this primary vertex and then as you remember the special second vertex for the be quarks So this was all rather technical just to provide a background This is what I like to call the propaganda plot showing how within the years evolved in performance, so we want to find the big jets and Discriminate against charm jets and light flavor jets And you see that from the initial attempts using fully connected neural networks up to the graph neural network tagger Gn1 we have improved the performance by a factor of fall leveraging this technology and of course the Cost for this is that we need to train on large data sets fortunately at CERN we have both the data sets as As well the technology to do this and here comes in cube flow So we use some software that is based on pytorch and pytorch lightning We have it configurable with yaml files. We use best practices like continuous integration and deployment Provide support as onyx so we can deploy the train model within the Atlas reconstruction software and identification of these be jets and The nice thing is that all of this is containerized So we have a git lab its an instance that provides us with images of the software and this is ideal to be used and keep flow So this is a possible workflow. We have in git lab this salt software Where we then create images linux container images and then these can be directly scheduled within Cube flow to train and the input data sets the training data sets and the validation data sets We get either from storage from Institute machines from the EOS storage system at CERN some from in-house solutions We can also use a open stack as three type of storage and mounted So there are different options how to go ahead and once we have the training Then we can evaluate all of this using Jupyter notebooks that are also being served in keep flow So I think the best way to show that you trust your infrastructure is to go for a live demo So we are going to do exactly that I want to show first how to run this salt software. So we have Here's some some pipelines That we are going to schedule if the Wi-Fi permits Sorry, I shouldn't have said that with the trust in the infrastructure because we just tried What I want to show is Run of the pipeline But I think while this is loading I just show one of the pre-recorded fallback solutions Would have been too nice Also seems to be challenged But we're getting there we have a little more time So right before doing this Ricardo said this is very brave. Are you sure you want to do this? But let's see So we now have the pipeline where we schedule with a YAML file The GitLab registry hosted image and this is the command that we use for training everything So we now create a run schedule an experiment and Then we can monitor the run Once we launch it with let's let's say we trade for five epochs and Here you see that we have now a container image that is being deem deployed And as soon as it's running we can see in the log the the output So you see here some information about the model It's a simplified version with only 1.3 million parameters that are being trained The the data is being loaded and we evaluate them for five epochs and get the validation at the train loss But I think at this point it's just to show you how easy it is to use So we just have YAML file where we configure everything and then This is running now We have the first epoch which completed with just a small data set of I think a thousand entries for this Demonstration and you see that we get the train and the validation loss with a timestamp and in principle We could use this now and also important to attend support to see the loss as a function of epoch decreasing if everything goes well So that's one thing we can do I think I will not bore you with this and just seeing the loss decreasing and this eventually converging so I used some time before to pre-train and We now see in Jupiter notebook Evaluated version of the model on some test data set And in this we import some plotting libraries So you probably know pandas numpy and then some edge five reader for the data sets and We'll read this Output from a similar model and we can directly visualize it in the notebook to see how well this performed So as I mentioned, we have three outputs from this transformer model Score for the BJET for the charm jet and for the light jets We want to discriminate the BJETs originating from the Higgs decay from the other two classes So we construct a discriminant And can compute then the efficiency for the signal class and the rejection of the background classes And plot everything as a receiver operator characteristic curve so doing that Then provides us with the plot as a PNG file and I'm just going to display it below So what you see is for certain efficiency of how many of the BJETs from the Higgs decay you select How well you can reject the light flavored quark initiated jets and the charm jets and This is of course not as good as the real performance But shows that it's possible to train state-of-the-art classification architecture so particle physics So I think with some some Excitement for the presenters we can say that indeed this technology works it even works on a stage So I give back to Diana after this experiment so What is left to say is some conclusions to end with some machine learning becomes a key technique in high-energy physics Cube flow we can see is does meet our requirements and needs and The example over here a transformer model for classification of jet flavor Showcases that the platform can be utilized for physics data analysis Thank you. This is for for the dessert and they believe do we have time for questions I Think we have a time for a couple of questions so There's one there Thank you for the great talk super exciting So can you speak just so on one of the slide that you speak about shareable PVC between notebooks and training workloads? So can you please speak a little more how you use it and why it is useful for you? Sorry, will be like one of your architecture diagram You show that basically you use shareable PVC between notebooks and training Which allows you to easily scale your you know training scripts and distribute across workloads So can you speak a bit more how you can do it and how it supports with you for notebooks today? I'm not sure. I believe it's just pvcs that can be mounted Multiple times. Can you if you can just maybe show the slide? Yes more precise there? Yes, please I'll go back. Yes Yes, yeah, exactly. So my question is like is you can hear you using like refrite many PVC between your training workloads and notebooks So my question is like exactly how you transfer the code from notebook to your distributed training workloads, right and multiple machines But here it's read read write many times I think this is it but do you use like notebooks directly on the train workloads or you use like a bytes and files there like So the notebook we don't really use for the training. We have this salt software Which is just a wrapper around lightning and in this we can use many GPUs at the same time by specifying the number of of See the GPUs and the workers So this is being used and I have to say I didn't explore the training with many from from sort of not perspective So I can't answer that so you're not really taking like a notebook conversion to the Python file and distribute, right? Okay, so no, it's directly taking the Linux container for the training all right And you don't have any containerization, you know capabilities inside the jupy lock right now, right to help you to build these images So you don't have it right now, right? No. All right. Yeah. Thank you. Thank you. Thank you for the nice talk Actually, we have a similar setup at university and I'm wondering if you found a solution for a fair resource Usage because that's a huge topic in our research group. How do you actually make sure GPU resources are fairly shared in the team? That's a very complicated question. I don't think it I can just say we found the solution. We're working on it I think the first step that we're trying to do is to make sure there are no users that are just assigning Resources to them and then they don't use them. I think we what we're trying to do is once we see that there is Resources that are just staying idle. We make sure we Put them back in the common pool so that someone else can take advantage of them and then second, of course, we're trying to make the pool larger and A big thing here is the GPU sharing with time slicing and MIG where of course we need With time slicing there are some risk because still it's you're not isolated But it's worth it because then for testing setups. We can ensure more users actually have access to GPUs That's presentation to you all. I want to ask you if There you are using multiple namespaces inside of Qflow or Or just a single name space We have one namespace per user per user. This is how isolation works. Yes. Okay because I want to talk because I Did they have a project some time ago that I was needed to implement multi-tenancy for Each user can use it different S3 buckets and different namespaces. That's why I Was asked about that. Thank you. Thank you. Hi, so just a small question regarding the GPU pooling and how do you do the isolation? So underneath the hood, what does run in these GPUs? Do you really do VGPU? Do you like allocate specific GPU resources for a pod? How do you do these, you know splitting isolation and you know? Under new hood, we're not using VGPUs. We are using time slicing and multiple instance GPUs. So MIG So it's just we have GPUs connected to the nodes and then they are being advertised that they are available When you use the time slicing or MIG from the GPU operator, they are advertising that there are GPUs either timeshared or MIG and Then the pod can request them and get them. That's it. So the isolation in terms of time-sharing you can Expect that if one tenant is greedy then others will be starving So there is no isolation here. We can and we are thinking of having a mechanism where we're trying on top of what is by default Kill processes that are trying to get more from the GPUs that allowed But other than this you have the isolation only if you use MIG which is more at the hardware level Was it MIG MIG? Okay, multi instance GPUs. Yes Which are available only for a 100s and each 100s and this like completely open sourced. I Don't think so. Okay. Good to know. Thanks Just a question. Um, how are you managing the number of pods per node? You are running because I suppose you have a couple of models running the same time, right? Yeah, but I think the limit on the node is So we have an upper limit of the node. I don't remember if they're hard 110 pods per node. Yeah, so you are you're not reaching that limit? No, but also you have in times in terms of CPU or memory if you can or cannot Schedule more on that node. So I think we're not reaching the limit because we're reaching the other limits before Okay, and just another question is if you are using Kale for building your Kubeflow pipelines If we were using Kubernetes Kale is a Pludging for notebooks for Jupiter notebooks. I think for building Kubeflow pipelines. I am not sure Thank you If you want to teach us make sure to find those after the presentations and after the Kubeflow summit. Thank you