 Hello everyone, I'm really happy to be here today. I'm Ricardo, today we'll be talking about training and optimization of our transformers and we'll talk a bit about the platforms we have at CERN for this kind of workloads and also a real physics use case for once with a real physicist as well, so I don't have to pretend that I know what I'm talking about today. So hi everyone, my name is Maxen Trager and I am supposed to be the real physicist but unfortunately my talk is mostly going to be about how to apply machine learning for a very specific use case in our experiment. Yeah and I'm Ricardo, I lead the platforms infrastructure team at CERN. I'm also a member of the technical oversight committee at the CNCF and the recently formed end user technical advisory board as well. So I'll start by giving a very brief overview of what CERN is. It's the European organization for particle physics. It's been there for several decades and our flagship project is the Large Hadron Collider which is a very large particle accelerator that has 27 kilometers in perimeter and it's 100 meters underground. We accelerate protons to very close to the speed of light and we make them collided specific points where we built these experiments but because a video is better than a thousand words I'll try to play this very quickly so you can see we accelerate two beams of protons on this very large ring. We have these four experiments where we make these collisions and what I want to highlight here is that these collisions and these machines act as very large cameras and we produce a lot of data. So I'll stop here because this is where it gets interesting is that we are generating petabytes of data per second and traditionally to handle this on the nanoseconds we've been building custom electronics that will filter this data most of this data the very very large amount of data we are generating in using custom electronics first and then the very few percentage that is left then we use very large computing farms which traditionally are also CPU based and then from here we finally get to an amount of data that we can store in our data centers and then reprocess and give the physicists for analysis. So this is kind of important to put some context into this talk because it explains a transition that is happening which is all this custom hardware all this custom electronics we got used to develop are now possible to be replaced with GPUs and other accelerators and machine learning algorithms and this is a huge simplification of our infrastructure. So Maxans will talk a little bit about one of the experiments. So as I mentioned I'm part of the atlas experiments and you have a nice photo showing you the scale of the of the detector with a human just so that it's more understandable. So the point of the atlas experiment is to study physics at the fundamental scale and at the fundamental scales physics is described by particles which are fundamental elementary elements that you can't subdivide as far as we know so far and they all listed here into the cycles are showing you the different particles and you have particles that make matter the outer ring and then you have particles that make the interactions themselves the inner rings and as you can see there's one at the center it's the Brutangloth X boson or more commonly called just a X boson or the goddammit particle. So the point of the LHC and atlas was to discover this particle which was done in 2012 but we still have to measure a lot of the properties and the user case I'm going to show is an advanced use of AI so that we can measure some rare the case of it and this is obviously a huge collaboration given just the size of the detector the size of the data set to see through so we are about 6,000 members scattered across 216 institutes and more than 40 countries now and so I'm going to give you a user case by the end of the talk. Yeah so now we'll break kind of the talk into two parts I'll start by talking about infrastructure and focus on the journey we had to get here and also the challenges we are facing and then Maxence will talk about the physics use case at the end so if we look at what we've been building at CERN and we've given many many talks over the years at this conference and others we had a long journey to get to where we are we started looking at cloud native and Kubernetes back in 2016 where swarm was also a possibility being considered we even added mesos at some point and we eventually got to production service and then all the work we've been doing since then is about integration with our internal systems so things like storage and our self systems and other storage systems we have in house and then usability and ease of deployment we get up secret management things like flux and Argo then allowing heterogeneous clusters with node groups auto scaling auto healing load balancers and then a very strong focus on dissemination internally mostly to get people used to handle this kind of applications and also security which is something we put a lot of focus on and finally one key key aspect is ensuring business continuity in disaster recovery and we've been building on all the features that have been offered by by our tools so this has been a long journey but it got us to to quite far and you can see here we use these tools but a lot more but from from the top you can see the webinars we organize internally and you see two different kinds of things we are offering you can see the services so here at the top you have how certain runs critical application service on Kubernetes and this is really the critical systems on the campus the everyday services we need to support the campus with 10,000 people and then we also see the other side which is more the physics analysis and you can see here analysis reproducibility with Rihanna and Kubernetes this is where the physics actually happens and then the base computing power we need Ruchio here moving Ruchio to production in Kubernetes Ruchio is the system that is responsible for in the Atlas experiment moving data around and this are pretty large requirements so if we look at Ruchio for example I have two plots on the top there you can see we are any random week at CERN we are moving over 50 petabytes of data around and you can see transferings of 7 petabytes of data as well and then on the capacity you can see here an experiment that was done by the Atlas experiment using public clouds where they tried to scale out to in this case GCP and they managed to run for several days with over 80,000 cores of spot instances so we have a lot of experience we built all this infrastructure that people got used to so when we start looking at machine learning in AI it's only obvious to try to build on it and we started looking and Kubeflow was a good option it builds on all the principles we had in-house already it kind of doesn't build anything new it gets a lot of components working together in a way that can help the end users and it's it's been quite successful in answering all our needs which is things from data preparation to some sort of fast iteration using notebooks or other means then scaling out to discrete training and hyperparameter optimization and then model storage and serving I won't go into details on this but you have the QR code of previous talks that we gave in this area so what I will focus is we have this infrastructure what are the challenges that we have today and these are challenges that are both on the stack but also on the usage of the of the resources so the first the first challenge we have and I don't know how many people are running their on-premises infrastructure or just relying on external cloud providers but if you're running your infrastructure at least for us this has been a huge challenge the pattern of usage of this hardware is very different from what we call our traditional CPU workloads the needs for power and for cooling increase dramatically if we look at this plot here we can see that the needs for power in the current generations but especially if we look at the future generations that are coming kind of limit a lot the density we can have and at the same time people are you are requesting this this kind of density not only the density in the single node but they are also asking for things that were traditionally only needed by HPC centers and supercomputers things like fast fast networking interconnects infinite band and friends all of this comes with a huge request for power and cooling so if we look at that diagram there we can see if we have four GPUs per node with the interconnects internally we are already putting quite a lot of demand on the power and cooling required for each rack if we start talking about fast interconnects between the nodes then things get even more complicated the second one is hardware evolution and this doesn't seem like it's a calming down we got used to sort of stable increase in terms of evolution of hardware around CPUs but suddenly we started having GPUs coming into the scene and the the rate of increase is much higher so we can see here the predicted the next generations for NVIDIA and we see what they are optimizing for so they are clearly targeting things like LLMs which are not necessarily the main use cases we have internally for machine learning but what this means is that people are following the trend when technology allows you to do something you start building new use cases and this is what we are seeing in the house the fact that you can have such large models in the GPU and such power in a single GPU means that people are considering what I was showing at beginning of having this custom electronics for the very fast filtering in the detectors this can now be replaced potentially by more commodity hardware with GPUs and other accelerators so these use cases are coming at the same time people will want to have the new fancy GPUs from our side because they're extremely expensive we want to make them last longer so we are already pushing the lifetime of this kind of hardware from five years which is our standard to eight years while people want to have a much faster turnaround because this is what the public cloud providers are giving them and then what this means is that we have to make the best of our internal infrastructure but we also want to offer the the more advanced use cases the hardware they need so what this means is that we probably or we are going hybrid clearly and this is to fit both the needs but also the costs for this specific use cases and the very large delivery times we are facing today and we could say this will be a hype that disappears but we already saw this with Bitcoin and now we have another hype maybe there will be a new a new one so it doesn't seem like this kind of infrastructure is going away going hybrid brings a lot of new challenges and luckily we have some quite some experience with this the challenges the needs the new needs and requirements for network and storage over the last 20 years we built something that you can see there on the top right which is the grid computing infrastructure for the LHC these are 200 different centers around the world that we connected with very high throughput links so we know how to distribute the data we know how to do the workload scheduling to put the workloads where the data is or move the data when appropriate if we look at the cloud native stack though there are things that are missing we don't have the primitives yet completely there for advanced scheduling things like cues things like co-scheduling and we have we don't have an easy way to handle multicluster the management of multicluster is there the scheduling across multicluster is not there so this is something I find extremely interesting to focus on right now projects like you and the new features they've been adding around multi queue are really interesting and if you look at other projects like volcano or Armada I can see that they will start building on this base layer as well what is really interesting here is that if we solve this problem and we are focusing on this mostly because of jnai we are actually going to solve a lot of problems for HPC centers that have been using traditional tools like slurm for scheduling and suddenly they start being able maybe to offer a pure cloud native API to their users this is a huge simplification and for us this would mean that we can use the same API on premises public cloud providers and targeting HPC centers as well so this is a something that will be very interesting to follow in the next year or so and the last one I have and this is came while discussing with Maxans for this talk is that we probably need to start focusing on the Python for our C++ infrastructure if you have anything ever to do with high energy physics you probably heard of the root framework which is a data analysis framework developed at CERN it's written in C++ and there's a lot of people that are exporting C++ that love it and wouldn't change C++ for anything else but there's a lot of other people that do not want to handle the complexity of C++ so there's a layer called pyroot that kind of simplifies the life of a data physics analysis physicist quite a lot so when we start looking at our stack we see constantly when we offer tools like Kubeflow that people say okay my job is pending I have no idea what that means I don't know how to debug it I run the kubectl command that is in the documentation but I really don't know what I'm doing so maybe the abstraction is required to kind of bridge this gap and then also for HPC there's this this lack of knowledge to answer HPC questions when you're running very large workloads on a kind of batch like way there are things that you want answered to be able to do your job properly things like how long is my workload going to take to complete or when will real resources be available for my workload to run cloud native stack wasn't necessarily designed for this kind of workloads batch workloads it was more for the service oriented type of workloads and maybe we need to like take a step back and see what primitives were missing to be able to answer these kind of questions and with this I pass to Maxans to continue with a real use case all right so yeah sorry so go to the use case so as I mentioned like one of the things we really interested in Atlas is to find Higgs boson and you have a nice visualization of certain events in Atlas this is a real event in which you can see one fundamental particle called immune that is emanated from the main event and also you can distinguish these two bluish cone which are more visible on this side which is a Higgs boson decaying to two types of quarks called bees and we want to be able to identify these quarks in our detector you can see that this leaves a very complicated signature definitely possible to analyze humanly so we do need a detailed analysis techniques and what we want to be able to do more specifically is classification of these be quarks versus other types of quarks which are seasoned light primarily so and of course we want to do that because we want to study the Higgs because in the end this is the physics experiment and we want to make prudence on the theory the best way to find these be quarks be jets is to use machine learning for the because they give the best accuracy obviously and obviously to get the best accuracy you need the best performing machine learning model so you need to optimize your hyper parameters which is very costly and this talk is this part of the talk really is about this it's about a framework to improve the hyper parameter optimization using kubflow and a technique from the literature so on our side we've been not only focusing on the hyper parameters of course we've also been trying other architecture and I'm just showing this plot so that you have a bit of an idea of the history of the project so we're doing B versus C versus L classification which every time they quarks and you can see how the models performance evolves for the year so this is at the fee B efficiency identification efficiency is showing you in green the C rejection and in blue the light rejection where the rejection is the inverse of the misclassification efficiency and you can see that through the years adopting more advanced machine learning from like a deep neural network to including a recurrent neural network to using a deep set base architecture using graph attention and finally using gen two which is a transformer base model I'm gonna mostly talk about we managed to have this really nice improvement in in performance and this is very important for us because these are very difficult and rare events to find so we literally sifting through to find like the needle in the haystack even though the haystack is much bigger and all of these models are trained in a similar way they use combined type of inputs which combines different types of physics input so that they can output the probability for each for jet to be of each quark and then we use this we build discriminant based on this chord that later on analysis can use to do the data analysis they all train on Monte Carlo simulated data but they calibrated on real data to account for some mismodeling effects there and so the model I'm going to focus on is the gene to model which is a transformer base model which we could describe as a multi-model multi-task model it's multi-model in the physics sense that it combines different types of physics input which for us is kind of nice because it combines varied things we use historically to treat in very different ways we used to have a one network that would deal with one type of information and another network that would build use another one and the nice thing with this single architecture that is combining the different type of input it's much simpler to maintain and upgrade and do we study such as hyperparameter visualization or when we change the entire software stack to reconstruct several sub-elements and it gives us the simplicity with the state of the art performance really the usual trend in I think energy learning to have one large huge network able to do several stuff at once and really able to do several stuff at once because it's multi-task and you can see this from the fact it's not just predicting the flavor so the class it's also predicting other physics physically relevant information that we know from expert knowledge so it's a way to put expert knowledge into the design and it's working really well with the only caveat that is quite resource intensive for us to train it takes us roughly two to three a 100 GPUs an entire week of training to do one full training due to the size of the data set mostly and the problem is we are not a large tech company and most people in our collaboration do not have access to a high performance cluster with the sufficient amount of GPUs to be able to contribute to this sort of project it also makes it prohibitively large for us to do hyperparameter optimization just given the scale of one training and the fact that you have to iterate so really this talk is about you know how we could democratize this and obviously this is the cupola conference so we thought that this would be like a nice way to do this for several reasons so first the nice reason is that this you know container orchestration engine characteristic of cupola it's quite suitable with our framework we we always keep our code on get snap repository and it's quite already well adopted to to make them build buildable into executable images so using the CI the continuous integration tools of a GitLab so far it's very easy to integrate this with this KT framework in this case because it's about hyperparameter optimization that is provided by CERN in this this website here which is the server that's in sort of the flow and on our side more specifically so we just have our code on the GitLab which is executable and we have our data accessible through an S3 like data storage that we can mount more locally so that when we train the speed of loading in the input output is higher and one thing we really like with this approach is that we can actually use your guys work which means that we get for example access to automation learning agreements which we normally not use so it really gives us easily access to a lot of our development in this side and I just want to highlight like what we think are the key points that really makes it interesting for us so first of all it's this open source and active community that really continuously develop these tools which means that we don't have to do it ourselves which is not our expertise on the Atlas side more of an expertise in that on the sense that of course for us it's reviewed as a multi-platform and very flexible framework meaning that we don't really we're not really so dependent on one server we could really move if we need to to a private cloud provider if we need to massively scale suddenly and you know for a single one-off project so really gives us an optimized resource usage also it means that we could more easily share hardware resource with the other experiments because at the moment the thing is quite fragmented and we do not work with other experiments to share hardware which is not ideal because we would like to have a huge GPU clusters and it would be easier if we were to share them with the other experiments and again one strong point of us is that it's accessible to everyone so it really democratize access to machine learning heavy projects in our experiment and has a nice side benefits of having visualization when you do hyperparameter optimization with you you're probably familiar with all right but this is not enough of course because it just it's a nice framework to run something but it doesn't mean that the thing you're running is simpler you can use only stopping from hyperparameter optimization but we still need to reduce the computational complexity so that we able to do hyperparameter optimization of our transformer and this is why the talk slightly changes direction and goes into a technique because there's a technique in the literature to be able to do the hyperparameter optimization at a lower cost for some of the hyperparameters and this has to do with the parametrization of the network and here you have like a classical deep neural network and what I mean by parametrization really is the way you initialize the weights so the way you sample them from a random distribution function in this case Gaussian but really doesn't matter so much what really matters is that the variance is inversely proportional to the size of the input of the layer and the learning rate classically for all of the weights in the networks are the same effort like the master learning rate so this would be a standard parametrization a parametrization and one way to actually move into a nicer parametrization for the project I'm going to mention is to adopt this maximum maximal update parametrization from the literature and I'm just highlighting the key difference with the standard one the key difference being that the output layer variance is scaled down by the input dimension of this layer and the learning rates of the hidden and the output layers after the scaled on to and this is this actually makes sense in a way like if you think of a network with a specific width the output layer would be like the more opaque layer as if you were to think of a cross section of the ocean the surface is most of the sun but the bottom really doesn't see so much of the sun and here the sun would be the the loss function and the bottom would be the input layer so this has the effect of making the output side of the network a bit more transparent so that learning goes into depth and this is actually provable and I'm going to go into this later but I'm just going to give you like the two key conclusions from this maximum update parametrization is it has this effect that the updates of the activations become roughly independent of the size of the width of the layer so just to clarify the width is the number of units per layer so the really the transverse of the dimension not the depth and also it has this nice provable theoretical property that it's the maximal update one in the sense that it's as big as an update you could get for each layer without leading to instabilities and there's a way to see this which is quite cool and come from this paper on on the left which is to take the simplest network you can think of which is one input x has through two layers an input layer u with n dimension and an output layer v with n dimension two so here n would be the width on the left you have the sp k's and on the right you have the mvp k's and you can see like some of the key difference differences at initialization the way the they sampled and also the way the output layer gets updated after one gradient step and if you look at what this layer what what this network computes the f of x after one gradient update on the sp side you have this nicely term theta u transpose u which by the low of large numbers will scale not with the input x but with the dimension of the layer n um so this means that this would not be with independence in a way what on the right side thanks to the modification to the rules of the parameterization you can see that this term is correctly scaled on by n so that the scaling is with the input instead of with the dimension and this can be this can be actually viewed in a network so this is showing you a gene to like architecture trained for shown at three different time steps the initialization time step and one and two training steps the first row is sp and the second row is mu p and every time each plot is showing you for different dimensions um the sum of the absolute value of the pre activation weights of the different layers and each layer is its own curve and you can see that the on on the sp side things quickly get unstable after a few um steps of uh of learning for very large bit and on the mu p side however things stay stable so really it scales nicely across width and uh it's the maximum update in the sense that it stays stable and um this actually leads to a few properties that are quite nice and relevant to hyper parameter optimization the first one is your trainings are stable from the learning rate want to do so that's actually already quite nice but also a wider model is always going to be more performance than a less wide one and that is on the training loss you may over overfit but this is a side that i'm not going to touch here so wider always better so it also makes no architecture search easier you just take a model that is as wide as you can afford and finally and this is the key point here is that you get similar performance i key across different width which means that for the mu p side on the left scanning for one hyper parameter here which is the maximum value of the learning rate scheduler and showing the validation loss on the on the x-axis you can see that they share the same best hyper parameter so that's quite nice that means that if if you want to be efficient at this you scan oh sorry i should have said but so these three curves and the yellow one is a transformer we are then bending width of width 64 and the red one is with 128 while the purple one is 256 so the thing that's nice with the fact that they all share the same best hyper parameter is that you could use the small one which is much easier to train to find the best hyper parameter for the full width one while on the SP side you have no such quarantine first even for in this case it seems to roughly work but also you get this really unstable behavior when you get a learning rate that's too high which is again due to the fact the standard parameterization models are not resistant to the width scaling so this is quite nice actually and it leads to a algorithm to do the hyper parameter optimization that's come from this paper which is called new transfer where you just basically do the hyper parameter optimization on the small model and you directly transfer that to the full model so that you don't you have to you can avoid like the complexity of training that several times and we've tried this in Atlas and what we've tried is because we have a required compute limited again we try to actually optimize two parameters the maximal learning rate in the initial value of the learning rate scheduler and this is shown here with the same types of plot as the previous slide just different plots show you different initial value of the learning rate so it's very efficient for us to actually so this argument makes it very efficient because if you just focus on the mupe side which is the bottom row you could have directly found the best one from the small model and it really avoids like the very costly large model training also it makes the model in general stable and also it makes the model in general more performant than the equivalent SP side thanks to this maximal update in this transparency behavior so we think it's actually quite a nice framework to combine with Kubeflow so we from the Kubeflow side what we really expect is this you know natural multiplatform execution so that as a user you don't really have to think where you're going to run it's quite portable and flexible so if you know how to use it on on the third one you can very easily move to a private cloud one you just it's very nice if you don't have to think about slur and all of this stuff because you people do that and to another extent it means that we have an improved resource usage it means we can more easily share I know the results with the other experiments we can also use resources out there not really use from you know state clouds or national laboratories and stuff like that on the subject of the user case here it really improves the hyper parameter search thanks to this you know auto machine learning algorithm so it's stopping and all of that so it also exposes us to some of the development being happening in the machine learning community also it has this nice thing that you know you can inspect and visualize it also makes it globally simpler to just install things and to serve the model later on and importantly for us it's accessible to everyone in our collaboration so it really democratizes access to these very interesting project for us and finally on the we want to combine this with mu p which is um the this technique which means that a wider model will always be more performing and makes it possible for us to use me transfer which is very useful for example just to give you some numbers the full embedding with models 256 has 2.3 million parameters and it takes 40 minutes for one epochs on two GPUs due to the size of the data set on the small width one which has a tenth for the parameter we can actually do an epoch in 20 minutes with one GPU which means that equal compute we can do four tests of the hyper parameters on the small model for one test of the large one which means we get a far better coverage and i just want to conclude the like reminding that this is quite important like the sort of performance gains we have from the best hyper parameter is equivalent to the sort of performance gain we have from adopting way more complicated architecture changes or physics infused changes and this is shown here with a very elaborate rock curves that's typical of the field where you have the b-jet efficiency on the x-axis and the y-axis are showing you the rejection for lights and c-jets so we really want to be higher there and i showing you this for the best model in green and the worst model in purple from the scan on the right and you can see that in areas where we're interested you really have 20 to 30 percent gain really from just this one parameter change so really something important for us and i think that that's pretty much it for us and thank you for your attention see you again there's time for one question so oh yeah there's a couple but i think we only have time for one so i don't know do i thank you for the talk hi this is abhishek from IBM research and i have two questions both based on kubernetes so one is how do you manage custom hardware with kubernetes and i think you mentioned for quite some time that you use sharing techniques so how does kubernetes help you in sharing your custom hardware or gpus today yeah okay so that's a very quick question so if you look at previous talks we've given uh i think last year at kubcon uh amsterdam we went a bit more in detail so um we've been we wrote if you go to kubernetes.docs.com.ch you will find a series of blog posts about sharing gpus and we tried and benchmarked all sorts of possibilities so we tried anything from using uh pure sharing of the gpus with no uh knowledge between the workloads and the issues with that we also documented quite well how to use nvidia mig and the integration with the gpu operator from nvidia and we've been trying also nps for for for a possibility of sharing as well so basically all our clusters have these capabilities of of partitioning in either physical or logical way all the gpu resources and this is what we expose to the kubflow users we you might give get a full gpu a fraction of a gpu or a totally shared gpu depending on what you're asking for and the availability is the main thing if you ask for a full gpu a full a100 or h100 we don't have that many so you won't get them very fast and people fall back to the second best uh i think there was another part of the question but i don't know how do we manage custom hardware so in most cases for now these are fpgas uh and these are just attached to the nodes and we just pass through in the same way we do for gpus uh so all the gpu access we are giving we are not virtualizing we're just doing pci pass through uh in some cases partitioning it using Kubernetes but for the nodes themselves they they are just doing pci pass through a really cool talk madri from elotal um wondering if you had to pick one key problem in the multi cluster scheduling space that you alluded to what would that be uh i think i think the scheduling is is like what the main the main issue is that there is no notion in the scheduler of multi cluster there are some some efforts to add this to the to the stack uh but there's there's little the scheduler can do right now because it doesn't know about the availability so we tried in the past to kind of go around this so we tried something a few years ago where we implemented a virtual kubelet uh that would basically be an abstraction of a full kubernetes cluster behind so that we could advertise to the scheduler the resources from many clusters and this was something that was picked up by some people but it has drawbacks as well so it is not a perfect situation i would say the ideal thing would be for the scheduler somehow to be able to incorporate the resource availability of multiple clusters uh i don't know exactly how we can do this because availability based scheduling is number one i think so yeah thank you i think that's it okay thank you very much and if you have more questions just reach us