 Hello everyone, thank you for being here to listen to our presentation. My name is Dan, I'm here with my colleague Daniel, we both work at CERN, and today we are going to share a use case of machine learning application in high energy physics and to show how we can scale such workloads with Kubeflow. A bit of introduction about CERN. CERN is a particle physics research organization located in Switzerland. Its mission is to expand the knowledge of the universe and how subatomic particles interact with each other and what their properties are. It's operating the largest particle physics lab in the world and it is an international collaboration of over 17,000 part-time and full-time employees from over 110 nationalities. CERN is studying the subatomic particles and it is doing that through with the help of Large Hadron Collider. Large Hadron Collider is the largest particle accelerator in the world and one of the most massive machines ever built. It is a 27 kilometer ring of superconducting magnets that is located 100 meters underground at the Swiss-French border. It works by accelerating beams of protons from opposite directions until they meet the speed of light and then these beams collide at four collision points called detectors or experiments and this is where all the data is gathered and analyzed. Here we can see our globe, the exhibition building at CERN. Here is an illustration of the LHC magnets and here we can see the CMS experiment and we can see the magnitude of the experiment. Now a little bit about machine learning in Kubeflow at CERN. There the LHC produces a lot of data, there are 40 million collisions happening every second in the LHC and even after filtering that translates to around 90 petabytes of data produced by all experiments per year. So there is a potential for machine learning in different stages. We can apply machine learning in data acquisition or later in the offline analysis and we can utilize machine learning algorithms to find the patterns of interactions between particles. In this slide I will cover a couple of examples of machine learning applications at CERN and later Daniel will cover one specific use case in more detail. In data acquisition at the CMS experiment we have so-called trigger mechanisms. These mechanisms are extremely important because they select which collisions are interesting to store and process further. It is impossible to store all 40 million collisions per second. The computing infrastructure just doesn't support that. So we have to make a selection which collisions are interesting to store and further process. So the trigger mechanisms do that and the different machine learning algorithms are deployed in the trigger mechanisms such as boosted decision trees. But in the recent years there is an ongoing research effort to apply deep learning methods as well to the trigger mechanisms and there are different challenges there. First to train the high-performance models and second to run these models for inference in the FPGAs. So that's one example. Another one is in particle tracking and reconstruction. So with the LHC we are observing short-lived particles that otherwise cannot be seen in nature and they are analyzed by monitoring energy depositions in the colorimeters. So with machine learning algorithms we are doing particle tracking and we are reconstructing the events of a single collision. Then also there's an application of machine learning in simulations as a faster alternative to Monte Carlo simulations. And in addition to physics use cases we have a lot of data coming from the facilities themselves such as the LHC, the experiments and even our cloud computing infrastructure. So we use machine learning algorithms there to detect anomalies and to proactively monitor our systems to keep them in peak performance. Now there are different challenges of machine learning developments at CERN. One of them being that there are multiple groups working with different projects and mostly they work with local infrastructure. That means a couple of local GPUs very often one or two and that requires researchers to set up everything from scratch. That means they have to install GPUs, to install NVIDIA drivers, CUDA, Python environments and then to share that across all users in the group. And then if they want to scale their workloads to more GPUs they have to move to another platform. So that's a lot of overhead for researchers. And that was the main motivation to deploy a centralized machine learning platform using Kubeflow. With this platform we are trying to reduce this maintenance work. We are providing access to GPUs on demand and also scaling capabilities both on our local instance and to the public cloud. So our instance is in production since 2021. It is an on-premise cluster using OpenStack. We have an integration with many different CERN services such as hardware registry, single sign-on, then Manila CSI or network file system called EOS. And we do our GitOps with Argo CD so that we can easily deploy our instances to different public cloud providers. In our previous talks we were discussing in more technical detail our admin workflows and how we manage our instance and the technical details of it. And if you're interested in that you can have a look at our previous talks. And today we will cover a specific use case from a user perspective and to demonstrate how Kubeflow is used in high-energy physics. And Daniel will take it from here. Thank you, Deja. So I present a use case for you for high-energy physics where we run our stuff on Kubeflow. What I've studied is called jet energy corrections. We get so-called jets coming out of the detector. So initially protons collide at the Large Hadron Collider. They release color-charged partons such as quarks and gluons. And they cannot exist freely in nature due to a phenomenon called color confinement. So instead they hadronize and create color-neutral particles which we can detect. And these are then clustered into a single object called a jet. The energy of this jet is pretty important for further physics analysis. However, the detector is far from perfect. So we need to calibrate this energy. And what I've done is used machine learning for this calibration. Machine learning production is a non-machine learning method. So we have a baseline that we can compare to and see if machine learning you need can help with correcting energy. So this requires a bit of a special kind of machine learning. These jets are or they create or consist of unordered sets of numbers basically. So you can't really apply a basic feed-forward neural network. Instead we can borrow from computer vision because you can create a so-called particle cloud which is analogous to a point cloud in computer vision. You use the coordinates of the detector for the particles. And that way you can create a graph of these particles and use for example graph neural network. So here is an illustration of what I mean. You have a bunch of particles. And this is a set of particles with some unknown length. It can be anywhere from 0 to 64 particles in a jet basically. And you want to apply some machine learning method to these particles or to these feature vectors and map that towards a target. In this case it's an energy value for the whole jet. So it's a single value target. And furthermore this also has to obey order invariance. So the order of the particles that the machine learning model takes as input it should produce the same map to the same target anyway. What I've used is two models from that are currently used in computer vision as well. One of them is deep sets and it has been adopted in the high energy physics community with this particle flow network. It takes as input these particles. You see them at the top over here. The charge neutral particles and then the secondary vertices. These are all constituents of the jets. And what you do is you apply a feed forward neural network to every member of the set i.e. every particle vector. And then you aggregate them using some sort of order invariant pooling such as summing or max pooling or mean pooling or something like this. Then you can feed that output which is now representative of the whole jet. You can map that towards the target energy. If you want to go even more or you say even more complex network you then create a graph. So here you see the KNN block over here. This creates like you represent the particles in the detector coordinates and then you take the KNN neighbors of every particle, connect them with an edge, and then you have a graph. And you can then use that to get some sort of locality information of the particles which this provides more information and will potentially improve the results over the purely set based method. So here's where our cube flow comes in. Now we have our two models. Now we want to run them on our internal resources at CERN. And so we in this case we created a pipeline where we begin by running an auto-ML experiment to tune hyperperimeters for the neural network. We then export this or the best model of them all. We export this, store it on S3 and serve it using KServe. The training set was 10 gigabytes. We started on S3 as well. And the target here is to minimize some loss. And then we tuned the hyperperimeters using random search. It's a pretty one of the more basic hyperperimeter tuning algorithms but also one of the best apparently just picking random features and doing that over and over again or random input parameters and doing that will eventually find you the optimal model. So yeah, regarding scalability as well, using the PyTorch DOB operator, we can use multiple nodes and also within our or in the PyTorch code we also can use multiple CPUs to read data in parallel, train on multiple GPUs in parallel. And you can also run many cathode trials in parallel. So this is really, really scalable in that sense in many different ways. We monitor the training while it happens using TensorBoard. And this is an example of that. So for inference, we export the best PyTorch model to ONNEX which is like a framework for multiple different deep learning frameworks where you can export to one universal format. And then we can serve this using NVIDIA Triton and then when served we can use the Python client or whatever client we want to request inference predictions and statistics. And we also did some analysis of the different, like how fast it is to request inference based on the batch size that you requested once. This way you can speed up the inference a bit. So let's see if a live demo goes well. I'll try my best. So what I'll show you is our Kubeflow interface. Here's the pipeline that we built that I showed earlier. We can inspect. Is by the way the font size good enough? Yeah? It's not. Okay. All right. So here's the initial component, the hyperparameter tuning experiment. We see the cat-tip component to do this. And so you see it takes a few input parameters. We have some output. So among the inputs are, for example, the data training data and test data, et cetera, stored on S3. We can specify the number of CPUs, GPUs, how many workers we want for training. We also did some experimentation with the visualization. So as you can see, you can use markdown or tabular data, which you get in the UI of Kubeflow. So this is all pretty neat. These are the optimal parameters that the hyperparameter tuning found. And furthermore, then we export this. So this is a separate component. It also has some parameters. And then we finally serve this model. The serve model is then in an ONIX format stored on S3. And then we go on to the experiment itself. So this is a user interface for the experiments. And this is what it looks like when it has finished. So we can see all the trials and highlight them as we wish. And then we can sort of inspect and compare which parameters are good and how different parameters affect the loss that we want to minimize. So this is a pretty convenient way of getting an overview of what works and what doesn't work. All right. So while this trains, and afterwards, you can connect the TensorBoard to the S3 bucket that the model logs to. So here we can see the TensorBoard. You can have different types of metrics that you log. So here, for example, oh, this covers it. Anyways, you can then follow along the lines and see how they measure against each other. And then we finally get the model that we served. So we can see here that the predictor is the Triton one. We had to change the runtime a bit to fit our needs because the built-in one wasn't new enough. So that's one of the issues that we came across. And then let's, for example, check the particle net one. And we get here an overview of the model that's served. And now we have an internal URL, which I'll copy here. And we can then request inference or predictions from this model that they have served. So I'll just go ahead and paste it and wish that I have internet connection and I get some predictions. So first I go ahead, load some jets with some features. And I start to try it on client to this specific URL that I copied. And then I try to request some predictions. So here we see the input jet. I plot it in detector coordinates. And above the jet you see what particle net predicts for energy or transform mental value and versus the true one. So it seems to be working, I think, yeah. And you can also note that, for example, if there's a big bubble, the predicted PT is also bigger. So this is a plot of the constituents, the particles that make up the jet. And if the marker is a bit bigger, then it has a higher transform momentum. So yeah, that relates to the predicted value. And that was it for the demo, in conclusion. So this is the physics result that we get. We just requested some predictions for one specific jet at the time here. But these are the final results that we got. We get a 10% improvement in the resolution. I mean, in other words, how accurate the energy predictions are. And we compare that to the baseline shown here in blue, like a mark. So this is the baseline, the blue markers. And then we get the 10% improvement. So we take the ratio towards the baseline, and it's a 10% improvement, which is very neat. And then on the other plot here, we see a comparison between jet flavors. So these are up, down, strange, charm, bottom, and quarks, and then a blue one as well. And ideally, the energy response should be similar for, like, even though the jet flavor is a bit different, the response should not be so. We also hear machine learning improves by a factor of three, the flavor dependence of the energy corrections against the baseline. Both models work very well. ParticleNet, which was the one that used graph machine learning, might be a little bit better compared to the one that used purely set-based corrections. All right. Thank you, Daniel, for this nice demo. So here I would like to mention a couple of challenges that we had while implementing this demo. So the first one was finding a correct version of Triton server image to make inference servers. The default one with Kubeflow 1.4 didn't work, so we used the latest one, and then everything worked as expected. Now we had to do some customizations with the TensorBoard. So we had to customize TensorBoard controller to automatically pick up the secret for S3 and to mount to a TensorBoard deployment so that we don't have to do KubeCTL edit. So we did it downstream in a way that it fit this use case, but maybe this could be generalized in some way. Now some advice maybe for new Kubeflow users. If you're using Catib, then the best option for a metrics collector would be the file metrics collector, because if you're using STD out metrics collector and you have a lot of logs that are not always under your control, that may not work, because the metrics collector may pick up something that is not actually a metrics. And then maybe Catib UI could offer some better visualizations for multiple metrics, but in principle those are all minor things. We didn't have any major stoppers to make this demo work, which was quite nice. So to conclude, if we go back to CERN's mission to expand our knowledge about the universe and do fundamental research, we see that machine learning has its place there and that it has its application and that it can help us. And we have seen on this example of jet energy regression that we get a 10% improvement, which is great, and there are many other applications like this. So that's why we need to work on machine learning infrastructure to make sure that it can support all these use cases. With Kubeflow, we can facilitate the scalability of such workloads. So to go beyond Jupyter notebooks, to run pipelines, Catib jobs, to CERN models, so it really allows a lot of flexibility there. The mutual integration of components is great. So in a sense that from pipelines we can submit Catib jobs, distributor training, serving... model serving, sorry. So it all works pretty nicely. And we offer customizable and reproducible environments for our scientific users at CERN. So Kubeflow has been very well accepted and here I would like to thank Kubeflow community for all the great work. And yeah, we hope this was useful for everyone. So thank you very much for your attention. If you have any questions, let us know. Thank you. Yeah, yeah. No, no, we used the latest version of Triton, so the official one, yes. We didn't do any customizations there, yeah. To repeat the question, yeah. The question was which image did we use for serving? Did we customize it or we use one of the official ones? So we use one of the official ones, the latest Triton server image. Yeah, that seems to be it. Thank you.