 All right. Hello, everyone. We're very excited to be here. Lucas has been in previous cube cons, but for me it's first cube con and also first time on official travel for two years and a half. So we're very happy to talk a little bit about the journey that we've been going on in the last years to integrate Kubernetes into the Atlas distributed computing and help with the search for new phenomena in particle physics. I am Fernando Barredo. I'm actually from Spain, studied information technology and telecommunications in Madrid. And after university I moved to Geneva and have been most of my professional career working in the Atlas experiment covering different roles. And lately I'm running projects to integrate Kubernetes and public and private cloud resources. And that's where I know Lucas from. Okay. Hi, everybody. I hope you're enjoying cube con. My name is Lucas Heinrich. I'm a professor for data science and physics at the Technical University of Munich. And I'm also like Fernando working on the Atlas experiment at CERN. And my research focus is kind of twofold. I can develop machine learning techniques and statistical data analysis techniques to actually apply to the data. But I also am very interested and exciting about using cloud computing technologies to build up the actual infrastructure that enables thousands of physicists to work with the data that we're collecting at these big scientific experiments that we have at CERN. Okay. So by way of introduction, what is CERN? So you might have heard about CERN already in the keynote today. So CERN is one of the major particle physics laboratories in the world. And so I, as a particle physicist, view it as a particle physics lab. But the way that CERN probably impacted your lives the most is by being the birthplace of the worldwide web. So here on the right hand side, you can see Tim Berners-Lee who was working at CERN when he was drafting the first proposal for the worldwide web. And as you know, you know, the rest is history. And now we're all using the web every day. Okay. So, but in some sense for us as physicists, the invention of the worldwide web, which was originally conceived as a way to more efficiently exchange information between scientists was just a byproduct of our scientific activity. And so here you can see CERN. So it's in a nice Swiss countryside. And we have many projects going on at CERN. But the biggest one is the large head on collider that you might have heard about in the news. It's a 27 kilometer long tunnel that is 100 meters underground. It goes through France and Switzerland. And here we're accelerating particle beams to almost the speed of light. And so we have two beams. One is going clockwise. The other one is going counterclockwise. And at these four specific points, we collide the beams head on. And then hopefully something interesting happens, like the creation of a new elementary particle. So Fernando and I, we both work at the Atlas experiment, which in my mind is of course the best experiment at the LHC. Okay. So here you can see a view inside of the tunnel. Again, it's 100 meters underground. And so these big things are big, strong magnets that are able to bend the beam into the circular trajectory, even though the beams are going at the almost the speed of light. And so they're very strong magnets. And here you see what happens at one of these collisions points. So I showed there are four collision points. And this is the collision point where the Atlas experiment sits. So this is the Atlas experiment. It's one of the biggest scientific machines that humankind has ever built. And so what is it? It's basically a three, a huge giant three-dimensional camera that records what is going on during these particle collisions from all angles. So we want to know what is happening during these collisions so that we can analyze it later on offline while we're taking the data. And so these collisions, they're not happening, once a week or once a day or once a minute. They happen every 25 nanoseconds and they produce a couple of megabytes of data per collision. So we have 40 million collisions every second. And so there's a lot of data that is being accumulated. So while in the former times, we analyzed the data by eye using photographic plates, now we of course need very large computational infrastructure in order to get, manage this data in an efficient way and actually extract some physics insight out of this huge amount of data. Okay, so if you're actually managing to extract some interesting science out of it, nice things can happen. So here, for example, we can discover a new elementary particle. This happened almost exactly 10 years ago. This was the announcement of the discovery of the Higgs boson. And a couple of years ago at the KubeCon in Barcelona, we also showed how we can use Kubernetes to rediscover the Higgs. And so because of this discovery that we made at CERN, Peter Higgs was awarded a Nobel Prize a year later because he was predicting the existence of this new special particle a couple of decades earlier. And so that was a nice success not only of the physicists and the work that they do, but also of the computational infrastructure that we build in order to run these high scale, large scale scientific experiments. Okay, but this is not a physics conference. I'll not talk too much about the physics, but I want to talk more about the computing infrastructure and how we actually manage the data. And so in Atlas, we have a two tiered data processing pipeline, so to say. We have one kind of large scale pipeline, which we call the production system. And the role of this production system to either take the data from the raw detector output or from our simulation and then pre-process it in a way into a format that is useful for the actual downstream scientists. And so this is a very large scale operation. So we have an extra byte of data, roughly. And so this doesn't fit a single data center. So we need to distribute globally across something like a million CPUs. And Fernando is going to talk a bit about that. And this pre-processing stage is a fairly well-organized activity where you have only a few teams that are able or in charge of running this pre-processing campaign. And so we heard a lot about batch processing also at this conference. And there was a co-located event about the Kubernetes batch working group. And so this is a classic batch workload where we're not super interested when this processing happens. We're interested that it happens at some point, but it doesn't need to be now. It doesn't need to be tomorrow. It can happen maybe in a week from now. And so this is kind of the very large scale system. And then the user analysis, like the data scientist view of the system is down here. And it's a couple of orders of magnitude below that. So it's at the petabyte scale. And so this is where you have individual teams of data scientists and physicists that look for a specific physics question in the data. And so here it's still large scale. So globally, we still have something like 100,000 CPUs. But typically, a team works on one or two individual facilities. But unlike the production system, which is this few groups, highly organized activity, here you have a much more heterogeneous setup where you have individual teams, hundreds of individual teams that train machine learning models. They do their data analysis, statistical analysis, pre-selection, and so on. So there's a very rich bouquet of individual things that you want to do. And ideally, you want to have the answer as soon as possible. If I, as a physicist, have an idea on how to process my data to extract some physics, I want to try it out this idea and immediately get the answer. So ideally, what we want to have is a kind of interactive data analysis experience where you can try something even though you're still kind of roughly at this petabyte scale. And so the way that we're structuring this talk is that Fernando is going to talk a bit about this first production system tier. And then later on, we'll try to do a live demo and try to actually do some physics that represents roughly what we're doing in this data analysis tier at the bottom. Okay, so I'll hand over to Fernando. He'll talk to you about the production system and then we go to the demo. Thanks a lot, Lukas. Okay, so while most of the people started hearing about Cernand Atlas around 2008, which is which when Atlas started to go into production and then there was more media attention, also books like Don Brown's Angels and Demons. But these experiments are actually being planned decades in advance. So Atlas was being discussed in the 90s and the computing infrastructure was being discussed late 90s and beginning of the 2000s. And back in those days, there was simply industry was not at the level that it is now. There was no cloud computing. There were no massive storage systems and no real off the shelf components that Atlas could use for the processing of their data and the storage. So for this reason, in 2001, the worldwide LHC computing grid was conceived. And they came up with a plan of that each of the university and laboratory that was participating in the Atlas experiment or in the LHC experiments, they would also contribute a little bit for the computing power of the experiment and the computing storage. And then the WLCG also developed all of the middleware and storage elements and compute elements that would be doing the storage and processing. Today, the Atlas statistics are around 165 data centers distributed in 40 countries. You can see here in this image, the center is around, is in Switzerland, around Geneva. And since we are in Spain, also in Spain, you see that there are three data centers, one is in Madrid, one is in Valencia. It's quite close actually here to this venue. And also, the tier one is in Barcelona. While the processing and storage is done distributed, we have central services that do actually all of the, they have the intelligence and the management of the data and of the workloads. The first system is Ruthia. It's responsible for the data management. It knows all of the data sets and files that are managed by the experiment and knows where they are in the grid. And then it's also responsible to interact with the storage systems like upload, download, files, schedule transfers between them and so on. To date, the Ruthia system manages around 700 petabytes of data distributed around the grid. For the workload management part, we have another animal, which is Panda. And Panda is talking with Ruthia constantly, knows where their fights are, and then schedules the computational tasks to the data, also depending on what is the load of each site. And it's also responsible to interact with all of the compute systems and push the jobs into the batch systems. You see in this diagram, it was the evolution of the last seven years, more or less, how we have been growing. And to date, we are on the 700, 800,000 virtual CPUs for course. And the resources are different. So the main component are the pledge resources, which are the traditional grid resources. But then we also have opportunistic or over-the-pledge resources provided by cloud and by HPCs, that we have very successful collaborations with. Going a little bit in a high-level diagram, so you see that we have our users, they interact with the data and the workload management system. These systems make the grid look like a unified resource and abstract all of the distributed nature of the system. And then for the workload management part, we have this Havista component, which is the one that talks with all of the batch systems and resources. The typical flavor are the grid sites, and they are Havista talks, the HD Condor, or the ArcMiddleware APIs to submit jobs. For HPCs, we have a lot of collaborations and use different supercomputers. And usually, this has to be done on a case-by-case basis, since every HPC is different. And then the one site that we are actually focusing this presentation on is the integration with Kubernetes. So this originated a couple of years ago, we were running a project with Google, and we were thinking, how could we run Atlas jobs in Google? And then the typical option was to fiddle around with virtual machines, contextualize them to join some batch queue. But instead, what we thought is that the best is to use Kubernetes as a native resource, because we can run also the Kubernetes classes on our own sites, not only on commercial clouds. And also all of the commercial clouds nowadays are the major ones that are offering managed Kubernetes clusters. Here we show a little bit how we integrate the Havista and Kubernetes. So the one requirement that we had is that we were not starting from zero. We needed to integrate Kubernetes in a structure that was, that is already 15 years old or so on. And we needed to integrate Kubernetes in a way that it offers all of the services that a traditional grid site is offering. But at the same time, I also didn't want to install too many things on Kubernetes and wanted to keep it as simple as possible. So while most of the people know Kubernetes for application management and service management, Kubernetes also has native job controllers, which allow you to run batch applications directly on Kubernetes. And we used just the native job controllers, but if you attended the batch working group, for example, they are talking about a lot of extensions that provide additional capabilities that we did not use at the moment. So in Havista, we wrote plugins that use the Kubernetes Python API. They submit the jobs, they see the status of the jobs. And also when something goes wrong, they clean up Kubernetes jobs. And we also extended it a little bit with Kubernetes common options. So we set limits so that the jobs cannot run out of CPU or memory. We also use port affinity and anti-affinity, sometimes when you want, for example, the small jobs to get packed together and not spread around the cluster. And also, really, we use priority classes when we want to have some higher priority for the larger jobs. And maybe the one thing that sticks out a little bit from the pure Kubernetes world is this high-energy physics file system that we need to mount, and that's CVMFS. So you need to imagine CVMFS is like a content delivery network and Atlas software and all of the high-energy physics software is distributed through this content delivery network. And then on the nodes, we have a CVMFS client that fuses the file system into the node. And so we installed this as a demon set, and then CVMFS shares this file system through volumes with all of the panda jobs that are running in the cluster. Following a little bit the motto of this KubeCon conference, I think we've been also going onward and upward. If you see the plot, we started, this is in 2020. We started with a handful of resources. Each one was contributing a couple hundred cores, and we had our mini Kubernetes grid. But then we've been growing that significantly last year. This big part here that's in blue, that's actually our first early adopter, which is the University of Victoria. Here, the site admin in Victoria, he, at the beginning, he was testing a bit the water, seeing how it's working. And then he was actually very happy with how it worked, and then he moved all of his resources into a big Kubernetes cluster and went away from the traditional grid model. And he also was very convinced about the support model that is generally in Kubernetes, like the white support community. And then we also, those spikes that we see in the plot, that's actually when we are scaling out to the cloud. And we can do this at a very large scale, as I will show you in the next slides. So some words about the elastic cloud scale. So this is actually the slide we are very proud about. And it shows how during the month of April, we were trying out different configurations for our Kubernetes cluster that was in Google in Belgium, Europe, West 1. And we were at the beginning scaling up to 20,000 cores, and we did 40,000 cores. And we've ended up with almost 100,000 cores. And we also have been adapting our payload so that we are very resilient to preemptible VMs and also lately we are running on spot VMs so that we don't have this 24-hour time limit. And this very last scale out, close to 100,000 cores. And we managed to run with a 1% failure rate in our jobs on spot VMs, which are inherent failure rate by nature. And during this day, we managed to process 100 million events on 100,000 virtual CPUs in Google. We use fairly big nodes. We try to use the 80 VCPU nodes and also some 32 VCPU nodes. So the cluster overall is 2,000 nodes, something like that. And from the harvester point of view, it's a scaled harvester instance that is also submitting all of the jobs to Victorian to the other smaller sites. And again, all of this is controlled by just fractions of our time. And then if we zoom in into the 30th of April, it's this plot that I have down here. And here you see the contribution from all of the different sites that are contributing processing power to Atlas. And you see that we are on the 30th of April, not always. We were the second contributor just behind Vega, which is our URAHPC and the top 500 supercomputer list in the world. So it's something that's quite impressive, the amount of computer you can add on with not so much person power invested. The other cool thing that we can do with all of these Kubernetes classes is provide heterogeneous architectures to the Atlas experiment. So we are living nowadays in this golden age for computer architecture development as described in this ACM communications article by Patterson and Hennessy. Or also you see, and if you see the NVIDIA key nodes, you see, for example, how Jensen Huang pulls out the hottest GPUs out of the oven. So while Atlas for 99% of the processing that we do, we just need the basic x86 CPUs. But we do not live completely isolated from what's happening outside. And if we are not able to modify our software tool, for example, use ARM resources or use GPU, we are going to be missing out a lot of opportunities in the future. One example that was successful this year was, for example, the Atlas software team. They wanted to build their software for ARM and needed infrastructure that from end to end, they would want to simulate Atlas events on ARM. And while there are great sites that are interested in purchasing ARM, but no one wants to be really the first one to do so. So what we did in this case, we had a grant through University of Fresno and the grant is on Amazon. So we set up an EKS cluster with a Graviton 2 node. And you see, well, in this, just for illustration purposes, you see how the first 10,000 events ever processed or simulated on ARM were being generated and here they are being compared to events on x86 to see if they align properly or not. And also, well, one thing that I used for this thing in particular were these multi-arc docker images, which really make your life much more easier. You just need to build the image once, and docker will automatically, you say, the architectures you want to support, and docker will generate the different versions of the image. And then when the client puts the image, the correct version will be sent based on the architecture of the client. Now, until now, I've been focusing mostly on this batch processing and bulk processing. Now we are shifting gears a little bit into the interactive analysis that Lukas described. One, some technologies at Atlas and the high-energy physics community is interested in are interactive analysis facilities based, for example, on Jupyter and on Dask. So we installed on our GKE class, we installed also Jupyter and Dask, and have been offering that to the users and also, for example, offering them to start notebooks or Dask classes using GPUs. This is how Jupyter and the Dask integration looks like. Lukas will show it live, so I will not get into it. But the one thing that is cool is this plot here, and it shows this was done by Lukas. He's scaling up his task cluster and running the same task again, multiple times, but each time on a larger cluster. And you can see, first, he ran the task with 100 workers, and it took 40 minutes. Then he ran it with 200 workers, the time that he was waiting reduced to 20 minutes, and like that, until he was running it on 1500 workers, the task is done in five minutes, and then he's done with the job. And if he likes the resources, the results, then he's done for the day, and if not, then he can repeat the process interactively and can really focus on his science, and with this kind of system, we provide him that capability. And the cool thing is that installing this setup, there are already Helm charts available that do most of the work for you, like the Dask hub, which provides directly to Jupyter and Dask integrated Helm chart. The thing that I mostly needed to figure out was the configuration that I needed to add for scalability and also cost-effectiveness. So I wanted to have, like, critical, and, you know, critical parts in a particular node pool, which is guaranteed so that the user is not connected and his node pool gets slashed. But then, for example, the workers where we have thousands of workers that I put into cheap preemptible VMs, and like that, the cost is much better. And then one thing that I didn't work with Lukas, but Lukas has had a different project with other people like Rihanna. It's all of these workflow engines that you could also install on Kubernetes, and like that, you have the whole computing capabilities that are needed in one single Kubernetes cluster. And also, I think that the presentation in this room after us will actually be showing a demo or a presentation about tube flow use for machine learning at CERN as well. And with that, I will pass it back to Lukas. Okay, thanks Fernando. Yeah, so we're now going to focus a bit more on this analysis side, and we're actually trying to do a live demo, so wish us luck. So what we're trying to do is to recreate roughly this plot, so it's kind of the history of particle physics as it goes through the decades. As we're able to build more powerful and powerful particle accelerators, we can move from the left-hand side, which is low energy, to the right-hand side, to high energy, and every time we cross the energy threshold, we're able to create new elementary particles. So these peaks that you see are each elementary particles. So on the left-hand side, you see the japeside particle that was in the 70s, a Nobel Prize. In the middle, you see the bequark discovery, and on the right-hand side, you see also the Zebozon that was in the 80s, a Nobel Prize. And so these are old discoveries, so they're nothing new. But what we can show is that we can rediscover these particles live in the data that we collected, the Large Hadron Collider, because it not only has it's high energy, but it will basically create all of these particles as you go as well. So let's switch over to this analysis facility that Fernando mentioned. We have this Kubernetes substrate. Part of this Kubernetes substrate is dedicated for batch processing, in a very large scale. And another part is focused on more analysis-focused workloads. And so here we have Jupyter. And so I hope this is going to work. And so Jupyter is a data science IDE that a lot of people like, so it's very much used in data science, machine learning, but also by physicists. And what we want to do is basically to grab some data and analyze this data in a nice way. But of course, in the kind of particle physics context, the data is much too large in order to just be processing it in memory inside of a single node, even if it's a large node. And so we need to have a scale-out system in the background. So the user interface is just kind of the front. And then we have something that scales horizontally in the background. And for this we will basically take our data lake. And so we authenticate to the storage and we are able to grab some data from the data lake. And here we are basically scaling a cluster. Let me see. The scrolling doesn't work. And so here we see that we scaled up the cluster while I was talking. We scaled it to 500 cores. And of course, 500 cores are not 100,000 cores as we talked about. But at the same time, remember that this is a multi-tenant system. So I as a physicist want to go to this facility and then request 500 cores to do my interactive data analysis. But then there might be 100 other physicists that also want to have each of their 500 cores. And then very quickly you scale up very fast. And so here we can use a lot of the auto-scaling capabilities of Kubernetes to scale the cluster to whatever size is needed, how many physicists are actually trying to do data analysis. And then if it's a more quieter period, we can then scale the cluster down again in order to conserve costs, especially if it's on public cloud resources. And so here once you have a scaled up cluster, you can then actually define your physics analysis inside of the Jupyter notebook. And so I just have a bit trouble with scrolling. And so once you have the physics analysis defined in a Jupyter notebook, gets distributed to all the workers. And the workers, they are able to do embarrassingly parallel data processing where each worker grabs a slice of the data from the data lake. They do whatever processing the user requested. So we are trying to extract some physics information from the data. And then the results get accumulated back into this user interface until we are, and then we can visualize it. So what we are doing here in particular is that we use Einstein's famous like energy and mass relationship so that we can infer the energy of the original particle by measuring, the mass of the original particle by measuring the energies of the decay product. And so this. Lucas, you show us the dashboard. Yeah, I'll show this, but I'll just need to copy this URL. So this is nice. So unlike a batch system, what we see is that we have basically a real time view on what's happening inside of the data. And so here we are actually visualizing the results. And we recreated the plot that we just showed in the original slide. And so we can also go to the dashboard. Let me see. And so you can see kind of in live view of what's happening inside of the dashboard and what is happening on the data processing site. And so this gives us a very interactive feel of what's going on. And so here we basically process 60 million events in just two minutes and we basically recreated all of this particle physics history life on Kubernetes in a live demo. So I'm very happy that this worked even though the scrolling didn't work just as well as we wanted to. Okay, so let's go back to our slides. So this is my summary. So what we showed is basically the way that we imagine interactive data analysis for physicists to work. We have a data lake that is close by to the analysis facility as hundreds of terabytes or even petabytes. And as a user, I log on to the system and I can scale out dynamically to whatever many cores I want. And then it's a multi tenant system where a lot of people are able to use this interface and they can use Jupyter notebooks in order to do their data analysis. Okay, so for this, I'll then hand it back to Fernando for the summary. Thanks for the demo, Lukas. Yeah, so just concluding. So I think we've shown you that Kubernetes goes far beyond just pure service management. We've been using it natively for bunch processing. We were even not expecting we would get to this scale. We were thinking this would be at a few thousand cores, but we managed to scale up to a hundred thousand cores in a single cluster. And I think that we might not be at the limit yet and we are already trying to convince people to let us try a higher scale. Managed Kubernetes clusters simply work. They work great. You don't really... The less you look at them, it's better. They auto-heal themselves. The broken nodes, they get repaired or swapped. So it works very nicely. Also, Kubernetes, we can run it on-prem. We can run it on the cloud. It's going to be the same thing. And also, it's very easy to integrate what I call exotic resources. It's just the resources that we don't have a lot of in our grid sites. Besides this bunch processing, we also showed this kind of next generation interactive services that users can use for interactive analysis facilities. And also, Kubernetes provides a very high elasticity to scale up and down as the users request the workers. And obviously, there are other functionalities that can be added. Other services, functionalities from that, for example, are being looked at in the Kubernetes working group for patches and so on. And well, in the next years, we will see how all of this Kubernetes integration is accepted in our WLCG world. And maybe one DMS, can we have... The same way there is from zero to Jupyter, can we have a Helm chart on GitHub that does from zero to a grid site where with one commit you will have a grid site installed. So we will see how far we get there. And also, we need to see how in our university settings, we will also start deploying more Kubernetes clusters. So yeah, it's not perfect yet. There is still a lot of work to be done, but I think that with fairly reasonable amount of effort that we dedicated to it, we have been reaching very promising results. And well, and just to conclude, some acknowledgments to people that have worked with us. And also, I mentioned that Google actually gave us the funding for Luca's demo. So thanks a lot for that. And that's it from us.