 OK, good morning, everybody. I hope you can hear me. Thanks for coming here. I'm going to talk about multi-instance capable GPUs that we recently got in our CERN infrastructure. But before I do this, a few words about me. It's my first time that I'm on an open-intra summit. In education, I'm actually a physicist. And I've been at CERN already for a long time. But I only joined CERN IT in 2005. And since then, I have worked at various IT services, like batch systems and integrations with grid, what we call the grid. And last year, I actually joined the CERN Cloud team, and which is why I'm here now. I took over responsibilities for the GPUs that we have. Here's an outline of my presentations. I will start with a few words about who we are. You probably know us already. So that will be probably nothing new for you. Then I will give you a short overview of the current status of our cloud infrastructure, and finally switch to the main topic, which is GPUs in the CERN Cloud. Starting with example use cases, our main user will say a few words about that. And it's pretty cool what they're doing. How we deploy the GPUs in the OpenStack infrastructure and what we do with virtual GPUs and Meek. OK, so CERN is the European Organization for Nuclear Research. It's the world's largest particle physics laboratory, founded in 1954, with currently 23 member states and quite a lot of associated states as well. We're doing fundamental research in physics, so nonprofit. And we are running the largest machine that exists on Earth, which is literally 27 kilometers. And you can see here, this is Geneva, the town of Geneva. This is Lake Geneva. And this is where the tunnel is. It's 100 meters under Earth, so you won't see anything there when you go there and look for it. And this is actually the main site here. This is our flagship machine. This is the LHC with four experiments, pictures of the four main experiments that we have here. It runs the most powerful magnets on Earth with 8.3 Tesla. And in order to achieve that, they have to be cooled down to two kelvins to compare with three kelvins that you have in interstellar space. All around the 27 kilometers by the magnets. And we also don't want to collide with air, so we need a very high vacuum, which is actually 10 times less particles than you can find on the moon. But there's a lot of other things that we are doing, apart from that. And it's already heard about, I guess. And if you visit CERN, you can actually see things like this is the first accelerator that existed at CERN, which you can visit now. There's an anti-matter factory. And this is actually from the computer center. And this is an old detector, which you can actually visit if you go down to LHCB and manage to get a visit there. OK, so the main topic of the talk, which is the cloud infrastructure and the GPUs. So we are in production since July 2013, meaning that's 10 years now. Currently, we have 8,700 physical nodes all managed by Ironic, and 1,800 of them being hypervisors running 13K active VMs. The software is based on RDO. It's mainly x86-64. And we have now, since last year, a few machines running ARM64, which are mainly used for building purposes and porting experiment software to this architecture. The versions depend on the component. And we have versions between train and Z, that being the latest one that we are using for Octavia. Can we do this in a minute? The infrastructure is managed with puppet and foreman. And for our secret management, we have an in-house developed tool. Right now, we are running a campaign to evacuate all the hypervisors and move them to Rated 8 or Alma Linux from center 7, because we have to phase that out. The goal here is to get rid of train, the oldest release, and catch up with Nova to get it to newer versions. And in particular, get rid of private patches and back ports that we had to do. There's an overview of the components that we are using. And there's only one thing that I wanted to mention here is the addition of Octavia, which is coming soon for load balancing. OK, now let me switch to GPUs. That's why I'm here. So if you look at this figure on the right-hand side, this is a wrap-up of the number of jobs which have been run by different communities on the batch farm on GPU resources. And you see there's one which is sticking out. And these are our colleagues who are running the accelerators. And what they're actually doing is they are simulating how the particles and the beams interact with the accelerator complex. Beside of that, we have, of course, machine learning applications, which are mainly run by experiments. And they tried various things where GPUs may be helpful. There's quite a lot of development and testing ongoing, probably everywhere. Looking at the use cases, as I said, the beam simulation is quite interesting. One, not only because they are the biggest users of us. These are two figures on the left-hand side. You can see how much they profit from using GPUs. So this is a double logarithmic plot. So in red, you see the CPU, the number of particles that are simulated, and the computational time needed for this. And you see this goes up exponentially while for GPUs it stays flat for a long, long time. The right-hand side plot is also interesting, where they made a comparison between different GPU architectures that they have been using, in particular the V100 on Google Cloud and our own premise ones, and A100s. They don't have the A100 on the side yet, but what you can see here is actually they don't profit much from the different architecture from the newer GPUs. And the reason for this is that they depend on double precision and the number of registers in the GPUs. And there's not much difference between these two GPU types. Sorry, which passed. GPUs is a real game changer for them. They can do things they were not able to do before with this. So they are a very important customer for us. Let me turn to the different GPUs that we have. We have about 70 T4 machines. Most of them have one card, but one, which has four cards. So it's very heterogeneous. We have about 20 of these V100s. In this case, most of them having four cards per machine. And then the recent additions, which came in in autumn after one year waiting for them, is 72 cards, A118 machines. And these are now running in production. One thing that is probably worth to mention is that in the data center and the main data center at CERN, we only have NVIDIA cards so far. But some of the experiments, they are running their own farms closer to the experiments. And some of them also have other vendors. The main GPU deployment models are our workhorse as PCI pass through. Apart from that, we have a few cases where we agreed to deploy the resources as bare metal. And we have some use cases for VGPUs and have been working with this for a while, testing this. Though it's not widely used for the time being. PCI pass through. Here's a brief overview of how this works. You basically have to your hardware. But in this case, I just put three GPUs in there. The Hypervisor just passes this on to the different VMs. It's a one-to-one mapping. So the Hypervisor doesn't know anything about the GPU, meaning that we don't have any information if the GPU is used or not. This is really happening on the client side, which runs the GPU drivers. So this is very nice. It's fairly easy to configure. It works fine, usually. And it gives full control over the GPU to the users, which makes users very happy, usually. The con is that it gives full control over the GPU to the user, meaning that we have no idea what the user is doing. If the resources are efficiently used and since these resources are very scarce and expensive, for us, it's not optimal. There's no way for us to share the resources in case they fall idle for whatever reason. We have actually seen some issues, which is a funny story. When we tried to move from Center 7 to 8, we changed all our default images from BIOS to UFE boot. And at that point, we started to see issues with GPUs which have more than 32 gigabytes of memory. And that's actually a fix which is coming up for Reddit, which is not relevant for us yet, because we are still running Center 7 at the moment. On most of that, well, not on most of that. We are still far away from Rare 9, but with Rare 9.3, this will be fixed. Fortunately, there's a working workaround which is just boot the guests in BIOS mode. So if you see this issue, either go for 9.3 on the hypervisor and test the patch or use BIOS mode. OK. On GPU provisioning, our most important customers are actually ourselves, so secondary IT services, namely the batch service, which is HD condo-driven. We have some GPUs in interactive services. And we have a service which is called SWAN, which is a front-end to Python notebooks. They also have quite a lot of, in particular, our T4s. And a specific framework for machine learning, which is based on Kubeflow. There are very few direct allocations to users, and we try to avoid them. Because from these guys, we can actually ask them for monitoring information in the PCI pass-through mode, which we don't have, and try to optimize things this way. And this is actually one of the examples of the monitoring. This is from the batch. Colleagues, in the upper part, these are mainly T4s. The lower parts, the middle part is the V100s. And the lower parts are the A100s. You can't read it as it doesn't matter. There are three things that you can see here. So you have some periods where, in particular, the V100s and the A100s are pretty busy. And this is mainly because of when our colleagues from beams become quite active. Then you see some areas where there are gaps. And that is actually when we take out the machines for other purposes. Sometimes we have to lend them out for trainings or something like that. So we just drain them, put them somewhere else, and then they return at some point again, and then you see data again. And then you have these areas here where you see some greenish things, which basically means that these GPUs are not fully used. And that's one of the things that we would like to address. OK, virtual GPUs. That is actually one way that we are hoping can help us making better use of the GPUs for those applications which don't manage to fill up the full GPU by just sharing them. The setup has been done already for a while for based on Tesla GeForce and on time-based GPU sharing. We have only two physical GPUs dedicated for this purpose at the moment. And the way it works is you have, again, your hardware with a physical GPU on it. You have the hypervisor operating system. And then you run one component of the software on the hypervisor itself, which is quite neat, because this allows you to configure the GPU and get monitoring information. And then on the VMs, you have the second part of the drivers, which is the virtual GPU driver. They are called grid driver, which requires actually a license. So it calls out to our license server to get a licensed token. And then each of these VMs can use the GPU. I already said, so one of the big advantages is that we have access to the usage of the GPU this way for sharing of these expensive scarce resources. But there are some limitations. In particular, there is a performance hit when several users start to use the GPU because of the time slicing. And we had actually people who have been complaining about that. There was one user running on his GPU alone, and then somebody else sneaked in, and he got only half of the performance. So users didn't like that too much. One way out is with the new resources. We have the A100, which allow MIG. Basically, MIG means that you can partition the GPU into different smaller ones. And this is kind of a physical partitioning. So the idea is here to create these MIG devices and export them to the VM, so taking based off both worlds. And we expect that there is less or no performance hit when multiple users start to use the GPU. There are some limitations, though, which is as soon as you switch on MIG, you lose some of the monitoring information. What you can monitor still is the power of the bar, but you don't have the information about the individual usage of the GPU or even of the full GPU card. Maybe this is going to change in the future. It's a limitation of the drivers. So we have a prototype ready using this. Tomorrow we have one of our hypervisors dedicated with a setup, so four cards. GPU monitoring is done with DCGM and the collective plug-in, which sends the data to InfluxDB and then to Grafana. And you see some example plots where you see the four GPU cards not very much used at the moment. And we had a few pilot users trying to use these. To help them to set up things, consistently, we created a puppet module for centrally managed VMs. And if users do this, they just have to do one little thing in their manifest. They just add this little line here, include GPU, the GPU module. And this will make sure that the drivers are consistent with, it will install the drivers, it will make sure that the version is consistent with what we have on the hypervisor. It will install CUDA. It will also register the client and get a license key. So it's relatively straightforward. There are a couple of use cases hoping we will be able to cover with this approach. The goal is, of course, to improve the resource usage. I don't know, not all use cases can fully exploit the GPUs. One use case is the interactive services, which I already mentioned, where people can just log in and try their stuff if it works before they send it to the batch system to do larger operations or larger jobs. It's also a possible use case to allow users rapid access for GPU resources on user requests. And the last use case I briefly mentioned in the beginning, we have sometimes people running workshops and trainings, which require access to GPUs. And they sometimes just say, I need next week 40 GPUs. Please give them to me. And of course, with hard resources, you can do this. So this may be a way out to do this, to simplify this. However, we have seen some stability issues, which is why we haven't put that into production yet. Typically, errors were synchronization errors with placement, which may be related either to our backpots that we had to do in ANOVA. Or it may be driver-related. Sometimes the GPU is not being found for some reason. The only way I found so far to fix this is to reboot the hypervisor, which is, of course, a good way to get forward. So this is something we have to investigate before we can continue with this. The next step would then be to have proper quota and scheduling for different VGPU flavors. And for this, we are also closely looking at what Cyberg can actually do for us in the future. We will start small and see how it goes. And then we will decide whether we continue with this or stick to the PCI path through. So the plans are to investigate the remaining issues that we have seen, which I just mentioned. We need to go on with ANOVA Upgrade because we believe that this way we can eliminate the local patches and backpots, which are related also to the GPUs, and hope that we can improve the stability of the thing. And have a look at Cyborg to support different GPU flavors. We also are looking into ways to quote us for GPUs and alternative architectures, meaning our arm nodes, preferably in a way which is consistent with each other. So lots of work to be done. Coming already to the conclusions. So we are running a whole zoo of different GPUs, and we have a whole zoo of different use cases. One of them is sticking out, of course, but that may change in the future. Our workhorse is PCI path through mode. We have a proof of concept for A100s with MIG on the hypervisor, exporting the GPUs to individual virtual machines. And top priority at the moment for the whole team is to phase out CentOS 7 and go for Rated 8. This is a picture of the combined Linux team and the cloud team. And that's the end of the talk. Thank you. Any questions? Yes. Why? Because we try to avoid giving out bare metal if we can because we lose a bit control over resources this way. Even if they are still an ironic, it's a policy that we have. Yes, resource management. Exactly. Just a quick question. There is a limitation that the version of the driver on the host and the version of the driver on the guest, they have to be the same, at least on the same branch. How do you plan to deal with the issue where users come in and they need another version or they need another version of CUDA, which will require another version of your driver on the host? Do you plan to have different hosts with different versions of the driver? For the time being, we have this module that I showed you when I can just go back to the slide. This one. This is exactly why it fixes the version. So this way we have control over which version of the drivers we are running on the VM, the users running on the VM. And if users decide not to use this, for example, they don't want to have a puppet managed machine, they are basically on their own. So we just have to tell them this version is, you need to use this version. But then it's, of course, difficult when we go for an upgrade, because that is going to be a lot of manual work. So the recommendation is to go for this. Yes. There's actually a parallel talk on this topic, which is done by two colleagues of mine, which is running at the same time slot, unfortunately. But I can redirect you. Yes. For users, I thought the difference between MIG and VCPU, if you slice it in half, MIG is statically positioned, so the user is only ever going to get half the GPU. Yes. Yeah, the thing is that it's hard to explain to users. Exactly. So the user experience in time sharing is worse. I agree with your comment. It's true that if nobody else is there, they get the full GPU with time sharing. But then they usually don't understand why all of a sudden the performance drops and they ask questions why it's like this. And so this adds a lot of support load, which is why we want to try MIG and to see if this works better for the users. I mean, if they know what they have, they often just measure what they have so that they can estimate how long the job will take and these kind of things. And of course, if you have time sharing and other people sneak in, then this doesn't work anymore. I think there was somebody else. No? Yep. Is there, I know that the licensing, right, is the NVIDIA grid licensing, I'm assuming. Is there anything up and coming that is actually open source or uses the AMD graphics cards or the new Intel ones that you're aware of that, as we're planning, might be willing to wait for? OK. For the VGPUs, there's not that I know of this and there's nothing coming up that they would open this to the best I know. Maybe in contact with them also. Yeah. One more? Oh, thanks for the talk. I would like to know if you tried it. Is there any limitation concerning the number of device in MIG that you can add to a VM? Because I did some tests a while ago and there was a limitation by the first, I think. On time sharing or in for? I think I tried it all to add multiple VGPU devices to a VM and it didn't work. Yes, as far as I know, at the moment, it doesn't work. You can only have one. Yeah, that's it. Virtual GPU, yes. Thank you. Per VM. Let's hope that changes in the future as well. Any more questions? That's not in our hands. We just deliver the infrastructure and the GPUs. So what the users actually do at the end, if they decide to, I mean, these beams guys, they can do things now, which they were not able to do before because it would have taken months of computing time. And they have plans to extend that to other simulations as well to spread it out. So they have increasing requirements for GPU resources. For the experiments, it's actually similar. They also have, but I'm not aware at the moment, of concrete requests on this. So this is nothing that we are driving. It's something that the users are driving and they will tell us how many resources they will need in the future and for which purpose, of course. Hope that answers the question. OK, thank you. OK, thank you.