 And we are here to talk about the work that CERN and SKA are doing in preemptible instances and bare metal containers for HPC. So this, I think, is the biggest slide that we can have, the entire universe in one slide. You can see in your left from the big bank till today, 13.7 billion of years represented in one slide. CERN and SKA mission, basically, is to do fundamental research, to try to understand the universe, to try to understand this slide, easy, easy. So CERN, at CERN, what we do is try to recreate the conditions of the big bank, that little dot at your left, colliding particles to try to understand what matter is made of. At SKA, with their big radio telescopes, what they will try to do is to understand all the evolution of the universe, to try to observe after the big bank till today. So at CERN, we have the LHC program. LHC is the largest particle accelerator in the world. It's a 27-kilometer ring that crosses the border between France and Switzerland. And it is 100 meters underground. So inside the LHC, they are two beams of particles that are accelerated very close to the speed of light, and they collide together in these big detectors. Basically, these detectors are digital cameras. However, different from your digital camera, they take 40 million pictures per second. This puts us up to one petabyte of raw data per second. Of course, we cannot at this point store everything. We filter the data. We only store at this point a few gigabytes of data per second. However, in 2024, we are planning to upgrade all these machines, what we call the I-Luminosity LHC. With these upgrades, the resolution and the number of events will increase dramatically. As a result, the amount of data that will be produced and needs to be stored, it will increase several, several times. During the same timeline, the SK project will also start producing, getting results. Need to start storing data. This is an unprecedented amount of data that needs to be stored. So CERN and SKA started to collaborate around one year ago because we have very similar problems in terms of the amount of data that needs to be stored and analyzed. So we are looking into our infrastructures and see what we can optimize, how can we maximize the utilization of our resources. In this slide, in these graphs, what you can see is the preliminary studies from one of our experiments, ATLAS, about the amount of data that needs to be stored and the amount of computational resources required to analyze that data over the next years. So we are in 2018. You see in the first graph what you see is the amount of storage required. You see the amount of petabytes that this experiment requires today. And I would like to stress that this is only one experiment. The LHT program has four main experiments. So all of this is multiplied more or less by four. So in 2018, today, this year, you can see the amount of data that this experiment stores in the CERN data center. And you see the evolution. That line is what we can buy with a flat budget over the years that we have. The blue dots that you see there is what is expected, what is required. So in terms of storage, you see 2024 is when we start the upgrade with luminosity LHT. And over the next years, 2026 is actually when the machines really start. You see that the amount of data that needs to be stored increases several, several times. The same for computing, the graph that you're right. So we need to focus how can we optimize our infrastructure? How can we get all the computational power from our actual infrastructure in order to have capacity to analyze all of this data? The SKA is exactly the same thing. So the SKA program consists in two instruments, the SKA low frequency and the SKA mid frequency. The low frequency consists in thousands and thousands of these Christmas-like antennas. They will be spread around between 500 sites in Australia. They will produce more than 150 terabytes of data per second. The mid frequency instrument consists in 200 dishes. It will be located in South Africa and it will produce more than 2 terabytes of data per second. They will store all this raw data, the SKA. So you see these are unprecedented amounts of data that we need to store and then analyze. So we have been presenting this collaboration work over the last OpenStack summit. This started with the idea, what is the purpose of this collaboration. Then in the last summit we presented some of the work that we are doing and what is actually the focus. And today is a demo day. So we are going to show you some of two demos about the work that we are doing in terms of printable instances and containers on bare metal. So be with us. Let's start with printable instances. Hello. I am Thodoris from the cloud infrastructure team in CERN. So let's start with this. So how do we maximize the resource utilization of our infrastructure? The answer to this could be by providing printable instances. They are, in essence, servers that are created by using the idle resources in your system and they are terminated as soon as those resources are needed by higher priority tasks or paying users, for example. By providing printable instances the operators can handle the demand for extra resources just by increasing the cloud utilization. Unfortunately right now OpenStack does not provide this functionality. So we started working on prototyping an orchestrator for these printable instances. We call it also ArtVark. So let me show you how it works. As soon as Nova understands that there are no resources available for a requested server it sets it into pending state. It notifies ArtVark, which in turn terminates some printable instances to free up the space and then rebuilds the server that was set into the pending state. We can try to show it in action also in CERN's cloud. So for this scenario we show that we have dedicated some resources for some service VMs, as you can see here, higher priority VMs. Here we can see the memory that is used right now and we can see that there are lots of idle resources. What would happen if we could fill this space with printable instances? Let's see. This is live from CERN Cloud. This is live from CERN Cloud. We'll try live. Yeah. Please bear with me. Yeah. Everything is building, scheduling, networking, that's good. So if we get back to Grafana, we have to wait a bit more. Sorry. It's almost going. Almost there. Almost there. Awesome. Over here. So, as you can see, we use those idle resources to spawn printable VMs. And we are now at 90%, more than 90% of our infrastructure. But what happens if a higher priority user needs those resources back? Let's see. So, if I show this, it ended up in the pending state, as was expected. And we should see our work trying to free up the space. Let's give it another go. Sorry. I have to change quickly my script here. There it goes. It decided to delete two printables. And if we can go back to Grafana, we can see. So, nice. From 24 printables, it ended up with 22. At the same time, we can see that another non-printable has spawned. And the cool thing here is that we maintain the memory usage in our system. I will try to spawn another VM also. An even bigger VM. Because the first one went so well. Again, if I show it, it should be in pending state. Lots of servers. It deleted at least four, yeah. So, you can see that from 22, we're back to 18. You can see that we have also the new non-printable, the service we are running. And at the same time, we maintain the level of memory usage. We have also implemented some admin CLIs. Where, for example, you can list those reaper actions. And also, you can show them to see what happened and why. So, as you can see here, this was a successful run. The instance was this. And all the victim instances, the instances that were killed in order to make space were all of this. So, going back to the demo, to the presentation. So, in order to make it work with Nova, we need two changes that we have already proposed the specs for. We need to add this pending state for the VMs. And also, allow the rebuilding of those VMs in this state. This is also the repo where we have the code for ArtVark for this service. We will try to have it upstream. And that's it for me. Thank you. So, that was pretty awesome. So, hello, everyone. My name is John Garbet. I work with StackHPC. And just before I start, I wanted to be clear. I'm going to be talking about the SKA work. We're actually a subcontractor for Cambridge University working on the SKA project rather than talking on behalf of the SKA project. So, I just want to get that one out there. So, I'm hoping to also do a live demo. But before I get there, excuse me, I'm talking about containers on bare metal. I just wanted to give a bit of context. So, to start with, I'm going to start with the problem. What's the actual problem going on here that we're trying to solve? So, if we go back to the SKA, I'm going to talk about a particular component called the STP, or the science data processor. Now, Balmuro introduced the fact that there's going to be two telescopes, one low and mid, with slightly different data rates. All the data coming from these will go into a supercomputer that's local on the site. So, there'll be two science data processors, both the same design with slightly different requirements. One next to SKA low, one next to SKA mid. So, what does this thing actually do? So, I've tried to represent that in the most minimalist way possible. I've probably overachieved on the minimalism, but let's go through with this. So, the first point is the data comes in. The data in this case is actually being generated. The voltages, there's certainly one of the telescopes. The voltages from the telescopes get converted into UDP packets, and they come in on fibers onto an Ethernet network. At least that's the current design. And that gets spread over to all the machines. Machines get these UDP packets, speed packets, and they need to do something with them. So, what happens is we have this ingest process. This is the second stage. This is a real-time processing step. The basic idea is you take the speed packets, you turn them into something we can actually process, and that gets written into storage. Depending on which bit of the sky you're looking for, how long you're looking for, and a whole load of parameters that I don't actually completely understand, there'll be a different data rate coming in. So, the size of storage that's required will change depending on the observation. In a similar way, the next step is we need to reduce this data down to, you know, just the terabytes so we can store them. So, the reduced step is a batch process. Again, the actual size of that batch process will depend on the kind of observation, exactly what's happening. And over time, the system is designed for average load. So, depending on what's happened, you'll catch up with the big observation you did earlier. So, there's a sort of mix of this real-time and batch processing happening. And eventually, the idea is that the data that gets produced, these artifacts, will get sent out to regional science centers across the globe, much in the same way CERN pushes things out, and that's the delivery phase. Delivery phase is kind of ongoing, so there is a long-term storage here that keeps the artifacts so you can request them in the future. So, that's the kind of flow, and the kind of flows that we're having to deal with. So, how do we solve this? Well, it's an OpenStack conference, I say, OpenStack. But there's a bigger story, and I'm trying to represent that story with this kind of diagram. Let's first look at the different colors in the diagram. These are the different people involved and the different components that they're thinking about. So, at the very top of the diagram, we have a nice purple color. Hopefully, that's visible for people. So, these things at the top. Wow, this screen is so big, this makes no sense whatsoever. I'll put that down. So, on the purple bits across the top, these are the science workflows. These are the things that the scientists intimately understand, and really, the whole job is to make sure that they can get science done as quickly as possible, as easily as possible, and think about the science, not think about all the underneath bit. The other piece here is that you can see that some of these purple workflows I've put as going directly onto OpenStack or on directly onto all different pieces in this stack. At least that's what it was trying to represent. The basic idea is as scientists come up with new different ways, different workflows, new ideas, as new hardware comes in, and different things happen, we can share those innovations to make it easier for the next person to do something similar. That's a really good analogy to what's been happening with the SK and CERN relationship here. We've been talking about all the common challenges, and much like in many of the forum sessions that you hear, there's lots of people going, oh, I've got this problem. Oh, yeah, so have I, and I've got this thing, and I did this. And through this conversation, we start to build up this set of shared tooling, and that's what I'm hoping to try and get to demonstrate. So another piece I wanted to show how we can expand this. My developer at heart and the previous diagram wasn't quite complicated enough, so I thought a bigger diagram, more boxes. So we've introduced another color here, the red color. I've put here in the commercial public clouds, be these open stack or non-opersack things. So we have to work to actually be able to, some workloads may make sense to push out, and there'll be various different clouds all joining together to do this kind of thing. This is future work, but this is the way the vision is building. How do you build things that work with these different things? And it's kind of interesting, there's a box in the middle labeled Kubernetes that sort of straddles the two. So before I get on to the demo, I just wanted to quickly describe the environment the demo will hopefully run on. So at the bottom left corner, we've got a soft-iron SEF cluster. This is in the demo going to be holding most of the storage. In the, this particular system is called P3, which is the performance prototype platform. I think that's the correct order. And it's, and it's a la SKA, hence P3 Alaska. Not that we sit there trying to make up cool names for everything, like Ardbark. Anyway, so what's happening with the, what's happening here is we, it's basically been built as a bare metal cloud. So the sort of rhombus-y shaped things in the middle are representing the different available compute nodes. And, you know, there's a selection of different compute nodes, but what we're doing here is we're using OpenStack Ironic to manage all of the bare metal provisioning. This is being accessed through OpenStack Nova to do all the, so it's using the Ironic VRT driver to access Ironic. In the demo that I'm looking at, we're going to go a little bit further up this stack. We're actually using OpenStack Magnum to deploy the Kubernetes cluster on the bare metal that's controlled by Nova that's controlled by Ironic. And we've got Kubernetes on top of that. So that's the stack that we're going to be looking at. This is based on the Queen's release. The Queen's release made a big step forward on making it easy to do bare metal and VMs in the same kind of way using this OpenStack Magnum stack. Well, if someone was doing lecture cricket and had stack, you may have just won. But let's go on to the scary bit. Okay, so a piece I probably should have mentioned is this particular P3 environment is behind an SSH gateway. So there's enough SSH tunnels to do this demo that I'm using my laptop. But it is real. It is running live. Or at least I hope it's still running live. So when I showed you the diagram of OpenStack and how we use it in this stack, one of the things that could be useful within the system is having a slurm system in place. So here I'm actually on the login node for slurm. Just to show you this is a slurm thing here. We'll get to the call containers bit first. Oh, eventually, I mean. So just to prove there's no cheating going on, this is a Seth mount. My current working directory is my home directory. And it's got lots of truly, truly interesting things in here. So if we go into the Berlin demo directory, we can do lots of cool command line things. Like, say, hello, Berlin. And see what my secret propaganda messages are. I love preemptibles. So that's all good. This is all shared on a parallel file system. So next, we'll show we can access the same things over here in a nice web GUI. I love preemptibles. I really, really love preemptibles. Just to get a message across. And woo! Just to prove I'm not cheating. So that here, so what's actually running here? This is an interesting question. So with many interesting questions, we can ask Horizon for the answer. So I was talking about Magnum. We've got several Magnum clusters running here, actually. There's some prototyping work going on in the other clusters. This is the demo one. So if you dig into the demo cluster, you can see its name here. The only real reason to show you the name is that I can then show you this is the instance that's actually running. I'm logged in as admin here, so I see the whole world. Which is nice. If you go searching in the ironic dashboard, you can see that there is an ironic node that has this instance on. You can watch back the video to prove I'm not cheating with instance UIDs, but you know, that's up to you. And just to sort of try and make it real, you see it actually lists an IPMI address for the thing that the thing is running on. I would have created a new cluster, but biases take a long time to boot. That is one of the trade-offs. So that's that little part of the story. If we go into here, we can actually see, hopefully, the Manila dashboard. So in the Manila dashboard, we can see this has actually got the home directory's share, and that's the Manila piece that I was trying to connect to inside the Kubernetes environment. So here we've got the Kubernetes dashboard, which is where these pieces are running. I suspect if I refresh this, my... There we go. Thought that might happen. Naturally, this was sort of part of the plan. I can find the right window. Can't find the right window. Wrong window. I said there were a lot of SSH channels, sorry. This will do. So, accidentally, I'm going to show you how you set up this environment. So here, I'm trying to set up... I'm looking at accessing OpenStack Magnum. So I go into my thing here. It's an isolated environment. Take a picture if you want. For now, I'm getting hold of this token, which will hopefully not be in here. Sorry, super slick. Should get the Keystone integration working shouldn't I? So, here, I was trying to prove it is actually running in Kubernetes. If we look into the persistent volumes, there's a persistent volume here. That references this storage class, which references the share name of home directories, which is me trying to prove this is working. So, what I'll now do is I'm going to go into the code that's actually running this demo. So, this is actually using the cloud provider OpenStack Kubernetes plugins. It's actually the upstream demo with a few modifications. Smooth. So, in here, we go to the sexily named user deploy 2. So, what this demo is doing is a very similar environment to the first demo, except I'm asking it to actually create three shares. One share I'm going to attach to exactly the same Minilla share was before, but I'm going to actually cd into the Berlin demo directory, because that's an exciting feature to show off. Another volume I'm going to attach is an auto-creating Minilla volume. So, I just wanted to create a Minilla volume of the size I specify. So, if we go back into the share dashboard, by the magic of some good work of other people, we've got an extra share in here. This is available. It's actually exposed the service will be exposed on this port, hopefully. If we log in here, we can see we're in the sub directory of the same thing, which is hopefully showing the same propaganda messages. So, there we've got it. Now, I was inviting my co-speakers to make sure they heckle me to make sure it looks real. Does that seem kind of real? There we go. Oh, yeah, I suppose you should ask the audience, but you might say no. Okay. So, there we go. We've got the... using Kubernetes on top of a bare metal provision node to attach lots of different volumes and bits and pieces. So, I'm going to jump back into the presentation, which is really an invitation to say, you know, please do get involved in this design process. There's lots of interesting shared problems that we have, and it's been great talking to you. Thank you very much for your attention, and we'll hopefully have a bit of time for questions, but thank you very much for listening. Okay, more questions. I see a microphone there, which is hopefully a real thing. So, if you're feeling brave, feel free to have to say the question and it saves me repeating it. Oh, some good heckling. Hey, Julie. Is there any work at this time for reservations combining with bare metal and possibly the container usage? Combining the reservations above and below. Basically, yes. So, the simple answer is no. Okay, we should talk about that then. Yeah. So, for context with Blazar, the plan is that there's going to be some new placement integration with Blazar, and with that, you'll be able to reserve the bare metal node. Actually, you're able to reserve VMs and other things as well, but certainly what we should be able to make happen is that that Blazar reservation can then be fed into Magnum, which gets fed into Heat, which gets fed into Nova, which then gets fed into etc., etc., and to actually use that. That's a very good question. Thank you. Yeah, no problem. People free-field to shout out questions, you're scared by the microphone, and I'm happy to repeat the question. So, maybe I should ask a question to the other co-presenters. Oh, there we go. So, the question is related to the pre-emptible. So, if the higher priority task terminates, you tack the victims pre-emptible, do you reactivate them? Sorry? I didn't get it. The victims, they don't come back from the dead. No, right now, no. It needs to be our orchestration to do that. But you tack the instance. So, you just lock them. You just turn them down. So, what actually happens is the other presentation covers this better, but what we actually do is we send a delete request in, which is a soft delete request. So, it actually does a soft shutdown, which allows the instance a certain amount of grace period to actually say, oh, I'm going to be killed now because I know I'm a pre-emptible. So, it can tell its job queue I've been killed halfway through this or maybe even get a checkpoint in. But you've got about 30 seconds before you then get the hard shutdown before it's dead. So, in the end, you consider them or it's a lead. Okay. A container worker would be a good example of one of those things. I'm trying to think about it, ask the question, and it's not coming. Sorry. I was just wondering how the pre-emptibles play with quotas because I can envisage a situation where you might want to set a quota of non-preemptible and then allow them to burst using pre-emptibles and do those systems play nicely together at the moment? It saw the reason for pre-emptibles to exist, actually. We also argued about this in a forum session earlier in the week. Do you want to talk about it? Oh, okay. Sorry, I don't want to keep talking. So, the way I think about this in my head is that it's very common to basically say, I've got three blocks in my cloud, I give one block to each of the three projects, and the middle project doesn't pull its weight. It's only using half of its resources. So, the other two projects can pour pre-emptibles in, and that's kind of worked into the quota system. Right now, all the pre-emptibles in this demo in a designated project, so that project has its own quota. So, effectively, you can quota the pre-emptibles completely separately in other instances, because they're right now in separate projects. Okay. If it was all in one project, are you bound by the hard and overquotes for your pre-emptibles as well? You are right now. We've had conversations about that, but that's probably... Because I can envisage a situation where you might maybe run your masters in non-preemptible nodes and then have your workers and maybe in your cluster. That is true, and the problem that you've correctly pointed out is you actually have to use neutron-shared networks if you're doing things sensibly to do that kind of thing. It is a use case we've spoken about. It basically requires a lot more nova changes. So, right now, as pre-emptibles has been proposed, it's a generic pending instance facility. So, really, what it's saying is that it's an external system that can add capacity. Yeah. Be that ad-composable hardware or delete pre-emptibles to make space. It's kind of that mechanism. So, the actual marking of a pre-emptible hasn't yet become a first-class citizen, and I think that's the... That'll be the first requirement before we can go down all the things. Does that make sense? That makes sense, yeah. Cool. Is there any concept with pre-emptibles a cell or a pool of hardware that might exist at the bare metal level? I think Organization A, B and C all work together in one project, but A can only share with B and B and C can only share together in certain cases. We did talk about that a bit, didn't we, with tagging. Yes. Right now, we haven't got any sort of concept of pre-emptible, but the theoretical thing that we spoke about was you can add arbitrary metadata to your pre-emptible instance, which theoretically could promote it above any other pre-emptible, or basically the Reaper can do what it likes. Okay, but that would be upon a contract negotiate outside of OpenStack. Depends if you call Aardvark part of OpenStack. It's sort of the contract on the metadata that you're reading from the instances. Is there a scheduler policy? It's like a scheduler policy, like scheduler hints, effectively, but the hint potentially would be contributed by tags and server metadata. Okay, awesome. That's what we were thinking anyway. Maybe, we'll see. Thank you. As far as I know today, there is no option to stop EVM and release its resources for someone else, and then spawn it back if there are resources available. So, do you see further work on this feature? So, that feature sort of exists in Nova, but we called it Shelf, and it's not quite the same thing. What do you mean? It's not the best feature in the world, but... The idea of Shelf is to basically simulate an AWS-like instance. Stop. If you've got a volume-backed instance, it can basically release the resources on the host, but keep all of the IP address resources, the ports, and the... I don't know, existential concept of the VMs still exists. If you see what I mean, everything's attached to. That is a feature, and I suppose, theoretically, you could make a preemptible do that, but right now, the use cases for preemptibles are the things that you kill and just build new ones when you've got the space. Okay. It's a bit more cataly, if that makes any sense. The use cases we're thinking about, I think that's fair, isn't it? I just think that this feature could be maybe useful for others. It could. To spawn them back once available, or something like that. Thank you. Yeah, it is a different use case, but yes, it's interesting. So, there's the idea of overcommit in various aspects of NOVA. And so, it looked like when you were doing the demo for this CERN part, it's specifically about memory consumption, which you typically don't overcommit. Is that the design you're looking at? We're overcomitting quota, not actual resources. We were overcomitting project quota. So, the quota was overlapping, not the actual memory. Yeah, yeah, I think that's fair. So, I think to your point, one of the problems with introducing this in a generic way is that the victim picking is going to be intimately tied with your scheduling policies. And I think that's a fundamental thing. Yeah. And you're right. We were talking about, basically, you make it as simple as possible within that area of resource. Because, fundamentally, it's a set of flavors that you're particularly launching and they're preemptible within that project context, if that makes sense. So, it's kind of, if you segregate your flavors in different cells, you start to get that kind of thing. So, thank you very much, everybody. These are our Twitter handles for keeping up to date with things. Yeah. Thank you very much.