 Welcome, everyone. My name is John Garbert and I'm here with my colleagues Pierre and Steve from StackHPC. Today we're going to talk to you about the Coal Reef Crowd. Now this is a resource management concept. We're going to look at reservations and preemptibles and quotas and how all these things work together. So first I'm going to start talking about how HPC is typically deployed today, or certainly traditionally been deployed. And here we have silos of infrastructure dedicated to a particular platform. So for example, if you wanted to do some big data work that was trying to do, you know, some work was trying to use technologies from big data and things like GPUs that might be in AI deep learning, but you want to structure it in a HPC MPI kind of workload. It's difficult to know exactly which platform to go and use right now, the hardware siloed. So you're sort of forced into adopting a platform that's got hold of the resources. So this is where we start talking about HPC 2.0, where we start talking about converged cloud. So what's happening in the background here is that lots of the hardware used for all of these different kinds of HPC workload are now converging. There's very similar hardware, if you want to access GPUs for whatever type of workload, it's still the same kind of GPU and a similar kind of interconnect. So we can make use of this and actually pool all the hardware in one pool. And by using OpenStack as a cloud API on top of that, we have a single unified API to access any kind of resource. So as the needs change between different silos of what's different sort of platforms that create their own kind of silo, we can move the hardware between who's got the most need at this point in time and sort of load balance between all of those different requirements. Now just to be clear about this not being make believe, we've been quite heavily involved in a project called Iris, which is a common SDFC infrastructure, a common infrastructure for SDFC science in the UK. This is an OpenStack cloud and it's part of a federation of various different services, including various different OpenStack clouds all working together. This particular instance that we've been working on is operated by Cambridge University, by the HPCS team. And it builds on work that we've done with them on the Alaska Performance Prototypes List that you may have heard us talk about in the past. We use OpenStack KUB using Color Ansible to deploy the system. And we've been working to ensure that we've got reference workloads so that scientists can make use of the cloud and get the work done that needs to get done. And this has all been funded through the medium of digital assets as part of the Iris bidding process. So we must thank SDFC and Iris for funding this work so we can make progress on building this HVC 2.0 Converge Cloud concept. The particular part of the problem that we want to talk to you about today is how do we get a good fair share of resources within this fixed capacity cloud. This isn't specific to HPC as such, but it's a specific need within the typical HPC funding cycles as you have this fixed amount of capacity and you want to get the most research and the most science done with that particular investment. So within the HPC world it's very typical to use a batch scheduler. And lots of the concepts we're going to talk to you today are well implemented by batch schedulers. So they're very good at trying to get this fair share of resources. So the concept of carving up the system within the HPC batch scheduler is this idea of CPU hours per quarter very frequently so you can review per quarter or put whatever time period you choose like who gets picked off the queue first and if there's no one with a higher priority than you with work on the queue, other people's work gets done. So you're basically getting a very high utilization from this finite capacity cluster. They do this and the sort of the systems can be everything's on a queue but you can actually sort of start to jump the queue if you use things like reservations and preemptions and the backfilling concepts. One of the limitations of the system is that in order for this system to work you actually have to kind of package your work into jobs, into these units of work that go on the queue. This is particularly hard for things that you want to be interactive with for example if you want to have a Jupyter Hub service. Some systems will let you do a two or three day job but that's not really what the system that you want to use to have a sort of interactive long running process so it doesn't really work for that kind of interactive use case. So when you go back to the HPC 2.0 Convert Cloud concept, OpenStacks has quotas to try and manage this fair share of resources but quotas don't work very well when you've got a fixed capacity cloud when you can't just say you can't just predict demand and say well now we're seeing rising demand let's buy several more racks of gear and then put that into the cloud and that doesn't always work particularly when it's a grant based system you've got this fixed size system that you want to get the most out of. So let's just review what kind of quotas in OpenStack do today. I like to call it pizza slicing to give fair share so the idea is that you can have a look at your cloud and you say well we can divide it between these three people in these three proportions and then what OpenStack does is it tries to make sure that you can't start at any one time any more than your quota allows for. This can cause lots of underutilization so as you saw in the previous slide not necessarily everyone's using all of their quota all of the time one fix for this problem is to start overlapping quotas so in the knowledge that not everyone will use quite all of their quota you hand out more quota on your cloud than you actually have resource for. This is quite a common practice. Now the issue here is it tends to cause bad behavior particularly when you start to run out of resource. People soon realize that the first person to spin up a VM gets all of that VM and no one's going to shut them down. So people start doing land grabbing and creating very large VMs to make sure that they've got that VM when they need it next week and you start to get very bad behavior where your system looks utilized but actually the VMs are there doing nothing because people are trying to reserve their little spot and they do sort of empire building to try and create VMs and get more space. So we kind of need to look at a way that we can have a model that doesn't work like that and that we can give people the structures to get the resources that they need. The other issue here is that when you start looking at quota management within lots of different groups or wanting different bits of capacity, managing quotas at scale is really hard so you need some kind of level of hierarchy within the system. We're roughly dividing between the smaller set of units and within that people can share within multiple projects underneath that. So we need some kind of level of hierarchy to really deal with this. This slide is taken from a previous joint presentation with CERN when they were describing their approach to this problem. So another question, another way of phrasing this problem is if you know that tomorrow you need to get hold of 10 GPUs, how can you do that? How can you have that conversation with the resource provider on how to get hold of those GPUs? So to have a look at that question, I'm going to hand over to my colleague Pierre. So the answer to this question is to add a temporal concept to the quotas. This can be done with Blazar reservations. Blazar is a software part of OpenStack which provides resource reservation as a service and effectively the users themselves can reserve instances for a specific period of time. They can also reserve full hypervisals and deploy multiple instances on this resource. And what this provides is a guarantee that your resource is going to be available when you need it in the future. So it helps to avoid those issues that John was mentioning with quotas where people pre-allocate when they actually don't need the resources now but they need it later. One caveat of Blazar is that it requires to enroll the Nova compute nodes into an exclusive use for reservations. And that means that if your user are not actively allocating all the resources through reservations or within their reservations are not actually launching enough instances to fully use them then you can end up with underutilization. And so that's where we hope that autoscaling can help. So what we mean by autoscaling is platforms that can automatically add or remove capacity to themselves as required. So for that we need to have some sort of software defined version of the platform because we can't be manually configuring this ourselves. We need a way to change the size of that platform and some sort of metric that captures demand. And all of this is going to be very platform dependent. And there's a number of platforms that have already got at least some level of support for this. So for example, both V-Cycle and VM Dirac can spin up VMs on demand for grid-PP jobs and can do things like retry jobs if VMs are deleted or back off creating VMs if there's insufficient quota or capacity. Kubernetes has got lots of features to support dynamic operations and the cluster autoscaler is particularly interesting here. When containers go into a pending state because there's no room left on the cluster it can call back into OpenStack Magnum to add capacity to the cluster which allows those pending containers to start up. Now John mentioned interactive use earlier and this is particularly helpful for things like Jupyter Hub that spin up a new container when a user logs into the cluster. We've also been able to integrate Slurm with OpenStack in a similar way essentially using the power management features that have been in Slurm for a while and we're just going to explain over the next few slides how that works. So we deploy a Slurm cluster with a control node which could be a VM running the Slurm control daemon and some persistent nodes, we call them persistent nodes they're just normal Slurm nodes, they're always going to exist and they provide a kind of a base level of service I guess to the cluster. We also have cloud state nodes. Now they aren't actually provisioned at this point, they're just defined in the Slurm config and that's done in such a way that Slurm understands that it can't contact them initially. The other key part is that we upload an image to the cloud represented by the disk icon here which defines a compute node in a fully configured state so that as soon as it boots it, it can join the cluster and there's some details about how to make that work about things like having a Slurm config off-board somewhere but we won't go into here because they're not kind of the crucial point. So now when the scheduler decides it needs more nodes are needed to service the key it runs what it basically thinks is a power-up script but what it actually does is it talks to the cloud to launch some instances with the appropriate image. Once they beat, they contact the Slurm control node which then starts the schedule job on them as normal and it's a very similar process to release nodes back to the cloud when the scheduler decides they aren't required looking at the key. There's kind of a key thing here which is to ensure that we have a single source of truth both for this original cluster deployment and for the auto-scaled nodes otherwise you can get yourself into a real mess if you're getting a mismatch between the two so you probably need some kind of image build pipeline or something like that to ensure you've got a reproducible cluster effectively but what this does is it gets us from this situation where the platform for example Slurm is hogging all of the resources in its quota all of the time to this where we can hand nodes back to the cloud but we still have a guaranteed quota or reservation that the platform can use. The problem is that that's not actually useful until we can put some workloads on these nodes we've handed back to the cloud but that gets us back to the same problem of how do we do that and still guarantee that we can get these nodes back when the resource is back where we need to. So this is where preventables can help. Preventables are instances that can be terminated at any time by the cloud infrastructure. It's also called spot instances in Amazon terminology and that's been adopted by other commercial cloud vendors. It can be used by workloads such as QuickPP which supports being terminated at any time. This is a concept that has been introduced to OpenStack by CERN however it was based on changes to NOVA including API changes that did not merge with the upstream community. So we are investigating an alternative approach which is purely implemented within OpenStack Plaza and using this approach we can fill the gap between the browser reservations. So let's imagine that we have a few reservations on the cloud. Here we've got a graph with the time on the x-axis and the values resources on the cloud on the y-axis and a few reservations here we have a four. You can see that there are gaps between those reservations which means there is underutilization. The idea is that we will schedule pre-emptible instances within those gaps but Plaza will be able to terminate those instances just before the reservations are ready to start. This is a concept that will help us build the entire coral reef. So by combining autoscaling with pre-emptibles we can have a system where a reservation for each platform is going to bound the maximum use of the platform and ensure that it has the availability of the resources that are required for the workload. But combining that with a new API to tell Plaza how much each platform is actually using at any point in time we can add the autoscaling element. And with this API Plaza will be able to schedule pre-emptible workloads within the reservations as well not just in the gaps between them. And of course the platforms will be able to scale up and down when they need and that will trigger the removal of the pre-emptible instances. So with this in place we can combine all our elements the quotas, the pre-emptibles, the autoscaling and the reservation and build our coral reef. So now how can we make this happen? So for Plaza specifically there are a few steps that we are planning to take. First we would like to improve the ease of use so that the users of the system don't necessarily have to create reservations but they could be automatically created by some quota management system. There's of course the work on integrating pre-emptibles which is already being developed upstream and this concept of flexible reservation which allows Plaza to know at any time how much resources are actually being used within each reservation. How can you help us to build this? There are two places where you can get involved. First is within the Plaza project itself. We have meetings in two different time zones. So the first one is for Asia Pacific and Europe at 9am UTC on Tuesdays every week. And we also have another meeting that is more friendly for people based in Americas and that's on Thursdays at 4pm UTC that's only every two weeks. You can also get involved with the Scientific SIG. There's a meeting every week at different times one that is more friendly for people based in Europe and another one which is more for people based in Americas. And finally if you want to keep hearing about the progress of this project you can follow us on Twitter, we are at StackHPC and you can of course contact us directly by email and those are the addresses to reach out to us. Thank you for listening.