 I'm Tim Randles, and I'm with Los Alamos National Lab. I'm going to talk today a little bit about some neutron slurm integration work that we're doing. So we'll cover quickly a little bit of the past work. It lays a little groundwork. Talk about the cluster requirements for the plugin that we've written, the new plugin, select OpenStack, and then some future work ideas. So this motivation and background slide, I stole from a talk I gave in Austin in the last summit. And I think I put it up here purposely because I think it should just sort of be a priori knowledge now when someone gives a talk like this, that these requirements, I've heard four or five times at different talks so far at the summit, that users want more flexibility from their HPC resources. We tend to build HPC resources and finally tune them for MPI jobs. But increasingly we have users that we're not increasingly. It's been a longstanding problem. That solution, you hope, meets 70, 80, whatever metric you use, percent of your user population. But users increasingly, and I do think this is true, have more complex dependencies in their software than ever before. They have build time requirements, build systems such as Ant and Maven. The Python runtime wants to run off. A lot of these things expect to have unfettered internet access. We don't typically give that from our compute nodes inside of a cluster. We've always had validated and vendor-supplied software stacks. That's not a new problem. The problem comes when you have to build a Venn diagram of requirements of what OS you can run on your cluster so that you meet all of the vendor requirements or you decide which vendors you just won't go to for support when something doesn't work. And then there's legacy code with legacy requirements. This, in particular, at Los Alamos is important for us. We've been doing heavy computational work for 70 years now. And while we're not going to try to recreate the Cray-1 environment or Maniac or something like that, we do have codes that are constantly being updated for new architectures that we need to validate the updated code results against the old code. And the old code in this case is probably 15 or 20 years old. So if we can run the environment we ran 15 or 20 years ago, it's largely x86, so it'll be a Pentium class processor we need to provide to something like a VM. Code teams can go back and rerun that in a virtual environment with the old input dex and the new input dex and see if the new models match. Increasingly, there are data-intensive frameworks being developed, Hadoop with the Yarn resource manager and Hadoop 2, Spark, which can be executed through Apache Mesos or Yarn or standalone. But you notice Slurm, Torq, PBS. These aren't resource managers that these frameworks can interface with easily today. So we need some way to let users run data-intensive frameworks at the kind of scale we have in our HPC centers. Now they rely on closet clusters. They've got a rack in a closet somewhere. They've got a bunch of workstations that should have been retired five or six years ago. Loosely cluster together so that they can run these things, but they can't run them at a scale that really means anything for them. And users also want an environment they understand and control. The OS we run is Red Hat based on Red Hat 6, so it's still supported. But users are running things much newer than that. Usually it's a bunch of 16 or something. They want Python 3. These things exist for Red Hat 6. You have to jump through some hoops to get them, or you're building them by yourself and supporting them. But these reasons are common. We see them all the time. So I think what the discussion should shift to is we understand why people want flexibility, but how are we going to meet flexibility in our environments? And where does the line of demarcation between a traditional HPC resource that runs HPC workloads, MPI workloads and such, and the newer runtime frameworks, can they coexist on the same hardware and neither one of them is the lesser for it? And I think the answer is yes. I'd be really interested in having conversations with everybody who's interested in this about how they think they want to do it. I think some of the prejudices that exist on building flexibility into an HPC system, such as IO is bad and VM. Well, that may be not true anymore. We need to break down some of these barriers and start talking about tighter integration between these newer technologies and our old way of doing things. So the past effort to provide some of these frameworks, GLOME was an HPC style machine. It was repurposed HPC hardware, but we just built it as a Hadoop 2 cluster. We built it the way the Hadoop folks told you to build it and we ran Hadoop on it. We ran Spark on it. It was right after Hadoop 2 was released in late 2013. So the Yarn support for Spark was not very good, but users play with it a little bit, didn't really like it. We're an HPC center. Users aren't used to writing anything in Java. So they felt crippled, the barrier to learning Java to be productive with Java was high enough that it never really did more than write some hello-worlds and kick the tires and wander away and say, this isn't for me. To address different Linux distributions, we have built another machine that was a mirror of GLOME actually called Charli Cloud or called Kugel, but on it we ran Charli Cloud VMs. And Charli Cloud is an internally developed orchestration engine. You might be looking at me saying, which you're at the OpenStack conference, why don't you just install Cloud and Nova and let it fire off your VMs? What we were trying to do was come up with a lightweight framework that was very simple to understand that would allow the cluster to run traditional MPI jobs on the cluster or Charli Cloud VMs and no one's interfering with each other. So we built some plumbing and framework in Charli Cloud to fire off VMs, QAMU and KVM, that when it wasn't running, it just looked like a normal HPC. There was no Neutron service running. There was no Nova service running. We didn't want to introduce a lot of system noise and jitter for the MPI jobs that would come along later. Wasn't very successful. For a user who says just give me Ubuntu and I tell them, great, well you can build a virtual machine, you can install these scripts inside your virtual machine and you can copy it into our infrastructure and then you change your job submission script to do all this Charli Cloud stuff and then you're, but trust us, just your VM's gonna run. There wasn't much uptake. We had a few brave users but it wasn't widely used. So our current direction is something we're thinking of as a converged data compute machine. It's Woodchuck, we've had it around for about a year now. And there we're developing something called Charli Cloud containers. So once again, why are you using Charli Cloud for containers? We wanted a container implementation that was once again small, simple and was unprivileged. I'm never going to give a user access to the Docker group so he can have root on my systems. So we wanted something that was small, easily auditable, started up a username space and a mount name space so the user could bind file systems into it that they had permissions to bind into it and that was it. So a user could run this, he could do an M-Alic or sorry, S-Alic, get a compute node, SSH into it, run the Charli Cloud tool and then he would land in his container environment. It's small, I have a link for a research paper we wrote on a technical review. If anybody's interested, just talk to me afterwards. I'm forgetting something, okay. So for Charli Cloud VMs we had some complex network configurations that were required to secure this thing. This is the low hanging fruit of security vulnerabilities. A user requests two compute nodes and their VMs start up on those compute nodes and the compute nodes are mounting the parallel file system. In our case it's Luster or maybe at that time it was Panassas but either way, it's an IP based access model. We don't, the large parallel file systems don't authenticate generally, they wanna be fast. So the exploit here, the obvious one is user takes the IP off compute node two, he owns the VM, he's got root, puts the IP in there, he crashes node two, he mounts your parallel file system, he owns all your data, this is really bad. We can't have this in any environment let alone where we're at. So the schematic view here, Slurm would fire up the VM, we use MPI run, it starts the virtual machine on whatever allocation the user has, the user does this and he's sending evil packets off to the parallel file system, he's stealing all your data. So we get Slurm to actually use epilogue and prologue scripts to set up a firewall. So now we're gonna block packets and the firewall is pretty simple at this point because if the IP address of the VM, if the source address of the packet isn't what the VM should be, we're just gonna block it. But a single node job isn't very useful for most people. So then we had to get fancier with our firewall configuration and allow the good packets to go off to the right places and the bad packets get stopped and all the while we're introducing complexity, we're introducing failure modes. If a job crashes, that firewall disappears, the job might come back, the node might come back, the job's still running, we had all sorts of problems in this and really it was the fragility scared us. So using Charlie Cloud, user requests a Charlie Cloud container that lands on the compute node. We have a separate physical network to mount the file system over, the parallel file system. We dedicated an Ethernet interface for the Charlie Cloud container and we use Slurm to join the Charlie Cloud or to create a new network namespace and give it to the container. Then we bind mount in the parallel file system. The user has what they need, a lot cleaner. We don't have taps, we don't have bridges, we don't have the overhead that that introduces, we don't have complex firewall setups but they have unfettered access to everything right now over this dedicated Ethernet. So what we're looking for is enhanced security and resource availability in this mode of operation. We want network isolation for the user jobs. We want Charlie Cloud to have hardware level access to the network for the job without putting a bunch of barriers in its place. So we want to isolate those jobs so that they can't do evil things to each other. So we want to use VLANs instead of firewalls and bridges. The hardware supports this, our switch hardware supports it, let's take advantage of it. We're gonna get better performance all around. Even if we're gonna run a VM, we're gonna run an old Charlie Cloud VM, in this mode of operation we can give the VM access to the hardware via PCI pass through or SRIOV, the network itself is going to isolate this job and only allow it to have access to what it should have access to. And then this should open up the possibility of dynamic access to network attached resources. So file systems, databases, on-demand networks, a common use case that we encounter is a user has a database and he's running it on his closet cluster because the program manager says that database needs to be protected for some reason. Currently, when they want to scale up, they have to buy more hardware. When the hardware needs to be refreshed, they have to have a fight with the program manager to get some number of zeros of dollars to go buy more hardware. If we had the ability to cite their database on a network that our clusters have access to, or we do today, the only granularity we have is this cluster can see that database. Not that compute node can see the database. So we tell the users, well, of course it's a multi-user cluster, we're not gonna build a cluster for just you guys, but be aware that anybody on the cluster can hammer away your database and try to get into it. We wanna build a system that allows this to happen. So in this view, we have Charlie Cloud with its dedicated interface, and then let's say it's connected to 10 gigabit ethernet, in this case it actually is. I think if we had a little more money, we'd have gone with 40. Regardless, it's basically IP over IB, but with VLANs because IB partitioning isn't dynamic and there's only 16 of them that I can't run many jobs. So there's the database that lives on the network and the colors denote VLANs, so your Charlie Cloud jobs there. Slurm talks to Neutron and says compute jobs running, it's allowed to have access to the database, create my VLAN, yay, my job can see it. When the job's done, Neutron goes off to the SDN controller and says, take this node out of the VLAN, we don't need that access anymore, database disappears from the fabric. The user has multiple nodes or there are multiple users. We don't want them to talk to each other, so we dynamically create a VLAN and we put the user's jobs in the VLAN so he can't bludgeon someone else to death. Once again, we have an ethernet fabric here, so congestion gets to be a real problem. Users have inadvertently denial of service to each other by doing bad network things. So to make this all work, Neutron has to control the network that you're allowing the containers to be provisioned upon. So in our case, we're doing something simple. It's a provider network and it's just VLANs. We don't want open V switch, once again, we don't want to introduce this layer of host complexity that introduces performance penalties, complexity of configuration. I'm the one that's on call for this. The other 12 people on the team think I'm crazy. They don't want to get an on-call ticket saying, packets aren't flowing and it has to, you know, it's because OVS is best up in some way and they have no idea how to debug it. Topology or where switches? The way this works is the switch has to know which nodes on which port. So you run LLDP, it talks back to the switch. Topology gets communicated through the switches back to the SDN controller. You need an SDN controller. I'm using a Rista hardware and the cloud vision stuff's just built in, works well. I run, Rista will tell you, run a virtual machine of this. So we have the EOS running that acts as our SDN controller to talk to our physical fabric. Other vendors have similar solutions. We need the Neutron ML2 plugin to work. Vendor might have a solution. Vendor probably has an ML2 plugin available for you. And we need Nova with a fake driver. Get to that in a sec. So what we've done with this Neutron integration with Slurm is we've replaced the complexity of the Charlie Cloud VM configuration in Scripps. We don't rely on epilogue and prologue anymore. We did run into cases where IP tables would hang. So epilogue would hang, but SlurmD had already told the controller D that the job was finished and that the node was available and then nodes start landing on this misconfigured node or job start landing on this misconfigured node and dying. Users don't like it when you drain the queue by killing their jobs. This required modifications to an existing plugin. I wrote job submit open stack glance back in the spring and that was for the image management portion of this. It's now been renamed job submit open stack because we're validating not only image requests but network requests that the users can make. So if the user wants that database on his network, he's gonna have to request it somehow. We're gonna validate that in the job submit plugin and then it required a new plugin, select open stack and this is what actually interfaces with Neutron to create networks, create subnet, does the nova boot part, deletes everything when the job terminates, take the cluster back to unknown state. So how does a user actually use this? They just put an S batch option in their job submit script. In the spring it was Gres, the generic resource request UDSS which is user defined software stack and that's what we use internally to call it and then some name of an image that lives in the glance registry. It's not there, we're not gonna accept your job and you can fix it. We're not gonna load the queue up with things that are gonna fail. Now we've added network ability. So a user can ask for a network, has to match the network name that's in Neutron. Pretty straightforward, handles both cases. The job submit plugin workflow, this is also from the spring. It's kind of obvious. It looks for the requested resource, did you ask for anything? If you did, let's get data about it. Does it exist? No, then resubmit your job. Is a user allowed to grab an existing image as a public? Are they in a member tenant? It's allowed for the image, yes, no. We go ahead and pass on. What we did to expand this to handle the network request was replace image with net, replace glance with Neutron. It's pretty straightforward, but it's the same workflow. The select open stack plugin. So I talked to one of our scheduler experts and he said, why would you wanna overload a select plugin to do this? And I pointed to him at the documentation on the SCED MD site. This is the mechanism for providing, or providing a mechanism for performing any system-specific tasks for job launch and termination. He said, well, select's supposed to be very fast. It's supposed to just come back. You're supposed to get the allocation and the job should start executing. And then I said, well, Cray's already done a lot of this work for me because Cray is using this same mechanism to handle a lot of the set up and clean up of their blades in an XC architecture. And on an XC from Cray, they don't tend to run Slurm through the whole thing. They run Alps. And Alps is their own resource manager. And there's a translation layer built into their plugin that actually takes Slurm, like what nodes am I allocated and things like this, and translates it into Alps and then pushes it into the Alps resource manager. So Cray's done a lot of work for me. I use their wrapper plugin, the other select, and then I use their select Cray plugin as just a blueprint for my own. So in the select API, there's select P, select job begin, and there's select P, select job fini. Begin handles creation of the network, the subnet does a nova boot, does not do revalidation. And then job termination, we tear it all down. But is this deceptively simple? Well, yes and no. The neutron net create, and I put this in terms of the command line tools you may be familiar with. I'm not actually reaching outside of Slurm to run the CLI. This is written in C. So we're building JSON payloads and parsing JSON responses, but we're calling the RESTful API directly using curl. So net create is straightforward. The very nice thing about this is when we first started this, we thought we would just directly control the switches from Slurm, and then we started down the road of, but how do I know which VLAN belongs to which job? How do I know if it's an existing VLAN? How do I get the information? We were quickly realizing we're gonna build our own database that described the network on the cluster. The neutrons already got that. In fact, neutron, I just, you configure it properly to do provider networks in VLANs. You give it a resource pool of VLANs, you say you can use VLAN 100 through 4,000, and it just takes care of it. It knows if it's in use, it takes it back, it talks to your SDN controller to know, so that solved a huge problem for us. Subnet create is necessary for instances to use neutron networks. So if you create a network and you tell Nova to boot and use that network, it's gonna come right back and say, there's no subnet on that, I can't use it. So this is a step that is required, but does nothing, basically. It just meets a condition that Nova needs to go ahead and create the fake instance. Nova boot, you specify the networks, you know how that works, but you need to specify which nodes, and then you're gonna boot what? Because I'm not actually booting anything. And then Nova delete and neutron delete, does this leave anything behind? Nova delete, I'm pretty sure it doesn't. Neutron delete, I'm not doing a subnet delete because I can't find a subnet after I delete a network that the subnet belongs to. Anybody knows if it's filling up orphan rows in a table somewhere of subnets, let me know, because I haven't seen them, but I haven't looked real hard either. This part was a problem for us, kind of. So boot what? We're using the Nova fake driver, we don't care. It doesn't matter. You just need to tell neutron that you wanna boot something that actually exists in the glance registry and that there's a flavor that actually exists so it can resolve these things. So it doesn't think that you're asking for something it can't provide. And then which nodes? In this case, we're using availability zones. You can specify an availability zone. We're giving it the cluster name because we envision this being at the center of our backend infrastructure so that we may have multiple network or multiple clusters that we're configuring from here. And then you specify the networks you need and then you give it some name. At LANL, we use two-letter abbreviations for all of our clusters. So Woodchuck, a compute node is WC001 is the first compute node. We have an availability zone mapped out with all the compute nodes in it. The network name, if you get in and start poking, if you have Horizon running, you start looking at your networks. That's the job ID from the Slurm job. So it's Woodchuck Job1234 has this network. MyDB, that comes in from the user request if you want something. And then what did I boot? Nova needs a name for the host and it needs to be deterministic. So it's the cluster of the job ID and then the host name. So other minor details here. How are we gonna get log in node access to the containers? Because we've said that the containers are on their own physical interface, on their own fabric, and we vlan them. So we made it deliberately hard to get to them and now we need to make it easy to get to them. The first use cases, we'll just stick an access port. We'll put an interface on our log in node. We'll plug it into the fabric. We'll put it in trunking mode. It can see all the VLANs. User can just go ahead and SSH into their container if it's running SSHD. But then anyone, of course, can try to access your container. The argument right now is, well, anybody can try to access your compute node. What's the difference? The difference is that maybe the user created the container or the virtual machine. They didn't secure it. They didn't, they left a default password somewhere. They didn't put a password in it. So I think what we're gonna end up doing in the short term is using something like cloud init to insert a user's key, SSH key into the container so that only he can get into it and not allow password authentication. And if we find containers that have password authentication enabled, we're gonna tell them they can't run them in the short term. A longer term idea that we have, and it's one that I'd love to discuss if anybody's done anything this style. Our user's login to a gateway to reach our front end network or our login nodes. We would like to come up with a system where they hit the login or the gateway and then instead of SSHing to the login node, make a request. And maybe this is open stack running as just a cloud manager. What they end up with is a VM on the physical node attached to that cluster. VX land with a gateway into the VLAN. Well, it won't be into the VLAN initially because they haven't submitted a job yet, but set up an arrangement where users are sandboxing VMs on our physical hardware and then the VTAPs are set up when their jobs are submitted so they can get to the nodes that are in their job allocation. It should provide stronger, it's obviously stronger isolation between users than we have now. We get a lot of support requests of long running process on front end node, whatever, because someone's testing their code and he went to lunch and it's just running a simulation single threaded on the node and is going to do this. When it happens on Friday afternoon, the Monday ticket queue looks really ugly because no one else can do anything on that compute node because he's got it CPU bound. So we have other reasons why we want to come up with this regime, but it's definitely non-trivial to just give users VMs as their own front end nodes. And then how about dynamic network access beyond the cluster? So one of the use cases earlier in here was what if a user wants unfettered internet access? Do we allow Neutron, which is really being controlled by Slurm, which means a user's telling it what to do. Do we really want that controlling our infrastructure outside of our cluster? My networking guy says, hell no, I'm not gonna allow that. So do we do static VLAN assignments outside of the cluster? Maybe it goes into the third point with QOS, open flow, traffic shaping, all these other things that SDN give you that we're not doing yet. We have Juniper firewalls sitting around. Do we want to orchestrate our infrastructure so that we can allow the users to reconfigure a larger piece of the network for on-demand stuff? But I think this is definitely the case. This is way out of scope for a Slurm plugin. But for this plugin or for this model to extend very far, we're gonna have to come up with something that does those things. What that is, I'm not sure. Questions? Reads the author of Charlie Cloud. Chuck's my networking guy and Steve Senator is our Slurm expert. Thanks Tim. How do you do deterministic IP address assignment? So in the case of containers, we're hoping no one really asked for Charlie Cloud VMs in this regime because we told them they're deprecated. In the case of containers, the users getting access to the bare metal hardware it's already configured, the network namespace should carry that in for us. We're not reassigning the IP when the container is attached to the device. No, the nicks are statically assigned. We had the same thing with Charlie Cloud VMs and that there was one of the reasons why we could do the firewalling we did which was we had a subnet setup so that your Charlie Cloud job with a VM had access to this device to this bridge but the external interface was statically assigned. We had an immediate scaling issue in that the way the scripts were written in the first place, you couldn't go beyond 255 addresses because we ran out. Luckily we had a 96 node cluster and that wasn't a problem. Any other questions? I do plan to start talking to SCEDMD and try to upstream some of this. I don't know that it's generally useful, it's just the direction we're taking right now. You know, one of our, like I said earlier, I think the prejudice is against HPC's view of non-HPC workloads and mechanisms for running those on clusters are ill-held now. They need to fall away. So we're just starting down the road of this. You know, we have image management, we have a little bit of network management now. It all needs to grow but I think we're trying to to prove the case that with the existing technologies with things like OpenStack components, not running a cloud but using the components as bolt-ons to our clusters, we can give flexibility that's just as performant as if we had a bare metal cluster built for a single purpose.