 Hello, my name is John Garbert and I'm Principal Engineer at StackHPC. Today I'm going to talk to you about the projects we've been doing at Cambridge University, which is expanding their CSD3 supercomputer with some new hardware and making use of OpenStack to do that. So a little bit of context on CSD3. It's the Cambridge service for data-driven discovery. It's got a mixture of regular CPUs, Intel Xeon CPUs, Intel K&L, and Nvidia GPUs. When it was first installed, roughly about November 2017, it was one of the fastest UK supercomputers and it was in fact number 75 in the world and the top 500 list. CSD3 has been, so far, has been provisioned using XCAT. So effectively you take a running machine, you create an image from that node, then that image is pushed out to all the other nodes using XCAT via, effectively via pixie-routing the machine and pushing the image down. And all the resources within the system are accessed via a Slurm cluster. So a batch queue that you submit your jobs to. So there's some reasons that they wanted to change the way they're currently working when they're looking to expand the cluster. The first one really isn't as many people have as an increase in complexity. There's always more different platforms that people want to run and people wanting to self-service their needs. And they need some help managing all of this complexity that's coming along as things get more and more diverse and the workloads that need to be supported within the cluster. At the same time, this is causing problems with knowledge sharing as they're looking to use automation infrastructure as code tooling to try and manage the complexity and help with this resource sharing. So using peer review and checking things, having a git repository with a good history telling you why things happened in the past but applying this to infrastructure. Another part of this is removing resource silos. So the moment when hardware is purchased, it usually gets a specific platform on it and it's often stuck with that for its lifetime. That can look a bit like this over the traditional HPC stack 1.0. You have different silos of hardware for different use cases. So there's a silo that's good for Hadoop and there's a silo that's good for AI and deep learning. There's a silo that does traditional HPC and these are often quite separate. Now the hardware is getting more similar between all these things and people in each of these buckets are wanting hardware from the other one. It's getting increasingly complicated to try and deal with all the shifting, hardware needs, all the different resource requests over time. So what we're looking to do is to bring an open stack to help with that. So you have this single pool of bare metal and hardware resources but we want to expose them through a single set of APIs. So whether you want bare metal or VMs or containers, you can come to open stack and spin those up and on that run your science platform. In addition, they're also trying to look at building these science platforms in a more sort of cloud native way such that when there's resource needs that can't be met locally because the system is fully used and there's the option to move those workloads that can move to external clouds where it's generally more expensive to run there but that time can be used if that's a good use of money. One of the key things that we're doing here is to use Ironic to enable those bare metal workloads. So we're looking particularly in this talk at how we expand the CSD3 partition with new nodes using OpenStack Ironic. So let's just take a moment to have a look at Cambridge's High Performance Computing Services OpenStack journey. It started roughly back in 2015. They're one of the founding members of the sort of resurgence of the OpenStack scientific SIG and they've been doing a lot of work telling the OpenStack community and the world what they've been doing and what they need OpenStack to do and trying to make that happen. In particular using the SKA and its need for a performance prototype platform and seeing how you can use OpenStack bare metal to do that. But actually if you look more widely at what Cambridge has done there's loads of varying different use cases. There's virtualised clouds and bare metal clouds although they all generally have a specific sort of a slightly specific flavour. Whereas we come to the new cloud that's been created to expand CSD3 that's got the name Arcus and this is an attempt to create a more unified cloud. And the particular use case that we haven't been able to satisfy until now is actually to be able to look at large scale HPC. So can the main cluster that Cambridge are using be deployed using OpenStack? So this is the expanding of CSD3 using OpenStack. So firstly some notes on the hardware. We're using Dell PowerEdge C6420 servers. So basically you get four dual socket machines within a 2U footprint. For the sake of this talk let's just talk more about the networking. Well there is a local SSD and there's a one gig NIC that can be dedicated for the outbound management network. Now this actually also can be optionally visible inside the OS. So the main networking for the system is a Menonix ConnectX6 card. One of the ports is used as a HDR100 port. So the HDR200 switch with a breakout cable with two HDR100s per switch port. And a very similar setup for Ethernet where the Ethernet 100 gig switch has a breakout cable for two ports each one being 50 gig. So there's a 100 gig of infinite band and 50 gig Ethernet on every one of those servers. So job number one was finding the machines. Cambridge have quite a well practiced path in which they discover the servers where basically once it's racked and stacked then they scan the information using a handheld scanner and that gets exported into some spreadsheets. So you can generate a CSV file that shows what you expect to be in the rack. So what we're going to do is take that CSV file and automate everything from that point onwards. One of the key pieces of information that CSV file is the MAC address for the outbound management port. In this particular case we use Neutron to hand out the IP address to the outbound management network. And this port isn't actually bound but the Neutron DHP server still happily hands out the IP address to the MAC that requests it. So we use Terraform actually to create those ports from the CSV file and then we generate an Ansible Inventory from that same CSV file and that Ansible Inventory is used to basically do all the automation from this point onwards. So as soon as possible we get the node enrolled into Ironic. The first step is actually to enable IPMI so that we can actually add the node into Ironic. We want to use the IPMI driver. We'll come on to more about that later. So we enable IPMI then we enroll it into Ironic. Now we basically use Ironic to track this machine throughout its enrollment state. So in order to get the machine fully functional we actually have to configure in-band the Connector 6NIC and we actually do this inside the Inspection RAM disk. So when it first boots for the Inspection RAM disk it checks the state of the Connector 6NIC. If it's not what's expected it updates the NIC, updates the firmware, tells the NIC which port should be Ethernet which port should be InfiniBand and then reboots and then does the inspection. So now from this inspection we should have mostly the information about the node. And then we in fact, actually what we do is we change the BIOS settings to start to disable the 1GIG NIC and start pixie-booting over the 50GIG Ethernet NIC. Now that's available. And then once that's all configured then we do an inspection again over the 50GIG Ethernet and that updates the pixie MAC address and all the other data so that we're then ready to image the machine. So building up the Slurm cluster, creating the Slurm cluster using Terraform we're actually using that same CSV file that says this is the hardware name and this is the logical cluster name. So the hardware name is the Ironic node name and the logical cluster name is the name of the Nova instance that we create to sit on that specific bare metal node. So we actually use the AZ mapping to target specific bare metal node. The key part of the image they deploy is that when the image boots on the server, the cloud in it as usual via a config drive, it gives the server the correct name, host name and IP address. But from that point it then mounts NFS share which gets the Slurm configuration and Slurm starts up and just joins the main existing CSD3 Slurm cluster. So we've got these OpenStack powered nodes rejoining the main Slurm cluster. The main Slurm cluster having had its configuration updated to include these new expected nodes. But as an aside, the idea is to use Ansible for the image build process as well as ad hoc changes to try and keep these two things in sync. Now in terms of rebuilding the nodes, once the cluster is up, it's quite common for monthly or whenever the next vulnerability is found to have to read and to update the kernel. And obviously any of the vulnerabilities and the stack in the packages installed on the system. Often it's cleaner just to reimage the node. So we really need to do that without interrupting the operation of the system. Now when they do this with XCAP, what they do is they change the Pixie Boot server to start handing out a new image. And that means that when any of the nodes reboot, they just pick up, they Pixie Boot in and get the new image and reboot back out of that. What we've done for OpenStack is that we've actually got a custom Slurm reboot script. So when the reboot command is sent, the reason basically includes a string that includes the fact that we want an OpenStack reimage opt into it. And this is the image UID to reimage on. And so this script actually uses the instance UID from Comfig Drive, well from Clavinet. And it uses that and the image ID that the user specified to re-image to send a re-image API command to OpenStack. And that's actually what reboots the node and goes through all of the ironic process to rebuild and reimage the node. And that's proven to work really quite well. We can within about 20 minutes reimage all 56 servers in a single rack. Now we have to do make some changes to get that. So tuning ironic for scale. So the main focus has been on this rebuild to apply new kernel update and how to do that as quickly as possible. The first part actually was getting the networking working well. So we're using multi-tenant networking with an ironic. In particular using the networking generic switch driver. We started off using the actual networking driver for the cumulus switches, but this proved too slow. We used a networking, moved to a networking generic switch driver, which actually has been contributed upstream and is now merged upstream in the latest release. The main reason this driver seems to be faster was they actually only does one commit of the switch config for every port update, whereas Ansible networking was doing two commits. And that seemed to be where it was bottlenecking on the switch. It's got about an order of magnitude faster. It's taking about 1000 seconds for the slowest ports to bind Ansible networking for a whole rack. It's taking about 300 seconds with this cumulus driver. The advantage of the cumulus drive also was that extra configuration on the port such as like this is an edge port. In terms of spanning tree configuration or that configuration gets persisted and the description of the port. It doesn't get blown away, whereas Ansible networking was actually deleting all of that information. Anyway, we used a networking generic switch and it didn't do that. One extra thing we developed was actually using SCD to batch up the requests. So basically when a request came in, we wrote the request to change the switch config to SCD. And then every request that comes in kicks off an async job to get all the latest requests batched together and submit them to the switch. And then it waits for the result. So anyway, with this mechanism, we're able to get the time down to reconfigure the switch to well under a minute to reconfigure the switch for the whole rack. And when you actually look at the port configurations that happened to happen during a rebuild process, you actually have to pull the port out of the tenant network and then put it into the provision network and pull it out of the provision network and back into the tenant work at the end. So speeding this up was a big help to the amount of time it took to reimage the node. When picking the driver, we can quite get the Redfish driver to work for some reason. So we originally started with the iDRAC driver, but on closer inspection when it set the Pixieboot mode, that was actually requiring a reboot to actually set the Pixieboot mode. So we moved back to just a simple IPMI driver. And that avoided effectively avoided an extra post cycle, which saved quite a bit of time. It was quite a bit of work deciding which deploy mechanism to use. Typically we've used iSCSI. Now this has been deprecated upstream, which also pushed us towards trying to direct deploy. In this case, we're using direct deploy using HTTP, not Swift. Just because we don't have a very particularly high-performing object store ready to use right now. So it seemed sensible to use the HTTP method. Now originally slipping to direct deploy, actually we saw a massive increase in CPU usage in ironic conductor. I tracked this down to the force raw images flag and we set that to false. That went away. In part because of the image conversion, but actually the biggest problem was probably using Python to check some of the image. With the large images, that was taking a long time when all of the instances were doing it every time, to check some. Anyway, force raw images false made a big difference. Moreover, rather than when the IPA round disk is now pulling from the conductor, it can now pull the QCAM image, which is generally much smaller than what the raw image would be. That saves more time in the data copy. Now this does move work from the conductors to all of the compute nodes, but given we're trying to scale this out to lots of compute nodes, that's exactly what we want to do. So the particular problem here is that we're forcing the case where we want to build all of this, almost 700 nodes, we want to rebuild them all at once. That's really what's causing this to be a problem. Now when we were looking at the CPU usage spiking for the ironic conductor process and the ironic conductor process was getting up to, the process itself was getting up to about 100% CPU occasionally. And we turned off the power sink and that really did reduce the amount of CPU, which gave more overhead. But once we get started deploying, maybe that was less relevant, we could probably maybe look at turning this back on, but certainly turning this off gave more CPU available. Now there was particularly tricky issue to debug when we were looking at this. There were lots of connection issues. At first I was fearful that actually downloading the image to the IPA RAM disk was actually interrupting the sort of normal communications to rabbit, to the database, to Keystone, to Glance, to other things. I even actually added trying, there's a Glance retry parameter for ironic. So increasing that, but that didn't really actually help the failures actually getting the Keystone token. With a detailed look at HAProxy logs and in particular looking at the documentation for the HAProxy logs, it turns out what was happening is we were hitting the connection time out. So I believe this is the amount of time it takes for the connection to be established and for the request to be completed from the client. I'm fairly sure this is related to Eventlet and its scheduling of the threads. I think what was actually happening as the connection was started. Then it took an awful long time before basically various other lots of threads started their connections. And eventually it got back to the beginning where it got to finish its connection and send the rest of the request. And certainly we saw raising the connection time out from the default of 10 seconds or the default in color for 10 seconds up to about 30 seconds got rid of all of these connection problems. The first time I bumped the connection in HAProxy, we then started only seeing the database problems, but in a slightly different way. And on closer inspection, ReaDB has its own connection time out, which we also bumped, which got rid of the database problems. I also tried to tweak the database connection pooling so that the connection pool didn't grow quite so big. So there was more chance for it to actually deal with all the connections it had in flight. But in reality, I think it's the connection timeouts that made all the big difference. One final thing when doing the server deletes. The other thing is that we actually saw some RPC timeouts. Things just taking slightly too long when all of the deletes all came in at once. There was no throttling on that. Increasing the timeouts got around that. I'm sure there was something more clever I could do that got us over the hump. So before I leave this topic, I think it's worth highlighting the monitoring. The monitoring really helped actually understand what notice is going on. We used the built-in Prometheus support in Color Ansible alongside the Fluorin plastic search and commander to collect the logs and look at those across all the controllers. And that worked really well to get visibility on what was happening on the controller nodes. In addition, we actually added some extra bits. We actually ran node exporter on the cumulus switches. This was particularly useful for traffic flowing into IPXE and traffic flowing into the IPA RAM disk where we didn't really want to run an exporter. So we could actually get visibility into the traffic between all the compute nodes just by running node exporter on the cumulus switch and that worked surprisingly effectively. Another piece we were using the Redfish exporter just to get hardware metrics. In particular, there were lots of metrics about things like power cables getting knocked and various things. I'm sure in main production this will be much more useful, but in terms of bringing up the system, it was very useful in spotting some of the problems. And some of the insights I was saying about what was happening when I run a conductor using maximum CPU, the guru mediation reports were really useful in telling me exactly what all the event threads happened to be doing. We did see some issues with image download looking like it was stopping progress for other ironic conductor threads like the DB updates. Especially just said that she tuned the amount of time it takes for ironic conductors to think that the other ironic conductor is dead and redo the ring just in case that the DB was busy. Using the guru mediation reports we were able to see what all the event threads were doing. Which was really useful. So what's next? So going back to the original vision of having this one open stack to rule them all. So there's a single hardware pool. We're not there yet. Really, this work has proven that we can actually, for a large scale HPC, use open stack and use ironic to provision it. That's worked really well. But right now we're kind of in the situation that half of the cluster is still using Xcap and the new bits of the cluster are using open stack. Now the old bits of the cluster, we could just let for all this hardware to age out but I think it'd be nice to, may well make sense to move some of that into open stack just so we can get the flexibility that we need between the different pieces. There's some work on SRV we'd like to do. Now we're using Infinity Band rather than OPA on the new cluster. We should be able to get RDMA inside the VMs using SRV. We have to do this elsewhere. We just want to pull this to the system. Building on all of that, we want to use open stack magnum deployed Kubernetes clusters to make use of all this. In particular targeting things like Horovod, we've got distributed machine learning where you want an RDMA transport for low latency links between the GPUs that are attached in all the different VMs. And generally there's some loose ends around operational tooling. One particular one is one of the key things that we want to be able to do is if we've got this over cloud ironic that has all of the Slurm cluster in when some of those need to get converted to hypervisors to host VMs and various other things we want to make that really easy to do to be able to come to take the node out of the Slurm cluster That's a well-known thing, but we just want to be able to easily deploy the hypervisor on there rather than have to... Currently you typically have to move the hypervisor, enroll that hypervisor in the Yfrost that we've been using to deploy hypervisors but really we should be able to directly use the the ironic that's running within the cloud to deploy the hypervisors if that's what makes sense, which should do in this case. There's some bumps around networking so we need to trunk VLANs down. There's no support for that in the NGS Cumulus driver but I'm sure we can work our way around that one way or the other but making this slick will really help make that one open sector with a more vision, more of a reality. So thank you very much for listening. I kind of hate thank-you slides, I always feel like I missed one off. A big thank-you to the OpenStack community. Loads of support through the OpenDev, lots of great discussions actually really helped with this work. Particularly gives a lot of confidence on the direction towards scanning. A lot of people do the scanning to scale out bare metal stuff. Lots of great conversation with the folks at CERN on what they're doing because it's a very similar use case. A great thanks to Cambridge University for being such great partners to work with and great thanks to all the funding bodies that have funded this work. The link at the bottom of this slide is a sort of press release on the background of where this particular expansion of CSD3 came from in the context of that funding and also all the industry partners and vendors that have worked to help support the sort of co-development and co-design of this project, the Dell Intel NVIDIA metal knocks. Maybe they're the same thing now. Thank you very much everyone for listening.