 Hi, I'm Scott. I work at G Research in the Cloud Engineering Department. My main role is to deploy OpenStack and Ironic and all the fun that comes with that. So, yeah, who are G Research? We're a ffintech company based in London and Dallas, and what we do is we try to, we have a bunch of quantitative researchers that will try and create algorithms to get movements in financial markets and then we trade against that. So, we build a big research platform for the quants to build their models on and try and predict how the markets are going to move. And then we're in the process currently of moving away from windows and HG Condor over to a full Linux and Kubernetes and OpenStack, an open source sort of stack. So, this is a bit of a high level view of our architecture. It's pretty high level. I won't go into everything. So, what we, some of the main services we use is Ironic for a start. So, in there we have Inspector, Pixie and the Conductors. So, to build the bare metal nodes, we make quite heavy use of things like Inspector and Pixie. And essentially what happens with an Ironic node is you turn it on, it will pixie into a round disk image and it will have a bunch of preloaded scripts in there and they will run to do different tasks. And then if we move into the middle, we've got Neutron. So, to move a bare metal node from Network A to Network B, or like the provisioning network to say the tenant network, we use Networking Generic Switch. And then, yeah, this essentially allows you to SSH into a switch or use an API or whatever you need to do, however you need to interact with the switch. NGS is kind of the plug-in that we use. And then if you have a vendor-supported plug-in, then that's great, but if you don't, then Networking Generic Switch is kind of a good thing to go in there. And also, if you have multiple vendors in your switch layer, then Networking Generic Switch can sort of allow you to have one plug-in to sort of rule them all and that's exactly what we do. Yeah, so you basically just supply it with a bunch of commands, what it needs to do to actually move from one access fee line to the next and it will kind of do the rest for you and you don't really have to get too involved in kind of neutron ins and outs. And then we have Nova. So typically, when a user builds a bare metal node, they don't actually interact with Ironic directly. What they actually do is they just interface with Nova and then that allows them to have a consistent workflow between moving from VMs over to Ironic. So essentially, all they have to really do is just change their flavour and then, yeah, it's hopefully nice and easy workflow for them. So, yeah, at the top of the diagram, we've got Terraform and Jenkins. So we run pretty much everything via automation and then, yeah, so Jenkins pretty much runs all the orchestration side of things and then users will interface with Terraform. So digging a little bit deeper into the architecture, we make pretty heavy use of things called conductor groups. So in a conductor group, you can sort of think of that as like a collection of Ironic nodes. So here in the diagram, you see we've got three AZs and then we've got up to 1,000 bare metal nodes in there. So the way that typically works is, so the stuff on the previous slide, you would see that split between sort of one or three AZs depending on the size of the hall, how much room we've got. But the thing we do different with conductor groups is we then would split those up into an AZ with a set of conductors and up to 1,000 nodes. So that would map to something like a pod in a data centre. When I say like a pod, I don't mean a Kubernetes pod, I mean like a collection of racks and calling and all that fun. So we're piling about 1,000 nodes into one of those and then that will allow us to keep it all siloed and then we kind of reduce our blast radius and things like that. So one other thing that helps us do is when you roll out your changes, you can kind of do this like one AZ at a time. So in a typical data centre, say we had like five or six pods, we can roll that out one at a time and if we make a breaking change, then we'd only take out a fifth of the infrastructure at one time. So into Ironic, the four main things we care about in Ironic or when we look at Ironic, the four main states that we really care about is enrolling, cleaning, holding and provisioning. There are some transitional states in between there but we won't go into that for now. So when a bare metal node is provisioned, it's given to the user and then when it's deleted, it's cleaned and then it's removed back into the holding state and then it should be cleaned and it should be as the same state that it was before it was taken by the user and then it's ready for the next person to pick up. So we do use this feature a lot in GR. We try to have our nodes handed back for once a month and that's done today. So the user will take the nodes, they use it, it will usually have like Kubernetes work or something like that and then it will come back, it will get cleaned, it will get put into the pawn and it will pull something out of the available pawn. That cycle just continually happens all the time. So digging in a bit deeper, we've got enrolling. So when a machine is rolled into the data centre, it's first enrolled and then we have this pre-inspection phase, that's what we call it. So it's before we actually go into ironic inspection, we would do this pre-inspection phase. So we'll create the record of the machine in the ironic API. We set the resource class, we reset some BIOS to baseline settings. That's just really for consistency and we also set the baseline configuration of the BMC or the ILO or the Dirac or whatever it is. So there you can see that on number two there's a resource class. So if anyone doesn't know what a resource class is, it defines like a type of hardware. So you would say I've got this centre model server, it's got four disks in it, they're all of exercise and it's got this much RAM. So it defines a piece of hardware and then what that allows us to do is when we get the hardware in, we can kind of allow us to confirm we got what we paid for for a start and then we can make sure that things are going in are consistent. So they've got consistent minimum versions of hardware, so firmware, we reset the BIOS so everything sort of looks the same. With this stuff, if you do lots of the same thing, it kind of scales quite well. So setting it all here and making sure that everything's consistent really saves you a lot of time over the long run. And then on that point, I'd rather say that it's quite good to make this quite opinionated this sort of section because if you can get the stuff into Ironic and it's all consistent, Ironic does a really, really good job of looking after things after that and if you're not checking that you've got all the disks that you enrolled and then you get your own user, they spin it up and they go, where's two of the disks? It's not great. So if we make this a little bit harder to get things in in the first place and make sure that everything is exactly what we want before we go and throw it into the farm, then we make sure that everyone's happy in the long run. So once we're ready to go, we've got a record of it in Ironic, so we've done our OpenStack bare metal node create. The machine's ready to be inspected. So inspection is where the machine is turned on and then booted into the round disk. The IPA will then run a set of scripts, as I said earlier, and then the machine will discover exactly what's there, say, like how many nicks we've got, what's the CPU model, and then it will post that back to the Ironic API. And then what we can do with that afterwards is we can make use of things called inspection rules and then we can verify that everything that we, the assumptions that we're making about the hardware is there. So has it got the correct name on switchA and switchB as it's switch port description? Is the MTU right? Yeah, just all things like that. Let's check for cabling issues. If it's plugged into one port on the A side and another port on the B side, then there's probably a cabling error there. And yeah, we tend to find quite a lot of those going through. And yeah, it's pretty good to just get out of the way at the beginning because as soon as you throw a network and generic switch in there and you're going in, you're logging into switches and you're changing access ports, you want to make sure you're changing the right access ports and you're not just changing stuff really nearly. So yeah, once we get all this information from inspection, we can do the next part which is to create the bare metal port. And then this means that network and generic switch has enough information to actually move the machine. So before, when we boot the node up for inspection, we need to make sure it's on the right network. And then because network and generic switch doesn't actually know what to do, it will discover that when it does the inspection. So the node just gets powered on, we do inspection, we check the inspection rules and then we can store that information as a bare metal port which will allow us to move on to the next part, which is cleaning. So cleaning and ironic is a set of tasks that allow you to recycle machine back to a known state. So the tasks are ordered by a priority, the lower the number, the higher the priority. It kind of sounds a bit weird to say that backwards, but yeah, the lower the number, the higher its priority is. So we've built our own custom hardware manager that allows us to plug in our own cleaning steps without forking the main ironic code, which is great because maintaining forks is not great for when you want to do an upgrade. So yeah, I want it to do a good job but allowing us to plug in our own hardware manager. And then we can do things like setting the hardware clock using NTP, verifying that the firmware is as we expect. We can then update the firmware and then we can wipe the hard disks. There are things in ironic to allow you to wipe the hard disks, but one of the things that we do is we use an external encryption device which will give you a bunch of encryption keys. So if you try and zero out a hard disk and it's a spinning disk and it's quite large and it's a big data node, it might take you a few days, where what we do is we have written our own clean step which interacts with our encryption device and just rotates the keys and then it's super quick and then it can get it back to the users. So yeah, this is quite good. And then also things like setting the NTP and that's just more for the consistency point that I keep plugging in. Consistency really is the most important thing. And then lastly, we take that new storage once we've configured the new keys and then we can set up any RAID config that the user that's defined in the resource class. So yeah, once the node's been cleaned we can run some tests. It's always great if you're using automation to use some tests to actually check that things are as expected. So the two types of tests that we have is the burn-in tests. So what that will allow you to do is just run some kind of burn-in as a cleaning step. So that's a bit ironic. We use the CPU and the RAM versions of that. And then another type of test we do is creating a test instance using Nova. And then why this is useful is because basically if we build a machine I can build it onto either the tenants network or a test network that looks pretty much the same as theirs. And then that allows us to build the bonds and make sure that everything works with the bonds because when you go into cleaning you're only actually bringing up one side of the switch. So one of the downsides is if one side of your switch is down you're probably going to get clean failed. But one of the good things we're doing this test is that that instance looks exactly like what the user is going to get, receive on the other end. And then when we hand it back we just clean it and then we just put it into the available pool and then it's all ready. But these steps are kind of optional. We turn them on for some resource classes and we turn them off for others. And it really depends on the use case. Like if we get a bunch of new hardware in and we just want to get it in and get it enrolled really quickly we might not want to spend like a day burning it in. But if we're repurposing some hardware which is one of the things that we've been doing with this move from the windows and the HD condor is we've been given a lot of hardware that we want to check for things like is actually working, is it a bit flaky? Do we have disks in there that don't work? That haven't been detected? Generally I think if someone's going to give you a server they're probably not going to do the due diligence to check whether it's ready for you. So this is great. It's all built in and kind of works. So the creating a test instance isn't we just use Ansible to do that. We just create an instance and using Nova it comes up. We check that we can get to its node export and Prometheus and that some parameters have been checked. If that happens we're all good to go. And then what does that mean? It means everyone's got a bit of a meta server. It's time to go to the pub and we can also chill out. What it actually means is we've done this. So we've just done the first part and we don't need to worry about enrolling anymore it's in. If it's all consistent, Ironic does do a very good job of looking after its life cycle management and then it will move it between states as it's handed back to us and back out to the user and back to us so it will go like cleaning, provisioning, active, cleaning, provisioning, so cleaning, holding and so on. So then we can move on to the provisioning part. So at this point we've enrolled the node. It's ready for the user. It's come up and it's available. They can see it on their dashboards. They can check it in Prometheus and they can see that they've got nodes available to build on. And then this is what we need them to know. So this is one of the great things about using Nova in front of it is really for a user they don't actually have to understand too much of the nitty-gritty OpenStack stuff. That's kind of my role as a cloud engineer. We're the kind of abstraction layer for them and they come along when they get on boarded to OpenStack we give them a Git repo which they can just push and tear off on code into and then we have automation in the back ends to when that's peer approved and merged and gets built in the cloud. So they need to know the flavour, the network, the AZ and the image. So really the only difference between a VM would be just the flavour. So if you're trying to repurpose a cluster from a VM to a bare metal node it's one line change and then just throw it away and rebuild it and it will go. All the VMs will go and then you'll have more capacity in your VMs and then hopefully as long as you've got some bare metal nodes on the other side it will build. So the steps to provision a node are the user rights to the terraform codes they specify the flavour, the network, the AZ and terraform will do that interaction with NOVA and then placement will reserve the bare metal nodes Neutral will go in and configure the port so it will log into the switch and it will move it to the provisioning network then we run some deploy steps to configure bios settings so some users will have specific things like sometimes they want hyper-fredded on sometimes they want it off and to have two resource classes and make that static would be quite a lot of operational overhead we don't really need a ticket just to turn one bios setting off but what you can do with Ironic is you can use deploy steps to actually go and run things at run time we don't do the raid config here, we could but we do things like bios settings and then all they need is a flavour with it on and a flavour with it off and they just use traits and then they can even use the same machine so one time they have it on the next time they have it off deploy steps will do that for them so the machine has turned on Neutral has done its thing and turned it to the provisioning vlan and then we stream the image down from glance then when it's done it reports back to the conductor and it says I'm happy, I'm ready to go and then Neutral will go and configure the port again the server reboots comes into the OS and then the user can log in so all the stuff I've spoke about is great but how does it actually solve our technical issues if it doesn't do that then we've wasted a bunch of time and money so Amada is is our biggest use case for Ironic Amada I can go and see now so yeah Amada is a multicluster Kubernetes batch scheduler it distributes millions of batch jobs per day across thousands of nodes and many clusters and then yeah Amada is fully open source and it's a CNCF sandbox project and then if you look at the architecture of Amada how it works is when you deploy Amada typically if you do this at a big scale you can do this all on one node or one cluster if you want but for the purpose of talking like thousands of machines this is kind of the architecture you'd go for so you'd deploy a Kubernetes cluster which would serve the Amada API and then this has the ability to push batch jobs to lots of back end clusters that are separate from the API cluster so if you see there you've got at the front you've got an Amada cluster and in the back end you've got three different clusters and then what that gives you is that gives you the ability to scale out and add and remove capacity rebuild the entire cluster looking out the main API and then that doesn't affect the end users it's completely invisible to them so then typically we scale our Kubernetes clusters up to about a thousand nodes and then this approach works really well with the way that we've built the AZs and the conductor groups in Ironic because we build those up to about a thousand nodes so typically we have like a one to one relationship with a full conductor group and a Kubernetes cluster and then that allows us to go up to like a full capacity and then if we need to take out like a whole pod for maintenance or we need to rebuild a whole Amada cluster because it's going to move from one data classification to another we just throw it away and then rebuild it and then yeah Ironic kind of fits in there and then so yeah so the way this works when we actually deploy Amada although I'm not actually an Amada guy there are some people here I can point you in the right direction for Ironic is that we have Terraform which builds the bare metal machine the minimum set of configuration is passed through through Ignition Script if it's using Flatcar or like Cloud init if we're using Ubuntu and then what that gives you is that gives you a vanilla like Kubernetes cluster with some kind of customisations with like internal stuff so it gives the ability for the users for like service accounts to be able to log in and admins to do the things that they need to do and then we use a combination of Jenkins and Argo CD to then go and apply Amada as like a kind of software stack on top of that and then we have some internal tooling within GR which will deal with things like the data operations and your rebuilds and finding nodes that are older than like your 30 days and then marking those and draining those because there's two kind of ways that we look at the rebuilds we have I need to take out cluster A because it needs a firmware update or it's got a security patch and we're just going to throw it away and rebuild it so we'd set that one as drain and then it would just drain the jobs and then when the last job's finished it would pull back and say we're ready to do it and then we could just throw it away and rebuild it or just throw away a portion of the nodes if we wanted to and then there's this constant rebuild cycle that goes on and it looks for candidates out there and that's just running constantly and we used to use that with VMs as well so it's kind of ported over and yeah, again with the Nova in front of it there's not too much there's not too much change there that's going on so Amade is by far a big issues case of Ironic but it's not our only use case we use things we deploy big data like Apache Spark and Hadoop with Ironic and then we've also started building our OpenStack hypervisors using Ironic as well which is great because that allows us to drink our own champagne saying that it does come with some pros and cons of course, like everything so yeah, the pros, let's go for those first we have increased ability for the GPU workloads now what that means is we got in a situation where we ended up with like a one-to-one mapping with hypervisors and a GPU node which is using PCI password to get the GPUs into it well if anyone's done that before they've probably realised that it's quite hard to then do anything on those nodes like when it comes to maintenance we can't live my great then because I got a GPU attached and what's the point of having that extra sort of virtualisation layer when we don't need it so that's one of the reasons we picked Ironic to move over to and then that allows us to also use BTP peering between our worker nodes this is research so we have a lot of high throughput applications and they need to get to like fast storage like behind them to pull out big datasets so we can use Kubernetes to allow some BTP peering between between the nodes which means we've increased our overall throughput and then yeah, we've got a simpler estate as well so we've got a few layers in the stack, bigger worker nodes maintenance becomes a lot easier we still love VMs for like the mixed sort of workloads but where we sort of ended up with lots of kind of pets and things that aren't great for like a one-to-one mapping like just having a hypervisor to run one VM is not optimal basically taking all that out and throwing it into Ironic is a really big pro and then some of the cons slower provisioning times I mean it's no surprise it takes a lot longer to turn on a server than it does to boot a VM more precise quota and capacity management so quotas in Ironic are difficult we have a lot of internal dashboards which we use for capacity management and as soon as you go and throw Ironic into that it's kind of difficult because you have to give people quota in Nova so it does make it easier for me because I work in a private cloud side for the open stack stuff and we sort of generally the users aren't going to like do things they shouldn't they will try sometimes but you could have a quota you could have a quota for a big bit amount of machine and then if you throw that away you could go and make hundreds of VMs and if we don't have the capacity for that that's going to cause us issues so yeah they're a bit less flexible than the virtual machines obviously we can't live my great things if there's a failed dim in a server we have to contact the user so you can turn it off or we're going to turn it off or you can drain it and yeah we kind of find it tricky to mix and match the clusters it's easier to just start from scratch and to be honest that was that's more of a side effect of the work that Ironic has actually made it quite easy to do because we can just throw away a VM cluster and then we can just change the burn change like a flavour and just build it as bare metal it's far easier to just not have to do the mix workloads and try and get them across and when I talk about the mixed use case what that means is it's like having working nodes on VMs and bare metal we had a few internal issues you probably could get it to work but we sort of made it easy enough to not do that and you can leave the old bare metal clusters VM clusters there they sit behind a model and a model will schedule jobs to them and then you just have your bare metal clusters when you're ready to use them a couple of lessons learned before the days of terraform we'd go in and manually delete things like you'd have like a traditional sort of sadmin who would go in delete things and then recreate them and then that operation was pretty slow now when terraform and similar tools came along it doesn't actually leave any time before it does between the delete and the crate as long as you've got an available pool you can do like a delete and a create and then the node will just go through cleaning and then come out the other side what we found is we found that some of the exceptions weren't being raised correctly and then when the node was failing it wasn't actually trying retrying on another node so every time that like you try to rebuild a server and you can't contact the BMC what Ironic should do is it should pick another one from its candidates and it should try and build on there we're finding that didn't work but we fixed it we found well with the help of the guy from StackHPC we managed to fix that and push it back and it's great community collaboration once we did that we unearthed some more race conditions where you have a node that's active and then you delete it and then you rebuild some other nodes in the meantime placement will give you back those nodes in its allocations so basically Nova will say this is now gone and then it will move to Ironic and then Ironic will say I'm moving it to cleaning and then Nova will then go and try and build on it and then it will look like it's available to Nova but then Ironic will go oh I'm cleaning and then it will throw an exception and it will come back up and it kind of goes in this horrible sort of race condition cycle and what we did to fix that is you can just change the ordering of how it looks in placement so that Nova doesn't try and rebuild on to a node that's been taken and yeah once we did that yeah we yeah we're like a lot more successful there's a couple of links on there if anyone actually wants to see the work that we did in collaboration with StackHPC to fix that we could probably do a whole presentation on that so I won't do that for now and then yeah unfortunately when you've done the stuff that I've been talking about for the last half an hour this is just sort of the thing started your day-to-day operations are your next thing and then we started aggressively rebuilding all of our machines and we wanted to work through the teving issues we found that when we first started doing this it was quite a lot of failures we were doing this more than 28 cycle we were just constantly rebuilding things there wasn't much running on the clusters when we first started doing this so we were just hitting it at a 75% failure rate for the first couple of weeks which isn't great you could say that 75% of the stuff worked but 25% of it didn't so we had to do quite a lot of work to work out those teving issues and we found that it wasn't so much really things were ironic it was more like backend systems that we hadn't hit with a thousand rebuilds at one time so there were other parts that we needed to scale out and work with vendors on the switches and moving things and making sure that everything's happy but Rome really wasn't built in a day and a decent amount of effort from cloud engineering and the rest of the company we managed to find out a lot of those issues and the rebuild cycle the 28 or 30 day rebuild cycle largely just runs itself I was literally not even checking it or anything and I looked on the dashboards a couple of weeks ago and noticed that this is rebuilding stuff in the thousands every day and there's no operational work that really needs to go on there we have obviously failures but we also have other pipelines which we'll then try and rectify things in self-heal so if we have a node that's in clean failed we have a pipeline which will find it a clean because sometimes there's intermittent issues like someone pulled a cable someone was doing a bit of maintenance on an Uber network or something like that and just a retry will fix it and there's no need for us to really worry about it too much if it gets to the end of that and it hasn't managed to fix it it will open us a ticket but yeah, just having things like consistency good testing as it goes in metrics to pull out like what looks normal for you is always good to kind of work out like are things taking twice as long today than they were yesterday to go for cleaning just looking for those kind of pointers doing things like that really pays its dividends over time and yeah, that's it really I think we're overrun so I don't really have time for questions but feel free to grab me if you see me walking around or if you want to know any more about Amada give us a shout, thank you