 Gashening, yeah, it's Swiss, I think. All right, guys, welcome to the talk. My name is Josh Gashening. This talk is on operating in Ironic, in case you guys are in the wrong spot. I'm a developer and operator of Rackspace on Metal, which is a bare metal cloud. Yeah, so in this talk, I'm gonna talk about a lot of the pain points that we had getting on Metal up and running, and then the ongoing pain points that we have operating on a day-to-day basis. So just quick stuff we're gonna go over. We've got, sorry, that didn't advance at all. Now it's working. So just as a quick summary, we're gonna go through, just quick overview of on Metal. We're gonna talk about the Ironic Python agent, which is a deploy system that we built to make operating a little bit easier. We've got the pain points that we have both setting it up and dealing with day-to-day. We've got a repo that we just open-sourced of all the scripts that we use to manage everything. Makes life a lot easier when you can just run a script instead of working with this CLI all the time. We've got a dashboard that we also open-sourced, which has been key to our operational success. We're gonna talk about some of the scaling pain points, and then about both the release that just happened and the release that's coming up and how that affects operators. So we've been running on Metal for about a year now, and it works with a Nova API. You just call it like you're building a virtual machine, and instead you get a full bare metal machine. You have full road access on it, you can do whatever you want. It comes in three flavors. You've got a compute node that's the cheapest one. It's optimized for compute-bound problems. We have a memory node with half a terabyte of RAM. If you can figure out a way to use half a terabyte of RAM. And then we have an IO node that has two super-fast PCI Express all the states in there. Great for running databases. All the hardware we're using is based on open compute, which has caused some scale or some operational issues, but overall has been a pretty good platform. And it's loosely based on Ironic. It's based on Ironic. With our next deployment, it's based on actual master. We're deploying master as it goes with some patches on top, most of which we're hoping to upstream in Liberty so that we can just run straight master or pretty close to it. So a quick overview of Ironic. This is pretty much how Ironic works. You say you need a bare metal server. You talk to the cloud. The cloud does magic. Pixie Boots, which is the bear that's our, the Ironic mascot, gives you a bare metal server and bam, you have a server. But for a little more in depth, you can think of Ironic as basically a bare metal hypervisor. It's very similar to how you interact with, how Nova interacts with Zen. You have a Nova API that your user talks to. The Nova API talks to, if it was Zen, it would talk to Nova Compute, which is running on the actual hypervisor. In our case, it talks to the Ironic service through a Nova Compute. In the Zen version, it builds virtual machines. In our case, it builds bare metal machines. And there's a couple of ways that it'll actually talk to the hardware. It's a driver-based system. So when we first wanted to build a bare metal cloud, we looked at Ironic. The only available driver was the set of Pixie Drivers. We looked at it and we thought it was okay, but it wasn't what we were looking for. Basically, the deploy model is it boots a bash RAM disk. It exposes an ice-cozy target from that RAM disk back to Ironic. Ironic downloads the image that the user is requested, caches that on the conductor, and then writes that over ice-cozy onto the bare metal node. Then it reboots that machine. It Pixie boots, but it chooses the local disk after that, and then bam, you have a bare metal node. And this worked great in some use cases. We saw a lot of, Ironic was using it for OpenStack on OpenStack, a way to deploy your cloud on top of bare metal nodes. It was working great, not what we were looking for. So we built something called the Ironic Python agent. This is, again, it's still a RAM disk based deploy system. When you're looking into it, you can find it as agent underscore IPMI or ILO. It works with a bunch of different types of hardware. It's based on a Khoros RAM disk. It runs a container, has a Python REST API inside of it, exposes that to Ironic. Ironic talks to it via REST, just normal HTTP calls. And really we built it, not just because we didn't like the ice-cozy stuff, but we wanted a RAM disk that was more focused on operability and making sure that it was extensible. So if you have different use cases than we do, you can implement those, use those. So I'm just gonna dive into that a little bit. So for the operability focus, like I said, REST API, this is really key to us. One thing I should note, it's always running. Anytime there's not a customer on a box, we have IPA running on it, which means that we can always query the hardware. Our, the open compute boxes that we have have a very simple autoband system, unlike an ILO or a DRAC that you'd get an HP or Dell hardware, respectively. So this REST API gives us a lot of control over the node. A lot of things we can do, like we can send a request to IPA and say, please verify that the ports that you're connected to are what we think they are. So it sends out, or it listens for LLGP packets. If it works, it gives us, it's as good. If it's not, it throws up an error. And then we know that that machine isn't ready for a tenant. So yeah, the REST API is very helpful for us. It's really easy to write scripts against, which I'm gonna cover some of the stuff we have done. But also IPA is extensible. So when we came to Ironic, there wasn't a concept of cleaning up the bare metal nodes after they had been provisioned and deleted. So the disks didn't get erased. And we're creating a bare metal cloud for multi-tenant people. It's kind of important, right? So we built IPA to be extensible. For a while we were running our own extension that did decommissioning that's now been upstreamed. It's called cleaning instead. But also having a RAM disk running on the node gives you a lot of options that you can do. If you have to flash a firmware for your RAID card and that can only be done inside of an operating system instead of out of band, like through an ILO, IPA will let you do that. And it's totally extensible. You can write whatever you want, just plug it in and you'll be able to do it. Stuff like upgraded firmwares will happen after every single delete. And that's really key for us. It's made our managing our cloud much, much easier. And there's also hardware managers. Hardware, basically everything in an IPA can be replaced by a hardware manager. So like the way that the images are downloaded, the way the images are written, that's all completely extensible. So we have one that works for our hardware called on metal. There's a ProLiant Utils one. It's very easy to write. But it's not exactly perfect. Since we have this RAM disk running across our whole unprovisioned fleet at the whole time, when we want to upgrade IPA, it's a little bit harder. You have to reboot some of your nodes into a new RAM disk, but you don't want to reboot all of your unprovisioned capacity at the same time. So we wrote a script that does this for us and does it in a staggered way so that customers don't even notice the downtime. But we get new versions. You can just fire and forget. Yeah, it works really well. Another big problem with IPA is it's a 200 megabyte RAM disk. That's a pretty big thing to be booting over the network via TFTP. If you have any bit of flakiness in your network, it's gonna fail, and you're not gonna be able to boot anything. We've solved this by using iPixie. You chain load it into an iPixie boot loader. It downloads via HTTP. You get a TCP connection instead of UDP. Much better. Use a little more bandwidth. You're not caching anything on the conductors. If you have a decently sized, like Swift and Glance cluster, it's not really a problem. One big thing to note that the REST API I was talking about is not authenticated at this point. So don't expose this to the internet or people will start flashing firmwares to your boxes. Not a good thing. So we keep it on a secure network any time that RAM disk is running. So if you've gotten this a little bit, you should see operating on it can be a little bit of a pain at times. But it's not that bad. On metal, all of our engineers, if you're writing code, you're also gonna wind up on call. So all the code that you write, you better be ready to fix it, make sure that it's running like if it fails, you're getting called at 2 a.m. in the morning, or 2 a.m. Not fun. So this makes us engineers, it makes us operators. We feel a lot of pain that the community feels instead of just writing some code, throwing it over, I hope it works. And some of the things that we've really noticed and we've been actively trying to fix, with the ironic like deploy system, networks fail a lot. I think everyone knows that. But with ironic, especially since we are switching between a provisioning network and a public network like the internet, there's a lot of interaction between us and Neutron, there's interaction between us and Nova, and any of those things can fail and cause a node to fail to boot. So, talking to, ironic to Neutron, that can fail obviously from the network. Neutron in itself sometimes has errors. The plugin that we use for Neutron talks to bare metal switches, that communication can fail, or there could be authentication issues, and then the switches themselves are flaky too. You could always have a switch fail to do something. So another script that we wrote basically deals with all of the nodes that would be classified in any type of network failure. So we call it fixed Neutron fails, basically lists all of the error nodes in ironic, and then classifies them based on their provision state, the last error that they have, things like that, and then basically walks them back through the delete process that you would get if someone deleted the node on their own. So it powers off the node, puts it back on the provisioning network, powers the node back on, make sure everything's working. And then at the end of that, you'll get a nice, cleanly decommissioned node. One thing we've been considering is making this more automatic. In the ops feedback session today, one of the big things was that ironic has trouble dealing with failure cases, and automatically resolving some of these failure cases, and that's a pain that we really, really feel. So we're looking at ways to fix that. One of the other big pain points we feel a lot is there's no HA story for Nova and ironic at this point. So we kind of break Nova's model for everything. Nova normally talks to a hypervisor, hypervisors have VMs. We're kind of the bare metal hypervisor. We have lots of bare metal nodes. You can fit a lot more bare metal nodes into one compute, a Nova compute, than you can on like a physical box. So we're seeing things like when you start up Nova, it can take up to five minutes for the resource tracker to be ready. So that kills any chance of us having an active passive solution where you have both of them going. We were just discussing this earlier today on ways to fix this with the Nova guys. It looks like maybe in Liberty we'll have something hopefully. But again, we wrote a script for this because you don't wanna be managing this. You don't wanna be dealing with individual nodes at this point. So there's an easy script. It's called fix active no instance. We would see a race condition where Nova would leave a node that was being booted for a customer. The node would go through the whole boot process in Ironic. But at the end of it, it wouldn't have an instance UUID and it wouldn't have an actual tenant on it. It'd be provisioned, but no one's using it and it's just wasting capacity. So we wrote a script to fix that kind of bug. But really this is just a bandaid. Nova and Ironic made a real solution to this. One thing that we say a lot is pixie booting fails. It happens just constantly. There's network issues, there's hardware issues. You tell your node to pixie boot and it's like, no, I'm just gonna boot from the hard disk anyway. And since we're booting into a RAM disk, you really need to pixie boot. So we wrote a script that's basically like a nuclear option that again, mostly steps through the deletion process that Ironic would go through, but make sure that literally any problem that could happen can be fixed. For example, if you have your BMC locked up, you can send a command that resets the BMC. You wait about 30 seconds and hopefully it's working again. And we've seen this work in, we get a handful of nodes go into this no heartbeat state every single day. And in almost all cases, a script like this will fix it. Sometimes you have to go into the rack and physically pull it out and put it back in. But usually we can fix it with code. Again, we can make this a periodic task, help Ironic automatically recover from these failures. Another thing would be using heartbeats for scheduling. Basically, the RAM disk that we run is always heartbeating to Ironic. You can use that. We could use that as part of our scheduling to prefer nodes that have heartbeated recently. So one of the big problems with all of this stuff, like we wouldn't have identified most of these issues if we didn't have some way to track all of these failures. See trends, see really classify like how these nodes are doing. If you do Ironic node list, like you'll see nodes that are in error, but you have to like dig into each node to figure out what the actual error is. So we asked a couple of our interns that we had to build us a dashboard. You can't run software without a pretty graph, right? So two of our interns, Ellen and Kyle, sat down, wrote us a dashboard. It's based on Node.js. Basically just sits there querying the Ironic API. And the really key thing is that it tracks all of the state changes that happen. So if a node, like we have a state machine, we understand like a node should go from here to here in the good case. It can go from active to error in a bad case. We classify that as an error. And so we have a bunch of different moods tracking those different state transitions. And we emit all of these events to IRC. So if you look in our alerts channel and you see 400 messages, you know that something's gone horribly, horribly wrong and everyone needs to be on hand to fix something. So just quickly walking through the dashboard, which is open source and up on our GitHub, I have links at the end for you. So this is just showing an example. You have some nodes that are provisioned in maintenance mode, no heartbeat, neutron fail. And really the other part of the dashboard, it gives you an easy way to interact with the IRONIC API through your web browser. You can list all the nodes, you can click on stuff, you can search things. It has really made our lives a lot easier, not having to go to the CLI every time. So I promise to talk about the scaling issues that we've had in IRONIC. Scaling anything is hard, right? But in our case, it wasn't that hard. If you wanna really in-depth talk about this, Jim Rolenhagen did a talk on it right after we launched, or I guess it was about six months after we launched at the last summit. But quickly, we haven't really run into any scaling issues with IRONIC. It scales pretty well, it's a well-designed system. You have a layer of API nodes. Those API nodes talk to conductors. You can scale both of those horizontally. Nova has a concept of cells. You can use cells to break up your nodes. We haven't had any issues with that. We have had issues with things like slow hardware. IRONIC is constantly talking to the hardware to make sure that it's still available, that it can still check the power status, that it can control the power status. If you have really slow hardware, and a lot of really slow hardware, all of a sudden that loop that's checking every node starts getting longer and longer than the time it takes, or the period between the loops. And everything grinds to a halt really fast. So, really we just identify those really slow nodes. We take them out of service, get them fixed, put them back in service, haven't seen the issue. One of the more recent scaling issues we found is scaling DHCP, TFTP, and serving the agent over HTTP. We hit some serious bottlenecks there. If you were rebooting a bunch of nodes at the same time, which happens pretty often in a public cloud, the 200 megabyte RAM disk that's getting served up starts interfering with the packets that are coming out of TFTP. You have nodes that fail to boot into the RAM disk. You just see a lot of issues. Basically, to fix that, we just split it up. We were serving everything from one active passive cluster, so we broke it up into multiple clusters. So you have a DHCP cluster, TFTP cluster, HTTP cluster. IRONIC is looking at other ways to solve this as well. But yeah, so let's talk about the actual IRONIC release. Kilo just came out. Kilo is good for operators. We've put up a lot of the stuff that we found that were operational problems for us. A lot of other people have done the same thing. The biggest part and the beginning, the whole beginning of the cycle was dedicated to changing the state machine. IRONIC state machine used to be, you have a node that's not being used, it's being used, it's being deleted, and that's about it. There's a whole bunch more states now to handle all the things that you might run into as an operator. We have a manageable state. Manageable is a place you can put a node that you want to take a look at that you don't want people to be able to provision right now. It's a good place to put nodes when you first enroll them in IRONIC. We've added an inspection system. So you give IRONIC a basic level of detail to be able to talk to the hardware, and then it boots up a RAM disk, figures out what the CPU is, how much memory it has. This avoids fat fingering and winding up with nodes that say they have a 10th of the RAM that you expect them to. We added cleaning, which I alluded to before. It's a system that happens after a node is deleted. It goes through a cleaning process. This is extremely pluggable. We use a lot of stuff in IPA. I think we have about 10 steps at this point. Everything from erasing your root disk, erasing the PCI Express flashcards I was talking about. We upgrade firmwares, make sure everything's still signed, and there's a whole bunch of other steps I can't think off the top of my head. We added a field on the node called maintenance reason. When you put a node in maintenance because it's broken or won't boot or something, now you can add a reason to that. This is much easier than managing this with a spreadsheet, which is what we were doing for quite a while. We've added rolling restart for Ironic. So before this was added, every time we would deploy to Ironic, we would wind up with multiple machines would have stuck locks because Ironic would have a lock on the machine. It would reboot. When it came back up, it would still be holding the lock, but when realized it, that lock never got resolved, and you had to log into the actual database and clear that lock. This is very, very painful. It led to us not deploying as often as we would like. Now we have a rolling restart. It gracefully handles the locking, and we can do zero downtime deploys with Ironic. We can do it during the day. No one even notices. It's fantastic. Ironic's really focused on untangling services this cycle. Before to set up a cluster with Ironic and IPA, you would need Nova, Ironic, Neutron, Glance, Swift, IPA. You just need a whole bunch of services to do anything. We've gotten to the point where someone wrote a set of Ansible playbooks called Bifrost that can just manage Ironic directly. No Nova, no Neutron. It just, it talks to the API. You can add nodes, you can boot nodes, you can delete nodes. It's a pretty good system. I think having this loose coupling is really handy if you just want to use Ironic to boot up maybe a small cluster at home or something, or in a lab. I don't know how many people have clusters at home, but... But looking forward, Liberty's gonna be great for operators, at least I hope so. There's been a ton of specs proposed already. Some of them are carryover from Kilo that we didn't get done yet, but some of my favorites and some of the ones I think really affect operators are metrics. I mean, how can you run a scalable service without metrics? Like, you need to be able to track how long it's taken to boot nodes, identify bottlenecks, stuff like that. So there's a good spec for that. There's a spec to be able to control raid on your nodes. So you can lay down a raid 10 on your hypervisor node before it gets provisioned. You can do that. We're hoping to have it so that you can do it at boot time and before boot time. That's a pretty good spec. The something that we've been running downstream and are now proposing upstream is rescue mode. It really sucks to get called at 2 a.m. because a customer's machine's disk died and they have no way to fix it. Now they have a rescue mode that they can boot up a RAM disk, very similar to the IPA RAM disk and troubleshoot it themselves and then call our support team and I don't have to get woken up anymore. That makes me happy. There's a kind of the corollary to the cleaning thing I talked about. It's called zapping. Zapping will be a way to do arbitrary actions against nodes in a operator controlled way instead of automatically after delete. So if you want to push out your firmwares at a certain time, there'll be an API to be able to do stuff like that. Or if you need to do, like I alluded to running a script to check all of the ports on the nodes to make sure that they're connected to the correct switch, that would be something that zapping would be able to handle. And Ironic's gonna have IPXE support built in in a stateless fashion, which will be awesome, especially with IPA being as big as it is, you'll be able to boot up nodes. And Ironic will handle all the IPXE thing. In our environment right now, we basically have a static IPXE file that works for all the nodes. It's a very ugly way to do it. Having Ironic control this will give us a lot more options to boot into choosing between like a rescue RAM disk or a provisioning RAM disk, a bunch of options that we will be able to do. Also like booting directly from volumes could be supported by something like that since IPXE supports booting from my SCSI. Yeah, so if any of these look interesting to you, these are all just proposed specs right now in the Ironic specs repo. We very much encourage operators to come and look at them. Even if you don't code or anything, please come look at the specs. These are things that are going to impact you. We have some operational experience, but you guys definitely have different operational experience. We don't want all of Ironic just working for a handful of cases. So please review specs with us. And yeah, that's about the end of my presentation. There's some links for you guys. All the scripts that we use are open source now. You can go and use them. The dashboard is also open source. These aren't exactly super production. Well, they work for our production, but they're very specific to our environment. But this should give you a good starting point to figure out how to do some of these things. The obligatory, like everyone else here, we're also hiring, specifically for the on-metal team as well, so. And yeah, that's my name and whatnot. If anyone has any questions, please use the microphone so that the folks watching afterwards will hear your questions.