 All right, should we get started? OK, this talk is about scaling Ironic. Before I get started, I'm curious how many of you have used or developed on Ironic? Excellent, about half of you. Cool. So what does this talk about? There's a lot of stuff I want to cover here. Essentially, this is the story of how we at Rackspace built onMetal. OnMetal is a public cloud deployment of Ironic. It's multi-tenant. It's secure, believe it or not. And this is how that came to fruition. So back in February, we started hacking on Ironic. That's when our first commit went in. In May, we did our initial deploy with the first deploy we really did with hardware at all. In July, we released this to the public and started getting users on it. And today, we build and delete over 1,000 instances a day. So it's running at scale. And so what I really want to talk about is how using OpenStack allowed us to do this. And more importantly, how using Ironic allowed us to do this. I want to get into what went well, what didn't go well, and what other projects can learn. So who am I? Why should you listen to me? I'm Jim Roland Hagan. I'm J-roll on FreeNode. I'm in like a million channels. You can find me easy. I'm a core reviewer on Ironic. I'm a developer. I'm a deployer. I'm a user. So going back to when this first started, back in the day, some people at Rackspace got together. And they had done a lot of distributed systems, deployed large software systems. And they've realized dedicated hosting sucks. There's too much human interaction. It's really slow to scale up. It can take weeks to get a new server and get it installed and racked. You need to manage your own hardware. If RAM dies, you have to go in and fix it or order a new server and wait a month. It's horrible. You can't scale an application that way. But cloud hosting sucks, too. You've got noisy neighbors. You've got the virtualization tax. There's companies out there like Netflix who have built entire engineering teams working around the deficiencies of the cloud. It's really hard to build an application on it. You have to work around a lot of things. And so what if we could get the best of both worlds and get bare metal servers in minutes? Even better if we can do that with an API, with open hardware, like open compute, and open software. And one of the people that conceived it, he was the founder of Mailgun. He kind of put it best when he said, our mission and goal in life is to destroy proprietary control of the data center. And I believe OpenStack has a similar mission. We want to destroy proprietary control of the cloud. This right here is our team's kind of logo, mascot, motto. You may have seen us wearing shirts like this before. Anyway, so we want to destroy proprietary control. Clearly, we will use OpenStack, right? Well, no. So back in those days, we looked at Ironic. This was around the Havana release, I believe. And it wasn't even close to ready. It had one deploy driver available, which tends to work, but it's a little insane the way it does things. And furthermore, we wanted to build a product really fast and didn't want to get slowed down by the OpenStack process. And so why that driver is so insane? The deploy RAM disk is written in Bash. Not very easy to write good software in Bash. It uses DNS mask for DHCP. It actually uses Neutron. Neutron only provides DNS mask. DNS mask on its home page on the internet clearly says, like, this is for small-scale deployments. And so we didn't want to use that. It also uses TFTP. TFTP doesn't really scale for large images. You can easily do CRAS over TFTP. But beyond that, you're going to get a lot of failures. It uses iSCSI. The deploy RAM disk, how it works, is it basically boots up, finds the first disk, exposes that as an iSCSI volume, and then Ironic mounts that volume and DDs an image across the network. That doesn't seem scalable, right? And there's stateful conductors in Ironic. There's two pieces, the API and the conductor, where the conductor does work. The problem is that Ironic pixie boots instance images as well. And so you have the kernel and RAM disk locally on that conductor that's managing the node, which means, one, if the conductor goes away, you're screwed. And two, if it goes away, you have to fail over somehow. That failover now contains downloading images from Glantz and getting them ready and in TFTP and not just adjusting a hashering. And so what we did, we went and built our own thing. We called it Teeth. We built TeethAgent, which is a deploy RAM disk, or I guess it's a Python application that runs in the deploy RAM disk. Then we built TeethOverlord, which is the ironic side of things. It's the server application kind of manages everything. And that went great. So early February, we had a prototype working. We were deploying servers. Everything was good. We're on track for our deadlines. And then this guy shows up. For those that don't know him, that's Chris Barons. He's a Novakor. We chat for a while and reevaluate ironic. And we realize that open stack is the right thing to do. Rackspace started open stack. We want to support it. Ironic was close, but not quite there. But let's just help it get there. And so after that, we went to the mid-cycle meetup. This was back in March. We talked about what we were doing, what our goals were. To some extent, we didn't expose our product. But we said we wanted to do secure, multi-tenant bare metal. And turns out, Ironic wanted to build this Python agent thing for about a year and just hadn't gotten around to it. And so they were thrilled that we brought all this code with us and wanted to share it. And we ended up making a new project, Ironic Python agent, IPA for short, which is a great acronym. And then we began to take Teeth Overlord and munch that into a deploy driver for Ironic. So we fixed most of the problems we saw here. The RAM disk is now in Python. It allows us to write decent software, allows us more flexibility with what we can do. We made it work with iPixie, which is similar to Pixie. It can bootstrap into HTTP. And then we can serve those large images over HTTP, which is an easily solved problem. It gave us restful APIs. How IPA works is essentially it boots up, exposes this REST API that has commands and command results and that sort of thing. And Ironic can just call those APIs, tell it what to do, for instance, cache an image, write an image, deploy a config drive, and get the box ready to go. And because we write whole disk images and boot from disk, this gets rid of all the state in the conductor, which gives us a better failover in HA story. Ironic had always had a really great driver API that was completely pluggable. It allowed us to quickly get our agent driver in there, except the abstraction was not perfect. There were some assumptions there that worked great for the Pixie driver, and it did not work for us. So we had to do a lot of patches and a lot of work ahead of time to be able to make our driver work as expected. And so we quickly learned that being the second driver is hard. There's assumptions you've never seen before. Regardless, we dove in headfirst. And I think this is the greatest contributor to our success. Instead of contributing a couple patches here and there, our entire team, when 100% into Ironic, started interacting with the community, putting up patches, reviewing code, fixing general architecture things, even if we didn't really care about them. And some of us spent up to 80%, 90% of our time upstream. And it turns out upstream is hard, as I'm sure all of you know. We moved really fast. We got this out really fast. And so we had and still have a lot of patches downstream that were still working to upstream. Luckily, Rackspace has been doing this for a while with other projects. And so we had the infrastructure in place to maintain those packages downstream. Although downstream is even harder, because you're using crappy tools invented at home. Our patches touched a lot of different places in Ironic. We're making new driver interfaces and that kind of thing. And so we can never automatically rebase. And we have to manually do this every so often to catch up with trunk. Anyhow, so soon we had a thing. We got things working. Part of it was upstream. Part of it's downstream. We had a lot of existing deploy infrastructure at Rackspace. So we quickly got a pre-production environment up. We had about 50 servers there, real bare metal. As far as we knew, that was the largest deployment of Ironic already. We found some minor issues, little bugs here and there. We earned those out and everything was really starting to come together. And so to production, we brought up a similar environment that I had. That's our data center that our production gear was going to. We started out with 4x there and some beta customers. And everything seemed mostly fine. Like, I don't know, 200 servers, not a big deal. Things were good. And then we started bringing in more hardware. And at first, we started adding one rack at a time, testing things out. Everything was still good. So then we got overconfident and just added everything. And this is kind of what it looked like. So what happened there? There's a few major things that happened that were kind of core problems to us. First of all, it turned out our agent is really chatty. We have this patch that keeps deploy agents always on and running. And that removes one post cycle and faster boot times and our customers are all happy. But we need a heartbeat for that. And so our agents heartbeat periodically. And when we added all these new racks in, we added the DHCP configs. The computers were already on. They instantly booted like hundreds and hundreds of agents, like tripled our API traffic. And everything just kind of fell over. But luckily that's just APIs, right? Just scale those out. We were good to go. A second problem that's kind of core to Ironic is that IPMI is just bad and horrible and slow. Ironic has a periodic task in the conductor to go and verify the power state for each node in its database or each node that the conductor is managing. And it does this in serial. So the default time for this loop is once per minute. And that's always been good for everyone because everybody's running it with two servers, three servers, et cetera. We were running two conductors with hundreds of servers. And one minute was not nearly enough. Also with Python and networking and threads, when you reach out to the network and the network takes seconds and seconds and seconds to respond, things don't go so well. And so that loop was taking minutes at a time. We had to back that off, which put us at risk for a server dying and us not knowing about it fast enough and trying to deploy to it. But that was an okay trade-off for us. We also ended up scaling out our conductor cluster a bit to deal with that. So this is still a problem. Distributed locking is really, really hard. This is a huge operational pain point for us. Essentially what Ironic does today, whenever it performs an action on a node, it puts a conductor host name in the database and if that's in that column, then that node is locked. Turns out that you can't put a TTL on a MySQL column, right? And so if a conductor dies, you do a deploy and restart it, something like that, that lock will get stuck. If there's an exception while a lock is happening that could get stuck. Most of those issues are fixed right now. But this remains a huge pain point. We have a session tomorrow to talk about if Ironic is locking too much or if we should do a better locking thing. We also have a patch in Garrett that uses ZooKeeper for locking instead of the database, which would allow us to do TTLs and things like that. But when you're deploying things to hardware, TTLs are kind of hard. And lastly, Ironic breaks Nova's world completely. If you ask any of the Nova guys about this, they will tell you that we're dumb. Basically how Ironic works with Nova is we've got this clustered compute manager, which basically can take many Nova computes and put every Ironic node on every one of those computes. So these things end up with thousands of compute nodes according to Nova and it wasn't really built for that. Our resource tracker loop is taking about five minutes in production right now to just collect resources. And so we had to put some hacks in there. Chris Barron's went in and hacked some stuff together, made that loop longer, made it so the compute hosts wouldn't time out, that kind of thing. Secondly, since Nova only locks in process on the Nova compute process, you end up with race conditions between the two Nova computes. And so you'll get a build goes here and a build goes here. They both schedule the same node and things blow up. Or you get a build here and a delete here and that doesn't work out well either. And we mostly coded around these things, it's still kind of a pain point for us and we hope to fix that in Kilo. And that's really it. Ironic was really great for us in terms of scalability. It turned out even though it wasn't integrated, it was production ready. Integration is not a production ready stamp. Of course, we found like small bugs and needed features that can only be found by running this in production. For example, turns out when a conductor had an exception in the heartbeat thread, say it couldn't connect to the database, that conductor would stop heart beating, keep running, we wouldn't notice and suddenly your conductors are gone. So little bugs like that that we fixed. And so it ended in great success. We launched our product on time. We became valued members of the community. We helped Ironic push forward and Ironic helped push us forward as developers. I can't tell you how much I've learned by becoming an open stack developer, doing code review with hundreds of people. You just learn a ton about good process and whatnot. And so in the future, we look forward to scaling this even more, scaling it to multiple regions, multiple cells, making Ironic even faster and better. And so why does this matter to you all? Well, there's a couple of things you can do and learn from this. First is a shameless plug. Come build this with us. Ironic's got a really, really great team, we're really fun people, I swear. If you're at all interested in it, join us, jump in our IRC channel, develop on it, run it, document it, whatever your strength is, we would love your help. And secondly, you can do the same for your project, whatever project you are kind of day, ANOVA, Neutron, et cetera. Go deploy it, run it yourself, depend on it if only within your local development environment. That's, in my opinion, the best way to find the pain points that one, deployers go through, and two, people that run it at scale go through. I mean, after all, what's a project without users? And so that's all I have. I guess we're quite a bit early. Any questions? Yeah. We use Open Compute. Right, so actually what we do is run Ironic in its own cell. Rackspace was already running cells everywhere. We put Ironic in its own cell, and so we don't have to mix and match the two. So actually, Jay and Josh over here, I have a talk on that tomorrow at 11 o'clock. Okay, 4.30. Oh, sorry. It was how do you, what do you do after a tenant leaves to make sure the box is secure for the next tenant? So yeah, they've got a talk on this. It's some really interesting stuff. TLDR is signed firmwares, security-racing disks, that kind of thing. Right, so you can run Ironic with one Nova Compute process or more. But each of those processes will know about every Ironic node and see that as a Compute node. See it as a hypervisor, so to speak. And so Nova was built with the idea that you would run Nova Compute on the same machine as the hypervisor, so it has a single hypervisor there. So that resource tracker loop does a ton of database stuff, which with one hypervisor is fine with thousands, it's not. Well, right now every Nova Compute sees every physical box. So I think this, there's a few different ways to solve it. We definitely want to shard it. We're not sure how to solve the HA problem, maybe just active passive failover, something like that. But we're looking into that this cycle. Yeah, so the question was, why does Nova think the bare metal nodes are hypervisors when really Ironic is a proxy? It's for resource tracking and for scheduling. It works out best that way. I think we, I'm not sure on the specifics of that, I haven't dove too much into Nova, but I think there'd be a lot more hacking going on if we thought of it a different way. It is, I think it is a breakdown either way. Yep, yeah, so we have a flavor class of on metal, and then there's a couple other flavor classes for at Rackspace, and cells can advertise that capability that they can handle that flavor class. The current scale number, I can't say specifically how many hosts we have, but we do do 1,000 something builds a day, mostly by our QE team, or even greater than 10. Like triple OCI runs Ironic, and they're the only other deployment that I know of that's more than 10 nodes. Yes, awesome, anything else? All right, thanks everybody for coming.