 All right, let's go ahead and start. Hello, everyone. Thanks for joining us today. Today, we're going to talk about how we run Ironic in production with the thousands of servers. Quick introduction, my name is Mikita Goumenko. I'm a deployment engineer. Here today with me is Sergei Kashaba, principal software engineer, and Alex Saknov, who is a deployment lead. We are part of the professional services department at Mirantis, and our presentation will be based, for most part, on the customer engagement we're having with a semantic cloud platform engineering team. So here's the agenda for today. First one is why Ironic. So we will discuss what provisional tool requirements we had and why Ironic was almost the perfect fit. Environment, we will discuss what production environment we have and how we deploy things. In customization and missing parts, we will discuss on what changes we did to the OpenStack code base, so Ironic will work for us. And the last part is the production challenges. We will talk about bottlenecks, pitfalls to avoid, including awesome story when we almost shut down whole data center of thousands of nodes. And some of our colleagues will hear this for the first time. All right, so Ironic. At Semantic, we have multiple development teams, and they all want to use bare metal differently. So we required some kind of bare metals as a service type of the system. And we tried many things. We tried Formon, Cobbler, Crowbar. We even developed our own in-house tool. But all the system lacked features and required flexibility. So let's discuss what provision tool requirements we had. So it obviously has to be open source. Since we have multiple hardware vendors in our data centers, we need to support them all. We don't want to depend on a single vendor lock-in. And we also have multiple hardware generations in our data center, and we need to support them all. We constantly get feature requests from our operations guys and internal customers. So we should be able to easily develop drivers, extensions, and plugins without modifying the core components of the system. Full node lifecycle should be supported. So decommissioning and reprevisioning the node should be as easy as building a new one. And this improves environment consistency, manageability, and reliability, and security. System should be able to schedule resources automatically, which means when user requests to build a box, proper bare metal flavor will be used, IPs allocated, recant affinity requirement, et cetera. Inventory system is basically the dependency for two previous points. Manual management of assets is very effort-intensive and error-prone. So we have to have some kind of automated inventory. And it also helps to keep an eye on the current state of the environment. Oops. So as I said, we have multiple development teams. So they should be isolated, which means we have to have some kind of role-based access control to restrict access between tenants, roles, assignments. This one is easy. For automation, you will need the CLI or API. And for user interaction, it's nice to have a comfy UI. And the last but not least, we build our core infrastructure on top of VMs and containers. So these two use cases should be supported also. And we are looking for some kind of unicorn to manage them all. So let's jump to the quick ironic overview. With all our expertise in OpenStack, we decided that ironic will be a perfect fit for us. So ironic is officially integrated OpenStack project since killer release. It's basically a barometer provisioning as a service we were looking for. It has multiple reference drivers which leverage common technologies like PXE and IPMI to cover a wide range of hardware. It also supports pluggable driver architecture that allows to develop drivers for other vendors to improve performance and reliability. And provisioning process is very similar to virtual machines. For example, ironic uses image-based deployment, which not only specifies the provisioning process in comparison with general kickstart proceed approach, but also improves provisioning consistency. Here's a quick picture of ironic architecture. So to the generic OpenStack deployment, it adds a RESTful API which exposes functionality to operator to manage bare metal servers. Ironic conductors, which do bulk of work to provisioning database to store the assets and resources and multiple drivers to support a variety of hardware. So let's discuss the Symantec CPE environment. It's four data centers across the globe. It's hundreds of racks, thousands of bare metal nodes. And we have a variety of hardware and networking vendors and types. And every single node now is managed by Ironic. We are on killer release right now. We are planning to upgrade to Mitaka because some of the issues are already addressed in that release. And it was to mention that we also moved already provisioned servers to Ironic system. And we will talk about it in a bit. And this is our deployment architecture, how we deploy things. Basically, each data center is OpenStack region. We only replicate Keystone database across the data centers with rows and tenants. Users are stored in LDAP, which is also being replicated. Here on the picture, you can see that we have two types of racks, infrastructure racks and production racks. So infrastructure racks hold the OpenStack computes that were provisioned by the Ironic with this OpenStack control plane. And we put VMs on top of it using the same control plane. So it's basically one system to rule them all. So I did a quick introduction on why we chose Ironic and how we deploy things. And next, Sergey Koshama will tell about customizations we need to OpenStack codebase and the missing parts we discovered. Thanks, Nikita. OK. So as a software developer, I'll tell you what changes we've done to the OpenStack to feed it actually to our infrastructure. So OK. It's better. Thanks. Year and a half ago, we had actually in several data centers already provisioned and deployed with different provisioning tools. And we were about to at least double our capacity. So I was asked, the Turing wasn't perfect. And I was asked to research if we can use Ironic as the only tool to manage all our bare metals. We already had a couple of requirements from previously provisioned servers. It was to not change networking topology. It was to support naming convention and other. It appears that there is some gap between the killer release features list and what we needed. So after some discussion, we decided to fill in that gap locally. First modification was because of how networking is designed. So we had several networks, shared networks, management, for instance, data. And each server in the data center belongs to that network. But for each rack, there should be separate R3 subnet in each network. It means that if user asks, OK, I want that bare metal with two networks, a provisioned framework should pick the server from the pool and find what rack that server belongs to. And for each requested network, pick the IP address from the subnet that assigned to that network. Then we tried to map it to Netron. We created the same set of shared networks. And for each shared network, we created subnets, one per each rack. The tricky part was to teach NOVA to pick a proper network when node is provisioned. We had to modify NOVA network IPA class. And also, we had to bring some assumptions that kind of subnet name should be similar as a rack name. Also, because we had a lot of subnets for each network, and Netron, DNS, Mask, driver sent all that subnets as part of command line, command line became too long, and it didn't work. So we had to patch it a little bit as well. Again, the same workflow. User requests bare metal and server networks to be on the bare metal after it got provisioned. But networking on a bare metal is different on the VM. So it's usually much less trivial. So you can have just the physical interfaces. You can have bonding. You can have tech interfaces attached to that bond. On a slide, you can see actually the example of one of possible networking topology. By default, NOVA can't do it now. It's in a blueprint, but not yet implemented. So likely, Ironic as part of definition provides extra dictionary. So when you define your node, you can put whatever you want to that dictionary. We decided to put a required network configuration to that dictionary. And on NOVA side, we put that extra dictionary to NOVA config drive. Then we just prepared a set of images for different operations systems that contain cloud in it and teach cloud in it to read that networking configuration and apply it. So done. Neatron usage and changes. Actually, not much. The way we use Neatron is defined by the networking design. So everything is configured on the switch side. And we use only IPAM and DHCP. IPAM is simple just to locate IP. And DHCP, for several reasons, we use it only for big boot. We had some issues with the use case when we had relay agent configured on top of red switches. We had to remove logic that Neatron does when it spawned DNS mask process about interface creation and other. And configuration is just don't use network namespaces. Nothing should be done with interface driver. So DHCP should just listen, predefined network interface, and that's all. What? OK. Scheduling. For scheduling, we had three basic requirements. So-called fold zone tolerance distribution, which means the way we distribute provisioning for bare metals is the same role. So we have a lot of racks. And we don't want a failure of any racks destroy some service. So we want bare metal loads to be distributed over the racks as evenly as possible. Then scheduling by type. So users should be able to say, OK, I want a compute node or storage node, obviously. And finally, we had the requirements that users should be able to say, kind of, hey, schedule, I want that node to be provisioned in this rack. Or even, I want to provision this exact bare metal node. Last option was really useful when we migrated to ironic servers that were already provisioned with other provisioning tool. Alex will tell about that later on. So at first glance, I thought that it's actually not doable because of the way ironic driver for NOVA is designed. With NOVA, you can have only one NOVA compute node that manage all the bare metals. It means that you cannot use availability zones. You cannot use host aggregates, nothing. But then I found that I can define capabilities as part of bare metal node description. And that capabilities can be used for scheduling, which is great. Actually, it solved all our requirements. We added to that capabilities rack name and server type. So distribution. We just developed our weight function that used to rack name from capabilities and provided correct weight for any server. Then for scavderier by server type, we created several instances of flavors. And for each flavor instead of defining memory and disk and ROM, for some reason, we defined flavor metadata, which server type. And turned on compute capabilities filter. It actually matched the flavor metadata to the node capabilities, which has solved the use case. And last, rack and server as an instance target we solved it with a JSON filter. Syntax is not really convenient, but it does what we needed. OK, so last changes are supporting naming convention and local DNS. Naming convention is kind of a fingerprint of a data center. They all are different. Sometimes naming convention expects a physical node location as part of the node name. It means that data center name or rack name or even position of the server inside the rack should be part of name. It means also that final name can be generated only after Nova scheduler defines what node should be the target for the scheduling request. By default, we can use instance template name, but it's not enough, so plus one change on Nova site. And final DNS integration. We had a local DNS, which is essential part of the data center. And it's not always possible to use neutron DNS service. So we had to, again, plus one change to Nova to add and remove FQDN records to our local DNS, corporate DNS, when node is provisioned or deprovisioned. So it's actually all the changes we've done. All the changes, they are on a GitHub as a single package. Yes, it's possible. With Nova, actually, with OpenStack, you can replace almost any piece of code without bringing changes to the core packages. So it's great. What else? So then Alex is going to tell about real production experience of using Ironic. Alex? Thank you, Sergey. So let's finally talk about production challenges we had dealing with Ironic. When we initially started our POC, we didn't really thought about scalability, high availability, and other production type of requirements. We had tons of challenges to even make it work for our use cases and in our environment. So after a couple months of POC, we finally did the first deployment in production for newly arrived hardware. And surprisingly, Ironic with our changes was working great. And users were happy and excited about using CLI and Horizon that they already know to basically build the boxes. Unfortunately for us, it was the very beginning of our path. We had a lot of old hardware in the same data center that we somehow had to move to Ironic. And the problem was, since most of these boxes were already up and running, the transition process had to be seamless. And we cannot afford any reboots and downtime. The very first idea was to inject data about these nodes manually into database. But when we started thinking about this idea in details, it becomes more and more crazy, basically. You have to update three databases, Ironic, Neutron, and Nova. Set proper dependencies between Ironic nodes, instances in Nova, compute nodes, Neutron ports, allocate IPs, and probably do many other things. Chances that we're going to miss something were probably 100%. We decided to go another route and use Ironic fake driver. So we gathered all the information about the nodes through our discovery process and added them to Ironic with the fake driver. The fake driver does pretty much everything as the normal driver does except the real provisioning. And there was one problem with this driver as well. When you launch an instance, Nova expect the power status of the hypervisor to be some kind of good status, like power on or on or power off. Since we use the fake driver, Ironic actually don't gather the power status of the physical nodes. And it was reported as a no state. So that was kind of a problem. So we did the temporary fix and basically added the no state to the list of a good state. And that's how we basically were able to launch in Nova instances on these bare metal nodes. Using the scheduler hints that Sergio described before, we were able to select the exact physical holes that we need to build, basically the ones that we're already up and running. And obviously through Nova CLI or Nova API, you can specify the exact name you want to use for these boxes and allocate the specific IP. So this part was pretty easy. The very last step was to change the fake driver to the real driver to make these physical nodes available for users so they will be able to delete them, rebuild them. Unfortunately, there is no way to do it through an ironic API if the node is already active. And the only way that we figured was the manual database update. It might be the best way to do when you deal with production environment, but that was the only way we figured. And again, that was kind of easy and it did work for us. When we added all these nodes to existing capacity in Ironic, we had probably more than thousands of nodes in a single Ironic instance, moreover under single Nova Compute service. Needless to say, things were working badly. Simple Nova Compute starts took probably 90 seconds and it was kind of challengeable to even spin up a single instance, usually just time out it or erode out for any other reason. So it didn't work well. We started trying to debug what's going on and we found out that there is a periodic task update available with resources that is really slow and it took probably like all the 90 seconds to complete and it's blocked Nova Compute service from any other operations. We kind of tried to optimize this thing and removed some of the steps like get update, migration list, since we don't have migrations for bare metal nodes, it kind of helped a little bit and the time reduced probably by twice but it was still bad for us and somehow we have to scale it horizontally. Indeed, you can always separate your physical nodes in a data center, somehow logically, set up additional ironic control plane with the Nova Compute service and scale it that way. No code change will be required and this way probably will work for everyone. The downside is that you have to manage both control planes that is additional overhead for your upstream and the end users have to either choose the endpoint to use or a region and when you deal with the nodes in the same data center, it's probably not really friendly for them. Worst to mention that there is a blueprint to the community about the multiple compute host capability for ironic. Unfortunately, it's not implemented yet and hopefully it's gonna be ready in the next few releases but it anyway didn't work for us because we need this feature like today. So we kind of decided to reuse this idea and introduced additional parameter in Nova Configuration file to specify what server type this Nova Compute service gonna manage. Also, this server type is specified in the capabilities in all ironic nodes. So basically that's how we scaled Nova Compute service in our environment and as Sergey mentioned, this code and all these features are available at semantic GitHub, so if anyone have the same problem in your environments, feel free to check it out and maybe reuse it. The last story for today is gonna be the story about how we moved ironic to a different host. This picture kind of represents the situation we've been while doing it. When you deal with the tool that operates with the bare metal nodes in data centers, kind of working on a rope without the belay and any wrong step can be a real disaster and I'll tell you why. When we did our first deployment in production, we were kind of enraged and knew hardware was already in data center and we didn't really had a chance to do proper deployment so we decided to take our lab environment, clone it into production and just reuse it for this hardware. A few weeks later, it was the time to move all this setup to a better place with automation, monitoring and other cool features people usually do in production. So we provisioned a dedicated server for so-called open stack appliance with Ironic and started migrating the services. Glance, Keystone, U-turn, most of the novel components migrated without any issues and it was a time to move Ironic with novel compute service and that's where all the fun parts begins. 20 seconds after we started novel compute service, we lost SSH connection to this host. Trust me, you never want to have the same feeling we had at this time because literally at this moment we didn't know whether we lost connection to just the single node or we just shut down the whole data center. So a few minutes later we powered on this node, logged in and checked the logs. It appears that novel compute just terminated this instance and Ironic basically shut it down, but why? I'm not gonna go through all the details in the database tables relationship but what actually matters that when you change the host name where you run your compute service you have to do proper change in your compute nodes and instance tables in novel database. There is kind of dependency between them and it appears that there is a mechanism in novel to basically shut down the instance that it thinks are not supposed to run there and that basically what just happened in our case, oops, what just happened? Anyways, it was the last slide so I'll just continue. Yeah, so as I said before there is a dependency between the host name and the host field and the compute nodes and instance tables. So one is changed like the host name, novel basically terminate all the instances. Fortunately for us it happens not just randomly, novel filters such instances and sort them by the created date and started from the newest one. In our case that's the latest instance was the instance with the OpenStack and the Ironia kind of just deleted itself and not deleted any other box in the production environment. Yeah, so this situation can just happen with a much worse consequences not just couple of gray hairs on our heads. So whenever anyone planning to do such migration just remember to either preserve the host name or update your tables properly. That's pretty much it. That's all we had for today. Thank you for your time and if anyone has any questions I guess we have like 10 more minutes for it. Yeah, sure. I forgot to say that all the changes that we've done we made sure there is a blueprint for each change in a community. So it means that part of changes that we've done are already implemented by the community in a more general way of course and the rest are coming during might be next cycle, so. Okay, any questions? No, okay, then thank you for your time.