 Hello, everyone. My name is Steve Baker. I am an engineer at Red Hat who focuses on bare metal provisioning. And I put this talk together to aim at people who are considering designing and deploying their own bare metal provisioning environment, or people who already have a bare metal provisioning environment, and they'd like their decisions to be validated or, I don't know, they'd like to experience regret. But actually, no, it's relevant for existing deployments as well, because it's going to cover some relatively new features that do have implementations for deployment architecture, as well as some future pipeline things, which will have some mixing differences in the future. First, I'd like to go through a representative sample of Ironic installers. Like all OpenStack components, there's a few options for deploying Ironic. And each of those options often have some opinionated defaults about architecture. So I'd just like to go through those, and as we go through, we'll learn different use cases for Ironic and different capabilities it has that different installers will expose. But before we get specific, I just want to put together this sort of call sequence diagram of a typical bare metal node provisioning process. When I say typical, I mean IPMI and in a traditional sort of level-to-network kind of scenario. So when a user triggers a deploy, Ironic will initially send some IPMI commands to switch the boot device to Pixie and power the device on. Once that device comes up, DHCP request will go out. It will be responded to by a DHCP server, which is managed by Ironic. And that response will include a payload that needs to be downloaded via TFTP. That payload is actually IPixie, which is another Pixie-like environment, but it has the features that we require. So once IPixie is running, it will be instructed to download Ironic Python agent via HTTP. And then once that is actually executing, then the actual deployment process starts, where there's two-way communication between the agent and Ironic, during which there will be an image downloaded via HTTP. And then finally, as final action, there will be an IPMI command to switch the boot device back to a real disk. And the device will be powered on, and that's the end of the provisioning process. So I'd like you to keep this in mind as we go through various installers and other scenarios. First install I'd like to cover is Bifrost. It's more of an appliance-based approach, so by default it's single node, and it's meant to be as simple as possible to get going on any machine. Notable things to look at. We have one Ironic All service, which is a combination of the API and the conductor. We are in the process of merging an inspector back into the Ironic code base, so in the future that will disappear as well. DNSMask serves the purpose of both DHCP and TFDP, and EngineX for HTTP transfers, MurrayDB for storage, and overall these services are all managed by SystemD. And on the bottom there we have a provisioning network interface that's attached to the BML nodes on the same level 2 network. We're going to be coming back to that situation quite a lot in this talk. Now another appliance-like deployment is Metal Cubed. This is deployed on Kubernetes, but it's quite different to Bifrost. For a start the end user is not interacting with Ironic APIs at all, like they are with Bifrost. Instead they are interacting with Kubernetes resources, and there is a BML operator, which is responsible for starting Ironic when it needs to be started, and converging the state so that the nodes that need to be deployed are what's reflected in the resource. So this is immediately very different to a traditional OpenStack or Ironic deployment. But we do have the similarity on the back end. There needs to be a provisioning network interface that the pod can attach to directly with the BML nodes also on that same network, which can be tricky in a Kubernetes environment to have that kind of access to BML. Currently with Metal Cubed the solution for that is to run it in the run the pod in the network post-network namespace. There are different what you can run Metal Cubed with MariaDB, but in this case it's running SQLite, and in fact in some scenarios there's an effort to make the Ironic as ephemeral as possible. So the state is the Kubernetes resource, and Ironic is only brought up when operations are actually in progress. So it's really pushing the boat out of the concept of an appliance in this case. So it's quite unique. So now I want to move on to some triple-O architectures. I know that feature development has stopped in Wallaby, but triple-O is going to be around for a long time, and I think it's useful to look at both the undercloud and the overcloud. With the undercloud we're starting to add some OpenStack components that add value that support the deploy and overcloud scenario. As you can see, Ironic API and Ironic conductor are now split out into two separate services, communicating via rabbit, just like in a normal OpenStack. We have Neutron in the picture now to allow for some more complex network configuration scenarios, and also it's generally responsible for DHCP responses. So now Ironic is communicating with Neutron when it wants to, when it says we're deploying this, this no needs to be served this content over TFTP, which means that DNS mask is now only serving the function of TFTP. I'm going to gloss over inspector and DHCP issues just for simplicity. Again, on the bottom we've got provisioning network. I'm not going to diagram out full network architectures just for simplicity and to keep the scope to things that are related purely to the B-Metal provisioning. Then when we move to the overcloud, we get to what I would consider as representative of most of the other installers, where it's an OpenStack cloud focused on probably Nova, but it also has Ironic running with the Nova Ironic drivers enabled. There is a Nova compute running on each of the control plane nodes that it's configured for Ironic. So that means end users are going to be interacting with Nova in many scenarios to do the provisioning. In this case we have three HA controller nodes, we've got HA proxy for load balancing to access the API services, but different installers, slightly different variations on this, different combinations of services on different nodes, etc. So I think this is like represented of a traditional Ironic deployment. So I think we can move on from installers for now and start looking to some other areas. So scaling and constraints is something that people often ask, how many nodes can one conductor manage? It's a difficult question to ask, because it depends on many things. We had this concept called conductor groups. The intentional conductor groups was to categorize nodes by some physical aspect of the architecture, be that data warehouse, rack, different L2 network, whatever different failure domain is useful to you. And that was the intention of the feature and it is used for that. But it's also ended up being used for a different purpose. So here we have a scenario where there's nothing physically separating the nodes and yet they're still separated into groups. And the reason for that is Nova Compute is optimized for dealing with a number of services, resources, which you could reasonably fit on a single compute node. But in the Ironic scenario, every Nova Compute has a view of every single metal node as a resource. And part of its function is to constantly loop through every available resource and make sure it's up to date. And the Nova State layer. This becomes a scalability issue when you're managing a certain number of nodes. It takes a longer time for Nova State to come up to date and there's load across the whole cluster as it does this work. So a kind of work around solution to this is using conductor groups as a pseudo sharding mechanism. This lets Nova be configured to a particular conductor and that conductor is configured to a particular group of nodes. So Nova only sees those nodes. And that solves the problem. But this is not what the groups were designed for. So I said in the abstract that there would be a specific real world example of a working architecture. And thanks to CERN, I can show this the way they're using conductor groups specifically. So they have around 20 conductor groups. Each of those groups are managing around 500 nodes. And they each have a Nova compute dedicated VM, an ironic VM which has a collection of ironic services that are required. There's also a special group which doesn't have a Nova attached to it. And that's just for burning tasks for new hardware. So with this setup, they're currently managing about 8,700 nodes in a single Nova compute cell that's dedicated to VM. I had to update that number after Ulrich's talk this morning. They expect in the next year or so that they'll be managing around 10,000 nodes, which is really quite impressive. So some things to consider when you're deciding what kind of ratio between nodes and conductors you should expect. One we've just talked about, Nova responsiveness for the resource tracking overhead, the CPU load of ironic conductor, not just of Nova constantly polling it, but for the periodic jobs that ironic is running, such as to poll the power state, et cetera. How dynamic your workloads are. And by that I mean, are nodes constantly being redeployed that would require that results in API activity conductors that are doing work? Or are they relatively static workloads where they're deployed once? And then from that point on, the only overhead is periodic tasks like power management. So all those points are kind of around how many conductors. But as we saw, single node ironics is a perfectly valid case, the kind of appliance scenario. And if you do a bit of tuning, you can manage around 1000 nodes on a single node open stack. Sorry, a single node ironic. So that's at least gives you an idea of the parameters for scaling. Even though I can't give you an exact answer. But in the real world, you know, some operators have a firm policy of 300 nodes per conductor, some have a firm policy of 500, as we saw with the earlier. So that gives you the kind of range that people are currently doing in the real world. So we have in the pipeline a solution to the conducted groups for sharding situation. And ironic, there is now a dedicated shard ID attached to each node. And there's a minimal API to manage shard IDs for all your nodes. And it's also possible to when you're requesting all nodes to filter by that shard ID. So this is now as us to sort of decouple in the in the number case, decouple computes from conductors, while still making it possible to group nodes to computes, so that computes have a limited view of the full pool. And, you know, we can have a proper scalable sharding solution. The support is has been added in Antelope in ironic. So if you need this kind of sharding mechanism beyond over, then you can do that now. The nova support is still in progress. I would tentatively say it will happen in the cycle, but please don't hold me to that. So another thing to consider when you're designing your ironic is your end users. Ideally, you'll be wanting to make your end users as happy as possible when empower them to do what they want to do. And one aspect of that is giving them an API that they are already familiar with and like using. And as you've seen with the installers, there's three quite distinct ways of interacting with ironic. The traditional one would be through the nova server API. And if your users are, you know, deploying virtual machine workloads, you know, it's a compelling case that they could deploy metal workloads using the same API, same tooling. And that is traditionally what is how ironic has been used, but it's not the only way. There's also directly with the ironic API. In some cases, it's actually useful to have full access to the bare metal state. Let's say if the more in the model of wanting to provision a specific bare metal server, rather than just taking one from a pool, having access to failure reasons and having access to the actual state of the node can be extremely useful in that scenario. And it is possible to provide a nova like experience by using the client side library and app called Metalsmith. And what I would sum up the nova like experience as would be, one, it will manage the neutron nodes for you. And two, it gives you an allocation API so that you can request, you can say, I want a bare metal node with these constraints. And it will find a free node from the pool that meets those constraints and give that to you. And finally, we have the Kubernetes API. I mean, increasingly, our users are going to be most comfortable with managing Kubernetes resources. So being able to manage bare metal as a Kubernetes resource would be very compelling for them, especially if they had no exposure to OpenStack APIs at all. And in fact, anecdotally, we're hearing about a lot of people who say, yeah, we don't use any OpenStack components. We don't use Ironic for bare metal provisioning because we use metal cubed. So they're actually not aware that they're using Ironic under the hood, which is, I mean, it's good and bad. So a relatively recent feature that we believe is mature enough to start using now is virtual media boot. This is a feature of the Redfish driver, which enables quite a different deployment flow. And it has implications for networking. So just coming back to that call sequence that we looked at at the beginning, I could just highlight what could be considered level two network services. DHCP definitely null two service. TFTP doesn't scale beyond networks. It doesn't scale well with number of nodes or with a quantity of data. So we kind of think of it as a level two service as well. But what if we didn't have that restriction? This is actually the case with virtual media boot. So here we have a deploy is being triggered when using the Redfish driver with virtual media boot. Ironic will build a custom ISO. And with Redfish calls, it will attach that ISO as a virtual device on the node, such as a virtual CD ROM. So when the node boots, it will immediately be executed. And that ISO is Ironic Python agent. So it cuts out that whole first bootstrap part of the provisioning process. And we're immediately into the actual provisioning tasks where the agent is communicating with Ironic and downloading the image, writing image. And that's it. So this is quite a game changer. And it does have implications for network architecture because we no longer need the direct access to the network interface that's on the provisioning network. So this is just a repeat of the metal cubed architecture. If we no longer have this requirement, then it could look like this. So the pod can just be a normal pod. The network accessibility to the nodes doesn't have to be L2, it can be more. It's very much still desirable to have a dedicated provisioning network. But those L2 restrictions have been lifted. So it's considerably more flexible. One of the more complex scenarios that we deploy are overcloud on is with a spine and leaf network configuration. This has really good performance characteristics and also better failure redundancy. So it's quite compelling for a lot of our operators. And in this scenario, we've got Ironic running on the bottom left leaf, which means that the leaves on the right need to have configured, sorry, the leaf switches on the right, need to have a DHCP relay so that the DHCP requests can reach back to the Ironic. Which is fine. But it's configuration complexity. If DHCP messages aren't coming through, it's going to be a pain to debug. So it would be nice to not have that complexity on top of the configuration of the spine and leaf. And that is the case when we're using virtual media boot. So some final takeaways. I'd just like to reiterate, different deployment tools have opinionated architectures, so that's something to keep in mind. Your end users have preferences about the kind of APIs they would prefer to use. So it's nice to keep them happy. You've got a few things to consider when you're working out your conductor to node ratio based on, you know, nerve usage type of workload, et cetera. And I'd say virtual media boot is really, and it removes network architectures constraints. But some hardware vendors are more ready than others. So I would recommend doing your own evaluation before you're making firm decisions about your network architecture based on the assumption that it'll just work. Yeah, that's it. So, yeah, any questions? Yes, good. Okay. Okay, thanks very much.