 Welcome to Scaling Bare Metal Provisioning with Nova and Ironic at CERN. My name is Arne and together with my colleague Bemiro, we will walk you through some of the issues we have encountered when scaling up our bare metal infrastructure here at CERN. To briefly introduce CERN, CERN is the European Organization for Nuclear Research. CERN has a laboratory which is located at the border between France and Switzerland close to Geneva and the main mission of CERN is to understand some of the mysteries of the universe. For this, CERN has built the largest machine ever built by mankind, which is the Large Hadron Collider, a massive particle accelerator, built 100 meters underground with a circumference of 27 kilometers. In this machine, particles are accelerated and collided and the collisions are detected and recorded by four main detectors, each the size of a cathedral and each of them producing around 10 gigabytes of data per second. This data is then sent to the CERN IT data center where the initial reconstruction of these events takes place, where the data is permanently stored and where it is finally fed into the WACG, the worldwide LHC computing grid, a compound of around 180 sites worldwide which help with the analysis of this data. CERN IT runs say one and a half data centers at the moment. The main data center in Marin, close to Geneva, hosts around 13,000 servers and in addition we have a couple of containers which are located at one of the experiment sites and there we host around 2,000 servers. CERN and CERN IT rely heavily on OpenStack. We have a deployment in production since 2013 with around eight and a half thousand compute nodes, 300,000 cores and 35,000 instances. We heavily make use of cells, so we have around 80 cells at the moment, mostly for scalability but also to separate use cases, different hardware, different power feeds or different physical location. In addition we deploy regions also mainly for scalability but also to ease the rollout of new features. So if you look at the right hand side of the slide you will see a rough schema of how our nobody plan looks like. If you start from the bottom we have up to 200 compute nodes per cell which is managed by one child cell controllers and we have about 30 cells in three regions. Each of the regions has a top level cell controller and a couple of API nodes to manage requests. As far as Ironic is concerned, Ironic is only accessed through NOVA so a user requesting physical instances will contact NOVA and NOVA by means of the Ironic driver will contact the Ironic API which is then via the conductor talking to clients in Newton in order to provide physical nodes. The controllers themselves look relatively simple. They have an API Apache node or a process, a conductor process and an inspector plus a Revit node and a database. Computer nodes are managed by a specific cell, the bare metal cell and we have around 15 of these controllers and 15 compute nodes at the moment which talk to the Ironic deployment. And at the bottom you see a snapshot of the dashboard that we use in order to see how our Ironic deployment is doing. Now in order to see how intertwined these services are, the compute nodes are in our deployment. The compute nodes in our deployment are actually physical instances on top of Ironic. So Ironic has physical nodes, NOVA instantiates them as physical instances in the bare metal cell and these physical instances then become compute nodes in a different region and in a different cell. And then on top of these physical instances we host virtual instances. The Ironic deployment at CERN has seen a massive growth over the past two to three years. This is mostly because we have the policy that new deliveries only go through Ironic. Also we had a data center that we shut down so all the computers that we had there were all the servers that we had there were repatriated to the main data center and all of them went into Ironic as well. And in addition we plan to adopt the in production node so while we have 5,000 nodes at the moment we plan to go to 10,000 nodes. Now when scaling up the infrastructure there's a couple of issues that we basically struggle with on a day-to-day basis. Since Ironic talks to the servers mostly via the BMC, every hiccup in the BMC is of course noticeable in Ironic. So we had BMCs ignoring power commands so a node would not shut off or start or a tool incompatibilities for instance when IPMI tool used or insisted on certain ciphers that the BMC was not supporting. In order to accommodate most of these we have written a tool that's called IPMI Proxies which is basically a wrapper around IPMI tool and which hides some of this complexity. Another issue that we struggled with is the pixie reliability. So if the or when the pixie TFTP infrastructure fails the node falls back to booting from disk and then basically is blocked in its lifecycle managed by Ironic. Another issue is the overall complexity of the setup. Nova placement in Ironic are all complex by themselves but together it is sometimes a little bit difficult to understand why something fails. It's something that I always mention on this occasion is the flavor explosion and flavor quotas. So we have at the moment flavors per hardware and per location this is more or less by rack and this is because users would like to control very closely where their physical machines go. The other issue is that users cannot easily see how many physical instances of a specific type they can still instantiate because there's no per flavor quota and we hope that resource class based flavors resource class based quotas will solve this issue. There are three main issues though that we want to cover in this talk that are related to scaling of our structure and we ordered them by the more for more simple to more complex. So the issues that we're going to cover now are controller crashes the API responsiveness and resource discovery. The first thing we ran into were controller crashes. I have to say that we use the ice scuzzy deploy interface. So basically a node upon deployment is exporting an ice scuzzy device to the controller and the controller is then dumping the image downloaded from glance onto it. This deploy interface is going to be deprecated so we should not use this anymore but use direct instead. But for us we started with this and when we deployed many nodes in parallel and because the images are turned through the controller many parallel deployments would drive the conductor into out of memory situations. The conductor would crash and leave the nodes in error state. The solution is relatively straightforward. We horizontally scale the controllers and introduce so-called wing controllers which had the controllers or the current controller with handling all these requests. A better solution is as I mentioned to use a scalable deploy interface such as direct or answerable which makes the node directly download the image rather than telling it to the conductor. An alternative solution that is currently under review is the so-called rum guard patch. This patch will make the conductor watch the memory situation and basically stall requests until the memory pressure goes down rather than running into an out of memory situation. You can find all the details about this under the URL. The second issue we'd like to cover is the API responsiveness. When scaling up the infrastructure we notice that the API response time went up significantly. This graph shows over time the responsiveness and you should know that the y-axis is actually showing tens of thousands of seconds or at least thousands of seconds. Some of the requests actually took very long to come back. While we're looking at this, this was actually the day before we had a massive database outage and we were convinced that this is actually kind of a pre-earthquake to this database outage that we witnessed the next day. But once the database colleagues fixed their issue, the issue for the slow API responses was back so we a little bit puzzled at what's going on. What we saw is that all requests involving the DVE were slow. Looking at the request logs, we saw, however, that the requests came from the controllers themselves which is quite surprising because why should the controller talk to itself via the API? But then we realized, of course, we run another component on the controllers which is the inspector. And the inspector gets a list of all nodes to clean up its DB and it does that by default every 60 seconds. In addition, what we did when we crossed 1000 nodes is we disabled pagination. So what happened is that every request that went to the API actually assembled all the nodes in one giant request and tried to give that back to the requester. So what we did is we re-enabled pagination and we changed the sync interval from 60 seconds to one hour which is enough because the sync is basically only a cleanup of the database to make sure that the forgotten nodes are not stuck in the database forever. And at the bottom, you can see the effect on the response time when we switched on pagination and we changed the sync interval which basically solved the issue. However, a more scalable solution is the introduced inspector leader election which we developed together with upstream and which we deployed now also in production. What happens here is that if you have identical controllers which each of them run an inspector, they synchronize via synchronization backend such as ZooKeeper so that multiple inspectors actually elect a leader and only one of them does the sync. This will be available in Victoria and you can find all the details under this URL. And with this, I hand over to my colleague Belmiro. Thank you, Arne. I will continue from here. Hello, everyone. I'm Belmiro and I'm pleased to have the opportunity to talk about the scaling challenges of the bare metal surface. Let's talk now about another issue. We call it the resource discovery. To create a bare metal instance, the user uses the same NOVA APIs as creating a virtual machine. Workflow to schedule an instance to a resource is very similar when considering bare metal or a virtual machine. This means that placement and the NOVA scheduler are used so placement needs to know about the ironic resources. And that is done by the resource tracker that runs in each NOVA compute. This is an overview about our initial NOVA ironic architecture. We have a dedicated NOVA cell for ironic because it's the standard way that we partition and manage the infrastructure. In this dedicated NOVA cell, we only have the cell control plane, NOVA conductor and rabbit MQ and the NOVA compute. This NOVA compute is responsible for all the communication between NOVA and ironic using the ironic API. Why we don't run more NOVA computes to interact with ironic? Isn't this a potential risk? Well, yes, it is risky. It's not fault tolerant at all. However, we can recover easily in case of an issue because we run odd standby for these nodes. When managing a large infrastructure, we try to balance risk and simplicity. There is the possibility to run several NOVA computes in parallel and let them use an ashring to manage the nodes. However, when we tested this functionality, we weren't convinced about its reliability and also it will introduce a lot of complexity. As I mentioned, we have only one NOVA compute that does all the talk with the ironic. The bad news is that the resource tracker report is sequential per resource. And because we have API calls involved, it can take some time per resource. As you can guess, when we increase the number of resources in ironic, we have served that the resource tracker cycle can take a lot of time. Actually, several hours if you have thousands of resources in ironic. We have a blog post where we explain the several issues that we found and what we did to overcome them. Currently, most of them are fixed upstream. However, with an ironic deployment reaching 5,000 bare metal nodes, the cycle time of the resource tracker was reaching three hours. During the resource tracker cycle, all the user's actions are queued until all the resources are updated. This created a very bad user experience. With the new resources, creation and deletion take a few hours for our users. In order to have failure domains in ironic and allow a controlled and efficient partition of the infrastructure, a new feature was introduced in Stein release of NOVA and ironic. It's called Conductor Groups. A Conductor Group is not more than the manual association that we do between ironic resources and an ironic conductor. Then we can select which NOVA compute nodes will interact with that Conductor Group, partitioning in this way the deployment. How to configure Conductor Groups? Very easy. For each ironic conductor, we need to give it a name. And that is done in the ironic configuration file. Next, each ironic resource needs to be mapped to the Conductor Group that we selected for it. And this can be done using the ironic API. Finally, we need to configure NOVA. Basically, this is to map one or more NOVA computes to the Conductor Group. And that is done again using the NOVA configuration file. As previously, we decided to have only one NOVA compute, but this time per Conductor Group. So this is how our deployment looks like. The ironic infrastructure was split between different Conductor Groups and we have one NOVA compute node per Conductor Group. In case of failure of one NOVA compute, only a small part of the ironic infrastructure is affected. Now the question is, how we move the production infrastructure to use Conductor Groups? I'm going to describe briefly the steps that we followed. For more information, we can dive into the blog post that we wrote about this transition. The link is also in the slide. Step one, we decided to stop NOVA compute and ironic API and ironic Conductor to avoid any issue since we will update manually the databases. Step two, update the ironic and the NOVA configuration files with the Conductor Group names. Step three, needed to update the ironic database to map the ironic resources to the new Conductors. Finally, step four, needed to update NOVA database to map the running instances to the new NOVA compute. And this is the result. Let's look into these graphs. They show the number of placement requests per Conductor Group. As expected, the resource tracker cycle is only few minutes now because all the resources were split between several NOVA Computes. How we decided the number of resources per Conductor Group? Well, the resource tracker cycle time increases linearly with the number of resources. So we tried to define an acceptable cycle time for our use case. For our infrastructure, we tried to have around 500 ironic resources per Conductor Group. This translates in few minutes for the resource tracker cycle. Conclusion, scaling an infrastructure is a constant challenge. We addressed various issues, but some are still open and of course others will arise. But what is really key is to have good monitoring even if sometimes we only start monitoring the right metrics after investigating the issues. We are really happy to answer now your questions using the chat tool. Thank you so much for watching.