 All right, we're good to go. Thanks everyone for joining to the session about hardware onboarding and burning in the CERN data center. My name is Arne Wieberg. I'm the cloud and Linux team lead in CERN IT. And just to introduce CERN briefly in one minute for those of you who have no idea what that is. So CERN is the European organization for nuclear research. Our mission is to find answers to some of the mysteries of the universe and in order to do this, we have built the large Hadron Collider, the largest machine ever built by mankind, which is a particle collider 100 meters underground, 27 kilometers in circumference, where we collide particles, protons and ions in particular. And these collisions are then recorded and tracked by so-called detectors. We have four main detectors, each of them roughly the size of a cathedral, 100 meters underground, each of them producing orders of like 10 gigabytes per second with this event data. This data is then sent into CERN IT. So you see here a screenshot of the, one of the main rooms in CERN IT where the initial reconstruction of these events happens, where they are stored permanently and then fed into a worldwide grid of around 170 data centers worldwide where a second copy is stored and where part of the analysis takes place. So this is CERN in one minute. For this talk today, it's basically split roughly into two parts. So first, I wanna give you like the big picture, like how actually, hardware moves in and out of the CERN data center. It would be a little bit more theoretically. And then we were talking about the working horse that enables this, which is Arbonic. So new hardware in the CERN data centers goes through several phases. So the first phase is basically from specification to delivery. When new hardware is about to be bought, the ones using that hardware basically specify what they need. There's usually a meeting taking place which is called the pre-specs meeting where service managers express what kind of servers they would like. Then these servers are procured, which is also a multi-stage process because we're like an international organization that has to go out for tender. Vendors can apply, send their offers. And then at some point the hardware is procured following the delivery instructions that we give and all the specification that we give. And then, however, it finally arrives in the CERN data center. And you see on the left-hand side, this process takes about nine months from A to Z. So it's quite a lengthy process and needs a lot of planning, okay? And this is also one of the reasons why we introduced OpenStack in 2013 in order to reduce the time until the service manager actually has resources, gets a lot shorter. Now, once the hardware is there, it's of course installed physically and then it has to go through a process called acceptance. So before we actually pay the hardware, it has to be checked. Is everything as we ordered, does the hardware work, and so on. That takes a couple of weeks depending on the server. So if you have a large disk server with 96 drives, for instance, and you need to go through Burnin that may take quite a while to go through this. And then we keep hardware usually for something like five years, roughly. So it gets commissioned and moves into production so it basically gets to the service manager to the various services that run their services on that hardware. We repair the hardware during that time, of course, we monitor it. And then at some point, after a couple of years, hardware gets decommissioned, it gets removed from all our databases, and then is that physically removed and either recycled or offered for a donation because the hardware usually is still good. Okay, so this is like roughly the whole process like in a nutshell. And many parts of this or many components in this process are actually driven by ironic. So that brings us ready to the second part, the working hose, the underlying working hose, ironic as the bare metal management framework. So what is ironic for those of you who are not intimately familiar with it? The idea is to extend the cloud approach to bare metal servers. So basically when we introduced it, we wanted to complement the offering that we have in our service where we offered already virtual machines and for instance container clusters to also offer bare metal servers to our users using the same interface. Why? Because that simplifies a lot of the work that has to be done like workflows, accounting, approval. All of this is the same, whether you use physical machines or virtual machines. So ironic provides an API service to interact with these physical servers. So originally this was a provisioning driver and Nova and then was like moved out, but nowadays it can be either be used with an open stack. This is the way we use it. So it's tightly integrated. Or you can use this as a standalone tool, which also many deployments do. You can find a lot of details on the website, ironicbaremetal.org. One important point though, is that ironic leverages open source standards and tooling. So many of the things that, or all of the things that ironic actually uses are like open source and open standards. It relies a lot on things like IPMI and RedFish and on Pixi and DHCP. But it also allows for vendors to actually write their own plugins to make it more flexible. And also we have our own plugin, a CERN plugin, that does certain things that we need at CERN. So ironic basically on a very high level, consists of three main components. There's a database where all the notes, the physical notes are listed with their name or which we use the serial number in our case, which state they are, what kind of credentials they have and so on. Then there are so-called controllers, which are the service processes. There's an API and conductor that acts on these physical machines. There's an inspector that can receive information about the internal state of a server, which components are in there. There's a message queue that allows to exchange messages between these components. And then the third component is the IPA, is the ironic Python agent, which is an image with a demon inside that is launched on these physical notes. And with these three basic components, ironic manages physical notes. On the right-hand side, you see how it is layout at CERN at the moment. So you see at the bottom, we have groups of 500 notes, which are controlled or managed by one ironic controller and then one Nova controller. And this way, the whole deployment is basically split, and the number here is around 20. So there are 9,000 physical notes at the moment that I managed CERN with this system. Our initial use case was provisioning. So as I said, to give physical notes just as virtual machines to the user. So provisioning was our initial use case. So you have the user, which basically talks to OpenStack and then to Nova in order to get a physical machine. And then Nova via the ironic driver talks to Ironic and then with glance and neutron, basically pick a physical note, which is then given back to the user. So this was the initial use case, but of course, physical notes are not created. They are just instantiated. Physical notes are already there. Now, one of the things that we ran into, like very early on, is to have like universal images. And this is already the first detour that I take in my tour, in my talk. In order to not have different images for physical notes and virtual machines, we worked a lot on having a single image that can handle both. And during that journey, we had a lot of various things that we needed to adapt. So it all follows the extend the cloud to bare metal paradigm still. So as I said, when you create a VM and if you instantiate a physical server from the user point of view and from the API point of view, it's basically the same. The first thing that we needed to add was GPT support. This was driven by very large servers for elastic search actually, where we needed to have GPT partition tables. So we needed to add a bias boot partition. Then we had to move from U of I because the hardware colleagues preferred U of I for various reasons. So we needed to make the image U of I capable. Which meant that we had to do additional partitions inside the image and could not rely on the NBR gap anymore. We use extensively software rates. So we had to do something in our running and also in the image in order to enable software rate when deploying images on physical note. And then something rather recent is that we also have to keep an eye on what to do with these software rates over time because there's some details that you need to pay attention to when you have a software rate for a longer time. So for instance, when you replace disks, you need to, for instance, relocate bootloaders and so on. Okay, so another use case that we had is physical batch and then the top left corner says back to the future. So basically what we did when we introduced OpenStack at CERN is that we moved our batch farm from physical nodes to virtual machines. Because we wanted to leverage or the batch team wanted to leverage the APIs that OpenStack has in order to like create workflows. Now with Ironic coming into the game, we actually had the opportunity to revisit this virtualization tax that we were paying. So we were like accepting that there's a three to 5% loss in performance when using virtual machines over physical machines and we basically moved back. So this is right back to the future. So we basically moved all of batch into virtual machines, run it as virtual machines for eight or nine years and now we basically converted everything back to physical machines. This was a big campaign and you see on the right hand side some screenshots of how the monitoring of Ironic actually sees this conversion back and forth. We use Terraform as the infrastructure as code tool to interface with Ironic. So this is basically the batch team having using Terraform in order to talk to the OpenStack APIs in order to leverage all of this. So this is still provisioning, right? Still the initial use case giving physical nodes to users. But of course there's a limit to this. Physical servers are not virtual machines, okay? That may sound like silly, but if you don't realize this, you will learn that the hardware, so there's some limits of resources that you can provide. You cannot create physical servers out of thin air. You cannot create a physical server wherever you want. And the booting on bootstrapping, the debugging is more complicated. It may involve multiple teams, multiple skill sets, so it's a lot more complicated. The batch conversion that I just mentioned needed weeks of cleanup of BMCs not working, I know it's not booting for whatever reason, so it's a lot harder. And there's also less flexibility because a physical server is what it is, right? You cannot necessarily create a second interface card very easily. So there's also a talk by a colleague later this week, by Marina, who will look at some of these difficulties explicitly. Now, this is the Ironic state machine, and I have only 25 minutes, so I won't go through all of the bubbles here. I will only go through the high-level bubble, which is basically, you ever know that's available? It's been deployed, which means there's an instance that's created, then it's active, which means there's an instance running, so it's an in-production server with an instance, then it gets cleaned and becomes available again. So that's the basic cycle for servers that servers run through. And I will mostly look at the cleaning part because most of the things that I'm going to talk about now, from now on, the burn-in or onboarding in burn-in will be based on this cleaning step. We basically built most of the things into the cleaning framework of Ironic. So this shows the whole life cycle of physical servers at CERN. So you see there's the preparation that's taking place, then the physical servers are installed, they need to be registered, there's some inventory happening, some health checks, or is there all the memory there, are all the CPUs there, are all the disks there, and so on. There's some burn-in, where the components are stress-tested, benchmarking to make sure they behave as we want. Some configuration like software rate, their provisions are given to the user, there's some adopt step where you can roll in production servers into Ironic. So basically trick Ironic. Repairs, of course, have to happen, and at some point they get cleaned and retired. And you see I put the bear, and most of them, because I updated this slide over the past couple of years, and the bear is getting more and more full. And you see most of the components or most of these steps are actually now handled by Ironic fully. So how does that work? So the generic work cycle for Ironic is basically something triggers Ironic to do something on a node. So that was a very generic statement. So you have the admin or some tooling that sets Ironic, hey, clean this node. Ironic checks on the databases and then talks to the node, usually via the BMC, so via protocols like IPMI or Pixi, and boots it into this image that I mentioned before, with the IPI daemon, which then calls back home to Ironic, and Ironic gives the instructions of what to do, for instance, clean the disks. So this is the cleaning framework in a nutshell, and this is basically what we leveraged for the things that we're doing. So, second detour, you see I like detours. BMC interaction I just mentioned. So one of the things that we are moving to is Redfish. So rather than using IPMI, which we've been using for a long time and still use a lot in the sender data center, is that we're moving now more and more to Redfish. Why? Because it's becoming the industry standard. It has a lot of advantages over the traditional IPMI standard. So we moved newer deliveries into Ironic with Redfish. We found a couple of issues in the various implementations. So implementation means on the BMC side, how the Redfish protocol is actually, or the Redfish standard is actually implemented. So one famous thing that we had to handle was the ETAC handling, verifying that the state of the node is the same before when you last checked and when you send commands. But the basic functionality is there and you can handle nodes with Redfish. So we have like 600 or 700 nodes out of the whole park now that are totally handled with Redfish and it works for our use cases, which is basically like all the cleaning, instantiating and so on. What we're currently working on is like moving this into the specification process that I mentioned earlier. So in the specification document that we give out for a tender, for companies to like actually to validate that their hardware matches our needs, we're looking into the interoperability profiles. So what is that? It's basically a list of things that you expect from a Redfish endpoint and there's a validator that actually allows you to consume such a profile and tries it against the physical node or in the Redfish endpoint in more general terms. So in this profile, you can basically describe what you need, you give it to the tool and the tool verifies that the hardware is actually doing the right thing. So this is the plan to have these as part of the specifications and when we say that we wanna buy hardware, it has to fulfill these kind of characteristics on the Redfish endpoint so that we're able to actually deal with it with our running. Now the other question that we had around Redfish is like how users should interact with Redfish and you may ask, okay, why users? So why do users need to interact with BMCs? It's very peculiar in our data center that users actually have access to the BMC credentials. So it's a relatively trusted environment so users can actually talk to our BMCs and also the repair team can talk to the BMCs. That's not the case in all clouds but it is the case in our cloud. So one of the things that we were wondering is like how to interact with the Redfish endpoint for IPMI for instance, there's the well-known IPMI tool but what do we do with Redfish? So of course you can ask the users to use curl but when I wrote the documentation I realized that it's not going to fly very far because then it's getting complicated very quickly and also hits the limits of me editing websites with quotes and stuff that was very complicated. I mean there's like a Redfish tool which is like an IPMI tool for Redfish but it was only a temporary kind of stop-gap solution and there's the Redfish tetra box which is basically a set of tools that you can install in order to interact with the Redfish endpoint and that seems to be working very well for us and I also would like to point out that the community is very responsive there were a couple of issues where we said like, hey, can we do this with the Redfish tetra box for instance, see the service event log more easily and they were very responsive and implemented this. Okay, so much for the Redfish detour. So one of the things that we do now, basically we go through this like big loop that I showed earlier, one of the first things that happens is like nodes are now auto-registered with all the databases that turn automatically. So basically the node switches on for the first time in the data center and then gets redirected by our DHCP server to Ironic, basically not to Ironic but due to the IPA image. So it boots into the IPA image for the first time. So you can see this over here which then registers the node into our non-open stack network databases and sends the introspection data so it doesn't introspection of this node and sends the introspection data to Ironic upon which Ironic itself learns for the first time about this nodes and registers the node with its own databases. Okay, so this is the enrolls in your node step and then also forwards this inventory data of this new server to S3 which is in our case on top of CIF. Now what do we do with this introspection data that I just mentioned? So this is basically the part up here so there's like data that is going into S3 and then there's some enrichment with benchmarking data but in the end you see down here this open DCIM so we have our data center and inventory management system based on open DCIM where it basically can drill down from the room to the rack to the quad to the server and then you can click and basically get that into inventory data if you want to and the inventory data that we collect per node is quite extensive so you can basically see like all the disk serial numbers, everything. So it's quite a large chunk of data that Ironic is able to collect. So this is just how this is embedded. Actually I should have called this a small detour as well. Verifying that a server is actually looking the way it should can be done with introspection rules. So that's a feature in Ironic that allows you to describe how a certain server or in our case a whole delivery should look like so you can say it should have that many cores, that many disks, this size and so on. And Ironic will automatically check these and flag nodes where this is not the case. So imagine you get 1000 nodes and there's like a couple of them where a memory module is missing. It's very hard to find this in the past. We were basically sending this data somewhere and then looking at huge text files to see if there's any difference. You can do this with sort and orc and grep, I see. I mean, we have done this as well, but introspection rules does that for you. Okay, you describe this and you can describe this per delivery. So you write the rules once and then you basically let this run against the inventory that you get from node. You can do this online. So the first time the node sends it in, you can also do this offline and retrieve the data from S3 later on. Then after you're fine, all the components are there, we burn in the hardware. So basically we do burn in four stages. We stress the CPU, the memory, the disk and the network. For a CPU and memory, we use stress and G and for a disk and network, we use FIO. We use the same tool because that means that we have the same tool in the image where it makes the image a little bit smaller. This is also part of cleaning. So there are cleaning steps that are also like added upstream where you can basically trigger these steps, okay? This has all been released with Xena and something that we had in addition is to watch what is going on in real time. So we have added to our IPA image a FluentD daemon that actually, I was about to say live tweets, what's going on, but it's basically like sending data into Elasticsearch and Kibana so we can actually see where the nodes are in the stages because some of these burn in steps may take very long and you wanna see if something is breaking and how it's going. Now, networking is a little bit more complicated because actually, networking requires pairing. All the others you can do on a node, you just launch and basically run a script or run a command. Networking is more complicated because you need pairs, okay? So we started with something like static so each node basically knew its counterpart and then both nodes were starting to clean and then they were waiting, one, the first one was waiting for the other and then they were basically burning in. This has a lot of some drawbacks. So for instance, imagine like one of the nodes doesn't show up, then the other one is basically waiting and timing out. So we quickly moved to what we call dynamic pairing so we have a distributed arbitrary in the back where nodes can just say like, look, I wanna burn in and then someone else is someone else. Another node is coming up and say I wanna burn in as well and then they just pair and go off and burn them in. This is the advantage that you don't have to wait for nodes that are, for instance, in repair. All of this is all upstream and we have tested this with more than 150 nodes in parallel. So 150 nodes send into this, they go to Zookeeper in the end and find their pairs, it works. Benchmarking, also part of cleaning. So basically what happens in the benchmarking step is that the benchmarking step downloads a container, in our case a singularity image, runs a specific benchmark and then sends the data into a pipeline with elastic search in Kibana. And this allows us for different deliveries, so this is basically different processes, different kind of hardware over time to see how the nodes are doing performance-wise. So you see like for most deliveries you have a more or less nice normal distribution but sometimes we had cases where you had like double peak structures, okay? And this allows you very easily to find nodes where something is wrong. So you have like a fraction of the nodes that don't behave well and then you can like dive in and see what's going on. And at the very end, okay, talk is called onboarding but I also want to talk about offboarding a little bit, we also use Ironic for retirements. So basically what we added to Ironic also upstream is the possibility to tag a node as retired. So what does that mean? And basically initially it doesn't mean anything, the node just carries that tag, it's an in-production server which is retired. But the moment the user deletes the instance, basically the node does not go through the cycle anymore and goes to available, it goes back to management with a retired flag and it doesn't move anymore. So you cannot use this for a new instantiation. So the idea is basically when we plan retirements and you saw that the cycle is multiple months, we can like one year in advance, okay, this row of racks needs to be retired, we set the retirement flag and then whenever the user deletes the instance it will basically move to side, okay? And then we can at the very end also like re-burn in. So usually before we donate servers they are like tested again so that we don't like hand over broken nodes. But this is not something that we are doing yet. This is also what the bear I hope was like half. So all of this together allowed us to grow Ironic since we started in like late 2017 to okay, the graph is not brand new but trust me it's like around 9,000 nodes somewhere up here. And you can see some certain structures here. So you see like very steep increases here. So these are deliveries. So new how we're coming in. And you see it's like a couple of hundred nodes that are added. And then in between there's also like smallish, either smallish deliveries or nodes that are added to Ironic. Equally you can see retirement. So here's a retirement of like 1,000 servers. So that makes me of course very sad because the numbers in Ironic go down. But we're still like overall we're still going up. And then we have something here which is very steep which is adoption. So this is like where we run a campaign to add in production servers. So while they were being used under the control of Ironic. Okay, so like a running mail server rather than control the old way is now controlled by Ironic. And we enroll these nodes. If you wanna know how that works, there's a blog post on this and I also recommend the talk that I mentioned earlier and we'll also briefly explain this. If you wanna know more about like how Ironic is used, at CERN we have a blog where we post very often articles about how we're doing things with Ironic and in general at CERN. And also a couple of videos that describe why we're using Ironic in more detail, how we do the whole life cycle or how we scaled to that size. And with this, I'm done and happy to take questions. Thank you. So if you have a question, there's a mic over there. Otherwise, I would just repeat the question. So the question is whether we do this, provide this to users or if we keep it all for ourselves. So I like to keep things for myself, of course. So we do both. So the open stack hypervisors are provisioned in Ironic, okay, which is a very nice loop. So the physical node that runs a hypervisor is an instance in Ironic, which is a hypervisor running the control plane for Ironic, okay, so let that sink in. But we also give this to users. So users in this case means mostly services within IT. So for instance, I don't know, the mail servers, the surf servers, the other storage servers, DB team, they get instances in Ironic, that's one thing. Also the experiments, so this physics experiments that I was mentioning earlier, they get sometimes need physical machines, for instance, because they need something that is performance-wise very stable when they tune their code, they can't do necessarily with a VM because the optimization that they do is actually smaller than the fluctuation of the performance of the VM so they can't really see things, okay? So in that case, we also give them physical nodes. Something that we are introducing now that I haven't mentioned at all is multi-tenancy. So what we do with specific deliveries is that we even give the users more power on the nodes. So the security folks are already scratching their heads because they have the BMCs already. But with multi-tenancy, you can basically, Ironic has the concept of having owners of nodes. So when you have, for instance, certain delivery that goes to a specific service, like, say, the mail team, and they wanna have a specific rate configuration, you can basically give them the nodes and then this opens the API or you can open the API to change certain things on the nodes. So for instance, do the rate configuration themselves rather than us doing it. Or they can drive a node through cleaning so when they, if there's something wrong with the node and they get stuck in cleaning, they can fix it themselves. Or main use case, they can actually see how many nodes they have because there's no quotas for flavors. It's sometimes very hard for users to actually see how many nodes or which type can I actually instantiate because they only get cores and RAM and have access to 20 flavors. But I don't know which is which. Or which, how many of which can I create? So this is one of the drawbacks of physical nodes. And with multi-tenancy, they can actually see, okay, there's like 20 of this type and 10 of this type and eight of these are instantiated. So the answers we do both, right? So the question was whether we considered other tools, like from, well, there's other tools that we can, the answer is no, the short answer is no, basically because we had like open stack at the time. So for us, it was very natural to look at ironic at the next time. And once it got momentum, we basically rolled with it. We did not do an extensive study and compare this, for instance, to Canonical's Mars system or any other system. No. So the question is whether we use, whether the ironic that we use in order to provide the hypervisors is inside the same open stack deployment or if this is a separate open stack deployment. Actually it's the same. So it's like one open stack deployment which has ironic, which also provides the hypervisors. It's the mic actually on one second. Okay, yes. Can you hear me? Yes. A couple of questions. First, how do you deal with BMC passwords when they are, you know, nowadays often they are unique per node? And second question is, do you rerun the performance benchmarks and reanalyze against the previous results to see if there was any regression in performance following maybe a BIOS firmware update or a wearing of the devices? Okay, so on the second question about benchmarks, the answer is no. So what we do is like, we buy capacity by the metric is basically like performance per dollar, if you like, or performance per price. So we try to like optimize to get the most performance per dollar. And this is basically, so we roughly know what we should expect. And the benchmark is mostly to confirm that this is the case and also to give this to the experiment. So we basically give, not course to the experiments, but basically the unit of this benchmark to the experiments. Okay. What we do is we have for each delivery, we basically take nodes aside and they are continuously being benchmarked. So we can actually see if a new kernel or a new software library actually changes something so that later on we don't get surprised. We basically constantly monitor this with a small or a tiny fraction of the delivery. The first question was about the BMC hardware and the passwords. About the passwords. Yeah. Do you register them via introspection rules or directly in the ironic data? Yeah. So BMC passwords, setting them and resetting them, that's a hard problem. It's not that obvious to do. So the way it works is basically like initially the hardware comes with password that we tell the vendors to set and this is the passwords that we use initially and then the passwords are set by an external tool so it's not inside ironic that does that. So that's basically like the hardware team that deals with this. So they have a separate database and then they can manage the passwords via their tooling and then update ironic. So ironic is not the source of truth for the passwords. Thank you. There's one more question. Yeah. Yeah, very good question. And so the question is like, how do we discover physical location? Okay, so my cloud answer to this is that physical location shouldn't matter, right? This is what I tell when people come to my office and say I want this server in this specific location, I say that's not cloudy and you can imagine that they're super happy about this answer and say, ah, okay, very good then. So. So yeah, that's usually not what happened there. So the way it's done is, this is actually a problem as well. So the way it's done is, the way the network is structured as soon is basically like mostly per rack. So when the node discovered in which, what we call an IP service basically in which network it is, in which switch it is connected, it knows where it is. So this is roughly how it works. This is then reflected in the resource class, which is tagged that you use in order to do the scheduling and also in the flavor. So users can actually see in the flavor where the node will go. But it's actually a hard problem and if you come to the other talk, you will see that it's hard. Right, we can use this, we just didn't do this yet, but we can, yeah. There's also LADP, we didn't do like super sophisticated stuff. We've started looking into this as well. But in most cases, I also try to insist that people should not care necessarily where exactly their server is. So like look, we make sure we can like say it's in different availability zones. That should be enough. You should not rely on whether it's like on the top position of the rack or not. Doesn't work with all use cases. But yeah, there's room for improvement as always. And they're back. So the question is about the image that we have. I'm not sure. No, it's a public image basically. It's so public to our users. There's nothing special about the hypervisor image. So users instantiate their instances, virtual machines with the exact same image as they do for physical nodes. So there's nothing, this is why we try to like extend the initial image that we had for virtual machines to be capable with physical nodes as well. Sorry, say it again. It's a stream mate, central stream mate. Yeah. So the question is about the networking. That's very often a question like what we do about networking. So the networking at CERN and the standard is very simple. It's like a flat provider network. So there's like the interaction that we have with ironic to like new transfer instance basically non-existent in our case. So basically this is all cut short. So, well, it's because of like historic reasons that are this infamous database that I mentioned. So ironic does not really touch the network in our case. So yes. Yes. So it's very simple, very straightforward. We're looking into like interact with Neutron, but so far there was no need and when we started actually it wasn't needed. So, never back. So the question is about like how we validate that servers are collected, correctly connected network-wise, is that a question? Yeah. We don't validate this by ironic. And of course they are also visual inspections and it's cabled basically by our team. So that's not part of ironic validation for physical nodes. Okay. Okay. I think that's it. Thank you very much.