 Manos, the floor is yours. Hello, I'm Manos Bagakis. I work at CERN in the IT department. And today we will talk about the burning, which is a new feature added in Ironic. So I will now share my screen. Can you all see? Yep, that works well. All right, perfect. So let's, first of all, let's talk, let's take a look about the server life cycle management with Ironic at CERN. The first step is the preparing the racks, the power, the network before the servers even arrive. And then the servers are physically installed. This is the moment where the lifecycle of the server begins. It starts with a registration of the server into our network databases. Registration in Ironic, et cetera. Then there is the health check to make sure that each server has all the components that we ordered. Afterwards, there is the burning during which we stress test the server's component to make sure to detect early failures, but more on that later. We then move to the benchmark. We benchmark the servers to make sure that the servers match the expected performance that we requested to the vendors. And then there is the configuration of the servers and eventually the provision the users. There is also an additional entry point since there could be cases of existing servers in the data center that we want to adopt in Ironic. And of course servers may break so we need to have a repair mechanism. Eventually, servers reach their end of life and die retired and the last step would be to physically remove them from the data center. Now, let's take a closer look to the burning part. There are several reasons that we want to use the burning feature. First of all, we need to ensure that hardware delivered complies with certain technical specifications. We use it to find systematic issues with all machines in a delivery such as bad firmware. We could potentially identify failed components in signal machines. And maybe the most important reason is to provoke early failure in failing components. You do high load during stress testing. To describe this even further, there is the looking at the servers failure rate. We see that the overall failing instances consist of three cases. There is a constant random failure that would happen at any point in the server's lifetime. We also have failures due to wear which are increasing the longer the server has been operating. But burning focuses on the third case which is the early infant mortality failure which decreases over time and it describes the effect of a lot of servers failing in their first operating life cycles. A case that decreases over time. Burning aims to detect such cases and this allowed us to act before the servers are rolled into production. Replacing or fixing a server in production is much more time and money consuming than handling beforehand. So let's now get down to specifics. We use burning for four different components. CPU, memory, disk and network. As you can see, we use two different tools. The first one is StressNG which is designed to exercise various physical subsystems of the computer as well as the various operating system kernel interfaces that may have. It also has a wide range of CPU specific stress tests that exercise, for example, floating point, integer, bit manipulation and control flow. StressNG was intended to make a machine work hard and trip hardware issues such as thermal overruns as well as operating system bugs. FIO, which is used for the disk and the network is a tool that can generate threads or processes and perform specific types of input output operations specified by us. The typical use is to write job files that in essence simulate the input output load. There are several profiles that we use and depending on the use case, we alternate between them. The shortest one that includes roughly one day total of testing time, a medium one, which takes roughly four days and then there are the long burnings which take up to two weeks. The main parameters can be configured by setting a driver info parameters as we please, such as CPU cores, the time that we want the burning to run, how many look numbers we want to iterate over the disk, et cetera. So what about the implementation? Hardware burning is featured in Xena version for the first time. We treated it as part of node cleaning procedure, which means that it's nothing more than another cleaning step, which we can execute through the OpenStack CLI. And of course, in any case, we detect some error, we can abort it at any time. I believe we could move on to the demo. So as you can see here, we will test and do a burning CPU and disk step. Each of it will last roughly one minute. And we have the HSTOP Open connected to the server to see the results on the processes and the CPU load as well as the disk IO and the journal for ironic Python agent. So we will, in a few seconds, see that the first will be the CPU test. We will see that all of the cores will run to almost 100% load. This will last for one minute. And of course, we can set the parameters for different load. Here we go. Or we can just set it to specific cores if we also please. We see that in the journal that the results will be posted there. And then we will be able to collect them and send them to whatever storage service we use. We can also see the processes of StressNG, the tool that we use for the CPU burning. There is one created for every thread. And once the burning CPU finishes, then the next step will take place, which will be the burning disk. For the burning disk, we chose to do two threads that will write and we will see the disk input output raise hopefully to the maximum. As you can see, the StressNG results are posted here. We see that the disk input output is 200% which means that both read and write are our capacity of 100% each. We also see here the FIO threads with a block size of four kilobytes. We can also specify the number of iterations we want to do or which disk we want to affect inside the server. We can do it on all of them or just one of them specifically. This was the end of the disk input output as well. We can see here all the logs in the journal that in the end of the cleaning step will be collected and sent to the conductor. Let's move now back to the presentation. So what do we do with the burning logs? In order to collect them and make sure that we need to check each server separately it will be very time consuming to connect to each server and then make sure to look at the journal and see if everything is all right or not. So what we do is use Fluendee. Fluendee is a service, it acts as an agent and it basically tail folders. And then gets the data and posts them to Elasticsearch in which we store them and then we use Kibana to visualize the results and have a broad image of what is happening. For example, when there are a thousand computers that we need to burning because there is a big delivery then there is a bit of a scaling issue when we want to handle and see its server separately. Real time logs is something important for us and this is why we use Fluendee and don't wait for all the data to get posted in the conductor because for example, if we had a 48 hour benchmark we would have to wait essentially two days in order to see the results and if there is a problem that we want to detect early and abort the cleaning step then we wouldn't be able to see it. So what is the next step? What are the some ideas for improvement? In the network burning step we have so far a static approach which means that we basically pair manually two nodes to network test each other so the one will write and the other will read. What we want to do is to develop a dynamic network pairing algorithm. We will use tools and specifically Azure Keeper for this. The nodes set for the network burning will enter a room with other candidates they will pair up themselves and then execute the burning step without any interference. I believe that that was all. Thanks everyone for listening me. Are there any questions? Thanks a lot Manos, this was really cool. Thank you. Do we have any questions? If there's no questions, I have like two comments. So the first one is like in the output you maybe notice if you look there's like the output is kind of truncated. It says like stress N rather than stress NG with the full parameter. So this is actually something that we noticed while we were developing this and this is mean by we fixed in stress NG upstream. So if you Google for this string you will find it actually everywhere. So everyone's output looks like this but this is a bark in stress NG which has been fixed. The second thing is about like aborting a clean step. So aborting a clean step is maybe not as clean as it is for other clean steps because some of the burning tools will continue to run in the background. So it's not like the abortion of the clean step will kill the processes necessarily. So it will move the node to the next stage in the state diagram. So it would be to if you say open segment node abort if we move it to clean failed but this doesn't mean that also the corresponding burning process is actually having killed. So this is maybe something to mention. Are there any questions? Comanos? Okay.