 Okay, so it's 10 past five. I suggest we start as In the first keynote they say during Germany. We like to be sharp. So my name is Daniel I I work at the CERN cloud team and I'm together also on behalf of Arne my colleague together with whom we look after the Ironic service at CERN and today I'm gonna tell you a story that we've decided to name the most excellent and lamentable tragedy of ironic and Nova just in my remind you of William Shakespeare's Romeo and Juliet and it's just a funny way to tell you our adventures as operators of ironic and That's sometimes due to our custom deployment. We suffer from weird issues that shouldn't be expected at all But they cause us sometimes fun sometimes headaches. So we would just want to share them with you So I hope you enjoy So probably you already know about this because we are in almost all summits And but I just want to give a brief overview to those of you who don't know. So CERN is the European In laboratory for particle physics and we are located in Geneva in Switzerland And what we do is among other things. So our main mission is fundamental research in the field of particle physics This is the LHC, which is an accelerator that accelerates particles to very close to the speed of light And we made we made so to make them to make these particles collide in four big experiments This is the Atlas Cavern. This is located a hundred meters underground and it's an eight or nine story building So it's the size of an eight or nine story building, but a hundred meters underground and in these places We have these huge detectors. So just for you to notice, this is a person down there. This is a hundred meters underground you can see the In the middle, this is where the beam where the two beams go and they collide in the in the center and basically what we just do here is with Super state-of-the-art electronic detectors We try to get the data that is coming out and we try to analyze it Afterwards and see which particles are coming out So our main idea is to reproduce the the first instance Instance of the big band and we do it by getting these particles to a huge amount of energy After so we collect all this data in these experiments then all this data come to all comes to our data center so We have links that come from all the experiments to our data center and here is where we happily run our open-stack cloud as You can see all our mascots that we love them so much. They run all freely across the data center Orchestrating the machines and doing some message around there We have two data centers one in Geneva in Switzerland and one in Hungary We have three dedicated links and you know, there is where we run our open-stack cloud to give you a brief Overview of the cloud. This is a recent snapshot. So here you can see more or less what we are doing We have around 300,000 cores after the L1TF vulnerability that we saw earlier this year now we We do we do not have as many cores used as available because we needed to disable SMT in some of our hypervisors Things that are growing in our cloud We see a huge increase of magnum clusters. So many Kubernetes clusters So users are getting into this trend and actually our users reflect what Where the market is going somehow so they are using it So we offer them Magnum as a way to provision them super easy in our infrastructure And I'm mainly here to talk you about ironic because I'm looking after that service So we are also seeing a huge increase in bare metal notes And that's due to the policy with in turn that everything every hardware coming into the data center should go through ironic So whether it's for an end user or for us as having them as hypervisors Everything comes now to us enrolled in ironic and this is also making our numbers grow, which is pretty cool Yes, so now focusing in ironic, this is our main ironic dashboard So what we see or what we control in our daily life? We are almost there in the 1500 nodes Enrolled and managed with ironic. We are one shit. I could have worked on that to have 1500 for the presentation so the main things we Take a look here are if there are any log messages So if we scroll down here, we would see some error messages so that we can spot quickly What's going on in the cloud and also we like to see at the unusual provision states So we either like our bare metal nodes to be neither available or active So active means our user is running an instance available means someone could potentially run an instance Unfortunately, we have some nodes which end up in cleaning failed states and or without doing some things And we got a recent delivery that cost a hundred nodes to go into inspect failed This is something we some issue we have and we understand it We just need to to work on it probably after the summit. So this is our dashboard I'm happy to answer any questions afterwards or if you want to share some of your best practices Like I guess we are all here to learn from each other. So that would be pretty cool Okay, so as I told you at the beginning I'm gonna tell you a story and So for this presentation I I read the book from William Shakespeare I I have to recommend it to you. It's a nice book. Even if it was written like 500 years ago And I might want to share with you some of the of my favorite quotes So this is Romeo when he realizes that he is in love with Juliet He says to himself as in a monologue love is heavy and light bright and dark hot and cold Sick and healthy asleep and awake. It's everything except what it what it is and I don't know this inspires me and it's a bit of a reflection of what sometimes Even if things should work some some way then in the reality they might not be like that. It's neither good Not bad. It's just our experience. So as I told you the same policy now is that for all hardware that comes to the data center It goes through ironic. So we might use some of this hardware as a hypervisor So we are the main users and then we expose these from through Nova to our end users to run their VMs. We also Offer very meta provisioning to our users. So there are some use cases like databases or Storage servers that need to have the full capabilities of bare metal instances So everything goes through ironic what the user doesn't see is the actual ironic API, right? So for the end user when interacting with OpenStack, they are very aware of what OpenStack is So it's something we offer at CERN to our users to our physicists and also to people from IT So ourselves so they never interact with ironic. So This is something like we as working with ironic Do not are not exposed to the world because every request comes from Nova and this is a bit how we feel because we don't get the Visibility but in the end of course the users know that the physical instances are provided with ironic So this is yeah Okay, I need to talk to you a bit about about our setup because most of the issues that we have been seeing are due to our setup and our setup is caused by years of implementing OpenStack and Things that you have to do along the way that maybe are not the best ideal solution But it's the only way. So, you know, there are some trade-offs So currently we have around 70 cells and the middle was speaking yesterday about a bit more about how we use cells V2 in our cloud So we have 70 cells which are basically the cell conductor and all the hypervisors on top of there We have like the Nova Magic that does everything so whenever an instance gets So whenever a user wants to create an instance Something happens there and there is a lot of conversation to in the end select a compute node This is another cell which for Nova is another cell But for us is our main cell because it's the bare metal cell So we've been going through different setups and this is our current setup. So this is what we have today We have one compute node which handles all requests for all our 1500 Call it compute nodes hypervisors bare metal nodes. That's a funny story as well So when talking within Nova and ironic for for me I I deal with bare metal nodes. So that's for me a bare metal node But when debugging issues with the Nova experts from our team they sometimes come to us and say wait But why do you have 1500 computes? You told me you had one So these are common issues that we find and this is only the talking but this happens to us, right? So we sometimes we we are lost in translation into how Nova and ironic consider their resources. So but this is So this is our bare metal cell the way we have or we implement cells is because we want to separate projects So we have this thing called project mapping. So in each project We have settings that assign the project to a cell. So whenever an instance is requested from some project This project has some target cells that might end up hosting that That instance in the case of ironic we have bare metal flavors Which are dedicated to our bare metal cells. So all requests to have an instance created with that flavor will end up in our cell We have one compute node as I mentioned in the past. We used to have five or seven for Scalability reasons we decided to end up with only one because with one node We can debug much easier because all happens through that node So in the past and I will explain it later. We found some issues when nodes would go down and then they this would cause some Inconsistencies on the on the data as you can see we have three ironic nodes It's not very explicit on the image, but that's why I'm here to talk to you about it So these are three ironic controllers. We have the same configuration in the three of them We run the ironic inspector to do the instant inspection when new hardware arrives We run and the ironic conductor and the ironic API using Apache so in the story that happens we have several malicious users or malicious Characters one of them is the power sink. So this is an issue that we saw Back in time I don't know one year ago and we didn't realize or we just were taking it for granted because actually it's a cool feature Right like Nova what it does it checks in its hyper visors It tries to synchronize the state of each instance so to keep an update an up-to-date state So this is the same in ironic because the way Nova works if you want to deploy an ironic Cloud or handle bare metal machines You just have a compute node or several compute nodes with the ironic driver. So it's doing the same thing Right. It's checking this thing this power sink But what I'm showing in this slide is not actually this sorry for that This is in the ironic side the same happens. So ironic needs to have control of the state of its nodes So periodically in a by default 60 seconds Frequency it checks the state via IPMI of every node. So this This will this will go one by one in all the nodes checking the state whether it's on or off updated its database That's fine. The reason is when you have many nodes It starts to get to an issue but we didn't see that even if we saw the increase on the logs We realized because sometimes the we were losing the connection on my SQL And we were talking to the DB teams and saying hey look we have this issue This is a bit weird and they told that this is definitely on your side Even if it seems like it's on ours So we ended up digging into the into the ironic configuration and reading a lot and actually we found these two Settings that we had to change after some testing. So first of all the first one is pretty self-explanatory We just increased the the frequency of the of this power sink interval from from one minute to five minutes And this actually decreased the load both I mean in the ironic conductors and this stopped also hammering the the databases another Setting that we found out that seems to be pretty Interesting is this force power state during sync. So if Orchestrated by someone else in this case Nova if Nova sees that the node should be on or off This setting will make ironic translate or actually enforced the state on the database So we ended up with the case and I think Arne reported on the last summit that some users were complaining Hey look I'm powering on my node I just go went to the data center powered on my node and it keeps on going down This is because for Nova. It was down and Nova was saying ironic. Hey look for me This instance is down and I run it was saying oh shit. I'm gonna turn it down So this was pretty annoying for the users. So by setting this instance this setting to false You will make sure that you might have some inconsistencies, but I think it depends on your users or whether you want to Whether you want to deal with their anger Another thing we found on the ironic side It's some hard to explain API memory fruit footprint. So our ironic database for around 1,500 nodes. It's around 50 60 megabytes. We noticed that whenever we Start the API's we see they start that constant amount of memory consumed the RSS the resident set size and Whenever some of the processes or threats gets get a call they immediately increase for some reason in another of magnitude of 5 or 10 So we were trying to debug that we never understood what's exactly causing But what we've seen is that changing the configuration of the amount of Processes and threads that we have we can reduce the total of of the memory consumed by the whole ironic API to Actually like around to the half. So this is actually what we are using we run in each of the three Controllers we run one process with 16 threads of the API. We also have updated to use Apache Which we were expecting to get some improvement, but we didn't see any clear So this is something that was discussed with upstream. I think we need to raise this up again in case it's important for Or you saw something similar And as I mentioned yesterday before Nova is also doing these regular power things. So back again to this I got when we talk from Nova with ironic like in the team and we say hey you have 15 or I say I have 1500 notes and my colleague says no way but you have only one compute note, right and this okay So in the end we get to the same page But nobody's also checking because Nova considers ironic or our cell in our setup a single hypervisor So Nova somewhere it says hey hypervisor. Give me the state of your instances and then the bare metal driver Goes and asks ironic in the API Hey, I run and give me for this this this instance. Give me the state So this was hammering our ironic APIs. So every 10 minutes we would get 1,000 calls and this would cause some load in our ironic controllers in the end So we also were confused because they seem like similar issues, but they have nothing to do So one is the ironic sync power state, which is trigger Lee triggered independently from the Nova side But this so we were confused because they seem to be correlated, but they were not in the end so again looking in the documentation and Reading the dogs we found out that we could reduce in this Config this parameter the sync power state pool size, which is clearly written in the in the Nova documentation Specifically for the ironic driver you may want to turn down this to something like 10 because yeah, if you run Only one compute note as you do for handling all the all the bare metal notes This is maybe not the right approach, but this solved our issues because this means that Nova is only calling or Creating 10 API calls at a time and only when it gets back all the answers then it will start creating new new calls. We also Played while debugging this with the sync power state interval. So we tried to Reduce the frequency. This isn't change anything, but I just maybe it's useful for you another approach that some Operators are doing so we were talking to the community Is to actually disable the whole power sync process This might come at a price, but if you are willing to pay that price, which is not having consistent state Consistent states between ironic I know that you might want to give it a go So we know operators doing it at a some at a massive scale running without the power sync So that's our thing So up to now Our two main characters ironic and Nova have gone through the first challenge. So they have They they beat it the The power sync which was a malicious character that cost us some headaches, but we were finally happy over it and That's another nice quote from Romeo and Juliet. I love this one. So Romeo again Talking to himself. He was a did my heart love till now for swearing site for I never saw true beauty till this night So it's something like that, right? Our two main characters are realizing what we do a nice team together So they they set up for new adventures and this was when we needed to upgrade to Queens So the Queens seemed as a solution to all our problems, but this is how it felt when we tried to upgrade Queen seemed like a big thing but again our two characters love each other so they decided to continue the adventure and And So aren't I already reported this in the in the summit we found many well We found curious stuff while well after upgrading it because we were not maybe actually ready Or we didn't read through the whole dogs like I remember we had to upgrade ironic By emergency because even if it's clearly written in the ironic dogs, please upgrade Ironic before Nova so we happen to upgrade Nova before and then ironic crashed But okay, that was our fault also a good Tip that I would like to give you today read the dogs before you upgrade or so that's so things we saw with Queens where Where Where to do with the resource classes? So we discovered that a user forgot to change the the project He was working on and he decided to create a virtual machine which our flavor that we called Medium so medium is something like two cores and Because he was working on a project that was mapped to our bare metal cell He just SSH to the machine check the amount of RAM and he got a hundred and thirty two or a hundred and twenty six He was like whoa, that's a nice Outcome for my VM side like this flavor. So then after careful investigation we realized that Back then the way Resource classes were consumed so ironic The way it schedules or the way Nova schedules ironic instances is using the custom resource classes So if we had a node with a custom resource class, it would consume So if we use the bare metal flavor, it would consume this resource class, but it would not consume the virtual traditional resource classes which are megabyte CPU and Course so in this case because our node was reporting this Classic resource classes. So of course our node fulfilled For Nova considering this node as a hypervisor this node had indeed two cores at least and this amount of RAM and and And this space so he said okay go go for it you you're happy So host this instance. So this ended up like this It was we I think we reported it or we saw there was already a fix on going so we applied it and this worked But it was fun to see that this happened and how because our users are friendly with us and the user reported it to us, but It's a funny thing that you might run into so just be aware of it Another interesting thing. So it's the same error, but it got like somehow a virtual overcome it because we scheduled a Bear metal instance. So with a bear metal flavor We got consumed the custom bear metal resource class, but then a user came along Committed or actually we did it by doing some testing and it was the virtual flavor that consumed the class the cut the classical resource classes So we ended up as reported in this log that this node had actually More than or because of the way the resource classes were consumed. We ended up with this funny Outcome there are some tools I just put here one one that I like because this allows us to Check whether this resource class has the appropriate candidates that you would expect from when trying to instantiate an instance there are There are more ways to check that another weird instance that also due to our how to our How are we the history of our deployment and how our infrastructure is set up is that we ended up at some point in time? Scheduling instances to notes in maintenance. So this was due to some problem in how we use the placement Nested resource provider. So we were catching some things that we shouldn't or and then we ended up sometimes scheduling notes to Scheduling instance to notes in maintenance which cost our rally dashboard to look green and red and Did we didn't like it? It's just a funny story because as soon as soon as the request gets to ironic ironic complaints quickly But it made it all the way. So I don't know interesting Okay, one One thing that we saw due to the L1TF reboot campaign we saw so we have Availability zones and when we run campaigns like this we ensure we go one by one not to Make our users have if they if they properly Define their services to run in different availability zones. They should always have a note up. So this This graph you see is when they when our database that it's hosted in our own infrastructure was down due to the reboot campaign you see all these reds there and With this we Realized that If the ironic DB is down or if ironic is not reporting properly or if you have a single Nova compute like we have and it's also On fire this might cause issues because then the way all the resource providers are reported That's not stable and then they get compromised then we might get some issues recreating them that's something else that we also saw that it was already fixed in rocky and So now we are happily running all the services both in in rocky Nova and ironic Even though we needed to roll back as Belmiro said yesterday in his talk the compute node in the bare metal cell to Queens because we had some scaling issues but Another thing we That made us that made our two characters love each other is our rally testing So we have a rally enabled for lots of different tests within our infrastructure So we run cluster creation VM creation, of course volume attachment Image creation we run this every hour for the whole infrastructure and because we have cells we quickly spot if a single Cell databases down or if we have some issues with any of the services because the error rep the errors replicate across all projects Since we run rally for ironic We are so happy because we see at a glance whether it's running or not Currently Q8 failing that was just to show you that it actually reports red, but that was tricked on purpose so Yeah, this is a quote by The friar when he agrees to marry Romeo and Juliet So as we've seen if our characters love each other they should marry and he tells both lovers These violent delights have violent ends and in their triumph die like fire and powder, which as they kiss consume therefore love moderately, so That's just to remind us that I don't know I love this quote So our new our two characters they are so happy so they now set out to rocky and You now see there that they have the support of the Queens, so they are not alone anymore So we upgraded to rocky as I already mentioned before we saw Small issues nothing serious, but for example on the ironic side when we were trying to do the DB sink We discovered that because of the objects Versions in the database we had some issues or running with the new controllers in rocky this ironic DB sink Upgrade we were seeing that the database is not ready And it was the client was kindly suggesting us to run online data migration in the previous release We did that and even if the apparently there was no error and the command succeeded We couldn't still rerun the upgrade because it kept on failing We ended up discovering that in the database we had for the ports different versions upgrading those manually solve the issue I think the day afterwards. We just saw that this was fixed in a patch for the next Review so it's a critic as well for me like maybe get a bit more involved or in the How the conversations are going within ironic because most of the issues were already there And we sometimes miss them because we are focused on our day-to-day operations But you might write I don't think you will run into this because this is now back ported So what are the new plans for the CERN cloud? Our data center we have around 10,000 hypervisors or nodes in total We have now 1500 and rolled in ironic one of our goals for the next couple of months is adopting as ironic calls The way to bring existing Hardware within the reach of ironic so in case we want to repurpose that hardware We can actually do it from ironic because we've proven ourselves that it's pretty convenient So we just need to adopt all these hardware so that the next time they need to be Either recreated or reprobation for another user then we can you do it within ironic and within the opens that context Another feature we've been working up At and we have a proposal now upstream and it's been debated is the support for software raid So our hypervisors run with a raid one configuration or depending on the discs We also have some other configuration This would be super nice because this would allow to use ironic and abstract all of the configuration that we do After the installation. So this is the document is there I will share the slides later in case you want to check it Probably if you are around you've seen that we phrase this in the last previous ironic meetings And this is cool because if we can have software raid then There wouldn't be any blocker for using ironic for the whole data center What else Another thing we would like to do Is to migrate our the way we handle the hardware So one pain that we have is keeping track of the exact delivery or the exact hardware for whom is it going Who has it now if there is some lease between users because they don't need the full I don't know 40 machines that they got and they temporarily Lease it to another user. This is the way we use it and having an excel with colors I think it's a it's a proof that it's not the proper tooling for Handling this so actually yesterday. They was a super interesting forum about hardware inventory So there is this proposal. I don't think it's accepted yet, but there is work being down there It's actually it actually has a name as well sardonic So this is something like a hardware inventory not a fully fledged cmdb, but this would allow us to to have many of our issues Because it would nicely integrate with ironic which is the main goal of this proposal be it accepted or not This would help us a lot with our managing with our management of the of the bare metal cloud What else one thing that we very much look into it for the future is console support So this is what we currently offer our users. This is far from ideal. We suffer it ourselves whenever we need to log in to our Machines to check whether there is some networking issue or something happened during inspection or cleaning We get the credentials and we go to the To the BMC interface download the console and open it. That's far from ideal So another thing we are looking is into redfish because if redfish provides depending on the vendor all these use cases that would be nice to use it and That's it for today as you can see this is a love story So in our team nova and ironic work together and love each other So now feel free to ask any questions and thanks for coming along We don't at the moment because we have a Flat network, but that's something we are looking on migrating to SDN Thank you. Then you have any question. Just feel free to ping me. I'm here Thanks for coming guys