 Thank you. Today's topic is scaling ironic and by scaling we don't mean climbing, although you could possibly climb ironic in terms of learning knowledge and so on and so forth, but I'll just ramble. Actually, before I really dive into this, I do want to thank everyone, every large operator that ever talked to the ironic community and expressed frustration or feedback and has really guided us a long way. CERN, Yahoo, even the folks in the video have provided us feedback and a lot of what we've been able to take action upon and really improve some of the user experience. So a little bit of information that we're going to talk about today is the architecture roughly from 10,000 feet. This is a lot of theory, how it is supposed to work. I'll talk about a couple different options are available. And really this is up to however you wish to deploy your environment. And I'll talk about some common pitfalls and lesser known details. After that, I'll talk about some solutions. These solutions are mostly for ironic, and I'll talk about inspector. Inspector is a special topic in this case because it is different and has a different use case. So first the architecture from 10,000 feet. Before I dive into this, I found this quote from Terry Pratchett, which I thought was very interesting. If you don't know where you come from, then you don't know where you are. And if you don't know where you are, then you don't know where you're going. And if you don't know where you're going, then you're probably wrong. And that kind of feels very appropriate in this case, because in order to administer and run a large infrastructure deployment, you really have to understand the lower level details. These are things that tend to be forgotten, or knowledge that's actually being lost in our society, because, you know, technology is becoming so advanced, but there's still the lower level infrastructure and that's kind of the reason why we're all here actually. So at the highest level. Ironic has an API service which you can any number of it uses a message bus and a database. And then you have an ironic conductor, and you have any number of conductors. And the theory at a high level is that the database for the API is read only only the conductor writes to the database. API sends commands or requests to the conductor over the message bus, and the conductor replies over the message bus. And there's two different options you want to use rabbit MQ as the message bus through also messaging, or one can use JSON RPC. Ironic is very much agnostic to the two. In order to have explicit support, the real driver behind using JSON RPC was just to eliminate rabbit MQ as a component to enable smaller deployments. Another probably an important aspect of this is ironic doesn't store any persistent data on the message bus. It's merely a pass through communications channel so JSON RPC works very well for this as well. Again, message buses for transactional information requests and responses. The database is a persistent data store. If that database has the details state information, everything that guides how the software works. So largely when we look at distribution work requests can be distributed across the API's. And you can have depending on how you set up your environment in essence, any number of API's talking to any number of conductors. There's probably some scalability limits here that Arnie's probably hit himself. That being said, the way it works is when you request something that has to go to conductor, it chooses the most appropriate conductor based upon the hash ring of the note and when I say the hash ring of the node, we distribute machines by a consistent hash ring that is a calculation of mapping the node name to the actual physical hardware. The reason for this is not the case like creating new nodes, it gets sent to whatever conductor can handle the request, but generally, this is how requests get distributed so your API may get a request. It will send it along to whatever conductors with us appropriate or the conductor responsible for managing the node. The conductors themselves have that consistent hash ring. They assign nodes to that consistent via that consistent hash ring. They are essentially responsible for all the managing of state of that node and that interaction. So that's the theory. Now, I'm going to dive in some common issues and pitfalls and lesser known details about how all this works. API requests. Specifically, in wallaby, we added a bunch of role based access control logic. We measured the performance and it was atrocious. Far worse than we had previously. And so we spent some time during the Zena cycle to actually clean up this performance, these bottlenecks and the optimization of code and how we do this request processing. And managed to increase performance quite a bit. And we backboard these patches. But before Zena. Ironic only removed unrequested data from a request. So if you can imagine, you're packing a box full of information from the database, you're sending it to the web server, the web server takes this giant box of data, starts unpacking it and throws out 90% of the data. So that was the thing we were running into quite a bit. So, at this point, now what we do is we only return the requested data and the essential fields to the API. And then return that response based upon what is requested. So if you say, give me the no detail for all the machines. Yes, it's going to be slow. Under the hood, it's doing thousands of data conversions. So we're a list of 100 machines. So it's very resource intensive to do a detailed list. If you do a column based list, or say I need these three fields, it's much faster. This was previously kind of in place but not, it really wasn't performant and actually generate more database queries when you did it. Which was kind of actually surprising when we dug into how the back and sequel interaction worked. So for example, an image provided by CERN. This is their maximum response times after the upgraded work before, during and after I believe. And you can kind of see the amount of time that was spent for API request processing on average dropped substantially. My benchmarks locally ran about 5.5.7 times faster. Aren't you posted something much, much larger, but it really depends on what you're doing. Go ahead. It's about 10 times I measured about 10 times and you can also see like in the on the left side of the picture there's even a structure, which shows the pagination, for instance, so you can see like how we get notes and chunks by, I think 1000. So there's all kinds of structure, the structure is still there but it's now like in the right hand side is like much more condensed, but it's roughly a factor of 8 to 10 I would say it's faster. Excellent. That makes me feel really good as someone spent a lot of time on this. So where this has a huge impact though is the Nova to ironic synchronization. And the way at a high level that this works is Nova maintains or Nova compute processes maintain a cache of nodes. They basically constantly repole or semi constantly repole and update that cache with new entries as they appear. However, this is a lot of overhead. Where this these patches really make a huge difference though is in getting that data because a huge amount of that time that was being sent spent by that Nova compute was just retrieval from ironic. There is another bottleneck inside Nova that I won't really talk about. But long story short, it's data manipulation processing that causes some bottlenecks, which is why one might want from many Nova computes. We can talk about that little, a little bit more about that in a minute. But the bottom line is the Nova compute process does do a heavy lift with all this data. Additionally, one thing we did find is some operators were having to we're finding their need to restart Nova computes to address issues in their environment due to race conditions and the actual list processing because it could take so long for this list to be processed. And then we found that we were doing unnecessary with attachments or virtual interfaces, and we've actually stripped that out of the interaction so now the compute process should actually start much faster I know particular operator that's on this call that actually was the inspiration for me fixing that because I wanted to you'll meet for coffee one morning or something like that. And they were having to restart and it was taking atrocious amount of time. I felt really bad about it. That operator may have played up the impact of that event slightly. Okay. So, before we really move on, I think it's important that we kind of understand the process model of ironic. Every conductor launches actions with what's called a task. It's not a green thread. Or periodic tasks. And these are things like is this task done, or can I resume this task or is the computer done, or has the state changed or rate configuration finished or bias configuration finished. All of these things are tasks. And by default, the conductor can do 100 tasks. And a huge consumer of this is power synchronization. And this is actually one of the huge bottlenecks that can exist in environments. Internally also ironic has only two reserved green threads we've not found this necessary to extend but largely this is just used for once reserved for basically the database heartbeat saying I'm working I'm alive. The other is reserved for heartbeat processing from agents, but most of the conductor time is actually spent orchestrating actions, hence the name conductor. So, I mentioned power sync and power sync is an core feature of ironic. It's an extremely useful feature of ironic. We found hyperscalers that operate at scales of 40 to 250,000 nodes or so tend to have to turn this off. And largely because it is an aggressive process that updates the power status. So it can be reflected to users in relatively shorter order. Obviously you don't want to have machine moves power and have all the guys say, yes it's alive it's working everything's fine. It's not ideal, especially if you have other processes that may automatically kick off or launch new instances. You kind of get the idea at that point. So a common setting that's used is to actually change the interval to make it longer. We did make improvements to this back in March 2021. And I believe we changed the threading model a little bit around this, but largely this is still a very process intensive operation that just runs in the background. And I think that falls every five minutes or 10 minutes. So, one of the things that actually turned found was that we were doing extra work. Surprise surprise. We like to be very thorough. They were kind enough to implement lazy loading for the data structures. And we can see that this actually had a huge impact in the performance of ironic in terms of the amount of database queries that were occurring. Because you know, if you're just doing a list of nodes or you're just working on one task, you don't necessarily need the ports or the drivers or any other related information. So obviously we cut down number queries, the more concurrency that can occur in an environment. These patches worked out. We're back forward all the way through Stein, which is quite old at this point. But it should be a huge performance improvement. And this is probably an addition to the API performance improvements is just another reason operators really should try and run more up to date versions. It's actually very difficult resource intensive for us to get patches down to, say, Stein. So, the other issue of PowerSync is IPI. Ironic uses the operating system provided IPI tool binary to talk to each machine individually. This has a huge benefit of abstracting a number of details away from Ironic in terms of retry handling ciphers. These are but these are still things sometimes we need to set or know about pass into IPI tool to address certain cases. But the bottom line is this process is resource intensive to launch because it's native C binary. Python's execution of other binaries is also CPU intensive. So it's CPU intensive to launch it. It's CPU intensive to operate it briefly. And if you start doing the math with default settings, if you have a cluster of 250 nodes, or one conductor of 250 nodes, it launches or it will try and launch 15,000 IPI tool commands an hour, which is a enormous workload. And if you have a cluster of 1000 machines, this can easily reach a million IPI calls, which seems insane, but it is extremely resource intensive. A path forward to prevent this resource intensive use is to use Redfish. This is a native client that caches sessions that tries to be a bit or much more efficient, and it's not causing a system binary to be executed every single time which is causing libraries to be loaded and processed, and so on and so forth. So if you can use Redfish, I would highly advise it for scaling any ironic clusters, just because of the amount of performance gain that you'll have just based upon not using IPI tool. Another reason not to use IPI tool is it's incredibly insecure. And when I say insecure, I mean exceptionally insecure. For those that aren't aware. If you earn most cases, and in what the specification says, if you fail authentication, it sends you the encrypted hash of the password. So that you can try and figure out what your password was basically and it's just awful. So please don't use IPI if you can avoid it. So going back to periodic tasks, another strategy that is often used by operators is to disable a needed periodic tasks. So if you're not doing things like grade configuration or firmware updates, these have periodic tasks wrapped around them, you largely to complete state and move this conductor to the next task. These queries have been optimized to try and minimize load minimize creation of tasks, but still additional load. It's still additional database queries sorting through nodes. So if you don't absolutely need it and you need to scale, consider evaluating periodic tasks that exist. And usually you can see these in the ironic conductor configuration by looking at anything named with an interval. Not everything is an interval. So some explicit hard pauses for tasks are intervals as well. But generally everything named interval is a API task timer or used in that calculation. So, if you set those values to zero generally. That means during the task. Also, one thing to note is even if you don't use a driver in an environment. If it's enabled, it's para tasks will still run. It's part of the driver design model. And it's kind of handy, but at the same time, if you're running many machines, and you have 2% extra, 2% extra database queries in your environment that are unnecessary work, you might want to disable them. So here's some solutions to talk about. And I know I've been talking about solutions and how to kind of navigate these things, but largely those are pitfalls and aspects that are somewhat known but not commonly talked about. So one of the things when we talk about scaling ironic is just add more conductors. The hashering which I mentioned earlier will distribute nodes across the hashering. So, you can increase the number of nodes and increase number of conductors and that will scale. However, there is a general happy place where the operators tend to find an ideal conductor to machine ratio. And this is largely for operators running IPMI, or a large number of IPMI nodes, because of that computational overhead of launching processes. So I think the operators have have provided feedback of somewhere between 250 and 500. I would say 250 is no longer really the happy place I would say in another two years with improvements and processors that might be 750. But again, this depending, it all depends on the workload, your drivers, everything you're doing. I will say that some operators have achieved tens of thousands of nodes per conductor. These operators are the ones that tend to turn off power synchronization though, which, as I mentioned is the most resource intensive operation. So, as I mentioned earlier, you can scale the number of API's you can spell scale number of conductors. But at some point you start running into some logistical issues. Database is one aspect where you might run into problems. We have not tested in the ironic community but an option may exist. And the community would probably more than happy to help explore this. The also DB service or not service library does allow for what are called secondary databases, or read only databases to be loaded in terms of URL paths to additional read replicas. That allows read queries to go to one database, or any number of databases, which could be behind a load balancer, which could allow the service to scale even more. So, when adding more conductors, you do want to sequence your restarts. And this is largely because there is a hatching calculation, there are database updates that have to occur, especially if you do a full complete cold cold start where you shut everything down. That tends to cause a huge spike in database activity. Loading all the nodes, starting all the tasks. And Nova starting to sync. And you can just imagine how this kind of goes and this is a graph provided by turn, where you can kind of see the running environment. They shut everything down. And then they started everything up and you can see the queries per second, actually just go through the roof. I believe the, the next section until about 2pm UTC was when they were trying to figure out what was going on. Right. And I believe the next section was when they started to introduce a sequencing into it. Right. Go ahead. So basically like, I mean, as you said, it's absolutely correct. So it's basically shows like on the very left hand side before the upgrade, then everything was shut down everything was started at the same time. Then the first sequence until 2pm is without the patches that actually do the lazy loading, because I branched off a little bit before that patch. I ran with that patch before so you can see on the right on the left hand side so I was like, totally confused while all of a sudden the database goes goes crazy I realized at around two. And then the last step is around like seven when I sequence out the restart so basically the secret the restarts of the conductors are staged and smeared out rather than all synchronized which would all hit the database at the same time so this is the different sequence you see so in the end, you look at the very right, you see it's like lower database activity and it's more, it's less spiky than on the very left, but yeah, we had to go through multiple stages here. How many conductors were here and how much interval did you do the sequencing we how long did you delay before starting the conductors. So for around 8000 physical notes, at the time they had around 8000 physical notes, it's 25 conductors, and the start is like a random number between the stage restart is between 30 and 90 seconds. So basically, basically I go, yeah, it's like this. I basically go one by one and then leave 30 to 90 seconds in between the start of each. So it takes like 40 minutes or so to restart it. Okay, great. Does that affect issues that with pastures either on the Nova side or the ironic side. The Nova hashrang is independent of the ironic hashrang. So it has no insight into if conductors are alive or not. It trusts the API service will get the request to the destination endpoint. Right, and then we have conductor groups which map the notes directly to a conductor. Okay. Yeah, thanks. I think the next slide is actually conductor groups. Yes, it is conductor groups. So conductor groups is a concept that we created to kind of represent physical realities. There are scale limitations to environments. There are geographic limitations where you don't need or want a conductor in say New York, managing machines in London. You don't need data center row. See talking days in a row a. So the idea is to use conductor groups where you delineate the hashrang further by stating to the two ironic. This isn't in conductor group, say a conductor group a will only have nodes assigned into that hashrang based upon the actual configuration of the notes. So if you have two conductors for conductor group a, then all of your other nodes will end up on other other conductors. So this is kind of useful for those physical realities and preventing these sorts of things but it can actually be used and also query directly from Nova compute to limit the scope and view of the Nova compute process as well. So kind of an idea where this ends up heading is you almost start splitting the infrastructure. So if you have a distinct set of Nova computes, you can have one set of ironic APS or multiple sets of ironic APS, they can run with different configuration, and they can show the same database, or same configuration and show the same database. And then you can run multiple groups of conductors depending on your needs and your load, you may have a group that does not need power synchronization as often that might just be doing long term batch processing. So if you have conductors for instances that you want to keep a state, the state information as up to date as possible so you can provide your users with input or feedback so they know oh my machine's off or oh I need to do something on my machine or so on so forth. And one of the useful things is you can run these conductors with different configurations. They'll have a configuration saying I'm in this conductor group, but you can say, here's your intervals, here's your timers, and here's your boot default boot settings or your drivers even. So you have a lot of power, and probably going back to the hashrang drivers are also delineators in the hashrang. It's just conductor groups are before drivers in the hashrang calculation. So this is a slide of CERN's layout, and you can kind of see what Arnie was talking about in terms of they have NOVA configs with conductor group configuration assigned to ironic conductors, in essence, with specific conductor configuration, so that those that pull of machines is or that those interactions are limited to that environment. So you can say, oh, I want to run all these processes on one machine, or I need to run it in this physical area or a center A, a center B, and kind of the sky's the limit in terms of what you end up with and how you want to work with this. Right, if I may add one one example, one additional example of what you just described. It's like you can, she said this on the previous slide as well, is that you can like have different configurations and this is also what we have here and what I also tried to like show a little bit in the picture so if you go to the conductor group, and which is the leading group you could have like more conductors because these is maybe the group where new nodes are added so there's more activity in the beginning because you have inspection and you have cleaning and have deployment and then it gets more calm for the for the conductor so one conductors maybe good enough. Another thing that we do is like in this leading group where we add new nodes, we have fast track enabled so in the leading, well, it's not exactly as we have another group which is group zero, if you like, where we have fast track enabled so the nodes go through inspection cleaning and so on benchmarking burn in without rebooting. And there is no Nova conductor, or sorry no Nova compute for this group at all. And we have multiple nodes in there. So basically, the team that is enrolling you have where it uses this group in order to like, it's like a waiting room for ironic, and they do all their stuff like benchmarking burning there. There's no computer that is interfering in any way there's fast track and once they're done, I move them to the corresponding conductor group, whether than deployed. So these conductor groups are really in addition to scaling of a lot of possibility to do additional configuration that suits your, your workflows. We introduced them mostly for scaling because of Nova compute because of the time it takes for the resource tracker to find all these notes this is why we chunk it up into smaller pieces. We discovered that we can do a lot more with these so it's really a very powerful concept. There are some pitfalls to this powerful concept specifically if you have a node already in Nova and change the conductor group. Specifically the Nova compute code doesn't understand this or handle this very well. There's some discussion in patches, their work in progress to try and remedy this situation. It's actually a couple different interaction issues with the design of Nova and how it tracks resources. So, hopefully, sometime in the near future we'll have this largely resolved and the patch that we can back and forth to older branches to try and make this seamless and just be a thing. So this is, this is even an issue if you don't when you don't have an instance on the note. So, with an instance I would not have dare to like reshuffle things between conductor groups, but I thought without an instance, what does Nova care, but it does care, and I screwed it up big time so there's a big warning yeah. So that's, that's one of the, one of the issues. While we're wanting to fix this issue is because Nova thinks that oh, this is a physical compute node that runs BMS not bare metal, and thinks, Oh, this is no longer in the cluster it's gone, I delete the record that record might have appeared elsewhere. It doesn't know it, or the one that issues the delete doesn't know it, and it's just is what it is right now. Hopefully it'll be fixed soon. I'm not an ironic inspector, because this is a different topic. Why would you want to scale ironic inspectors probably good question because ironic inspectors usages largely intended for when you add nodes or enroll them originally. So, if you want to add a bunch of nodes at one time, it might be useful. Another case is active node introspection. And there are some important things to keep in mind with ironic inspector. It has the ability to have a scaling model similar to ironic. However, most operators only launched as a single process. We have mixed messages, if you can actually run it with load balancers, and it works or not. And I'll talk about this a little bit more than I slide x two slides actually. The issues in inspector largely heard that it has some internal state. Which is very much not exactly load balancer friendly. Often inspector ends up being configured without a real database, which compounds this problem. The internal caching code has been improved so that you don't necessarily run into this issue. However, if you're running a much older version you probably don't have the patch to fix this issue, and it might not work. But if you're using load balancers, depending on what you're using, you may require a bug fix. In short, inspector is not really ready to scale, but it was never really tended or designed to do so. So, back to load balancers. We recently discovered a bug in event lit of all places. The long story, as short as possible, is that event lit made some bad decisions or bad decisions not the right word. They read the RFC one way, and they read the next paragraph. And in essence they thought all empty replies are treated one way. In reality, empty replies are treated a little bit differently if you have a return code of 204, which is what inspector will use in two cases. This was causing some versions of HA proxy to actually fail the request processing completely. So we found a patchy kind of handles this and masks it, but sometimes not. And this is kind of where we're getting some mixed signaling. We're not entirely sure. So we have created a patch to try and fix this. I've had feedback saying both it works and not doesn't work. So I'm kind of hoping we'll figure this one out soon. But the event lit maintainers were quite responsive when we point out the RFC and the actual what the code is supposed to do. So, I'm hopeful that a future version of that then will be fixed. So now it's time for questions and comments and everything else that makes us so valuable makes great for everyone and ourselves. Thank you, Julia for this great overview of all the work that has gone into scaling ironic. We have questions. Speak up comments user experiences. At what point did you find me at what point did you find that tuning the power sink actually became necessary for performance inner in your environments. Number of nodes or processes or what was your trigger to realize hey, by turning down the power sink I can try and does. If somebody is encountering these types of performance issues what would they be looking for in order to say hey, turning this down or turning this up or off would potentially help and improve my environment. My feeling would be task exhaustion would be some a solid sign of this issue. High database load, high CP load on the conductor for the most part. This be like looking for the your task has outlasted process by 27 seconds in the log be a trigger that somebody would use to look for these things. I don't believe we actually have for that in terms of that's more on the Nova side that does that but I'm just yeah, but there's like the priority task when it runs into itself that's at least the time. So if it hasn't finished, like I think the default as Julia said was 60 seconds I don't know that change now to 300 seconds or something. Because at some point, the task was running into itself and did not finish actually. And this was at the couple of thousand notes already, it ran into itself this is when we like increased it. And there's also like parameters that were introduced, I know two or three years ago where you could like have these power spin calls and the IPM calls in parallel. And we also play with this a little bit but this is, you know, the chorus also pointed out this with IPMI there's all kinds of debase that are not always under full control this. It's very unpredictable but I think for us what it was was at some point the power soon was running into itself and then we increased the interval. In our particular case we ended up seeing folks on the CLI end up with six out of six tries. And, you know, please retry your operation again, and it was literally because the API was swamped with power checks. It's not the API that gets swamp with the power checks. It's a conductor threads, running out of conductive threads because it's waiting. Absolutely. Yes, thank you for correcting me. So that's, that's a solid indicator of time to scale my deployment quite a bit when you start running the CLI command start returning six out of six retries failed. Talking about like, like scaling the deployment, do we have numbers actually of, for instance, number of API nodes per and number of nodes, because what the way we have deployed it is basically we have like all in one ironic controllers that always have an API a conductor and an inspector and I just scale them out horizontally. By for other components, for instance, for Nova, we basically split the API from the conductor, and we have depending on the needs more or less API nodes than conductor nodes. So do we have any like, rule of thumb, like how many API nodes you would need for like a deployment with Nova per, I don't know, end nodes or something. I think we do. And largely this is kind of an artifact of launch, the process launch model. If you do a native launch the API. I believe it will launch workers based on the number of CPUs you have. So say 90 CPUs. I think you used to actually try and launch 90 workers or something absurd like that. I think that's been toned down by default, but another common launch pattern is actually use WS GI. Say if you with Apache, you may run into a situation where your configuration is, oh, I'm only well, let me clarify. Your configuration might be such that you say, I only want to concurrent request to be processed or accessible, which is really not ideal for an API surface. And this is workers and threads count thread counts, basically in the US GI app workers are useful in case you have more the processes die however that's not really been a reported issue. The threads are really where it's kind of as useful as much as workers based on number of processors and the amount of workload that you have in your environment and that's going to depend on what you have and how you're working with it and the scale which you're running. And we have a hard and fast number because there are, there's really three or four different ways you could launch the API and run it and configure it and have it working in an environment. And there's also going to be additional overhead, say if you're terminating SSL on the on the end process versus terminating SSL load balance or, or something like Apache, because running all that trans all that insulation through additional layers does add overhead, however, say if it's patchy it's much closer to see and the CPU and more optimized interactions. So, I wish we had numbers, and I would love to talk about their experiences here. It's more like, I mean, as I said a couple of times, I'm more because I pack everything into one controller and then like add controllers as I see fit, but I never made use of the fact that I could actually split API is and conductors and the inspector actually, I just like replicate them. And then it seems to be working okay so I was just wondering here. Inspector is, is probably the weakest point in all of ironic in terms of scaling and operation but active node inspection and inspecting number of nodes for not really original use cases for inspector they're, they're bolted on or made logical sense of, oh I need to update this one machines data or I need to add these 10 machines. Right. Once mileage will vary or kilometers. Just just to add on this for those who wonder what active introspection is. So active introspection is when you run on a deployed instance, the inspector in order to update inventory data that the inspector would provide. I was done this because there was at the original inspection, when we run it there was a bug in the version of LS hardware that we used, and it reported zero Ram and that went into our inventory system and then these nodes that actually went on notice the nodes went into production. And at some point later someone noticed like hey all these nodes don't have Ram and our inventory there's something wrong so we re ran introspection why a container that we built and updated that information and this is like, when we needed to do. introspection at scale. Well, at some scale with 200 nodes but still, we may run this on a couple of thousand nodes in order to update some other data that is missing, where this came in, came in very handy that we had this feature of active introspection. Joey, sorry, go ahead. I was gonna say you deployed that as a container was the was the underlying operating system on those nodes capable of supporting that container natively or did you have to install all of that support my concern is going out and installing a bunch of stuff on users machines. And so I was lucky the the set of notes that I was needing to do this part of our batch cluster and they had all the infrastructure already there. So all I needed to do is basically run one line Docker command which holds the container executes it. And then with the parameters that were actually provided that you can pass to the active introspection, which is the the interval where you can say okay run only once. I think if you emit the interval or something it runs only once for the container exits the containers cleaned up the command exits, and you're all done okay the image is still cash there but okay apart from this everything is cleaned up to very, very neat way of doing this updating the introspection data afterwards, after the fact actually. And since we moved from, or the first half of our notes has not been inspected with all the hardware details so there's this extra hardware collector where you get actually more hardware information, like much more detailed and we only enable this after we had enrolled a couple of thousand notes, and we want to have this detailed information we will run a campaign and run this on all the other 4,000 nodes but yeah, we may have to introduce or install some additional tools we have to see our goals that. Okay. More questions or comments. I have one more comment. I think one very nice takeaway from from this work and this session is also that like operator feedback actually works. So many of the things that we have seen here is actually we're triggered by operators saying like hey, this is top can we do something about this, and then like upstream, if they find the time or if the problem is big enough, like go along and actually fix things and you saw some of the massive improvements in the graphs that Julia shared and that's just like, like an awesome collaboration or feedback loop between operations and like how the code is improved. I find this quite amazing and you saw multiple instances of this in the presentation today. It's really cool. It is the most powerful interaction open source to get to have a feedback loop as working operating and gives us the requirements and experiences that we can take back and use this context, moving forward. And I really do want to thank all the operators that have expressed frustration feedback constructively that have really helped us move this forward because it's really amazing the scale some of the operators have reached. There are obviously lots of config changes some operators had to do even some code change themselves. But getting the stuff back into the community getting this information out there is kind of vital in terms of making easier because people can run ironic with one machine. Or one machine deploy another or they can run it to deploy thousands of thousands and thousands of machines. It just really depends on how they want to run. Okay. Well, thank you everyone. Thanks again and thanks everyone for joining and see you next time round.