 Hello Welcome everyone This is our talk iron cube bring two words together I'm Yuri Gregor Malfeira senior software engineer at Red Hat current iron PTO and also a metal tree contributor And my name is Dmitri. I'm principal software engineer at Red Hat And I'm a long term means eight years contributed to ironic and contributed to metal cubed from it's very beginning So what we will discuss today We'll give an overview about the projects also talk about the integration challenges and architectural challenges and also perception challenge and Conclusion and we will open two questions so overview Ironic is the open source project that we have to full-image bare metal deployment and It's under open stack Kubernetes the system for automating deployments and scaling in container as infrastructure and Then we have metal tree. That is the bare metal provisioning that we have for Kubernetes. It's part of the CNCF project so How they would work together like we have ironic that Has an API to manage bare metal in the infrastructure and it's an imperative API and Then you would have metal tree that use Kubernetes and then Kubernetes API from addition bare metal host in a declarative way Basically you have iron KPI running and the Kubernetes API would talk with the iron KPI in the end but It's not only this This would be what we have in a metal tree environment basically you have the Kubernetes API and you would define the You would create a CR for the bare metal host definition with all the information that you have and is stated where that you want for that bare metal host There's the bare metal operator that would take care to talk with the iron pod and Request to the iron KPI the creation of the node and if everything's go well in the end You would have your bare metal machine deployed our reaction to all of this when we start working It was something like what is that going to work? so some of the integration challenges that We have while working in both projects First one would be our authentication In the iron quad we the type of authentications that are supported are related to Keystone and the identity service for open stack and also we have the no-off mode In the beginning so we've only these two when the product started the metal tree We only want to use ironic people don't want to want other dependencies related to open stack We only need ironic for bare metal deployment. So let's only use that So no office can really be using production not really security reasons and so on so how we handle that We went for our solution creating an ATP basic authentication support in open stack and it's available in the keystone now and also in open stack SDK and basically The lessons we learned from that is that we took for granted that all our use case for authentication It would be covered using keystone and also no off but in a completely new world now We have ATP basic for to solve our problems Verify steps it's a new feature that we included in ironic in the last cycle What are they there is a new mechanism that you can have in ironic when you are creating a node in the enroll and enroll the node and you want to move for manageable and basically you want to run some price steps to verify a few things on Your machine that everything is working before you try to really deploy that machine so you can run some predefined actions that you create a verify step and You should be able to run and verify if your horse is working well or not Normally why it would be important we have cases for Unresponsive BMC's like it's frozen and you can't request anything So that would be a case that you want to make sure like your BMC is able to receive all the requests and answer Correctly and maybe in the future a few things that can come up is check for support for a few features For example virtual media support some hardwares have some others don't so maybe someone will create a verify step That we'll check for virtual media support before you are trying to deploy a node and oh, I want to use virtual media Well, you're not doesn't support so you won't go through the deployment and only see a failure there the lessons we learned from that basically We will be helping operators to automate A few challenges that they have in their setup related to the unresponsive BMC And it would make their life easier because also in the context for Kubernetes like you really don't have the operator looking at The hardware and what is happening is just that the Kubernetes operator taking care of everything so Then a few use cases that show up. It was related to metrics Exposed hardware for monitoring It's important for a few use case in production like you want to monitor How is the fan speed the temperature of the your machine? And then we created the iron prometheus exported and normally in Kubernetes You have something that the metrics that you expose you can integrate to prometheus So that was our use case at the beginning And then we created this new project the iron prometheus exporter and basically it's possible to collect metrics from IPMI and redfish and provide to prometheus and be available in a cluster Also while working in the creation for the iron prometheus exported We noticed that the different type of BMCs depending on the hardware vendor They will expose different type of metrics also. They not always show the same metrics and Also will have different Hardware metrics based on the driver you're using for example if I was an IPMI or redfish the most interesting case for Integration challenge that we have I would say it was related to the redfish virtual media support It should just work right basically Redfish is defined by the the MTF. So it's a standard So people should just try to follow the standard and everything would work perfectly not in the hardware world We start to see a lot of problems with different type of machines and Basically the solution that we had to go with redfish virtual media in the end like we need to really check if the support For each hardware is working because sometimes they can just have a different URL to Provide the response for virtual media. Sometimes they will be asking for new parameters They are not really required by the standard So it would make things difficult when you try to deploy virtual media in many different types of hardware and The solution that we went through like we had a few checks in ironic for each type of Hardware that has a specific problem like if you need a specific parameter you will try to add if it's not we just go with the normal path and also doing that Virtual media can work but sometimes okay the return me this working you can check that the System is there and it's responding to the request that you do but when you try to deploy Some issues can occur and normally they would be related to firmware versions like the current Firmware versions that you have in your machine is not really working well for veto media So probably you need to upgrade your firmware version in the end to have it working and Like I said different carries that can occur because you need to pass a specific parameter in the end to Have it working for that Another redfish case that really highlights The value of collaboration between communities and open-stack communities and all the members of both communities is a case of redfish e-tags So for those who don't know e-tags a small checksum like pieces of information Then an HTTP server will return with the resource so that when you update the resource You can provide it again And the server will check that the resource you are updating was not updated in between Great idea. Unfortunately, not everyone implements that and DMTF the standard body behind redfish has a bit of a creative Implementation of e-tags pretty mildly Which then ends up in some hardware supporting no attacks Some hardware supporting e-tags, but optionally Some hardware You have a mandatory usage of e-tags So if it returned the attack to you you have to do that and the most wonderful part Somehow the mandates are so-called weak e-tags if you read the HTTP standard They cannot be mandatory because actually a server must return an error if you provided a weak e-tag But it's mandatory in redfish. It can be mandatory in redfish and some hardware actually had it mandatory the story started with somebody Just coming to our sea asking about I think the Nova hardware then it continues is wonderful people from CERN and Then it ended up in our metal-cubed world and All the three parties could actually figure out a combination of magic in our code that makes it work for all hardware without this Without this collaboration will probably be stuck updating and updating and updating things ever and ever again On this positive note, I want to talk about architectural challenges of bringing the two words together Studying with a story about RPC open stack as you know uses rabbit MQ. We all love it So We all love it, but In the Kubernetes world we really wanted to have this ironic appliance small and tiny and not maintain this wonderful piece of code written in our lung We developed JSON RPC Support in ironic so now ironic can use just very simple RPC solution instead of rabbit MQ And that's not on a viable for Kubernetes of course Bifrost benefits from that. We're gonna talk about by frost later today come there So problem solved well when you start thinking about architectures you also start thinking about how you can go one step further and a Small application actually does not need an RPC which is a fresh thought maybe for an open stack context So we added a mode in ironic which combines API and conductor executables in one binary You cannot say binary but Python right and JSON RPC is now completely optional both in meta cubed and in bifrost context So again some rethinking of architectures That was a cost Second the rethinking of architecture that caused by researching this Kubernetes use case Let's not only do a better solution in Kubernetes world, but also more options for operators Inside open stack On a similar vein We did not really want to have a separate database Because now that we don't have an API API and conductor separate. Why do we need really Maria db? The thing is about Kubernetes Meta cubed Kubernetes is authority So barometer host customer resources are what define what happens with the nodes ironic in this case is An utility and its database is pretty much you can say a cache Uh, it's fmro. Uh, it's rebuilt every time you start a running port barometer operator rebuilds the database by either Redoing some operations with the nodes or adopting them, which is ironic process for taking already deployed nodes and just register them um We ended up what we ended up doing is improving SQLite support and using SQLite as a small and lightweight local database for our new combined ironic process And the lesson from all this as I said, um It's actually very useful to challenge common technologies. We get used to things. We get used to rebutem queue even if it's a suffer We get get use of mario db as a solution um Trying to apply common technologies in a new world actually leads to some Rethinking to new decisions to new use cases that's was actually very helpful for us and To balance out. I want to tell an opposite story. Um An open shift and I'm working on open shift. We needed to support the custom installation CDs for a certain layered product on top of open shift I won't go into details Yeah, we can do that We had actually been long talking about a modern operation in ironic for unrelated reasons that just boots some code and code is done It was driven by hpc community who just wanted to have some simulation on a completely ephemeral instance To just and I saw with some tools. You booted. Um, you run simulation for a day. You shoot it down. So yeah, we can reuse that, right? This product is about inspection in ironing the disabled cleaning in ironing They started using this so-called ramp disc deploy interface for their purposes and Yeah permit solved Um The lack of integration between this so-called life isoflow, which is a feature in very much operator By the way, we didn't remove it and the normal ironic floor Became a problem quite quickly So the product which is built on top of open shift is built on top of metal cubed Had to actually simulate inspection Uh firmware settings in that which is a great feature in ironing was not working because it relied on the normal floor Deployment status was weird. So ironic and thus Burmett operator reported it done when it was just Boots the CD. We haven't even made sure the CD is actually booted oops So we are now Getting it back. We are Making We have extended the custom deploy make a step mechanism in ironing to be able to accommodate what they want to do Instead of, you know going completely different paths. We are making it possible For them to just plug into the normal deployment process with their code So replace what ironing does with the image writing you call say With what they need to do but not replacing everything else So a lesson here. Yes, uh, you can change a lot based on new challenges, but you're gonna also There's still some architecture work to be done. So, uh, it's It's good to revalid what you're doing, but it's also good to not jump into, you know Completely new solutions, uh, even for new problems And to finish it up on some political notes I'm going to talk about perception challenges First the first thing we did when we say, okay, we do it with ironic, but we don't want to install open stack um There is a lot of misunderstanding that using a part of open stack doesn't mean installing the whole open stack um We have successfully I think for this perception changes in a challenge in metal cubed world I would love it to be more A wider understanding also in the open stack community open infra community communities community Yes, I would love to see a standalone nova It may sound crazy, but Why not? It's just a convenient api to run your virtual machines. That would be great. Um My biggest lesson for that is I opening for a prohibition Constituent more on making small beats reusable rather than Like one monolithic product is also good But reusable small beats is what I would love to see and I think we have a good example of that in ironic And in its application community cinder. Yeah, shout out to them um Wow, why is written in python instead of go? Yes, um It's it's interesting In the in the container world you would probably don't want to see like a single technology mindset because containers is all about Bridging stuff through a common interface um, I think we did pretty well. There was some uh Educating of people going on uh And so far any attempts to rewrite ironic ended up as people coming back and say, you know, it's actually hard Then there is a real thing python projects are harder to distribute and for example something written in go and go Have one binary you put it in container or just start it. That's all right um A big shout out towards your people. We are using your packages if you're here Um to build our containers That's another case of great collaboration between our community and meta cube community and a part of open infra big community So the existence of rdo enables us to overcome this problem with distribution of python projects but also it's Always useful people especially when dealing with containers people tend to forget about development Uh, keep an ability to install stuff from source your developers will say, thank you Um Building when you're testing something and ability to build from source by passing packages it's a great thing and I want to see more of that And yeah on the topic of rewriting I heard I think but if you found a separate redfish and virtual media, we can probably just you know have a small go Think it just you know small one Go back to the slide about redfish compatibility first Hardware is about subtle behaviors. That's uh, I think you only understand after working several years in ironic It's less about okay. There is a 2000 pages of ufi standard and just Read it and Follow that and then redfish standard is I think also like hundreds of thousands of pages. Yeah It's it's about subtle changes. It's about gaining experience and sharing this experience. So, um Ironic is built over over again to be a mature and usable and stable and feature rich foundation Uh, I I would love this experience I would love the same situation to happen as I open in for project that experience at nova 12 years of experience and yeah shared across infrastructures across projects Instead of you know building silos. I think we should really avoid building silos and break the walls And share experience and build foundations that can be reused over and over again To conclude that essentially to reiterate what we just said We need more integration. I just talked about a lot about it or we need to We need to challenge our assumptions Revolting every now and then Check our established practices for sanities so that people don't hate rabit mq Standards are important, but how implement them is even more important Experience with implementations is still a great asset that cannot easily replace by the people just reading a PDF And open stack alive Open stack is ready for new challenges. It's thriving. So Hey, keep rocking That's it. Thank you very much Any questions, uh, if you have questions, there's a mike there. Yeah, if it's too far just shout and we will repeat Yeah I will also shout so So can you hear me? I guess so. Yeah. Yeah, I was Really interested in you what you mentioned about the databases and so forth. So you were basically did some sequel or marie dv Whatever you want to call it to see a sequel light So but you already have you know, if you are integrating with metal gear, you got hcd. Why don't you just integrate with hcd and fack off a succulite? um There are a few reasons for that it is first There are some architectural problems and actually getting access to the atcd there, but that's that's not interesting um, it's It applies not very well to patterns that ironica using We actually it is developed to be a relations with database. We use some things There are a few complex queries a few things where you rely on transactions the way in relations database do that And also what I was told atcd is not exactly fast I mean it's fast when you use it the way you should be using it And kubernetes actually goes to great extents to use atcd properly There is like it uses notifications, for example, which is not something we do is xk outcome, right? We don't get notified about objects So it may be one day But that's definitely not a priority now. Awesome. Thank you guys One question about your experience with redfish and all the subtle differences between the hardware manufacturers I encountered the same. I'm pretty sure there's like a secret challenge between the hardware manufacturers to implement redfish and made it make just make it Make it not adhere to the standard at different points. Um, did you have a look into open bmc to like Remove the crappy vendor specific bmc implementation and go for the nice shiny open source stuff uh, yeah, absolutely, but um, we are on the other end of that So we would love the vendors dear vendors. Please use one implementation of everything That being said, I don't think it's something that's gonna happen So we have to accommodate that because we are the client we are the clients from bmc perspective So we cannot just go to big names and tell them, you know stop You know the question was more into the direction Have you tried actually flashing open bmc and replacing the vendors bmc because I have done that to At least a few servers for Not my professional but for the more hobby projects because it brought me great relief because then I at least had like Always the same errors and not 20 different kinds of errors Right, so um open bsc supports a ip mine. They support redfish. Yes, that's right I know people used at least the ip mi one with ironic We have not done that. I think yeah, we haven't really done because it won't solve our problem Our problem is customers come and say I have the server model this this and that with firmware version that and it doesn't Look with your court. We need to make it work. Yeah, and if you tell your customer to just just flash another bmc They will probably show you the door right away Not something that we can really do. I would say that will for many cases it will invalidate their support contract Okay, thanks Again Yeah, the suggestion was to use a verify step to flash new firmware I would totally do that in my last week in red hat Anything else Okay, thank you folks. You can find us in between the breaks We are going to have another session at bifrost in ironic today a more beginner one So come learn about standalone ironic and how we use it in a nice demo. It's around 2 p.m I think in the back and yeah, find us for any questions. Thank you. If you have more questions, feel free to reach out